AW: Lexical analysis tools for German language data

2012-04-13 Thread Michael Ludwig
 Von: Tomas Zerolo

   There can be transformations or inflections, like the s in
   Weinachtsbaum (Weinachten/Baum).
 
  I remember from my linguistics studies that the terminus technicus
  for these is Fugenmorphem (interstitial or joint morpheme) [...]
 
 IANAL (I am not a linguist -- pun intended ;) but I've always read
 that as a genitive. Any pointers?

Admittedly, that's what you'd think, and despite linguistics telling me
otherwise I'd maintain there's some truth in it. For this case, however,
consider: die Weihnacht declines like die Nacht, so:

nom. die Weihnacht
gen. der Weihnacht
dat. der Weihnacht
akk. die Weihnacht

As you can see, there's no s to be found anywhere, not even in the
genitive. But my gut feeling, like yours, is that this should indicate
genitive, and I would make a point of well-argued gut feeling being at
least as relevant as formalist analysis.

Michael


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
 Given an input of Windjacke (probably wind jacket in English),
 I'd like the code that prepares the data for the index (tokenizer
 etc) to understand that this is a Jacke (jacket) so that a
 query for Jacke would include the Windjacke document in its
 result set.
 
 It appears to me that such an analysis requires a dictionary-
 backed approach, which doesn't have to be perfect at all; a list
 of the most common 2000 words would probably do the job and fulfil
 a criterion of reasonable usefulness.

A simple approach would obviously be a word list and a regular
expression. There will, however, be nuts and bolts to take care of.
A more sophisticated and tested approach might be known to you.

Michael


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
 Von: Valeriy Felberg

 If you want that query jacke matches a document containing the word
 windjacke or kinderjacke, you could use a custom update processor.
 This processor could search the indexed text for words matching the
 pattern .*jacke and inject the word jacke into an additional field
 which you can search against. You would need a whole list of possible
 suffixes, of course.

Merci, Valeriy - I agree on the feasability of such an approach. The
list would likely have to be composed of the most frequently used terms
for your specific domain.

In our case, it's things people would buy in shops. Reducing overly
complicated and convoluted product descriptions to proper basic terms -
that would do the job. It's like going to a restaurant boasting fancy
and unintelligible names for the dishes you may order when they are
really just ordinary stuff like pork and potatoes.

Thinking some more about it, giving sufficient boost to the attached
category data might also do the job. That would shift the burden of
supplying proper semantics to the guys doing the categorization.

 It would slow down the update process but you don't need to split
 words during search.

  Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
 
  Given an input of Windjacke (probably wind jacket in English),
  I'd like the code that prepares the data for the index (tokenizer
  etc) to understand that this is a Jacke (jacket) so that a
  query for Jacke would include the Windjacke document in its
  result set.

A query for Windjacke or Kinderjacke would probably not have to be
de-specialized to Jacke because, well, that's the user input and users
looking for specific things are probably doing so for a reason. If no
matches are found you can still tell them to just broaden their search.

Michael


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
 Von: Markus Jelsma

 We've done a lot of tests with the HyphenationCompoundWordTokenFilter
 using a from TeX generated FOP XML file for the Dutch language and
 have seen decent results. A bonus was that now some tokens can be
 stemmed properly because not all compounds are listed in the
 dictionary for the HunspellStemFilter.

Thank you for pointing me to these two filter classes.

 It does introduce a recall/precision problem but it at least returns
 results for those many users that do not properly use compounds in
 their search query.

Could you define what the term recall should be taken to mean in this
context? I've also encountered it on the BASIStech website. Okay, I
found a definition:

http://en.wikipedia.org/wiki/Precision_and_recall

Dank je wel!

Michael


AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
 Von: Walter Underwood

 German noun decompounding is a little more complicated than it might
 seem.
 
 There can be transformations or inflections, like the s in
 Weinachtsbaum (Weinachten/Baum).

I remember from my linguistics studies that the terminus technicus for
these is Fugenmorphem (interstitial or joint morpheme). But there's
not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum
in the example above is from the singular (die Weihnacht), then s,
then Baum. Still, it's much more complex then, say, English or Italian.

 Internal nouns should be recapitalized, like Baum above.

Casing won't matter for indexing, I think. The way I would go about
obtaining stems from compound words is by using a dictionary of stems
and a regex. We'll see how far that'll take us.

 Some compounds probably should not be decompounded, like Fahrrad
 (farhren/Rad). With a dictionary-based stemmer, you might decide to
 avoid decompounding for words in the dictionary.

Good point.

 Note that highlighting gets pretty weird when you are matching only
 part of a word.

Guess it'll be a weird when you get it wrong, like Noten in
Notentriegelung.

 Luckily, a lot of compounds are simple, and you could well get a
 measurable improvement with a very simple algorithm. There isn't
 anything complicated about compounds like Orgelmusik or
 Netzwerkbetreuer.

Exactly.

 The Basis Technology linguistic analyzers aren't cheap or small, but
 they work well.

We will consider our needs and options. Thanks for your thoughts.

Michael


Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht

Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
 Some compounds probably should not be decompounded, like Fahrrad
 (farhren/Rad). With a dictionary-based stemmer, you might decide to
 avoid decompounding for words in the dictionary.
 
 Good point.

More or less, Fahrrad is generally abbreviated as Rad.
(even though Rad can mean wheel and bike)

 Note that highlighting gets pretty weird when you are matching only
 part of a word.
 
 Guess it'll be a weird when you get it wrong, like Noten in
 Notentriegelung.

This decomposition should not happen because Noten-triegelung does not have a 
correct second term.

 The Basis Technology linguistic analyzers aren't cheap or small, but
 they work well.
 
 We will consider our needs and options. Thanks for your thoughts.

My question remains as to which domain it aims at covering.
We had such need for mathematics texts... I would be pleasantly surprised if, 
for example, Differenzen-quotient  would be decompounded.

paul

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote:

 I remember from my linguistics studies that the terminus technicus for
 these is Fugenmorphem (interstitial or joint morpheme). 

That is some excellent linguistic jargon. I'll file that with hapax legomenon.

If you don't highlight, you can get good results with pretty rough analyzers, 
but highlighting exposes those, even when they don't affect relevance. For 
example, you can get good relevance just indexing bigrams in Chinese, but it 
looks awful when you highlight them. As soon as you highlight, you need a 
dictionary-based segmenter.

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma
On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote:
 Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
  Some compounds probably should not be decompounded, like Fahrrad
  (farhren/Rad). With a dictionary-based stemmer, you might decide to
  avoid decompounding for words in the dictionary.
  
  Good point.
 
 More or less, Fahrrad is generally abbreviated as Rad.
 (even though Rad can mean wheel and bike)
 
  Note that highlighting gets pretty weird when you are matching only
  part of a word.
  
  Guess it'll be a weird when you get it wrong, like Noten in
  Notentriegelung.
 
 This decomposition should not happen because Noten-triegelung does not have
 a correct second term.
 
  The Basis Technology linguistic analyzers aren't cheap or small, but
  they work well.
  
  We will consider our needs and options. Thanks for your thoughts.
 
 My question remains as to which domain it aims at covering.
 We had such need for mathematics texts... I would be pleasantly surprised
 if, for example, Differenzen-quotient  would be decompounded.

The HyphenationCompoundWordTokenFilter can do those things but those words 
must be listed in the dictionary or you'll get strange results. It still 
yields strange results when it emits tokens that are subwords of a subword.

 
 paul

-- 
Markus Jelsma - CTO - Openindex


Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote:

 More or less, Fahrrad is generally abbreviated as Rad.
 (even though Rad can mean wheel and bike)

A synonym could handle this, since farhren would not be a good match. It is 
judgement call, but this seems more like an equivalence Fahrrad = Rad than 
decompounding.

wunder
--
Walter Underwood
wun...@wunderwood.org