Von: Tomas Zerolo
There can be transformations or inflections, like the s in
Weinachtsbaum (Weinachten/Baum).
I remember from my linguistics studies that the terminus technicus
for these is Fugenmorphem (interstitial or joint morpheme) [...]
IANAL (I am not a linguist -- pun
Given an input of Windjacke (probably wind jacket in English),
I'd like the code that prepares the data for the index (tokenizer
etc) to understand that this is a Jacke (jacket) so that a
query for Jacke would include the Windjacke document in its
result set.
It appears to me that such an
Von: Valeriy Felberg
If you want that query jacke matches a document containing the word
windjacke or kinderjacke, you could use a custom update processor.
This processor could search the indexed text for words matching the
pattern .*jacke and inject the word jacke into an additional field
Von: Markus Jelsma
We've done a lot of tests with the HyphenationCompoundWordTokenFilter
using a from TeX generated FOP XML file for the Dutch language and
have seen decent results. A bonus was that now some tokens can be
stemmed properly because not all compounds are listed in the
Von: Walter Underwood
German noun decompounding is a little more complicated than it might
seem.
There can be transformations or inflections, like the s in
Weinachtsbaum (Weinachten/Baum).
I remember from my linguistics studies that the terminus technicus for
these is Fugenmorphem
Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
Some compounds probably should not be decompounded, like Fahrrad
(farhren/Rad). With a dictionary-based stemmer, you might decide to
avoid decompounding for words in the dictionary.
Good point.
More or less, Fahrrad is generally abbreviated
On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote:
I remember from my linguistics studies that the terminus technicus for
these is Fugenmorphem (interstitial or joint morpheme).
That is some excellent linguistic jargon. I'll file that with hapax legomenon.
If you don't highlight, you can get
On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote:
Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
Some compounds probably should not be decompounded, like Fahrrad
(farhren/Rad). With a dictionary-based stemmer, you might decide to
avoid decompounding for words in the dictionary.
On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote:
More or less, Fahrrad is generally abbreviated as Rad.
(even though Rad can mean wheel and bike)
A synonym could handle this, since farhren would not be a good match. It is
judgement call, but this seems more like an equivalence Fahrrad = Rad