Jarek Zgoda wrote:
Wiadomość napisana w dniu 2008-10-16, o godz. 16:21, przez Grant Ingersoll:
I'm trying to create a search facility for documents in "broken"
Polish (by broken I mean "not language rules compliant"),
Can you explain what you mean here a bit more? I don't know Polish,
Hi guys,
I do speak Polish :) maybe I can help here a bit.
Some documents (around 15% of all pile) contain the texts entered by
children from primary school's and that implies many syntactic and
ortographic errors.
document text: "włatcy móch" (in proper Polish this would be "władcy
much")
example terms that should match: "włatcy much", "wlatcy moch",
"wladcy much"
These examples can be classified as "sounds like", and typically
soundexing algorithms are used to address this problem, in order to
generate initial suggestions. After that you can use other heuristic
rules to select the most probable correct forms.
AFAIK, there are no (public) soundex implementations for Polish, in
particular in Java, although there was some research work done on the
construction of a specifically Polish soundex. You could also use the
Daitch-Mokotoff soundex, which comes close enough.
Taking word "włatcy" from my example, I'd like to find documents
containing words
"wlatcy" (latin-2 accentuations stripped from original),
This step is trivial.
"władcy" (proper form of this noun) and "wladcy" (latin-2
accents stripped from proper form).
And this one is not. It requires using something like soundexing in
order to look up possible similar terms. However ... in this process you
inevitably collect false positives, and you don't have any way in the
input text to determine that they should be rejected. You can only make
this decision based on some external knowledge of Polish, such as:
* a morpho-syntactic analyzer that will determine which combinations of
suggestions are more correct and more probable,
* a language model that for any given soundexed phrase can generate the
most probable original phrases.
Also, knowing the context in which a query is asked may help, but
usually you don't have this information (queries are short).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com