Jarek Zgoda wrote:
Wiadomość napisana w dniu 2008-10-16, o godz. 16:21, przez Grant Ingersoll:

I'm trying to create a search facility for documents in "broken" Polish (by broken I mean "not language rules compliant"),

Can you explain what you mean here a bit more? I don't know Polish,

Hi guys,

I do speak Polish :) maybe I can help here a bit.


Some documents (around 15% of all pile) contain the texts entered by children from primary school's and that implies many syntactic and ortographic errors.

document text: "włatcy móch" (in proper Polish this would be "władcy much") example terms that should match: "włatcy much", "wlatcy moch", "wladcy much"

These examples can be classified as "sounds like", and typically soundexing algorithms are used to address this problem, in order to generate initial suggestions. After that you can use other heuristic rules to select the most probable correct forms.

AFAIK, there are no (public) soundex implementations for Polish, in particular in Java, although there was some research work done on the construction of a specifically Polish soundex. You could also use the Daitch-Mokotoff soundex, which comes close enough.


Taking word "włatcy" from my example, I'd like to find documents containing words

"wlatcy" (latin-2 accentuations stripped from original),

This step is trivial.

"władcy" (proper form of this noun) and "wladcy" (latin-2 accents stripped from proper form).

And this one is not. It requires using something like soundexing in order to look up possible similar terms. However ... in this process you inevitably collect false positives, and you don't have any way in the input text to determine that they should be rejected. You can only make this decision based on some external knowledge of Polish, such as:

* a morpho-syntactic analyzer that will determine which combinations of suggestions are more correct and more probable,

* a language model that for any given soundexed phrase can generate the most probable original phrases.

Also, knowing the context in which a query is asked may help, but usually you don't have this information (queries are short).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to