Re: Advice on analysis/filtering?

Jarek Zgoda Thu, 16 Oct 2008 07:47:39 -0700

Wiadomość napisana w dniu 2008-10-16, o godz. 16:21, przez GrantIngersoll:

I'm trying to create a search facility for documents in "broken"Polish (by broken I mean "not language rules compliant"),
Can you explain what you mean here a bit more? I don't know Polish,but most spoken languages can't be pinned down to a specific set ofrules. In other words, the exception is the rule. Or, are yousaying the documents use more dialog based, i.e. more informal, asin two people having a conversation?

Some documents (around 15% of all pile) contain the texts entered bychildren from primary school's and that implies many syntactic andortographic errors. The text is indexed "as is" and Solr is able tofind exact occurences, but I'd like to be able to find also documentsthat contain other variations of errors and proper forms, too. And oh,the system will be used by the same aged children, who tends to makesimilar errors when entering search terms.

searchable by terms in "broken" Polish, but broken in many otherways than documents. See this example:
document text: "włatcy móch" (in proper Polish this would be"władcy much")example terms that should match: "włatcy much", "wlatcy moch","wladcy much"
This double brokeness ruled out any Polish stemmers currentlyavailable for Lucene and now I am at point 0. The search results donot have to be 100% accurate - some missing results are acceptable,
but "false positives" are not.
There's no such thing in any language. In your example above, whatis matching that shouldn't? Is this happening across a lot ofdocuments, or just a few?

Yea, I know that. By "not acceptable" I mean "not acceptable abovesome level". Sorry for this confusion.

Taking word "włatcy" from my example, I'd like to find documentscontaining words "wlatcy" (latin-2 accentuations stripped fromoriginal), "władcy" (proper form of this noun) and "wladcy" (latin-2accents stripped from proper form). The issue #1 (strippingaccentuations from original) seems to be resolvable outside solr - Ican index texts with accentuations stripped already. The issue #2(finding proper form for word) is the most interesting for me. Issue#3 depends on #1 and #2.

Is it at all possible using machinery provided by Solr (I do notown PHD in liguistics), or should I ask the business for loweringtheir expectations?
Well, I think there are a couple of approaches:
1. You can write your own filter/stemmer/analyzer that you thinkfixes these issues2. You can protect the "broken" words and not have them filtered, orfilter them differently.
3. You can lower expectations.
One thing to try out is Solr's analysis tool in the admin, and seeif you can get a better handle on what is going wrong.


I'll see how far I could go with spellchecker and fuzzy searches.

--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
[EMAIL PROTECTED]

Re: Advice on analysis/filtering?

Reply via email to