Thanks to everyone for the thought, time and effort put into
AutoPhrasingTokenFilter(APTF)! It's a real lifesaver.
While trying to add APTF to my indexing, i discovered that the original (TS)
version throws an exception while indexing a 100MB PDF. The error
isException writing document to the index; possible analysis errorThe
modified (JS) version runs without error, but it removes the tokens used to
create the phrase. They are needed.
Before looking into this i have a question; Solr would normally tokenize the
phrasethe peoples republic of china isasthe(1) peoples(2) republic(3) of(4)
china(5) is(6)
Defining the APTF phrase file asthe Solr admin analysis page reports that
the APTF indexer tokenizes the phrase asWould it be possible for someone to
explain the reasoning behind the discontinuous token numbering? As it is now
phrase queries such as "republic of china" will fail. And i can't get
proximity queries like "republic of"~10 to work either (though it seems they
should). Wouldn't it be more flexible to return the following
tokenizationThis allows spurious matches such as "peoples peoplesrepublic"
but it seems like this type of event would be very rare. It has the
advantage of allowing phrase queries to continue working the way most users
think.
Thank you for supporting more than one entity definition per phrase (ie
peoplesrepublic and peoplesrepublicofchina). This is type of contraction is
common in longer documents, especially when the first used phrase ends with
a preposition. It helps support robust matching.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4184888.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to