Sorry, I've been a bit unfocused from this list for a bit. When I was working with the APTF code I rewrote a big chunk of it and didn't include the inclusion of the original tokens as I didn't need it at the time. That feature could easily be added back in. I will see if I can find a bit of time for that.
As for the other part of your message, are you suggesting that the token indexes are not correct? There is a bit of a formatting issue with the text and I'm not sure what you're getting at. Can you explain further please? On Sun, Feb 8, 2015 at 3:04 PM, trhodesg <trhodes...@gmail.com> wrote: > Thanks to everyone for the thought, time and effort put into > AutoPhrasingTokenFilter(APTF)! It's a real lifesaver. > While trying to add APTF to my indexing, i discovered that the original > (TS) > version throws an exception while indexing a 100MB PDF. The error > isException writing document to the index; possible analysis errorThe > modified (JS) version runs without error, but it removes the tokens used to > create the phrase. They are needed. > Before looking into this i have a question; Solr would normally tokenize > the > phrasethe peoples republic of china isasthe(1) peoples(2) republic(3) of(4) > china(5) is(6) > Defining the APTF phrase file asthe Solr admin analysis page reports that > the APTF indexer tokenizes the phrase asWould it be possible for someone to > explain the reasoning behind the discontinuous token numbering? As it is > now > phrase queries such as "republic of china" will fail. And i can't get > proximity queries like "republic of"~10 to work either (though it seems > they > should). Wouldn't it be more flexible to return the following > tokenizationThis allows spurious matches such as "peoples peoplesrepublic" > but it seems like this type of event would be very rare. It has the > advantage of allowing phrase queries to continue working the way most users > think. > Thank you for supporting more than one entity definition per phrase (ie > peoplesrepublic and peoplesrepublicofchina). This is type of contraction is > common in longer documents, especially when the first used phrase ends with > a preposition. It helps support robust matching. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4184888.html > Sent from the Solr - User mailing list archive at Nabble.com. >