Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

James Strassburg Thu, 19 Mar 2015 11:39:03 -0700

Sorry, I've been a bit unfocused from this list for a bit. When I was
working with the APTF code I rewrote a big chunk of it and didn't include
the inclusion of the original tokens as I didn't need it at the time. That
feature could easily be added back in. I will see if I can find a bit of
time for that.


As for the other part of your message, are you suggesting that the token
indexes are not correct? There is a bit of a formatting issue with the text
and I'm not sure what you're getting at. Can you explain further please?

On Sun, Feb 8, 2015 at 3:04 PM, trhodesg <trhodes...@gmail.com> wrote:

> Thanks to everyone for the thought, time and effort put into
> AutoPhrasingTokenFilter(APTF)! It's a real lifesaver.
> While trying to add APTF to my indexing, i discovered that the original
> (TS)
> version throws an exception while indexing a 100MB PDF. The error
> isException writing document to the index; possible analysis errorThe
> modified (JS) version runs without error, but it removes the tokens used to
> create the phrase. They are needed.
> Before looking into this i have a question; Solr would normally tokenize
> the
> phrasethe peoples republic of china isasthe(1) peoples(2) republic(3) of(4)
> china(5) is(6)
> Defining the APTF phrase file asthe Solr admin analysis page reports that
> the APTF indexer tokenizes the phrase asWould it be possible for someone to
> explain the reasoning behind the discontinuous token numbering? As it is
> now
> phrase queries such as "republic of china" will fail. And i can't get
> proximity queries like "republic of"~10 to work either (though it seems
> they
> should). Wouldn't it be more flexible to return the following
> tokenizationThis allows spurious matches such as "peoples peoplesrepublic"
> but it seems like this type of event would be very rare. It has the
> advantage of allowing phrase queries to continue working the way most users
> think.
> Thank you for supporting more than one entity definition per phrase (ie
> peoplesrepublic and peoplesrepublicofchina). This is type of contraction is
> common in longer documents, especially when the first used phrase ends with
> a preposition. It helps support robust matching.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4184888.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

Reply via email to