Re: [Free Text] Field Tokenizing

Erick Erickson Thu, 09 Jun 2011 09:51:26 -0700

The KeywordTokenizer doesn't do anything to break up the input stream,
it just treats the whole input to the field as a single token. So I don't think
you'll be able to "extract" anything starting with that tokenizer.


Look at the admin/analysis page to see a step-by-step breakdown of what
your analyzer chain does. Be sure to check the "verbose" checkbox....

Best
Erick

On Thu, Jun 9, 2011 at 12:35 PM, Adam Estrada
<estrada.adam.gro...@gmail.com> wrote:
> Erick,
>
> I totally understand that BUT the keyword tokenizer factory does a really
> good job extracting phrases (or what look like phrases from) from my data. I
> don't know why exactly but it does do it. I am going to continue working
> through it to see if I can't figure it out ;-)
>
> Adam
>
> On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
>
>> The problem here is that none of the built-in filters or tokenizers
>> have a prayer
>> of recognizing what #you# think are phrases, since it'll be unique to
>> your situation.
>>
>> If you have a list of phrases you care about, you could substitute a
>> single token
>> for the phrases you care about...
>>
>> But the overriding question is what determines a phrase you're
>> interested in? Is it
>> a list or is there some heuristic you want to apply?
>>
>> Or could you just recognize them at query time and make them into a
>> literal phrase
>> (i.e. with quotationmarks)?
>>
>> Best
>> Erick
>>
>> On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
>> <estrada.adam.gro...@gmail.com> wrote:
>> > All,
>> >
>> > I am at a bit of a loss here so any help would be greatly appreciated. I
>> am
>> > using the DIH to grab data from a DB. The field that I am most interested
>> in
>> > has anywhere from 1 word to several paragraphs worth of free text. What I
>> > would really like to do is pull out phrases like "Joe's coffee shop"
>> rather
>> > than the 3 individual words. I have tried the KeywordTokenizerFactory and
>> > that does seem to do what I want it to do but it is not actually
>> tokenizing
>> > anything so it does what I want it to for the most part but it's not
>> > creating the tokens that I need for further analysis in apps like Mahout.
>> >
>> > We can play with the combination of tokenizers and filters all day long
>> and
>> > see what the results are after a quick reindex. I typlically just view
>> them
>> > in Solitas as facets which may be the problem for me too. Does anyone
>> have
>> > an example fieldType they can share with me that shows how to
>> > extract phrases if they are there from the data I described earlier. Am I
>> > even going about this the right way? I am using today's trunk build of
>> Solr
>> > and here is what I have munged together this morning.
>> >
>> > <fieldType name="text_ws" class="solr.TextField"
>> positionIncrementGap="100"
>> > autoGeneratePhraseQueries="true">
>> >  <analyzer >
>> >  <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> >  <charFilter class="solr.MappingCharFilterFactory"
>> > mapping="mapping-ISOLatin1Accent.txt"/>
>> >  <tokenizer class="solr.KeywordTokenizerFactory"/>
>> >  <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >  <filter class="solr.ShingleFilterFactory" maxShingleSize="4"
>> > outputUnigrams="true" outputUnigramIfNoNgram="false"/>
>> >  <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>> >  <filter class="solr.EnglishPossessiveFilterFactory"/>
>> >  <filter class="solr.EnglishMinimalStemFilterFactory"/>
>> >  <filter class="solr.ASCIIFoldingFilterFactory"/>
>> >  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> >  <filter class="solr.TrimFilterFactory"/>
>> >  </analyzer>
>> > </fieldType>
>> >
>> > Thanks,
>> > Adam
>> >
>>
>

Re: [Free Text] Field Tokenizing

Reply via email to