The KeywordTokenizer doesn't do anything to break up the input stream, it just treats the whole input to the field as a single token. So I don't think you'll be able to "extract" anything starting with that tokenizer.
Look at the admin/analysis page to see a step-by-step breakdown of what your analyzer chain does. Be sure to check the "verbose" checkbox.... Best Erick On Thu, Jun 9, 2011 at 12:35 PM, Adam Estrada <estrada.adam.gro...@gmail.com> wrote: > Erick, > > I totally understand that BUT the keyword tokenizer factory does a really > good job extracting phrases (or what look like phrases from) from my data. I > don't know why exactly but it does do it. I am going to continue working > through it to see if I can't figure it out ;-) > > Adam > > On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson > <erickerick...@gmail.com>wrote: > >> The problem here is that none of the built-in filters or tokenizers >> have a prayer >> of recognizing what #you# think are phrases, since it'll be unique to >> your situation. >> >> If you have a list of phrases you care about, you could substitute a >> single token >> for the phrases you care about... >> >> But the overriding question is what determines a phrase you're >> interested in? Is it >> a list or is there some heuristic you want to apply? >> >> Or could you just recognize them at query time and make them into a >> literal phrase >> (i.e. with quotationmarks)? >> >> Best >> Erick >> >> On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada >> <estrada.adam.gro...@gmail.com> wrote: >> > All, >> > >> > I am at a bit of a loss here so any help would be greatly appreciated. I >> am >> > using the DIH to grab data from a DB. The field that I am most interested >> in >> > has anywhere from 1 word to several paragraphs worth of free text. What I >> > would really like to do is pull out phrases like "Joe's coffee shop" >> rather >> > than the 3 individual words. I have tried the KeywordTokenizerFactory and >> > that does seem to do what I want it to do but it is not actually >> tokenizing >> > anything so it does what I want it to for the most part but it's not >> > creating the tokens that I need for further analysis in apps like Mahout. >> > >> > We can play with the combination of tokenizers and filters all day long >> and >> > see what the results are after a quick reindex. I typlically just view >> them >> > in Solitas as facets which may be the problem for me too. Does anyone >> have >> > an example fieldType they can share with me that shows how to >> > extract phrases if they are there from the data I described earlier. Am I >> > even going about this the right way? I am using today's trunk build of >> Solr >> > and here is what I have munged together this morning. >> > >> > <fieldType name="text_ws" class="solr.TextField" >> positionIncrementGap="100" >> > autoGeneratePhraseQueries="true"> >> > <analyzer > >> > <charFilter class="solr.HTMLStripCharFilterFactory"/> >> > <charFilter class="solr.MappingCharFilterFactory" >> > mapping="mapping-ISOLatin1Accent.txt"/> >> > <tokenizer class="solr.KeywordTokenizerFactory"/> >> > <filter class="solr.StopFilterFactory" ignoreCase="true" >> > words="stopwords.txt" enablePositionIncrements="true"/> >> > <filter class="solr.ShingleFilterFactory" maxShingleSize="4" >> > outputUnigrams="true" outputUnigramIfNoNgram="false"/> >> > <filter class="solr.KeywordMarkerFilterFactory" >> protected="protwords.txt"/> >> > <filter class="solr.EnglishPossessiveFilterFactory"/> >> > <filter class="solr.EnglishMinimalStemFilterFactory"/> >> > <filter class="solr.ASCIIFoldingFilterFactory"/> >> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> > <filter class="solr.TrimFilterFactory"/> >> > </analyzer> >> > </fieldType> >> > >> > Thanks, >> > Adam >> > >> >