Re: tokenizer of solr
Jack, Thanks so much for this info. It's awesome. Ming On Thu, Apr 11, 2013 at 7:32 PM, Jack Krupansky wrote: > In that case, use the types="wdfftypes.txt" attribute of WDF and map "@" > and "_" to ALPHA as shown in: > http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** > WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory> > . > > > -- Jack Krupansky > > -Original Message- From: Mingfeng Yang > Sent: Thursday, April 11, 2013 8:50 PM > To: solr-user@lucene.apache.org > Subject: Re: tokenizer of solr > > > looks like it's due to the word delimiter filter. Anyone know if the > "protected" file support regular expression or not? > > Ming > > > On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky * > *wrote: > > Try the whitespace tokenizer. >> >> -- Jack Krupansky >> >> -Original Message- From: Mingfeng Yang Sent: Thursday, April 11, >> 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr >> Dear Solr users and developers, >> >> I am trying to index some documents some of which are twitter messages, >> and >> we have a problem when indexing retweet. >> >> Say a twitter user named "jpc_108" post a tweet, and then someone retweet >> his msg, and now @jpc_108 become part of the tweet text body. >> >> Seems like before indexing, the tokenizer factory of solr turns "@jpc_108" >> into "jpc and 108", and when we search for jpc_108, it's not there >> anymore. >> >> >> Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"? >> >> Thanks, >> Ming- >> >> >
Re: tokenizer of solr
In that case, use the types="wdfftypes.txt" attribute of WDF and map "@" and "_" to ALPHA as shown in: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory. -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Thursday, April 11, 2013 8:50 PM To: solr-user@lucene.apache.org Subject: Re: tokenizer of solr looks like it's due to the word delimiter filter. Anyone know if the "protected" file support regular expression or not? Ming On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky wrote: Try the whitespace tokenizer. -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Thursday, April 11, 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr Dear Solr users and developers, I am trying to index some documents some of which are twitter messages, and we have a problem when indexing retweet. Say a twitter user named "jpc_108" post a tweet, and then someone retweet his msg, and now @jpc_108 become part of the tweet text body. Seems like before indexing, the tokenizer factory of solr turns "@jpc_108" into "jpc and 108", and when we search for jpc_108, it's not there anymore. Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"? Thanks, Ming-
Re: tokenizer of solr
looks like it's due to the word delimiter filter. Anyone know if the "protected" file support regular expression or not? Ming On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky wrote: > Try the whitespace tokenizer. > > -- Jack Krupansky > > -Original Message- From: Mingfeng Yang Sent: Thursday, April 11, > 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr > Dear Solr users and developers, > > I am trying to index some documents some of which are twitter messages, and > we have a problem when indexing retweet. > > Say a twitter user named "jpc_108" post a tweet, and then someone retweet > his msg, and now @jpc_108 become part of the tweet text body. > > Seems like before indexing, the tokenizer factory of solr turns "@jpc_108" > into "jpc and 108", and when we search for jpc_108, it's not there anymore. > > > Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"? > > Thanks, > Ming- >
Re: tokenizer of solr
Try the whitespace tokenizer. -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Thursday, April 11, 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr Dear Solr users and developers, I am trying to index some documents some of which are twitter messages, and we have a problem when indexing retweet. Say a twitter user named "jpc_108" post a tweet, and then someone retweet his msg, and now @jpc_108 become part of the tweet text body. Seems like before indexing, the tokenizer factory of solr turns "@jpc_108" into "jpc and 108", and when we search for jpc_108, it's not there anymore. Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"? Thanks, Ming-
tokenizer of solr
Dear Solr users and developers, I am trying to index some documents some of which are twitter messages, and we have a problem when indexing retweet. Say a twitter user named "jpc_108" post a tweet, and then someone retweet his msg, and now @jpc_108 become part of the tweet text body. Seems like before indexing, the tokenizer factory of solr turns "@jpc_108" into "jpc and 108", and when we search for jpc_108, it's not there anymore. Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"? Thanks, Ming-