Re: tokenizer of solr

2013-04-12 Thread Mingfeng Yang
Jack,

Thanks so much for this info.  It's awesome.

Ming


On Thu, Apr 11, 2013 at 7:32 PM, Jack Krupansky wrote:

> In that case, use the types="wdfftypes.txt" attribute of WDF and map "@"
> and "_" to ALPHA as shown in:
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
> .
>
>
> -- Jack Krupansky
>
> -Original Message- From: Mingfeng Yang
> Sent: Thursday, April 11, 2013 8:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: tokenizer of solr
>
>
> looks like it's due to the word delimiter filter.  Anyone know if the
> "protected" file support regular expression or not?
>
> Ming
>
>
> On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky *
> *wrote:
>
>  Try the whitespace tokenizer.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Mingfeng Yang Sent: Thursday, April 11,
>> 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr
>> Dear Solr users and developers,
>>
>> I am trying to index some documents some of which are twitter messages,
>> and
>> we have a problem when indexing retweet.
>>
>> Say a twitter user named "jpc_108" post a tweet, and then someone retweet
>> his msg, and now @jpc_108 become part of the tweet text body.
>>
>> Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
>> into "jpc and 108", and when we search for jpc_108, it's not there
>> anymore.
>>
>>
>> Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?
>>
>> Thanks,
>> Ming-
>>
>>
>


Re: tokenizer of solr

2013-04-11 Thread Jack Krupansky
In that case, use the types="wdfftypes.txt" attribute of WDF and map "@" and 
"_" to ALPHA as shown in:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory.

-- Jack Krupansky

-Original Message- 
From: Mingfeng Yang

Sent: Thursday, April 11, 2013 8:50 PM
To: solr-user@lucene.apache.org
Subject: Re: tokenizer of solr

looks like it's due to the word delimiter filter.  Anyone know if the
"protected" file support regular expression or not?

Ming


On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky 
wrote:



Try the whitespace tokenizer.

-- Jack Krupansky

-Original Message- From: Mingfeng Yang Sent: Thursday, April 11,
2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr
Dear Solr users and developers,

I am trying to index some documents some of which are twitter messages, 
and

we have a problem when indexing retweet.

Say a twitter user named "jpc_108" post a tweet, and then someone retweet
his msg, and now @jpc_108 become part of the tweet text body.

Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
into "jpc and 108", and when we search for jpc_108, it's not there 
anymore.



Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?

Thanks,
Ming-





Re: tokenizer of solr

2013-04-11 Thread Mingfeng Yang
looks like it's due to the word delimiter filter.  Anyone know if the
"protected" file support regular expression or not?

Ming


On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky wrote:

> Try the whitespace tokenizer.
>
> -- Jack Krupansky
>
> -Original Message- From: Mingfeng Yang Sent: Thursday, April 11,
> 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr
> Dear Solr users and developers,
>
> I am trying to index some documents some of which are twitter messages, and
> we have a problem when indexing retweet.
>
> Say a twitter user named "jpc_108" post a tweet, and then someone retweet
> his msg, and now @jpc_108 become part of the tweet text body.
>
> Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
> into "jpc and 108", and when we search for jpc_108, it's not there anymore.
>
>
> Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?
>
> Thanks,
> Ming-
>


Re: tokenizer of solr

2013-04-11 Thread Jack Krupansky

Try the whitespace tokenizer.

-- Jack Krupansky

-Original Message- 
From: Mingfeng Yang 
Sent: Thursday, April 11, 2013 7:48 PM 
To: solr-user@lucene.apache.org 
Subject: tokenizer of solr 


Dear Solr users and developers,

I am trying to index some documents some of which are twitter messages, and
we have a problem when indexing retweet.

Say a twitter user named "jpc_108" post a tweet, and then someone retweet
his msg, and now @jpc_108 become part of the tweet text body.

Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
into "jpc and 108", and when we search for jpc_108, it's not there anymore.


Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?

Thanks,
Ming-


tokenizer of solr

2013-04-11 Thread Mingfeng Yang
Dear Solr users and developers,

I am trying to index some documents some of which are twitter messages, and
we have a problem when indexing retweet.

Say a twitter user named "jpc_108" post a tweet, and then someone retweet
his msg, and now @jpc_108 become part of the tweet text body.

Seems like before indexing, the tokenizer factory of solr turns "@jpc_108"
into "jpc and 108", and when we search for jpc_108, it's not there anymore.


Is there anyway we can keep "jcp_108" when it appears as "@jpc_108"?

Thanks,
Ming-