Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir  wrote:

> On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder  wrote:
> > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> > past, we were using a patched version of StandardTokenizer which treated
> > @twitteruser and #hashtag better, but this became a release engineering
> > nightmare so we switched to Whitespace.
>
> in this case, have you considered using a CharFilter (e.g.
> MappingCharFilter) before the tokenizer?
>
> This way you could map your special things such as @ and # to some
> other string that the tokenizer doesnt split on,
> e.g. # => "HASH_".
>
> then your #foobar goes to HASH_foobar.
> If you want searches of "#foobar" to only match "#foobar" and not also
> "foobar" itself, and vice versa, you are done.
> Maybe you want searches of #foobar to only match #foobar, but searches
> of "foobar" to match both "#foobar" and "foobar".
> In this case, you would probably use a worddelimiterfilter w/
> preserveOriginal at index-time only , followed by a StopFilter
> containing HASH, so you index HASH_foobar and foobar.
>
> anyway i think you have a lot of flexibility to reuse
> standardtokenizer but customize things like this without maintaining
> your own tokenizer, this is the purpose of CharFilters.
>

That worked brilliantly. Thank you very much, Robert.

-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Robert Muir
On Wed, Dec 1, 2010 at 12:25 PM, Jacob Elder  wrote:
>
> What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the
> current stable StandardTokenizer handle CJK?
>

yes


Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir  wrote:

> (Jonathan, I apologize for emailing you twice, i meant to hit reply-all)
>
> On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind 
> wrote:
> >
> > Wait, standardtokenizer already handles CJK and will put each CJK char
> into
> > it's own token?  Really? I had no idea!  Is that documented anywhere, or
> you
> > just have to look at the source to see it?
> >
>
> Yes, you are right, the documentation should have been more explicit:
> in previous releases it doesn't say anything about how it tokenizes
> CJK in the documentation. But it does do them this way, and tagged
> them as "CJ" token type.
>
> I think the documentation issue is "fixed" in branch_3x and trunk:
>
>  * As of Lucene version 3.1, this class implements the Word Break rules
> from the
>  * Unicode Text Segmentation algorithm, as specified in
>  * http://unicode.org/reports/tr29/";>Unicode Standard Annex
> #29.
> (from
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
> )
>
> So you can read the UAX#29 report and then you know how it tokenizes text
> You can also just use this demo app to see how the new one works:
> http://unicode.org/cldr/utility/breaks.jsp (choose "Word")
>

What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the
current stable StandardTokenizer handle CJK?

-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Robert Muir
(Jonathan, I apologize for emailing you twice, i meant to hit reply-all)

On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind  wrote:
>
> Wait, standardtokenizer already handles CJK and will put each CJK char into
> it's own token?  Really? I had no idea!  Is that documented anywhere, or you
> just have to look at the source to see it?
>

Yes, you are right, the documentation should have been more explicit:
in previous releases it doesn't say anything about how it tokenizes
CJK in the documentation. But it does do them this way, and tagged
them as "CJ" token type.

I think the documentation issue is "fixed" in branch_3x and trunk:

 * As of Lucene version 3.1, this class implements the Word Break rules from the
 * Unicode Text Segmentation algorithm, as specified in
 * http://unicode.org/reports/tr29/";>Unicode Standard Annex #29.
(from 
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java)

So you can read the UAX#29 report and then you know how it tokenizes text
You can also just use this demo app to see how the new one works:
http://unicode.org/cldr/utility/breaks.jsp (choose "Word")


Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jonathan Rochkind

On 11/29/2010 5:43 PM, Robert Muir wrote:

On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind  wrote:

* As a tokenizer, I use the WhitespaceTokenizer.

* Then I apply a custom filter that looks for CJK chars, and re-tokenizes
any CJK chars into one-token-per-char. This custom filter was written by
someone other than me; it is open source; but I'm not sure if it's actually
in a public repo, or how well documented it is.  I can put you in touch with
the author to try and ask. There may also be a more standard filter other
than the custom one I'm using that does the same thing?


You are describing what standardtokenizer does.



Wait, standardtokenizer already handles CJK and will put each CJK char 
into it's own token?  Really? I had no idea!  Is that documented 
anywhere, or you just have to look at the source to see it?


I had assumed that standardtokenizer didn't have any special handling of 
bytes known to be UTF-8 CJK, because that wasn't mentioned in the 
documentation -- but it does?   That would be convenient and not require 
my custom code.


Jonathan



Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Robert Muir
On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder  wrote:
> Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> past, we were using a patched version of StandardTokenizer which treated
> @twitteruser and #hashtag better, but this became a release engineering
> nightmare so we switched to Whitespace.

in this case, have you considered using a CharFilter (e.g.
MappingCharFilter) before the tokenizer?

This way you could map your special things such as @ and # to some
other string that the tokenizer doesnt split on,
e.g. # => "HASH_".

then your #foobar goes to HASH_foobar.
If you want searches of "#foobar" to only match "#foobar" and not also
"foobar" itself, and vice versa, you are done.
Maybe you want searches of #foobar to only match #foobar, but searches
of "foobar" to match both "#foobar" and "foobar".
In this case, you would probably use a worddelimiterfilter w/
preserveOriginal at index-time only , followed by a StopFilter
containing HASH, so you index HASH_foobar and foobar.

anyway i think you have a lot of flexibility to reuse
standardtokenizer but customize things like this without maintaining
your own tokenizer, this is the purpose of CharFilters.


Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
past, we were using a patched version of StandardTokenizer which treated
@twitteruser and #hashtag better, but this became a release engineering
nightmare so we switched to Whitespace.

Perhaps I could rephrase the question as follows:

Is there a literal configuration example of what this wiki article suggests:

http://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields

Further, could I then use copyFields to get those back into a single field?

On Mon, Nov 29, 2010 at 5:39 PM, Robert Muir  wrote:

> On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder  wrote:
> > StandardTokenizer doesn't handle some of the tokens we need, like
> > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese
> or
> > Korean. Am I wrong about that?
>
> it uses the unigram method for CJK ideographs... the CJKtokenizer just
> uses the bigram method, its just an alternative method.
>
> the whitespace doesnt work at all though, so give up on that!
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
+1

That's exactly what we need, too.

On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey  wrote:

> On 11/29/2010 3:15 PM, Jacob Elder wrote:
>
>> I am looking for a clear example of using more than one tokenizer for a
>> source single field. My application has a single "body" field which until
>> recently was all latin characters, but we're now encountering both English
>> and Japanese words in a single message. Obviously, we need to be using CJK
>> in addition to WhitespaceTokenizerFactory.
>>
>
> What I'd like to see is a CJK filter that runs after tokenization
> (whitespace in my case) and doesn't do anything but handle the CJK
> characters.  If there are no CJK characters in the token, it should do
> nothing at all.  The CJK tokenizer does a whole host of other things that I
> want to handle myself.
>
> Shawn
>
>


-- 
Jacob Elder
@jelder
(646) 535-3379


RE: Good example of multiple tokenizers for a single field

2010-11-30 Thread jan.kurella
We had the same problem for our fields and we wrote a Tokenizer using the icu4j 
library. Breaking tokens at script changes, and dealing with them according the 
script and the configured Breakiterators.
This works out very well, as we also add the "scrip" information to the token 
so later filter can easily process on them without checking the tokens again 
for being some CJK-token (or greek or Russian or Hebrew or, or, or...)

After this you then can put any filter (N-gram, dictionary-segmenter) to make 
your tokens better.

Jan

>-Original Message-
>From: ext Jacob Elder [mailto:jel...@locamoda.com]
>Sent: Montag, 29. November 2010 23:15
>To: solr-user@lucene.apache.org
>Subject: Good example of multiple tokenizers for a single field
>
>I am looking for a clear example of using more than one tokenizer for a
>source single field. My application has a single "body" field which until
>recently was all latin characters, but we're now encountering both English
>and Japanese words in a single message. Obviously, we need to be using CJK
>in addition to WhitespaceTokenizerFactory.
>
>I've found some references to using copyFields or NGrams but I can't quite
>grasp what the whole solution would look like.
>
>--
>Jacob Elder
>@jelder
>(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Shawn Heisey

On 11/29/2010 3:15 PM, Jacob Elder wrote:

I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in addition to WhitespaceTokenizerFactory.


What I'd like to see is a CJK filter that runs after tokenization 
(whitespace in my case) and doesn't do anything but handle the CJK 
characters.  If there are no CJK characters in the token, it should do 
nothing at all.  The CJK tokenizer does a whole host of other things 
that I want to handle myself.


Shawn



Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind  wrote:
>
> * As a tokenizer, I use the WhitespaceTokenizer.
>
> * Then I apply a custom filter that looks for CJK chars, and re-tokenizes
> any CJK chars into one-token-per-char. This custom filter was written by
> someone other than me; it is open source; but I'm not sure if it's actually
> in a public repo, or how well documented it is.  I can put you in touch with
> the author to try and ask. There may also be a more standard filter other
> than the custom one I'm using that does the same thing?
>

You are describing what standardtokenizer does.


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jonathan Rochkind
You can only use one tokenizer on given field, I think. But a tokenizer 
isn't in fact the only thing that can tokenize, an ordinary filter can 
change tokenization too, so you could use two filters in a row.


You could also write your own custom tokenizer that does what you want, 
although I'm not entirely sure if you turn exactly what you say into 
code it will actually do what you want, I think it's more complicated, I 
think you'll need a tokenizer that looks for contiguous blocks of bytes 
that are UTF-8 CJK and does one thing to them, and contiguous blocks of 
bytes that are not UTF8 CJK and does another thing to them; rather than 
just "first do one to the whole string and then do another."


Dealing with mixed language fields is tricky, I know of no general 
purpose good solutions, in part just because of the semantics involved.


If you have some strings for the field you know are CJK, adn others you 
know are English, the easiest thing to do is NOT put them in the same 
field, but put them in different fields, and use dismax (for example) to 
search both fields on query.  But if you can't even tell at index time 
which is which, or if you have strings that themselves include both CJK 
and English interspersed with each other, that might not work.


For my own case, where everything is just interspersed in the fields and 
I don't really know what language it is, here's what I do, which is 
definitely not great for CJK, but is better than nothing:


* As a tokenizer, I use the WhitespaceTokenizer.

* Then I apply a custom filter that looks for CJK chars, and 
re-tokenizes any CJK chars into one-token-per-char. This custom filter 
was written by someone other than me; it is open source; but I'm not 
sure if it's actually in a public repo, or how well documented it is.  I 
can put you in touch with the author to try and ask. There may also be a 
more standard filter other than the custom one I'm using that does the 
same thing?


Jonathan

Jonathan

On 11/29/2010 5:30 PM, Jacob Elder wrote:

The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
wrote:


You can use only one tokenizer per analyzer. You'd better use separate
fields +
fieldTypes for different languages.


I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both

English

and Japanese words in a single message. Obviously, we need to be using

CJK

in addition to WhitespaceTokenizerFactory.

I've found some references to using copyFields or NGrams but I can't

quite

grasp what the whole solution would look like.





Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder  wrote:
> StandardTokenizer doesn't handle some of the tokens we need, like
> @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
> Korean. Am I wrong about that?

it uses the unigram method for CJK ideographs... the CJKtokenizer just
uses the bigram method, its just an alternative method.

the whitespace doesnt work at all though, so give up on that!


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
StandardTokenizer doesn't handle some of the tokens we need, like
@twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
Korean. Am I wrong about that?

On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir  wrote:

> On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder  wrote:
> > The problem is that the field is not guaranteed to contain just a single
> > language. I'm looking for some way to pass it first through CJK, then
> > Whitespace.
> >
> > If I'm totally off-target here, is there a recommended way of dealing
> with
> > mixed-language fields?
> >
>
> maybe you should consider a tokenizer like StandardTokenizer, that
> works reasonably well for most languages.
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder  wrote:
> The problem is that the field is not guaranteed to contain just a single
> language. I'm looking for some way to pass it first through CJK, then
> Whitespace.
>
> If I'm totally off-target here, is there a recommended way of dealing with
> mixed-language fields?
>

maybe you should consider a tokenizer like StandardTokenizer, that
works reasonably well for most languages.


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
wrote:

> You can use only one tokenizer per analyzer. You'd better use separate
> fields +
> fieldTypes for different languages.
>
> > I am looking for a clear example of using more than one tokenizer for a
> > source single field. My application has a single "body" field which until
> > recently was all latin characters, but we're now encountering both
> English
> > and Japanese words in a single message. Obviously, we need to be using
> CJK
> > in addition to WhitespaceTokenizerFactory.
> >
> > I've found some references to using copyFields or NGrams but I can't
> quite
> > grasp what the whole solution would look like.
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Markus Jelsma
You can use only one tokenizer per analyzer. You'd better use separate fields + 
fieldTypes for different languages.

> I am looking for a clear example of using more than one tokenizer for a
> source single field. My application has a single "body" field which until
> recently was all latin characters, but we're now encountering both English
> and Japanese words in a single message. Obviously, we need to be using CJK
> in addition to WhitespaceTokenizerFactory.
> 
> I've found some references to using copyFields or NGrams but I can't quite
> grasp what the whole solution would look like.


Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in addition to WhitespaceTokenizerFactory.

I've found some references to using copyFields or NGrams but I can't quite
grasp what the whole solution would look like.

-- 
Jacob Elder
@jelder
(646) 535-3379