Re: Good example of multiple tokenizers for a single field
On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir wrote: > On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote: > > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > > past, we were using a patched version of StandardTokenizer which treated > > @twitteruser and #hashtag better, but this became a release engineering > > nightmare so we switched to Whitespace. > > in this case, have you considered using a CharFilter (e.g. > MappingCharFilter) before the tokenizer? > > This way you could map your special things such as @ and # to some > other string that the tokenizer doesnt split on, > e.g. # => "HASH_". > > then your #foobar goes to HASH_foobar. > If you want searches of "#foobar" to only match "#foobar" and not also > "foobar" itself, and vice versa, you are done. > Maybe you want searches of #foobar to only match #foobar, but searches > of "foobar" to match both "#foobar" and "foobar". > In this case, you would probably use a worddelimiterfilter w/ > preserveOriginal at index-time only , followed by a StopFilter > containing HASH, so you index HASH_foobar and foobar. > > anyway i think you have a lot of flexibility to reuse > standardtokenizer but customize things like this without maintaining > your own tokenizer, this is the purpose of CharFilters. > That worked brilliantly. Thank you very much, Robert. -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
On Wed, Dec 1, 2010 at 12:25 PM, Jacob Elder wrote: > > What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the > current stable StandardTokenizer handle CJK? > yes
Re: Good example of multiple tokenizers for a single field
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir wrote: > (Jonathan, I apologize for emailing you twice, i meant to hit reply-all) > > On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind > wrote: > > > > Wait, standardtokenizer already handles CJK and will put each CJK char > into > > it's own token? Really? I had no idea! Is that documented anywhere, or > you > > just have to look at the source to see it? > > > > Yes, you are right, the documentation should have been more explicit: > in previous releases it doesn't say anything about how it tokenizes > CJK in the documentation. But it does do them this way, and tagged > them as "CJ" token type. > > I think the documentation issue is "fixed" in branch_3x and trunk: > > * As of Lucene version 3.1, this class implements the Word Break rules > from the > * Unicode Text Segmentation algorithm, as specified in > * http://unicode.org/reports/tr29/";>Unicode Standard Annex > #29. > (from > http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java > ) > > So you can read the UAX#29 report and then you know how it tokenizes text > You can also just use this demo app to see how the new one works: > http://unicode.org/cldr/utility/breaks.jsp (choose "Word") > What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the current stable StandardTokenizer handle CJK? -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
(Jonathan, I apologize for emailing you twice, i meant to hit reply-all) On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind wrote: > > Wait, standardtokenizer already handles CJK and will put each CJK char into > it's own token? Really? I had no idea! Is that documented anywhere, or you > just have to look at the source to see it? > Yes, you are right, the documentation should have been more explicit: in previous releases it doesn't say anything about how it tokenizes CJK in the documentation. But it does do them this way, and tagged them as "CJ" token type. I think the documentation issue is "fixed" in branch_3x and trunk: * As of Lucene version 3.1, this class implements the Word Break rules from the * Unicode Text Segmentation algorithm, as specified in * http://unicode.org/reports/tr29/";>Unicode Standard Annex #29. (from http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java) So you can read the UAX#29 report and then you know how it tokenizes text You can also just use this demo app to see how the new one works: http://unicode.org/cldr/utility/breaks.jsp (choose "Word")
Re: Good example of multiple tokenizers for a single field
On 11/29/2010 5:43 PM, Robert Muir wrote: On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote: * As a tokenizer, I use the WhitespaceTokenizer. * Then I apply a custom filter that looks for CJK chars, and re-tokenizes any CJK chars into one-token-per-char. This custom filter was written by someone other than me; it is open source; but I'm not sure if it's actually in a public repo, or how well documented it is. I can put you in touch with the author to try and ask. There may also be a more standard filter other than the custom one I'm using that does the same thing? You are describing what standardtokenizer does. Wait, standardtokenizer already handles CJK and will put each CJK char into it's own token? Really? I had no idea! Is that documented anywhere, or you just have to look at the source to see it? I had assumed that standardtokenizer didn't have any special handling of bytes known to be UTF-8 CJK, because that wasn't mentioned in the documentation -- but it does? That would be convenient and not require my custom code. Jonathan
Re: Good example of multiple tokenizers for a single field
On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote: > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > past, we were using a patched version of StandardTokenizer which treated > @twitteruser and #hashtag better, but this became a release engineering > nightmare so we switched to Whitespace. in this case, have you considered using a CharFilter (e.g. MappingCharFilter) before the tokenizer? This way you could map your special things such as @ and # to some other string that the tokenizer doesnt split on, e.g. # => "HASH_". then your #foobar goes to HASH_foobar. If you want searches of "#foobar" to only match "#foobar" and not also "foobar" itself, and vice versa, you are done. Maybe you want searches of #foobar to only match #foobar, but searches of "foobar" to match both "#foobar" and "foobar". In this case, you would probably use a worddelimiterfilter w/ preserveOriginal at index-time only , followed by a StopFilter containing HASH, so you index HASH_foobar and foobar. anyway i think you have a lot of flexibility to reuse standardtokenizer but customize things like this without maintaining your own tokenizer, this is the purpose of CharFilters.
Re: Good example of multiple tokenizers for a single field
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the past, we were using a patched version of StandardTokenizer which treated @twitteruser and #hashtag better, but this became a release engineering nightmare so we switched to Whitespace. Perhaps I could rephrase the question as follows: Is there a literal configuration example of what this wiki article suggests: http://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields Further, could I then use copyFields to get those back into a single field? On Mon, Nov 29, 2010 at 5:39 PM, Robert Muir wrote: > On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder wrote: > > StandardTokenizer doesn't handle some of the tokens we need, like > > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese > or > > Korean. Am I wrong about that? > > it uses the unigram method for CJK ideographs... the CJKtokenizer just > uses the bigram method, its just an alternative method. > > the whitespace doesnt work at all though, so give up on that! > -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
+1 That's exactly what we need, too. On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey wrote: > On 11/29/2010 3:15 PM, Jacob Elder wrote: > >> I am looking for a clear example of using more than one tokenizer for a >> source single field. My application has a single "body" field which until >> recently was all latin characters, but we're now encountering both English >> and Japanese words in a single message. Obviously, we need to be using CJK >> in addition to WhitespaceTokenizerFactory. >> > > What I'd like to see is a CJK filter that runs after tokenization > (whitespace in my case) and doesn't do anything but handle the CJK > characters. If there are no CJK characters in the token, it should do > nothing at all. The CJK tokenizer does a whole host of other things that I > want to handle myself. > > Shawn > > -- Jacob Elder @jelder (646) 535-3379
RE: Good example of multiple tokenizers for a single field
We had the same problem for our fields and we wrote a Tokenizer using the icu4j library. Breaking tokens at script changes, and dealing with them according the script and the configured Breakiterators. This works out very well, as we also add the "scrip" information to the token so later filter can easily process on them without checking the tokens again for being some CJK-token (or greek or Russian or Hebrew or, or, or...) After this you then can put any filter (N-gram, dictionary-segmenter) to make your tokens better. Jan >-Original Message- >From: ext Jacob Elder [mailto:jel...@locamoda.com] >Sent: Montag, 29. November 2010 23:15 >To: solr-user@lucene.apache.org >Subject: Good example of multiple tokenizers for a single field > >I am looking for a clear example of using more than one tokenizer for a >source single field. My application has a single "body" field which until >recently was all latin characters, but we're now encountering both English >and Japanese words in a single message. Obviously, we need to be using CJK >in addition to WhitespaceTokenizerFactory. > >I've found some references to using copyFields or NGrams but I can't quite >grasp what the whole solution would look like. > >-- >Jacob Elder >@jelder >(646) 535-3379
Re: Good example of multiple tokenizers for a single field
On 11/29/2010 3:15 PM, Jacob Elder wrote: I am looking for a clear example of using more than one tokenizer for a source single field. My application has a single "body" field which until recently was all latin characters, but we're now encountering both English and Japanese words in a single message. Obviously, we need to be using CJK in addition to WhitespaceTokenizerFactory. What I'd like to see is a CJK filter that runs after tokenization (whitespace in my case) and doesn't do anything but handle the CJK characters. If there are no CJK characters in the token, it should do nothing at all. The CJK tokenizer does a whole host of other things that I want to handle myself. Shawn
Re: Good example of multiple tokenizers for a single field
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote: > > * As a tokenizer, I use the WhitespaceTokenizer. > > * Then I apply a custom filter that looks for CJK chars, and re-tokenizes > any CJK chars into one-token-per-char. This custom filter was written by > someone other than me; it is open source; but I'm not sure if it's actually > in a public repo, or how well documented it is. I can put you in touch with > the author to try and ask. There may also be a more standard filter other > than the custom one I'm using that does the same thing? > You are describing what standardtokenizer does.
Re: Good example of multiple tokenizers for a single field
You can only use one tokenizer on given field, I think. But a tokenizer isn't in fact the only thing that can tokenize, an ordinary filter can change tokenization too, so you could use two filters in a row. You could also write your own custom tokenizer that does what you want, although I'm not entirely sure if you turn exactly what you say into code it will actually do what you want, I think it's more complicated, I think you'll need a tokenizer that looks for contiguous blocks of bytes that are UTF-8 CJK and does one thing to them, and contiguous blocks of bytes that are not UTF8 CJK and does another thing to them; rather than just "first do one to the whole string and then do another." Dealing with mixed language fields is tricky, I know of no general purpose good solutions, in part just because of the semantics involved. If you have some strings for the field you know are CJK, adn others you know are English, the easiest thing to do is NOT put them in the same field, but put them in different fields, and use dismax (for example) to search both fields on query. But if you can't even tell at index time which is which, or if you have strings that themselves include both CJK and English interspersed with each other, that might not work. For my own case, where everything is just interspersed in the fields and I don't really know what language it is, here's what I do, which is definitely not great for CJK, but is better than nothing: * As a tokenizer, I use the WhitespaceTokenizer. * Then I apply a custom filter that looks for CJK chars, and re-tokenizes any CJK chars into one-token-per-char. This custom filter was written by someone other than me; it is open source; but I'm not sure if it's actually in a public repo, or how well documented it is. I can put you in touch with the author to try and ask. There may also be a more standard filter other than the custom one I'm using that does the same thing? Jonathan Jonathan On 11/29/2010 5:30 PM, Jacob Elder wrote: The problem is that the field is not guaranteed to contain just a single language. I'm looking for some way to pass it first through CJK, then Whitespace. If I'm totally off-target here, is there a recommended way of dealing with mixed-language fields? On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma wrote: You can use only one tokenizer per analyzer. You'd better use separate fields + fieldTypes for different languages. I am looking for a clear example of using more than one tokenizer for a source single field. My application has a single "body" field which until recently was all latin characters, but we're now encountering both English and Japanese words in a single message. Obviously, we need to be using CJK in addition to WhitespaceTokenizerFactory. I've found some references to using copyFields or NGrams but I can't quite grasp what the whole solution would look like.
Re: Good example of multiple tokenizers for a single field
On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder wrote: > StandardTokenizer doesn't handle some of the tokens we need, like > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or > Korean. Am I wrong about that? it uses the unigram method for CJK ideographs... the CJKtokenizer just uses the bigram method, its just an alternative method. the whitespace doesnt work at all though, so give up on that!
Re: Good example of multiple tokenizers for a single field
StandardTokenizer doesn't handle some of the tokens we need, like @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or Korean. Am I wrong about that? On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir wrote: > On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote: > > The problem is that the field is not guaranteed to contain just a single > > language. I'm looking for some way to pass it first through CJK, then > > Whitespace. > > > > If I'm totally off-target here, is there a recommended way of dealing > with > > mixed-language fields? > > > > maybe you should consider a tokenizer like StandardTokenizer, that > works reasonably well for most languages. > -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote: > The problem is that the field is not guaranteed to contain just a single > language. I'm looking for some way to pass it first through CJK, then > Whitespace. > > If I'm totally off-target here, is there a recommended way of dealing with > mixed-language fields? > maybe you should consider a tokenizer like StandardTokenizer, that works reasonably well for most languages.
Re: Good example of multiple tokenizers for a single field
The problem is that the field is not guaranteed to contain just a single language. I'm looking for some way to pass it first through CJK, then Whitespace. If I'm totally off-target here, is there a recommended way of dealing with mixed-language fields? On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma wrote: > You can use only one tokenizer per analyzer. You'd better use separate > fields + > fieldTypes for different languages. > > > I am looking for a clear example of using more than one tokenizer for a > > source single field. My application has a single "body" field which until > > recently was all latin characters, but we're now encountering both > English > > and Japanese words in a single message. Obviously, we need to be using > CJK > > in addition to WhitespaceTokenizerFactory. > > > > I've found some references to using copyFields or NGrams but I can't > quite > > grasp what the whole solution would look like. > -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
You can use only one tokenizer per analyzer. You'd better use separate fields + fieldTypes for different languages. > I am looking for a clear example of using more than one tokenizer for a > source single field. My application has a single "body" field which until > recently was all latin characters, but we're now encountering both English > and Japanese words in a single message. Obviously, we need to be using CJK > in addition to WhitespaceTokenizerFactory. > > I've found some references to using copyFields or NGrams but I can't quite > grasp what the whole solution would look like.
Good example of multiple tokenizers for a single field
I am looking for a clear example of using more than one tokenizer for a source single field. My application has a single "body" field which until recently was all latin characters, but we're now encountering both English and Japanese words in a single message. Obviously, we need to be using CJK in addition to WhitespaceTokenizerFactory. I've found some references to using copyFields or NGrams but I can't quite grasp what the whole solution would look like. -- Jacob Elder @jelder (646) 535-3379