Re: indexing Chienese langage
On Mon, Feb 16, 2009 at 4:30 PM, revathy arun wrote: > Hi, > > When I index chinese content using chinese tokenizer and analyzer in solr > 1.3 ,some of the chinese text files are getting indexed but others are not. > are u sure ur analyzer can do it good? if not sure, u can use analzyer link in solr admin page to check it > > Since chinese has got many different language subtypes as in standard > chinese,simplified chinese etc which of these does the chinese tokenizer > support and is there any method to find the type of chiense language from > the file? > > Rgds > -- regards j.L ( I live in Shanghai, China)
Re: indexing Chienese langage
first: u not have to restart solr,,,u can use new data to replace old data and call solr to use new search..u can find something in shell script which with solr two: u not have to restart solr,,,just keep id is same..example: old id:1,title:hi, new id:1,title:welcome,,just index new data,,it will delete old data and insert new doc,,,like replace,,but it will use more time and resouce. u can find indexed doc number from solr admin page. On Fri, Jun 5, 2009 at 7:42 AM, Fer-Bj wrote: > > What we usually do to reindex is: > > 1. stop solr > 2. rmdir -r data (that is to remove everything in /opt/solr/data/ > 3. mkdir data > 4. start solr > 5. start reindex. with this we're sure about not having old copies or > index.. > > To check the index size we do: > cd data > du -sh > > > > Otis Gospodnetic wrote: > > > > > > I can't tell what that analyzer does, but I'm guessing it uses n-grams? > > Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629 > > instead? > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > ----- Original Message > >> From: Fer-Bj > >> To: solr-user@lucene.apache.org > >> Sent: Thursday, June 4, 2009 2:20:03 AM > >> Subject: Re: indexing Chienese langage > >> > >> > >> We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after > >> reindexing > >> the index size went from 1.5 Gb to 2.7 Gb. > >> > >> Is that some expected behavior ? > >> > >> Is there any switch or trick to avoid having a double + index file size? > >> > >> Koji Sekiguchi-2 wrote: > >> > > >> > CharFilter can normalize (convert) traditional chinese to simplified > >> > chinese or vice versa, > >> > if you define mapping.txt. Here is the sample of Chinese character > >> > normalization: > >> > > >> > > >> > https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG > >> > > >> > See SOLR-822 for the detail: > >> > > >> > https://issues.apache.org/jira/browse/SOLR-822 > >> > > >> > Koji > >> > > >> > > >> > revathy arun wrote: > >> >> Hi, > >> >> > >> >> When I index chinese content using chinese tokenizer and analyzer in > >> solr > >> >> 1.3 ,some of the chinese text files are getting indexed but others > are > >> >> not. > >> >> > >> >> Since chinese has got many different language subtypes as in standard > >> >> chinese,simplified chinese etc which of these does the chinese > >> tokenizer > >> >> support and is there any method to find the type of chiense language > >> >> from > >> >> the file? > >> >> > >> >> Rgds > >> >> > >> >> > >> > > >> > > >> > > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > > > > > > > > -- > View this message in context: > http://www.nabble.com/indexing-Chienese-langage-tp22033302p23879730.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- regards j.L ( I live in Shanghai, China)
Re: indexing Chienese langage
What we usually do to reindex is: 1. stop solr 2. rmdir -r data (that is to remove everything in /opt/solr/data/ 3. mkdir data 4. start solr 5. start reindex. with this we're sure about not having old copies or index.. To check the index size we do: cd data du -sh Otis Gospodnetic wrote: > > > I can't tell what that analyzer does, but I'm guessing it uses n-grams? > Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629 > instead? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Fer-Bj >> To: solr-user@lucene.apache.org >> Sent: Thursday, June 4, 2009 2:20:03 AM >> Subject: Re: indexing Chienese langage >> >> >> We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after >> reindexing >> the index size went from 1.5 Gb to 2.7 Gb. >> >> Is that some expected behavior ? >> >> Is there any switch or trick to avoid having a double + index file size? >> >> Koji Sekiguchi-2 wrote: >> > >> > CharFilter can normalize (convert) traditional chinese to simplified >> > chinese or vice versa, >> > if you define mapping.txt. Here is the sample of Chinese character >> > normalization: >> > >> > >> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG >> > >> > See SOLR-822 for the detail: >> > >> > https://issues.apache.org/jira/browse/SOLR-822 >> > >> > Koji >> > >> > >> > revathy arun wrote: >> >> Hi, >> >> >> >> When I index chinese content using chinese tokenizer and analyzer in >> solr >> >> 1.3 ,some of the chinese text files are getting indexed but others are >> >> not. >> >> >> >> Since chinese has got many different language subtypes as in standard >> >> chinese,simplified chinese etc which of these does the chinese >> tokenizer >> >> support and is there any method to find the type of chiense language >> >> from >> >> the file? >> >> >> >> Rgds >> >> >> >> >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/indexing-Chienese-langage-tp22033302p23879730.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing Chienese langage
I can't tell what that analyzer does, but I'm guessing it uses n-grams? Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629 instead? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Fer-Bj > To: solr-user@lucene.apache.org > Sent: Thursday, June 4, 2009 2:20:03 AM > Subject: Re: indexing Chienese langage > > > We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing > the index size went from 1.5 Gb to 2.7 Gb. > > Is that some expected behavior ? > > Is there any switch or trick to avoid having a double + index file size? > > Koji Sekiguchi-2 wrote: > > > > CharFilter can normalize (convert) traditional chinese to simplified > > chinese or vice versa, > > if you define mapping.txt. Here is the sample of Chinese character > > normalization: > > > > > https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG > > > > See SOLR-822 for the detail: > > > > https://issues.apache.org/jira/browse/SOLR-822 > > > > Koji > > > > > > revathy arun wrote: > >> Hi, > >> > >> When I index chinese content using chinese tokenizer and analyzer in solr > >> 1.3 ,some of the chinese text files are getting indexed but others are > >> not. > >> > >> Since chinese has got many different language subtypes as in standard > >> chinese,simplified chinese etc which of these does the chinese tokenizer > >> support and is there any method to find the type of chiense language > >> from > >> the file? > >> > >> Rgds > >> > >> > > > > > > > > -- > View this message in context: > http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing Chienese langage
Hmmm, are you quite sure that you emptied the index first and didn'tjust add all the documents a second time to the index? Also, when you say the index almost doubled, were you looking only at the size of the *directory*? SOLR might have been holding a copy of the old index open while you built a new one... Best Erick On Thu, Jun 4, 2009 at 2:20 AM, Fer-Bj wrote: > > We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing > the index size went from 1.5 Gb to 2.7 Gb. > > Is that some expected behavior ? > > Is there any switch or trick to avoid having a double + index file size? > > Koji Sekiguchi-2 wrote: > > > > CharFilter can normalize (convert) traditional chinese to simplified > > chinese or vice versa, > > if you define mapping.txt. Here is the sample of Chinese character > > normalization: > > > > > https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG > > > > See SOLR-822 for the detail: > > > > https://issues.apache.org/jira/browse/SOLR-822 > > > > Koji > > > > > > revathy arun wrote: > >> Hi, > >> > >> When I index chinese content using chinese tokenizer and analyzer in > solr > >> 1.3 ,some of the chinese text files are getting indexed but others are > >> not. > >> > >> Since chinese has got many different language subtypes as in standard > >> chinese,simplified chinese etc which of these does the chinese tokenizer > >> support and is there any method to find the type of chiense language > >> from > >> the file? > >> > >> Rgds > >> > >> > > > > > > > > -- > View this message in context: > http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: indexing Chienese langage
We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing the index size went from 1.5 Gb to 2.7 Gb. Is that some expected behavior ? Is there any switch or trick to avoid having a double + index file size? Koji Sekiguchi-2 wrote: > > CharFilter can normalize (convert) traditional chinese to simplified > chinese or vice versa, > if you define mapping.txt. Here is the sample of Chinese character > normalization: > > https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG > > See SOLR-822 for the detail: > > https://issues.apache.org/jira/browse/SOLR-822 > > Koji > > > revathy arun wrote: >> Hi, >> >> When I index chinese content using chinese tokenizer and analyzer in solr >> 1.3 ,some of the chinese text files are getting indexed but others are >> not. >> >> Since chinese has got many different language subtypes as in standard >> chinese,simplified chinese etc which of these does the chinese tokenizer >> support and is there any method to find the type of chiense language >> from >> the file? >> >> Rgds >> >> > > > -- View this message in context: http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing Chienese langage
CharFilter can normalize (convert) traditional chinese to simplified chinese or vice versa, if you define mapping.txt. Here is the sample of Chinese character normalization: https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG See SOLR-822 for the detail: https://issues.apache.org/jira/browse/SOLR-822 Koji revathy arun wrote: Hi, When I index chinese content using chinese tokenizer and analyzer in solr 1.3 ,some of the chinese text files are getting indexed but others are not. Since chinese has got many different language subtypes as in standard chinese,simplified chinese etc which of these does the chinese tokenizer support and is there any method to find the type of chiense language from the file? Rgds
Re: indexing Chienese langage
Hi, While some of the characters in simplified and traditional Chinese do differ, the Chinese tokenizer doesn't care - it simply creates ngram tokens. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun To: solr-user@lucene.apache.org Sent: Monday, February 16, 2009 4:30:47 PM Subject: indexing Chienese langage Hi, When I index chinese content using chinese tokenizer and analyzer in solr 1.3 ,some of the chinese text files are getting indexed but others are not. Since chinese has got many different language subtypes as in standard chinese,simplified chinese etc which of these does the chinese tokenizer support and is there any method to find the type of chiense language from the file? Rgds
indexing Chienese langage
Hi, When I index chinese content using chinese tokenizer and analyzer in solr 1.3 ,some of the chinese text files are getting indexed but others are not. Since chinese has got many different language subtypes as in standard chinese,simplified chinese etc which of these does the chinese tokenizer support and is there any method to find the type of chiense language from the file? Rgds