Re: indexing Chienese langage

2009-06-04 Thread Fer-Bj

We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing
the index size went from 1.5 Gb to 2.7 Gb.

Is that some expected behavior ?

Is there any switch or trick to avoid having a double + index file size?

Koji Sekiguchi-2 wrote:
 
 CharFilter can normalize (convert) traditional chinese to simplified 
 chinese or vice versa,
 if you define mapping.txt. Here is the sample of Chinese character 
 normalization:
 
 https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
 
 See SOLR-822 for the detail:
 
 https://issues.apache.org/jira/browse/SOLR-822
 
 Koji
 
 
 revathy arun wrote:
 Hi,

 When I index chinese content using chinese tokenizer and analyzer in solr
 1.3 ,some of the chinese text files are getting indexed but others are
 not.

 Since chinese has got many different language subtypes as in standard
 chinese,simplified chinese etc which of these does the chinese tokenizer
 support and is there any method to find the type of  chiense language 
 from
 the file?

 Rgds

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexing Chienese langage

2009-06-04 Thread Erick Erickson
Hmmm, are you quite sure that you emptied the index first and didn'tjust add
all the documents a second time to the index?

Also, when you say the index almost doubled, were you looking only
at the size of the *directory*? SOLR might have been holding a copy
of the old index open while you built a new one...

Best
Erick

On Thu, Jun 4, 2009 at 2:20 AM, Fer-Bj fernando.b...@gmail.com wrote:


 We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing
 the index size went from 1.5 Gb to 2.7 Gb.

 Is that some expected behavior ?

 Is there any switch or trick to avoid having a double + index file size?

 Koji Sekiguchi-2 wrote:
 
  CharFilter can normalize (convert) traditional chinese to simplified
  chinese or vice versa,
  if you define mapping.txt. Here is the sample of Chinese character
  normalization:
 
 
 https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
 
  See SOLR-822 for the detail:
 
  https://issues.apache.org/jira/browse/SOLR-822
 
  Koji
 
 
  revathy arun wrote:
  Hi,
 
  When I index chinese content using chinese tokenizer and analyzer in
 solr
  1.3 ,some of the chinese text files are getting indexed but others are
  not.
 
  Since chinese has got many different language subtypes as in standard
  chinese,simplified chinese etc which of these does the chinese tokenizer
  support and is there any method to find the type of  chiense language
  from
  the file?
 
  Rgds
 
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: indexing Chienese langage

2009-06-04 Thread Otis Gospodnetic

I can't tell what that analyzer does, but I'm guessing it uses n-grams?
Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629 instead?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Fer-Bj fernando.b...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, June 4, 2009 2:20:03 AM
 Subject: Re: indexing Chienese langage
 
 
 We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing
 the index size went from 1.5 Gb to 2.7 Gb.
 
 Is that some expected behavior ?
 
 Is there any switch or trick to avoid having a double + index file size?
 
 Koji Sekiguchi-2 wrote:
  
  CharFilter can normalize (convert) traditional chinese to simplified 
  chinese or vice versa,
  if you define mapping.txt. Here is the sample of Chinese character 
  normalization:
  
  
 https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
  
  See SOLR-822 for the detail:
  
  https://issues.apache.org/jira/browse/SOLR-822
  
  Koji
  
  
  revathy arun wrote:
  Hi,
 
  When I index chinese content using chinese tokenizer and analyzer in solr
  1.3 ,some of the chinese text files are getting indexed but others are
  not.
 
  Since chinese has got many different language subtypes as in standard
  chinese,simplified chinese etc which of these does the chinese tokenizer
  support and is there any method to find the type of  chiense language 
  from
  the file?
 
  Rgds
 
   
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexing Chienese langage

2009-06-04 Thread Fer-Bj

What we usually do to reindex is:

1. stop solr
2. rmdir -r data  (that is to remove everything in  /opt/solr/data/
3. mkdir data
4. start solr
5. start reindex.   with this we're sure about not having old copies or
index..

To check the index size we do:
cd data
du -sh



Otis Gospodnetic wrote:
 
 
 I can't tell what that analyzer does, but I'm guessing it uses n-grams?
 Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629
 instead?
 
  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: Fer-Bj fernando.b...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, June 4, 2009 2:20:03 AM
 Subject: Re: indexing Chienese langage
 
 
 We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after
 reindexing
 the index size went from 1.5 Gb to 2.7 Gb.
 
 Is that some expected behavior ?
 
 Is there any switch or trick to avoid having a double + index file size?
 
 Koji Sekiguchi-2 wrote:
  
  CharFilter can normalize (convert) traditional chinese to simplified 
  chinese or vice versa,
  if you define mapping.txt. Here is the sample of Chinese character 
  normalization:
  
  
 https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
  
  See SOLR-822 for the detail:
  
  https://issues.apache.org/jira/browse/SOLR-822
  
  Koji
  
  
  revathy arun wrote:
  Hi,
 
  When I index chinese content using chinese tokenizer and analyzer in
 solr
  1.3 ,some of the chinese text files are getting indexed but others are
  not.
 
  Since chinese has got many different language subtypes as in standard
  chinese,simplified chinese etc which of these does the chinese
 tokenizer
  support and is there any method to find the type of  chiense language 
  from
  the file?
 
  Rgds
 
   
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/indexing-Chienese-langage-tp22033302p23879730.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexing Chienese langage

2009-06-04 Thread James liu
first: u not have to restart solr,,,u can use new data to replace old data
and call solr to use new search..u can find something in shell script which
with solr

two: u not have to restart solr,,,just keep id is same..example: old
id:1,title:hi, new id:1,title:welcome,,just index new data,,it will delete
old data and insert new doc,,,like replace,,but it will use more time and
resouce.

u can find indexed doc number from solr admin page.


On Fri, Jun 5, 2009 at 7:42 AM, Fer-Bj fernando.b...@gmail.com wrote:


 What we usually do to reindex is:

 1. stop solr
 2. rmdir -r data  (that is to remove everything in  /opt/solr/data/
 3. mkdir data
 4. start solr
 5. start reindex.   with this we're sure about not having old copies or
 index..

 To check the index size we do:
 cd data
 du -sh



 Otis Gospodnetic wrote:
 
 
  I can't tell what that analyzer does, but I'm guessing it uses n-grams?
  Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629
  instead?
 
   Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Fer-Bj fernando.b...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Thursday, June 4, 2009 2:20:03 AM
  Subject: Re: indexing Chienese langage
 
 
  We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after
  reindexing
  the index size went from 1.5 Gb to 2.7 Gb.
 
  Is that some expected behavior ?
 
  Is there any switch or trick to avoid having a double + index file size?
 
  Koji Sekiguchi-2 wrote:
  
   CharFilter can normalize (convert) traditional chinese to simplified
   chinese or vice versa,
   if you define mapping.txt. Here is the sample of Chinese character
   normalization:
  
  
 
 https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
  
   See SOLR-822 for the detail:
  
   https://issues.apache.org/jira/browse/SOLR-822
  
   Koji
  
  
   revathy arun wrote:
   Hi,
  
   When I index chinese content using chinese tokenizer and analyzer in
  solr
   1.3 ,some of the chinese text files are getting indexed but others
 are
   not.
  
   Since chinese has got many different language subtypes as in standard
   chinese,simplified chinese etc which of these does the chinese
  tokenizer
   support and is there any method to find the type of  chiense language
   from
   the file?
  
   Rgds
  
  
  
  
  
 
  --
  View this message in context:
 
 http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

 --
 View this message in context:
 http://www.nabble.com/indexing-Chienese-langage-tp22033302p23879730.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
regards
j.L ( I live in Shanghai, China)


Re: indexing Chienese langage

2009-06-04 Thread James liu
On Mon, Feb 16, 2009 at 4:30 PM, revathy arun revas...@gmail.com wrote:

 Hi,

 When I index chinese content using chinese tokenizer and analyzer in solr
 1.3 ,some of the chinese text files are getting indexed but others are not.


are u sure ur analyzer can do it good?

if not sure, u can use analzyer link in solr admin page to check it



 Since chinese has got many different language subtypes as in standard
 chinese,simplified chinese etc which of these does the chinese tokenizer
 support and is there any method to find the type of  chiense language  from
 the file?

 Rgds




-- 
regards
j.L ( I live in Shanghai, China)


Re: indexing Chienese langage

2009-02-17 Thread Koji Sekiguchi
CharFilter can normalize (convert) traditional chinese to simplified 
chinese or vice versa,
if you define mapping.txt. Here is the sample of Chinese character 
normalization:


https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG

See SOLR-822 for the detail:

https://issues.apache.org/jira/browse/SOLR-822

Koji


revathy arun wrote:

Hi,

When I index chinese content using chinese tokenizer and analyzer in solr
1.3 ,some of the chinese text files are getting indexed but others are not.

Since chinese has got many different language subtypes as in standard
chinese,simplified chinese etc which of these does the chinese tokenizer
support and is there any method to find the type of  chiense language  from
the file?

Rgds

  




indexing Chienese langage

2009-02-16 Thread revathy arun
Hi,

When I index chinese content using chinese tokenizer and analyzer in solr
1.3 ,some of the chinese text files are getting indexed but others are not.

Since chinese has got many different language subtypes as in standard
chinese,simplified chinese etc which of these does the chinese tokenizer
support and is there any method to find the type of  chiense language  from
the file?

Rgds


Re: indexing Chienese langage

2009-02-16 Thread Otis Gospodnetic
Hi,

While some of the characters in simplified and traditional Chinese do differ, 
the Chinese tokenizer doesn't care - it simply creates ngram tokens.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: revathy arun revas...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 4:30:47 PM
Subject: indexing Chienese langage

Hi,

When I index chinese content using chinese tokenizer and analyzer in solr
1.3 ,some of the chinese text files are getting indexed but others are not.

Since chinese has got many different language subtypes as in standard
chinese,simplified chinese etc which of these does the chinese tokenizer
support and is there any method to find the type of  chiense language  from
the file?

Rgds