Re: How to index large set data

2009-05-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Mon, May 25, 2009 at 10:56 AM, nk 11 nick.cass...@gmail.com wrote:
 Hello
 Interesting thread. One request please, because I don't have much experience
 with solr, could you please use full terms and not DIH, RES etc.?

nk11.
DIH =  DataImportHandler
RES=?

it is unavoidable that we end up using short names because of
laziness/lack of time. But if you ever come across one, do not
hesitate to ask.we will be more than glad to clarify.

 Thanks :)

 On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai djian...@yahoo.com wrote:


 Hi Paul,

 Hope you have a great weekend so far.
 I still have a couple of questions you might help me out:

 1. In your earlier email, you said if possible , you can setup multiple
 DIH say /dataimport1, /dataimport2 etc and split your files and can achieve
 parallelism
 I am not sure if I understand it right. I put two requesHandler in
 solrconfig.xml, like this

 requestHandler name=/dataimport
 class=org.apache.solr.handler..dataimport.DataImportHandler
    lst name=defaults
      str name=config./data-config.xml/str
    /lst
 /requestHandler

 requestHandler name=/dataimport2
 class=org.apache.solr.handler.dataimport.DataImportHandler
    lst name=defaults
      str name=config./data-config2.xml/str
    /lst
 /requestHandler


 and create data-config.xml and data-config2.xml.
 then I run the command
 http://host:8080/solr/dataimport?command=full-import

 But only one data set (the first one) was indexed. Did I get something
 wrong?


 2. I noticed that after solr indexed about 8M documents (around two hours),
 it gets very very slow. I use top command in linux, and noticed that RES
 is 1g of memory. I did several experiments, every time RES reaches 1g, the
 indexing process becomes extremely slow. Is this memory limit set by JVM?
 And how can I set the JVM memory when I use DIH through web command
 full-import?

 Thanks!


 JB




 --- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 wrote:

  From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
  Subject: Re: How to index large set data
  To: Jianbin Dai djian...@yahoo.com
  Date: Friday, May 22, 2009, 10:04 PM
  On Sat, May 23, 2009 at 10:27 AM,
  Jianbin Dai djian...@yahoo.com
  wrote:
  
   Hi Pual, but in your previous post, you said there is
  already an issue for writing to Solr in multiple threads
   SOLR-1089. Do you think use solrj alone would be better
  than DIH?
 
  nope
  you will have to do indexing in multiple threads
 
  if possible , you can setup multiple DIH say /dataimport1,
  /dataimport2 etc and split your files and can achieve
  parallelism
 
 
   Thanks and have a good weekend!
  
   --- On Fri, 5/22/09, Noble Paul നോബിള്‍
   नोब्ळ् noble.p...@corp.aol.com
  wrote:
  
   no need to use embedded Solrserver..
   you can use SolrJ with streaming
   in multiple threads
  
   On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai
  djian...@yahoo.com
   wrote:
   
If I do the xml parsing by myself and use
  embedded
   client to do the push, would it be more efficient
  than DIH?
   
   
--- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org
   wrote:
   
From: Grant Ingersoll gsing...@apache.org
Subject: Re: How to index large set data
To: solr-user@lucene.apache.org
Date: Friday, May 22, 2009, 5:38 AM
Can you parallelize this?  I
don't know that the DIH can handle it,
but having multiple threads sending docs
  to Solr
   is the
best
performance wise, so maybe you need to
  look at
   alternatives
to pulling
with DIH and instead use a client to push
  into
   Solr.
   
   
On May 22, 2009, at 3:42 AM, Jianbin Dai
  wrote:
   

 about 2.8 m total docs were created.
  only the
   first
run finishes. In
 my 2nd try, it hangs there forever
  at the end
   of
indexing, (I guess
 right before commit), with cpu usage
  of 100%.
   Total 5G
(2050) index
 files are created. Now I have two
  problems:
 1. why it hangs there and failed?
 2. how can i speed up the indexing?


 Here is my solrconfig.xml


   
  
  useCompoundFilefalse/useCompoundFile

   
  
  ramBufferSizeMB3000/ramBufferSizeMB

   
  mergeFactor1000/mergeFactor

   
  
  maxMergeDocs2147483647/maxMergeDocs

   
  
  maxFieldLength1/maxFieldLength

   
  
  unlockOnStartupfalse/unlockOnStartup




 --- On Thu, 5/21/09, Noble Paul
നോബിള്‍  नो
 ब्ळ् noble.p...@corp.aol.com
wrote:

 From: Noble Paul
  നോബിള്‍
नोब्ळ्
 noble.p...@corp.aol.com
 Subject: Re: How to index large
  set data
 To: solr-user@lucene.apache.org
 Date: Thursday, May 21, 2009,
  10:39 PM
 what is the total no:of docs
  created
 ?  I guess it may not be
  memory
 bound. indexing is mostly amn IO
  bound
   operation.
You may
 be able to
 get a better perf if a SSD is
  used (solid
   state
disk)

 On Fri, May 22, 2009 at 10:46
  AM, Jianbin

Re: How to index large set data

2009-05-24 Thread Jianbin Dai

Hi Paul,

Hope you have a great weekend so far.
I still have a couple of questions you might help me out:

1. In your earlier email, you said if possible , you can setup multiple DIH 
say /dataimport1, /dataimport2 etc and split your files and can achieve 
parallelism
I am not sure if I understand it right. I put two requesHandler in 
solrconfig.xml, like this

requestHandler name=/dataimport 
class=org.apache.solr.handler..dataimport.DataImportHandler
lst name=defaults
  str name=config./data-config.xml/str
/lst
/requestHandler

requestHandler name=/dataimport2 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
  str name=config./data-config2.xml/str
/lst
/requestHandler


and create data-config.xml and data-config2.xml.
then I run the command
http://host:8080/solr/dataimport?command=full-import

But only one data set (the first one) was indexed. Did I get something wrong?


2. I noticed that after solr indexed about 8M documents (around two hours), it 
gets very very slow. I use top command in linux, and noticed that RES is 1g 
of memory. I did several experiments, every time RES reaches 1g, the indexing 
process becomes extremely slow. Is this memory limit set by JVM? And how can I 
set the JVM memory when I use DIH through web command full-import?

Thanks!


JB




--- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com wrote:

 From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 Subject: Re: How to index large set data
 To: Jianbin Dai djian...@yahoo.com
 Date: Friday, May 22, 2009, 10:04 PM
 On Sat, May 23, 2009 at 10:27 AM,
 Jianbin Dai djian...@yahoo.com
 wrote:
 
  Hi Pual, but in your previous post, you said there is
 already an issue for writing to Solr in multiple threads
  SOLR-1089. Do you think use solrj alone would be better
 than DIH?
 
 nope
 you will have to do indexing in multiple threads
 
 if possible , you can setup multiple DIH say /dataimport1,
 /dataimport2 etc and split your files and can achieve
 parallelism
 
 
  Thanks and have a good weekend!
 
  --- On Fri, 5/22/09, Noble Paul നോബിള്‍
  नोब्ळ् noble.p...@corp.aol.com
 wrote:
 
  no need to use embedded Solrserver..
  you can use SolrJ with streaming
  in multiple threads
 
  On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai
 djian...@yahoo.com
  wrote:
  
   If I do the xml parsing by myself and use
 embedded
  client to do the push, would it be more efficient
 than DIH?
  
  
   --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org
  wrote:
  
   From: Grant Ingersoll gsing...@apache.org
   Subject: Re: How to index large set data
   To: solr-user@lucene.apache.org
   Date: Friday, May 22, 2009, 5:38 AM
   Can you parallelize this?  I
   don't know that the DIH can handle it,
   but having multiple threads sending docs
 to Solr
  is the
   best
   performance wise, so maybe you need to
 look at
  alternatives
   to pulling
   with DIH and instead use a client to push
 into
  Solr.
  
  
   On May 22, 2009, at 3:42 AM, Jianbin Dai
 wrote:
  
   
about 2.8 m total docs were created.
 only the
  first
   run finishes. In
my 2nd try, it hangs there forever
 at the end
  of
   indexing, (I guess
right before commit), with cpu usage
 of 100%.
  Total 5G
   (2050) index
files are created. Now I have two
 problems:
1. why it hangs there and failed?
2. how can i speed up the indexing?
   
   
Here is my solrconfig.xml
   
   
  
 
 useCompoundFilefalse/useCompoundFile
   
  
 
 ramBufferSizeMB3000/ramBufferSizeMB
   
  
 mergeFactor1000/mergeFactor
   
  
 
 maxMergeDocs2147483647/maxMergeDocs
   
  
 
 maxFieldLength1/maxFieldLength
   
  
 
 unlockOnStartupfalse/unlockOnStartup
   
   
   
   
--- On Thu, 5/21/09, Noble Paul
   നോബിള്‍  नो
ब्ळ् noble.p...@corp.aol.com
   wrote:
   
From: Noble Paul
 നോബിള്‍
   नोब्ळ्
noble.p...@corp.aol.com
Subject: Re: How to index large
 set data
To: solr-user@lucene.apache.org
Date: Thursday, May 21, 2009,
 10:39 PM
what is the total no:of docs
 created
?  I guess it may not be
 memory
bound. indexing is mostly amn IO
 bound
  operation.
   You may
be able to
get a better perf if a SSD is
 used (solid
  state
   disk)
   
On Fri, May 22, 2009 at 10:46
 AM, Jianbin
  Dai
   djian...@yahoo.com
wrote:
   
Hi Paul,
   
Thank you so much for
 answering my
  questions.
   It
really helped.
After some adjustment,
 basically
  setting
   mergeFactor
to 1000 from the default value
 of 10, I
  can
   finished the
whole job in 2.5 hours. I
 checked that
  during
   running time,
only around 18% of memory is
 being used,
  and VIRT
   is always
1418m. I am thinking it may be
 restricted
  by JVM
   memory
setting. But I run the data
 import
  command through
   web,
i.e.,
   
   
  
 
 http://host:port/solr/dataimport?command=full-import,
how can I set the memory
 allocation for
  JVM?
Thanks again

Re: How to index large set data

2009-05-24 Thread nk 11
Hello
Interesting thread. One request please, because I don't have much experience
with solr, could you please use full terms and not DIH, RES etc.?

Thanks :)

On Mon, May 25, 2009 at 4:44 AM, Jianbin Dai djian...@yahoo.com wrote:


 Hi Paul,

 Hope you have a great weekend so far.
 I still have a couple of questions you might help me out:

 1. In your earlier email, you said if possible , you can setup multiple
 DIH say /dataimport1, /dataimport2 etc and split your files and can achieve
 parallelism
 I am not sure if I understand it right. I put two requesHandler in
 solrconfig.xml, like this

 requestHandler name=/dataimport
 class=org.apache.solr.handler..dataimport.DataImportHandler
lst name=defaults
  str name=config./data-config.xml/str
/lst
 /requestHandler

 requestHandler name=/dataimport2
 class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
  str name=config./data-config2.xml/str
/lst
 /requestHandler


 and create data-config.xml and data-config2.xml.
 then I run the command
 http://host:8080/solr/dataimport?command=full-import

 But only one data set (the first one) was indexed. Did I get something
 wrong?


 2. I noticed that after solr indexed about 8M documents (around two hours),
 it gets very very slow. I use top command in linux, and noticed that RES
 is 1g of memory. I did several experiments, every time RES reaches 1g, the
 indexing process becomes extremely slow. Is this memory limit set by JVM?
 And how can I set the JVM memory when I use DIH through web command
 full-import?

 Thanks!


 JB




 --- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 wrote:

  From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
  Subject: Re: How to index large set data
  To: Jianbin Dai djian...@yahoo.com
  Date: Friday, May 22, 2009, 10:04 PM
  On Sat, May 23, 2009 at 10:27 AM,
  Jianbin Dai djian...@yahoo.com
  wrote:
  
   Hi Pual, but in your previous post, you said there is
  already an issue for writing to Solr in multiple threads
   SOLR-1089. Do you think use solrj alone would be better
  than DIH?
 
  nope
  you will have to do indexing in multiple threads
 
  if possible , you can setup multiple DIH say /dataimport1,
  /dataimport2 etc and split your files and can achieve
  parallelism
 
 
   Thanks and have a good weekend!
  
   --- On Fri, 5/22/09, Noble Paul നോബിള്‍
   नोब्ळ् noble.p...@corp.aol.com
  wrote:
  
   no need to use embedded Solrserver..
   you can use SolrJ with streaming
   in multiple threads
  
   On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai
  djian...@yahoo.com
   wrote:
   
If I do the xml parsing by myself and use
  embedded
   client to do the push, would it be more efficient
  than DIH?
   
   
--- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org
   wrote:
   
From: Grant Ingersoll gsing...@apache.org
Subject: Re: How to index large set data
To: solr-user@lucene.apache.org
Date: Friday, May 22, 2009, 5:38 AM
Can you parallelize this?  I
don't know that the DIH can handle it,
but having multiple threads sending docs
  to Solr
   is the
best
performance wise, so maybe you need to
  look at
   alternatives
to pulling
with DIH and instead use a client to push
  into
   Solr.
   
   
On May 22, 2009, at 3:42 AM, Jianbin Dai
  wrote:
   

 about 2.8 m total docs were created.
  only the
   first
run finishes. In
 my 2nd try, it hangs there forever
  at the end
   of
indexing, (I guess
 right before commit), with cpu usage
  of 100%.
   Total 5G
(2050) index
 files are created. Now I have two
  problems:
 1. why it hangs there and failed?
 2. how can i speed up the indexing?


 Here is my solrconfig.xml


   
  
  useCompoundFilefalse/useCompoundFile

   
  
  ramBufferSizeMB3000/ramBufferSizeMB

   
  mergeFactor1000/mergeFactor

   
  
  maxMergeDocs2147483647/maxMergeDocs

   
  
  maxFieldLength1/maxFieldLength

   
  
  unlockOnStartupfalse/unlockOnStartup




 --- On Thu, 5/21/09, Noble Paul
നോബിള്‍  नो
 ब्ळ् noble.p...@corp.aol.com
wrote:

 From: Noble Paul
  നോബിള്‍
नोब्ळ्
 noble.p...@corp.aol.com
 Subject: Re: How to index large
  set data
 To: solr-user@lucene.apache.org
 Date: Thursday, May 21, 2009,
  10:39 PM
 what is the total no:of docs
  created
 ?  I guess it may not be
  memory
 bound. indexing is mostly amn IO
  bound
   operation.
You may
 be able to
 get a better perf if a SSD is
  used (solid
   state
disk)

 On Fri, May 22, 2009 at 10:46
  AM, Jianbin
   Dai
djian...@yahoo.com
 wrote:

 Hi Paul,

 Thank you so much for
  answering my
   questions.
It
 really helped.
 After some adjustment,
  basically
   setting
mergeFactor
 to 1000 from the default value
  of 10, I
   can
finished

Re: How to index large set data

2009-05-22 Thread Jianbin Dai

about 2.8 m total docs were created. only the first run finishes. In my 2nd 
try, it hangs there forever at the end of indexing, (I guess right before 
commit), with cpu usage of 100%. Total 5G (2050) index files are created. Now I 
have two problems:
1. why it hangs there and failed?
2. how can i speed up the indexing?


Here is my solrconfig.xml

useCompoundFilefalse/useCompoundFile
ramBufferSizeMB3000/ramBufferSizeMB
mergeFactor1000/mergeFactor
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
unlockOnStartupfalse/unlockOnStartup




--- On Thu, 5/21/09, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com wrote:

 From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 Subject: Re: How to index large set data
 To: solr-user@lucene.apache.org
 Date: Thursday, May 21, 2009, 10:39 PM
 what is the total no:of docs created
 ?  I guess it may not be memory
 bound. indexing is mostly amn IO bound operation. You may
 be able to
 get a better perf if a SSD is used (solid state disk)
 
 On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com
 wrote:
 
  Hi Paul,
 
  Thank you so much for answering my questions. It
 really helped.
  After some adjustment, basically setting mergeFactor
 to 1000 from the default value of 10, I can finished the
 whole job in 2.5 hours. I checked that during running time,
 only around 18% of memory is being used, and VIRT is always
 1418m. I am thinking it may be restricted by JVM memory
 setting. But I run the data import command through web,
 i.e.,
 
 http://host:port/solr/dataimport?command=full-import,
 how can I set the memory allocation for JVM?
  Thanks again!
 
  JB
 
  --- On Thu, 5/21/09, Noble Paul നോബിള്‍
  नोब्ळ् noble.p...@corp.aol.com
 wrote:
 
  From: Noble Paul നോബിള്‍
  नोब्ळ् noble.p...@corp.aol.com
  Subject: Re: How to index large set data
  To: solr-user@lucene.apache.org
  Date: Thursday, May 21, 2009, 9:57 PM
  check the status page of DIH and see
  if it is working properly. and
  if, yes what is the rate of indexing
 
  On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai
 djian...@yahoo.com
  wrote:
  
   Hi,
  
   I have about 45GB xml files to be indexed. I
 am using
  DataImportHandler. I started the full import 4
 hours ago,
  and it's still running
   My computer has 4GB memory. Any suggestion on
 the
  solutions?
   Thanks!
  
   JB
  
  
  
  
  
 
 
 
  --
 
 -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 
 
 
 
 
 
 
 
 
 -- 
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com
 





Re: How to index large set data

2009-05-22 Thread Grant Ingersoll
Can you parallelize this?  I don't know that the DIH can handle it,  
but having multiple threads sending docs to Solr is the best  
performance wise, so maybe you need to look at alternatives to pulling  
with DIH and instead use a client to push into Solr.



On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:



about 2.8 m total docs were created. only the first run finishes. In  
my 2nd try, it hangs there forever at the end of indexing, (I guess  
right before commit), with cpu usage of 100%. Total 5G (2050) index  
files are created. Now I have two problems:

1. why it hangs there and failed?
2. how can i speed up the indexing?


Here is my solrconfig.xml

   useCompoundFilefalse/useCompoundFile
   ramBufferSizeMB3000/ramBufferSizeMB
   mergeFactor1000/mergeFactor
   maxMergeDocs2147483647/maxMergeDocs
   maxFieldLength1/maxFieldLength
   unlockOnStartupfalse/unlockOnStartup




--- On Thu, 5/21/09, Noble Paul നോബിള്‍  नो 
ब्ळ् noble.p...@corp.aol.com wrote:


From: Noble Paul നോബിള്‍  नोब्ळ्  
noble.p...@corp.aol.com

Subject: Re: How to index large set data
To: solr-user@lucene.apache.org
Date: Thursday, May 21, 2009, 10:39 PM
what is the total no:of docs created
?  I guess it may not be memory
bound. indexing is mostly amn IO bound operation. You may
be able to
get a better perf if a SSD is used (solid state disk)

On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com
wrote:


Hi Paul,

Thank you so much for answering my questions. It

really helped.

After some adjustment, basically setting mergeFactor

to 1000 from the default value of 10, I can finished the
whole job in 2.5 hours. I checked that during running time,
only around 18% of memory is being used, and VIRT is always
1418m. I am thinking it may be restricted by JVM memory
setting. But I run the data import command through web,
i.e.,



http://host:port/solr/dataimport?command=full-import,
how can I set the memory allocation for JVM?

Thanks again!

JB

--- On Thu, 5/21/09, Noble Paul നോബിള്‍

 नोब्ळ् noble.p...@corp.aol.com
wrote:



From: Noble Paul നോബിള്‍

 नोब्ळ् noble.p...@corp.aol.com

Subject: Re: How to index large set data
To: solr-user@lucene.apache.org
Date: Thursday, May 21, 2009, 9:57 PM
check the status page of DIH and see
if it is working properly. and
if, yes what is the rate of indexing

On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai

djian...@yahoo.com

wrote:


Hi,

I have about 45GB xml files to be indexed. I

am using

DataImportHandler. I started the full import 4

hours ago,

and it's still running

My computer has 4GB memory. Any suggestion on

the

solutions?

Thanks!

JB









--


-

Noble Paul | Principal Engineer| AOL | http://aol.com











--
-
Noble Paul | Principal Engineer| AOL | http://aol.com







--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: How to index large set data

2009-05-22 Thread Noble Paul നോബിള്‍ नोब्ळ्
there is already an issue for writing to Solr in multiple threads  SOLR-1089

On Fri, May 22, 2009 at 6:08 PM, Grant Ingersoll gsing...@apache.org wrote:
 Can you parallelize this?  I don't know that the DIH can handle it, but
 having multiple threads sending docs to Solr is the best performance wise,
 so maybe you need to look at alternatives to pulling with DIH and instead
 use a client to push into Solr.


 On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:


 about 2.8 m total docs were created. only the first run finishes. In my
 2nd try, it hangs there forever at the end of indexing, (I guess right
 before commit), with cpu usage of 100%. Total 5G (2050) index files are
 created. Now I have two problems:
 1. why it hangs there and failed?
 2. how can i speed up the indexing?


 Here is my solrconfig.xml

   useCompoundFilefalse/useCompoundFile
   ramBufferSizeMB3000/ramBufferSizeMB
   mergeFactor1000/mergeFactor
   maxMergeDocs2147483647/maxMergeDocs
   maxFieldLength1/maxFieldLength
   unlockOnStartupfalse/unlockOnStartup




 --- On Thu, 5/21/09, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 wrote:

 From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 Subject: Re: How to index large set data
 To: solr-user@lucene.apache.org
 Date: Thursday, May 21, 2009, 10:39 PM
 what is the total no:of docs created
 ?  I guess it may not be memory
 bound. indexing is mostly amn IO bound operation. You may
 be able to
 get a better perf if a SSD is used (solid state disk)

 On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com
 wrote:

 Hi Paul,

 Thank you so much for answering my questions. It

 really helped.

 After some adjustment, basically setting mergeFactor

 to 1000 from the default value of 10, I can finished the
 whole job in 2.5 hours. I checked that during running time,
 only around 18% of memory is being used, and VIRT is always
 1418m. I am thinking it may be restricted by JVM memory
 setting. But I run the data import command through web,
 i.e.,

 http://host:port/solr/dataimport?command=full-import,
 how can I set the memory allocation for JVM?

 Thanks again!

 JB

 --- On Thu, 5/21/09, Noble Paul നോബിള്‍

  नोब्ळ् noble.p...@corp.aol.com
 wrote:

 From: Noble Paul നോബിള്‍

  नोब्ळ् noble.p...@corp.aol.com

 Subject: Re: How to index large set data
 To: solr-user@lucene.apache.org
 Date: Thursday, May 21, 2009, 9:57 PM
 check the status page of DIH and see
 if it is working properly. and
 if, yes what is the rate of indexing

 On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai

 djian...@yahoo.com

 wrote:

 Hi,

 I have about 45GB xml files to be indexed. I

 am using

 DataImportHandler. I started the full import 4

 hours ago,

 and it's still running

 My computer has 4GB memory. Any suggestion on

 the

 solutions?

 Thanks!

 JB








 --

 -

 Noble Paul | Principal Engineer| AOL | http://aol.com









 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com





 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search





-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: How to index large set data

2009-05-22 Thread Otis Gospodnetic

Hi,

Those settings are a little crazy.  Are you sure you want to give Solr/Lucene 
3G to buffer documents before flushing them to disk?  Are you sure you want to 
use the mergeFactor of 1000?  Checking the logs to see if there are any errors. 
 Look at the index directory to see if Solr is actually still writing to it? 
(file sizes are changing, number of files is changing).  kill -QUIT the JVM pid 
to see where things are stuck if they are stuck...


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Jianbin Dai djian...@yahoo.com
 To: solr-user@lucene.apache.org; noble.p...@gmail.com
 Sent: Friday, May 22, 2009 3:42:04 AM
 Subject: Re: How to index large set data
 
 
 about 2.8 m total docs were created. only the first run finishes. In my 2nd 
 try, 
 it hangs there forever at the end of indexing, (I guess right before commit), 
 with cpu usage of 100%. Total 5G (2050) index files are created. Now I have 
 two 
 problems:
 1. why it hangs there and failed?
 2. how can i speed up the indexing?
 
 
 Here is my solrconfig.xml
 
 false
 3000
 1000
 2147483647
 1
 false
 
 
 
 
 --- On Thu, 5/21/09, Noble Paul നോബിള്‍  नोब्ळ् wrote:
 
  From: Noble Paul നോബിള്‍  नोब्ळ् 
  Subject: Re: How to index large set data
  To: solr-user@lucene.apache.org
  Date: Thursday, May 21, 2009, 10:39 PM
  what is the total no:of docs created
  ?  I guess it may not be memory
  bound. indexing is mostly amn IO bound operation. You may
  be able to
  get a better perf if a SSD is used (solid state disk)
  
  On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai 
  wrote:
  
   Hi Paul,
  
   Thank you so much for answering my questions. It
  really helped.
   After some adjustment, basically setting mergeFactor
  to 1000 from the default value of 10, I can finished the
  whole job in 2.5 hours. I checked that during running time,
  only around 18% of memory is being used, and VIRT is always
  1418m. I am thinking it may be restricted by JVM memory
  setting. But I run the data import command through web,
  i.e.,
  
  http://:/solr/dataimport?command=full-import,
  how can I set the memory allocation for JVM?
   Thanks again!
  
   JB
  
   --- On Thu, 5/21/09, Noble Paul നോബിള്‍
   नोब्ळ् 
  wrote:
  
   From: Noble Paul നോബിള്‍
   नोब्ळ् 
   Subject: Re: How to index large set data
   To: solr-user@lucene.apache.org
   Date: Thursday, May 21, 2009, 9:57 PM
   check the status page of DIH and see
   if it is working properly. and
   if, yes what is the rate of indexing
  
   On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai
  
   wrote:
   
Hi,
   
I have about 45GB xml files to be indexed. I
  am using
   DataImportHandler. I started the full import 4
  hours ago,
   and it's still running
My computer has 4GB memory. Any suggestion on
  the
   solutions?
Thanks!
   
JB
   
   
   
   
   
  
  
  
   --
  
  -
   Noble Paul | Principal Engineer| AOL | http://aol.com
  
  
  
  
  
  
  
  
  
  -- 
  -
  Noble Paul | Principal Engineer| AOL | http://aol.com
  



Re: How to index large set data

2009-05-22 Thread Jianbin Dai

I dont know exactly what is this 3G Ram buffer used. But what I noticed was 
both index size and file number were keeping increasing, but stuck in the 
commit. 

--- On Fri, 5/22/09, Otis Gospodnetic otis_gospodne...@yahoo..com wrote:

 From: Otis Gospodnetic otis_gospodne...@yahoo.com
 Subject: Re: How to index large set data
 To: solr-user@lucene.apache.org
 Date: Friday, May 22, 2009, 7:26 AM
 
 Hi,
 
 Those settings are a little crazy.  Are you sure you
 want to give Solr/Lucene 3G to buffer documents before
 flushing them to disk?  Are you sure you want to use
 the mergeFactor of 1000?  Checking the logs to see if
 there are any errors.  Look at the index directory to
 see if Solr is actually still writing to it? (file sizes are
 changing, number of files is changing).  kill -QUIT the
 JVM pid to see where things are stuck if they are
 stuck...
 
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
  From: Jianbin Dai djian...@yahoo.com
  To: solr-user@lucene.apache.org;
 noble.p...@gmail.com
  Sent: Friday, May 22, 2009 3:42:04 AM
  Subject: Re: How to index large set data
  
  
  about 2.8 m total docs were created. only the first
 run finishes. In my 2nd try, 
  it hangs there forever at the end of indexing, (I
 guess right before commit), 
  with cpu usage of 100%. Total 5G (2050) index files
 are created. Now I have two 
  problems:
  1. why it hangs there and failed?
  2. how can i speed up the indexing?
  
  
  Here is my solrconfig.xml
  
      false
      3000
      1000
      2147483647
      1
      false
  
  
  
  
  --- On Thu, 5/21/09, Noble Paul
 നോബിള്‍  नोब्ळ् wrote:
  
   From: Noble Paul നോബിള്‍ 
 नोब्ळ् 
   Subject: Re: How to index large set data
   To: solr-user@lucene.apache.org
   Date: Thursday, May 21, 2009, 10:39 PM
   what is the total no:of docs created
   ?  I guess it may not be memory
   bound. indexing is mostly amn IO bound operation.
 You may
   be able to
   get a better perf if a SSD is used (solid state
 disk)
   
   On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai 
   wrote:
   
Hi Paul,
   
Thank you so much for answering my
 questions. It
   really helped.
After some adjustment, basically setting
 mergeFactor
   to 1000 from the default value of 10, I can
 finished the
   whole job in 2.5 hours. I checked that during
 running time,
   only around 18% of memory is being used, and VIRT
 is always
   1418m. I am thinking it may be restricted by JVM
 memory
   setting. But I run the data import command
 through web,
   i.e.,
   
   http://:/solr/dataimport?command=full-import,
   how can I set the memory allocation for JVM?
Thanks again!
   
JB
   
--- On Thu, 5/21/09, Noble Paul
 നോബിള്‍
    नोब्ळ् 
   wrote:
   
From: Noble Paul നോബിള്‍
    नोब्ळ् 
Subject: Re: How to index large set
 data
To: solr-user@lucene.apache.org
Date: Thursday, May 21, 2009, 9:57 PM
check the status page of DIH and see
if it is working properly. and
if, yes what is the rate of indexing
   
On Thu, May 21, 2009 at 11:48 AM,
 Jianbin Dai
   
wrote:

 Hi,

 I have about 45GB xml files to be
 indexed. I
   am using
DataImportHandler. I started the full
 import 4
   hours ago,
and it's still running.
 My computer has 4GB memory. Any
 suggestion on
   the
solutions?
 Thanks!

 JB





   
   
   
--
   
  
 -
Noble Paul | Principal Engineer| AOL |
 http://aol.com
   
   
   
   
   
   
   
   
   
   -- 
  
 -
   Noble Paul | Principal Engineer| AOL | http://aol.com
   
 
 






Re: How to index large set data

2009-05-22 Thread Jianbin Dai

If I do the xml parsing by myself and use embedded client to do the push, would 
it be more efficient than DIH?


--- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org wrote:

 From: Grant Ingersoll gsing...@apache.org
 Subject: Re: How to index large set data
 To: solr-user@lucene.apache.org
 Date: Friday, May 22, 2009, 5:38 AM
 Can you parallelize this?  I
 don't know that the DIH can handle it,  
 but having multiple threads sending docs to Solr is the
 best  
 performance wise, so maybe you need to look at alternatives
 to pulling  
 with DIH and instead use a client to push into Solr.
 
 
 On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:
 
 
  about 2.8 m total docs were created. only the first
 run finishes. In  
  my 2nd try, it hangs there forever at the end of
 indexing, (I guess  
  right before commit), with cpu usage of 100%. Total 5G
 (2050) index  
  files are created. Now I have two problems:
  1. why it hangs there and failed?
  2. how can i speed up the indexing?
 
 
  Here is my solrconfig.xml
 
    
 useCompoundFilefalse/useCompoundFile
    
 ramBufferSizeMB3000/ramBufferSizeMB
    
 mergeFactor1000/mergeFactor
    
 maxMergeDocs2147483647/maxMergeDocs
    
 maxFieldLength1/maxFieldLength
    
 unlockOnStartupfalse/unlockOnStartup
 
 
 
 
  --- On Thu, 5/21/09, Noble Paul
 നോബിള്‍  नो 
  ब्ळ् noble.p...@corp.aol.com
 wrote:
 
  From: Noble Paul നോബിള്‍ 
 नोब्ळ्  
  noble.p...@corp.aol.com
  Subject: Re: How to index large set data
  To: solr-user@lucene.apache.org
  Date: Thursday, May 21, 2009, 10:39 PM
  what is the total no:of docs created
  ?  I guess it may not be memory
  bound. indexing is mostly amn IO bound operation.
 You may
  be able to
  get a better perf if a SSD is used (solid state
 disk)
 
  On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai
 djian...@yahoo.com
  wrote:
 
  Hi Paul,
 
  Thank you so much for answering my questions.
 It
  really helped.
  After some adjustment, basically setting
 mergeFactor
  to 1000 from the default value of 10, I can
 finished the
  whole job in 2.5 hours. I checked that during
 running time,
  only around 18% of memory is being used, and VIRT
 is always
  1418m. I am thinking it may be restricted by JVM
 memory
  setting. But I run the data import command through
 web,
  i.e.,
 
 
 http://host:port/solr/dataimport?command=full-import,
  how can I set the memory allocation for JVM?
  Thanks again!
 
  JB
 
  --- On Thu, 5/21/09, Noble Paul
 നോബിള്‍
   नोब्ळ् noble.p...@corp..aol.com
  wrote:
 
  From: Noble Paul നോബിള്‍
   नोब्ळ् noble.p...@corp.aol.com
  Subject: Re: How to index large set data
  To: solr-user@lucene.apache.org
  Date: Thursday, May 21, 2009, 9:57 PM
  check the status page of DIH and see
  if it is working properly. and
  if, yes what is the rate of indexing
 
  On Thu, May 21, 2009 at 11:48 AM, Jianbin
 Dai
  djian...@yahoo.com
  wrote:
 
  Hi,
 
  I have about 45GB xml files to be
 indexed. I
  am using
  DataImportHandler. I started the full
 import 4
  hours ago,
  and it's still running
  My computer has 4GB memory. Any
 suggestion on
  the
  solutions?
  Thanks!
 
  JB
 
 
 
 
 
 
 
 
  --
 
 
 -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 
 
 
 
 
 
 
 
 
  -- 
 
 -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 
 
 
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem
 (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
 using Solr/Lucene:
 http://www.lucidimagination..com/search
 
 






Re: How to index large set data

2009-05-22 Thread Otis Gospodnetic

If the file numbers and index size was increasing, that means Solr was still 
working.  It's possible it's taking extra long because of such high settings.  
Bring them both down and try.  For example, don't go over 20 with mergeFactor, 
and try just 1GB for ramBufferSizeMB.


Bona fortuna!

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Jianbin Dai djian...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Friday, May 22, 2009 11:05:27 AM
 Subject: Re: How to index large set data
 
 
 I dont know exactly what is this 3G Ram buffer used. But what I noticed was 
 both 
 index size and file number were keeping increasing, but stuck in the commit. 
 
 --- On Fri, 5/22/09, Otis Gospodnetic wrote:
 
  From: Otis Gospodnetic 
  Subject: Re: How to index large set data
  To: solr-user@lucene.apache.org
  Date: Friday, May 22, 2009, 7:26 AM
  
  Hi,
  
  Those settings are a little crazy.  Are you sure you
  want to give Solr/Lucene 3G to buffer documents before
  flushing them to disk?  Are you sure you want to use
  the mergeFactor of 1000?  Checking the logs to see if
  there are any errors.  Look at the index directory to
  see if Solr is actually still writing to it? (file sizes are
  changing, number of files is changing).  kill -QUIT the
  JVM pid to see where things are stuck if they are
  stuck...
  
  
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
  - Original Message 
   From: Jianbin Dai 
   To: solr-user@lucene.apache.org;
  noble.p...@gmail.com
   Sent: Friday, May 22, 2009 3:42:04 AM
   Subject: Re: How to index large set data
   
   
   about 2.8 m total docs were created. only the first
  run finishes. In my 2nd try, 
   it hangs there forever at the end of indexing, (I
  guess right before commit), 
   with cpu usage of 100%. Total 5G (2050) index files
  are created. Now I have two 
   problems:
   1. why it hangs there and failed?
   2. how can i speed up the indexing?
   
   
   Here is my solrconfig.xml
   
   false
   3000
   1000
   2147483647
   1
   false
   
   
   
   
   --- On Thu, 5/21/09, Noble Paul
  നോബിള്‍  नोब्ळ् wrote:
   
From: Noble Paul നോബിള്‍ 
  नोब्ळ् 
Subject: Re: How to index large set data
To: solr-user@lucene.apache.org
Date: Thursday, May 21, 2009, 10:39 PM
what is the total no:of docs created
?  I guess it may not be memory
bound. indexing is mostly amn IO bound operation.
  You may
be able to
get a better perf if a SSD is used (solid state
  disk)

On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai 
wrote:

 Hi Paul,

 Thank you so much for answering my
  questions. It
really helped.
 After some adjustment, basically setting
  mergeFactor
to 1000 from the default value of 10, I can
  finished the
whole job in 2.5 hours. I checked that during
  running time,
only around 18% of memory is being used, and VIRT
  is always
1418m. I am thinking it may be restricted by JVM
  memory
setting. But I run the data import command
  through web,
i.e.,

http://:/solr/dataimport?command=full-import,
how can I set the memory allocation for JVM?
 Thanks again!

 JB

 --- On Thu, 5/21/09, Noble Paul
  നോബിള്‍
 नोब्ळ् 
wrote:

 From: Noble Paul നോബിള്‍
 नोब्ळ् 
 Subject: Re: How to index large set
  data
 To: solr-user@lucene.apache.org
 Date: Thursday, May 21, 2009, 9:57 PM
 check the status page of DIH and see
 if it is working properly. and
 if, yes what is the rate of indexing

 On Thu, May 21, 2009 at 11:48 AM,
  Jianbin Dai

 wrote:
 
  Hi,
 
  I have about 45GB xml files to be
  indexed. I
am using
 DataImportHandler. I started the full
  import 4
hours ago,
 and it's still running.
  My computer has 4GB memory. Any
  suggestion on
the
 solutions?
  Thanks!
 
  JB
 
 
 
 
 



 --

   
  -
 Noble Paul | Principal Engineer| AOL |
  http://aol.com









-- 
   
  -
Noble Paul | Principal Engineer| AOL | http://aol.com

  
  



Re: How to index large set data

2009-05-22 Thread Noble Paul നോബിള്‍ नोब्ळ्
no need to use embedded Solrserver. you can use SolrJ with streaming
in multiple threads

On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai djian...@yahoo.com wrote:

 If I do the xml parsing by myself and use embedded client to do the push, 
 would it be more efficient than DIH?


 --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org wrote:

 From: Grant Ingersoll gsing...@apache.org
 Subject: Re: How to index large set data
 To: solr-user@lucene.apache.org
 Date: Friday, May 22, 2009, 5:38 AM
 Can you parallelize this?  I
 don't know that the DIH can handle it,
 but having multiple threads sending docs to Solr is the
 best
 performance wise, so maybe you need to look at alternatives
 to pulling
 with DIH and instead use a client to push into Solr.


 On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:

 
  about 2.8 m total docs were created. only the first
 run finishes. In
  my 2nd try, it hangs there forever at the end of
 indexing, (I guess
  right before commit), with cpu usage of 100%. Total 5G
 (2050) index
  files are created. Now I have two problems:
  1. why it hangs there and failed?
  2. how can i speed up the indexing?
 
 
  Here is my solrconfig.xml
 
 
 useCompoundFilefalse/useCompoundFile
 
 ramBufferSizeMB3000/ramBufferSizeMB
 
 mergeFactor1000/mergeFactor
 
 maxMergeDocs2147483647/maxMergeDocs
 
 maxFieldLength1/maxFieldLength
 
 unlockOnStartupfalse/unlockOnStartup
 
 
 
 
  --- On Thu, 5/21/09, Noble Paul
 നോബിള്‍  नो
  ब्ळ् noble.p...@corp.aol.com
 wrote:
 
  From: Noble Paul നോബിള്‍
 नोब्ळ्
  noble.p...@corp.aol.com
  Subject: Re: How to index large set data
  To: solr-user@lucene.apache.org
  Date: Thursday, May 21, 2009, 10:39 PM
  what is the total no:of docs created
  ?  I guess it may not be memory
  bound. indexing is mostly amn IO bound operation.
 You may
  be able to
  get a better perf if a SSD is used (solid state
 disk)
 
  On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai
 djian...@yahoo.com
  wrote:
 
  Hi Paul,
 
  Thank you so much for answering my questions.
 It
  really helped.
  After some adjustment, basically setting
 mergeFactor
  to 1000 from the default value of 10, I can
 finished the
  whole job in 2.5 hours. I checked that during
 running time,
  only around 18% of memory is being used, and VIRT
 is always
  1418m. I am thinking it may be restricted by JVM
 memory
  setting. But I run the data import command through
 web,
  i.e.,
 
 
 http://host:port/solr/dataimport?command=full-import,
  how can I set the memory allocation for JVM?
  Thanks again!
 
  JB
 
  --- On Thu, 5/21/09, Noble Paul
 നോബിള്‍
   नोब्ळ् noble.p...@corp..aol.com
  wrote:
 
  From: Noble Paul നോബിള്‍
   नोब्ळ् noble.p...@corp.aol.com
  Subject: Re: How to index large set data
  To: solr-user@lucene.apache.org
  Date: Thursday, May 21, 2009, 9:57 PM
  check the status page of DIH and see
  if it is working properly. and
  if, yes what is the rate of indexing
 
  On Thu, May 21, 2009 at 11:48 AM, Jianbin
 Dai
  djian...@yahoo.com
  wrote:
 
  Hi,
 
  I have about 45GB xml files to be
 indexed. I
  am using
  DataImportHandler. I started the full
 import 4
  hours ago,
  and it's still running
  My computer has 4GB memory. Any
 suggestion on
  the
  solutions?
  Thanks!
 
  JB
 
 
 
 
 
 
 
 
  --
 
 
 -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 
 
 
 
 
 
 
 
 
  --
 
 -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 
 
 
 

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem
 (Lucene/Solr/Nutch/Mahout/Tika/Droids)
 using Solr/Lucene:
 http://www.lucidimagination..com/search










-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: How to index large set data

2009-05-22 Thread Jianbin Dai

Hi Pual, but in your previous post, you said there is already an issue for 
writing to Solr in multiple threads  SOLR-1089. Do you think use solrj alone 
would be better than DIH? 
Thanks and have a good weekend!

--- On Fri, 5/22/09, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com wrote:

 no need to use embedded Solrserver.
 you can use SolrJ with streaming
 in multiple threads
 
 On Fri, May 22, 2009 at 8:36 PM, Jianbin Dai djian...@yahoo.com
 wrote:
 
  If I do the xml parsing by myself and use embedded
 client to do the push, would it be more efficient than DIH?
 
 
  --- On Fri, 5/22/09, Grant Ingersoll gsing...@apache.org
 wrote:
 
  From: Grant Ingersoll gsing...@apache.org
  Subject: Re: How to index large set data
  To: solr-user@lucene.apache.org
  Date: Friday, May 22, 2009, 5:38 AM
  Can you parallelize this?  I
  don't know that the DIH can handle it,
  but having multiple threads sending docs to Solr
 is the
  best
  performance wise, so maybe you need to look at
 alternatives
  to pulling
  with DIH and instead use a client to push into
 Solr.
 
 
  On May 22, 2009, at 3:42 AM, Jianbin Dai wrote:
 
  
   about 2.8 m total docs were created. only the
 first
  run finishes. In
   my 2nd try, it hangs there forever at the end
 of
  indexing, (I guess
   right before commit), with cpu usage of 100%.
 Total 5G
  (2050) index
   files are created. Now I have two problems:
   1. why it hangs there and failed?
   2. how can i speed up the indexing?
  
  
   Here is my solrconfig.xml
  
  
 
 useCompoundFilefalse/useCompoundFile
  
 
 ramBufferSizeMB3000/ramBufferSizeMB
  
  mergeFactor1000/mergeFactor
  
 
 maxMergeDocs2147483647/maxMergeDocs
  
 
 maxFieldLength1/maxFieldLength
  
 
 unlockOnStartupfalse/unlockOnStartup
  
  
  
  
   --- On Thu, 5/21/09, Noble Paul
  നോബിള്‍  नो
   ब्ळ् noble.p...@corp.aol.com
  wrote:
  
   From: Noble Paul നോബിള്‍
  नोब्ळ्
   noble.p...@corp.aol.com
   Subject: Re: How to index large set data
   To: solr-user@lucene.apache.org
   Date: Thursday, May 21, 2009, 10:39 PM
   what is the total no:of docs created
   ?  I guess it may not be memory
   bound. indexing is mostly amn IO bound
 operation.
  You may
   be able to
   get a better perf if a SSD is used (solid
 state
  disk)
  
   On Fri, May 22, 2009 at 10:46 AM, Jianbin
 Dai
  djian...@yahoo.com
   wrote:
  
   Hi Paul,
  
   Thank you so much for answering my
 questions.
  It
   really helped.
   After some adjustment, basically
 setting
  mergeFactor
   to 1000 from the default value of 10, I
 can
  finished the
   whole job in 2.5 hours. I checked that
 during
  running time,
   only around 18% of memory is being used,
 and VIRT
  is always
   1418m. I am thinking it may be restricted
 by JVM
  memory
   setting. But I run the data import
 command through
  web,
   i.e.,
  
  
 
 http://host:port/solr/dataimport?command=full-import,
   how can I set the memory allocation for
 JVM?
   Thanks again!
  
   JB
  
   --- On Thu, 5/21/09, Noble Paul
  നോബിള്‍
    नोब्ळ् noble.p...@corp..aol.com
   wrote:
  
   From: Noble Paul
 നോബിള്‍
    नोब्ळ् noble.p...@corp.aol.com
   Subject: Re: How to index large
 set data
   To: solr-u...@lucene.apache..org
   Date: Thursday, May 21, 2009,
 9:57 PM
   check the status page of DIH and
 see
   if it is working properly. and
   if, yes what is the rate of
 indexing
  
   On Thu, May 21, 2009 at 11:48 AM,
 Jianbin
  Dai
   djian...@yahoo.com
   wrote:
  
   Hi,
  
   I have about 45GB xml files
 to be
  indexed. I
   am using
   DataImportHandler. I started the
 full
  import 4
   hours ago,
   and it's still running.
   My computer has 4GB memory.
 Any
  suggestion on
   the
   solutions?
   Thanks!
  
   JB
  
  
  
  
  
  
  
  
   --
  
  
 
 -
   Noble Paul | Principal Engineer|
 AOL | http://aol.com
  
  
  
  
  
  
  
  
  
   --
  
 
 -
   Noble Paul | Principal Engineer| AOL | http://aol.com
  
  
  
  
 
  --
  Grant Ingersoll
  http://www.lucidimagination.com/
 
  Search the Lucene ecosystem
  (Lucene/Solr/Nutch/Mahout/Tika/Droids)
  using Solr/Lucene:
  http://www.lucidimagination...com/search
 
 
 
 
 
 
 
 
 
 
 -- 
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com
 






How to index large set data

2009-05-21 Thread Jianbin Dai

Hi,

I have about 45GB xml files to be indexed. I am using DataImportHandler. I 
started the full import 4 hours ago, and it's still running
My computer has 4GB memory. Any suggestion on the solutions?
Thanks!

JB


  



Re: How to index large set data

2009-05-21 Thread Erick Erickson
This isn't much data to go on. Do you have any idea what your throughput is?How
many documents are you indexing? one 45G doc or 4.5 billion 10 character
docs?
Have you looked at any profiling data to see how much memory is being
consumed?
Are you IO bound or CPU bound?

Best
Erick

On Thu, May 21, 2009 at 2:18 AM, Jianbin Dai djian...@yahoo.com wrote:


 Hi,

 I have about 45GB xml files to be indexed. I am using DataImportHandler. I
 started the full import 4 hours ago, and it's still running
 My computer has 4GB memory. Any suggestion on the solutions?
 Thanks!

 JB







Re: How to index large set data

2009-05-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
check the status page of DIH and see if it is working properly. and
if, yes what is the rate of indexing

On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com wrote:

 Hi,

 I have about 45GB xml files to be indexed. I am using DataImportHandler. I 
 started the full import 4 hours ago, and it's still running
 My computer has 4GB memory. Any suggestion on the solutions?
 Thanks!

 JB








-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: How to index large set data

2009-05-21 Thread Jianbin Dai

Hi Paul,

Thank you so much for answering my questions. It really helped.
After some adjustment, basically setting mergeFactor to 1000 from the default 
value of 10, I can finished the whole job in 2.5 hours. I checked that during 
running time, only around 18% of memory is being used, and VIRT is always 
1418m. I am thinking it may be restricted by JVM memory setting. But I run the 
data import command through web, i.e.,
http://host:port/solr/dataimport?command=full-import, how can I set the 
memory allocation for JVM? 
Thanks again!

JB

--- On Thu, 5/21/09, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com wrote:

 From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 Subject: Re: How to index large set data
 To: solr-user@lucene.apache.org
 Date: Thursday, May 21, 2009, 9:57 PM
 check the status page of DIH and see
 if it is working properly. and
 if, yes what is the rate of indexing
 
 On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com
 wrote:
 
  Hi,
 
  I have about 45GB xml files to be indexed. I am using
 DataImportHandler. I started the full import 4 hours ago,
 and it's still running
  My computer has 4GB memory. Any suggestion on the
 solutions?
  Thanks!
 
  JB
 
 
 
 
 
 
 
 
 -- 
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com
 






Re: How to index large set data

2009-05-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
what is the total no:of docs created ?  I guess it may not be memory
bound. indexing is mostly amn IO bound operation. You may be able to
get a better perf if a SSD is used (solid state disk)

On Fri, May 22, 2009 at 10:46 AM, Jianbin Dai djian...@yahoo.com wrote:

 Hi Paul,

 Thank you so much for answering my questions. It really helped.
 After some adjustment, basically setting mergeFactor to 1000 from the default 
 value of 10, I can finished the whole job in 2.5 hours. I checked that during 
 running time, only around 18% of memory is being used, and VIRT is always 
 1418m. I am thinking it may be restricted by JVM memory setting. But I run 
 the data import command through web, i.e.,
 http://host:port/solr/dataimport?command=full-import, how can I set the 
 memory allocation for JVM?
 Thanks again!

 JB

 --- On Thu, 5/21/09, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com 
 wrote:

 From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 Subject: Re: How to index large set data
 To: solr-user@lucene.apache.org
 Date: Thursday, May 21, 2009, 9:57 PM
 check the status page of DIH and see
 if it is working properly. and
 if, yes what is the rate of indexing

 On Thu, May 21, 2009 at 11:48 AM, Jianbin Dai djian...@yahoo.com
 wrote:
 
  Hi,
 
  I have about 45GB xml files to be indexed. I am using
 DataImportHandler. I started the full import 4 hours ago,
 and it's still running
  My computer has 4GB memory. Any suggestion on the
 solutions?
  Thanks!
 
  JB
 
 
 
 
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com









-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com