RE: indexer threading?

2010-04-27 Thread Wawok, Brian
Hi Alex,

Were you ever able to get the indexing machine to go over about 1 CPU worth of 
work?  I am also curious of how BinaryRequestWriter compares to the 
StreamingUpdateSolrServer that I was using... 



Brian

-Original Message-
From: Alexey Serba [mailto:ase...@gmail.com] 
Sent: Tuesday, April 27, 2010 9:53 AM
To: solr-user@lucene.apache.org
Subject: Re: indexer threading?

Hi Brian,

I was testing indexing performance on a high cpu box recently and came
to the same issue. I tried different indexing methods ( xml,
CSVRequestHandler and Solrj + BinaryRequestWriter with multiple
threads ). The last method is the fastest indeed. I believe that
multiple threads approach gives you better performance if you have
complex text analysis. I had very simple analysis -
WhitespaceTokenizer only and performance boost with increasing threads
was not very impressive ( but still ). I guess that in case of simple
text analysis overall performance comes to synchronization issues.

I tried to profile application during indexing phase for CPU times and
monitors and it seems that most of blocking is on the following
methods:
- DocumentsWriter.doBalanceRAM
- DocumentsWriter.getThreadState
- SolrIndexWriter.ensureOpen

I don't know the guts of Solr/Lucene in such details so can't make any
conclusions. Are there any configuration techniques to improve
indexing performance in multiple threads scenario?

Alex

On Mon, Apr 26, 2010 at 6:52 PM, Wawok, Brian brian.wa...@cmegroup.com wrote:
 Hi,

 I was wondering about how the multi-threading of the indexer works?  I am 
 using SolrJ to stream documents to a server. As I add more threads on the 
 client side, I slowly see both speed and CPU usage go up on the indexer side. 
 Once I hit about 4 threads, my indexer is at 100% cpu usage (of 1 CPU on a 
 4-way box), and will not do any more work. It is pretty fast, doing something 
 like 75k lines of text per second.. but I would really like to use all 4 CPUs 
 on the indexer. Is the just a limitation of Solr, or is this a limitation of 
 using SolrJ and document streaming?


 Thanks,


 Brian



indexer threading?

2010-04-26 Thread Wawok, Brian
Hi,

I was wondering about how the multi-threading of the indexer works?  I am using 
SolrJ to stream documents to a server. As I add more threads on the client 
side, I slowly see both speed and CPU usage go up on the indexer side. Once I 
hit about 4 threads, my indexer is at 100% cpu usage (of 1 CPU on a 4-way box), 
and will not do any more work. It is pretty fast, doing something like 75k 
lines of text per second.. but I would really like to use all 4 CPUs on the 
indexer. Is the just a limitation of Solr, or is this a limitation of using 
SolrJ and document streaming?


Thanks,


Brian


RE: solr best practice to submit many documents

2010-04-08 Thread Wawok, Brian
As a follow up for anyone that is watching..

I changed from  post via 100,000 separate posts in python - stream with the 
java client (not sure if its xml or csv behind the scenes), and my time to send 
the data dropped from 340 seconds to 88 seconds.   

May try and figure out a better streaming way in Python, but for now it looks 
like I need to use the very fast Java Client..


Thanks for all the ideas guys!


Brian

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Wednesday, April 07, 2010 8:50 PM
To: solr-user@lucene.apache.org
Subject: Re: solr best practice to submit many documents

Stream XML input (or CSV if you can make that happen) works fine. If
the file is local, you can do a curl that would normally upload a file
via POST, but give this parameter: stream.file=/full/path/name.xml

Solr will read the file locally instead of through HTTP.

On Wed, Apr 7, 2010 at 9:18 AM, Wawok, Brian brian.wa...@cmegroup.com wrote:
 I don't think I want to stream from Java, text munging in Java is a PITA. 
 Would rather stream from a script, so need a more general solution.

 The Streaming document interface looks interesting, let me see if I can 
 figure out how to achieve the same thing without a Java client..


 Brian

 -Original Message-
 From: Paolo Castagna [mailto:castagna.li...@googlemail.com]
 Sent: Wednesday, April 07, 2010 11:11 AM
 To: solr-user@lucene.apache.org
 Subject: Re: solr best practice to submit many documents

 Hi Brian,
 I had similar questions when I begun to try and evaluate Solr.

 If you use Java and SolrJ you might find these useful:

  - http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
  -
 http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html

 I am also interested in knowing what is the best and more efficient way
 to index a large number of documents.

 Paolo

 Wawok, Brian wrote:
 Hello,

 I am using SOLR for some proof of concept work, and was wondering if anyone 
 has some guidance on a best practice.

 Background:
 Nightly get a delivery of  a few 1000 reports. Each report is between 1 and 
 500,000 pages.
 For my proof of concept I am using a single 100,000 page report.
 I want to see how fast I can make SOLR handle this single report, and then 
 can see how we can scale out to meet the total indexing demand (if needed).

 Trial 1:

 1)      Set up a solr server on server A with the default settings. Added a 
 few new fields to index, including a full text index of the report.

 2)      Set up a simple Python script on serve B. It splits the report into 
 100,000 small documents, pulls out a few key fields to be sent along to 
 index, and uses a python implementation of curl to shove the documents into 
 the server (with 4 threads posting away).

 3)      After all 100,000 documents are posted, we post an index and let the 
 server index.


 I was able to get this method to work, and it took around 340 seconds for 
 the posting, and 10 seconds for the indexing. I am not sure if that indexing 
 speed is a red hearing, and it was really doing a little bit of the indexing 
 during the posts, or what.

 Regardless, it seems less than ideal to make 100,000 requests to the server 
 to index 100,000 documents.  Does anyone have an idea for how to make this 
 process more efficient? Should I look into making an XML document with 
 100,000 documents enclosed? Or what will give me the best performance?  Will 
 this be much better than what I am seeing with my post method?  I am not 
 against writing a custom parser on the SOLR side, but if there is already a 
 way in SOLR to send many documents efficiently,  that is better.


 Thanks!

 Brian Wawok






-- 
Lance Norskog
goks...@gmail.com


solr best practice to submit many documents

2010-04-07 Thread Wawok, Brian
Hello,

I am using SOLR for some proof of concept work, and was wondering if anyone has 
some guidance on a best practice.

Background:
Nightly get a delivery of  a few 1000 reports. Each report is between 1 and 
500,000 pages.
For my proof of concept I am using a single 100,000 page report.
I want to see how fast I can make SOLR handle this single report, and then can 
see how we can scale out to meet the total indexing demand (if needed).

Trial 1:

1)  Set up a solr server on server A with the default settings. Added a few 
new fields to index, including a full text index of the report.

2)  Set up a simple Python script on serve B. It splits the report into 
100,000 small documents, pulls out a few key fields to be sent along to index, 
and uses a python implementation of curl to shove the documents into the server 
(with 4 threads posting away).

3)  After all 100,000 documents are posted, we post an index and let the 
server index.


I was able to get this method to work, and it took around 340 seconds for the 
posting, and 10 seconds for the indexing. I am not sure if that indexing speed 
is a red hearing, and it was really doing a little bit of the indexing during 
the posts, or what.

Regardless, it seems less than ideal to make 100,000 requests to the server to 
index 100,000 documents.  Does anyone have an idea for how to make this process 
more efficient? Should I look into making an XML document with 100,000 
documents enclosed? Or what will give me the best performance?  Will this be 
much better than what I am seeing with my post method?  I am not against 
writing a custom parser on the SOLR side, but if there is already a way in SOLR 
to send many documents efficiently,  that is better.


Thanks!

Brian Wawok



RE: solr best practice to submit many documents

2010-04-07 Thread Wawok, Brian
I don't think I want to stream from Java, text munging in Java is a PITA. Would 
rather stream from a script, so need a more general solution.

The Streaming document interface looks interesting, let me see if I can figure 
out how to achieve the same thing without a Java client..


Brian

-Original Message-
From: Paolo Castagna [mailto:castagna.li...@googlemail.com] 
Sent: Wednesday, April 07, 2010 11:11 AM
To: solr-user@lucene.apache.org
Subject: Re: solr best practice to submit many documents

Hi Brian,
I had similar questions when I begun to try and evaluate Solr.

If you use Java and SolrJ you might find these useful:

  - http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
  - 
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html

I am also interested in knowing what is the best and more efficient way
to index a large number of documents.

Paolo

Wawok, Brian wrote:
 Hello,
 
 I am using SOLR for some proof of concept work, and was wondering if anyone 
 has some guidance on a best practice.
 
 Background:
 Nightly get a delivery of  a few 1000 reports. Each report is between 1 and 
 500,000 pages.
 For my proof of concept I am using a single 100,000 page report.
 I want to see how fast I can make SOLR handle this single report, and then 
 can see how we can scale out to meet the total indexing demand (if needed).
 
 Trial 1:
 
 1)  Set up a solr server on server A with the default settings. Added a 
 few new fields to index, including a full text index of the report.
 
 2)  Set up a simple Python script on serve B. It splits the report into 
 100,000 small documents, pulls out a few key fields to be sent along to 
 index, and uses a python implementation of curl to shove the documents into 
 the server (with 4 threads posting away).
 
 3)  After all 100,000 documents are posted, we post an index and let the 
 server index.
 
 
 I was able to get this method to work, and it took around 340 seconds for the 
 posting, and 10 seconds for the indexing. I am not sure if that indexing 
 speed is a red hearing, and it was really doing a little bit of the indexing 
 during the posts, or what.
 
 Regardless, it seems less than ideal to make 100,000 requests to the server 
 to index 100,000 documents.  Does anyone have an idea for how to make this 
 process more efficient? Should I look into making an XML document with 
 100,000 documents enclosed? Or what will give me the best performance?  Will 
 this be much better than what I am seeing with my post method?  I am not 
 against writing a custom parser on the SOLR side, but if there is already a 
 way in SOLR to send many documents efficiently,  that is better.
 
 
 Thanks!
 
 Brian Wawok