As a follow up for anyone that is watching..

I changed from  post via 100,000 separate posts in python -> stream with the 
java client (not sure if its xml or csv behind the scenes), and my time to send 
the data dropped from 340 seconds to 88 seconds.   

May try and figure out a better streaming way in Python, but for now it looks 
like I need to use the very fast Java Client..


Thanks for all the ideas guys!


Brian

-----Original Message-----
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Wednesday, April 07, 2010 8:50 PM
To: solr-user@lucene.apache.org
Subject: Re: solr best practice to submit many documents

Stream XML input (or CSV if you can make that happen) works fine. If
the file is local, you can do a curl that would normally upload a file
via POST, but give this parameter: stream.file=/full/path/name.xml

Solr will read the file locally instead of through HTTP.

On Wed, Apr 7, 2010 at 9:18 AM, Wawok, Brian <brian.wa...@cmegroup.com> wrote:
> I don't think I want to stream from Java, text munging in Java is a PITA. 
> Would rather stream from a script, so need a more general solution.
>
> The Streaming document interface looks interesting, let me see if I can 
> figure out how to achieve the same thing without a Java client..
>
>
> Brian
>
> -----Original Message-----
> From: Paolo Castagna [mailto:castagna.li...@googlemail.com]
> Sent: Wednesday, April 07, 2010 11:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr best practice to submit many documents
>
> Hi Brian,
> I had similar questions when I begun to try and evaluate Solr.
>
> If you use Java and SolrJ you might find these useful:
>
>  - http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
>  -
> http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html
>
> I am also interested in knowing what is the best and more efficient way
> to index a large number of documents.
>
> Paolo
>
> Wawok, Brian wrote:
>> Hello,
>>
>> I am using SOLR for some proof of concept work, and was wondering if anyone 
>> has some guidance on a best practice.
>>
>> Background:
>> Nightly get a delivery of  a few 1000 reports. Each report is between 1 and 
>> 500,000 pages.
>> For my proof of concept I am using a single 100,000 page report.
>> I want to see how fast I can make SOLR handle this single report, and then 
>> can see how we can scale out to meet the total indexing demand (if needed).
>>
>> Trial 1:
>>
>> 1)      Set up a solr server on server A with the default settings. Added a 
>> few new fields to index, including a full text index of the report.
>>
>> 2)      Set up a simple Python script on serve B. It splits the report into 
>> 100,000 small documents, pulls out a few key fields to be sent along to 
>> index, and uses a python implementation of curl to shove the documents into 
>> the server (with 4 threads posting away).
>>
>> 3)      After all 100,000 documents are posted, we post an index and let the 
>> server index.
>>
>>
>> I was able to get this method to work, and it took around 340 seconds for 
>> the posting, and 10 seconds for the indexing. I am not sure if that indexing 
>> speed is a red hearing, and it was really doing a little bit of the indexing 
>> during the posts, or what.
>>
>> Regardless, it seems less than ideal to make 100,000 requests to the server 
>> to index 100,000 documents.  Does anyone have an idea for how to make this 
>> process more efficient? Should I look into making an XML document with 
>> 100,000 documents enclosed? Or what will give me the best performance?  Will 
>> this be much better than what I am seeing with my post method?  I am not 
>> against writing a custom parser on the SOLR side, but if there is already a 
>> way in SOLR to send many documents efficiently,  that is better.
>>
>>
>> Thanks!
>>
>> Brian Wawok
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to