Re: Faster Solr Indexing

2012-03-19 Thread Peyman Faratin
Hi Erick, Dimitry and Mikhail

thank you all for your time. I tried all of the suggestions below and am happy 
to report that indexing speeds have improved. There were several confounding 
problems including

- a bank of (~20) regexes that were poorly optimized and compiled at each 
indexing step
- single threaded
- not using StreamingUpdateSolrServer
- excessive logging

However, the biggest bottleneck was 2 lucene searches (across ~9MM docs) at the 
time of building the SOLR document. Indexing sped up after precomputing these 
values offline.

Thank you all for your help. 

best

Peyman 

On Mar 12, 2012, at 10:58 AM, Erick Erickson wrote:

> How have you determined that it's the solr add? By timing the call on the
> SolrJ side or by looking at the machine where Solr is running? This is the
> very first thing you have to answer. You can get a rough ides with any
> simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows
> box). The point is just to see whether the indexer machine is being
> well utilized. I'd guess it's not actually.
> 
> One quick experiment would be to try using StreamingUpdateSolrServer
> (SUSS), which has the capability of having multiple threads
> fire at Solr at once. It is possible that your performance is spent
> waiting for I/O.
> 
> Once you have that question answered, you can refine. But until you
> know which side of the wire the problem is on, you're flying blind.
> 
> Both Yandong Peyman:
> These times are quite surprising. Running everything locally on my laptop,
> I'm indexing between 5-7K documents/second. The source is
> the Wikipedia dump.
> 
> I'm particularly surprised by the difference Yandong is seeing based
> on the various analysis chains. the first thing I'd back off is the
> MaxPermSize. 512M is huge for this parameter.
> If you're getting that kind of time differential and your CPU isn't
> pegged, you're probably swapping in which case you need
> to give the processes more memory. I'd just take the MaxPermSize
> out completely as a start.
> 
> Not sure if you've seen this page, something there might help.
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> 
> But throw a profiler at the indexer as a first step, just to see
> where the problem is, CPU or I/O.
> 
> Best
> Erick
> 
> On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin  
> wrote:
>> Hi
>> 
>> I am trying to index 12MM docs faster than is currently happening in Solr 
>> (using solrj). We have identified solr's add method as the bottleneck (and 
>> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and 
>> jvm ram).
>> 
>> Adding 1000 docs is taking approximately 25 seconds. We are making sure we 
>> add and commit in batches. And we've tried both CommonsHttpSolrServer and 
>> EmbeddedSolrServer (assuming removing http overhead would speed things up 
>> with embedding) but the differences is marginal.
>> 
>> The docs being indexed are on average 20 fields long, mostly indexed but 
>> none stored. The major size contributors are two fields:
>> 
>>- content, and
>>- shingledContent (populated using copyField of content).
>> 
>> The length of the content field is (likely) gaussian distributed (few large 
>> docs 50-80K tokens, but majority around 2k tokens). We use shingledContent 
>> to support phrase queries and content for unigram queries (following the 
>> advice of Solr Enterprise search server advice - p. 305, section "The 
>> Solution: Shingling").
>> 
>> Clearly the size of the docs is a contributor to the slow adds (confirmed by 
>> removing these 2 fields resulting in halving the indexing time). We've tried 
>> compressed=true also but that is not working.
>> 
>> Any guidance on how to support our application logic (without having to 
>> change the schema too much) and speed the indexing speed (from current 212 
>> days for 12MM docs) would be much appreciated.
>> 
>> thank you
>> 
>> Peyman
>> 



Re: Faster Solr Indexing

2012-03-12 Thread Erick Erickson
How have you determined that it's the solr add? By timing the call on the
SolrJ side or by looking at the machine where Solr is running? This is the
very first thing you have to answer. You can get a rough ides with any
simple profiler (say Activity Monitor no a Mac, Task Manager on a Windows
box). The point is just to see whether the indexer machine is being
well utilized. I'd guess it's not actually.

One quick experiment would be to try using StreamingUpdateSolrServer
(SUSS), which has the capability of having multiple threads
fire at Solr at once. It is possible that your performance is spent
waiting for I/O.

Once you have that question answered, you can refine. But until you
know which side of the wire the problem is on, you're flying blind.

Both Yandong Peyman:
These times are quite surprising. Running everything locally on my laptop,
I'm indexing between 5-7K documents/second. The source is
the Wikipedia dump.

I'm particularly surprised by the difference Yandong is seeing based
on the various analysis chains. the first thing I'd back off is the
MaxPermSize. 512M is huge for this parameter.
If you're getting that kind of time differential and your CPU isn't
pegged, you're probably swapping in which case you need
to give the processes more memory. I'd just take the MaxPermSize
out completely as a start.

Not sure if you've seen this page, something there might help.
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

But throw a profiler at the indexer as a first step, just to see
where the problem is, CPU or I/O.

Best
Erick

On Sat, Mar 10, 2012 at 4:09 PM, Peyman Faratin  wrote:
> Hi
>
> I am trying to index 12MM docs faster than is currently happening in Solr 
> (using solrj). We have identified solr's add method as the bottleneck (and 
> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and 
> jvm ram).
>
> Adding 1000 docs is taking approximately 25 seconds. We are making sure we 
> add and commit in batches. And we've tried both CommonsHttpSolrServer and 
> EmbeddedSolrServer (assuming removing http overhead would speed things up 
> with embedding) but the differences is marginal.
>
> The docs being indexed are on average 20 fields long, mostly indexed but none 
> stored. The major size contributors are two fields:
>
>        - content, and
>        - shingledContent (populated using copyField of content).
>
> The length of the content field is (likely) gaussian distributed (few large 
> docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to 
> support phrase queries and content for unigram queries (following the advice 
> of Solr Enterprise search server advice - p. 305, section "The Solution: 
> Shingling").
>
> Clearly the size of the docs is a contributor to the slow adds (confirmed by 
> removing these 2 fields resulting in halving the indexing time). We've tried 
> compressed=true also but that is not working.
>
> Any guidance on how to support our application logic (without having to 
> change the schema too much) and speed the indexing speed (from current 212 
> days for 12MM docs) would be much appreciated.
>
> thank you
>
> Peyman
>


Re: Faster Solr Indexing

2012-03-11 Thread Mikhail Khludnev
Dmitry,

If you start to speak about logging, don't forget to say that jdk logging
is absolutely not really performant, but is default for 3.x. Logback is
much faster.

Peyman,
1. shingles has performance implication. That is. it can cost much. Why
term positions and phrase queries are not enough for you?
2. some time ago there was a similar thread caused by superfluous
shingleing, so it's worth do double check that you produce not much than
you really need (Captian O speaking)
3. when I have problem with performance the first thing I do is profiler or
sampler
4. The way to look inside lucene indexing is enabling infostream, you'll
have a lot of info
5. are all of your cpu cores utilized? is they aren't, employ indexing in
multiple threads. it scales. Post several indexing requests in parallel. Be
aware that DIH doesn't works for multiple threads yet SOLR-3011.
6. Some time ago I need to have a huge throughput and faced the trivial
producer-consumer trap. The indexing app (it was DIH hacked a little) pulls
data from jdbc, but in this time solr indexing were idle, then it pushed
constructed documents into solr for indexing but does it synchronously and
being idle while solr consumes them. As result I had overall time is equal
to sum of producing and consuming. So, I organized async buffer and reduce
time to maximum if those times. Double check that you have maximum of
producing and consuming but not a sum of it. I used perf4j to trace those
times.
7. As you data is huge you can try to employ cluster magic, spread your
docs between two solr instances then search them in parallel SolrShards,
SolrCloud for you, I never did it. If you don't like to search in parallel,
you can copy index shards between boxes to have a full replica on each box
- but I haven't heard about it out-of-the box.

Regards

On Sun, Mar 11, 2012 at 7:27 PM, Dmitry Kan  wrote:

> one approach we have taken was decreasing the solr logging level for
> the posting session, described here (implemented for 1.4, but should
> be easy to port to 3.x):
>
> http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
>
> On 3/11/12, Yandong Yao  wrote:
> > I have similar issues by using DIH,
> > and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> > consumes most of the time when indexing 10K rows (each row is about 70K)
> > -  DIH nextRow takes about 10 seconds totally
> > -  If index uses whitespace tokenizer and lower case filter, then
> > addDoc() methods takes about 80 seconds
> > -  If index uses whitespace tokenizer, lower case filer, WDF, then
> > addDoc uses about 112 seconds
> > -  If index uses whitespace tokenizer, lower case filer, WDF and
> porter
> > stemmer, then addDoc uses about 145 seconds
> >
> > We have more than million rows totally, and am wondering whether i am
> using
> > sth. wrong or is there any way to improve the performance of addDoc()?
> >
> > Thanks very much in advance!
> >
> >
> > Following is the configure:
> > 1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
> > 2) Solr version 3.5
> > 3) solrconfig.xml  (almost copied from solr's  example/solr directory.)
> >
> >   
> >
> > false
> >
> > 10
> > 
> > 64
> > 
> > 
> >
> > 2147483647
> > 1000
> > 1
> >
> > native
> >   
> >
> > 2012/3/11 Peyman Faratin 
> >
> >> Hi
> >>
> >> I am trying to index 12MM docs faster than is currently happening in
> Solr
> >> (using solrj). We have identified solr's add method as the bottleneck
> (and
> >> not commit - which is tuned ok through mergeFactor and maxRamBufferSize
> >> and
> >> jvm ram).
> >>
> >> Adding 1000 docs is taking approximately 25 seconds. We are making sure
> we
> >> add and commit in batches. And we've tried both CommonsHttpSolrServer
> and
> >> EmbeddedSolrServer (assuming removing http overhead would speed things
> up
> >> with embedding) but the differences is marginal.
> >>
> >> The docs being indexed are on average 20 fields long, mostly indexed but
> >> none stored. The major size contributors are two fields:
> >>
> >>- content, and
> >>- shingledContent (populated using copyField of content).
> >>
> >> The length of the content field is (likely) gaussian distributed (few
> >> large docs 50-80K tokens, but majority around 2k tokens). We use
> >> shingledContent to support phrase queries and content for unigram
> queries
> >> (following the advice of Solr Enterprise search server advice - p. 305,
> >> section "The Solution: Shingling").
> >>
> >> Clearly the size of the docs is a contributor to the slow adds
> (confirmed
> >> by removing these 2 fields resulting in halving the indexing time).
> We've
> >> tried compressed=true also but that is not working.
> >>
> >> Any guidance on how to support our application logic (without having to
> >> change the schema too much) and speed the indexing speed (from current
> 212
> >> days for 12MM docs) would be much appreciated.
> >>
> >> thank you
> >>
> >> Peym

Re: Faster Solr Indexing

2012-03-11 Thread Dmitry Kan
one approach we have taken was decreasing the solr logging level for
the posting session, described here (implemented for 1.4, but should
be easy to port to 3.x):

http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html

On 3/11/12, Yandong Yao  wrote:
> I have similar issues by using DIH,
> and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> consumes most of the time when indexing 10K rows (each row is about 70K)
> -  DIH nextRow takes about 10 seconds totally
> -  If index uses whitespace tokenizer and lower case filter, then
> addDoc() methods takes about 80 seconds
> -  If index uses whitespace tokenizer, lower case filer, WDF, then
> addDoc uses about 112 seconds
> -  If index uses whitespace tokenizer, lower case filer, WDF and porter
> stemmer, then addDoc uses about 145 seconds
>
> We have more than million rows totally, and am wondering whether i am using
> sth. wrong or is there any way to improve the performance of addDoc()?
>
> Thanks very much in advance!
>
>
> Following is the configure:
> 1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
> 2) Solr version 3.5
> 3) solrconfig.xml  (almost copied from solr's  example/solr directory.)
>
>   
>
> false
>
> 10
> 
> 64
> 
> 
>
> 2147483647
> 1000
> 1
>
> native
>   
>
> 2012/3/11 Peyman Faratin 
>
>> Hi
>>
>> I am trying to index 12MM docs faster than is currently happening in Solr
>> (using solrj). We have identified solr's add method as the bottleneck (and
>> not commit - which is tuned ok through mergeFactor and maxRamBufferSize
>> and
>> jvm ram).
>>
>> Adding 1000 docs is taking approximately 25 seconds. We are making sure we
>> add and commit in batches. And we've tried both CommonsHttpSolrServer and
>> EmbeddedSolrServer (assuming removing http overhead would speed things up
>> with embedding) but the differences is marginal.
>>
>> The docs being indexed are on average 20 fields long, mostly indexed but
>> none stored. The major size contributors are two fields:
>>
>>- content, and
>>- shingledContent (populated using copyField of content).
>>
>> The length of the content field is (likely) gaussian distributed (few
>> large docs 50-80K tokens, but majority around 2k tokens). We use
>> shingledContent to support phrase queries and content for unigram queries
>> (following the advice of Solr Enterprise search server advice - p. 305,
>> section "The Solution: Shingling").
>>
>> Clearly the size of the docs is a contributor to the slow adds (confirmed
>> by removing these 2 fields resulting in halving the indexing time). We've
>> tried compressed=true also but that is not working.
>>
>> Any guidance on how to support our application logic (without having to
>> change the schema too much) and speed the indexing speed (from current 212
>> days for 12MM docs) would be much appreciated.
>>
>> thank you
>>
>> Peyman
>>
>>
>


-- 
Regards,

Dmitry Kan


Re: Faster Solr Indexing

2012-03-11 Thread Yandong Yao
I have similar issues by using DIH,
and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
consumes most of the time when indexing 10K rows (each row is about 70K)
-  DIH nextRow takes about 10 seconds totally
-  If index uses whitespace tokenizer and lower case filter, then
addDoc() methods takes about 80 seconds
-  If index uses whitespace tokenizer, lower case filer, WDF, then
addDoc uses about 112 seconds
-  If index uses whitespace tokenizer, lower case filer, WDF and porter
stemmer, then addDoc uses about 145 seconds

We have more than million rows totally, and am wondering whether i am using
sth. wrong or is there any way to improve the performance of addDoc()?

Thanks very much in advance!


Following is the configure:
1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
2) Solr version 3.5
3) solrconfig.xml  (almost copied from solr's  example/solr directory.)

  

false

10

64



2147483647
1000
1

native
  

2012/3/11 Peyman Faratin 

> Hi
>
> I am trying to index 12MM docs faster than is currently happening in Solr
> (using solrj). We have identified solr's add method as the bottleneck (and
> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and
> jvm ram).
>
> Adding 1000 docs is taking approximately 25 seconds. We are making sure we
> add and commit in batches. And we've tried both CommonsHttpSolrServer and
> EmbeddedSolrServer (assuming removing http overhead would speed things up
> with embedding) but the differences is marginal.
>
> The docs being indexed are on average 20 fields long, mostly indexed but
> none stored. The major size contributors are two fields:
>
>- content, and
>- shingledContent (populated using copyField of content).
>
> The length of the content field is (likely) gaussian distributed (few
> large docs 50-80K tokens, but majority around 2k tokens). We use
> shingledContent to support phrase queries and content for unigram queries
> (following the advice of Solr Enterprise search server advice - p. 305,
> section "The Solution: Shingling").
>
> Clearly the size of the docs is a contributor to the slow adds (confirmed
> by removing these 2 fields resulting in halving the indexing time). We've
> tried compressed=true also but that is not working.
>
> Any guidance on how to support our application logic (without having to
> change the schema too much) and speed the indexing speed (from current 212
> days for 12MM docs) would be much appreciated.
>
> thank you
>
> Peyman
>
>


Faster Solr Indexing

2012-03-10 Thread Peyman Faratin
Hi

I am trying to index 12MM docs faster than is currently happening in Solr 
(using solrj). We have identified solr's add method as the bottleneck (and not 
commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm 
ram). 

Adding 1000 docs is taking approximately 25 seconds. We are making sure we add 
and commit in batches. And we've tried both CommonsHttpSolrServer and 
EmbeddedSolrServer (assuming removing http overhead would speed things up with 
embedding) but the differences is marginal. 

The docs being indexed are on average 20 fields long, mostly indexed but none 
stored. The major size contributors are two fields:

- content, and
- shingledContent (populated using copyField of content).

The length of the content field is (likely) gaussian distributed (few large 
docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to 
support phrase queries and content for unigram queries (following the advice of 
Solr Enterprise search server advice - p. 305, section "The Solution: 
Shingling"). 

Clearly the size of the docs is a contributor to the slow adds (confirmed by 
removing these 2 fields resulting in halving the indexing time). We've tried 
compressed=true also but that is not working. 

Any guidance on how to support our application logic (without having to change 
the schema too much) and speed the indexing speed (from current 212 days for 
12MM docs) would be much appreciated. 

thank you

Peyman