Re: Any tips for indexing large amounts of data?

2009-04-10 Thread Noble Paul നോബിള്‍ नोब्ळ्
they don't usually turn off the slave , but it is not a bad idea if
you can take it offline. It is a logistical headache.

BTW do you have very good cache hit ratio? then it makes sense to autowarm .
--Noble

On Fri, Apr 10, 2009 at 4:07 PM, sunnyfr johanna...@gmail.com wrote:

 ok but how people do for a frequent update for a large dabase and lot of
 query on it ?
 do they turn off the slave during the warmup ??


 Noble Paul നോബിള്‍  नोब्ळ् wrote:

 On Thu, Apr 9, 2009 at 8:51 PM, sunnyfr johanna...@gmail.com wrote:

 Hi Otis,
 How did you manage that? I've 8 core machine with 8GB of ram and 11GB
 index
 for 14M docs and 5 update every 30mn but my replication kill
 everything.
 My segments are merged too often sor full index replicate and cache lost
 and
  I've no idea what can I do now?
 Some help would be brilliant,
 btw im using Solr 1.4.


 sunnnyfr , whether the replication is full or delta , the caches are
 lost completely.

 you can think of partitioning the index into separate Solrs and
 updating one partition at a time and perform distributed search.

 Thanks,


 Otis Gospodnetic wrote:

 Mike is right about the occasional slow-down, which appears as a pause
 and
 is due to large Lucene index segment merging.  This should go away with
 newer versions of Lucene where this is happening in the background.

 That said, we just indexed about 20MM documents on a single 8-core
 machine
 with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process
 took
 a little less than 10 hours - that's over 550 docs/second.  The vanilla
 approach before some of our changes apparently required several days to
 index the same amount of data.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas mike.kl...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Monday, November 19, 2007 5:50:19 PM
 Subject: Re: Any tips for indexing large amounts of data?

 There should be some slowdown in larger indices as occasionally large
 segment merge operations must occur.  However, this shouldn't really
 affect overall speed too much.

 You haven't really given us enough data to tell you anything useful.
 I would recommend trying to do the indexing via a webapp to eliminate
 all your code as a possible factor.  Then, look for signs to what is
 happening when indexing slows.  For instance, is Solr high in cpu, is
 the computer thrashing, etc?

 -Mike

 On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

 Hi,

 Thanks for answering this question a while back. I have made some
 of the suggestions you mentioned. ie not committing until I've
 finished indexing. What I am seeing though, is as the index get
 larger (around 1Gb), indexing is taking a lot longer. In fact it
 slows down to a crawl. Have you got any pointers as to what I might
 be doing wrong?

 Also, I was looking at using MultiCore solr. Could this help in
 some way?

 Thank you
 Brendan

 On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:


 : I would think you would see better performance by allowing auto
 commit
 : to handle the commit size instead of reopening the connection
 all the
 : time.

 if your goal is fast indexing, don't use autoCommit at all ...
  just
 index everything, and don't commit until you are completely done.

 autoCommitting will slow your indexing down (the benefit being
 that more
 results will be visible to searchers as you proceed)




 -Hoss









 --
 View this message in context:
 http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 --Noble Paul



 --
 View this message in context: 
 http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22986152.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul


Re: Any tips for indexing large amounts of data?

2009-04-09 Thread sunnyfr

Hi Otis,
How did you manage that? I've 8 core machine with 8GB of ram and 11GB index
for 14M docs and 5 update every 30mn but my replication kill everything. 
My segments are merged too often sor full index replicate and cache lost and
 I've no idea what can I do now?
Some help would be brilliant,
btw im using Solr 1.4.

Thanks,


Otis Gospodnetic wrote:
 
 Mike is right about the occasional slow-down, which appears as a pause and
 is due to large Lucene index segment merging.  This should go away with
 newer versions of Lucene where this is happening in the background.
 
 That said, we just indexed about 20MM documents on a single 8-core machine
 with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process took
 a little less than 10 hours - that's over 550 docs/second.  The vanilla
 approach before some of our changes apparently required several days to
 index the same amount of data.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
 From: Mike Klaas mike.kl...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Monday, November 19, 2007 5:50:19 PM
 Subject: Re: Any tips for indexing large amounts of data?
 
 There should be some slowdown in larger indices as occasionally large  
 segment merge operations must occur.  However, this shouldn't really  
 affect overall speed too much.
 
 You haven't really given us enough data to tell you anything useful.   
 I would recommend trying to do the indexing via a webapp to eliminate  
 all your code as a possible factor.  Then, look for signs to what is  
 happening when indexing slows.  For instance, is Solr high in cpu, is  
 the computer thrashing, etc?
 
 -Mike
 
 On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:
 
 Hi,

 Thanks for answering this question a while back. I have made some  
 of the suggestions you mentioned. ie not committing until I've  
 finished indexing. What I am seeing though, is as the index get  
 larger (around 1Gb), indexing is taking a lot longer. In fact it  
 slows down to a crawl. Have you got any pointers as to what I might  
 be doing wrong?

 Also, I was looking at using MultiCore solr. Could this help in  
 some way?

 Thank you
 Brendan

 On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:


 : I would think you would see better performance by allowing auto  
 commit
 : to handle the commit size instead of reopening the connection  
 all the
 : time.

 if your goal is fast indexing, don't use autoCommit at all ...
  just
 index everything, and don't commit until you are completely done.

 autoCommitting will slow your indexing down (the benefit being  
 that more
 results will be visible to searchers as you proceed)




 -Hoss


 
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Any tips for indexing large amounts of data?

2009-04-09 Thread Glen Newton
For Solr / Lucene:
- use -XX:+AggressiveOpts
- If available, huge pages can help. See
http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
 I haven't yet followed-up with my Lucene performance numbers using
huge pages: it is 10-15% for large indexing jobs.

For Lucene:
- multi-thread using java.util.concurrent.ThreadPoolExecutor
(http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
  6.4 million full-text article + metadata indexed resulting in 83GB
index; these are old number: things are down to ~10hours now)
- while multithreading on multicore is particularly good, it also
improves performance on single core, for small (6 YMMV) numbers of
threads  good I/O (test for your particular configuration)
- Use multiple indexes  merge at the end
- As per http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf
use separate ThreadPoolExecutor  per index in previous, reducing queue
contention. This is giving me an additional ~10%. I will blog about
this in the near future...

-glen

2009/4/9 sunnyfr johanna...@gmail.com:

 Hi Otis,
 How did you manage that? I've 8 core machine with 8GB of ram and 11GB index
 for 14M docs and 5 update every 30mn but my replication kill everything.
 My segments are merged too often sor full index replicate and cache lost and
  I've no idea what can I do now?
 Some help would be brilliant,
 btw im using Solr 1.4.

 Thanks,


 Otis Gospodnetic wrote:

 Mike is right about the occasional slow-down, which appears as a pause and
 is due to large Lucene index segment merging.  This should go away with
 newer versions of Lucene where this is happening in the background.

 That said, we just indexed about 20MM documents on a single 8-core machine
 with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process took
 a little less than 10 hours - that's over 550 docs/second.  The vanilla
 approach before some of our changes apparently required several days to
 index the same amount of data.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas mike.kl...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Monday, November 19, 2007 5:50:19 PM
 Subject: Re: Any tips for indexing large amounts of data?

 There should be some slowdown in larger indices as occasionally large
 segment merge operations must occur.  However, this shouldn't really
 affect overall speed too much.

 You haven't really given us enough data to tell you anything useful.
 I would recommend trying to do the indexing via a webapp to eliminate
 all your code as a possible factor.  Then, look for signs to what is
 happening when indexing slows.  For instance, is Solr high in cpu, is
 the computer thrashing, etc?

 -Mike

 On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

 Hi,

 Thanks for answering this question a while back. I have made some
 of the suggestions you mentioned. ie not committing until I've
 finished indexing. What I am seeing though, is as the index get
 larger (around 1Gb), indexing is taking a lot longer. In fact it
 slows down to a crawl. Have you got any pointers as to what I might
 be doing wrong?

 Also, I was looking at using MultiCore solr. Could this help in
 some way?

 Thank you
 Brendan

 On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:


 : I would think you would see better performance by allowing auto
 commit
 : to handle the commit size instead of reopening the connection
 all the
 : time.

 if your goal is fast indexing, don't use autoCommit at all ...
  just
 index everything, and don't commit until you are completely done.

 autoCommitting will slow your indexing down (the benefit being
 that more
 results will be visible to searchers as you proceed)




 -Hoss









 --
 View this message in context: 
 http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 

-


Re: Any tips for indexing large amounts of data?

2009-04-09 Thread Glen Newton
 - As per
 http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf
Sorry, the presentation covers a lot of ground: see slide #20:
Standard thread pools can have high contention for task queue and
other data structures when used with fine-grained tasks
[I haven't yet implemented work stealing]

-glen

2009/4/9 Glen Newton glen.new...@gmail.com:
 For Solr / Lucene:
 - use -XX:+AggressiveOpts
 - If available, huge pages can help. See
 http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
  I haven't yet followed-up with my Lucene performance numbers using
 huge pages: it is 10-15% for large indexing jobs.

 For Lucene:
 - multi-thread using java.util.concurrent.ThreadPoolExecutor
 (http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
  6.4 million full-text article + metadata indexed resulting in 83GB
 index; these are old number: things are down to ~10hours now)
 - while multithreading on multicore is particularly good, it also
 improves performance on single core, for small (6 YMMV) numbers of
 threads  good I/O (test for your particular configuration)
 - Use multiple indexes  merge at the end
 - As per http://developers.sun.com/learning/javaoneonline/2008/pdf/TS-5515.pdf
 use separate ThreadPoolExecutor  per index in previous, reducing queue
 contention. This is giving me an additional ~10%. I will blog about
 this in the near future...

 -glen

 2009/4/9 sunnyfr johanna...@gmail.com:

 Hi Otis,
 How did you manage that? I've 8 core machine with 8GB of ram and 11GB index
 for 14M docs and 5 update every 30mn but my replication kill everything.
 My segments are merged too often sor full index replicate and cache lost and
  I've no idea what can I do now?
 Some help would be brilliant,
 btw im using Solr 1.4.

 Thanks,


 Otis Gospodnetic wrote:

 Mike is right about the occasional slow-down, which appears as a pause and
 is due to large Lucene index segment merging.  This should go away with
 newer versions of Lucene where this is happening in the background.

 That said, we just indexed about 20MM documents on a single 8-core machine
 with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process took
 a little less than 10 hours - that's over 550 docs/second.  The vanilla
 approach before some of our changes apparently required several days to
 index the same amount of data.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas mike.kl...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Monday, November 19, 2007 5:50:19 PM
 Subject: Re: Any tips for indexing large amounts of data?

 There should be some slowdown in larger indices as occasionally large
 segment merge operations must occur.  However, this shouldn't really
 affect overall speed too much.

 You haven't really given us enough data to tell you anything useful.
 I would recommend trying to do the indexing via a webapp to eliminate
 all your code as a possible factor.  Then, look for signs to what is
 happening when indexing slows.  For instance, is Solr high in cpu, is
 the computer thrashing, etc?

 -Mike

 On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

 Hi,

 Thanks for answering this question a while back. I have made some
 of the suggestions you mentioned. ie not committing until I've
 finished indexing. What I am seeing though, is as the index get
 larger (around 1Gb), indexing is taking a lot longer. In fact it
 slows down to a crawl. Have you got any pointers as to what I might
 be doing wrong?

 Also, I was looking at using MultiCore solr. Could this help in
 some way?

 Thank you
 Brendan

 On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:


 : I would think you would see better performance by allowing auto
 commit
 : to handle the commit size instead of reopening the connection
 all the
 : time.

 if your goal is fast indexing, don't use autoCommit at all ...
  just
 index everything, and don't commit until you are completely done.

 autoCommitting will slow your indexing down (the benefit being
 that more
 results will be visible to searchers as you proceed)




 -Hoss









 --
 View this message in context: 
 http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --

 -




-- 

-


Re: Any tips for indexing large amounts of data?

2009-04-09 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Thu, Apr 9, 2009 at 8:51 PM, sunnyfr johanna...@gmail.com wrote:

 Hi Otis,
 How did you manage that? I've 8 core machine with 8GB of ram and 11GB index
 for 14M docs and 5 update every 30mn but my replication kill everything.
 My segments are merged too often sor full index replicate and cache lost and
  I've no idea what can I do now?
 Some help would be brilliant,
 btw im using Solr 1.4.


sunnnyfr , whether the replication is full or delta , the caches are
lost completely.

you can think of partitioning the index into separate Solrs and
updating one partition at a time and perform distributed search.

 Thanks,


 Otis Gospodnetic wrote:

 Mike is right about the occasional slow-down, which appears as a pause and
 is due to large Lucene index segment merging.  This should go away with
 newer versions of Lucene where this is happening in the background.

 That said, we just indexed about 20MM documents on a single 8-core machine
 with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process took
 a little less than 10 hours - that's over 550 docs/second.  The vanilla
 approach before some of our changes apparently required several days to
 index the same amount of data.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas mike.kl...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Monday, November 19, 2007 5:50:19 PM
 Subject: Re: Any tips for indexing large amounts of data?

 There should be some slowdown in larger indices as occasionally large
 segment merge operations must occur.  However, this shouldn't really
 affect overall speed too much.

 You haven't really given us enough data to tell you anything useful.
 I would recommend trying to do the indexing via a webapp to eliminate
 all your code as a possible factor.  Then, look for signs to what is
 happening when indexing slows.  For instance, is Solr high in cpu, is
 the computer thrashing, etc?

 -Mike

 On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

 Hi,

 Thanks for answering this question a while back. I have made some
 of the suggestions you mentioned. ie not committing until I've
 finished indexing. What I am seeing though, is as the index get
 larger (around 1Gb), indexing is taking a lot longer. In fact it
 slows down to a crawl. Have you got any pointers as to what I might
 be doing wrong?

 Also, I was looking at using MultiCore solr. Could this help in
 some way?

 Thank you
 Brendan

 On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:


 : I would think you would see better performance by allowing auto
 commit
 : to handle the commit size instead of reopening the connection
 all the
 : time.

 if your goal is fast indexing, don't use autoCommit at all ...
  just
 index everything, and don't commit until you are completely done.

 autoCommitting will slow your indexing down (the benefit being
 that more
 results will be visible to searchers as you proceed)




 -Hoss









 --
 View this message in context: 
 http://www.nabble.com/Any-tips-for-indexing-large-amounts-of-data--tp13510670p22973205.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul


Re: Any tips for indexing large amounts of data?

2007-11-22 Thread Otis Gospodnetic
Brendan - yes, 64-bit Linux this is, and the JVM got 5.5 GB heap, though it 
could have worked with less.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Brendan Grainger [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, November 21, 2007 1:24:05 PM
Subject: Re: Any tips for indexing large amounts of data?

Hi Otis,

Thanks for this. Are you using a flavor of linux and is it 64bit? How  
much heap are you giving your jvm?

Thanks again
Brendan

On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote:

 Mike is right about the occasional slow-down, which appears as a  
 pause and is due to large Lucene index segment merging.  This  
 should go away with newer versions of Lucene where this is  
 happening in the background.

 That said, we just indexed about 20MM documents on a single 8-core  
 machine with 8 GB of RAM, resulting in nearly 20 GB index.  The  
 whole process took a little less than 10 hours - that's over 550  
 docs/second.  The vanilla approach before some of our changes  
 apparently required several days to index the same amount of data.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 19, 2007 5:50:19 PM
 Subject: Re: Any tips for indexing large amounts of data?

 There should be some slowdown in larger indices as occasionally large
 segment merge operations must occur.  However, this shouldn't really
 affect overall speed too much.

 You haven't really given us enough data to tell you anything useful.
 I would recommend trying to do the indexing via a webapp to eliminate
 all your code as a possible factor.  Then, look for signs to what is
 happening when indexing slows.  For instance, is Solr high in cpu, is
 the computer thrashing, etc?

 -Mike

 On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

 Hi,

 Thanks for answering this question a while back. I have made some
 of the suggestions you mentioned. ie not committing until I've
 finished indexing. What I am seeing though, is as the index get
 larger (around 1Gb), indexing is taking a lot longer. In fact it
 slows down to a crawl. Have you got any pointers as to what I might
 be doing wrong?

 Also, I was looking at using MultiCore solr. Could this help in
 some way?

 Thank you
 Brendan

 On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:


 : I would think you would see better performance by allowing auto
 commit
 : to handle the commit size instead of reopening the connection
 all the
 : time.

 if your goal is fast indexing, don't use autoCommit at all ...
  just
 index everything, and don't commit until you are completely done.

 autoCommitting will slow your indexing down (the benefit being
 that more
 results will be visible to searchers as you proceed)




 -Hoss












Re: Any tips for indexing large amounts of data?

2007-11-21 Thread Brendan Grainger

HI Otis,

Thanks for the reply. I am using a pretty vanilla approach right  
now and it's taking about 30 hours to build an index of about 5.5Gb.  
Can you please tell me what some of the changes you made to optimize  
the indexing process?


Thanks
Brendan

On Nov 21, 2007, at 2:27 AM, Otis Gospodnetic wrote:

Just tried a search for web on this index - 1.1 seconds.  This  
matches about 1MM of about 20MM docs.  Redo the search, and it's 1  
ms (cached).  This is without any load nor serious benchmarking,  
clearly.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Eswar K [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, November 21, 2007 2:11:07 AM
Subject: Re: Any tips for indexing large amounts of data?

Hi otis,

I understand that is slightly off track question, but I am just  
curious

 to
know the performance of Search on a 20 GB index file. What has been
 your
observation?

Regards,
Eswar

On Nov 21, 2007 12:33 PM, Otis Gospodnetic  
[EMAIL PROTECTED]

wrote:


Mike is right about the occasional slow-down, which appears as a

 pause and

is due to large Lucene index segment merging.  This should go away

 with

newer versions of Lucene where this is happening in the background.

That said, we just indexed about 20MM documents on a single 8-core

 machine

with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process

 took a

little less than 10 hours - that's over 550 docs/second.  The vanilla
approach before some of our changes apparently required several days

 to

index the same amount of data.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 19, 2007 5:50:19 PM
Subject: Re: Any tips for indexing large amounts of data?

There should be some slowdown in larger indices as occasionally large
segment merge operations must occur.  However, this shouldn't really
affect overall speed too much.

You haven't really given us enough data to tell you anything useful.
I would recommend trying to do the indexing via a webapp to eliminate
all your code as a possible factor.  Then, look for signs to what is
happening when indexing slows.  For instance, is Solr high in cpu, is
the computer thrashing, etc?

-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:


Hi,

Thanks for answering this question a while back. I have made some
of the suggestions you mentioned. ie not committing until I've
finished indexing. What I am seeing though, is as the index get
larger (around 1Gb), indexing is taking a lot longer. In fact it
slows down to a crawl. Have you got any pointers as to what I might
be doing wrong?

Also, I was looking at using MultiCore solr. Could this help in
some way?

Thank you
Brendan

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:



: I would think you would see better performance by allowing auto
commit
: to handle the commit size instead of reopening the connection
all the
: time.

if your goal is fast indexing, don't use autoCommit at all ...

 just

index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being
that more
results will be visible to searchers as you proceed)




-Hoss

















Re: Any tips for indexing large amounts of data?

2007-11-21 Thread Brendan Grainger

Hi Otis,

Thanks for this. Are you using a flavor of linux and is it 64bit? How  
much heap are you giving your jvm?


Thanks again
Brendan

On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote:

Mike is right about the occasional slow-down, which appears as a  
pause and is due to large Lucene index segment merging.  This  
should go away with newer versions of Lucene where this is  
happening in the background.


That said, we just indexed about 20MM documents on a single 8-core  
machine with 8 GB of RAM, resulting in nearly 20 GB index.  The  
whole process took a little less than 10 hours - that's over 550  
docs/second.  The vanilla approach before some of our changes  
apparently required several days to index the same amount of data.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 19, 2007 5:50:19 PM
Subject: Re: Any tips for indexing large amounts of data?

There should be some slowdown in larger indices as occasionally large
segment merge operations must occur.  However, this shouldn't really
affect overall speed too much.

You haven't really given us enough data to tell you anything useful.
I would recommend trying to do the indexing via a webapp to eliminate
all your code as a possible factor.  Then, look for signs to what is
happening when indexing slows.  For instance, is Solr high in cpu, is
the computer thrashing, etc?

-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:


Hi,

Thanks for answering this question a while back. I have made some
of the suggestions you mentioned. ie not committing until I've
finished indexing. What I am seeing though, is as the index get
larger (around 1Gb), indexing is taking a lot longer. In fact it
slows down to a crawl. Have you got any pointers as to what I might
be doing wrong?

Also, I was looking at using MultiCore solr. Could this help in
some way?

Thank you
Brendan

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:



: I would think you would see better performance by allowing auto
commit
: to handle the commit size instead of reopening the connection
all the
: time.

if your goal is fast indexing, don't use autoCommit at all ...

 just

index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being
that more
results will be visible to searchers as you proceed)




-Hoss












Re: Any tips for indexing large amounts of data?

2007-11-20 Thread Otis Gospodnetic
Mike is right about the occasional slow-down, which appears as a pause and is 
due to large Lucene index segment merging.  This should go away with newer 
versions of Lucene where this is happening in the background.

That said, we just indexed about 20MM documents on a single 8-core machine with 
8 GB of RAM, resulting in nearly 20 GB index.  The whole process took a little 
less than 10 hours - that's over 550 docs/second.  The vanilla approach before 
some of our changes apparently required several days to index the same amount 
of data.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 19, 2007 5:50:19 PM
Subject: Re: Any tips for indexing large amounts of data?

There should be some slowdown in larger indices as occasionally large  
segment merge operations must occur.  However, this shouldn't really  
affect overall speed too much.

You haven't really given us enough data to tell you anything useful.   
I would recommend trying to do the indexing via a webapp to eliminate  
all your code as a possible factor.  Then, look for signs to what is  
happening when indexing slows.  For instance, is Solr high in cpu, is  
the computer thrashing, etc?

-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

 Hi,

 Thanks for answering this question a while back. I have made some  
 of the suggestions you mentioned. ie not committing until I've  
 finished indexing. What I am seeing though, is as the index get  
 larger (around 1Gb), indexing is taking a lot longer. In fact it  
 slows down to a crawl. Have you got any pointers as to what I might  
 be doing wrong?

 Also, I was looking at using MultiCore solr. Could this help in  
 some way?

 Thank you
 Brendan

 On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:


 : I would think you would see better performance by allowing auto  
 commit
 : to handle the commit size instead of reopening the connection  
 all the
 : time.

 if your goal is fast indexing, don't use autoCommit at all ...
 just
 index everything, and don't commit until you are completely done.

 autoCommitting will slow your indexing down (the benefit being  
 that more
 results will be visible to searchers as you proceed)




 -Hoss








Re: Any tips for indexing large amounts of data?

2007-11-20 Thread Eswar K
Hi otis,

I understand that is slightly off track question, but I am just curious to
know the performance of Search on a 20 GB index file. What has been your
observation?

Regards,
Eswar

On Nov 21, 2007 12:33 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Mike is right about the occasional slow-down, which appears as a pause and
 is due to large Lucene index segment merging.  This should go away with
 newer versions of Lucene where this is happening in the background.

 That said, we just indexed about 20MM documents on a single 8-core machine
 with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process took a
 little less than 10 hours - that's over 550 docs/second.  The vanilla
 approach before some of our changes apparently required several days to
 index the same amount of data.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 19, 2007 5:50:19 PM
 Subject: Re: Any tips for indexing large amounts of data?

 There should be some slowdown in larger indices as occasionally large
 segment merge operations must occur.  However, this shouldn't really
 affect overall speed too much.

 You haven't really given us enough data to tell you anything useful.
 I would recommend trying to do the indexing via a webapp to eliminate
 all your code as a possible factor.  Then, look for signs to what is
 happening when indexing slows.  For instance, is Solr high in cpu, is
 the computer thrashing, etc?

 -Mike

 On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

  Hi,
 
  Thanks for answering this question a while back. I have made some
  of the suggestions you mentioned. ie not committing until I've
  finished indexing. What I am seeing though, is as the index get
  larger (around 1Gb), indexing is taking a lot longer. In fact it
  slows down to a crawl. Have you got any pointers as to what I might
  be doing wrong?
 
  Also, I was looking at using MultiCore solr. Could this help in
  some way?
 
  Thank you
  Brendan
 
  On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
 
 
  : I would think you would see better performance by allowing auto
  commit
  : to handle the commit size instead of reopening the connection
  all the
  : time.
 
  if your goal is fast indexing, don't use autoCommit at all ...
  just
  index everything, and don't commit until you are completely done.
 
  autoCommitting will slow your indexing down (the benefit being
  that more
  results will be visible to searchers as you proceed)
 
 
 
 
  -Hoss
 
 







Re: Any tips for indexing large amounts of data?

2007-11-20 Thread Eswar K
Thats great.

At what size of the index do you think we should look at partitioning the
index file?

Eswar

On Nov 21, 2007 12:57 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Just tried a search for web on this index - 1.1 seconds.  This matches
 about 1MM of about 20MM docs.  Redo the search, and it's 1 ms (cached).
  This is without any load nor serious benchmarking, clearly.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Eswar K [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Wednesday, November 21, 2007 2:11:07 AM
 Subject: Re: Any tips for indexing large amounts of data?

 Hi otis,

 I understand that is slightly off track question, but I am just curious
  to
 know the performance of Search on a 20 GB index file. What has been
  your
 observation?

 Regards,
 Eswar

 On Nov 21, 2007 12:33 PM, Otis Gospodnetic [EMAIL PROTECTED]
 wrote:

  Mike is right about the occasional slow-down, which appears as a
  pause and
  is due to large Lucene index segment merging.  This should go away
  with
  newer versions of Lucene where this is happening in the background.
 
  That said, we just indexed about 20MM documents on a single 8-core
  machine
  with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process
  took a
  little less than 10 hours - that's over 550 docs/second.  The vanilla
  approach before some of our changes apparently required several days
  to
  index the same amount of data.
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Mike Klaas [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Monday, November 19, 2007 5:50:19 PM
  Subject: Re: Any tips for indexing large amounts of data?
 
  There should be some slowdown in larger indices as occasionally large
  segment merge operations must occur.  However, this shouldn't really
  affect overall speed too much.
 
  You haven't really given us enough data to tell you anything useful.
  I would recommend trying to do the indexing via a webapp to eliminate
  all your code as a possible factor.  Then, look for signs to what is
  happening when indexing slows.  For instance, is Solr high in cpu, is
  the computer thrashing, etc?
 
  -Mike
 
  On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:
 
   Hi,
  
   Thanks for answering this question a while back. I have made some
   of the suggestions you mentioned. ie not committing until I've
   finished indexing. What I am seeing though, is as the index get
   larger (around 1Gb), indexing is taking a lot longer. In fact it
   slows down to a crawl. Have you got any pointers as to what I might
   be doing wrong?
  
   Also, I was looking at using MultiCore solr. Could this help in
   some way?
  
   Thank you
   Brendan
  
   On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:
  
  
   : I would think you would see better performance by allowing auto
   commit
   : to handle the commit size instead of reopening the connection
   all the
   : time.
  
   if your goal is fast indexing, don't use autoCommit at all ...
   just
   index everything, and don't commit until you are completely done.
  
   autoCommitting will slow your indexing down (the benefit being
   that more
   results will be visible to searchers as you proceed)
  
  
  
  
   -Hoss
  
  
 
 
 
 
 






Re: Any tips for indexing large amounts of data?

2007-11-19 Thread Brendan Grainger

Hi,

Thanks for answering this question a while back. I have made some of  
the suggestions you mentioned. ie not committing until I've finished  
indexing. What I am seeing though, is as the index get larger (around  
1Gb), indexing is taking a lot longer. In fact it slows down to a  
crawl. Have you got any pointers as to what I might be doing wrong?


Also, I was looking at using MultiCore solr. Could this help in some  
way?


Thank you
Brendan

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:



: I would think you would see better performance by allowing auto  
commit
: to handle the commit size instead of reopening the connection all  
the

: time.

if your goal is fast indexing, don't use autoCommit at all ... just
index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being that  
more

results will be visible to searchers as you proceed)




-Hoss





Re: Any tips for indexing large amounts of data?

2007-11-19 Thread Mike Klaas
There should be some slowdown in larger indices as occasionally large  
segment merge operations must occur.  However, this shouldn't really  
affect overall speed too much.


You haven't really given us enough data to tell you anything useful.   
I would recommend trying to do the indexing via a webapp to eliminate  
all your code as a possible factor.  Then, look for signs to what is  
happening when indexing slows.  For instance, is Solr high in cpu, is  
the computer thrashing, etc?


-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:


Hi,

Thanks for answering this question a while back. I have made some  
of the suggestions you mentioned. ie not committing until I've  
finished indexing. What I am seeing though, is as the index get  
larger (around 1Gb), indexing is taking a lot longer. In fact it  
slows down to a crawl. Have you got any pointers as to what I might  
be doing wrong?


Also, I was looking at using MultiCore solr. Could this help in  
some way?


Thank you
Brendan

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:



: I would think you would see better performance by allowing auto  
commit
: to handle the commit size instead of reopening the connection  
all the

: time.

if your goal is fast indexing, don't use autoCommit at all ... just
index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being  
that more

results will be visible to searchers as you proceed)




-Hoss







Re: Any tips for indexing large amounts of data?

2007-11-02 Thread Brendan Grainger
Thanks so much for your suggestions. I am attempting to index 550K  
docs at once, but have found I've had to break them up into smaller  
batches. Indexing seems to stop at around 47K docs (the index reaches  
264M in size at this point). The index eventually itself grows to  
about 2Gb. I am using embedded solr and adding a document with code  
very similar to this:




private void addModel(Model model) throws IOException {
UpdateHandler updateHandler = solrCore.getUpdateHandler();
AddUpdateCommand addcmd = new AddUpdateCommand();

DocumentBuilder builder = new DocumentBuilder 
(solrCore.getSchema());

builder.startDoc();
builder.addField(id, Model: + model.getUuid());
builder.addField(class, Model);
builder.addField(uuid, model.getUuid());
builder.addField(one_facet, model.getOneFacet());
builder.addField(another_facet, model.getAnotherFacet());

  .. other fields

addcmd.doc = builder.getDoc();
addcmd.allowDups = false;
addcmd.overwritePending = true;
addcmd.overwriteCommitted = true;
updateHandler.addDoc(addcmd);
}

I have other 'Model' objects I'm adding also.

Thanks

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:



: I would think you would see better performance by allowing auto  
commit
: to handle the commit size instead of reopening the connection all  
the

: time.

if your goal is fast indexing, don't use autoCommit at all ... just
index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being that  
more

results will be visible to searchers as you proceed)




-Hoss





RE: Any tips for indexing large amounts of data?

2007-10-31 Thread Jeryl Cook
Usability consideration,
Not really answering your question, but i must comment using searching on items 
up to 100k makes faceted navigation very effective..but becomes least effective 
past 100k..u may want to consider breaking up the 500k documents in 
categories(typical breadcrumb) to 100k to faceted browse.
 
 Jeryl Cook 



 To: solr-user@lucene.apache.org From: [EMAIL PROTECTED] Subject: Any tips 
 for indexing large amounts of data? Date: Wed, 31 Oct 2007 10:30:50 -0400  
 Hi,  I am creating an index of approx 500K documents. I wrote an indexing  
 program using embeded solr: http://wiki.apache.org/solr/EmbeddedSolr  and am 
 seeing probably a 10 fold increase in indexing speeds. My  problem is 
 though, that if I try to reindex say 20K docs at a time it  slows down 
 considerably. I currently batch my updates in lots of 100  and between 
 batches I close and reopen the connection to solr like so:  private void 
 openConnection(String environment) throws  ParserConfigurationException, 
 IOException, SAXException { System.setProperty(solr.solr.home, 
 SOLR_HOME); solrConfig = new SolrConfig(solrconfig.xml); solrCore = new 
 SolrCore(SOLR_HOME + data/ + environment,  solrConfig, new 
 IndexSchema(solrConfig, schema.xml)); logger.debug(Opened solr 
 connection); }  private void closeConnection() { solrCore.close(); 
 solrCore = null; logger.debug(Closed solr connection); }  Does anyone 
 have any pointers or see anything obvious I'm doing wrong?  Thanks   PS 
 Sorry if this is posted twice.

Re: Any tips for indexing large amounts of data?

2007-10-31 Thread scott.tabar
Greetings Brendan,

In the solrconfig.xml file, under the updateHandler, is an auto commit 
statement.

It looks like:

autoCommit 
  maxDocs1000/maxDocs
  maxTime1000/maxTime
/autoCommit

I would think you would see better performance by allowing auto commit to 
handle the commit size instead of reopening the connection all the time.

I believe the maxTime is in milliseconds.

Let us know if this helps,
   Scott Tabar

 Brendan Grainger [EMAIL PROTECTED] wrote: 
Hi,

I am creating an index of approx 500K documents. I wrote an indexing  
program using embeded solr: http://wiki.apache.org/solr/EmbeddedSolr  
and am seeing probably a 10 fold increase in indexing speeds. My  
problem is though, that if I try to reindex say 20K docs at a time it  
slows down considerably. I currently batch my updates in lots of 100  
and between batches I close and reopen the connection to solr like so:

 private void openConnection(String environment) throws  
ParserConfigurationException, IOException, SAXException {
 System.setProperty(solr.solr.home, SOLR_HOME);
 solrConfig = new SolrConfig(solrconfig.xml);
 solrCore = new SolrCore(SOLR_HOME + data/ + environment,  
solrConfig, new IndexSchema(solrConfig, schema.xml));
 logger.debug(Opened solr connection);
 }

 private void closeConnection() {
 solrCore.close();
 solrCore = null;
 logger.debug(Closed solr connection);
 }

Does anyone have any pointers or see anything obvious I'm doing wrong?

Thanks


PS Sorry if this is posted twice.



Re: Any tips for indexing large amounts of data?

2007-10-31 Thread Chris Hostetter

: currently batch my updates in lots of 100 and between batches I close and
: reopen the connection to solr like so:

: private void closeConnection() {
: solrCore.close();
: solrCore = null;
: logger.debug(Closed solr connection);
: }
: 
: Does anyone have any pointers or see anything obvious I'm doing wrong?

you haven't really shown us much about what you are acutally doing to 
index your docs so that we can see what might be taking time, but i can 
tell you that there is absolutely no reason what so ever to close your 
SolrCore in the middle of a large indexing job.



-Hoss



Re: Any tips for indexing large amounts of data?

2007-10-31 Thread Chris Hostetter

: I would think you would see better performance by allowing auto commit 
: to handle the commit size instead of reopening the connection all the 
: time.

if your goal is fast indexing, don't use autoCommit at all ... just 
index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being that more 
results will be visible to searchers as you proceed)




-Hoss