Re: Indexing taking so much time to complete.

2012-02-25 Thread Erick Erickson
Right. My situation is simple, I have a 32G dump of
Wikipedia data in a big XML file. I can parse it and
dump it into a (local) Solr instance at 5-7K
records/second. But it's stupid-simple, just a few
fields and no database involved. Much of the 32G
is XML. But that serves to illustrate
that the size of the data to be imported isn't much
information to go on...

bq: 60GB data set that I will need to index for the
project I am working on would take over three days
to import using the configuration that I have now.

OK, first thing I'd do is figure out what's taking the
time. Consider switching to SolrJ for your indexing
process, it can make debugging things much
easier. Here's a blog post:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/
When you start getting to 60G of data to import,
you might want finer control over what you're
doing, better error reporting, etc. as well as
being better able to pinpoint where your problems
are.

And, you can do things like just spin through the
data-retrieval part to answer the first question you
need to answer, "what's taking the time?" Is it
fetching the data? Sending it to Solr? Do you
have Tika in here somewhere? Network latency?
If you set up the SolrJ process, you can just selectively
remove steps in the process to determine what
the bottleneck is and go from there.

Hope that helps
Erick


On Sat, Feb 25, 2012 at 8:55 PM, Mike O'Leary  wrote:
> What's your secret?
>
> OK, that question is not the kind recommended in the UsingMailingLists 
> suggestions, so I will write again soon with a description of my data and 
> what I am trying to do, and ask more specific questions. And I don't mean to 
> hijack the thread, but I am in the same boat as the poster.
>
> I just started working with Solr less than two months ago, and after 
> beginning with a completely naïve approach to indexing database contents with 
> DataImportHandler and then making small adjustments to improve performance as 
> I learned about them, I have gotten some smaller datasets to import in a 
> reasonable amount of time, but the 60GB data set that I will need to index 
> for the project I am working on would take over three days to import using 
> the configuration that I have now. Obviously you're doing something different 
> than I am...
>
> What things would you say have made the biggest improvement in indexing 
> performance with the 32GB data set that you mentioned? How long do you think 
> it would take to index that same data set if you used Solr more or less out 
> of the box with no attempts to improve its performance?
> Thanks,
> Mike
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Saturday, February 25, 2012 2:51 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing taking so much time to complete.
>
> You have to tell us a lot more about what you're trying to do. I can import 
> 32G in about 20 minutes, so obviously you're doing something different than I 
> am...
>
> Perhaps you might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
> Best
> Erick
>
> On Sat, Feb 25, 2012 at 12:00 AM, Suneel  wrote:
>> Hi All,
>>
>> I am using Apache solr 3.1 and trying to caching 50 gb records but it
>> is taking more then 20 hours this is very painful to update records.
>>
>> 1. Is there any way to reduce caching time or this time is ok for 50
>> gb records ?.
>>
>> 2. What is the delta-import, this will be helpful for me cache only
>> updated record not rather then caching all records ?.
>>
>>
>>
>> Please help me in above mentioned question.
>>
>>
>> Thanks & Regards,
>>
>> -
>> Suneel Pandey
>> Sr. Software Developer
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-com
>> plete-tp3774464p3774464.html Sent from the Solr - User mailing list
>> archive at Nabble.com.


RE: Indexing taking so much time to complete.

2012-02-25 Thread Mike O'Leary
What's your secret?

OK, that question is not the kind recommended in the UsingMailingLists 
suggestions, so I will write again soon with a description of my data and what 
I am trying to do, and ask more specific questions. And I don't mean to hijack 
the thread, but I am in the same boat as the poster.

I just started working with Solr less than two months ago, and after beginning 
with a completely naïve approach to indexing database contents with 
DataImportHandler and then making small adjustments to improve performance as I 
learned about them, I have gotten some smaller datasets to import in a 
reasonable amount of time, but the 60GB data set that I will need to index for 
the project I am working on would take over three days to import using the 
configuration that I have now. Obviously you're doing something different than 
I am...

What things would you say have made the biggest improvement in indexing 
performance with the 32GB data set that you mentioned? How long do you think it 
would take to index that same data set if you used Solr more or less out of the 
box with no attempts to improve its performance?
Thanks,
Mike

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Saturday, February 25, 2012 2:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing taking so much time to complete.

You have to tell us a lot more about what you're trying to do. I can import 32G 
in about 20 minutes, so obviously you're doing something different than I am...

Perhaps you might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Sat, Feb 25, 2012 at 12:00 AM, Suneel  wrote:
> Hi All,
>
> I am using Apache solr 3.1 and trying to caching 50 gb records but it 
> is taking more then 20 hours this is very painful to update records.
>
> 1. Is there any way to reduce caching time or this time is ok for 50 
> gb records ?.
>
> 2. What is the delta-import, this will be helpful for me cache only 
> updated record not rather then caching all records ?.
>
>
>
> Please help me in above mentioned question.
>
>
> Thanks & Regards,
>
> -
> Suneel Pandey
> Sr. Software Developer
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-com
> plete-tp3774464p3774464.html Sent from the Solr - User mailing list 
> archive at Nabble.com.


Re: Indexing taking so much time to complete.

2012-02-25 Thread Erick Erickson
You have to tell us a lot more about what you're trying to do. I can
import 32G in about 20 minutes, so obviously you're doing
something different than I am...

Perhaps you might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Sat, Feb 25, 2012 at 12:00 AM, Suneel  wrote:
> Hi All,
>
> I am using Apache solr 3.1 and trying to caching 50 gb records but it is
> taking more then 20 hours this is very painful to update records.
>
> 1. Is there any way to reduce caching time or this time is ok for 50 gb
> records ?.
>
> 2. What is the delta-import, this will be helpful for me cache only updated
> record not rather then caching all records ?.
>
>
>
> Please help me in above mentioned question.
>
>
> Thanks & Regards,
>
> -
> Suneel Pandey
> Sr. Software Developer
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-complete-tp3774464p3774464.html
> Sent from the Solr - User mailing list archive at Nabble.com.