Right. My situation is simple, I have a 32G dump of
Wikipedia data in a big XML file. I can parse it and
dump it into a (local) Solr instance at 5-7K
records/second. But it's stupid-simple, just a few
fields and no database involved. Much of the 32G
is XML. But that serves to illustrate
that the size of the data to be imported isn't much
information to go on...
bq: 60GB data set that I will need to index for the
project I am working on would take over three days
to import using the configuration that I have now.
OK, first thing I'd do is figure out what's taking the
time. Consider switching to SolrJ for your indexing
process, it can make debugging things much
easier. Here's a blog post:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/
When you start getting to 60G of data to import,
you might want finer control over what you're
doing, better error reporting, etc. as well as
being better able to pinpoint where your problems
are.
And, you can do things like just spin through the
data-retrieval part to answer the first question you
need to answer, "what's taking the time?" Is it
fetching the data? Sending it to Solr? Do you
have Tika in here somewhere? Network latency?
If you set up the SolrJ process, you can just selectively
remove steps in the process to determine what
the bottleneck is and go from there.
Hope that helps
Erick
On Sat, Feb 25, 2012 at 8:55 PM, Mike O'Leary wrote:
> What's your secret?
>
> OK, that question is not the kind recommended in the UsingMailingLists
> suggestions, so I will write again soon with a description of my data and
> what I am trying to do, and ask more specific questions. And I don't mean to
> hijack the thread, but I am in the same boat as the poster.
>
> I just started working with Solr less than two months ago, and after
> beginning with a completely naïve approach to indexing database contents with
> DataImportHandler and then making small adjustments to improve performance as
> I learned about them, I have gotten some smaller datasets to import in a
> reasonable amount of time, but the 60GB data set that I will need to index
> for the project I am working on would take over three days to import using
> the configuration that I have now. Obviously you're doing something different
> than I am...
>
> What things would you say have made the biggest improvement in indexing
> performance with the 32GB data set that you mentioned? How long do you think
> it would take to index that same data set if you used Solr more or less out
> of the box with no attempts to improve its performance?
> Thanks,
> Mike
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Saturday, February 25, 2012 2:51 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing taking so much time to complete.
>
> You have to tell us a lot more about what you're trying to do. I can import
> 32G in about 20 minutes, so obviously you're doing something different than I
> am...
>
> Perhaps you might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
> Best
> Erick
>
> On Sat, Feb 25, 2012 at 12:00 AM, Suneel wrote:
>> Hi All,
>>
>> I am using Apache solr 3.1 and trying to caching 50 gb records but it
>> is taking more then 20 hours this is very painful to update records.
>>
>> 1. Is there any way to reduce caching time or this time is ok for 50
>> gb records ?.
>>
>> 2. What is the delta-import, this will be helpful for me cache only
>> updated record not rather then caching all records ?.
>>
>>
>>
>> Please help me in above mentioned question.
>>
>>
>> Thanks & Regards,
>>
>> -
>> Suneel Pandey
>> Sr. Software Developer
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-com
>> plete-tp3774464p3774464.html Sent from the Solr - User mailing list
>> archive at Nabble.com.