Why would I need faster hardware if my current hardware isn't reaching it's max capacity?

I'm already using a different machine for querying and indexing so while indexing the queries aren't affected. Pulling an optimized snapshot isn't even noticeable on the query-machines.

Thijs


On 20-5-2010 17:25, Dennis Gearon wrote:
It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions.

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
   otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin<knagelb...@globeandmail.com>  wrote:

From: Nagelberg, Kallin<knagelb...@globeandmail.com>
Subject: RE: Machine utilization while indexing
To: "'solr-user@lucene.apache.org'"<solr-user@lucene.apache.org>
Date: Thursday, May 20, 2010, 8:16 AM
How about throwing a blockingqueue,
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
between your document-creator and solrserver? Give it a size
of 10,000 or something, with one thread trying to feed it,
and one thread waiting for it to get near full then draining
it. Take the drained results and add them to the server
(maybe try not using streamingsolrserver). Something like
that worked well for me with about 5,000,000 documents each
~5k taking about 8 hours.

-Kallin Nagelberg

-----Original Message-----
From: Thijs [mailto:vonk.th...@gmail.com]

Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker
then it does
at the moment.

I have to index (and re-index) some 3-5 million documents.
These
documents are preprocessed by a java application that
effectively
combines multiple database tables with each-other to form
the
SolrInputDocument.

What I'm seeing however is that the queue of documents that
are ready to
be send to the solr server exceeds my preset limit. Telling
me that Solr
somehow can't process the documents fast enough.

(I have created my own queue in front of
Solrj.StreamingUpdateSolrServer
as it would not process the documents fast enough causing
OutOfMemoryExceptions due to the large amount of documents
building up
in it's queue)

I have an index that for 95% consist of ID's (Long). We
don't do any
analysis on the fields that are being indexed. The schema
is rather
straight forward.

most fields look like
<fieldType name="long" class="solr.LongField"
omitNorms="true"/>
<field name="objectId" type="long" stored="true"
indexed="true"
required="true" />
<field name="listId" type="long" stored="false"
indexed="true"
multiValued="true"/>

the relevant solrconfig.xml
<indexDefaults>

    <useCompoundFile>false</useCompoundFile>

    <mergeFactor>100</mergeFactor>

    <RAMBufferSizeMB>256</RAMBufferSizeMB>

    <maxMergeDocs>2147483647</maxMergeDocs>

    <maxFieldLength>10000</maxFieldLength>

    <writeLockTimeout>1000</writeLockTimeout>

    <commitLockTimeout>10000</commitLockTimeout>

    <lockType>single</lockType>
</indexDefaults>


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPU    Q9550  @
2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr
version 1.4

What I'm seeing is that the network almost never reaches
more then 10%
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is
used, not the
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
2730.68 MB

And that in the beginning (with a empty index) I get 2ms
per insert but
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my
indexing? Because I
have a feeling that my machine is capable of doing more
(use more
cpu's). I just can't figure-out how.

Thijs


Reply via email to