Re: Solr indexing configuration help

Yonik Seeley Wed, 28 May 2008 19:00:57 -0700

Not sure why you would be getting an OOM from just indexing, and with
the 1.5G heap you've given the JVM.
Have you tried Sun's JVM?


-Yonik

On Wed, May 28, 2008 at 7:35 PM, gaku113 <[EMAIL PROTECTED]> wrote:
>
> Hi all Solr users/developers/experts,
>
> I have the following scenario and I appreciate any advice for tuning my solr
> master server.
>
> I have a field in my schema that would index (but not stored) about ~10000
> ids for each document.  This field is expected to govern the size of the
> document.  Each id can contain up to 6 characters.  I figure that there are
> two alternatives for this field, one is the use a string multi-valued field,
> and the other would be to pass a white-space-delimited string to solr and
> have solr tokenize such string based on whitespace (the text_ws fieldType).
> The master server is expected to receive constant stream of updates.
>
> The expected/estimated document size can range from 50k to 100k for a single
> document.  (I know this is quite large). The number of documents is expected
> to be around 200,000 on each master server, and there can be multiple master
> servers (sharding).  I wish the master can handle more docs too if I can
> figure a way out.
>
> Currently, I'm performing some basic stress tests to simulate the indexing
> side on the master server.  This stress test would continuously add new
> documents at the rate of about 10 documents every 30 seconds.  Autocommit is
> being used (50 docs and 180 seconds constraints), but I have no idea if this
> is the preferred way.  The goal is to keep adding new documents until we can
> get at least 200,000 documents (or about 20GB of index) on the master (or
> even more if the server can handle it)
>
> What I experienced from the indexing stress test is that the master server
> failed to respond after a while, such as non-pingable when there are about
> 30k documents.  When looking at the log, they are mostly:
> java.lang.OutOfMemoryError: Java heap space
> OR
> Ping query caused exception: null (this is probably caused by the OOM
> problem)
>
> There were also a few cases that the java process even went away.
>
> Questions:
> 1)      Is it better to use the multi-valued string field or the text_ws field
> for this large field?
> 2)      Is it better to have more outstanding docs per commit or more frequent
> commit, in term of maximizing server resources?  What is the preferred way
> to commit documents assuming that solr master receives updates frequently?
> How many updated docs should there be before issuing a commit?
> 3)      How to avoid the OOM problem in my case? I'm already doing (-Xms1536M
> -Xmx1536M) on a 2-GB machine. Is that not enough?  I'm concerned that adding
> more Ram would just delay the OOM problem.  Any additional JVM option to
> consider?
> 4)      Any recommendation for the master server configuration, in a sense 
> that I
> can maximize the number of indexed docs?
> 5)      How can it disable caching on the master altogether as queries won't 
> hit
> the master?
> 6)      For an average doc size of 50k-100k, is that too large for solr, or 
> even
> solr is the right tool? If not, any alternative?  If we are able to reduce
> the size of docs, can we expect to index more documents?
>
> The followings are info related to software/hardware/configuration:
>
> Solr version (solr nightly build on 5/23/2008)
>        Solr Specification Version: 1.2.2008.05.23.08.06.59
>        Solr Implementation Version: nightly
>        Lucene Specification Version: 2.3.2
>        Lucene Implementation Version: 2.3.2 652650
>        Jetty: 6.1.3
>
> Schema.xml (the section that I think are relevant to the master server.)
>
>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="text_ws" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>
> <field name="id" type="string" indexed="true" stored="true" required="true"
> />
> <field name="hex_id_multi" type="string" indexed="true" stored="false"
> multiValued="true" omitNorms="true"/>
>        <field name="hex_id_string" type="text_ws" indexed="true" 
> stored="false"
> omitNorms="true"/>
>
> <uniqueKey>id</uniqueKey>
>
> Solrconfig.xml
>  <indexDefaults>
>    <useCompoundFile>false</useCompoundFile>
>    <mergeFactor>10</mergeFactor>
>    <maxBufferedDocs>500</maxBufferedDocs>
>    <ramBufferSizeMB>50</ramBufferSizeMB>
>    <maxMergeDocs>5000</maxMergeDocs>
>    <maxFieldLength>20000</maxFieldLength>
>    <writeLockTimeout>1000</writeLockTimeout>
>    <commitLockTimeout>10000</commitLockTimeout>
>
> <mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</mergePolicy>
> <mergeScheduler>org.apache.lucene.index.ConcurrentMergeScheduler</mergeScheduler>
>    <lockType>single</lockType>
>  </indexDefaults>
>
>  <mainIndex>
>    <useCompoundFile>false</useCompoundFile>
>    <ramBufferSizeMB>50</ramBufferSizeMB>
>    <mergeFactor>10</mergeFactor>
>    <!-- Deprecated -->
>    <maxBufferedDocs>500</maxBufferedDocs>
>    <maxMergeDocs>5000</maxMergeDocs>
>    <maxFieldLength>20000</maxFieldLength>
>    <unlockOnStartup>false</unlockOnStartup>
>  </mainIndex>
>  <updateHandler class="solr.DirectUpdateHandler2">
>
>    <autoCommit>
>      <maxDocs>50</maxDocs>
>      <maxTime>180000</maxTime>
>    </autoCommit>
>    <listener event="postCommit" class="solr.RunExecutableListener">
>      <str name="exe">solr/bin/snapshooter</str>
>      <str name="dir">.</str>
>      <bool name="wait">true</bool>
>    </listener>
>  </updateHandler>
>
>  <query>
>    <maxBooleanClauses>50</maxBooleanClauses>
>    <filterCache
>      class="solr.LRUCache"
>      size="0"
>      initialSize="0"
>      autowarmCount="0"/>
>    <queryResultCache
>      class="solr.LRUCache"
>      size="0"
>      initialSize="0"
>      autowarmCount="0"/>
>    <documentCache
>      class="solr.LRUCache"
>      size="0"
>      initialSize="0"
>      autowarmCount="0"/>
>    <enableLazyFieldLoading>true</enableLazyFieldLoading>
>
>    <queryResultWindowSize>1</queryResultWindowSize>
>    <queryResultMaxDocsCached>1</queryResultMaxDocsCached>
>    <HashDocSet maxSize="1000" loadFactor="0.75"/>
>    <listener event="newSearcher" class="solr.QuerySenderListener">
>      <arr name="queries">
>        <lst> <str name="q">user_id</str> <str name="start">0</str> <str
> name="rows">1</str> </lst>
>        <lst><str name="q">static newSearcher warming query from
> solrconfig.xml</str></lst>
>      </arr>
>    </listener>
>    <listener event="firstSearcher" class="solr.QuerySenderListener">
>      <arr name="queries">
>        <lst> <str name="q">fast_warm</str> <str name="start">0</str> <str
> name="rows">10</str> </lst>
>        <lst><str name="q">static firstSearcher warming query from
> solrconfig.xml</str></lst>
>      </arr>
>    </listener>
>    <useColdSearcher>false</useColdSearcher>
>    <maxWarmingSearchers>4</maxWarmingSearchers>
>  </query>
>
> Replication:
>        The snappuller is scheduled to run every 15 mins for now.
>
> Hardware:
>        AMD (2.1GHz) dual core with 2GB ram 160GB SATA harddrive
>
> OS:
>        Fedora 8 (64-bit)
>
> JVM version:
>        java version "1.7.0"
> IcedTea Runtime Environment (build 1.7.0-b21)
> IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode)
>
> Java options:
>        java  -Djetty.home=/path/to/solr/home -d64 -Xms1536M -Xmx1536M
> -XX:+UseParallelGC -jar start.jar
>
>
> --
> View this message in context: 
> http://www.nabble.com/Solr-indexing-configuration-help-tp17524364p17524364.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Solr indexing configuration help

Reply via email to