Another revelation...
I can see that there is a time difference in the Solr output for adding
these documents when I watch it realtime.
Here are some rows from the 3.5 solr server:

Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabin&version=2} status=0 QTime=6196
Jan 23, 2013 11:57:23 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1386104, RNA in situ-1351487, RNA in situ-1363917,
RNA in situ-1377125, RNA in situ-1371738, RNA in situ-1378746, RNA in
situ-1383410, RNA in situ-1362712, ... (1001 adds)]} 0 6266
Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabin&version=2} status=0 QTime=6266
Jan 23, 2013 11:57:24 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1371578, RNA in situ-1377716, RNA in situ-1378151,
RNA in situ-1360580, RNA in situ-1391657, RNA in situ-1370288, RNA in
situ-1388236, RNA in situ-1361465, ... (1001 adds)]} 0 6371
Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabin&version=2} status=0 QTime=6371
Jan 23, 2013 11:57:24 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1350555, RNA in situ-1350887, RNA in situ-1379699,
RNA in situ-1373773, RNA in situ-1374004, RNA in situ-1372265, RNA in
situ-1373027, RNA in situ-1380691, ... (1001 adds)]} 0 6440
Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute



And here from the 4.0 solr:

Jan 23, 2013 3:40:22 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-115650, RNA in situ-4109, RNA in situ-107614, RNA in
situ-86038, RNA in situ-19647, RNA in situ-1422, RNA in situ-119536, RNA
in situ-77775, RNA in situ-86825, RNA in situ-91009, ... (1001 adds)]} 0
3105
Jan 23, 2013 3:40:23 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-38103, RNA in situ-15797, RNA in situ-79946, RNA in
situ-124877, RNA in situ-62025, RNA in situ-67908, RNA in situ-70527, RNA
in situ-20581, RNA in situ-107574, RNA in situ-96497, ... (1001 adds)]} 0
2689
Jan 23, 2013 3:40:24 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-35518, RNA in situ-50512, RNA in situ-109961, RNA in
situ-113025, RNA in situ-33729, RNA in situ-116967, RNA in situ-133871,
RNA in situ-55287, RNA in situ-67367, RNA in situ-8617, ... (1001 adds)]}
0 2367
Jan 23, 2013 3:40:28 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-105749, RNA in situ-125415, RNA in situ-14667, RNA in
situ-41067, RNA in situ-1099, RNA in situ-86169, RNA in situ-90834, RNA in
situ-114639, RT-PCR-26160, RNA in situ-79745, ... (1001 adds)]} 0 3401
Jan 23, 2013 3:40:28 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabin&version=2}
{add=[RNA in situ-82061, RNA in situ-96965, RNA in situ-22677, RNA in
situ-52637, RNA in situ-131842, RNA in situ-31863, RNA in situ-111656, RNA
in situ-120509, RNA in situ-29659, RNA in situ-63579, ... (1001 adds)]} 0
3580
Jan 23, 2013 3:40:31 PM
org.apache.solr.update.processor.LogUpdateProcessor finish



I know that they aren't the same exact documents (like I said, there are
millions to load), but the times look pretty much like this for all of
them.

Can someone help me parse out the times of this? It *appears* to me that
the inserts are happening just as fast, if not faster in 4.0 than 3.5, BUT
the timestamps between the LogUpdateProcessor calls are much longer in
4.0.
I do not have the <updateLog> tag anywhere in my solrconfig.xml. So why
does it look to me like it is spending a lot of time logging? It shouldn't
really be logging anything, right? Bear in mind that these inserts happen
in threads that are pushing to Solr concurrently. So if 4.0 is logging
somewhere that 3.5 didn't, then the file-locking on that log file could be
slowing me down.

-Kevin

On 1/23/13 12:03 PM, "Kevin Stone" <kevin.st...@jax.org> wrote:

>I'm still poking around trying to find the differences. I found a couple
>things that may or may not be relevant.
>First, when I start up my 3.5 solr, I get all sorts of warnings that my
>solrconfig is old and will run using 2.4 emulation.
>Of course I had to upgrade the solconfig for the 4.0 instance (which I
>already described). I am curious if there could be some feature I was
>taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know.
>
>Second when I look at the console logs for my server (3.5 and 4.0) and I
>run the indexer against each, I see a subtle difference in this print out
>when it connects to the solr core.
>The 3.5 version prints this out:
>webapp=/solr path=/update
>params={waitSearcher=true&wt=javabin&commit=true&softCommit=false&version=
>2
>} {commit=} 0 2722
>
>
>The 4.0 version prints this out
> webapp=/solr path=/update/javabin
>params={wt=javabin&commit=true&waitFlush=true&waitSearcher=true&version=2}
>status=0 QTime=1404
>
>
>
>The params for the update handle seem ever so slightly different. The 3.5
>version (the one that runs fast) has a setting softCommit=false.
>The 4.0 version does not print that setting, but instead prints this
>setting waitFlush=true.
>
>These could be irrelevant, but thought I should add the information.
>
>-Kevin
>
>On 1/23/13 11:42 AM, "Kevin Stone" <kevin.st...@jax.org> wrote:
>
>>Do you mean commenting out the <updateLog>...</updateLog> tag? Because
>>that I already commented out. Or do I also need to remove the entire
>><updateHandler> tag? Sorry, I am not too familiar with everything in the
>>solrconfig file. I have a tag that essentially looks like this:
>>
>><updateHandler class="solr.DirectUpdateHandler2"></updateHandler>
>>
>>
>>Everything inside is commented out.
>>
>>-Kevin
>>
>>On 1/23/13 11:21 AM, "Mark Miller" <markrmil...@gmail.com> wrote:
>>
>>>It's hard to guess, but I might start by looking at what the new
>>>UpdateLog is costing you. Take it's definition out of solrconfig.xml and
>>>try your test again. Then let's take it from there.
>>>
>>>- Mark
>>>
>>>On Jan 23, 2013, at 11:00 AM, Kevin Stone <kevin.st...@jax.org> wrote:
>>>
>>>> I am having some difficulty migrating our solr indexing scripts from
>>>>using 3.5 to solr 4.0. Notably, I am trying to track down why our
>>>>performance in solr 4.0 is about 5-10 times slower when indexing
>>>>documents. Querying is still quite fast.
>>>>
>>>> The code adds  documents in groups of 1000, and adds each group to the
>>>>solr in a thread. The documents are somewhat large, including maybe
>>>>30-40 different field types, mostly multivalued. Here are some snippets
>>>>of the code we used in 3.5.
>>>>
>>>>
>>>> MultiThreadedHttpConnectionManager mgr = new
>>>>MultiThreadedHttpConnectionManager();
>>>>
>>>> HttpClient client = new HttpClient(mgr);
>>>>
>>>> CommonsHttpSolrServer server = new CommonsHttpSolrServer( "some url
>>>>for
>>>>our index",client );
>>>>
>>>> server.setRequestWriter(new BinaryRequestWriter());
>>>>
>>>>
>>>> Then, we delete the index, and proceed to generate documents and load
>>>>the groups in a thread that looks kind of like this. I've omitted some
>>>>overhead for handling exceptions, and retry attempts.
>>>>
>>>>
>>>> class DocWriterThread implements Runnable
>>>>
>>>> {
>>>>
>>>>    CommonsHttpSolrServer server;
>>>>
>>>>    Collection<SolrInputDocument> docs;
>>>>
>>>>    private int commitWithin = 50000; // 50 seconds
>>>>
>>>>    public DocWriterThread(CommonsHttpSolrServer
>>>>server,Collection<SolrInputDocument> docs)
>>>>
>>>>    {
>>>>
>>>>    this.server=server;
>>>>
>>>>    this.docs=docs;
>>>>
>>>>    }
>>>>
>>>> public void run()
>>>>
>>>> {
>>>>
>>>>    // set the commitWithin feature
>>>>
>>>>    server.add(docs,commitWithin);
>>>>
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> Now, I've had to change some things to get this compile with the Solr
>>>>4.0 libraries. Here is what I tried to convert the above code to. I
>>>>don't know if these are the correct equivalents, as I am not familiar
>>>>with apache httpcomponents.
>>>>
>>>>
>>>>
>>>> ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();
>>>>
>>>> DefaultHttpClient client = new DefaultHttpClient(mgr);
>>>>
>>>> HttpSolrServer server = new HttpSolrServer( "some url for our solr
>>>>index",client );
>>>>
>>>> server.setRequestWriter(new BinaryRequestWriter());
>>>>
>>>>
>>>>
>>>>
>>>> The thread method is the same, but uses HttpSolrServer instead of
>>>>CommonsHttpSolrServer.
>>>>
>>>> We also, had an old solrconfig (not sure what version, but it is pre
>>>>3.x and had mostly default values) that I had to replace with a 4.0
>>>>style solrconfig.xml. I don't want to post the entire file (as it is
>>>>large), but I copied one from the solr 4.0 examples, and made a couple
>>>>changes. First, I wanted to turn off transaction logging. So
>>>>essentially
>>>>I have a line like this (everything inside is commented out):
>>>>
>>>>
>>>> <updateHandler class="solr.DirectUpdateHandler2"></updateHandler>
>>>>
>>>>
>>>> And I added a handler for javabin
>>>>
>>>>
>>>> <requestHandler name="/update/javabin"
>>>>class="solr.BinaryUpdateRequestHandler">
>>>>
>>>>        <lst name="defaults">
>>>>
>>>>         <str name="stream.contentType">application/javabin</str>
>>>>
>>>>       </lst>
>>>>
>>>>  </requestHandler>
>>>>
>>>> I'm not sure what other configurations I should look at. I would think
>>>>that there should be a big obvious reason why the indexing performance
>>>>would drop nearly 10 fold.
>>>>
>>>> Against our 3.5 instance I timed our index load, and it adds roughly
>>>>40,000 documents every 3-8 seconds.
>>>>
>>>> Against our 4.0 instance it adds 40,000 documents every 70-75 seconds.
>>>>
>>>> This isn't the end of the world, and I would love to use the new join
>>>>feature in solr 4.0. However, we have many different indexes with
>>>>millions of documents, and this kind of increase in load time is
>>>>troubling.
>>>>
>>>>
>>>> Thanks for your help.
>>>>
>>>>
>>>> -Kevin
>>>>
>>>>
>>>> The information in this email, including attachments, may be
>>>>confidential and is intended solely for the addressee(s). If you
>>>>believe
>>>>you received this email by mistake, please notify the sender by return
>>>>email as soon as possible.
>>>
>>
>>
>>The information in this email, including attachments, may be confidential
>>and is intended solely for the addressee(s). If you believe you received
>>this email by mistake, please notify the sender by return email as soon
>>as possible.
>
>
>The information in this email, including attachments, may be confidential
>and is intended solely for the addressee(s). If you believe you received
>this email by mistake, please notify the sender by return email as soon
>as possible.


The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.

Reply via email to