Grant,

I am messing with the script, and with your tip I expect I can
make it recurse over as many releases as needed.

I did run it again using the full file, this time using my Imac:-
        643465    took  22min 14sec             2008-04-01
        734796          73min 58sec             2009-01-15
        758795          70min 55sec             2009-03-26
I then ran it again using only the first 1M records:-
        643465    took  2m51.516s               2008-04-01
        734796          7m29.326s               2009-01-15
        758795          8m18.403s               2009-03-26
this time with commit=true.
        643465    took  2m49.200s               2008-04-01
        734796          8m27.414s               2009-01-15
        758795          9m32.459s               2009-03-26
this time with commit=false&overwrite=false.
        643465    took  2m46.149s               2008-04-01
        734796          3m29.909s               2009-01-15
        758795          3m26.248s               2009-03-26

Just read your latest post. I will apply the patches and retest
the above.

>Can you try adding &overwrite=false and running against the latest  
>version?  My current working theory is that Solr/Lucene has changed  
>how deletes are handled such that work that was deferred before is now  
>not deferred as often.  In fact, you are not seeing this cost paid (or  
>at least not noticing it) because you are not committing, but I  
>believe you do see it when you are closing down Solr, which is why it  
>takes so long to exit.
It can take ages! (>15min to get tomcat to quit). Also my script does
have the separate commit step, which does not take any time!

>I also think that Lucene adding fsync() into  
>the equation may cause some slow down, but that is a penalty we are  
>willing to pay as it gives us higher data integrity.
Data integrity is always good. However if performance seems
unreasonable, user/customers tend to take things into their
own hands and kill the process or machine. This tends to be
very bad for data integrity.

>So, depending on how you have your data, I think a workaround is to:
>Add a field that contains a single term identifying the data type for  
>this particular CSV file, i.e. something like field: type, value:  
>fergs-csv
>Then, before indexing, you can issue a Delete By Query: type:fergs-csv  
>and then add your CSV file using overwrite=false.  This amounts to a  
>batch delete followed by a batch add, but without the add having to  
>issue deletes for each add.
Ok.. but... for these test cases I am starting off with an empty
index. The script does a "rm -rf solr/data" before tomcat is launched.
So I do not understand how the above helps. UNLESS there are duplicate
gaz entries.

>In the meantime, I'm trying to see if I can pinpoint down a specific  
>change and see if there is anything that might help it perform better.
>
>-Grant
>

-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Reply via email to