Yonik

>Another thought I just had - do you have autocommit enabled?
>
No; not as far as I know!

The solrconfig.xml from the two versions are equivalent as best I can tell,
also they are exactly as provided in the download. The only changes were
made by the attached script and should not affect committing. Finally the
indexing command has commit=true, which I think means do a single commit
at the end of the file?

Regards Fergus.


>A lucene commit is now more expensive because it syncs the files for
>safety.  If you commit frequently, this could definitely cause a
>slowdown.
>
>-Yonik
>
>On Wed, Nov 26, 2008 at 10:54 AM, Fergus McMenemie <[EMAIL PROTECTED]> wrote:
>> Hello Grant,
>>
>> Not much good with Java profilers (yet!) so I thought I
>> would send a script!
>>
>> Details... details! Having decided to produce a script to
>> replicate the 1.2 vis 1.3 speed problem. The required rigor
>> revealed a lot more.
>>
>> 1) The faster version I have previously referred to as 1.2,
>>   was actually a "1.3-dev" I had downloaded as part of the
>>   solr bootcamp class at ApacheCon Europe 2008. The ID
>>   string in the CHANGES.txt document is:-
>>   $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
>>
>> 2) I did actually download and speed test a version of 1.2
>>   from the internet. It's CHANGES.txt id is:-
>>   $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
>>   Speed wise it was about the same as 1.3 at 64min. It also
>>   had lots of char set issues and is ignored from now on.
>>
>> 3) The version I was planning to use, till I found this,
>>   speed issue was the "latest" official version:-
>>   $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
>>   I also verified the behavior with a nightly build.
>>   $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
>>
>> Anyway, The following script indexes the content in 22min
>> for the 1.3-dev version and takes 68min for the newer releases
>> of 1.3. I took the conf directory from the 1.3dev (bootcamp)
>> release and used it replace the conf directory from the
>> official 1.3 release. The 3x slow down was still there; it is
>> not a configuration issue!
>> =================================
>>
>>
>>
>>
>>
>>
>> #! /bin/bash
>>
>> # This script assumes a /usr/local/tomcat link to whatever version
>> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
>> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
>> # All the following was done as root.
>>
>>
>> # I have a directory /usr/local/ts which contains four versions of solr. The
>> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 
>> 1.3beata
>> # I got while attending a solr bootcamp. I indexed the same content using the
>> # different versions of solr as follows:
>> cd /usr/local/ts
>> if [ "" ]
>> then
>>   echo "Starting from a-fresh"
>>   sleep 5 # allow time for me to interrupt!
>>   cp -Rp apache-solr-bc/example/solr      ./solrbc  #bc = bootcamp
>>   cp -Rp apache-solr-nightly/example/solr ./solrnightly
>>   cp -Rp apache-solr-1.3.0/example/solr   ./solr13
>>
>>   # the gaz is regularly updated and its name keeps changing :-) The page
>>   # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest
>>   # version.
>>   curl 
>> "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip"; > 
>> geonames.zip
>>   unzip -q geonames.zip
>>   # delete corrupt blips!
>>   perl -i -n -e 'print unless
>>       ($. > 2128495 and $. < 2128505) or
>>       ($. > 5944254 and $. < 5944260)
>>       ;' geonames_dd_dms_date_20081118.txt
>>   #following was used to detect bad short records
>>   #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" 
>> if (@F != 26);' geonames_dd_dms_date_20081118.txt
>>
>>   # my set of fields and copyfields for the schema.xml
>>   fields='
>>   <fields>
>>      <field name="UNI"           type="string" indexed="true"  stored="true" 
>> required="true" />
>>      <field name="CCODE"         type="string" indexed="true"  
>> stored="true"/>
>>      <field name="DSG"           type="string" indexed="true"  
>> stored="true"/>
>>      <field name="CC1"           type="string" indexed="true"  
>> stored="true"/>
>>      <field name="LAT"           type="sfloat" indexed="true"  
>> stored="true"/>
>>      <field name="LONG"          type="sfloat" indexed="true"  
>> stored="true"/>
>>      <field name="MGRS"          type="string" indexed="false" 
>> stored="true"/>
>>      <field name="JOG"           type="string" indexed="false" 
>> stored="true"/>
>>      <field name="FULL_NAME"     type="string" indexed="true"  
>> stored="true"/>
>>      <field name="FULL_NAME_ND"  type="string" indexed="true"  
>> stored="true"/>
>>      <!--field name="text"       type="text"   indexed="true"  
>> stored="false" multiValued="true"/ -->
>>      <!--field name="timestamp"  type="date"   indexed="true"  stored="true" 
>>  default="NOW" multiValued="false"/-->
>>   '
>>   copyfields='
>>      </fields>
>>      <copyField source="FULL_NAME" dest="text"/>
>>      <copyField source="FULL_NAME_ND" dest="text"/>
>>   '
>>
>>   # add in my fields and copyfields
>>   perl -i -p -e "print qq($fields) if s/<fields>//;"           
>> solr*/conf/schema.xml
>>   perl -i -p -e "print qq($copyfields) if s[</fields>][];"     
>> solr*/conf/schema.xml
>>   # change the unique key and mark the "id" field as not required
>>   perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;"            
>> solr*/conf/schema.xml
>>   perl -i -p -e 's/required="true"//i if m/<field name="id"/;' 
>> solr*/conf/schema.xml
>>   # enable remote streaming in solrconfig file
>>   perl -i -p -e 
>> 's/enableRemoteStreaming="false"/enableRemoteStreaming="true"/;' 
>> solr*/conf/solrconfig.xml
>>   fi
>>
>> # some constants to keep the curl command shorter
>> skip="MODIFY_DATE,RC,UFI,DMS_LAT,DMS_LONG,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME"
>> file=`pwd`"/geonames.txt"
>>
>> export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr 
>> -Dsolr.solr.home=`pwd`/solr"
>>
>> echo 'Getting ready to index the data set using solrbc (bc = bootcamp)'
>> /usr/local/tomcat/bin/shutdown.sh
>> sleep 15
>> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
>>   then
>>   echo "Tomcat would not shutdown"
>>   exit
>>   fi
>> rm -r /usr/local/tomcat/webapps/solr*
>> rm -r /usr/local/tomcat/logs/*.out
>> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
>> cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
>> rm solr # rm the symbolic link
>> ln -s solrbc solr
>> rm -r solr/data
>> /usr/local/tomcat/bin/startup.sh
>> sleep 10 # give solr time to launch and setup
>> echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)"
>> time curl 
>> "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip";
>>
>> echo "Getting ready to index the data set using solrnightly"
>> /usr/local/tomcat/bin/shutdown.sh
>> sleep 15
>> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
>>   then
>>   echo "Tomcat would not shutdown"
>>   exit
>>   fi
>> rm -r /usr/local/tomcat/webapps/solr*
>> rm -r /usr/local/tomcat/logs/*.out
>> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
>> cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/webapps
>> rm solr # rm the symbolic link
>> ln -s solrnightly solr
>> rm -r solr/data
>> /usr/local/tomcat/bin/startup.sh
>> sleep 10 # give solr time to launch and setup
>> echo "Starting indexing at " `date` " with solrnightly"
>> time curl 
>> "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip";
>>
>>
>>
>>
>>>On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:
>>>
>>>> Hello Grant,
>>>>
>>>>> Were you overwriting the existing index or did you also clean out the
>>>>> Solr data directory, too?  In other words, was it a fresh index, or
>>>>> an
>>>>> existing one?  And was that also the case for the 22 minute time?
>>>>
>>>> No in each case it was a new index. I store the indexes (the "data"
>>>> dir)
>>>> outside the solr home directory. For the moment I, rm -rf the index
>>>> dir
>>>> after each edit to the solrconfig.sml or schema.xml file and reindex
>>>> from scratch. The relaunch of tomcat recreates the index dir.
>>>>
>>>>> Would it be possible to profile the two instance and see if you
>>>>> notice
>>>>> anything different?
>>>> I dont understand this. Do mean run a profiler against the tomcat
>>>> image as indexing takes place, or somehow compare the indexes?
>>>
>>>Something like JProfiler or any other Java profiler.
>>>
>>>>
>>>>
>>>> I was think of making a short script that replicates the results,
>>>> and posting it here, would that help?
>>>
>>>
>>>Very much so.
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Grant
>>>>>
>>>>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a CSV file with 6M records which took 22min to index with
>>>>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside
>>>>>> webapps with version 1.3, wiped my index and restarted tomcat.
>>>>>>
>>>>>> Indexing the exact same content now takes 69min. My machine has
>>>>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M.
>>>>>>
>>>>>> Are there any tweaks I can use to get the original index time
>>>>>> back. I read through the release notes and was expecting a
>>>>>> speed up. I saw the bit about increasing ramBufferSizeMB and set
>>>>>> it to 64MB; it had no effect.
>>>>>> --
>>
>> --
>>
>> ===============================================================
>> Fergus McMenemie               Email:[EMAIL PROTECTED]
>> Techmore Ltd                   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets             Analyst Programmer
>> ===============================================================
>>

-- 

===============================================================
Fergus McMenemie               Email:[EMAIL PROTECTED]
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Reply via email to