Another thought I just had - do you have autocommit enabled?

A lucene commit is now more expensive because it syncs the files for
safety.  If you commit frequently, this could definitely cause a
slowdown.

-Yonik

On Wed, Nov 26, 2008 at 10:54 AM, Fergus McMenemie <[EMAIL PROTECTED]> wrote:
> Hello Grant,
>
> Not much good with Java profilers (yet!) so I thought I
> would send a script!
>
> Details... details! Having decided to produce a script to
> replicate the 1.2 vis 1.3 speed problem. The required rigor
> revealed a lot more.
>
> 1) The faster version I have previously referred to as 1.2,
>   was actually a "1.3-dev" I had downloaded as part of the
>   solr bootcamp class at ApacheCon Europe 2008. The ID
>   string in the CHANGES.txt document is:-
>   $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
>
> 2) I did actually download and speed test a version of 1.2
>   from the internet. It's CHANGES.txt id is:-
>   $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
>   Speed wise it was about the same as 1.3 at 64min. It also
>   had lots of char set issues and is ignored from now on.
>
> 3) The version I was planning to use, till I found this,
>   speed issue was the "latest" official version:-
>   $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
>   I also verified the behavior with a nightly build.
>   $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
>
> Anyway, The following script indexes the content in 22min
> for the 1.3-dev version and takes 68min for the newer releases
> of 1.3. I took the conf directory from the 1.3dev (bootcamp)
> release and used it replace the conf directory from the
> official 1.3 release. The 3x slow down was still there; it is
> not a configuration issue!
> =================================
>
>
>
>
>
>
> #! /bin/bash
>
> # This script assumes a /usr/local/tomcat link to whatever version
> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
> # All the following was done as root.
>
>
> # I have a directory /usr/local/ts which contains four versions of solr. The
> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 
> 1.3beata
> # I got while attending a solr bootcamp. I indexed the same content using the
> # different versions of solr as follows:
> cd /usr/local/ts
> if [ "" ]
> then
>   echo "Starting from a-fresh"
>   sleep 5 # allow time for me to interrupt!
>   cp -Rp apache-solr-bc/example/solr      ./solrbc  #bc = bootcamp
>   cp -Rp apache-solr-nightly/example/solr ./solrnightly
>   cp -Rp apache-solr-1.3.0/example/solr   ./solr13
>
>   # the gaz is regularly updated and its name keeps changing :-) The page
>   # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest
>   # version.
>   curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip"; 
> > geonames.zip
>   unzip -q geonames.zip
>   # delete corrupt blips!
>   perl -i -n -e 'print unless
>       ($. > 2128495 and $. < 2128505) or
>       ($. > 5944254 and $. < 5944260)
>       ;' geonames_dd_dms_date_20081118.txt
>   #following was used to detect bad short records
>   #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if 
> (@F != 26);' geonames_dd_dms_date_20081118.txt
>
>   # my set of fields and copyfields for the schema.xml
>   fields='
>   <fields>
>      <field name="UNI"           type="string" indexed="true"  stored="true" 
> required="true" />
>      <field name="CCODE"         type="string" indexed="true"  stored="true"/>
>      <field name="DSG"           type="string" indexed="true"  stored="true"/>
>      <field name="CC1"           type="string" indexed="true"  stored="true"/>
>      <field name="LAT"           type="sfloat" indexed="true"  stored="true"/>
>      <field name="LONG"          type="sfloat" indexed="true"  stored="true"/>
>      <field name="MGRS"          type="string" indexed="false" stored="true"/>
>      <field name="JOG"           type="string" indexed="false" stored="true"/>
>      <field name="FULL_NAME"     type="string" indexed="true"  stored="true"/>
>      <field name="FULL_NAME_ND"  type="string" indexed="true"  stored="true"/>
>      <!--field name="text"       type="text"   indexed="true"  stored="false" 
> multiValued="true"/ -->
>      <!--field name="timestamp"  type="date"   indexed="true"  stored="true"  
> default="NOW" multiValued="false"/-->
>   '
>   copyfields='
>      </fields>
>      <copyField source="FULL_NAME" dest="text"/>
>      <copyField source="FULL_NAME_ND" dest="text"/>
>   '
>
>   # add in my fields and copyfields
>   perl -i -p -e "print qq($fields) if s/<fields>//;"           
> solr*/conf/schema.xml
>   perl -i -p -e "print qq($copyfields) if s[</fields>][];"     
> solr*/conf/schema.xml
>   # change the unique key and mark the "id" field as not required
>   perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;"            
> solr*/conf/schema.xml
>   perl -i -p -e 's/required="true"//i if m/<field name="id"/;' 
> solr*/conf/schema.xml
>   # enable remote streaming in solrconfig file
>   perl -i -p -e 
> 's/enableRemoteStreaming="false"/enableRemoteStreaming="true"/;' 
> solr*/conf/solrconfig.xml
>   fi
>
> # some constants to keep the curl command shorter
> skip="MODIFY_DATE,RC,UFI,DMS_LAT,DMS_LONG,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME"
> file=`pwd`"/geonames.txt"
>
> export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr 
> -Dsolr.solr.home=`pwd`/solr"
>
> echo 'Getting ready to index the data set using solrbc (bc = bootcamp)'
> /usr/local/tomcat/bin/shutdown.sh
> sleep 15
> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
>   then
>   echo "Tomcat would not shutdown"
>   exit
>   fi
> rm -r /usr/local/tomcat/webapps/solr*
> rm -r /usr/local/tomcat/logs/*.out
> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
> cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
> rm solr # rm the symbolic link
> ln -s solrbc solr
> rm -r solr/data
> /usr/local/tomcat/bin/startup.sh
> sleep 10 # give solr time to launch and setup
> echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)"
> time curl 
> "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip";
>
> echo "Getting ready to index the data set using solrnightly"
> /usr/local/tomcat/bin/shutdown.sh
> sleep 15
> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
>   then
>   echo "Tomcat would not shutdown"
>   exit
>   fi
> rm -r /usr/local/tomcat/webapps/solr*
> rm -r /usr/local/tomcat/logs/*.out
> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
> cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/webapps
> rm solr # rm the symbolic link
> ln -s solrnightly solr
> rm -r solr/data
> /usr/local/tomcat/bin/startup.sh
> sleep 10 # give solr time to launch and setup
> echo "Starting indexing at " `date` " with solrnightly"
> time curl 
> "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip";
>
>
>
>
>>On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:
>>
>>> Hello Grant,
>>>
>>>> Were you overwriting the existing index or did you also clean out the
>>>> Solr data directory, too?  In other words, was it a fresh index, or
>>>> an
>>>> existing one?  And was that also the case for the 22 minute time?
>>>
>>> No in each case it was a new index. I store the indexes (the "data"
>>> dir)
>>> outside the solr home directory. For the moment I, rm -rf the index
>>> dir
>>> after each edit to the solrconfig.sml or schema.xml file and reindex
>>> from scratch. The relaunch of tomcat recreates the index dir.
>>>
>>>> Would it be possible to profile the two instance and see if you
>>>> notice
>>>> anything different?
>>> I dont understand this. Do mean run a profiler against the tomcat
>>> image as indexing takes place, or somehow compare the indexes?
>>
>>Something like JProfiler or any other Java profiler.
>>
>>>
>>>
>>> I was think of making a short script that replicates the results,
>>> and posting it here, would that help?
>>
>>
>>Very much so.
>>
>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a CSV file with 6M records which took 22min to index with
>>>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside
>>>>> webapps with version 1.3, wiped my index and restarted tomcat.
>>>>>
>>>>> Indexing the exact same content now takes 69min. My machine has
>>>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M.
>>>>>
>>>>> Are there any tweaks I can use to get the original index time
>>>>> back. I read through the release notes and was expecting a
>>>>> speed up. I saw the bit about increasing ramBufferSizeMB and set
>>>>> it to 64MB; it had no effect.
>>>>> --
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:[EMAIL PROTECTED]
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>

Reply via email to