Hello Grant, Not much good with Java profilers (yet!) so I thought I would send a script!
Details... details! Having decided to produce a script to replicate the 1.2 vis 1.3 speed problem. The required rigor revealed a lot more. 1) The faster version I have previously referred to as 1.2, was actually a "1.3-dev" I had downloaded as part of the solr bootcamp class at ApacheCon Europe 2008. The ID string in the CHANGES.txt document is:- $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $ 2) I did actually download and speed test a version of 1.2 from the internet. It's CHANGES.txt id is:- $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $ Speed wise it was about the same as 1.3 at 64min. It also had lots of char set issues and is ignored from now on. 3) The version I was planning to use, till I found this, speed issue was the "latest" official version:- $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $ I also verified the behavior with a nightly build. $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $ Anyway, The following script indexes the content in 22min for the 1.3-dev version and takes 68min for the newer releases of 1.3. I took the conf directory from the 1.3dev (bootcamp) release and used it replace the conf directory from the official 1.3 release. The 3x slow down was still there; it is not a configuration issue! ================================= #! /bin/bash # This script assumes a /usr/local/tomcat link to whatever version # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. # All the following was done as root. # I have a directory /usr/local/ts which contains four versions of solr. The # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 1.3beata # I got while attending a solr bootcamp. I indexed the same content using the # different versions of solr as follows: cd /usr/local/ts if [ "" ] then echo "Starting from a-fresh" sleep 5 # allow time for me to interrupt! cp -Rp apache-solr-bc/example/solr ./solrbc #bc = bootcamp cp -Rp apache-solr-nightly/example/solr ./solrnightly cp -Rp apache-solr-1.3.0/example/solr ./solr13 # the gaz is regularly updated and its name keeps changing :-) The page # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest # version. curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip" > geonames.zip unzip -q geonames.zip # delete corrupt blips! perl -i -n -e 'print unless ($. > 2128495 and $. < 2128505) or ($. > 5944254 and $. < 5944260) ;' geonames_dd_dms_date_20081118.txt #following was used to detect bad short records #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if (@F != 26);' geonames_dd_dms_date_20081118.txt # my set of fields and copyfields for the schema.xml fields=' <fields> <field name="UNI" type="string" indexed="true" stored="true" required="true" /> <field name="CCODE" type="string" indexed="true" stored="true"/> <field name="DSG" type="string" indexed="true" stored="true"/> <field name="CC1" type="string" indexed="true" stored="true"/> <field name="LAT" type="sfloat" indexed="true" stored="true"/> <field name="LONG" type="sfloat" indexed="true" stored="true"/> <field name="MGRS" type="string" indexed="false" stored="true"/> <field name="JOG" type="string" indexed="false" stored="true"/> <field name="FULL_NAME" type="string" indexed="true" stored="true"/> <field name="FULL_NAME_ND" type="string" indexed="true" stored="true"/> <!--field name="text" type="text" indexed="true" stored="false" multiValued="true"/ --> <!--field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/--> ' copyfields=' </fields> <copyField source="FULL_NAME" dest="text"/> <copyField source="FULL_NAME_ND" dest="text"/> ' # add in my fields and copyfields perl -i -p -e "print qq($fields) if s/<fields>//;" solr*/conf/schema.xml perl -i -p -e "print qq($copyfields) if s[</fields>][];" solr*/conf/schema.xml # change the unique key and mark the "id" field as not required perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;" solr*/conf/schema.xml perl -i -p -e 's/required="true"//i if m/<field name="id"/;' solr*/conf/schema.xml # enable remote streaming in solrconfig file perl -i -p -e 's/enableRemoteStreaming="false"/enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml fi # some constants to keep the curl command shorter skip="MODIFY_DATE,RC,UFI,DMS_LAT,DMS_LONG,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME" file=`pwd`"/geonames.txt" export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr -Dsolr.solr.home=`pwd`/solr" echo 'Getting ready to index the data set using solrbc (bc = bootcamp)' /usr/local/tomcat/bin/shutdown.sh sleep 15 if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] then echo "Tomcat would not shutdown" exit fi rm -r /usr/local/tomcat/webapps/solr* rm -r /usr/local/tomcat/logs/*.out rm -r /usr/local/tomcat/work/Catalina/localhost/solr cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps rm solr # rm the symbolic link ln -s solrbc solr rm -r solr/data /usr/local/tomcat/bin/startup.sh sleep 10 # give solr time to launch and setup echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)" time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip" echo "Getting ready to index the data set using solrnightly" /usr/local/tomcat/bin/shutdown.sh sleep 15 if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] then echo "Tomcat would not shutdown" exit fi rm -r /usr/local/tomcat/webapps/solr* rm -r /usr/local/tomcat/logs/*.out rm -r /usr/local/tomcat/work/Catalina/localhost/solr cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/webapps rm solr # rm the symbolic link ln -s solrnightly solr rm -r solr/data /usr/local/tomcat/bin/startup.sh sleep 10 # give solr time to launch and setup echo "Starting indexing at " `date` " with solrnightly" time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip" >On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote: > >> Hello Grant, >> >>> Were you overwriting the existing index or did you also clean out the >>> Solr data directory, too? In other words, was it a fresh index, or >>> an >>> existing one? And was that also the case for the 22 minute time? >> >> No in each case it was a new index. I store the indexes (the "data" >> dir) >> outside the solr home directory. For the moment I, rm -rf the index >> dir >> after each edit to the solrconfig.sml or schema.xml file and reindex >> from scratch. The relaunch of tomcat recreates the index dir. >> >>> Would it be possible to profile the two instance and see if you >>> notice >>> anything different? >> I dont understand this. Do mean run a profiler against the tomcat >> image as indexing takes place, or somehow compare the indexes? > >Something like JProfiler or any other Java profiler. > >> >> >> I was think of making a short script that replicates the results, >> and posting it here, would that help? > > >Very much so. > > >> >> >>> >>> Thanks, >>> Grant >>> >>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote: >>> >>>> Hello, >>>> >>>> I have a CSV file with 6M records which took 22min to index with >>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside >>>> webapps with version 1.3, wiped my index and restarted tomcat. >>>> >>>> Indexing the exact same content now takes 69min. My machine has >>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M. >>>> >>>> Are there any tweaks I can use to get the original index time >>>> back. I read through the release notes and was expecting a >>>> speed up. I saw the bit about increasing ramBufferSizeMB and set >>>> it to 64MB; it had no effect. >>>> -- -- =============================================================== Fergus McMenemie Email:[EMAIL PROTECTED] Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===============================================================