Another thought I just had - do you have autocommit enabled? A lucene commit is now more expensive because it syncs the files for safety. If you commit frequently, this could definitely cause a slowdown.
-Yonik On Wed, Nov 26, 2008 at 10:54 AM, Fergus McMenemie <[EMAIL PROTECTED]> wrote: > Hello Grant, > > Not much good with Java profilers (yet!) so I thought I > would send a script! > > Details... details! Having decided to produce a script to > replicate the 1.2 vis 1.3 speed problem. The required rigor > revealed a lot more. > > 1) The faster version I have previously referred to as 1.2, > was actually a "1.3-dev" I had downloaded as part of the > solr bootcamp class at ApacheCon Europe 2008. The ID > string in the CHANGES.txt document is:- > $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $ > > 2) I did actually download and speed test a version of 1.2 > from the internet. It's CHANGES.txt id is:- > $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $ > Speed wise it was about the same as 1.3 at 64min. It also > had lots of char set issues and is ignored from now on. > > 3) The version I was planning to use, till I found this, > speed issue was the "latest" official version:- > $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $ > I also verified the behavior with a nightly build. > $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $ > > Anyway, The following script indexes the content in 22min > for the 1.3-dev version and takes 68min for the newer releases > of 1.3. I took the conf directory from the 1.3dev (bootcamp) > release and used it replace the conf directory from the > official 1.3 release. The 3x slow down was still there; it is > not a configuration issue! > ================================= > > > > > > > #! /bin/bash > > # This script assumes a /usr/local/tomcat link to whatever version > # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also > # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. > # All the following was done as root. > > > # I have a directory /usr/local/ts which contains four versions of solr. The > # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a > 1.3beata > # I got while attending a solr bootcamp. I indexed the same content using the > # different versions of solr as follows: > cd /usr/local/ts > if [ "" ] > then > echo "Starting from a-fresh" > sleep 5 # allow time for me to interrupt! > cp -Rp apache-solr-bc/example/solr ./solrbc #bc = bootcamp > cp -Rp apache-solr-nightly/example/solr ./solrnightly > cp -Rp apache-solr-1.3.0/example/solr ./solr13 > > # the gaz is regularly updated and its name keeps changing :-) The page > # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest > # version. > curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip" > > geonames.zip > unzip -q geonames.zip > # delete corrupt blips! > perl -i -n -e 'print unless > ($. > 2128495 and $. < 2128505) or > ($. > 5944254 and $. < 5944260) > ;' geonames_dd_dms_date_20081118.txt > #following was used to detect bad short records > #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if > (@F != 26);' geonames_dd_dms_date_20081118.txt > > # my set of fields and copyfields for the schema.xml > fields=' > <fields> > <field name="UNI" type="string" indexed="true" stored="true" > required="true" /> > <field name="CCODE" type="string" indexed="true" stored="true"/> > <field name="DSG" type="string" indexed="true" stored="true"/> > <field name="CC1" type="string" indexed="true" stored="true"/> > <field name="LAT" type="sfloat" indexed="true" stored="true"/> > <field name="LONG" type="sfloat" indexed="true" stored="true"/> > <field name="MGRS" type="string" indexed="false" stored="true"/> > <field name="JOG" type="string" indexed="false" stored="true"/> > <field name="FULL_NAME" type="string" indexed="true" stored="true"/> > <field name="FULL_NAME_ND" type="string" indexed="true" stored="true"/> > <!--field name="text" type="text" indexed="true" stored="false" > multiValued="true"/ --> > <!--field name="timestamp" type="date" indexed="true" stored="true" > default="NOW" multiValued="false"/--> > ' > copyfields=' > </fields> > <copyField source="FULL_NAME" dest="text"/> > <copyField source="FULL_NAME_ND" dest="text"/> > ' > > # add in my fields and copyfields > perl -i -p -e "print qq($fields) if s/<fields>//;" > solr*/conf/schema.xml > perl -i -p -e "print qq($copyfields) if s[</fields>][];" > solr*/conf/schema.xml > # change the unique key and mark the "id" field as not required > perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;" > solr*/conf/schema.xml > perl -i -p -e 's/required="true"//i if m/<field name="id"/;' > solr*/conf/schema.xml > # enable remote streaming in solrconfig file > perl -i -p -e > 's/enableRemoteStreaming="false"/enableRemoteStreaming="true"/;' > solr*/conf/solrconfig.xml > fi > > # some constants to keep the curl command shorter > skip="MODIFY_DATE,RC,UFI,DMS_LAT,DMS_LONG,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME" > file=`pwd`"/geonames.txt" > > export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr > -Dsolr.solr.home=`pwd`/solr" > > echo 'Getting ready to index the data set using solrbc (bc = bootcamp)' > /usr/local/tomcat/bin/shutdown.sh > sleep 15 > if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] > then > echo "Tomcat would not shutdown" > exit > fi > rm -r /usr/local/tomcat/webapps/solr* > rm -r /usr/local/tomcat/logs/*.out > rm -r /usr/local/tomcat/work/Catalina/localhost/solr > cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps > rm solr # rm the symbolic link > ln -s solrbc solr > rm -r solr/data > /usr/local/tomcat/bin/startup.sh > sleep 10 # give solr time to launch and setup > echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)" > time curl > "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip" > > echo "Getting ready to index the data set using solrnightly" > /usr/local/tomcat/bin/shutdown.sh > sleep 15 > if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] > then > echo "Tomcat would not shutdown" > exit > fi > rm -r /usr/local/tomcat/webapps/solr* > rm -r /usr/local/tomcat/logs/*.out > rm -r /usr/local/tomcat/work/Catalina/localhost/solr > cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/webapps > rm solr # rm the symbolic link > ln -s solrnightly solr > rm -r solr/data > /usr/local/tomcat/bin/startup.sh > sleep 10 # give solr time to launch and setup > echo "Starting indexing at " `date` " with solrnightly" > time curl > "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip" > > > > >>On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote: >> >>> Hello Grant, >>> >>>> Were you overwriting the existing index or did you also clean out the >>>> Solr data directory, too? In other words, was it a fresh index, or >>>> an >>>> existing one? And was that also the case for the 22 minute time? >>> >>> No in each case it was a new index. I store the indexes (the "data" >>> dir) >>> outside the solr home directory. For the moment I, rm -rf the index >>> dir >>> after each edit to the solrconfig.sml or schema.xml file and reindex >>> from scratch. The relaunch of tomcat recreates the index dir. >>> >>>> Would it be possible to profile the two instance and see if you >>>> notice >>>> anything different? >>> I dont understand this. Do mean run a profiler against the tomcat >>> image as indexing takes place, or somehow compare the indexes? >> >>Something like JProfiler or any other Java profiler. >> >>> >>> >>> I was think of making a short script that replicates the results, >>> and posting it here, would that help? >> >> >>Very much so. >> >> >>> >>> >>>> >>>> Thanks, >>>> Grant >>>> >>>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote: >>>> >>>>> Hello, >>>>> >>>>> I have a CSV file with 6M records which took 22min to index with >>>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside >>>>> webapps with version 1.3, wiped my index and restarted tomcat. >>>>> >>>>> Indexing the exact same content now takes 69min. My machine has >>>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M. >>>>> >>>>> Are there any tweaks I can use to get the original index time >>>>> back. I read through the release notes and was expecting a >>>>> speed up. I saw the bit about increasing ramBufferSizeMB and set >>>>> it to 64MB; it had no effect. >>>>> -- > > -- > > =============================================================== > Fergus McMenemie Email:[EMAIL PROTECTED] > Techmore Ltd Phone:(UK) 07721 376021 > > Unix/Mac/Intranets Analyst Programmer > =============================================================== >