Hello, I am putting together some performance comparisons of LuSql[1] and Solr's Data Import Request Handler[2], JdbcDataSource[3]. I want to make sure I am comparing apples with apples, so would appreciate the community helping me to make sure I am doing so.
First, LuSql default uses Lucene's StandardAnalyzer[4]. The Javadocs indicates it uses StandardTokenizer[5], StandardFilter[6], LowerCaseFilter[7], and StopFilter[8]. I have created a fieldType in my Solr configuration's schema.xml that I hope is the equivalent to this: <fieldType name="textN" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> </analyzer> </fieldType> Is this equivalent? Queries The two queries I am using in the evaluation are using the MySQL database of 6.4m journal article metadata records[9] I've used in previous Lucene indexing and searching[10]. Here is the LuSql command line for the first query: java ca.nrc.cisti.lusql.core.LuSqlMain -q "select Publisher.name as pub, Journal.title as jo, Article.rawUrl as textpath, Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id as id,Article.title as ti, Article.abstract, Article.startPage as startPage,Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id = Issue.volumeId and Issue.id = Article.issueId" -c "jdbc:mysql://blue01/dartejos?user=USER&password=PASS" -n 500000 -v -l testsolr0 Here is the corresponding Solr data-config.xml file: <dataConfig> <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://blue01/dartejos" user="USER" password="PASS"/> <document name="products"> <entity name="item" query="select Publisher.name as pub, Journal.title as jo, Article.rawUrl as textpath, Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id,Article.title as ti, Article.abstract, Article.startPage as startPage,Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id = Issue.volumeId and Issue.id = Article.issueId limit 500000"> <field column="id" name="id" /> <field column="jo" name="id" /> <field column="issn" name="id" /> <field column="vol" name="id" /> <field column="year" name="id" /> <field column="iss" name="id" /> <field name="abstract" column="abstract"/> <field name="title" column="title"/> <field name="pub" column="pub"/> <field name="textpath" column="textpath"/> <field name="startPage" column="startPage"/> <field name="endPage" column="endPage"/> </entity> </document> </dataConfig> Here is the LuSql command line for the second query: java ca.nrc.cisti.lusql.core.LuSqlMain -q "select Publisher.name as pub, Journal.title as jo, Article.rawUrl as textpath, Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id as id,Article.title as ti, Article.abstract, Article.startPage as startPage,Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id = Issue.volumeId and Issue.id = Article.issueId" -c "jdbc:mysql://blue01/dartejos?user=USER&password=PASS" -n 500000 -v -l testsolr1 -Q "id|select Keyword.string as keyword from ArticleKeywordJoin, Keyword where ArticleKeywordJoin.keywordId=@ and ArticleKeywordJoin.articleId = Keyword.id" -Q "id|select concat(lastName,\', \', firstName) as fullAuthor from ArticleAuthorJoin, Author where ArticleAuthorJoin.articleId = @ and ArticleAuthorJoin.authorId = Author.id" -Q "id|select referencedArticleId as citedId from Reference where Reference.referencingArticleId = @" Here is the corresponding Solr <tt>data-config.xml</tt> file: <dataConfig> <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://blue01/dartejos" user="gnewton" password="GNewton"/> <document name="products"> <entity name="item" query="select Publisher.name as pub, Journal.title as jo, Article.rawUrl as textpath, Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id,Article.title as ti, Article.abstract, Article.startPage as startPage,Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id = Issue.volumeId and Issue.id = Article.issueId limit 500000"> <field column="id" name="id" /> <field column="jo" name="id" /> <field column="issn" name="id" /> <field column="vol" name="id" /> <field column="year" name="id" /> <field column="iss" name="id" /> <field name="abstract" column="abstract"/> <field name="title" column="title"/> <field name="pub" column="pub"/> <field name="textpath" column="textpath"/> <field name="startPage" column="startPage"/> <field name="endPage" column="endPage"/> <entity name="keyword" query="select Keyword.string as keyword from ArticleKeywordJoin, Keyword where ArticleKeywordJoin.keywordId='${item.id}' and ArticleKeywordJoin.articleId = Keyword.id"> <field name="keyword" column="keyword" /> </entity> <entity name="fullAuthor" query="select concat(lastName,', ', firstName) as fullAuthor from ArticleAuthorJoin, Author where ArticleAuthorJoin.articleId = '${item.id}' and ArticleAuthorJoin.authorId = Author.id"> <field name="fullAuthor" column="fullAuthor" /> </entity> <entity name="referencedArticleId" query="select referencedArticleId as citedId from Reference where Reference.referencingArticleId ='${item.id}'"> <field name="referencedArticleId" column="referencedArticleId" /> </entity> </entity> </document> </dataConfig> Does this configuration look right, i.e. does it represent an equivalent or close-enough configuration for a valid and useful performance comparison of LuSql and Solr Data Import Request Handler JdbcDataSource comparison. Note: I am using Solr 1.3, however I have replaced the Lucene 2.9 jars in webapps/solr/WEB-INF/lib with Lucene 1.4 jars, which I am also using with LuSql. The tests will involve measuring the time to complete indexing of the above two queries with varying heap sizes. thanks, Glen [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql [2]http://wiki.apache.org/solr/DataImportHandler [3]http://wiki.apache.org/solr/DataImportHandler#head-210d4735264367dd07c61ec67dacb8581c57eb17 [4]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardAnalyzer.html [5]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardTokenizer.html [6]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardFilter.html [7]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LowerCaseFilter.html [8]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/StopFilter.html [9]http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html [10]http://zzzoot.blogspot.com/search?q=lucene -- -