Hello,

I am putting together some performance comparisons of LuSql[1] and
Solr's Data Import Request Handler[2], JdbcDataSource[3]. I want to
make sure I am comparing apples with apples, so would appreciate the
community helping me to make sure I am doing so.

First, LuSql default uses Lucene's StandardAnalyzer[4]. The Javadocs
indicates it uses StandardTokenizer[5], StandardFilter[6],
LowerCaseFilter[7], and StopFilter[8]. I have created a fieldType in
my Solr configuration's schema.xml that I hope is the equivalent to
this:

    <fieldType name="textN" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
      </analyzer>
    </fieldType>

Is this equivalent?

Queries
The two queries I am using in the evaluation are using the MySQL
database of 6.4m journal article metadata records[9] I've used in
previous Lucene indexing and searching[10].

Here is the LuSql command line for the first query:
 java ca.nrc.cisti.lusql.core.LuSqlMain -q "select  Publisher.name as
pub, Journal.title as jo, Article.rawUrl as textpath, Journal.issn,
Volume.number as vol,Volume.coverYear as year, Issue.number as iss,
Article.id as id,Article.title as ti, Article.abstract,
Article.startPage as startPage,Article.endPage as endPage from
Publisher, Journal, Volume, Issue, Article where Publisher.id =
Journal.publisherId and Journal.id = Volume.journalId and Volume.id =
Issue.volumeId and Issue.id = Article.issueId" -c
"jdbc:mysql://blue01/dartejos?user=USER&amp;password=PASS" -n 500000
-v -l testsolr0

Here is the corresponding Solr data-config.xml file:
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://blue01/dartejos" user="USER" password="PASS"/>
    <document name="products">
        <entity name="item" query="select  Publisher.name as pub,
Journal.title as jo, Article.rawUrl as textpath, Journal.issn,
Volume.number as vol,Volume.coverYear as year, Issue.number as iss,
Article.id,Article.title as ti, Article.abstract, Article.startPage as
startPage,Article.endPage as endPage from Publisher, Journal, Volume,
Issue, Article where Publisher.id = Journal.publisherId and Journal.id
= Volume.journalId and Volume.id = Issue.volumeId and Issue.id =
Article.issueId  limit 500000">
            <field column="id" name="id" />
            <field column="jo" name="id" />
            <field column="issn" name="id" />
            <field column="vol" name="id" />
            <field column="year" name="id" />
            <field column="iss" name="id" />
            <field name="abstract" column="abstract"/>
            <field name="title" column="title"/>
            <field name="pub" column="pub"/>
            <field name="textpath" column="textpath"/>
            <field name="startPage" column="startPage"/>
            <field name="endPage" column="endPage"/>
        </entity>
    </document>
</dataConfig>



Here is the LuSql command line for the second query:

 java ca.nrc.cisti.lusql.core.LuSqlMain -q "select  Publisher.name as
pub, Journal.title as jo, Article.rawUrl as textpath, Journal.issn,
Volume.number as vol,Volume.coverYear as year, Issue.number as iss,
Article.id as id,Article.title as ti, Article.abstract,
Article.startPage as startPage,Article.endPage as endPage from
Publisher, Journal, Volume, Issue, Article where Publisher.id =
Journal.publisherId and Journal.id = Volume.journalId and Volume.id =
Issue.volumeId and Issue.id = Article.issueId" -c
"jdbc:mysql://blue01/dartejos?user=USER&amp;password=PASS" -n 500000
-v -l testsolr1 -Q "id|select Keyword.string as keyword from
ArticleKeywordJoin, Keyword where ArticleKeywordJoin.keywordId=@  and
ArticleKeywordJoin.articleId =  Keyword.id"  -Q "id|select
concat(lastName,\', \', firstName) as fullAuthor   from
ArticleAuthorJoin, Author where ArticleAuthorJoin.articleId = @   and
ArticleAuthorJoin.authorId = Author.id"  -Q "id|select
referencedArticleId as citedId   from Reference where
Reference.referencingArticleId = @"

Here is the corresponding Solr <tt>data-config.xml</tt> file:

<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://blue01/dartejos" user="gnewton" password="GNewton"/>
    <document name="products">
        <entity name="item" query="select  Publisher.name as pub,
Journal.title as jo, Article.rawUrl as textpath, Journal.issn,
Volume.number as vol,Volume.coverYear as year, Issue.number as iss,
Article.id,Article.title as ti, Article.abstract, Article.startPage as
startPage,Article.endPage as endPage from Publisher, Journal, Volume,
Issue, Article where Publisher.id = Journal.publisherId and Journal.id
= Volume.journalId and Volume.id = Issue.volumeId and Issue.id =
Article.issueId  limit 500000">
            <field column="id" name="id" />
            <field column="jo" name="id" />
            <field column="issn" name="id" />
            <field column="vol" name="id" />
            <field column="year" name="id" />
            <field column="iss" name="id" />
            <field name="abstract" column="abstract"/>
            <field name="title" column="title"/>
            <field name="pub" column="pub"/>
            <field name="textpath" column="textpath"/>
            <field name="startPage" column="startPage"/>
            <field name="endPage" column="endPage"/>

            <entity name="keyword" query="select Keyword.string as
keyword from ArticleKeywordJoin, Keyword where
ArticleKeywordJoin.keywordId='${item.id}' and
ArticleKeywordJoin.articleId = Keyword.id">
                 <field name="keyword" column="keyword" />
            </entity>

            <entity name="fullAuthor" query="select concat(lastName,',
', firstName) as fullAuthor from ArticleAuthorJoin, Author where
ArticleAuthorJoin.articleId = '${item.id}'   and
ArticleAuthorJoin.authorId = Author.id">
                <field name="fullAuthor" column="fullAuthor" />
            </entity>

            <entity name="referencedArticleId" query="select
referencedArticleId as citedId  from Reference where
Reference.referencingArticleId ='${item.id}'">
                <field name="referencedArticleId"
column="referencedArticleId" />
            </entity>
        </entity>
    </document>
</dataConfig>

Does this configuration look right, i.e. does it represent an
equivalent or close-enough configuration for a valid and useful
performance comparison of LuSql and Solr Data Import Request Handler
JdbcDataSource comparison.

Note: I am using Solr 1.3, however I have replaced the Lucene 2.9 jars
in webapps/solr/WEB-INF/lib with Lucene 1.4 jars, which I am also
using with LuSql.

The tests will involve measuring the time to complete indexing of the
above two queries with varying heap sizes.

thanks,

Glen

[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
[2]http://wiki.apache.org/solr/DataImportHandler
[3]http://wiki.apache.org/solr/DataImportHandler#head-210d4735264367dd07c61ec67dacb8581c57eb17
[4]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardAnalyzer.html
[5]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardTokenizer.html
[6]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardFilter.html
[7]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LowerCaseFilter.html
[8]http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/StopFilter.html
[9]http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
[10]http://zzzoot.blogspot.com/search?q=lucene

-- 

-

Reply via email to