[jira] Updated: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

Shai Erera (JIRA) Wed, 10 Jun 2009 06:02:35 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shai Erera updated LUCENE-1595:
-------------------------------

    Attachment: LUCENE-1595.patch

Some updates:
# Added to PerfTask a log.step config parameter, and implemented in tearDown 
logging messages. Also introduced a getLogMessage(int recsCount) which can be 
overridden by sub classes.
#* Overrode getLogMessage in the relevant tasks which logged messages, such as 
AddDocTask, DeleteDocTask, WriteLineDocTask ... I also removed logging from 
these tasks
# Added ConsumeContentSource task together with a readContent.Source.alg - this 
can be used to simply read from a content source, if we want to measure the 
performance of a particular impl.
# Removed the "xerces" class name from EnwikiContentSource (read more below).

I changed EnwikiContentSource to not specifically request for a Xerces 
SAXParser. However, the default is to use the JRE's SAXParser, which is Xerces.

I wanted to remove the Xerces .jar, but when I attempted to read the 
enwiki-20090306-pages-articles.xml, it failed w/ an AIOOBE, so I don't think we 
can remove the .jar yet.
BTW, in LUCENE-1591 I reported that I am not able to parse that particular 
enwiki version, w/ and w/o Xerces, however Mike succeeded. So I don't know if 
this enwiki version is defective, or it's a problem on Windows.

Anyway, the bottom line is we cannot remove the Xerces .jar.

I think this patch is ready for commit. All benchmark tests pass.

> Split DocMaker into ContentSource and DocMaker
> ----------------------------------------------
>
>                 Key: LUCENE-1595
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1595
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

Reply via email to