[ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006377#comment-13006377
 ] 

Michael McCandless commented on LUCENE-2958:
--------------------------------------------

bq. For example, one DocMaker can decide to split each doc into N tiny docs. 
Another can choose to add facets to it. Yet another can do complex analysis on 
it and produce richer documents.

I like the flexibility to "enrich" the docs produced by the source
(set up facets, semantic extraction, etc.) but the ability to split up
docs is... dangerous, I think.

Ie, it feels to me like DocData should in fact just be a Document. The
two-step process we have now (fill fields in a DocData, then,
separately ask this DocData to make one or more Docments) feels
wrong.

Splitting a big doc into N smaller docs can't be done well.  It's
synthetic data (eg 20 docs in a row will have same title/data) and so
you'll draw synthetic conclusions.

The enrichment can "simply" be a Document processing pipeline that
runs after the source document was produced from the line file.  EG,
UIMA.

When we run perf tests w/ luceneutil, we do in fact do this split, but
then we shuffle the resulting line file so that you don't see 20 docs
w/ same title in a row which skews eg compression results since a
given term foo in its title will have 20 adjacent docIDs assigned and
thus be unnaturally easy to compress.  Likewise for the date field,
which makes the NRQ performance unnaturally good.

If you want to chop docs up really you do it as a pre-processing step
in building your line file...

bq. Before that, you'd have to write a DocMaker for every such combination. 
E.g., if you wanted to add facets, you'd need to write a DocMaker per source of 
data with the same impl.

But, if LineDocSource returned a Document, can't you take that
Document and run with it?  We'd still have a single class that pulls
Document from a line file, just different "Document processors" that
run after it.

I'm still not really seeing why DocData is needed, except for the
somewhat dangerous split-up-docs case.

But we don't need to change/fix this, today...


> WriteLineDocTask improvements
> -----------------------------
>
>                 Key: LUCENE-2958
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2958
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, 
> LUCENE-2958.patch
>
>
> Make WriteLineDocTask and LineDocSource more flexible/extendable:
> * allow to emit lines also for empty docs (keep current behavior as default)
> * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to