[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006377#comment-13006377 ]
Michael McCandless commented on LUCENE-2958: -------------------------------------------- bq. For example, one DocMaker can decide to split each doc into N tiny docs. Another can choose to add facets to it. Yet another can do complex analysis on it and produce richer documents. I like the flexibility to "enrich" the docs produced by the source (set up facets, semantic extraction, etc.) but the ability to split up docs is... dangerous, I think. Ie, it feels to me like DocData should in fact just be a Document. The two-step process we have now (fill fields in a DocData, then, separately ask this DocData to make one or more Docments) feels wrong. Splitting a big doc into N smaller docs can't be done well. It's synthetic data (eg 20 docs in a row will have same title/data) and so you'll draw synthetic conclusions. The enrichment can "simply" be a Document processing pipeline that runs after the source document was produced from the line file. EG, UIMA. When we run perf tests w/ luceneutil, we do in fact do this split, but then we shuffle the resulting line file so that you don't see 20 docs w/ same title in a row which skews eg compression results since a given term foo in its title will have 20 adjacent docIDs assigned and thus be unnaturally easy to compress. Likewise for the date field, which makes the NRQ performance unnaturally good. If you want to chop docs up really you do it as a pre-processing step in building your line file... bq. Before that, you'd have to write a DocMaker for every such combination. E.g., if you wanted to add facets, you'd need to write a DocMaker per source of data with the same impl. But, if LineDocSource returned a Document, can't you take that Document and run with it? We'd still have a single class that pulls Document from a line file, just different "Document processors" that run after it. I'm still not really seeing why DocData is needed, except for the somewhat dangerous split-up-docs case. But we don't need to change/fix this, today... > WriteLineDocTask improvements > ----------------------------- > > Key: LUCENE-2958 > URL: https://issues.apache.org/jira/browse/LUCENE-2958 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Reporter: Doron Cohen > Assignee: Doron Cohen > Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, > LUCENE-2958.patch > > > Make WriteLineDocTask and LineDocSource more flexible/extendable: > * allow to emit lines also for empty docs (keep current behavior as default) > * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org