[ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516996 ]
Michael McCandless commented on LUCENE-971: ------------------------------------------- This looks great! One alternate approach here would be to create a WikipediaDocMaker (implementing DocMaker interface) that pulls directly from the XML file and feeds documents into the alg. Then, to make a line file, one could create an alg that pulls docs from WikipediaDocMaker and uses WriteLineDoc task to create the line-by-line file. One benefit of this approach is creating docs of a certain size (10 tokens, 100 tokens, etc) would become a one-step process (single alg) instead of what I think is a 2-step process now (make first line file, then reprocess into second line file). Another benefit would be you could make wikipedia tasks that pull directly from the XML file and not even use a line file as an intermediary. Steve do you think this would be a hard change? I think it should be easy, except, I'm not sure how to do this w/ SAX since SAX is "in control". You sort of need coroutines. Or maybe one thread is running SAX and putting doc data into a shared queue, and then the other thread (the normal "main" thread that benchmark runs) would pull from this queue? > Create enwiki indexable data as line-per-article rather than file-per-article > ----------------------------------------------------------------------------- > > Key: LUCENE-971 > URL: https://issues.apache.org/jira/browse/LUCENE-971 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Steven Parkes > Attachments: LUCENE-971.patch.txt > > > Create a line per article rather than a file. Consume with indexLineFile task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]