[ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Parkes updated LUCENE-971: --------------------------------- Attachment: LUCENE-971.patch.txt Okay. Here's an update to the patch. Changes: 1) EnwikiDocMaker replaces ExtractWikipedia 2) A sample algorithm is provided (and used by the build.xml file, which could be removed if desired 3) A bug in LineDocMaker is fixed (it was storing both the title and date in the title field (small enough that it doesn't need its own JIRA(?)) 4) LineDocMaker was made derivable-from Much of the code in LineDocMaker is useful in EnwikiDocMaker so I made it so (it's inheritance for impl, not abstraction so it could be changed, of course) 5) Made LineDocMaker and WriteLineDocTask multicharater safe Or at least I tried to. Wikipedia has non-ascii characters in it. To make LineDocMaker work as a base class, I made it use an explicit FileInputStream which is required so that SAX can extract the encoding correctly. I made WriteLineDocTask always write UTF-8 so that I can get non-ASCII in the output file. Seems like UTF-8 is the best encoding for line files? At the same time, I made LineDocMaker assume UTF-8 (unless told otherwise by a derived class like EnwikiDocMaker) so that the line files created by EnwikiDocMaker/WriteLineDocTask can be read by LineDocMaker w/o loss. > Create enwiki indexable data as line-per-article rather than file-per-article > ----------------------------------------------------------------------------- > > Key: LUCENE-971 > URL: https://issues.apache.org/jira/browse/LUCENE-971 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Steven Parkes > Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt > > > Create a line per article rather than a file. Consume with indexLineFile task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]