[
https://issues.apache.org/jira/browse/LUCENE-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Schindler reassigned LUCENE-4220:
-------------------------------------
Assignee: Uwe Schindler
> Replace benchmarks crazy HTML parser by a nekohtml 10-liner
> -----------------------------------------------------------
>
> Key: LUCENE-4220
> URL: https://issues.apache.org/jira/browse/LUCENE-4220
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/benchmark
> Affects Versions: 4.0-ALPHA
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4220.patch, LUCENE-4220.patch, LUCENE-4220.patch
>
>
> Benchmark contains a javacc-based HTML parser which of course violates all
> specs, is huge and error prone.
> I can replace it by a NEKOHTML based one (approx 10 - 20 lines of code).
> NEKOHTML is an extension for XERCES (that we already use to read wikipedia),
> that produces SAX-events or DOM tree out of a HTML file usingg standard XML
> APIS. We could also use TIKA, but I refuse to download the Internet to get
> TIKA running for just parsing a HTML file.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]