[ https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185759#comment-13185759 ]
Steven Rowe edited comment on LUCENE-3690 at 1/13/12 7:08 PM: -------------------------------------------------------------- This patch contains a feature-complete version. Changes from the previous patch: # Now substituting newlines instead of spaces for block-level elements; this corresponds more closely to on-screen layout, enables sentence segmentation, and doesn't change offsets. # Supplementary characters are now accepted in tags. # Switched accepted tag names from {{[:XID_Start:]}} and {{[:XID_Continue:]}} Unicode properties to the more relaxed {{[:ID_Start:]}} and {{[:ID_Continue:]}} properties, in order to broaden the range of recognizable input. (The improved security afforded by the {{XID_*}} properties is irrelevant to what a {{CharFilter}} does.) # Uppercase character entity variants for "quot", "copy", "gt", "lt", "reg", and "amp" (from Dawid Weiss's SOLR-882 patch) are now accepted. # MS-Word-generated broken processing instructions ({{<? ... />}} instead of {{<? ... ?>}}) are now accepted. # Added several tests, including parsing a full MS-Word-2010-generated HTML file. Left to do: # Move javadocs from {{BaseCharFilter.addOffCorrectMap()}} to {{o.a.l.analysis.charfilter}} package level javadoc file. # Rename the existing {{HTMLStripCharFilter}} to {{ClassicHTMLStripCharFilter}}; move it to Solr {{o.a.s.analysis}} package; deprecate it; and make it package private. # Rename {{JFlexHTMLStripCharFilter}} to {{HTMLStripCharFilter}}. # Enable Solr back-compat (but not Lucene back-compat, since {{HTMLStripCharFilter}} has never been released as part of Lucene) by making {{HTMLStripCharFilterFactory}} instantiate {{ClassicHTMLStripCharFilter}} if the {{luceneMatchVersion}} parameter is {{LUCENE_35}} or earlier, and otherwise instantiate the new {{HTMLStripCharFilter}}. was (Author: steve_rowe): This patch contains a feature-complete version. Changes from the previous patch: # Now substituting newlines instead of spaces for block-level elements; this corresponds more closely to on-screen layout, enables sentence segmentation, and doesn't change offsets. # Supplementary characters are now accepted in tags. # Switched accepted tag names from {{[:XID_Start:]}} and {{[:XID_Continue:]}} Unicode properties to the more relaxed {{[:ID_Start:]}} and {{[:ID_Continue:]}} properties, in order to broaden the range of recognizable input. (The improved security afforded by the {{XID_*}} properties is irrelevant to {{CharFilter}}s' function.) # Uppercase character entity variants for "quot", "copy", "gt", "lt", "reg", and "amp" (from Dawid Weiss's SOLR-882 patch) are now accepted. # MS-Word-generated broken processing instructions ({{<? ... />}} instead of {{<? ... ?>}}) are now accepted. # Added several tests, including parsing a full MS-Word-2010-generated HTML file. Left to do: # Move javadocs from {{BaseCharFilter.addOffCorrectMap()}} to {{a.o.l.analysis.charfilter}} package level javadoc file. # Rename the existing {{HTMLStripCharFilter}} to {{ClassicHTMLStripCharFilter}}; move it to Solr {{o.a.s.analysis}} package; deprecate it; and make it package private. # Rename {{JFlexHTMLStripCharFilter}} to {{HTMLStripCharFilter}}. # Enable Solr back-compat (but not Lucene back-compat, since {{HTMLStripCharFilter}} has never been released as part of Lucene) by making {{HTMLStripCharFilterFactory}} instantiate {{ClassicHTMLStripCharFilter}} if the {{luceneMatchVersion}} parameter is {{LUCENE_35}} or earlier, and otherwise instantiate the new {{HTMLStripCharFilter}}. > JFlex-based HTMLStripCharFilter replacement > ------------------------------------------- > > Key: LUCENE-3690 > URL: https://issues.apache.org/jira/browse/LUCENE-3690 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis > Affects Versions: 3.5, 4.0 > Reporter: Steven Rowe > Assignee: Steven Rowe > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch > > > A JFlex-based HTMLStripCharFilter replacement would be more performant and > easier to understand and maintain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org