[ https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340807#comment-15340807 ]
ASF GitHub Bot commented on ANY23-280: -------------------------------------- Github user ansell commented on the issue: https://github.com/apache/any23/pull/24 If we are going to be modifying the public API we probably should be aiming for a 2.0 release, otherwise the version numbers are arbitrary > Refactor ContentExtractor to improve extraction flexibility > ----------------------------------------------------------- > > Key: ANY23-280 > URL: https://issues.apache.org/jira/browse/ANY23-280 > Project: Apache Any23 > Issue Type: Improvement > Components: core, extractors > Affects Versions: 1.1 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Critical > Fix For: 1.2 > > > As discussed on ANY23-247, the > [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44] > is simply not fit for purpose. This issue was discovered and the cause has > plagued our builds ever since. Any extractors which implement > [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java] > are based on the Extractor.ContentExtractor and hence work off of an > 'unfixed' raw data stream as oppose to a more flexible model such as the > [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60]. > This issue should refactor RDF extractors to enable more flexibility and to > avoid issues we encounter with the strict SAX parsing logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)