[ https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519145#comment-17519145 ]
ASF GitHub Bot commented on ANY23-280: -------------------------------------- erdemtuna commented on code in PR #24: URL: https://github.com/apache/any23/pull/24#discussion_r845529396 ########## api/src/main/java/org/apache/any23/configuration/Configuration.java: ########## @@ -33,7 +33,7 @@ * Checks whether a property is defined or not in configuration. * Review Comment: sar > Refactor ContentExtractor to improve extraction flexibility > ----------------------------------------------------------- > > Key: ANY23-280 > URL: https://issues.apache.org/jira/browse/ANY23-280 > Project: Apache Any23 > Issue Type: Improvement > Components: core, extractors > Affects Versions: 1.1 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Blocker > Fix For: 2.2 > > > As discussed on ANY23-247, the > [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44] > is simply not fit for purpose. This issue was discovered and the cause has > plagued our builds ever since. Any extractors which implement > [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java] > are based on the Extractor.ContentExtractor and hence work off of an > 'unfixed' raw data stream as oppose to a more flexible model such as the > [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60]. > This issue should refactor RDF extractors to enable more flexibility and to > avoid issues we encounter with the strict SAX parsing logic. -- This message was sent by Atlassian Jira (v8.20.1#820001)