[ https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228968#comment-15228968 ]
ASF GitHub Bot commented on ANY23-280: -------------------------------------- GitHub user lewismc opened a pull request: https://github.com/apache/any23/pull/24 Initial move towards addressing ANY23-280 Refactor ContentExtractor to improve extraction flexibility Hi Folks, This is an initial crack at addressing https://issues.apache.org/jira/browse/ANY23-280 Essentially, the main API difference is the complete removal of ```public interface ContentExtractor extends Extractor<InputStream>``` from the Extractor interface in the api module. This patch has a long way to go with numerous failing tests however I wanted to post it for feedback. Although Any23 still builds with -DskipTests, without that flag the failing tests are as follows ``` Results : Failed tests: Any23Test.testDemoCodeSnippet1:201 Any23Test.testN3Detection1:92->assertDetection:661 Any23Test.testN3Detection2:97->assertDetection:661 Any23Test.testTTLDetection:87->assertDetection:661 RoverTest.testRunMultiURLs:104->runWithMultiSourcesAndVerify:134 Unexpected number of statements. Tests in error: Any23Test.testProgrammaticExtraction:279 » NullPointer CSVExtractorTest.testExtractionCommaSeparated:49->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime CSVExtractorTest.testExtractionEmptyValue:112->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime CSVExtractorTest.testExtractionSemicolonSeparated:64->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime CSVExtractorTest.testExtractionTabSeparated:79->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime CSVExtractorTest.testTypeManagement:94->AbstractExtractorTestCase.dumpModelToRDFXML:714 » Runtime RDFa11ExtractorTest>AbstractRDFaExtractorTestCase.testDrupalTestPage:124->AbstractExtractorTestCase.assertExtract:217->AbstractExtractorTestCase.assertExtract:200->AbstractExtractorTestCase.extract:185 » NullPointer RDFaExtractorTest>AbstractRDFaExtractorTestCase.testDrupalTestPage:124->AbstractExtractorTestCase.assertExtract:217->AbstractExtractorTestCase.assertExtract:200->AbstractExtractorTestCase.extract:185 » NullPointer Tests run: 403, Failures: 5, Errors: 8, Skipped: 11 ``` You will see that some of the tests concern https://issues.apache.org/jira/browse/ANY23-267 as well. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lewismc/any23 ANY23-280 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/24.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #24 ---- commit 801f2f93967bfd1295700223085eef3f54181517 Author: Lewis John McGibbney <lewis.j.mcgibb...@jpl.nasa.gov> Date: 2016-04-06T19:44:35Z Initial move towards addressing ANY23-280 Refactor ContentExtractor to improve extraction flexibility ---- > Refactor ContentExtractor to improve extraction flexibility > ----------------------------------------------------------- > > Key: ANY23-280 > URL: https://issues.apache.org/jira/browse/ANY23-280 > Project: Apache Any23 > Issue Type: Improvement > Components: core, extractors > Affects Versions: 1.1 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Critical > Fix For: 1.2 > > > As discussed on ANY23-247, the > [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44] > is simply not fit for purpose. This issue was discovered and the cause has > plagued our builds ever since. Any extractors which implement > [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java] > are based on the Extractor.ContentExtractor and hence work off of an > 'unfixed' raw data stream as oppose to a more flexible model such as the > [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60]. > This issue should refactor RDF extractors to enable more flexibility and to > avoid issues we encounter with the strict SAX parsing logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)