[jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility

ASF GitHub Bot (JIRA) Wed, 06 Apr 2016 12:51:12 -0700

    [ 
https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228968#comment-15228968
 ]


ASF GitHub Bot commented on ANY23-280:
--------------------------------------

GitHub user lewismc opened a pull request:

    https://github.com/apache/any23/pull/24

    Initial move towards addressing ANY23-280 Refactor ContentExtractor to 
improve extraction flexibility

    Hi Folks,
    This is an initial crack at addressing 
https://issues.apache.org/jira/browse/ANY23-280
    Essentially, the main API difference is the complete removal of ```public 
interface ContentExtractor extends Extractor<InputStream>``` from the Extractor 
interface in the api module.
    This patch has a long way to go with numerous failing tests however I 
wanted to post it for feedback.
    Although Any23 still builds with -DskipTests, without that flag the failing 
tests are as follows
    ```
    Results :
    
    Failed tests:
      Any23Test.testDemoCodeSnippet1:201
      Any23Test.testN3Detection1:92->assertDetection:661
      Any23Test.testN3Detection2:97->assertDetection:661
      Any23Test.testTTLDetection:87->assertDetection:661
      RoverTest.testRunMultiURLs:104->runWithMultiSourcesAndVerify:134 
Unexpected number of statements.
    Tests in error:
      Any23Test.testProgrammaticExtraction:279 » NullPointer
    
CSVExtractorTest.testExtractionCommaSeparated:49->AbstractExtractorTestCase.dumpModelToRDFXML:714
 » Runtime
    
CSVExtractorTest.testExtractionEmptyValue:112->AbstractExtractorTestCase.dumpModelToRDFXML:714
 » Runtime
    
CSVExtractorTest.testExtractionSemicolonSeparated:64->AbstractExtractorTestCase.dumpModelToRDFXML:714
 » Runtime
    
CSVExtractorTest.testExtractionTabSeparated:79->AbstractExtractorTestCase.dumpModelToRDFXML:714
 » Runtime
    
CSVExtractorTest.testTypeManagement:94->AbstractExtractorTestCase.dumpModelToRDFXML:714
 » Runtime
    
RDFa11ExtractorTest>AbstractRDFaExtractorTestCase.testDrupalTestPage:124->AbstractExtractorTestCase.assertExtract:217->AbstractExtractorTestCase.assertExtract:200->AbstractExtractorTestCase.extract:185
 » NullPointer
    
RDFaExtractorTest>AbstractRDFaExtractorTestCase.testDrupalTestPage:124->AbstractExtractorTestCase.assertExtract:217->AbstractExtractorTestCase.assertExtract:200->AbstractExtractorTestCase.extract:185
 » NullPointer
    Tests run: 403, Failures: 5, Errors: 8, Skipped: 11
    ```
    You will see that some of the tests concern 
https://issues.apache.org/jira/browse/ANY23-267 as well.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lewismc/any23 ANY23-280

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/any23/pull/24.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #24
    
----
commit 801f2f93967bfd1295700223085eef3f54181517
Author: Lewis John McGibbney <[email protected]>
Date:   2016-04-06T19:44:35Z

    Initial move towards addressing ANY23-280 Refactor ContentExtractor to 
improve extraction flexibility

----


> Refactor ContentExtractor to improve extraction flexibility
> -----------------------------------------------------------
>
>                 Key: ANY23-280
>                 URL: https://issues.apache.org/jira/browse/ANY23-280
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core, extractors
>    Affects Versions: 1.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.2
>
>
> As discussed on ANY23-247, the 
> [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44]
>  is simply not fit for purpose. This issue was discovered and the cause has 
> plagued our builds ever since. Any extractors which implement 
> [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java]
>  are based on the Extractor.ContentExtractor and hence work off of an 
> 'unfixed' raw data stream as oppose to a more flexible model such as the 
> [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60].
> This issue should refactor RDF extractors to enable more flexibility and to 
> avoid issues we encounter with the strict SAX parsing logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility

Reply via email to