[jira] [Commented] (ANY23-304) Add extractor for OpenIE
[ https://issues.apache.org/jira/browse/ANY23-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139047#comment-16139047 ] Hudson commented on ANY23-304: -- SUCCESS: Integrated in Jenkins build Any23-trunk #1498 (See [https://builds.apache.org/job/Any23-trunk/1498/]) ANY23-304 Add extractor for OpenIE (lewis.mcgibbney: rev 2ecfbff1dddaf57689b725feddba47c7921f726d) * (edit) core/src/main/java/org/apache/any23/extractor/xpath/XPathExtractor.java * (add) openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java * (edit) core/src/main/java/org/apache/any23/extractor/html/EmbeddedJSONLDExtractor.java * (edit) core/src/main/java/org/apache/any23/rdf/RDFUtils.java * (edit) core/src/main/java/org/apache/any23/util/StreamUtils.java * (edit) pom.xml * (add) openie/src/main/java/org/apache/any23/openie/OpenIEExtractor.java * (edit) core/src/main/java/org/apache/any23/extractor/html/GeoExtractor.java * (edit) core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java * (add) openie/src/main/java/org/apache/any23/openie/OpenIEExtractorFactory.java * (edit) api/src/main/java/org/apache/any23/vocab/Vocabulary.java * (edit) api/src/main/resources/default-configuration.properties * (edit) core/src/main/java/org/apache/any23/extractor/yaml/YAMLExtractor.java * (add) openie/pom.xml * (edit) api/src/main/java/org/apache/any23/configuration/DefaultModifiableConfiguration.java * (add) test-resources/src/test/resources/org/apache/any23/extractor/openie/example-openie.html * (edit) api/src/main/java/org/apache/any23/configuration/DefaultConfiguration.java * (edit) core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java * (edit) core/src/test/java/org/apache/any23/extractor/yaml/YAMLExtractorTest.java ANY23-304 update package names and introduce Service Loading for OpenIE (lewis.mcgibbney: rev 2f54725049f0cbc152e9e27045c0f06e93c24647) * (edit) openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java * (edit) openie/src/main/resources/META-INF/services/org.apache.any23.extractor.ExtractorFactory * (edit) openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractorFactory.java * (edit) cli/pom.xml ANY23-304 Address comments from ansell (lewis.mcgibbney: rev 1b0c5ff22bb61a9cd992b909c776592a081216e4) * (edit) cli/pom.xml * (edit) src/site/apt/dev-microformat-extractors.apt * (edit) cli/src/test/java/org/apache/any23/cli/ToolRunnerTest.java * (edit) src/site/apt/configuration.apt * (edit) src/site/apt/plugin-basic-crawler.apt * (edit) src/site/apt/any23-plugins.apt * (edit) plugins/basic-crawler/pom.xml * (edit) src/site/apt/plugin-office-scraper.apt * (edit) src/site/apt/extractors.apt * (edit) openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java * (edit) src/site/apt/dev-data-extraction.apt * (edit) src/site/apt/dev-xpath-extractor.apt * (edit) src/site/apt/dev-csv-extractor.apt * (edit) src/site/apt/dev-data-conversion.apt * (edit) openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java * (edit) src/site/apt/dev-validation-fix.apt * (edit) src/site/apt/getting-started.apt ANY23-304 increase number of extractors found (lewis.mcgibbney: rev 6d5c39e57b5e8a4dd29da27e3137c396dd1ffbd9) * (edit) plugins/integration-test/src/test/java/org/apache/any23/plugin/PluginIT.java ANY23-304 implement temporary file reader within test logic (lewis.mcgibbney: rev b39d2201440c5f1297e99365744ac3fd9b4f9d90) * (edit) openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java ANY23-304 Add extractor for OpenIE (lewis.mcgibbney: rev ef14614473f608d275eecd4c10b3ab2e50391167) * (edit) cli/pom.xml * (edit) plugins/integration-test/src/test/java/org/apache/any23/plugin/PluginIT.java ANY23-304 skip tests in openie module (lewis.mcgibbney: rev c40b7888b9978bc81e6cbe1e05ea77af50367bed) * (edit) openie/pom.xml > Add extractor for OpenIE > > > Key: ANY23-304 > URL: https://issues.apache.org/jira/browse/ANY23-304 > Project: Apache Any23 > Issue Type: Bug > Components: core, extractors >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.1 > > > I'm going to start work on an extractor which uses the OpenIE library > https://github.com/allenai/openie-standalone > This will provide us with the ability to execute structured extractions from > unstructured content essentially taking Any23 in a new direction. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[DISCUSS] Prepare Any23 2.1 Release Candidate?
Hi Folks, We've addressed 10 issues for the 2.1 development drive... Is anyone interested in us pushing a 2.1 release? I think it would be good to keep the activity levels up. Lewis -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
[jira] [Resolved] (ANY23-304) Add extractor for OpenIE
[ https://issues.apache.org/jira/browse/ANY23-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved ANY23-304. Resolution: Fixed > Add extractor for OpenIE > > > Key: ANY23-304 > URL: https://issues.apache.org/jira/browse/ANY23-304 > Project: Apache Any23 > Issue Type: Bug > Components: core, extractors >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.1 > > > I'm going to start work on an extractor which uses the OpenIE library > https://github.com/allenai/openie-standalone > This will provide us with the ability to execute structured extractions from > unstructured content essentially taking Any23 in a new direction. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ANY23-304) Add extractor for OpenIE
[ https://issues.apache.org/jira/browse/ANY23-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139025#comment-16139025 ] ASF GitHub Bot commented on ANY23-304: -- Github user asfgit closed the pull request at: https://github.com/apache/any23/pull/34 > Add extractor for OpenIE > > > Key: ANY23-304 > URL: https://issues.apache.org/jira/browse/ANY23-304 > Project: Apache Any23 > Issue Type: Bug > Components: core, extractors >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.1 > > > I'm going to start work on an extractor which uses the OpenIE library > https://github.com/allenai/openie-standalone > This will provide us with the ability to execute structured extractions from > unstructured content essentially taking Any23 in a new direction. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE
Github user asfgit closed the pull request at: https://github.com/apache/any23/pull/34 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (ANY23-304) Add extractor for OpenIE
[ https://issues.apache.org/jira/browse/ANY23-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138920#comment-16138920 ] ASF GitHub Bot commented on ANY23-304: -- Github user lewismc commented on the issue: https://github.com/apache/any23/pull/34 Hi @ansell , in my last commit I've pushed a coupe of (hopefully) satisfying additions, namely * removal of open module from CLI (meaning that, by default the open extractor is not executed by default during normal unit test execution) * addition of some class loading logic which improves the flexibility of extractor detection based upon the presence of the open extractor. By default now, open tests are not executed by default... this will dramatically reduce 1) the time of tests, and 2) he memory required to execute the tests. Thanks for any final review. Lewis > Add extractor for OpenIE > > > Key: ANY23-304 > URL: https://issues.apache.org/jira/browse/ANY23-304 > Project: Apache Any23 > Issue Type: Bug > Components: core, extractors >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.1 > > > I'm going to start work on an extractor which uses the OpenIE library > https://github.com/allenai/openie-standalone > This will provide us with the ability to execute structured extractions from > unstructured content essentially taking Any23 in a new direction. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE
Github user lewismc commented on the issue: https://github.com/apache/any23/pull/34 Hi @ansell , in my last commit I've pushed a coupe of (hopefully) satisfying additions, namely * removal of open module from CLI (meaning that, by default the open extractor is not executed by default during normal unit test execution) * addition of some class loading logic which improves the flexibility of extractor detection based upon the presence of the open extractor. By default now, open tests are not executed by default... this will dramatically reduce 1) the time of tests, and 2) he memory required to execute the tests. Thanks for any final review. Lewis --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility
[ https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138824#comment-16138824 ] ASF GitHub Bot commented on ANY23-280: -- Github user lewismc commented on a diff in the pull request: https://github.com/apache/any23/pull/24#discussion_r134831365 --- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java --- @@ -39,22 +38,6 @@ /** * This interface specializes an {@link Extractor} able to handle - * {@link java.io.InputStream} as input format. - */ -public interface ContentExtractor extends Extractor { --- End diff -- @jgrzebyta yes this is correct... we do not always wish to assume that the input is structured in XML or a subset thereof... syntax-strict extractors are prone to breakage. Our aim in Any23 should be to provide flexibility in the extraction logic rather than a strict, fragile extraction logic. > Refactor ContentExtractor to improve extraction flexibility > --- > > Key: ANY23-280 > URL: https://issues.apache.org/jira/browse/ANY23-280 > Project: Apache Any23 > Issue Type: Improvement > Components: core, extractors >Affects Versions: 1.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 2.1 > > > As discussed on ANY23-247, the > [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44] > is simply not fit for purpose. This issue was discovered and the cause has > plagued our builds ever since. Any extractors which implement > [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java] > are based on the Extractor.ContentExtractor and hence work off of an > 'unfixed' raw data stream as oppose to a more flexible model such as the > [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60]. > This issue should refactor RDF extractors to enable more flexibility and to > avoid issues we encounter with the strict SAX parsing logic. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[GitHub] any23 pull request #24: Initial move towards addressing ANY23-280 Refactor C...
Github user lewismc commented on a diff in the pull request: https://github.com/apache/any23/pull/24#discussion_r134831365 --- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java --- @@ -39,22 +38,6 @@ /** * This interface specializes an {@link Extractor} able to handle - * {@link java.io.InputStream} as input format. - */ -public interface ContentExtractor extends Extractor { --- End diff -- @jgrzebyta yes this is correct... we do not always wish to assume that the input is structured in XML or a subset thereof... syntax-strict extractors are prone to breakage. Our aim in Any23 should be to provide flexibility in the extraction logic rather than a strict, fragile extraction logic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility
[ https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138292#comment-16138292 ] ASF GitHub Bot commented on ANY23-280: -- Github user jgrzebyta commented on a diff in the pull request: https://github.com/apache/any23/pull/24#discussion_r134738201 --- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java --- @@ -39,22 +38,6 @@ /** * This interface specializes an {@link Extractor} able to handle - * {@link java.io.InputStream} as input format. - */ -public interface ContentExtractor extends Extractor { --- End diff -- @lewismc Why do you remove `ContentExtractor`? I assume that In case if content is neither html nor xml type that developer should create new extractor extending `Exctractor` directly. Am I right? > Refactor ContentExtractor to improve extraction flexibility > --- > > Key: ANY23-280 > URL: https://issues.apache.org/jira/browse/ANY23-280 > Project: Apache Any23 > Issue Type: Improvement > Components: core, extractors >Affects Versions: 1.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 2.1 > > > As discussed on ANY23-247, the > [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44] > is simply not fit for purpose. This issue was discovered and the cause has > plagued our builds ever since. Any extractors which implement > [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java] > are based on the Extractor.ContentExtractor and hence work off of an > 'unfixed' raw data stream as oppose to a more flexible model such as the > [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60]. > This issue should refactor RDF extractors to enable more flexibility and to > avoid issues we encounter with the strict SAX parsing logic. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[GitHub] any23 pull request #24: Initial move towards addressing ANY23-280 Refactor C...
Github user jgrzebyta commented on a diff in the pull request: https://github.com/apache/any23/pull/24#discussion_r134738201 --- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java --- @@ -39,22 +38,6 @@ /** * This interface specializes an {@link Extractor} able to handle - * {@link java.io.InputStream} as input format. - */ -public interface ContentExtractor extends Extractor { --- End diff -- @lewismc Why do you remove `ContentExtractor`? I assume that In case if content is neither html nor xml type that developer should create new extractor extending `Exctractor` directly. Am I right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---