[jira] [Commented] (ANY23-304) Add extractor for OpenIE

2017-08-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139047#comment-16139047
 ] 

Hudson commented on ANY23-304:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1498 (See 
[https://builds.apache.org/job/Any23-trunk/1498/])
ANY23-304 Add extractor for OpenIE (lewis.mcgibbney: rev 
2ecfbff1dddaf57689b725feddba47c7921f726d)
* (edit) core/src/main/java/org/apache/any23/extractor/xpath/XPathExtractor.java
* (add) openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java
* (edit) 
core/src/main/java/org/apache/any23/extractor/html/EmbeddedJSONLDExtractor.java
* (edit) core/src/main/java/org/apache/any23/rdf/RDFUtils.java
* (edit) core/src/main/java/org/apache/any23/util/StreamUtils.java
* (edit) pom.xml
* (add) openie/src/main/java/org/apache/any23/openie/OpenIEExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/html/GeoExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java
* (add) openie/src/main/java/org/apache/any23/openie/OpenIEExtractorFactory.java
* (edit) api/src/main/java/org/apache/any23/vocab/Vocabulary.java
* (edit) api/src/main/resources/default-configuration.properties
* (edit) core/src/main/java/org/apache/any23/extractor/yaml/YAMLExtractor.java
* (add) openie/pom.xml
* (edit) 
api/src/main/java/org/apache/any23/configuration/DefaultModifiableConfiguration.java
* (add) 
test-resources/src/test/resources/org/apache/any23/extractor/openie/example-openie.html
* (edit) 
api/src/main/java/org/apache/any23/configuration/DefaultConfiguration.java
* (edit) 
core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java
* (edit) 
core/src/test/java/org/apache/any23/extractor/yaml/YAMLExtractorTest.java
ANY23-304 update package names and introduce Service Loading for OpenIE 
(lewis.mcgibbney: rev 2f54725049f0cbc152e9e27045c0f06e93c24647)
* (edit) 
openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java
* (edit) 
openie/src/main/resources/META-INF/services/org.apache.any23.extractor.ExtractorFactory
* (edit) 
openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractorFactory.java
* (edit) cli/pom.xml
ANY23-304 Address comments from ansell (lewis.mcgibbney: rev 
1b0c5ff22bb61a9cd992b909c776592a081216e4)
* (edit) cli/pom.xml
* (edit) src/site/apt/dev-microformat-extractors.apt
* (edit) cli/src/test/java/org/apache/any23/cli/ToolRunnerTest.java
* (edit) src/site/apt/configuration.apt
* (edit) src/site/apt/plugin-basic-crawler.apt
* (edit) src/site/apt/any23-plugins.apt
* (edit) plugins/basic-crawler/pom.xml
* (edit) src/site/apt/plugin-office-scraper.apt
* (edit) src/site/apt/extractors.apt
* (edit) 
openie/src/main/java/org/apache/any23/extractor/openie/OpenIEExtractor.java
* (edit) src/site/apt/dev-data-extraction.apt
* (edit) src/site/apt/dev-xpath-extractor.apt
* (edit) src/site/apt/dev-csv-extractor.apt
* (edit) src/site/apt/dev-data-conversion.apt
* (edit) openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java
* (edit) src/site/apt/dev-validation-fix.apt
* (edit) src/site/apt/getting-started.apt
ANY23-304 increase number of extractors found (lewis.mcgibbney: rev 
6d5c39e57b5e8a4dd29da27e3137c396dd1ffbd9)
* (edit) 
plugins/integration-test/src/test/java/org/apache/any23/plugin/PluginIT.java
ANY23-304 implement temporary file reader within test logic (lewis.mcgibbney: 
rev b39d2201440c5f1297e99365744ac3fd9b4f9d90)
* (edit) openie/src/test/java/org/apache/any23/openie/OpenIEExtractorTest.java
ANY23-304 Add extractor for OpenIE (lewis.mcgibbney: rev 
ef14614473f608d275eecd4c10b3ab2e50391167)
* (edit) cli/pom.xml
* (edit) 
plugins/integration-test/src/test/java/org/apache/any23/plugin/PluginIT.java
ANY23-304 skip tests in openie module (lewis.mcgibbney: rev 
c40b7888b9978bc81e6cbe1e05ea77af50367bed)
* (edit) openie/pom.xml


> Add extractor for OpenIE
> 
>
> Key: ANY23-304
> URL: https://issues.apache.org/jira/browse/ANY23-304
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core, extractors
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.1
>
>
> I'm going to start work on an extractor which uses the OpenIE library 
> https://github.com/allenai/openie-standalone
> This will provide us with the ability to execute structured extractions from 
> unstructured content essentially taking Any23 in a new direction.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[DISCUSS] Prepare Any23 2.1 Release Candidate?

2017-08-23 Thread lewis john mcgibbney
Hi Folks,
We've addressed 10 issues for the 2.1 development drive...
Is anyone interested in us pushing a 2.1 release? I think it would be good
to keep the activity levels up.
Lewis

-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


[jira] [Resolved] (ANY23-304) Add extractor for OpenIE

2017-08-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-304.

Resolution: Fixed

> Add extractor for OpenIE
> 
>
> Key: ANY23-304
> URL: https://issues.apache.org/jira/browse/ANY23-304
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core, extractors
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.1
>
>
> I'm going to start work on an extractor which uses the OpenIE library 
> https://github.com/allenai/openie-standalone
> This will provide us with the ability to execute structured extractions from 
> unstructured content essentially taking Any23 in a new direction.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ANY23-304) Add extractor for OpenIE

2017-08-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139025#comment-16139025
 ] 

ASF GitHub Bot commented on ANY23-304:
--

Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/34


> Add extractor for OpenIE
> 
>
> Key: ANY23-304
> URL: https://issues.apache.org/jira/browse/ANY23-304
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core, extractors
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.1
>
>
> I'm going to start work on an extractor which uses the OpenIE library 
> https://github.com/allenai/openie-standalone
> This will provide us with the ability to execute structured extractions from 
> unstructured content essentially taking Any23 in a new direction.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] any23 pull request #34: ANY23-304 Add extractor for OpenIE

2017-08-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/34


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (ANY23-304) Add extractor for OpenIE

2017-08-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138920#comment-16138920
 ] 

ASF GitHub Bot commented on ANY23-304:
--

Github user lewismc commented on the issue:

https://github.com/apache/any23/pull/34
  
Hi @ansell , in my last commit I've pushed a coupe of (hopefully) 
satisfying additions, namely
 * removal of open module from CLI (meaning that, by default the open 
extractor is not executed by default during normal unit test execution)
 * addition of some class loading logic which improves the flexibility of 
extractor detection based upon the presence of the open extractor.

By default now, open tests are not executed by default... this will 
dramatically reduce 1) the time of tests, and 2) he memory required to execute 
the tests.

Thanks for any final review.
Lewis


> Add extractor for OpenIE
> 
>
> Key: ANY23-304
> URL: https://issues.apache.org/jira/browse/ANY23-304
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core, extractors
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.1
>
>
> I'm going to start work on an extractor which uses the OpenIE library 
> https://github.com/allenai/openie-standalone
> This will provide us with the ability to execute structured extractions from 
> unstructured content essentially taking Any23 in a new direction.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] any23 issue #34: ANY23-304 Add extractor for OpenIE

2017-08-23 Thread lewismc
Github user lewismc commented on the issue:

https://github.com/apache/any23/pull/34
  
Hi @ansell , in my last commit I've pushed a coupe of (hopefully) 
satisfying additions, namely
 * removal of open module from CLI (meaning that, by default the open 
extractor is not executed by default during normal unit test execution)
 * addition of some class loading logic which improves the flexibility of 
extractor detection based upon the presence of the open extractor.

By default now, open tests are not executed by default... this will 
dramatically reduce 1) the time of tests, and 2) he memory required to execute 
the tests.

Thanks for any final review.
Lewis


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility

2017-08-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138824#comment-16138824
 ] 

ASF GitHub Bot commented on ANY23-280:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/24#discussion_r134831365
  
--- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java ---
@@ -39,22 +38,6 @@
 
 /**
  * This interface specializes an {@link Extractor} able to handle
- * {@link java.io.InputStream} as input format.
- */
-public interface ContentExtractor extends Extractor {
--- End diff --

@jgrzebyta yes this is correct... we do not always wish to assume that the 
input is structured in XML or a subset thereof... syntax-strict extractors are 
prone to breakage. Our aim in Any23 should be to provide flexibility in the 
extraction logic rather than a strict, fragile extraction logic.


> Refactor ContentExtractor to improve extraction flexibility
> ---
>
> Key: ANY23-280
> URL: https://issues.apache.org/jira/browse/ANY23-280
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core, extractors
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.1
>
>
> As discussed on ANY23-247, the 
> [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44]
>  is simply not fit for purpose. This issue was discovered and the cause has 
> plagued our builds ever since. Any extractors which implement 
> [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java]
>  are based on the Extractor.ContentExtractor and hence work off of an 
> 'unfixed' raw data stream as oppose to a more flexible model such as the 
> [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60].
> This issue should refactor RDF extractors to enable more flexibility and to 
> avoid issues we encounter with the strict SAX parsing logic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] any23 pull request #24: Initial move towards addressing ANY23-280 Refactor C...

2017-08-23 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/24#discussion_r134831365
  
--- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java ---
@@ -39,22 +38,6 @@
 
 /**
  * This interface specializes an {@link Extractor} able to handle
- * {@link java.io.InputStream} as input format.
- */
-public interface ContentExtractor extends Extractor {
--- End diff --

@jgrzebyta yes this is correct... we do not always wish to assume that the 
input is structured in XML or a subset thereof... syntax-strict extractors are 
prone to breakage. Our aim in Any23 should be to provide flexibility in the 
extraction logic rather than a strict, fragile extraction logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (ANY23-280) Refactor ContentExtractor to improve extraction flexibility

2017-08-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138292#comment-16138292
 ] 

ASF GitHub Bot commented on ANY23-280:
--

Github user jgrzebyta commented on a diff in the pull request:

https://github.com/apache/any23/pull/24#discussion_r134738201
  
--- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java ---
@@ -39,22 +38,6 @@
 
 /**
  * This interface specializes an {@link Extractor} able to handle
- * {@link java.io.InputStream} as input format.
- */
-public interface ContentExtractor extends Extractor {
--- End diff --

@lewismc Why do you remove `ContentExtractor`? I assume that In case if 
content is neither html nor xml type that developer should create new extractor 
extending `Exctractor` directly. Am I right? 


> Refactor ContentExtractor to improve extraction flexibility
> ---
>
> Key: ANY23-280
> URL: https://issues.apache.org/jira/browse/ANY23-280
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core, extractors
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.1
>
>
> As discussed on ANY23-247, the 
> [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44]
>  is simply not fit for purpose. This issue was discovered and the cause has 
> plagued our builds ever since. Any extractors which implement 
> [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java]
>  are based on the Extractor.ContentExtractor and hence work off of an 
> 'unfixed' raw data stream as oppose to a more flexible model such as the 
> [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60].
> This issue should refactor RDF extractors to enable more flexibility and to 
> avoid issues we encounter with the strict SAX parsing logic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] any23 pull request #24: Initial move towards addressing ANY23-280 Refactor C...

2017-08-23 Thread jgrzebyta
Github user jgrzebyta commented on a diff in the pull request:

https://github.com/apache/any23/pull/24#discussion_r134738201
  
--- Diff: api/src/main/java/org/apache/any23/extractor/Extractor.java ---
@@ -39,22 +38,6 @@
 
 /**
  * This interface specializes an {@link Extractor} able to handle
- * {@link java.io.InputStream} as input format.
- */
-public interface ContentExtractor extends Extractor {
--- End diff --

@lewismc Why do you remove `ContentExtractor`? I assume that In case if 
content is neither html nor xml type that developer should create new extractor 
extending `Exctractor` directly. Am I right? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---