[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357252#comment-16357252
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

ferrerod commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-364184599
 
 
   Thank you Lewis, I just sent an email: 
   https://www.mail-archive.com/user@nutch.apache.org/msg15974.html


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357245#comment-16357245
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

ferrerod commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-364184599
 
 
   Thank you Lewis, I just sent an email.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356630#comment-16356630
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-364034381
 
 
   Hi @ferrerod this is a good question and I would like to answer it on the 
mailing list. Please ask it on user@
   http://nutch.apache.org/mailing_lists.html


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356551#comment-16356551
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

ferrerod commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-364018728
 
 
   @nmaro @lewismc I'm new to nutch and stumbled upon this merged PR after 
realizing I wanted much better crawling results of sites leveraging schema.org. 
This is likely the wrong place to pose my questions, but it's the first 
exposure to Any23 within Nutch 1.x. Do please redirect me to a better place to 
post questions. And,...I may not even be asking the right questions but here 
goes:
   
   Q: How do I gain Any23 microdata parsing / indexing capabilities introduced 
with this PR? Do I replace  `parse-(html|tika)|index-(basic|anchor)` in 
plugin.includes with something like
   `parse-(html|tika|any23)|index-(basic|anchor|any23)`
   
   How do I expose the discovered microdata items to end-user such as Solr? For 
example, what are the microdata items and how should I map them to Solr in 
`solrindex-mapping.xml`  ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356523#comment-16356523
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-364011897
 
 
   Hi @ferrerod thank you for posting. I think this merely has to do with the 
Javadoc links not being available
   https://github.com/apache/nutch/blob/master/default.properties#L43-L51
   If you are able to fix it, then please by all means open a PR and we can 
review. Thank you


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356509#comment-16356509
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

ferrerod commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-364007152
 
 
   On a Mac with jdk 8 installed, I ran into failure on the javadoc task 
complaining about the java version. Upon deeper inspection I determined the 
failure condition was  tripping up on ant.java.version equals 1.6 - running Ant 
-v and it said my are in the JAVA_HOME (jdk 8) is 1.6! Super strange...
   
   I removed the ant.java.version checks in java doc task and  reran...
   
   ant zip-bin with java 8  finished successfully!!  However, the reason I'm 
posting here is, I noticed 19 errors and 106 warnings in the java doc task. 
Here is the first few errors it encountered:
   
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:30:
 error: package org.apache.any23 does not exist
 [javadoc] import org.apache.any23.Any23;
 [javadoc]^
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:31:
 error: package org.apache.any23.extractor does not exist
 [javadoc] import org.apache.any23.extractor.ExtractionException;
 [javadoc]  ^
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:32:
 error: package org.apache.any23.writer does not exist
 [javadoc] import org.apache.any23.writer.BenchmarkTripleHandler;
 [javadoc]   ^
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:33:
 error: package org.apache.any23.writer does not exist
 [javadoc] import org.apache.any23.writer.NTriplesWriter;
 [javadoc]   ^
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:34:
 error: package org.apache.any23.writer does not exist
 [javadoc] import org.apache.any23.writer.TripleHandler;
 [javadoc]   ^
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:35:
 error: package org.apache.any23.writer does not exist
 [javadoc] import org.apache.any23.writer.TripleHandlerException;
 [javadoc]   ^
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:43:
 error: package org.ccil.cowan.tagsoup does not exist
 [javadoc] import org.ccil.cowan.tagsoup.XMLWriter;
 [javadoc]  ^
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:44:
 error: package org.ccil.cowan.tagsoup.jaxp does not exist
 [javadoc] import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
 [javadoc]   ^
 [javadoc] 
/nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:87:
 error: cannot find symbol
 [javadoc] Any23Parser(String url, String htmlContent, String 
contentType, String... extractorNames) throws TripleHandlerException {
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323068#comment-16323068
 ] 

Hudson commented on NUTCH-1129:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3495 (See 
[https://builds.apache.org/job/Nutch-trunk/3495/])
Add patch from NUTCH-1129 (thilo: 
[https://github.com/apache/nutch/commit/6a9d0823757f2ffec86e1d32d789f5d00fa04667])
* (add) src/plugin/any23/build-ivy.xml
* (add) src/plugin/any23/build.xml
* (add) src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
* (add) src/plugin/any23/src/java/org/apache/nutch/any23/package-info.java
* (edit) default.properties
* (edit) src/plugin/parse-tika/howto_upgrade_tika.txt
* (add) src/plugin/any23/sample/BBC_News_Scotland.html
* (edit) 
src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag/RelTagIndexingFilter.java
* (add) 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java
* (add) src/plugin/any23/howto_upgrade_any23.txt
* (edit) ivy/ivysettings.xml
* (add) src/plugin/any23/sample/microdata_basic.html
* (edit) src/plugin/build.xml
* (add) 
src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java
* (add) src/plugin/any23/plugin.xml
* (add) 
src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23IndexingFilter.java
* (add) src/plugin/any23/ivy.xml


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323026#comment-16323026
 ] 

Moreno Feltscher commented on NUTCH-1129:
-

[~lewismc]: Thanks for merging! A special thank you goes out to my amazing 
co-workers who did a great job on this :-) cc [~thilohaas]

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322963#comment-16322963
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r161078442
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,175 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.*;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.Collections;
+
+import org.apache.any23.Any23;
+import org.apache.any23.extractor.ExtractionException;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.protocol.Content;
+import org.ccil.cowan.tagsoup.XMLWriter;
+import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+import org.xml.sax.XMLReader;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. The supported formats can be found at http://any23.apache.org";>Apache Any23.
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /**
+   * Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  public static final String ANY_23_EXTRACTORS_CONF = "any23.extractors";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent, String... extractorNames) 
throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent, extractorNames);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+private Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent, String... 
extractorNames) throws URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23(extractorNames);
+  any23.setMIMETypeDetector(null);
+
+  try {
+// Fix input to avoid extraction error 
(https://github.com/semarglproject/semargl/issues/37#issuecomment-69381281)
 
 Review comment:
   I will happ

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321927#comment-16321927
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160902020
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,175 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.*;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.Collections;
+
+import org.apache.any23.Any23;
+import org.apache.any23.extractor.ExtractionException;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.protocol.Content;
+import org.ccil.cowan.tagsoup.XMLWriter;
+import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+import org.xml.sax.XMLReader;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. The supported formats can be found at http://any23.apache.org";>Apache Any23.
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /**
+   * Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  public static final String ANY_23_EXTRACTORS_CONF = "any23.extractors";
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project 

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321928#comment-16321928
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356877048
 
 
   @lewismc all comments addressed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321925#comment-16321925
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160901970
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,175 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.*;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.Collections;
+
+import org.apache.any23.Any23;
+import org.apache.any23.extractor.ExtractionException;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.protocol.Content;
+import org.ccil.cowan.tagsoup.XMLWriter;
+import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+import org.xml.sax.XMLReader;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. The supported formats can be found at http://any23.apache.org";>Apache Any23.
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /**
+   * Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  public static final String ANY_23_EXTRACTORS_CONF = "any23.extractors";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent, String... extractorNames) 
throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent, extractorNames);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+private Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent, String... 
extractorNames) throws URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23(extractorNames);
+  any23.setMIMETypeDetector(null);
+
+  try {
+// Fix input to avoid extraction error 
(https://github.com/semarglproject/semargl/issues/37#issuecomment-69381281)
 
 Review comment:
   I'm going to 

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320754#comment-16320754
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356688205
 
 
   Thank you @mfeltscher 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320750#comment-16320750
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

mfeltscher commented on a change in pull request #205: WIP: NUTCH-1129 
microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160756114
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -15,26 +15,32 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] [-s ] [-sm 
]  
-#-i|--index  Indexes crawl results into a configured indexer
-#-w|--wait   NUMBER[SUFFIX] Time to wait before generating a new 
segment when no URLs
-#are scheduled for fetching. Suffix can be: s for second,
-#m for minute, h for hour and d for day. If no suffix is
-#specified second is used by default.
-#-D  A Java property to pass to Nutch calls
-#-s  Path to seeds file(s)
-#-sm Path to sitemap URL file(s)
-#CrawlDirDirectory where the crawl/link/segments dirs are saved
-#NumRounds   The number of rounds to run this crawl for
+# Usage: crawl [options]  
 
 Review comment:
   @lewismc I went ahead and merged master back into this feature branch so 
this is solved - see 1e2c84882b9d218c88752a93c4adaf167ae9355d


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320566#comment-16320566
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160729605
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -15,26 +15,32 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] [-s ] [-sm 
]  
-#-i|--index  Indexes crawl results into a configured indexer
-#-w|--wait   NUMBER[SUFFIX] Time to wait before generating a new 
segment when no URLs
-#are scheduled for fetching. Suffix can be: s for second,
-#m for minute, h for hour and d for day. If no suffix is
-#specified second is used by default.
-#-D  A Java property to pass to Nutch calls
-#-s  Path to seeds file(s)
-#-sm Path to sitemap URL file(s)
-#CrawlDirDirectory where the crawl/link/segments dirs are saved
-#NumRounds   The number of rounds to run this crawl for
+# Usage: crawl [options]  
 
 Review comment:
   Unfortunately, i don't think that this content should be included in this 
patch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320568#comment-16320568
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160729887
 
 

 ##
 File path: src/plugin/any23/ivy.xml
 ##
 @@ -0,0 +1,51 @@
+
+
+
+
+
+  
+
+http://nutch.apache.org"/>
+
+Apache Nutch
+
+  
+
+  
+
+  
+
+  
+
+
+  
+
+  
+
+
 
 Review comment:
   Why is both commons-rdf-api and owlapi included here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320569#comment-16320569
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160730238
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,175 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.*;
 
 Review comment:
   Please use explicit imports


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320562#comment-16320562
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160729357
 
 

 ##
 File path: ivy/ivysettings.xml
 ##
 @@ -64,6 +82,8 @@
   
   
   
+  
+  
 
 Review comment:
   
   
   This can be removed... is there any compelling reason to have it included?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320570#comment-16320570
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160731346
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,175 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.*;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.Collections;
+
+import org.apache.any23.Any23;
+import org.apache.any23.extractor.ExtractionException;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.protocol.Content;
+import org.ccil.cowan.tagsoup.XMLWriter;
+import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+import org.xml.sax.XMLReader;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. The supported formats can be found at http://any23.apache.org";>Apache Any23.
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /**
+   * Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  public static final String ANY_23_EXTRACTORS_CONF = "any23.extractors";
 
 Review comment:
   If we are going to introduce a new option we need to actually include it in 
```nutch-default.xml```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of t

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320565#comment-16320565
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160729250
 
 

 ##
 File path: ivy/ivysettings.xml
 ##
 @@ -34,6 +34,12 @@
   https://repository.apache.org/content/repositories/snapshots/";
 override="false"/>
+  http://svn.apache.org/repos/asf/any23/repo-ext/";
+override="false"/>
+   Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320567#comment-16320567
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160730852
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,175 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.*;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.Collections;
+
+import org.apache.any23.Any23;
+import org.apache.any23.extractor.ExtractionException;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.protocol.Content;
+import org.ccil.cowan.tagsoup.XMLWriter;
+import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+import org.xml.sax.XMLReader;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. The supported formats can be found at http://any23.apache.org";>Apache Any23.
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /**
+   * Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  public static final String ANY_23_EXTRACTORS_CONF = "any23.extractors";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent, String... extractorNames) 
throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent, extractorNames);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+private Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent, String... 
extractorNames) throws URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23(extractorNames);
+  any23.setMIMETypeDetector(null);
+
+  try {
+// Fix input to avoid extraction error 
(https://github.com/semarglproject/semargl/issues/37#issuecomment-69381281)
 
 Review comment:
   Ideally thi

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320564#comment-16320564
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160729309
 
 

 ##
 File path: ivy/ivysettings.xml
 ##
 @@ -47,6 +53,18 @@
   pattern="${maven2.pattern.ext}"
   m2compatible="true"
   />
+ Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320563#comment-16320563
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160729392
 
 

 ##
 File path: ivy/ivysettings.xml
 ##
 @@ -75,6 +95,8 @@
   
   
   
+  
+  
 
 Review comment:
   
   
   This can be removed... is there any compelling reason to have it included?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320057#comment-16320057
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356566774
 
 
   @lewismc Requested changes done - please note that
   
   * I had to extend the elastic http plugin to handle lists of Map objects 
that it previously just stringified
   * Any23 couldn't detect as many triples as you expected in your tests, had 
to lower the number - but it's good enough for us for now, people can still 
expand the any23 scope if they find out what the problem is
   * Data is now indexed as follows (example after crawling 
`https://smartive.ch/jobs`):
   
   ```
 "structured_data": [
   {
 "node": "",
 "value": "\"IE-edge,chrome=1\"@de",
 "key": "",
 "short_key": "X-UA-Compatible"
   },
   {
 "node": "",
 "value": "\"Wir sind smartive \\u2014 eine dynamische, 
innovative Schweizer Webentwicklungsagentur. Die Realisierung 
zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer Passion, 
wie die konstruktive Zusammenarbeit mit unseren Kundinnen und Kunden.\"@de",
 "key": "",
 "short_key": "description"
   },
   {
 "node": "",
 "value": "\"width=device-width, initial-scale=1, 
shrink-to-fit=no\"@de",
 "key": "",
 "short_key": "viewport"
   },
   {
 "node": "",
 "value": "\"width=device-width,initial-scale=1\"@de",
 "key": "",
 "short_key": "viewport"
   },
   {
 "node": "",
 "value": "\"ie=edge\"@de",
 "key": "",
 "short_key": "x-ua-compatible"
   }
 ],
   ```
 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320056#comment-16320056
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356566774
 
 
   @lewismc Requested changes done - please note that
   
   * I had to extend the elastic http plugin to handle lists of Map objects 
that it previously just stringified
   * Any23 couldn't detect as many triples as you expected in your tests, had 
to lower the number - but it's good enough for now, people can still expand the 
any23 scope if they find out what the problem is
   * Data is now indexed as follows (example after crawling 
`https://smartive.ch/jobs`):
   
   ```
 "structured_data": [
   {
 "node": "",
 "value": "\"IE-edge,chrome=1\"@de",
 "key": "",
 "short_key": "X-UA-Compatible"
   },
   {
 "node": "",
 "value": "\"Wir sind smartive \\u2014 eine dynamische, 
innovative Schweizer Webentwicklungsagentur. Die Realisierung 
zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer Passion, 
wie die konstruktive Zusammenarbeit mit unseren Kundinnen und Kunden.\"@de",
 "key": "",
 "short_key": "description"
   },
   {
 "node": "",
 "value": "\"width=device-width, initial-scale=1, 
shrink-to-fit=no\"@de",
 "key": "",
 "short_key": "viewport"
   },
   {
 "node": "",
 "value": "\"width=device-width,initial-scale=1\"@de",
 "key": "",
 "short_key": "viewport"
   },
   {
 "node": "",
 "value": "\"ie=edge\"@de",
 "key": "",
 "short_key": "x-ua-compatible"
   }
 ],
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317037#comment-16317037
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356091293
 
 
   @nmaro yes please submit a PR, thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316547#comment-16316547
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356013970
 
 
   @lewismc ok. I added an ant command that allows one to run single plugin 
tests like `ant -Dplugin=any23 test-plugin` (and works) can I check that in?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316533#comment-16316533
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356010006
 
 
   @nmaro I run them in Eclipse, by the way, I have been working hard on Any23 
to improve microdata extraction and a bunch of other stuff. We will be 
releasing Any23 2.2 reasonably soon so we can make the Any23 upgrade here in 
Nutch as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316504#comment-16316504
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356003572
 
 
   @lewismc quick question: what is your favorite way to run single plugin 
tests?
   
   We are trying with `ant compile-core-test && ant runtime && 
./runtime/local/bin/nutch junit 
org.apache.nutch.parse.metatags.TestAny23ParseFilter` but we are encountering 
some problems. Is this the recommended way?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316271#comment-16316271
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-355961648
 
 
   Excellent @nmaro 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316266#comment-16316266
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160138730
 
 

 ##
 File path: 
src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23IndexingFilter.java
 ##
 @@ -0,0 +1,54 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import static org.junit.Assert.*;
+
+import java.nio.ByteBuffer;
+
+import org.apache.avro.util.Utf8;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.storage.WebPage;
 
 Review comment:
   I'm going ot adapt them


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316225#comment-16316225
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160133797
 
 

 ##
 File path: 
src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23IndexingFilter.java
 ##
 @@ -0,0 +1,54 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import static org.junit.Assert.*;
+
+import java.nio.ByteBuffer;
+
+import org.apache.avro.util.Utf8;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.storage.WebPage;
 
 Review comment:
   Should I just remove these 2 tests? Or adapt them for 1.X?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316223#comment-16316223
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

nmaro commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r160133648
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:
+ * RDF/XML, Turtle, Notation 3
+ * RDFa with RDFa1.1 prefix mechanism
+ * Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview,
+ * License, XFN and Species
+ * HTML5 Microdata: (such as Schema.org)
+ * CSV: Comma Separated Values with separator autodetection..
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent) throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+public Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent) throws 
URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23();
+  ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  TripleHandler tHandler = new NTriplesWriter(baos);
+  BenchmarkTripleHandler bHandler = new BenchmarkTripleHandler(tHandler);
+  try {
+any23.extract(htmlContent, url, "text/HTML", "UTF-8", bHandler);
+  } catch (Exception e) {
+e.printStackTrace();
+  } finally {
+tHandler.close();
+bHandler.close();
+  }
+  //This merely prints out a report of the Any23 extraction.
+  LOG.info("Any23 

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312710#comment-16312710
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r159827833
 
 

 ##
 File path: src/plugin/parse-tika/ivy.xml
 ##
 @@ -44,6 +44,10 @@
   
   
   
+  
 
 Review comment:
   d0329a5


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312711#comment-16312711
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r159827859
 
 

 ##
 File path: ivy/ivy.xml
 ##
 @@ -66,6 +66,13 @@

 

+   
 
 Review comment:
   f02fb23


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310219#comment-16310219
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r159516036
 
 

 ##
 File path: src/plugin/parse-tika/ivy.xml
 ##
 @@ -44,6 +44,10 @@
   
   
   
+  
 
 Review comment:
   Why are these being excluded?
 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310220#comment-16310220
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r159515689
 
 

 ##
 File path: ivy/ivy.xml
 ##
 @@ -66,6 +66,13 @@

 

+   
 
 Review comment:
   The dependencies need to go into the plugin ivy.xml. They do not belong here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303889#comment-16303889
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158713724
 
 

 ##
 File path: 
src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23IndexingFilter.java
 ##
 @@ -0,0 +1,54 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import static org.junit.Assert.*;
+
+import java.nio.ByteBuffer;
+
+import org.apache.avro.util.Utf8;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.storage.WebPage;
 
 Review comment:
   None of these tests are compatible with master branch... this must have been 
forked from a previous patch I produced for Nutch 2.x. I'm going to go ahead 
and fix this just now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303883#comment-16303883
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158712206
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:
+ * RDF/XML, Turtle, Notation 3
+ * RDFa with RDFa1.1 prefix mechanism
+ * Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview,
+ * License, XFN and Species
+ * HTML5 Microdata: (such as Schema.org)
+ * CSV: Comma Separated Values with separator autodetection..
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent) throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+public Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent) throws 
URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23();
+  ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  TripleHandler tHandler = new NTriplesWriter(baos);
 
 Review comment:
   Ignore the comment. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> 

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302480#comment-16302480
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

kamaci commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-353730775
 
 
   @hostingnuggets master branch is 1.x


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302355#comment-16302355
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

hostingnuggets commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-353720441
 
 
   @lewismc you mention that this PR will get merged in the master branch of 
this repo, which version of nutch is in the master branch is that the 1.x or 
2.x?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299036#comment-16299036
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158127564
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,155 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.Collections;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. The supported formats can be found at http://any23.apache.org";>Apache Any23.
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  private final static String ANY23_TRIPLES = "Any23-Triples";
 
 Review comment:
   This needs to be changed from private to something else if it is to be used 
in later Indexing Filter tasks.
   Currently the private nature results in compiler error.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sen

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299035#comment-16299035
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158127564
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,155 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.Collections;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. The supported formats can be found at http://any23.apache.org";>Apache Any23.
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  private final static String ANY23_TRIPLES = "Any23-Triples";
 
 Review comment:
   This needs to be changed from private to something else if it is to be used 
in later Indexing Filter tasks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298980#comment-16298980
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158117002
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,155 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.Collections;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. The supported formats can be found at http://any23.apache.org";>Apache Any23.
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  private final static String ANY23_TRIPLES = "Any23-Triples";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent) throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+private Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent) throws 
URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23();
+  ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  TripleHandler tHandler = new NTriplesWriter(baos);
+  BenchmarkTripleHandler bHandler = new BenchmarkTripleHandler(tHandler);
+  try {
+any23.extract(htmlContent, url, "text/HTML", "UTF-8", bHandler);
+  } catch (Exception e) {
+e.printStackTrace();
+  } finally {
+tHandler.close();
+bHandler.close();
+  }
+  //This merely prints out a report of the Any23 extraction.
+  LOG.info("Any23 report: " + bHandler.report());
 
 Review comment:

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298981#comment-16298981
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158116805
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:
+ * RDF/XML, Turtle, Notation 3
+ * RDFa with RDFa1.1 prefix mechanism
+ * Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview,
+ * License, XFN and Species
+ * HTML5 Microdata: (such as Schema.org)
+ * CSV: Comma Separated Values with separator autodetection..
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent) throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+public Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent) throws 
URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23();
+  ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  TripleHandler tHandler = new NTriplesWriter(baos);
 
 Review comment:
   
http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TurtleWriter.html


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298856#comment-16298856
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158098455
 
 

 ##
 File path: src/plugin/any23/src/java/org/apache/nutch/any23/package-info.java
 ##
 @@ -0,0 +1,21 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * @author lewismc
 
 Review comment:
   c866518 and 747d939


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298849#comment-16298849
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158097510
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
 
 Review comment:
   b2199a8


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298845#comment-16298845
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158097340
 
 

 ##
 File path: build.xml
 ##
 @@ -1031,7 +1031,8 @@
 
 
 
-
+
+
 
 Review comment:
   a1bcad5


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298848#comment-16298848
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158097490
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
 
 Review comment:
   b2199a8


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298847#comment-16298847
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158097393
 
 

 ##
 File path: src/plugin/any23/howto_upgrade_any23.txt
 ##
 @@ -0,0 +1,8 @@
+1. Upgrade Any23 dependency in trunk/ivy/ivy.xml
 
 Review comment:
   c380e2d


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298846#comment-16298846
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on a change in pull request #205: WIP: NUTCH-1129 microdata for 
Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r158097385
 
 

 ##
 File path: ivy/ivy.xml
 ##
 @@ -66,6 +66,7 @@

 

+
 
 Review comment:
   c380e2d


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298843#comment-16298843
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

dwirz commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-353141496
 
 
   Hey @lewismc thanks for your comments i am working on it. 🙂 👍 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298119#comment-16298119
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157964888
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
 
 Review comment:
   Please use explicit imports.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298124#comment-16298124
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157965740
 
 

 ##
 File path: src/plugin/any23/src/java/org/apache/nutch/any23/package-info.java
 ##
 @@ -0,0 +1,21 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * @author lewismc
 
 Review comment:
   Remove lewismc and add comment similar to that present within 
```Any23ParseFilter```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298120#comment-16298120
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157965525
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:
+ * RDF/XML, Turtle, Notation 3
+ * RDFa with RDFa1.1 prefix mechanism
+ * Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview,
+ * License, XFN and Species
+ * HTML5 Microdata: (such as Schema.org)
+ * CSV: Comma Separated Values with separator autodetection..
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent) throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+public Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent) throws 
URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23();
+  ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  TripleHandler tHandler = new NTriplesWriter(baos);
+  BenchmarkTripleHandler bHandler = new BenchmarkTripleHandler(tHandler);
+  try {
+any23.extract(htmlContent, url, "text/HTML", "UTF-8", bHandler);
+  } catch (Exception e) {
+e.printStackTrace();
+  } finally {
+tHandler.close();
+bHandler.close();
+  }
+  //This merely prints out a report of the Any23 extraction.
+  LOG.info("Any2

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298116#comment-16298116
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157965004
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:
 
 Review comment:
   Please remove the list. Just link to the Any23 Webpage.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298117#comment-16298117
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157964854
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
 
 Review comment:
   Please remove thi wildcard and use explicit imports.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298118#comment-16298118
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157964683
 
 

 ##
 File path: src/plugin/any23/howto_upgrade_any23.txt
 ##
 @@ -0,0 +1,8 @@
+1. Upgrade Any23 dependency in trunk/ivy/ivy.xml
 
 Review comment:
   You can go ahead and upgrade this when you make the update to the any23 2.1 
dependency


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298121#comment-16298121
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157965249
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:
+ * RDF/XML, Turtle, Notation 3
+ * RDFa with RDFa1.1 prefix mechanism
+ * Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview,
+ * License, XFN and Species
+ * HTML5 Microdata: (such as Schema.org)
+ * CSV: Comma Separated Values with separator autodetection..
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent) throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+public Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent) throws 
URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23();
+  ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  TripleHandler tHandler = new NTriplesWriter(baos);
 
 Review comment:
   Please write the triples as Turtle, it is easier to read and hence debug if 
this breaks in the future.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please cont

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298122#comment-16298122
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157964410
 
 

 ##
 File path: build.xml
 ##
 @@ -1031,7 +1031,8 @@
 
 
 
-
+
+
 
 Review comment:
   Please add entries for the ```javadoc``` and ```eclipse``` targets as well. 
   
   Additionally, please add entries to ```default.properties``` as appropriate


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298115#comment-16298115
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157964486
 
 

 ##
 File path: ivy/ivy.xml
 ##
 @@ -66,6 +66,7 @@

 

+
 
 Review comment:
   Upgrade this to 2.1 and correct formatting.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298123#comment-16298123
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r157965804
 
 

 ##
 File path: 
src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23IndexingFilter.java
 ##
 @@ -0,0 +1,54 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import static org.junit.Assert.*;
 
 Review comment:
   Remove wildcard and use explicit imports


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275742#comment-16275742
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

hostingnuggets commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-348718590
 
 
   @lewismc sorry I am no Java dev here but I would nevertheless like to help 
if you can assist me here a bit. Do I understand correctly that the first step 
for that would be to take the 16 files which @thilohaas modified (total of two 
commits) and apply them to the master branch of nutch, see if it works and if 
yes create a new branch and submit a pull request?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-12-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275659#comment-16275659
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-348705362
 
 
   @hostingnuggets I don't see why not. If you feel like submitting a PR then I 
will review it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259481#comment-16259481
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

hostingnuggets commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-345760233
 
 
   Is it planned to have this patch available also in the Nutch 2.x branch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154107#comment-16154107
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

thilohaas commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327262863
 
 
   Sadly I'm currently too busy, but will definitely look into it as soon as 
possible. 
   Do you maybe have an idea of how to pass an array or hash of strings to the 
filter (see my comment on the PR)? So I would be able to simplify the process 
and come up with an alternative way of storing triples on the documents.
   
   btw the any23 webservice seems to be broken, as it's failing on all websites 
I've tried. For example google as well: 
http://any23.org/any23/?format=best&uri=https%3A%2F%2Fgoogle.com&validation-mode=none
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154087#comment-16154087
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327258936
 
 
   I get a parser error using the [Any23 
Webservice](http://any23.org/any23/?format=best&uri=http%3A%2F%2Fmcdonalds.jobs%2Fsalt-lake-city-ut%2Fgeneral-manager%2F2947B6E7B04147FFBEE1445E66D7EA67%2Fjob%2F&validation-mode=validate-fix&report=on&annotate=on)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154034#comment-16154034
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327245880
 
 
   @lewismc Here's one of the URLs that I've tried:
   
   
[http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/](url)
   
   BTW, the previous patch was able to parse the Microdata without problems. :)
   
   EDIT, here's the full output:
   ```Thread FetcherThread has no more work available
   Using queue mode : byHost
   -finishing thread FetcherThread, activeThreads=1
   Fetcher: throughput threshold: -1
   Thread FetcherThread has no more work available
   Fetcher: throughput threshold retries: 5
   -finishing thread FetcherThread, activeThreads=1
   fetcher.maxNum.threads can't be < than 50 : using 50 instead
   -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1
   -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1
   Thread FetcherThread has no more work available
   -finishing thread FetcherThread, activeThreads=0
   -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
   -activeThreads=0
   Fetcher: finished at 2017-09-05 17:25:43, elapsed: 00:00:08
   Parsing : 20170905172529
   /home/simoncpu/nutch/runtime/local/bin/nutch parse -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true -D mapreduce.task.skip.start.attempts=2 -D 
mapreduce.map.skip.maxrecords=1 crawl-dir/segments/20170905172529
   ParseSegment: starting at 2017-09-05 17:25:45
   ParseSegment: segment: crawl-dir/segments/20170905172529
   Error parsing: 
http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/:
 failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully 
parse content
   Parsed 
(225ms):http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/
   ParseSegment: finished at 2017-09-05 17:25:51, elapsed: 00:00:06
   CrawlDB update
   /home/simoncpu/nutch/runtime/local/bin/nutch updatedb -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true crawl-dir/crawldb 
crawl-dir/segments/20170905172529
   CrawlDb update: starting at 2017-09-05 17:25:53
   CrawlDb update: db: crawl-dir/crawldb
   CrawlDb update: segments: [crawl-dir/segments/20170905172529]
   CrawlDb update: additions allowed: true
   CrawlDb update: URL normalizing: false
   CrawlDb update: URL filtering: false
   CrawlDb update: 404 purging: false
   CrawlDb update: Merging segment data into db.
   CrawlDb update: finished at 2017-09-05 17:25:59, elapsed: 00:00:05
   Link inversion
   /home/simoncpu/nutch/runtime/local/bin/nutch invertlinks crawl-dir/linkdb 
crawl-dir/segments/20170905172529
   LinkDb: starting at 2017-09-05 17:26:01
   LinkDb: linkdb: crawl-dir/linkdb
   LinkDb: URL normalize: true
   LinkDb: URL filter: true
   LinkDb: internal links will be ignored.
   LinkDb: adding segment: crawl-dir/segments/20170905172529
   LinkDb: finished at 2017-09-05 17:26:06, elapsed: 00:00:04
   Dedup on crawldb
   /home/simoncpu/nutch/runtime/local/bin/nutch dedup crawl-dir/crawldb
   DeduplicationJob: starting at 2017-09-05 17:26:07
   Deduplication: 0 documents marked as duplicates
   Deduplication: Updating status of duplicate urls into crawl db.
   Deduplication finished at 2017-09-05 17:26:15, elapsed: 00:00:07
   Indexing 20170905172529 to index
   /home/simoncpu/nutch/runtime/local/bin/nutch index crawl-dir/crawldb -linkdb 
crawl-dir/linkdb crawl-dir/segments/20170905172529
   Segment dir is complete: crawl-dir/segments/20170905172529.
   Indexer: starting at 2017-09-05 17:26:17
   Indexer: deleting gone documents: false
   Indexer: URL filtering: false
   Indexer: URL normalizing: false
   Active IndexWriters :
   ElasticRestIndexWriter
   elastic.rest.host : hostname
   elastic.rest.port : port
   elastic.rest.index : elastic index command
   elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 
250)
   elastic.rest.max.bulk.size : elastic bulk index length. (default 
2500500 ~2.5MB)
   
   
   Indexer: number of documents indexed, deleted, or skipped:
   Indexer: finished at 2017-09-05 17:26:23, elapsed: 00:00:05
   Cleaning up index if possible
   /home/simoncpu/nutch/runtime/local/bin/nutch clean crawl-dir/crawldb
   Wed Sep 6 01:26:28 DST 2017 : Finished loop with 1 iterations
   ```
 
--

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154026#comment-16154026
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327245880
 
 
   @lewismc Here's one of the URLs that I've tried:
   
   
[http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/](url)
   
   BTW, the previous patch was able to parse the Microdata without problems. :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153922#comment-16153922
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327229664
 
 
   @thilohaas can you consider the comments above please?
   
   @simoncpu thank you for trying out the patch... please keep providing 
feedback. Did you manage to debug the source of the ParseException? The URL you 
provide is not actually available... have you tried it on anything else? An 
example would be https://www.w3.org
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153357#comment-16153357
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327062894
 
 
   @thilohaas I tested this on a website with Microdata, but it can't index 
anything...
   
   EDIT: The error is:
   `Error parsing: http://example.org/website-with-microdata: 
org.apache.nutch.parse.ParseException: Unable to successfully parse content`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153074#comment-16153074
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-327062894
 
 
   @thilohaas I tested this on a website with Microdata, but it can't index 
anything...
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149041#comment-16149041
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-326309710
 
 
   OK this is an issue. The solution is to address 
https://issues.apache.org/jira/browse/ANY23-264
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149030#comment-16149030
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-326307268
 
 
   @lewismc It still didn't work, so I just grabbed the jar file at: 
http://svn.apache.org/repos/asf/any23/repo-ext/org/apache/commons/commons-csv/1.0-SNAPSHOT-rev1148315/.
 :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147988#comment-16147988
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-326107912
 
 
   @simoncpu this may be intermittent... please report back here if it does not 
resolve itself. I am aware that this SNAPSHOT dependency has given us problems 
in the past. We may need to push a fix somewhere in Any23 e.g. upgrade the 
commons-csv library.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147972#comment-16147972
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-326104881
 
 
   I tried building using the updated patch but got this:
   
   ```
   [ivy:resolve] WARN: ::
   [ivy:resolve] WARN: ::  UNRESOLVED DEPENDENCIES ::
   [ivy:resolve] WARN: ::
   [ivy:resolve] WARN: :: 
org.apache.commons#commons-csv;1.0-SNAPSHOT-rev1148315: not found
   [ivy:resolve] WARN: ::
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135513#comment-16135513
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r134292923
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:
+ * RDF/XML, Turtle, Notation 3
+ * RDFa with RDFa1.1 prefix mechanism
+ * Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview,
+ * License, XFN and Species
+ * HTML5 Microdata: (such as Schema.org)
+ * CSV: Comma Separated Values with separator autodetection..
+ * In this implementation triples are written as Notation3 e.g.
+ *  
 "2014/03/31 
13:53:03"@en-gb .
+ * and triples are identified within output triple streams by the presence of 
'\n'.
+ * The presence of the '\n' is a characteristic specific to N3 serialization 
in Any23.
+ * In order to use another/other writers implementing the
+ * http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html";>TripleHandler
+ * interface, we will most likely need to identify an alternative data 
characteristic
+ * which we can use to split triples streams.
+ * 
+ *
+ */
+public class Any23ParseFilter implements HtmlParseFilter {
+
+  /** Logging instance */
+  public static final Logger LOG = 
LoggerFactory.getLogger(Any23ParseFilter.class);
+
+  private Configuration conf = null;
+
+  /** Constant identifier used as a Key for writing and reading
+   * triples to and from the metadata Map field.
+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  private static class Any23Parser {
+
+Set triples = null;
+
+Any23Parser(String url, String htmlContent) throws TripleHandlerException {
+  triples = new TreeSet();
+  try {
+parse(url, htmlContent);
+  } catch (URISyntaxException e) {
+throw new RuntimeException(e.getReason());
+  } catch (IOException e) {
+e.printStackTrace();
+  }
+}
+
+/**
+ * Maintains a {@link java.util.Set} containing the triples
+ * @return a {@link java.util.Set} of triples.
+ */
+public Set getTriples() {
+  return triples;
+}
+
+private void parse(String url, String htmlContent) throws 
URISyntaxException, IOException, TripleHandlerException {
+  Any23 any23 = new Any23();
+  ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  TripleHandler tHandler = new NTriplesWriter(baos);
+  BenchmarkTripleHandler bHandler = new BenchmarkTripleHandler(tHandler);
+  try {
+any23.extract(htmlContent, url, "text/HTML", "UTF-8", bHandler);
+  } catch (Exception e) {
+e.printStackTrace();
+  } finally {
+tHandler.close();
+bHandler.close();
+  }
+  //This merely prints out a report of the Any23 extraction.
+  LOG.info("Any2

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135514#comment-16135514
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata 
for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#discussion_r134293186
 
 

 ##
 File path: 
src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
 ##
 @@ -0,0 +1,165 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.any23;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;
+
+import org.apache.any23.Any23;
+import org.apache.any23.writer.BenchmarkTripleHandler;
+import org.apache.any23.writer.NTriplesWriter;
+import org.apache.any23.writer.TripleHandler;
+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.protocol.Content;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+
+/**
+ * This implementation of {@link org.apache.nutch.parse.HtmlParseFilter}
+ * uses the http://any23.apache.org";>Apache Any23 library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:
 
 Review comment:
   To be honest the comment, including a list of the supported formats is not 
really necessary. You can just link back to the any23.apache.org homepage for a 
list of supported formats.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-08-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135488#comment-16135488
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

thilohaas commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-323806868
 
 
   Sorry, I didn't accidentally added changes from another local test-branch. 
Should be cleaned up now and only contain any23 plugin relevant changes.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102901#comment-16102901
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-318291768
 
 
   Hi @simoncpu , there is no way we can merge this code into master branch of 
Nutch... it is simply too much of a change.
   This patch needs to be reduced in size to be considered.
   Thank you for all contributions to Nutch, we welcome all, we need to make 
sure that the software is high quality and **stable**.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-07-27 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102898#comment-16102898
 ] 

Lewis John McGibbney commented on NUTCH-1129:
-

We need some sort of reasonable response here...
Currently, this issue is too large.
Sebastians comments are true, can you please consider addressing them and then 
we can work with this patch?

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102322#comment-16102322
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-318189750
 
 
   Will try this patch while waiting for it to be merged into the official 
repo... thanks, man! :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101795#comment-16101795
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-318087705
 
 
   Hi @thilohaas this patch is too large for us to merge into Nutch master 
branch...
   Can you please separate our your code to implement Microdata support? We can 
then review that patch alone.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-07-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094256#comment-16094256
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

thilohaas opened a new pull request #205: WIP: NUTCH-1129 microdata for Nutch 
1.x
URL: https://github.com/apache/nutch/pull/205
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2014-04-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982827#comment-13982827
 ] 

Sebastian Nagel commented on NUTCH-1129:


Hi [~lewismc], not yet. But I head a look on the patch. Looks good, in general! 
Some comments:
* dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We 
recently had a discussion about that topic 
[@user|http://mail-archives.apache.org/mod_mbox/nutch-user/201404.mbox/%3C535615BA.3050601%40raytion.com%3E].
* all extracted triples are finally stored in one multi-valued field, each 
triple represented as string. That's not an optimal representation, regarding 
two (are there more?) possible use cases: extract and index key-value pairs as 
structured content (cf. 
[@dev|http://mail-archives.apache.org/mod_mbox/nutch-dev/201204.mbox/%3C4F8DEC5B.8070705%40googlemail.com%3E]),
 index into some triple store (as new indexer back-end)
* similar: isn't there a more efficient way to pass triples from parse to 
indexing filter than tab-separated in a huge string (there may be many triples 
in one document!)

The latter two points aren't a blocker by no means. But we should think about 
evolving the plugin and make it really usable.

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2014-04-27 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982688#comment-13982688
 ] 

Lewis John McGibbney commented on NUTCH-1129:
-

Did anyone get an opportunity to try this out on 2.x?

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2014-03-31 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955759#comment-13955759
 ] 

Lewis John McGibbney commented on NUTCH-1129:
-

During ApacheCon I'll port this to trunk. Unless someone else wishes to do so 
:) :) :)

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2012-02-15 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208345#comment-13208345
 ] 

Lewis John McGibbney commented on NUTCH-1129:
-

Yeah your right Markus. The Any23 libraries are parsers for extracting stuff 
like microdata we would rely upon Tika for content extraction. Currently in 
Any23 I think were stuck way back at 0.6 or something so there is obviously 
work to be done here obviously. I've been looking at 
https://svn.apache.org/viewvc/nutch/trunk/src/plugin/microformats-reltag/
I'll work towards reusing as much of the Tika stuff we have.

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2012-02-15 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208309#comment-13208309
 ] 

Markus Jelsma commented on NUTCH-1129:
--

This is a parser plugin right? How will this work if we for example would like 
to parse microdata with any23 and use Tika's BoilerpipeContentHandler to 
extraction? In the current BP patch we use multiple content handlers to parse 
all in one go so i wonder if this could be implemented as such.

Please correct me when wrong :)

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2012-02-10 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205750#comment-13205750
 ] 

Lewis John McGibbney commented on NUTCH-1129:
-

Hi Markus. I'm really gutted about this one, I've not had time to sort it out. 
I want to say the following things though.
- Any23 is now available on repository.apache.org [1], however I think we need 
to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty 
trivial though.
- Any23 already has a crawler plugin implementation (nothing like the stuff we 
offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch? 
[2] Unfortunately the documentation is not great at all as I'm sure you'll 
agree.

[1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23
[2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2012-02-09 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204586#comment-13204586
 ] 

Markus Jelsma commented on NUTCH-1129:
--

Hi guys, anything new on this one? 

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2011-12-20 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173845#comment-13173845
 ] 

Hudson commented on NUTCH-1129:
---

Integrated in Nutch-trunk #1699 (See 
[https://builds.apache.org/job/Nutch-trunk/1699/])
NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

markus : 
http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/log4j.properties


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2011-12-20 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173076#comment-13173076
 ] 

Hudson commented on NUTCH-1129:
---

Integrated in nutch-trunk-maven #69 (See 
[https://builds.apache.org/job/nutch-trunk-maven/69/])
NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

markus : 
http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/log4j.properties


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2011-09-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114692#comment-13114692
 ] 

Lewis John McGibbney commented on NUTCH-1129:
-

thanks Julien. To be honest it would be nice for the latter of your comments to 
materialise. I'll keep this issue open to track the progress.

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.4
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2011-09-24 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114077#comment-13114077
 ] 

Julien Nioche commented on NUTCH-1129:
--

Any23 might graduate into a Tika subproject, if not it should available as a 
Tika parser and we'll get it automatically. 

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.4
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira