[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-27 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170580#comment-15170580
 ] 

Adnane B. commented on NUTCH-:
--

Please let me know if this issue does not exist with any other persistent 
storage configuration.

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170557#comment-15170557
 ] 

ASF GitHub Bot commented on NUTCH-961:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/92#discussion_r54332201
  
--- Diff: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java ---
@@ -109,7 +114,18 @@ public Parse getParse(String url, WebPage page) {
 HTMLDocumentImpl doc = new HTMLDocumentImpl();
 doc.setErrorChecking(false);
 DocumentFragment root = doc.createDocumentFragment();
-DOMBuilder domhandler = new DOMBuilder(doc, root);
+   // DOMBuilder domhandler = new DOMBuilder(doc, root);
+ContentHandler domHandler;
+// Check whether to use Tika's BoilerplateContentHandler
+if (useBoilerpipe) {
+LOG.debug("Using Tikas's Boilerpipe with Extractor: " + 
boilerpipeExtractorName);
--- End diff --

Can also use more efficient slf4j convention
logger.debug("The entry is {}.", entry);


> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Add the boilerpipe parsing adapted from NUTCH-...

2016-02-27 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/92#discussion_r54332201
  
--- Diff: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java ---
@@ -109,7 +114,18 @@ public Parse getParse(String url, WebPage page) {
 HTMLDocumentImpl doc = new HTMLDocumentImpl();
 doc.setErrorChecking(false);
 DocumentFragment root = doc.createDocumentFragment();
-DOMBuilder domhandler = new DOMBuilder(doc, root);
+   // DOMBuilder domhandler = new DOMBuilder(doc, root);
+ContentHandler domHandler;
+// Check whether to use Tika's BoilerplateContentHandler
+if (useBoilerpipe) {
+LOG.debug("Using Tikas's Boilerpipe with Extractor: " + 
boilerpipeExtractorName);
--- End diff --

Can also use more efficient slf4j convention
logger.debug("The entry is {}.", entry);


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170556#comment-15170556
 ] 

ASF GitHub Bot commented on NUTCH-961:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/92#discussion_r54332193
  
--- Diff: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parse.tika;
+
+import java.lang.ClassLoader;
+import java.lang.InstantiationException;
+import java.util.WeakHashMap;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.tika.parser.html.BoilerpipeContentHandler;
+import de.l3s.boilerpipe.BoilerpipeExtractor;
+import de.l3s.boilerpipe.extractors.*;
+
+class BoilerpipeExtractorRepository {
+
+public static final Log LOG = 
LogFactory.getLog(BoilerpipeExtractorRepository.class);
+public static final WeakHashMap 
extractorRepository = new WeakHashMap();
+ 
+/**
+ * Returns an instance of the specified extractor
+ */
+public static BoilerpipeExtractor getExtractor(String 
boilerpipeExtractorName) {
+  // Check if there's no instance of this extractor
+  if (!extractorRepository.containsKey(boilerpipeExtractorName)) {
+// FQCN
+boilerpipeExtractorName = "de.l3s.boilerpipe.extractors." + 
boilerpipeExtractorName;
+
+// Attempt to load the class
+try {
+  ClassLoader loader = BoilerpipeExtractor.class.getClassLoader();
+  Class extractorClass = loader.loadClass(boilerpipeExtractorName);
+
+  // Add an instance to the repository
+  extractorRepository.put(boilerpipeExtractorName, 
(BoilerpipeExtractor)extractorClass.newInstance());
+
+} catch (ClassNotFoundException e) {
+  LOG.error("BoilerpipeExtractor " + boilerpipeExtractorName + " 
not found!");
--- End diff --

In slf4j we can better structure the catch
http://www.slf4j.org/faq.html#logging_performance
e.g.
```
logger.debug("The entry is {}.", entry);
```


> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Add the boilerpipe parsing adapted from NUTCH-...

2016-02-27 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/92#discussion_r54332193
  
--- Diff: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parse.tika;
+
+import java.lang.ClassLoader;
+import java.lang.InstantiationException;
+import java.util.WeakHashMap;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.tika.parser.html.BoilerpipeContentHandler;
+import de.l3s.boilerpipe.BoilerpipeExtractor;
+import de.l3s.boilerpipe.extractors.*;
+
+class BoilerpipeExtractorRepository {
+
+public static final Log LOG = 
LogFactory.getLog(BoilerpipeExtractorRepository.class);
+public static final WeakHashMap 
extractorRepository = new WeakHashMap();
+ 
+/**
+ * Returns an instance of the specified extractor
+ */
+public static BoilerpipeExtractor getExtractor(String 
boilerpipeExtractorName) {
+  // Check if there's no instance of this extractor
+  if (!extractorRepository.containsKey(boilerpipeExtractorName)) {
+// FQCN
+boilerpipeExtractorName = "de.l3s.boilerpipe.extractors." + 
boilerpipeExtractorName;
+
+// Attempt to load the class
+try {
+  ClassLoader loader = BoilerpipeExtractor.class.getClassLoader();
+  Class extractorClass = loader.loadClass(boilerpipeExtractorName);
+
+  // Add an instance to the repository
+  extractorRepository.put(boilerpipeExtractorName, 
(BoilerpipeExtractor)extractorClass.newInstance());
+
+} catch (ClassNotFoundException e) {
+  LOG.error("BoilerpipeExtractor " + boilerpipeExtractorName + " 
not found!");
--- End diff --

In slf4j we can better structure the catch
http://www.slf4j.org/faq.html#logging_performance
e.g.
```
logger.debug("The entry is {}.", entry);
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170555#comment-15170555
 ] 

ASF GitHub Bot commented on NUTCH-961:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/92#discussion_r54332155
  
--- Diff: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parse.tika;
+
+import java.lang.ClassLoader;
+import java.lang.InstantiationException;
+import java.util.WeakHashMap;
+import org.apache.commons.logging.Log;
--- End diff --

Nutch currently uses Slf4j

org.slf4j.Logger
org.slf4j.LoggerFactory

I think!


> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Add the boilerpipe parsing adapted from NUTCH-...

2016-02-27 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/92#discussion_r54332155
  
--- Diff: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parse.tika;
+
+import java.lang.ClassLoader;
+import java.lang.InstantiationException;
+import java.util.WeakHashMap;
+import org.apache.commons.logging.Log;
--- End diff --

Nutch currently uses Slf4j

org.slf4j.Logger
org.slf4j.LoggerFactory

I think!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170554#comment-15170554
 ] 

ASF GitHub Bot commented on NUTCH-961:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/92#discussion_r54332145
  
--- Diff: conf/nutch-default.xml ---
@@ -876,6 +876,19 @@
   
 
 
+
+
+
+  tika.boilerpipe
+  false
--- End diff --

Can you provide descriptions of these properties please?


> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Add the boilerpipe parsing adapted from NUTCH-...

2016-02-27 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/92#discussion_r54332145
  
--- Diff: conf/nutch-default.xml ---
@@ -876,6 +876,19 @@
   
 
 
+
+
+
+  tika.boilerpipe
+  false
--- End diff --

Can you provide descriptions of these properties please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---