[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb



  was:
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb



  was:
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> bachid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time a crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> bachid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb



  was:
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time a crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> bachid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second time a crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I'm using mongodb



  was:
This problem happens at the the second update on a crawled a page ** that has 
not changed** with -all option
bin/nutch updatedb -all
not tested with  other options
I'm using mongodb




> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second time a crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Summary: fetch deletes all  metadata except _csh_ and _rs_  (was: updatedb 
deletes all  metadata except _csh_ and _rs_)

> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second update on a crawled a page ** that has 
> not changed** with -all option
> bin/nutch updatedb -all
> not tested with  other options
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) updatedb deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adnane B. updated NUTCH-:
-
Description: 
This problem happens at the the second update on a crawled a page ** that has 
not changed** with -all option
bin/nutch updatedb -all
not tested with  other options
I'm using mongodb



  was:
This problem happens at the the second update on a crawled page with -all option
bin/nutch updatedb -all
not tested with  other options
I'm using mongodb


> updatedb deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 
>Reporter: Adnane B.
>
> This problem happens at the the second update on a crawled a page ** that has 
> not changed** with -all option
> bin/nutch updatedb -all
> not tested with  other options
> I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2222) updatedb deletes all metadata except _csh_ and _rs_

2016-02-16 Thread Adnane B. (JIRA)
Adnane B. created NUTCH-:


 Summary: updatedb deletes all  metadata except _csh_ and _rs_
 Key: NUTCH-
 URL: https://issues.apache.org/jira/browse/NUTCH-
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 2.3.1
 Environment: Centos 6, mongodb 2.6 and mongodb 3.0 

Reporter: Adnane B.


This problem happens at the the second update on a crawled page with -all option
bin/nutch updatedb -all
not tested with  other options
I'm using mongodb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148758#comment-15148758
 ] 

Hudson commented on NUTCH-961:
--

SUCCESS: Integrated in Nutch-trunk #3347 (See 
[https://builds.apache.org/job/Nutch-trunk/3347/])
NUTCH-961 Expose Tika's Boilerpipe support (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1730694])
* trunk/CHANGES.txt
* trunk/conf/nutch-default.xml
* 
trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
* 
trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-961.
-
Resolution: Fixed

Committed to trunk in revision 1730694. Thanks everyone for contributions.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:

Attachment: NUTCH-961.patch

Updated patch. ExtractorRepository was missing.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:

Fix Version/s: 1.12

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:

Affects Version/s: 1.11

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148642#comment-15148642
 ] 

Markus Jelsma commented on NUTCH-961:
-

Tests pass as expected and Boilerpipe as well. Will commit shortly.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:

Description: 
Tika 0.8 comes with the Boilerpipe content handler which can be used to extract 
boilerplate content from HTML pages. We should see how we can expose 
Boilerplate in the Nutch cofiguration.

Use the following properties to enable and control Boilerpipe.

{code}

  tika.extractor
  none
  
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  

 
 
  tika.extractor.boilerpipe.algorithm
  ArticleExtractor
   
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
ArticleExtractor
  or CanolaExtractor.
  

{code}

  was:
Tika 0.8 comes with the Boilerpipe content handler which can be used to extract 
boilerplate content from HTML pages. We should see how we can expose 
Boilerplate in the Nutch cofiguration.

Use the following properties to enable and control Boilerpipe.


  tika.extractor
  none
  
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  

 
 
  tika.extractor.boilerpipe.algorithm
  ArticleExtractor
   
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
ArticleExtractor
  or CanolaExtractor.
  



> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:

Description: 
Tika 0.8 comes with the Boilerpipe content handler which can be used to extract 
boilerplate content from HTML pages. We should see how we can expose 
Boilerplate in the Nutch cofiguration.

Use the following properties to enable and control Boilerpipe.


  tika.extractor
  none
  
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  

 
 
  tika.extractor.boilerpipe.algorithm
  ArticleExtractor
   
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
ArticleExtractor
  or CanolaExtractor.
  


  was:Tika 0.8 comes with the Boilerpipe content handler which can be used to 
extract boilerplate content from HTML pages. We should see how we can expose 
Boilerplate in the Nutch cofiguration.


> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> 
>   tika.extractor
>   none
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   
> 
>  
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148631#comment-15148631
 ] 

Hudson commented on NUTCH-2210:
---

SUCCESS: Integrated in Nutch-trunk #3346 (See 
[https://builds.apache.org/job/Nutch-trunk/3346/])
NUTCH-2210 Upgrade to Tika 1.12 (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1730686])
* trunk/CHANGES.txt
* trunk/ivy/ivy.xml
* trunk/src/plugin/parse-tika/ivy.xml
* trunk/src/plugin/parse-tika/plugin.xml


> Upgrade to Tika 1.12
> 
>
> Key: NUTCH-2210
> URL: https://issues.apache.org/jira/browse/NUTCH-2210
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2210.patch
>
>
> Upgrade to Tika 1.12 when it is released. Keep in mind,  module="rome" rev="0.9"/> in ivy.xml must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:

Attachment: NUTCH-961.patch

Patch for trunk.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148632#comment-15148632
 ] 

Hudson commented on NUTCH-1233:
---

SUCCESS: Integrated in Nutch-trunk #3346 (See 
[https://builds.apache.org/job/Nutch-trunk/3346/])
NUTCH-1233 Rely on Tika for outlink extraction (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1730687])
* trunk/CHANGES.txt
* 
trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java
* 
trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* 
trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


> Rely on Tika for outlink extraction
> ---
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, 
> post-1233.txt, pre-1233-2.txt, pre-1233.txt
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1233.
--
Resolution: Fixed

Committed to trunk in revision 1730687.


> Rely on Tika for outlink extraction
> ---
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, 
> post-1233.txt, pre-1233-2.txt, pre-1233.txt
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1233:
-
Affects Version/s: 1.11

> Rely on Tika for outlink extraction
> ---
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, 
> post-1233.txt, pre-1233-2.txt, pre-1233.txt
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1233:
-
Fix Version/s: 1.12

> Rely on Tika for outlink extraction
> ---
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, 
> post-1233.txt, pre-1233-2.txt, pre-1233.txt
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1233:
-
Component/s: parser

> Rely on Tika for outlink extraction
> ---
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, 
> post-1233.txt, pre-1233-2.txt, pre-1233.txt
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148590#comment-15148590
 ] 

Markus Jelsma commented on NUTCH-1233:
--

Awesome! Everything works as expected since the Tika 1.12 upgrade. Number of 
outlinks is as expected with and without LinkContentHandler. Tests pass. Will 
commit shortly.

> Rely on Tika for outlink extraction
> ---
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, 
> post-1233.txt, pre-1233-2.txt, pre-1233.txt
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2210.
--
Resolution: Fixed

Committed to trunk in revision 1730686.


> Upgrade to Tika 1.12
> 
>
> Key: NUTCH-2210
> URL: https://issues.apache.org/jira/browse/NUTCH-2210
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2210.patch
>
>
> Upgrade to Tika 1.12 when it is released. Keep in mind,  module="rome" rev="0.9"/> in ivy.xml must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148572#comment-15148572
 ] 

Markus Jelsma commented on NUTCH-2210:
--

Test passes, will commit shortly.

> Upgrade to Tika 1.12
> 
>
> Key: NUTCH-2210
> URL: https://issues.apache.org/jira/browse/NUTCH-2210
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2210.patch
>
>
> Upgrade to Tika 1.12 when it is released. Keep in mind,  module="rome" rev="0.9"/> in ivy.xml must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2210:
-
Attachment: NUTCH-2210.patch

Patch for trunk. 

> Upgrade to Tika 1.12
> 
>
> Key: NUTCH-2210
> URL: https://issues.apache.org/jira/browse/NUTCH-2210
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2210.patch
>
>
> Upgrade to Tika 1.12 when it is released. Keep in mind,  module="rome" rev="0.9"/> in ivy.xml must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148489#comment-15148489
 ] 

Markus Jelsma commented on NUTCH-2197:
--

Hello Arun - no, this is not applied to 2.3.1. The plugins are similar though, 
with some effort you could patch 2.x.

> Add solr5 solrcloud indexer support
> ---
>
> Key: NUTCH-2197
> URL: https://issues.apache.org/jira/browse/NUTCH-2197
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.12
>
> Attachments: NUTCH-2197.patch, NUTCH-2197.patch, NUTCH-2197.patch
>
>
> Nutch cannot index to Solr5. Also proper SolrCloud support is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)