[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb was: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb was: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> bachid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time a crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> bachid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb was: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time a crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> bachid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb was: This problem happens at the the second update on a crawled a page ** that has not changed** with -all option bin/nutch updatedb -all not tested with other options I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time a crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Summary: fetch deletes all metadata except _csh_ and _rs_ (was: updatedb deletes all metadata except _csh_ and _rs_) > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second update on a crawled a page ** that has > not changed** with -all option > bin/nutch updatedb -all > not tested with other options > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) updatedb deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second update on a crawled a page ** that has not changed** with -all option bin/nutch updatedb -all not tested with other options I'm using mongodb was: This problem happens at the the second update on a crawled page with -all option bin/nutch updatedb -all not tested with other options I'm using mongodb > updatedb deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second update on a crawled a page ** that has > not changed** with -all option > bin/nutch updatedb -all > not tested with other options > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2222) updatedb deletes all metadata except _csh_ and _rs_
Adnane B. created NUTCH-: Summary: updatedb deletes all metadata except _csh_ and _rs_ Key: NUTCH- URL: https://issues.apache.org/jira/browse/NUTCH- Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 2.3.1 Environment: Centos 6, mongodb 2.6 and mongodb 3.0 Reporter: Adnane B. This problem happens at the the second update on a crawled page with -all option bin/nutch updatedb -all not tested with other options I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148758#comment-15148758 ] Hudson commented on NUTCH-961: -- SUCCESS: Integrated in Nutch-trunk #3347 (See [https://builds.apache.org/job/Nutch-trunk/3347/]) NUTCH-961 Expose Tika's Boilerpipe support (markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1730694]) * trunk/CHANGES.txt * trunk/conf/nutch-default.xml * trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java * trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, > nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > {code} > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-961. - Resolution: Fixed Committed to trunk in revision 1730694. Thanks everyone for contributions. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, > nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > {code} > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Attachment: NUTCH-961.patch Updated patch. ExtractorRepository was missing. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, > nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > {code} > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Fix Version/s: 1.12 > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > {code} > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Affects Version/s: 1.11 > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > {code} > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148642#comment-15148642 ] Markus Jelsma commented on NUTCH-961: - Tests pass as expected and Boilerpipe as well. Will commit shortly. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > {code} > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Description: Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. Use the following properties to enable and control Boilerpipe. {code} tika.extractor none Which text extraction algorithm to use. Valid values are: boilerpipe or none. tika.extractor.boilerpipe.algorithm ArticleExtractor Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. {code} was: Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. Use the following properties to enable and control Boilerpipe. tika.extractor none Which text extraction algorithm to use. Valid values are: boilerpipe or none. tika.extractor.boilerpipe.algorithm ArticleExtractor Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > {code} > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Description: Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. Use the following properties to enable and control Boilerpipe. tika.extractor none Which text extraction algorithm to use. Valid values are: boilerpipe or none. tika.extractor.boilerpipe.algorithm ArticleExtractor Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. was:Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. > Use the following properties to enable and control Boilerpipe. > > tika.extractor > none > > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > > > > > tika.extractor.boilerpipe.algorithm > ArticleExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2210) Upgrade to Tika 1.12
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148631#comment-15148631 ] Hudson commented on NUTCH-2210: --- SUCCESS: Integrated in Nutch-trunk #3346 (See [https://builds.apache.org/job/Nutch-trunk/3346/]) NUTCH-2210 Upgrade to Tika 1.12 (markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1730686]) * trunk/CHANGES.txt * trunk/ivy/ivy.xml * trunk/src/plugin/parse-tika/ivy.xml * trunk/src/plugin/parse-tika/plugin.xml > Upgrade to Tika 1.12 > > > Key: NUTCH-2210 > URL: https://issues.apache.org/jira/browse/NUTCH-2210 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-2210.patch > > > Upgrade to Tika 1.12 when it is released. Keep in mind, module="rome" rev="0.9"/> in ivy.xml must be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Attachment: NUTCH-961.patch Patch for trunk. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148632#comment-15148632 ] Hudson commented on NUTCH-1233: --- SUCCESS: Integrated in Nutch-trunk #3346 (See [https://builds.apache.org/job/Nutch-trunk/3346/]) NUTCH-1233 Rely on Tika for outlink extraction (markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1730687]) * trunk/CHANGES.txt * trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java * trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java * trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java > Rely on Tika for outlink extraction > --- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, > post-1233.txt, pre-1233-2.txt, pre-1233.txt > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1233. -- Resolution: Fixed Committed to trunk in revision 1730687. > Rely on Tika for outlink extraction > --- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, > post-1233.txt, pre-1233-2.txt, pre-1233.txt > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Affects Version/s: 1.11 > Rely on Tika for outlink extraction > --- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, > post-1233.txt, pre-1233-2.txt, pre-1233.txt > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Fix Version/s: 1.12 > Rely on Tika for outlink extraction > --- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, > post-1233.txt, pre-1233-2.txt, pre-1233.txt > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Component/s: parser > Rely on Tika for outlink extraction > --- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, > post-1233.txt, pre-1233-2.txt, pre-1233.txt > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148590#comment-15148590 ] Markus Jelsma commented on NUTCH-1233: -- Awesome! Everything works as expected since the Tika 1.12 upgrade. Number of outlinks is as expected with and without LinkContentHandler. Tests pass. Will commit shortly. > Rely on Tika for outlink extraction > --- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, > post-1233.txt, pre-1233-2.txt, pre-1233.txt > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2210) Upgrade to Tika 1.12
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2210. -- Resolution: Fixed Committed to trunk in revision 1730686. > Upgrade to Tika 1.12 > > > Key: NUTCH-2210 > URL: https://issues.apache.org/jira/browse/NUTCH-2210 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-2210.patch > > > Upgrade to Tika 1.12 when it is released. Keep in mind, module="rome" rev="0.9"/> in ivy.xml must be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2210) Upgrade to Tika 1.12
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148572#comment-15148572 ] Markus Jelsma commented on NUTCH-2210: -- Test passes, will commit shortly. > Upgrade to Tika 1.12 > > > Key: NUTCH-2210 > URL: https://issues.apache.org/jira/browse/NUTCH-2210 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-2210.patch > > > Upgrade to Tika 1.12 when it is released. Keep in mind, module="rome" rev="0.9"/> in ivy.xml must be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2210) Upgrade to Tika 1.12
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2210: - Attachment: NUTCH-2210.patch Patch for trunk. > Upgrade to Tika 1.12 > > > Key: NUTCH-2210 > URL: https://issues.apache.org/jira/browse/NUTCH-2210 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-2210.patch > > > Upgrade to Tika 1.12 when it is released. Keep in mind, module="rome" rev="0.9"/> in ivy.xml must be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2197) Add solr5 solrcloud indexer support
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148489#comment-15148489 ] Markus Jelsma commented on NUTCH-2197: -- Hello Arun - no, this is not applied to 2.3.1. The plugins are similar though, with some effort you could patch 2.x. > Add solr5 solrcloud indexer support > --- > > Key: NUTCH-2197 > URL: https://issues.apache.org/jira/browse/NUTCH-2197 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Jurian Broertjes >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.12 > > Attachments: NUTCH-2197.patch, NUTCH-2197.patch, NUTCH-2197.patch > > > Nutch cannot index to Solr5. Also proper SolrCloud support is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)