[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter
[ https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509167#comment-14509167 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1985: --- Should we commit this for 1.10 release? or wait for 1.11 ? Adding a main() method to the MimeTypeIndexingFilter Key: NUTCH-1985 URL: https://issues.apache.org/jira/browse/NUTCH-1985 Project: Nutch Issue Type: Improvement Components: indexer, metadata, plugin Affects Versions: 1.10 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Labels: features, patch, test Fix For: 1.10 Attachments: NUTCH-1985.patch This make very easy the testing of different rules files to check the expressions used to filter the content based on the MIME type detected. Until now the only way to check this was to do test crawls and check the stored data in Solr/Elasticsearch. This allows calling the file using the {{bin/nutch plugin}} command, something like: {{bin/nutch plugin mimetype-filter org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}} Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} for specifying a rules file to be used, this makes easy to play with different rules file until you get the desired behavior. After invoking the class, a valid MIME type must be entered for each line, and the output will be the same MIME type with a {{+}} or {{-}} sign in the beginning, indicating if the given MIME type is allowed or denied respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015
s/1.8/1.10/ right? If so +1! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, April 23, 2015 at 2:14 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015 Hi Folks, Does anyone have an issue with the above proposal? Thanks Lewis -- Lewis
[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509501#comment-14509501 ] Lewis John McGibbney commented on NUTCH-1994: - Would like to commit by EoB today if no other issues. Thanks [~tpalsulich] Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509514#comment-14509514 ] Tyler Palsulich commented on NUTCH-1994: Happy to help, [~lewismc]! Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509393#comment-14509393 ] Lewis John McGibbney commented on NUTCH-1994: - Anyone to review? I can roll a release (or assist anyone else if they would like to learn/help) once we make this upgrade. Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015
Hi Folks, Does anyone have an issue with the above proposal? Thanks Lewis -- *Lewis*
[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509522#comment-14509522 ] Lewis John McGibbney commented on NUTCH-1994: - Dynamite [~tpalsulich] I'll get you on IRC tomorrow. Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509678#comment-14509678 ] Sebastian Nagel commented on NUTCH-1994: +1 Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Unsubscribe
Hi, I want to unsubscribe the email list. Best, Mengxian
Unsubscribe
Hi, I want to unsubscribe the email list. Best, Zhaohui -- Zhaohui Zhang Dept. of Chemical Engineering, University of Southern California Addr: 2611 Portland Street, Los Angeles, CA, USA 90007 Mobile:(+1)213-880-8321 Email: zhaoh...@usc.edu; happy...@gmail.com; zhaohuizhang2...@gmail.com;
[jira] [Resolved] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1994. - Resolution: Fixed Committed revision 1675723 in trunk Committed revision 1675724 in 2.X Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter
[ https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509873#comment-14509873 ] Lewis John McGibbney commented on NUTCH-1985: - [~jorgelbg] +1 please commit against trunk :) Adding a main() method to the MimeTypeIndexingFilter Key: NUTCH-1985 URL: https://issues.apache.org/jira/browse/NUTCH-1985 Project: Nutch Issue Type: Improvement Components: indexer, metadata, plugin Affects Versions: 1.10 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Labels: features, patch, test Fix For: 1.10 Attachments: NUTCH-1985.patch This make very easy the testing of different rules files to check the expressions used to filter the content based on the MIME type detected. Until now the only way to check this was to do test crawls and check the stored data in Solr/Elasticsearch. This allows calling the file using the {{bin/nutch plugin}} command, something like: {{bin/nutch plugin mimetype-filter org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}} Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} for specifying a rules file to be used, this makes easy to play with different rules file until you get the desired behavior. After invoking the class, a valid MIME type must be entered for each line, and the output will be the same MIME type with a {{+}} or {{-}} sign in the beginning, indicating if the given MIME type is allowed or denied respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509898#comment-14509898 ] Julien Nioche commented on NUTCH-2000: -- [~lewismc] reverted to 1.10 as this is a blocker. Will investigate it further as soon as I find the time to do so but in the meantime if someone could try and reproduce it that would be great. Link inversion fails with .locked already exists. - Key: NUTCH-2000 URL: https://issues.apache.org/jira/browse/NUTCH-2000 Project: Nutch Issue Type: Bug Affects Versions: 1.9 Reporter: Julien Nioche Priority: Blocker Fix For: 1.10 using standard crawl script with a brand new test dir in local mode I am getting Link inversion /data/BLABLABLA/runtime/local/bin/nutch invertlinks /data/BLABLABLA/testCrawl2//linkdb /data/BLABLABLA/testCrawl2//segments/20150423114335 LinkDb: java.io.IOException: lock file /data/BLABLABLA/testCrawl2/linkdb/.locked already exists. PS: 2000! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Build failed in Jenkins: Nutch-trunk #3083
See https://builds.apache.org/job/Nutch-trunk/3083/changes Changes: [lewismc] NUTCH-1994 Upgrade to Apache Tika 1.8 -- [...truncated 5538 lines...] [echo] Testing plugin: urlfilter-validator [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.tika.TestRTFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.017 sec [junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-ajax deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-ajax [junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-basic [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.938 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test [junit] Running org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-host [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.189 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-pass [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer [junit] Running org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.407 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading
[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509899#comment-14509899 ] Hudson commented on NUTCH-1994: --- FAILURE: Integrated in Nutch-trunk #3083 (See [https://builds.apache.org/job/Nutch-trunk/3083/]) NUTCH-1994 Upgrade to Apache Tika 1.8 (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1675723) * /nutch/trunk/CHANGES.txt * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/plugin/parse-tika/ivy.xml * /nutch/trunk/src/plugin/parse-tika/plugin.xml Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509916#comment-14509916 ] Lewis John McGibbney commented on NUTCH-2000: - ACK Link inversion fails with .locked already exists. - Key: NUTCH-2000 URL: https://issues.apache.org/jira/browse/NUTCH-2000 Project: Nutch Issue Type: Bug Affects Versions: 1.9 Reporter: Julien Nioche Priority: Blocker Fix For: 1.10 using standard crawl script with a brand new test dir in local mode I am getting Link inversion /data/BLABLABLA/runtime/local/bin/nutch invertlinks /data/BLABLABLA/testCrawl2//linkdb /data/BLABLABLA/testCrawl2//segments/20150423114335 LinkDb: java.io.IOException: lock file /data/BLABLABLA/testCrawl2/linkdb/.locked already exists. PS: 2000! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Unsubscribe
Email dev-unsubscr...@nutch.apache.org You unsub the same way you subbed. It's just a different email. -- Jimmy On Thu, Apr 23, 2015 at 1:23 PM, Zhaohui Zhang happy...@gmail.com wrote: Hi, I want to unsubscribe the email list. Best, Zhaohui -- Zhaohui Zhang Dept. of Chemical Engineering, University of Southern California Addr: 2611 Portland Street, Los Angeles, CA, USA 90007 Mobile:(+1)213-880-8321 Email: zhaoh...@usc.edu; happy...@gmail.com; zhaohuizhang2...@gmail.com;
Unsubscribe
Hi, I want to unsubscribe the email list. Best, Zhaohui -- Zhaohui Zhang PhD Student at University of Southern California Mobile: (213)-880-8321 Email: zhaoh...@usc.edu yuan...@usc.edu
[jira] [Created] (NUTCH-2001) SubCollection Field Name incorrect in nutch-default.xml
Jeff Cocking created NUTCH-2001: --- Summary: SubCollection Field Name incorrect in nutch-default.xml Key: NUTCH-2001 URL: https://issues.apache.org/jira/browse/NUTCH-2001 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.9, 1.8 Reporter: Jeff Cocking Priority: Minor Fix For: 1.10 SubcollectionIndexingFilter.java is looking for the following variable in nutch-default.xml (at line 56).: fieldName = conf.get(subcollection.default.fieldname, subcollection); nutch-default.xml lists the following: property namesubcollection.default.field/name valuesubcollection/value description The default field name for the subcollections. /description /property The field name for nutch-default.xml should be changed from subcollection.default.field to subcollection.default.fieldname. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2001) SubCollection Field Name incorrect in nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Cocking updated NUTCH-2001: Attachment: NUTCH-2001-1.x.patch SubCollection Field Name incorrect in nutch-default.xml --- Key: NUTCH-2001 URL: https://issues.apache.org/jira/browse/NUTCH-2001 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.8, 1.9 Reporter: Jeff Cocking Priority: Minor Fix For: 1.10 Attachments: NUTCH-2001-1.x.patch Original Estimate: 10m Remaining Estimate: 10m SubcollectionIndexingFilter.java is looking for the following variable in nutch-default.xml (at line 56).: fieldName = conf.get(subcollection.default.fieldname, subcollection); nutch-default.xml lists the following: property namesubcollection.default.field/name valuesubcollection/value description The default field name for the subcollections. /description /property The field name for nutch-default.xml should be changed from subcollection.default.field to subcollection.default.fieldname. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509995#comment-14509995 ] Hudson commented on NUTCH-1994: --- SUCCESS: Integrated in Nutch-nutchgora #1412 (See [https://builds.apache.org/job/Nutch-nutchgora/1412/]) NUTCH-1994 Upgrade to Apache Tika 1.8 (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1675724) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/ivy/ivy.xml * /nutch/branches/2.x/src/plugin/parse-tika/howto_upgrade_tika.txt * /nutch/branches/2.x/src/plugin/parse-tika/ivy.xml * /nutch/branches/2.x/src/plugin/parse-tika/plugin.xml Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1958) Remove scoring-opic from nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1958: Fix Version/s: (was: 1.10) 1.11 Remove scoring-opic from nutch-default.xml -- Key: NUTCH-1958 URL: https://issues.apache.org/jira/browse/NUTCH-1958 Project: Nutch Issue Type: Improvement Affects Versions: 2.3, 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.4, 1.11 I propose we remove scoring-opic from nutch-default. We all know it is flawed for any kind of incremental crawl, which most of us do. It is also useless if you want to perform a single crawl, if you must crawl all records of a domain, using OPIC for prioritizing URLS makes no sense. It also confuses users as we have seen in the past and recently [1]. What do you think? [1]: http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2000: Fix Version/s: (was: 1.10) 1.11 Link inversion fails with .locked already exists. - Key: NUTCH-2000 URL: https://issues.apache.org/jira/browse/NUTCH-2000 Project: Nutch Issue Type: Bug Affects Versions: 1.9 Reporter: Julien Nioche Fix For: 1.11 using standard crawl script with a brand new test dir in local mode I am getting Link inversion /data/BLABLABLA/runtime/local/bin/nutch invertlinks /data/BLABLABLA/testCrawl2//linkdb /data/BLABLABLA/testCrawl2//segments/20150423114335 LinkDb: java.io.IOException: lock file /data/BLABLABLA/testCrawl2/linkdb/.locked already exists. PS: 2000! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1947: Fix Version/s: (was: 1.10) 1.11 Overhaul o.a.n.parse.OutlinkExtractor.java --- Key: NUTCH-1947 URL: https://issues.apache.org/jira/browse/NUTCH-1947 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.3, 1.9 Reporter: Lewis John McGibbney Fix For: 2.4, 1.11 Right now in both trunk and 2.X, the [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java] class need a bit of TLC. It is referencing JDK1.5 in a few places, there are misleading URL entries and it boasts some interesting @Deprecation methods which we could ideally remove. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked
[ https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509876#comment-14509876 ] Lewis John McGibbney commented on NUTCH-1963: - [~gostep] is this issue addressed in NUTCH-1959? CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked --- Key: NUTCH-1963 URL: https://issues.apache.org/jira/browse/NUTCH-1963 Project: Nutch Issue Type: Bug Components: commoncrawl Affects Versions: 1.10 Reporter: Lewis John McGibbney Fix For: 1.10 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype application/pdf* I get the following stack trace which results in a failure of the task {code} java.lang.RuntimeException: file name 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' is too long ( 100 bytes) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275) at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400) at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236) {code} The workaround consists of not using the *-gzip* option, instead delaying this until a later task, however this is a workaround and not a solution. We need to fix this in order for the tool to work as designed and required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509920#comment-14509920 ] Lewis John McGibbney commented on NUTCH-2000: - Julien... I wonder if the 2nd URI path is OK? /data/BLABLABLA/testCrawl2//segments/20150423114335 Note the '//' YES :) :) 2000th Link inversion fails with .locked already exists. - Key: NUTCH-2000 URL: https://issues.apache.org/jira/browse/NUTCH-2000 Project: Nutch Issue Type: Bug Affects Versions: 1.9 Reporter: Julien Nioche Priority: Blocker Fix For: 1.10 using standard crawl script with a brand new test dir in local mode I am getting Link inversion /data/BLABLABLA/runtime/local/bin/nutch invertlinks /data/BLABLABLA/testCrawl2//linkdb /data/BLABLABLA/testCrawl2//segments/20150423114335 LinkDb: java.io.IOException: lock file /data/BLABLABLA/testCrawl2/linkdb/.locked already exists. PS: 2000! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2001) SubCollection Field Name incorrect in nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509997#comment-14509997 ] Jeff Cocking commented on NUTCH-2001: - Attached is a patch I created from a clean download of Nutch Trunk. SubCollection Field Name incorrect in nutch-default.xml --- Key: NUTCH-2001 URL: https://issues.apache.org/jira/browse/NUTCH-2001 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.8, 1.9 Reporter: Jeff Cocking Priority: Minor Fix For: 1.10 Attachments: NUTCH-2001-1.x.patch Original Estimate: 10m Remaining Estimate: 10m SubcollectionIndexingFilter.java is looking for the following variable in nutch-default.xml (at line 56).: fieldName = conf.get(subcollection.default.fieldname, subcollection); nutch-default.xml lists the following: property namesubcollection.default.field/name valuesubcollection/value description The default field name for the subcollections. /description /property The field name for nutch-default.xml should be changed from subcollection.default.field to subcollection.default.fieldname. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1969) URL Normalizer properly handling slashes
[ https://issues.apache.org/jira/browse/NUTCH-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509880#comment-14509880 ] Lewis John McGibbney commented on NUTCH-1969: - +1 for commit [~markus.jel...@openindex.io] URL Normalizer properly handling slashes Key: NUTCH-1969 URL: https://issues.apache.org/jira/browse/NUTCH-1969 Project: Nutch Issue Type: New Feature Components: plugin Affects Versions: 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.10 Attachments: NUTCH-1969.patch This is a URL normalizer we use that is simple to use and generate for dealing with hosts that mix up slash suffixed URL's with non-slash suffixed URL's. It is similar to the host nomalizer, reducing the number of duplicates while crawling. It takes the new line delimited rules, separated by either a tabulator or whitespace, followed by a + (PLUS) or - (MINUS) sign denoting whether or not a slash is to be added to the path. The normalizer ignores pages that look like files with extensions, see tests. Note: the normalizer must be enhanced to not take hosts as first argument of a rule, but host/path prefixes because some hosts need different rules depending on the root path. For example, * example.org/cms/news/1/2/3/4 is a CMS that doesn't accept slashes, if they are suffixed, the user is redirected to a non-slash page; * example.org/files/a/b/ wants to do it just the other way around. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2000: - Priority: Blocker (was: Major) Link inversion fails with .locked already exists. - Key: NUTCH-2000 URL: https://issues.apache.org/jira/browse/NUTCH-2000 Project: Nutch Issue Type: Bug Affects Versions: 1.9 Reporter: Julien Nioche Priority: Blocker Fix For: 1.10 using standard crawl script with a brand new test dir in local mode I am getting Link inversion /data/BLABLABLA/runtime/local/bin/nutch invertlinks /data/BLABLABLA/testCrawl2//linkdb /data/BLABLABLA/testCrawl2//segments/20150423114335 LinkDb: java.io.IOException: lock file /data/BLABLABLA/testCrawl2/linkdb/.locked already exists. PS: 2000! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2000: - Fix Version/s: (was: 1.11) 1.10 Link inversion fails with .locked already exists. - Key: NUTCH-2000 URL: https://issues.apache.org/jira/browse/NUTCH-2000 Project: Nutch Issue Type: Bug Affects Versions: 1.9 Reporter: Julien Nioche Priority: Blocker Fix For: 1.10 using standard crawl script with a brand new test dir in local mode I am getting Link inversion /data/BLABLABLA/runtime/local/bin/nutch invertlinks /data/BLABLABLA/testCrawl2//linkdb /data/BLABLABLA/testCrawl2//segments/20150423114335 LinkDb: java.io.IOException: lock file /data/BLABLABLA/testCrawl2/linkdb/.locked already exists. PS: 2000! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509492#comment-14509492 ] Tyler Palsulich commented on NUTCH-1994: Applied and tested both patches, both look good to me! Upgrade to Apache Tika 1.8 -- Key: NUTCH-1994 URL: https://issues.apache.org/jira/browse/NUTCH-1994 Project: Nutch Issue Type: Improvement Components: build, parser Affects Versions: 1.10, 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10, 2.3.1 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch Tika 1.8 was released this morning. Lets upgrade then release Nutch trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Build failed in Jenkins: Nutch-trunk #3087
See https://builds.apache.org/job/Nutch-trunk/3087/ -- [...truncated 5611 lines...] test: [echo] Testing plugin: urlfilter-validator [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.tika.TestRTFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.024 sec [junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-ajax deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-ajax [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor [junit] Running org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.013 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-basic [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.193 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-host [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.419 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-pass [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.315 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin:
[jira] [Resolved] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked
[ https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1963. - Resolution: Fixed Assignee: Giuseppe Totaro Addressed within NUTCH-1959 Thank you [~gostep] CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked --- Key: NUTCH-1963 URL: https://issues.apache.org/jira/browse/NUTCH-1963 Project: Nutch Issue Type: Bug Components: commoncrawl Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Giuseppe Totaro Fix For: 1.10 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype application/pdf* I get the following stack trace which results in a failure of the task {code} java.lang.RuntimeException: file name 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' is too long ( 100 bytes) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275) at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400) at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236) {code} The workaround consists of not using the *-gzip* option, instead delaying this until a later task, however this is a workaround and not a solution. We need to fix this in order for the tool to work as designed and required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1973) Job Administration end point for the REST service
[ https://issues.apache.org/jira/browse/NUTCH-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510097#comment-14510097 ] Lewis John McGibbney commented on NUTCH-1973: - This commit accidently removed the NUTCH-1927 property to nutch-default.xml The commit at revision 1675735 adds it back in. Excellent catch [~gostep] Job Administration end point for the REST service - Key: NUTCH-1973 URL: https://issues.apache.org/jira/browse/NUTCH-1973 Project: Nutch Issue Type: Sub-task Reporter: Sujen Shah Assignee: Chris A. Mattmann Fix For: 1.10 Attachments: NUTCH-1973.patch This sub task deals with implementing the functionality documented at https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510167#comment-14510167 ] Hudson commented on NUTCH-1927: --- FAILURE: Integrated in Nutch-trunk #3084 (See [https://builds.apache.org/job/Nutch-trunk/3084/]) Add back in NUTCH-1927 property to nutch-default as revoved during commit @1675022 (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1675735) * /nutch/trunk/conf/nutch-default.xml Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing --- Key: NUTCH-1927 URL: https://issues.apache.org/jira/browse/NUTCH-1927 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: available, patch Fix For: 1.10 Attachments: NUTCH-1927.2015-04-16.patch, NUTCH-1927.2015-04-17.patch, NUTCH-1927.Mattmann.041115.patch.txt, NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt, test_NUTCH-1927.2015-04-17.txt Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: {code:xml} property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Build failed in Jenkins: Nutch-trunk #3084
See https://builds.apache.org/job/Nutch-trunk/3084/changes Changes: [lewismc] Add back in NUTCH-1927 property to nutch-default as revoved during commit @1675022 -- [...truncated 5373 lines...] [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.945 sec [junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.029 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-ajax deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-ajax [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.tika.TestRTFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.017 sec [junit] Running org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-basic [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor [junit] Running org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.196 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-host [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.998 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-pass [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.423 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading
[jira] [Commented] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked
[ https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510062#comment-14510062 ] Giuseppe Totaro commented on NUTCH-1963: Hi [~lewismc]. Yes, [NUTCH-1959|https://issues.apache.org/jira/browse/NUTCH-1959] includes support for long filename: {noformat} tarOutput.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU); {noformat} Thanks, Giuseppe CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked --- Key: NUTCH-1963 URL: https://issues.apache.org/jira/browse/NUTCH-1963 Project: Nutch Issue Type: Bug Components: commoncrawl Affects Versions: 1.10 Reporter: Lewis John McGibbney Fix For: 1.10 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype application/pdf* I get the following stack trace which results in a failure of the task {code} java.lang.RuntimeException: file name 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' is too long ( 100 bytes) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275) at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400) at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236) {code} The workaround consists of not using the *-gzip* option, instead delaying this until a later task, however this is a workaround and not a solution. We need to fix this in order for the tool to work as designed and required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508541#comment-14508541 ] Luke sh commented on NUTCH-1997: i am working on the update. Add CBOR magic header to CommonCrawlDataDumper output --- Key: NUTCH-1997 URL: https://issues.apache.org/jira/browse/NUTCH-1997 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Priority: Minor Attachments: NUTCH-1997.patch For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single string value, representing the JSON text, into CBOR. For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first byte of all files is 0x7F (the first three bits are 011, that is the major type for strings, and the following 5 bits are 11010, meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR magic header ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded output files. Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1998) Add support for user-defined file extension to CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508520#comment-14508520 ] Luke sh commented on NUTCH-1998: Hi [~gostep], this patch works. I run a quick tested it with the command option -extension cbor, i was able to see the cbor extension was appended at least. Thanks Add support for user-defined file extension to CommonCrawlDataDumper Key: NUTCH-1998 URL: https://issues.apache.org/jira/browse/NUTCH-1998 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Priority: Minor Attachments: NUTCH-1998.patch {{CommonCrawlDataDumper}} tool is able to generate CBOR-encoded files, extracted from Nutch crawled data, using the Common Crawl format. By default, {{CommonCrawlDataDumper}} uses the original file extension. We are going to add support for a command-line option (e.g., {{-extension}}) that allows the user to provide a file extension to use in place of the original one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508522#comment-14508522 ] Luke sh commented on NUTCH-1997: Thanks a lot [~gostep], highly appreciated, this patch works too, i run a quick test and i was able to see the magic tag is appended at the beginning of the cbor file. Thanks Luke Add CBOR magic header to CommonCrawlDataDumper output --- Key: NUTCH-1997 URL: https://issues.apache.org/jira/browse/NUTCH-1997 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Priority: Minor Attachments: NUTCH-1997.patch For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single string value, representing the JSON text, into CBOR. For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first byte of all files is 0x7F (the first three bits are 011, that is the major type for strings, and the following 5 bits are 11010, meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR magic header ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded output files. Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508549#comment-14508549 ] Giuseppe Totaro commented on NUTCH-1997: Great. Thanks [~Lukeliush]. Please let me know if you may need support on adding cbor detection to Tika. Thanks a lot. Add CBOR magic header to CommonCrawlDataDumper output --- Key: NUTCH-1997 URL: https://issues.apache.org/jira/browse/NUTCH-1997 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Priority: Minor Attachments: NUTCH-1997.patch For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single string value, representing the JSON text, into CBOR. For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first byte of all files is 0x7F (the first three bits are 011, that is the major type for strings, and the following 5 bits are 11010, meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR magic header ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded output files. Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508540#comment-14508540 ] Giuseppe Totaro commented on NUTCH-1997: Thanks [~Lukeliush]. Do you verify if Tika is able to detect these files as cbor? Thanks a lot. Add CBOR magic header to CommonCrawlDataDumper output --- Key: NUTCH-1997 URL: https://issues.apache.org/jira/browse/NUTCH-1997 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Priority: Minor Attachments: NUTCH-1997.patch For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single string value, representing the JSON text, into CBOR. For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first byte of all files is 0x7F (the first three bits are 011, that is the major type for strings, and the following 5 bits are 11010, meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR magic header ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded output files. Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1999) Add http://nutch.apache.org/robots.txt
Julien Nioche created NUTCH-1999: Summary: Add http://nutch.apache.org/robots.txt Key: NUTCH-1999 URL: https://issues.apache.org/jira/browse/NUTCH-1999 Project: Nutch Issue Type: Improvement Components: website Reporter: Julien Nioche http://nutch.apache.org/robots.txt = 404 not found Aren't we funny! Go and tell webmasters to have a robots.txt after that! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-1999) Add http://nutch.apache.org/robots.txt
[ https://issues.apache.org/jira/browse/NUTCH-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-1999: Assignee: Julien Nioche Add http://nutch.apache.org/robots.txt -- Key: NUTCH-1999 URL: https://issues.apache.org/jira/browse/NUTCH-1999 Project: Nutch Issue Type: Improvement Components: website Reporter: Julien Nioche Assignee: Julien Nioche http://nutch.apache.org/robots.txt = 404 not found Aren't we funny! Go and tell webmasters to have a robots.txt after that! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2000) Link inversion fails with .locked already exists.
Julien Nioche created NUTCH-2000: Summary: Link inversion fails with .locked already exists. Key: NUTCH-2000 URL: https://issues.apache.org/jira/browse/NUTCH-2000 Project: Nutch Issue Type: Bug Affects Versions: 1.9 Reporter: Julien Nioche Fix For: 1.10 using standard crawl script with a brand new test dir in local mode I am getting Link inversion /data/BLABLABLA/runtime/local/bin/nutch invertlinks /data/BLABLABLA/testCrawl2//linkdb /data/BLABLABLA/testCrawl2//segments/20150423114335 LinkDb: java.io.IOException: lock file /data/BLABLABLA/testCrawl2/linkdb/.locked already exists. PS: 2000! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Build failed in Jenkins: Nutch-trunk #3085
See https://builds.apache.org/job/Nutch-trunk/3085/ -- [...truncated 5536 lines...] [echo] Testing plugin: urlfilter-validator [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.031 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-ajax deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-ajax [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.899 sec [junit] Running org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-basic [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.tika.TestRTFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.017 sec [junit] Running org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.198 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-host [junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.412 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-pass [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec init: init-plugin: deps-jar: clean-lib: resolve-default:
[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter
[ https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510341#comment-14510341 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1985: --- Committed revision 1675743. Adding a main() method to the MimeTypeIndexingFilter Key: NUTCH-1985 URL: https://issues.apache.org/jira/browse/NUTCH-1985 Project: Nutch Issue Type: Improvement Components: indexer, metadata, plugin Affects Versions: 1.10 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Labels: features, patch, test Fix For: 1.10 Attachments: NUTCH-1985.patch This make very easy the testing of different rules files to check the expressions used to filter the content based on the MIME type detected. Until now the only way to check this was to do test crawls and check the stored data in Solr/Elasticsearch. This allows calling the file using the {{bin/nutch plugin}} command, something like: {{bin/nutch plugin mimetype-filter org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}} Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} for specifying a rules file to be used, this makes easy to play with different rules file until you get the desired behavior. After invoking the class, a valid MIME type must be entered for each line, and the output will be the same MIME type with a {{+}} or {{-}} sign in the beginning, indicating if the given MIME type is allowed or denied respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [MASSMAIL]Re: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015
+1 - Original Message - From: Chris A Mattmann (3980) chris.a.mattm...@jpl.nasa.gov To: dev@nutch.apache.org Sent: Thursday, April 23, 2015 2:16:09 PM Subject: [MASSMAIL]Re: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015 s/1.8/1.10/ right? If so +1! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, April 23, 2015 at 2:14 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015 Hi Folks, Does anyone have an issue with the above proposal? Thanks Lewis -- Lewis
[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510380#comment-14510380 ] Luke sh commented on NUTCH-1997: Notes: The attached cbor file contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. Add CBOR magic header to CommonCrawlDataDumper output --- Key: NUTCH-1997 URL: https://issues.apache.org/jira/browse/NUTCH-1997 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Priority: Minor Attachments: NUTCH-1997.patch For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single string value, representing the JSON text, into CBOR. For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first byte of all files is 0x7F (the first three bits are 011, that is the major type for strings, and the following 5 bits are 11010, meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR magic header ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded output files. Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter
[ https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez resolved NUTCH-1985. --- Resolution: Fixed Adding a main() method to the MimeTypeIndexingFilter Key: NUTCH-1985 URL: https://issues.apache.org/jira/browse/NUTCH-1985 Project: Nutch Issue Type: Improvement Components: indexer, metadata, plugin Affects Versions: 1.10 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Labels: features, patch, test Fix For: 1.10 Attachments: NUTCH-1985.patch This make very easy the testing of different rules files to check the expressions used to filter the content based on the MIME type detected. Until now the only way to check this was to do test crawls and check the stored data in Solr/Elasticsearch. This allows calling the file using the {{bin/nutch plugin}} command, something like: {{bin/nutch plugin mimetype-filter org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}} Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} for specifying a rules file to be used, this makes easy to play with different rules file until you get the desired behavior. After invoking the class, a valid MIME type must be entered for each line, and the output will be the same MIME type with a {{+}} or {{-}} sign in the beginning, indicating if the given MIME type is allowed or denied respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke sh updated NUTCH-1997: --- Comment: was deleted (was: Notes: The attached cbor file contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. ) Add CBOR magic header to CommonCrawlDataDumper output --- Key: NUTCH-1997 URL: https://issues.apache.org/jira/browse/NUTCH-1997 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Priority: Minor Attachments: NUTCH-1997.patch For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single string value, representing the JSON text, into CBOR. For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first byte of all files is 0x7F (the first three bits are 011, that is the major type for strings, and the following 5 bits are 11010, meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR magic header ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded output files. Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Build failed in Jenkins: Nutch-trunk #3086
See https://builds.apache.org/job/Nutch-trunk/3086/changes Changes: [jorgelbg] NUTCH-1985 Adding a main() method to the MimeTypeIndexingFilter -- [...truncated 5373 lines...] copy-generated-lib: test: [echo] Testing plugin: urlfilter-validator [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-ajax deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-ajax [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-basic [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.224 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-host [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.438 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass deps-test-compile: compile-test: [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-pass [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.187 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-querystring
[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter
[ https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510388#comment-14510388 ] Hudson commented on NUTCH-1985: --- FAILURE: Integrated in Nutch-trunk #3086 (See [https://builds.apache.org/job/Nutch-trunk/3086/]) NUTCH-1985 Adding a main() method to the MimeTypeIndexingFilter (jorgelbg: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1675743) * /nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter/MimeTypeIndexingFilter.java Adding a main() method to the MimeTypeIndexingFilter Key: NUTCH-1985 URL: https://issues.apache.org/jira/browse/NUTCH-1985 Project: Nutch Issue Type: Improvement Components: indexer, metadata, plugin Affects Versions: 1.10 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Labels: features, patch, test Fix For: 1.10 Attachments: NUTCH-1985.patch This make very easy the testing of different rules files to check the expressions used to filter the content based on the MIME type detected. Until now the only way to check this was to do test crawls and check the stored data in Solr/Elasticsearch. This allows calling the file using the {{bin/nutch plugin}} command, something like: {{bin/nutch plugin mimetype-filter org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}} Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} for specifying a rules file to be used, this makes easy to play with different rules file until you get the desired behavior. After invoking the class, a valid MIME type must be entered for each line, and the output will be the same MIME type with a {{+}} or {{-}} sign in the beginning, indicating if the given MIME type is allowed or denied respectively. -- This message was sent by Atlassian JIRA (v6.3.4#6332)