date:20150423

[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509167#comment-14509167
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1985:
---

Should we commit this for 1.10 release? or wait for 1.11 ?

 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015

2015-04-23 Thread Mattmann, Chris A (3980)

s/1.8/1.10/ right?

If so +1!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, April 23, 2015 at 2:14 PM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015

Hi Folks,

Does anyone have an issue with the above proposal?

Thanks

Lewis

-- 
Lewis

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509501#comment-14509501
 ] 

Lewis John McGibbney commented on NUTCH-1994:
-

Would like to commit by EoB today if no other issues. Thanks [~tpalsulich]

 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Tyler Palsulich (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509514#comment-14509514
 ] 

Tyler Palsulich commented on NUTCH-1994:


Happy to help, [~lewismc]!

 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509393#comment-14509393
 ] 

Lewis John McGibbney commented on NUTCH-1994:
-

Anyone to review? I can roll a release (or assist anyone else if they would 
like to learn/help) once we make this upgrade. 

 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015

2015-04-23 Thread Lewis John Mcgibbney

Hi Folks,
Does anyone have an issue with the above proposal?
Thanks
Lewis

-- 
*Lewis*

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509522#comment-14509522
 ] 

Lewis John McGibbney commented on NUTCH-1994:
-

Dynamite [~tpalsulich] I'll get you on IRC tomorrow.

 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509678#comment-14509678
 ] 

Sebastian Nagel commented on NUTCH-1994:


+1

 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Unsubscribe

2015-04-23 Thread Mengxian Li

Hi,
I want to unsubscribe the email list.

Best,
Mengxian

Unsubscribe

2015-04-23 Thread Zhaohui Zhang

Hi,

I want to unsubscribe the email list.

Best,
Zhaohui


-- 
Zhaohui Zhang
Dept. of Chemical Engineering, University of Southern California
Addr: 2611 Portland Street, Los Angeles, CA, USA  90007
Mobile:(+1)213-880-8321
Email: zhaoh...@usc.edu;
   happy...@gmail.com;
   zhaohuizhang2...@gmail.com;

[jira] [Resolved] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1994.
-
Resolution: Fixed

Committed revision 1675723 in trunk
Committed revision 1675724 in 2.X

 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509873#comment-14509873
 ] 

Lewis John McGibbney commented on NUTCH-1985:
-

[~jorgelbg] +1 please commit against trunk :)

 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509898#comment-14509898
 ] 

Julien Nioche commented on NUTCH-2000:
--

[~lewismc] reverted to 1.10 as this is a blocker. Will investigate it further 
as soon as I find the time to do so but in the meantime if someone could try 
and reproduce it that would be great.

 Link inversion fails with .locked already exists.
 -

 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.10


 using standard crawl script with a brand new test dir in local mode I am 
 getting 
 Link inversion
 /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
 /data/BLABLABLA/testCrawl2//linkdb 
 /data/BLABLABLA/testCrawl2//segments/20150423114335
 LinkDb: java.io.IOException: lock file 
 /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
 PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: Nutch-trunk #3083

2015-04-23 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/3083/changes

Changes:

[lewismc] NUTCH-1994 Upgrade to Apache Tika 1.8

--
[...truncated 5538 lines...]
 [echo] Testing plugin: urlfilter-validator
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.tika.TestRTFParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.017 sec
[junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.025 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-ajax

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-ajax
[junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.012 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.938 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.189 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.407 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509899#comment-14509899
 ] 

Hudson commented on NUTCH-1994:
---

FAILURE: Integrated in Nutch-trunk #3083 (See 
[https://builds.apache.org/job/Nutch-trunk/3083/])
NUTCH-1994 Upgrade to Apache Tika 1.8 (lewismc: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1675723)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/plugin/parse-tika/ivy.xml
* /nutch/trunk/src/plugin/parse-tika/plugin.xml


 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509916#comment-14509916
 ] 

Lewis John McGibbney commented on NUTCH-2000:
-

ACK

 Link inversion fails with .locked already exists.
 -

 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.10


 using standard crawl script with a brand new test dir in local mode I am 
 getting 
 Link inversion
 /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
 /data/BLABLABLA/testCrawl2//linkdb 
 /data/BLABLABLA/testCrawl2//segments/20150423114335
 LinkDb: java.io.IOException: lock file 
 /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
 PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Unsubscribe

2015-04-23 Thread Michael Joyce

Email dev-unsubscr...@nutch.apache.org


You unsub the same way you subbed. It's just a different email.


-- Jimmy

On Thu, Apr 23, 2015 at 1:23 PM, Zhaohui Zhang happy...@gmail.com wrote:

 Hi,

 I want to unsubscribe the email list.

 Best,
 Zhaohui


 --
 Zhaohui Zhang
 Dept. of Chemical Engineering, University of Southern California
 Addr: 2611 Portland Street, Los Angeles, CA, USA  90007
 Mobile:(+1)213-880-8321
 Email: zhaoh...@usc.edu;
happy...@gmail.com;
zhaohuizhang2...@gmail.com;

Unsubscribe

2015-04-23 Thread Zhaohui Zhang

Hi,

I want to unsubscribe the email list.

Best,
Zhaohui

-- 
Zhaohui Zhang
PhD Student at University of Southern California
Mobile: (213)-880-8321
Email:   zhaoh...@usc.edu yuan...@usc.edu

[jira] [Created] (NUTCH-2001) SubCollection Field Name incorrect in nutch-default.xml

2015-04-23 Thread Jeff Cocking (JIRA)

Jeff Cocking created NUTCH-2001:
---

 Summary: SubCollection Field Name incorrect in nutch-default.xml
 Key: NUTCH-2001
 URL: https://issues.apache.org/jira/browse/NUTCH-2001
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.9, 1.8
Reporter: Jeff Cocking
Priority: Minor
 Fix For: 1.10


SubcollectionIndexingFilter.java is looking for the following variable in 
nutch-default.xml (at line 56).:

 fieldName = conf.get(subcollection.default.fieldname, subcollection);

nutch-default.xml lists the following:

property
  namesubcollection.default.field/name
  valuesubcollection/value
  description
  The default field name for the subcollections.
  /description
/property

The field name for nutch-default.xml should be changed from 
subcollection.default.field to subcollection.default.fieldname.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2001) SubCollection Field Name incorrect in nutch-default.xml

2015-04-23 Thread Jeff Cocking (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Cocking updated NUTCH-2001:

Attachment: NUTCH-2001-1.x.patch

 SubCollection Field Name incorrect in nutch-default.xml
 ---

 Key: NUTCH-2001
 URL: https://issues.apache.org/jira/browse/NUTCH-2001
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.9
Reporter: Jeff Cocking
Priority: Minor
 Fix For: 1.10

 Attachments: NUTCH-2001-1.x.patch

   Original Estimate: 10m
  Remaining Estimate: 10m

 SubcollectionIndexingFilter.java is looking for the following variable in 
 nutch-default.xml (at line 56).:
  fieldName = conf.get(subcollection.default.fieldname, subcollection);
 nutch-default.xml lists the following:
 property
   namesubcollection.default.field/name
   valuesubcollection/value
   description
   The default field name for the subcollections.
   /description
 /property
 The field name for nutch-default.xml should be changed from 
 subcollection.default.field to subcollection.default.fieldname.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509995#comment-14509995
 ] 

Hudson commented on NUTCH-1994:
---

SUCCESS: Integrated in Nutch-nutchgora #1412 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1412/])
NUTCH-1994 Upgrade to Apache Tika 1.8 (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1675724)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/ivy/ivy.xml
* /nutch/branches/2.x/src/plugin/parse-tika/howto_upgrade_tika.txt
* /nutch/branches/2.x/src/plugin/parse-tika/ivy.xml
* /nutch/branches/2.x/src/plugin/parse-tika/plugin.xml


 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-04-23 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1958:

Fix Version/s: (was: 1.10)
   1.11

 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.11


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2000:

Fix Version/s: (was: 1.10)
   1.11

 Link inversion fails with .locked already exists.
 -

 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
 Fix For: 1.11


 using standard crawl script with a brand new test dir in local mode I am 
 getting 
 Link inversion
 /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
 /data/BLABLABLA/testCrawl2//linkdb 
 /data/BLABLABLA/testCrawl2//segments/20150423114335
 LinkDb: java.io.IOException: lock file 
 /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
 PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2015-04-23 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1947:

Fix Version/s: (was: 1.10)
   1.11

 Overhaul o.a.n.parse.OutlinkExtractor.java 
 ---

 Key: NUTCH-1947
 URL: https://issues.apache.org/jira/browse/NUTCH-1947
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.3, 1.9
Reporter: Lewis John McGibbney
 Fix For: 2.4, 1.11


 Right now in both trunk and 2.X, the 
 [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java]
  class need a bit of TLC. It is referencing JDK1.5 in a few places, there are 
 misleading URL entries and it boasts some interesting @Deprecation methods 
 which we could ideally remove.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509876#comment-14509876
 ] 

Lewis John McGibbney commented on NUTCH-1963:
-

[~gostep] is this issue addressed in NUTCH-1959?

 CommonsCrawlDataDumper is too long (  100 bytes) when -gzip option invoked
 ---

 Key: NUTCH-1963
 URL: https://issues.apache.org/jira/browse/NUTCH-1963
 Project: Nutch
  Issue Type: Bug
  Components: commoncrawl
Affects Versions: 1.10
Reporter: Lewis John McGibbney
 Fix For: 1.10


 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype 
 application/pdf* I get the following stack trace which results in a failure 
 of the task
 {code}
 java.lang.RuntimeException: file name 
 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf'
  is too long (  100 bytes)
   at 
 org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
   at 
 org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
   at 
 org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
   at 
 org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
 {code}
 The workaround consists of not using the *-gzip* option, instead delaying 
 this until a later task, however this is a workaround and not a solution.
 We need to fix this in order for the tool to work as designed and required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509920#comment-14509920
 ] 

Lewis John McGibbney commented on NUTCH-2000:
-

Julien... I wonder if the 2nd URI path is OK?
/data/BLABLABLA/testCrawl2//segments/20150423114335
Note the '//'
YES :) :)
2000th

 Link inversion fails with .locked already exists.
 -

 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.10


 using standard crawl script with a brand new test dir in local mode I am 
 getting 
 Link inversion
 /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
 /data/BLABLABLA/testCrawl2//linkdb 
 /data/BLABLABLA/testCrawl2//segments/20150423114335
 LinkDb: java.io.IOException: lock file 
 /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
 PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2001) SubCollection Field Name incorrect in nutch-default.xml

2015-04-23 Thread Jeff Cocking (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509997#comment-14509997
 ] 

Jeff Cocking commented on NUTCH-2001:
-

Attached is a patch I created from a clean download of Nutch Trunk. 

 SubCollection Field Name incorrect in nutch-default.xml
 ---

 Key: NUTCH-2001
 URL: https://issues.apache.org/jira/browse/NUTCH-2001
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.9
Reporter: Jeff Cocking
Priority: Minor
 Fix For: 1.10

 Attachments: NUTCH-2001-1.x.patch

   Original Estimate: 10m
  Remaining Estimate: 10m

 SubcollectionIndexingFilter.java is looking for the following variable in 
 nutch-default.xml (at line 56).:
  fieldName = conf.get(subcollection.default.fieldname, subcollection);
 nutch-default.xml lists the following:
 property
   namesubcollection.default.field/name
   valuesubcollection/value
   description
   The default field name for the subcollections.
   /description
 /property
 The field name for nutch-default.xml should be changed from 
 subcollection.default.field to subcollection.default.fieldname.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1969) URL Normalizer properly handling slashes

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509880#comment-14509880
 ] 

Lewis John McGibbney commented on NUTCH-1969:
-

+1 for commit [~markus.jel...@openindex.io]

 URL Normalizer properly handling slashes
 

 Key: NUTCH-1969
 URL: https://issues.apache.org/jira/browse/NUTCH-1969
 Project: Nutch
  Issue Type: New Feature
  Components: plugin
Affects Versions: 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.10

 Attachments: NUTCH-1969.patch


 This is a URL normalizer we use that is simple to use and generate for  
 dealing with hosts that mix up slash suffixed URL's with non-slash suffixed 
 URL's.
 It is similar to the host nomalizer, reducing the number of duplicates while 
 crawling. It takes the new line delimited rules, separated by either a 
 tabulator or whitespace, followed by a + (PLUS) or - (MINUS) sign denoting 
 whether or not a slash is to be added to the path.
 The normalizer ignores pages that look like files with extensions, see tests.
 Note: the normalizer must be enhanced to not take hosts as first argument of 
 a rule, but host/path prefixes because some hosts need different rules 
 depending on the root path. For example,
 * example.org/cms/news/1/2/3/4 is a CMS that doesn't accept slashes, if they 
 are suffixed, the user is redirected to a non-slash page;
 * example.org/files/a/b/ wants to do it just the other way around.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2000:
-
Priority: Blocker  (was: Major)

 Link inversion fails with .locked already exists.
 -

 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.10


 using standard crawl script with a brand new test dir in local mode I am 
 getting 
 Link inversion
 /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
 /data/BLABLABLA/testCrawl2//linkdb 
 /data/BLABLABLA/testCrawl2//segments/20150423114335
 LinkDb: java.io.IOException: lock file 
 /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
 PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2000:
-
Fix Version/s: (was: 1.11)
   1.10

 Link inversion fails with .locked already exists.
 -

 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.10


 using standard crawl script with a brand new test dir in local mode I am 
 getting 
 Link inversion
 /data/BLABLABLA/runtime/local/bin/nutch invertlinks 
 /data/BLABLABLA/testCrawl2//linkdb 
 /data/BLABLABLA/testCrawl2//segments/20150423114335
 LinkDb: java.io.IOException: lock file 
 /data/BLABLABLA/testCrawl2/linkdb/.locked already exists.
 PS: 2000!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Tyler Palsulich (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509492#comment-14509492
 ] 

Tyler Palsulich commented on NUTCH-1994:


Applied and tested both patches, both look good to me!

 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: Nutch-trunk #3087

2015-04-23 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/3087/

--
[...truncated 5611 lines...]
test:
 [echo] Testing plugin: urlfilter-validator
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.tika.TestRTFParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.024 sec
[junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.026 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-ajax

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-ajax
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.013 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.193 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.419 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.315 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin:

[jira] [Resolved] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked

2015-04-23 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1963.
-
Resolution: Fixed
  Assignee: Giuseppe Totaro

Addressed within NUTCH-1959
Thank you [~gostep]

 CommonsCrawlDataDumper is too long (  100 bytes) when -gzip option invoked
 ---

 Key: NUTCH-1963
 URL: https://issues.apache.org/jira/browse/NUTCH-1963
 Project: Nutch
  Issue Type: Bug
  Components: commoncrawl
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Giuseppe Totaro
 Fix For: 1.10


 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype 
 application/pdf* I get the following stack trace which results in a failure 
 of the task
 {code}
 java.lang.RuntimeException: file name 
 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf'
  is too long (  100 bytes)
   at 
 org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
   at 
 org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
   at 
 org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
   at 
 org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
 {code}
 The workaround consists of not using the *-gzip* option, instead delaying 
 this until a later task, however this is a workaround and not a solution.
 We need to fix this in order for the tool to work as designed and required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1973) Job Administration end point for the REST service

2015-04-23 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510097#comment-14510097
 ] 

Lewis John McGibbney commented on NUTCH-1973:
-

This commit accidently removed the NUTCH-1927 property to nutch-default.xml
The commit at revision 1675735 adds it back in.
Excellent catch [~gostep]

 Job Administration end point for the REST service
 -

 Key: NUTCH-1973
 URL: https://issues.apache.org/jira/browse/NUTCH-1973
 Project: Nutch
  Issue Type: Sub-task
Reporter: Sujen Shah
Assignee: Chris A. Mattmann
 Fix For: 1.10

 Attachments: NUTCH-1973.patch


 This sub task deals with implementing the functionality documented at 
 https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510167#comment-14510167
 ] 

Hudson commented on NUTCH-1927:
---

FAILURE: Integrated in Nutch-trunk #3084 (See 
[https://builds.apache.org/job/Nutch-trunk/3084/])
Add back in NUTCH-1927 property to nutch-default as revoved during commit 
@1675022 (lewismc: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1675735)
* /nutch/trunk/conf/nutch-default.xml


 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
 ---

 Key: NUTCH-1927
 URL: https://issues.apache.org/jira/browse/NUTCH-1927
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: available, patch
 Fix For: 1.10

 Attachments: NUTCH-1927.2015-04-16.patch, 
 NUTCH-1927.2015-04-17.patch, NUTCH-1927.Mattmann.041115.patch.txt, 
 NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt, 
 test_NUTCH-1927.2015-04-17.txt


 Based on discussion on the dev list, to use Nutch for some security research 
 valid use cases (DDoS; DNS and other testing), I am going to create a patch 
 that allows a whitelist:
 {code:xml}
 property
   namerobot.rules.whitelist/name
   value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
   descriptionComma separated list of hostnames or IP addresses to ignore 
 robot rules parsing for.
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: Nutch-trunk #3084

2015-04-23 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/3084/changes

Changes:

[lewismc] Add back in NUTCH-1927 property to nutch-default as revoved during 
commit @1675022

--
[...truncated 5373 lines...]
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.945 sec
[junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.029 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-ajax

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-ajax
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.tika.TestRTFParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.017 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.011 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.196 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.998 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.423 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading

[jira] [Commented] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked

2015-04-23 Thread Giuseppe Totaro (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510062#comment-14510062
 ] 

Giuseppe Totaro commented on NUTCH-1963:


Hi [~lewismc]. Yes, 
[NUTCH-1959|https://issues.apache.org/jira/browse/NUTCH-1959] includes support 
for long filename:
{noformat}
tarOutput.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU);
{noformat}

Thanks,
Giuseppe

 CommonsCrawlDataDumper is too long (  100 bytes) when -gzip option invoked
 ---

 Key: NUTCH-1963
 URL: https://issues.apache.org/jira/browse/NUTCH-1963
 Project: Nutch
  Issue Type: Bug
  Components: commoncrawl
Affects Versions: 1.10
Reporter: Lewis John McGibbney
 Fix For: 1.10


 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype 
 application/pdf* I get the following stack trace which results in a failure 
 of the task
 {code}
 java.lang.RuntimeException: file name 
 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf'
  is too long (  100 bytes)
   at 
 org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
   at 
 org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
   at 
 org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
   at 
 org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
 {code}
 The workaround consists of not using the *-gzip* option, instead delaying 
 this until a later task, however this is a workaround and not a solution.
 We need to fix this in order for the tool to work as designed and required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output

2015-04-23 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508541#comment-14508541
 ] 

Luke sh commented on NUTCH-1997:


i am working on the update.

 Add CBOR magic header to CommonCrawlDataDumper output
 ---

 Key: NUTCH-1997
 URL: https://issues.apache.org/jira/browse/NUTCH-1997
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Priority: Minor
 Attachments: NUTCH-1997.patch


 For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
 wraps a single string value, representing the JSON text, into CBOR. 
 For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
 the first byte of all files is 0x7F (the first three bits are 011, that 
 is the major type for strings, and the following 5 bits are 11010, meaning 
 a uint32_t encodes the length of following text), and the following 4 bytes 
 (single-precision float) encodes the right length of file (as described in 
 [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
 currently included into the file (a list of cbor tags is available 
 [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
 In order to add support for CBOR detection using Apache Tika (as described in 
 [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
 great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
 CBOR magic header ([Tag 
 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
 output files. 
 Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
 for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1998) Add support for user-defined file extension to CommonCrawlDataDumper

2015-04-23 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508520#comment-14508520
 ] 

Luke sh commented on NUTCH-1998:


Hi [~gostep], this patch works. I run a quick tested it with the command option 
-extension cbor, i was able to see the cbor extension was appended at least. 

Thanks


 Add support for user-defined file extension to CommonCrawlDataDumper
 

 Key: NUTCH-1998
 URL: https://issues.apache.org/jira/browse/NUTCH-1998
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Priority: Minor
 Attachments: NUTCH-1998.patch


 {{CommonCrawlDataDumper}} tool is able to generate CBOR-encoded files, 
 extracted from Nutch crawled data, using the Common Crawl format. By default, 
 {{CommonCrawlDataDumper}} uses the original file extension.
 We are going to add support for a command-line option (e.g., {{-extension}}) 
 that allows the user to provide a file extension to use in place of the 
 original one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output

2015-04-23 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508522#comment-14508522
 ] 

Luke sh commented on NUTCH-1997:


Thanks a lot [~gostep], highly appreciated, this patch works too, i run a quick 
test and i was able to see the magic tag is appended at the beginning of the 
cbor file.

Thanks
Luke

 Add CBOR magic header to CommonCrawlDataDumper output
 ---

 Key: NUTCH-1997
 URL: https://issues.apache.org/jira/browse/NUTCH-1997
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Priority: Minor
 Attachments: NUTCH-1997.patch


 For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
 wraps a single string value, representing the JSON text, into CBOR. 
 For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
 the first byte of all files is 0x7F (the first three bits are 011, that 
 is the major type for strings, and the following 5 bits are 11010, meaning 
 a uint32_t encodes the length of following text), and the following 4 bytes 
 (single-precision float) encodes the right length of file (as described in 
 [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
 currently included into the file (a list of cbor tags is available 
 [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
 In order to add support for CBOR detection using Apache Tika (as described in 
 [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
 great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
 CBOR magic header ([Tag 
 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
 output files. 
 Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
 for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output

2015-04-23 Thread Giuseppe Totaro (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508549#comment-14508549
 ] 

Giuseppe Totaro commented on NUTCH-1997:


Great. Thanks [~Lukeliush]. Please let me know if you may need support on 
adding cbor detection to Tika.
Thanks a lot.

 Add CBOR magic header to CommonCrawlDataDumper output
 ---

 Key: NUTCH-1997
 URL: https://issues.apache.org/jira/browse/NUTCH-1997
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Priority: Minor
 Attachments: NUTCH-1997.patch


 For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
 wraps a single string value, representing the JSON text, into CBOR. 
 For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
 the first byte of all files is 0x7F (the first three bits are 011, that 
 is the major type for strings, and the following 5 bits are 11010, meaning 
 a uint32_t encodes the length of following text), and the following 4 bytes 
 (single-precision float) encodes the right length of file (as described in 
 [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
 currently included into the file (a list of cbor tags is available 
 [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
 In order to add support for CBOR detection using Apache Tika (as described in 
 [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
 great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
 CBOR magic header ([Tag 
 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
 output files. 
 Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
 for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output

2015-04-23 Thread Giuseppe Totaro (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508540#comment-14508540
 ] 

Giuseppe Totaro commented on NUTCH-1997:


Thanks [~Lukeliush]. Do you verify if Tika is able to detect these files as 
cbor?
Thanks a lot.

 Add CBOR magic header to CommonCrawlDataDumper output
 ---

 Key: NUTCH-1997
 URL: https://issues.apache.org/jira/browse/NUTCH-1997
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Priority: Minor
 Attachments: NUTCH-1997.patch


 For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
 wraps a single string value, representing the JSON text, into CBOR. 
 For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
 the first byte of all files is 0x7F (the first three bits are 011, that 
 is the major type for strings, and the following 5 bits are 11010, meaning 
 a uint32_t encodes the length of following text), and the following 4 bytes 
 (single-precision float) encodes the right length of file (as described in 
 [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
 currently included into the file (a list of cbor tags is available 
 [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
 In order to add support for CBOR detection using Apache Tika (as described in 
 [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
 great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
 CBOR magic header ([Tag 
 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
 output files. 
 Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
 for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-1999) Add http://nutch.apache.org/robots.txt

2015-04-23 Thread Julien Nioche (JIRA)

Julien Nioche created NUTCH-1999:


 Summary: Add http://nutch.apache.org/robots.txt
 Key: NUTCH-1999
 URL: https://issues.apache.org/jira/browse/NUTCH-1999
 Project: Nutch
  Issue Type: Improvement
  Components: website
Reporter: Julien Nioche


http://nutch.apache.org/robots.txt = 404 not found

Aren't we funny! Go and tell webmasters to have a robots.txt after that!






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (NUTCH-1999) Add http://nutch.apache.org/robots.txt

2015-04-23 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-1999:


Assignee: Julien Nioche

 Add http://nutch.apache.org/robots.txt
 --

 Key: NUTCH-1999
 URL: https://issues.apache.org/jira/browse/NUTCH-1999
 Project: Nutch
  Issue Type: Improvement
  Components: website
Reporter: Julien Nioche
Assignee: Julien Nioche

 http://nutch.apache.org/robots.txt = 404 not found
 Aren't we funny! Go and tell webmasters to have a robots.txt after that!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Julien Nioche (JIRA)

Julien Nioche created NUTCH-2000:


 Summary: Link inversion fails with .locked already exists.
 Key: NUTCH-2000
 URL: https://issues.apache.org/jira/browse/NUTCH-2000
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Julien Nioche
 Fix For: 1.10


using standard crawl script with a brand new test dir in local mode I am 
getting 

Link inversion
/data/BLABLABLA/runtime/local/bin/nutch invertlinks 
/data/BLABLABLA/testCrawl2//linkdb 
/data/BLABLABLA/testCrawl2//segments/20150423114335
LinkDb: java.io.IOException: lock file 
/data/BLABLABLA/testCrawl2/linkdb/.locked already exists.

PS: 2000!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: Nutch-trunk #3085

2015-04-23 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/3085/

--
[...truncated 5536 lines...]
 [echo] Testing plugin: urlfilter-validator
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.031 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-ajax

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-ajax
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.899 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.011 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.tika.TestRTFParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.017 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.198 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.412 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.192 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:

[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510341#comment-14510341
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1985:
---

Committed revision 1675743.

 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [MASSMAIL]Re: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015

2015-04-23 Thread Jorge Luis Betancourt González

+1

- Original Message -
From: Chris A Mattmann (3980) chris.a.mattm...@jpl.nasa.gov
To: dev@nutch.apache.org
Sent: Thursday, April 23, 2015 2:16:09 PM
Subject: [MASSMAIL]Re: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 
04232015

s/1.8/1.10/ right?

If so +1!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, April 23, 2015 at 2:14 PM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: [PROPOSE] Kick off Apache Nutch 1.8 by EoB Friday 04232015

Hi Folks,

Does anyone have an issue with the above proposal?

Thanks

Lewis

-- 
Lewis

[jira] [Commented] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output

2015-04-23 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510380#comment-14510380
 ] 

Luke sh commented on NUTCH-1997:


Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, 
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor 
(40); [I don't know who (and why) assigned 40 to cbor];  So if xhtml gets read 
and compared first,  cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded. 


Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool,  both types 
(xhtml and cbor) will be selected as candidate mime types and they will be put 
in the magic estimation list; since xhtml type gets read first, it is placed 
atop the cbor; in order to break that tie, tika will rely on the decision from 
the extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.


 Add CBOR magic header to CommonCrawlDataDumper output
 ---

 Key: NUTCH-1997
 URL: https://issues.apache.org/jira/browse/NUTCH-1997
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Priority: Minor
 Attachments: NUTCH-1997.patch


 For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
 wraps a single string value, representing the JSON text, into CBOR. 
 For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
 the first byte of all files is 0x7F (the first three bits are 011, that 
 is the major type for strings, and the following 5 bits are 11010, meaning 
 a uint32_t encodes the length of following text), and the following 4 bytes 
 (single-precision float) encodes the right length of file (as described in 
 [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
 currently included into the file (a list of cbor tags is available 
 [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
 In order to add support for CBOR detection using Apache Tika (as described in 
 [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
 great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
 CBOR magic header ([Tag 
 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
 output files. 
 Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
 for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez resolved NUTCH-1985.
---
Resolution: Fixed

 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (NUTCH-1997) Add CBOR magic header to CommonCrawlDataDumper output

2015-04-23 Thread Luke sh (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated NUTCH-1997:
---
Comment: was deleted

(was: Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, 
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor 
(40); [I don't know who (and why) assigned 40 to cbor];  So if xhtml gets read 
and compared first,  cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded. 


Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool,  both types 
(xhtml and cbor) will be selected as candidate mime types and they will be put 
in the magic estimation list; since xhtml type gets read first, it is placed 
atop the cbor; in order to break that tie, tika will rely on the decision from 
the extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.
)

 Add CBOR magic header to CommonCrawlDataDumper output
 ---

 Key: NUTCH-1997
 URL: https://issues.apache.org/jira/browse/NUTCH-1997
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Priority: Minor
 Attachments: NUTCH-1997.patch


 For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
 wraps a single string value, representing the JSON text, into CBOR. 
 For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
 the first byte of all files is 0x7F (the first three bits are 011, that 
 is the major type for strings, and the following 5 bits are 11010, meaning 
 a uint32_t encodes the length of following text), and the following 4 bytes 
 (single-precision float) encodes the right length of file (as described in 
 [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
 currently included into the file (a list of cbor tags is available 
 [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
 In order to add support for CBOR detection using Apache Tika (as described in 
 [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
 great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
 CBOR magic header ([Tag 
 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
 output files. 
 Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
 for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: Nutch-trunk #3086

2015-04-23 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/3086/changes

Changes:

[jorgelbg] NUTCH-1985 Adding a main() method to the MimeTypeIndexingFilter

--
[...truncated 5373 lines...]
copy-generated-lib:

test:
 [echo] Testing plugin: urlfilter-validator
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.025 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-ajax

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-ajax/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-ajax
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.011 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.224 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.438 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.187 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-querystring

[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510388#comment-14510388
 ] 

Hudson commented on NUTCH-1985:
---

FAILURE: Integrated in Nutch-trunk #3086 (See 
[https://builds.apache.org/job/Nutch-trunk/3086/])
NUTCH-1985 Adding a main() method to the MimeTypeIndexingFilter (jorgelbg: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1675743)
* 
/nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter/MimeTypeIndexingFilter.java


 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

53 matches

Mail list logo