date:20150312

RE: Google Summer of Code 2015 Mentor Registration

2015-03-12 Thread Markus Jelsma

+1
 
-Original message-
 From:Talat Uyarer ta...@uyarer.com
 Sent: Wednesday 11th March 2015 13:45
 To: ment...@community.apache.org; dev@nutch.apache.org
 Subject: Google Summer of Code 2015 Mentor Registration
 
 Nutch PMC,
 
 Please acknowledge my request to become a mentor for Google Summer of
 Code 2015 projects for Apache
 Nutch.
 
 My Melange username is talat.
 
 -- 
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: [jira] [Issue Comment Deleted] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-03-12 Thread Mohit Bagde

Hi,

My name is Mohit Bagde. I am currently doing my Master's in CS at USC. I
have taken CS572 Information Retrieval and Search Engines under Prof.
Mattmann and as have worked on Nutch 1.X as part of the first assignment
which involved crawling with Nutch and integrating with Tika and
subsequently developing a plugin in Nutch. I have also taken INF 550 under
Prof. Kim where I am learning about the HDFS and Map Reduce and I find that
both these subjects have a common point in the JIRA issue NUTCH-1936 which
is about porting Nutch to Hadoop 2.X.

My questions are, I would like to know on a very high level, what the
requirements for this project are? And what kind of background is required?
I would like to submit a project proposal but I am not entirely sure what
to put into it. I enjoyed working with Nutch and found the entire
experience to be very knowledgeable. I would like to continue to develop
and contribute to Nutch in any which way possible. I would be really
obliged if you could give some more insight into this JIRA issue.

Sincerely,

Mohit Bagde.

On Tue, Mar 10, 2015 at 9:54 PM, Ashwini Tokekar (JIRA) j...@apache.org
wrote:

[
https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashwini Tokekar updated NUTCH-1936:
---
Comment: was deleted

(was: Thanks Lewis)

GSoC 2015 - Move Nutch to Hadoop 2.X

Key: NUTCH-1936
URL: https://issues.apache.org/jira/browse/NUTCH-1936
Project: Nutch
Issue Type: Task
Components: build
Reporter: Lewis John McGibbney
Labels: gsoc2015
Fix For: 2.4, 1.11

The Nutch PMC [discussed|
http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] ideas
for a good 2015 GSoC project. It appears that porting the (trunk) codebase
to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an
attractive option and one which would present an excellent learning
experience for a summer student.
A more comprehensive description of this issue should be included within
either a mentor-defined project description or a successful student
application.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mohit Bagde
Graduate Student,
Computer Science,
University of Southern California,
Los Angeles, CA 90007.

[jira] [Created] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-12 Thread Markus Jelsma (JIRA)

Markus Jelsma created NUTCH-1958:


 Summary: Remove scoring-opic from nutch-default.xml
 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9, 2.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


I propose we remove scoring-opic from nutch-default. We all know it is flawed 
for any kind of incremental crawl, which most of us do. It is also useless if 
you want to perform a single crawl, if you must crawl all records of a domain, 
using OPIC for prioritizing URLS makes no sense. It also confuses users as we 
have seen in the past and recently [1].

What do you think?

[1]: 
http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1956) Members to be public in URLCrawlDatum

2015-03-12 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358349#comment-14358349
 ] 

Sebastian Nagel commented on NUTCH-1956:


+1

 Members to be public in URLCrawlDatum
 -

 Key: NUTCH-1956
 URL: https://issues.apache.org/jira/browse/NUTCH-1956
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.10

 Attachments: NUTCH-1956.patch


 URLCrawlDatum's datum member cannot be accessed from other unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: NUTCH-1957 using MD5 as part of file path to s...

2015-03-12 Thread renxiawang

GitHub user renxiawang opened a pull request:

https://github.com/apache/nutch/pull/12

NUTCH-1957 using MD5 as part of file path to solve filename collision issue



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/renxiawang/nutch trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/12.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12


commit 23d7d8f62dec166b210cca0f49883580dfbef48d
Author: Renxia Wang renxia.w...@gmail.com
Date:   2015-03-12T10:01:38Z

NUTCH-1957 using MD5 as part of path and filename to solve filename 
collision issue




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions

2015-03-12 Thread Renxia Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358439#comment-14358439
 ] 

Renxia Wang commented on NUTCH-1957:


Hi Sebastian,

Thank you for your suggestions. Based on your comment, I resolve this issue and 
sent a pull request here: https://github.com/apache/nutch/pull/12

 FileDumper output file name collisions
 --

 Key: NUTCH-1957
 URL: https://issues.apache.org/jira/browse/NUTCH-1957
 Project: Nutch
  Issue Type: Bug
  Components: tool
Affects Versions: 1.10
Reporter: Renxia Wang
Priority: Minor
  Labels: dumper, filename, tools

 The FileDumper extracts file base name and extension and use 
 basename.extension(e.g. given the url 
 https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
 basename.extension will be project.html) as the file name to dump the 
 file. 
 Code from FileDumper.java: 
 String url = key.toString();
 String baseName = FilenameUtils.getBaseName(url);
 String extension = FilenameUtils.getExtension(url);
 ...
 String filename = baseName + . + extension;
 This introduce file name collision and leads to loss of data when using 
 bin/nutch dump. 
 Sample logs:
 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
 http://beringsea.eol.ucar.edu/data/
 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/.html]: file already exists
 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
 http://catalog.eol.ucar.edu/
 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Christopher%20Arp/project.html
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Mary%20Albert/project.html
 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: HTTP Post Authentication

2015-03-12 Thread Sebastian Nagel

Hi Tizy,

this should help:
https://wiki.apache.org/nutch/HttpPostAuthentication
http://svn.apache.org/repos/asf/nutch/trunk/conf/httpclient-auth.xml.template

For more details you could also check
https://issues.apache.org/jira/browse/NUTCH-827
https://issues.apache.org/jira/browse/NUTCH-1943

Cheers,
Sebastian

2015-03-12 7:59 GMT+01:00 Tizy Ninan tizy1...@gmail.com:

 Hi,

 Is there any detailed step by step explanation on how to implement
 HTTPPostAuthentication on Nutch 1.10.?

 Thanks and Regards,
 Tizy

[jira] [Updated] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-12 Thread Jorge Luis Betancourt Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1962:
--
Attachment: NUTCH-1962.patch

 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-12 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359894#comment-14359894
 ] 

Lewis John McGibbney commented on NUTCH-1962:
-

+1 commit thanks Jorge

On Thursday, March 12, 2015, Jorge Luis Betancourt Gonzalez (JIRA) 



-- 
*Lewis*


 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-12 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359931#comment-14359931
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1962:
---

Committed r1666356.

 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-12 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359935#comment-14359935
 ] 

Hudson commented on NUTCH-1962:
---

SUCCESS: Integrated in Nutch-trunk #3012 (See 
[https://builds.apache.org/job/Nutch-trunk/3012/])
NUTCH-1962 Need to have mimetype-filter.txt file available by default 
(jorgelbg: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1666356)
* /nutch/trunk/conf/mimetype-filter.txt


 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: HTTP Post Authentication

2015-03-12 Thread Tizy Ninan

Hi Lewis,

Thank you for the reply.

I tried by providing the parameters specified in the httpclient-auth.xml
template file. But while crawling I am getting the following warnings.

WARN httpclient.Http: Bad auth conf file: root element credentials found
in httpclient-auth.xml - must be auth-configuration
WARN httpclient.Http: Bad auth conf file: Element loginPostData not
recognized in httpclient-auth.xml - expected credentials
WARN httpclient.Http: Bad auth conf file: Element additionalPostHeaders
not recognized in httpclient-auth.xml - expected credentials

The httpclient-auth.xml file is placed in the conf folder. The version of
nutch used is nutch 1.10 (trunk).

Could you please explain what could be wrong?

Thanks,
Tizy


On Fri, Mar 13, 2015 at 1:26 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Tizy,

 On Thu, Mar 12, 2015 at 12:20 AM, user-digest-h...@nutch.apache.org
 wrote:

 
  Is there any detailed step by step explanation on how to implement
  HTTPPostAuthentication on Nutch 1.10.?
 
 

 https://github.com/apache/nutch/blob/trunk/conf/httpclient-auth.xml.template#L61-L105
 https://wiki.apache.org/nutch/HttpPostAuthentication
 HTH
 Lewis




-- 
Thanks and Regards,
Tizy

[jira] [Created] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked

2015-03-12 Thread Lewis John McGibbney (JIRA)

Lewis John McGibbney created NUTCH-1963:
---

 Summary: CommonsCrawlDataDumper is too long (  100 bytes) when 
-gzip option invoked
 Key: NUTCH-1963
 URL: https://issues.apache.org/jira/browse/NUTCH-1963
 Project: Nutch
  Issue Type: Bug
  Components: commoncrawl
Affects Versions: 1.10
Reporter: Lewis John McGibbney
 Fix For: 1.10


When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype 
application/pdf* I get the following stack trace which results in a failure of 
the task

{code}
java.lang.RuntimeException: file name 
'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf'
 is too long (  100 bytes)
at 
org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
at 
org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
at 
org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
at 
org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
{code}

The workaround consists of not using the *-gzip* option, instead delaying this 
until a later task, however this is a workaround and not a solution.
We need to fix this in order for the tool to work as designed and required.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1959) Improving CommonCrawlFormat implementations

2015-03-12 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1959:

Attachment: NUTCH-1959.v02.patch

Giuseppe's patch

 Improving CommonCrawlFormat implementations
 ---

 Key: NUTCH-1959
 URL: https://issues.apache.org/jira/browse/NUTCH-1959
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: NUTCH-1959.patch, NUTCH-1959.v02.patch


 {{CommonCrawlFormat}} is an interface for Java classes that implement methods 
 for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is 
 an abstract class that implements {{CommonCrawlFormat}} and provides abstract 
 methods for CommonCrawl formatter classes.
 You can find in attachment a PATCH that includes some improvements for 
 {{CommonCrawlFormat}}-based classes;
 * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only 
 the {{getJsonData()}} method, responsible for getting out JSON data.
 * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each 
 subclass has to implement in order to handle JSON objects.
 * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now 
 provide also escaping of JSON string values.
 This PATCH aims at providing a better interface for implementing/extending 
 {{CommonCrawlFormat}} classes.
 I would really appreciate your feedback.
 Thanks a lot,
 Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked

2015-03-12 Thread Giuseppe Totaro (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359127#comment-14359127
 ] 

Giuseppe Totaro commented on NUTCH-1963:


Thanks a lot [~lewismc]. We can solve this problem using 
{{setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU)}} for 
{{TarArchiveOutputStream}} ([Apache Commons 
Compress|http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/archivers/tar/TarArchiveOutputStream.html]).
I will update the patch soon in 
[https://issues.apache.org/jira/browse/NUTCH-1959|NUTCH-1959].
Thank you,
Giuseppe

 CommonsCrawlDataDumper is too long (  100 bytes) when -gzip option invoked
 ---

 Key: NUTCH-1963
 URL: https://issues.apache.org/jira/browse/NUTCH-1963
 Project: Nutch
  Issue Type: Bug
  Components: commoncrawl
Affects Versions: 1.10
Reporter: Lewis John McGibbney
 Fix For: 1.10


 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype 
 application/pdf* I get the following stack trace which results in a failure 
 of the task
 {code}
 java.lang.RuntimeException: file name 
 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf'
  is too long (  100 bytes)
   at 
 org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
   at 
 org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
   at 
 org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
   at 
 org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
 {code}
 The workaround consists of not using the *-gzip* option, instead delaying 
 this until a later task, however this is a workaround and not a solution.
 We need to fix this in order for the tool to work as designed and required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1957) FileDumper output file name collisions

2015-03-12 Thread Renxia Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renxia Wang updated NUTCH-1957:
---
Attachment: NUTCH-1957.patch

 FileDumper output file name collisions
 --

 Key: NUTCH-1957
 URL: https://issues.apache.org/jira/browse/NUTCH-1957
 Project: Nutch
  Issue Type: Bug
  Components: tool
Affects Versions: 1.10
Reporter: Renxia Wang
Priority: Minor
  Labels: dumper, filename, tools
 Attachments: NUTCH-1957.patch


 The FileDumper extracts file base name and extension and use 
 basename.extension(e.g. given the url 
 https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
 basename.extension will be project.html) as the file name to dump the 
 file. 
 Code from FileDumper.java: 
 String url = key.toString();
 String baseName = FilenameUtils.getBaseName(url);
 String extension = FilenameUtils.getExtension(url);
 ...
 String filename = baseName + . + extension;
 This introduce file name collision and leads to loss of data when using 
 bin/nutch dump. 
 Sample logs:
 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
 http://beringsea.eol.ucar.edu/data/
 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/.html]: file already exists
 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
 http://catalog.eol.ucar.edu/
 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Christopher%20Arp/project.html
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Mary%20Albert/project.html
 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1957) FileDumper output file name collisions

2015-03-12 Thread Renxia Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renxia Wang updated NUTCH-1957:
---
Patch Info: Patch Available

 FileDumper output file name collisions
 --

 Key: NUTCH-1957
 URL: https://issues.apache.org/jira/browse/NUTCH-1957
 Project: Nutch
  Issue Type: Bug
  Components: tool
Affects Versions: 1.10
Reporter: Renxia Wang
Priority: Minor
  Labels: dumper, filename, tools

 The FileDumper extracts file base name and extension and use 
 basename.extension(e.g. given the url 
 https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
 basename.extension will be project.html) as the file name to dump the 
 file. 
 Code from FileDumper.java: 
 String url = key.toString();
 String baseName = FilenameUtils.getBaseName(url);
 String extension = FilenameUtils.getExtension(url);
 ...
 String filename = baseName + . + extension;
 This introduce file name collision and leads to loss of data when using 
 bin/nutch dump. 
 Sample logs:
 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
 http://beringsea.eol.ucar.edu/data/
 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/.html]: file already exists
 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
 http://catalog.eol.ucar.edu/
 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Christopher%20Arp/project.html
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Mary%20Albert/project.html
 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
 [testFileName/project.html]: file already exists
 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
 https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[Nutch Wiki] Update of Nutch_1.X_RESTAPI by SujenShah

2015-03-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Nutch_1.X_RESTAPI page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI?action=diffrev1=2rev2=3

  Ok
  
  
+ === Configuration ===
+  Configuration's list 
+ 
+ GET /config
+ 
+ 
+ __Response__ contains names of availible configurations.
+ 
+   [default,custom-config]
+ 
+ 
+  Configuration parameters 
+ 
+ GET /config/{configuration name}
+ 
+ Examples:
+ GET /config/default
+ GET /config/custom-config
+ 
+ 
+ __Response__ contains parameters with values
+ 
+   {
+anchorIndexingFilter.deduplicate:false,
+crawl.gen.delay:60480,
+db.fetch.interval.default:2592000,
+db.fetch.interval.max:7776000,
+
+
+}
+ 
+ 
+  Get property value 
+ 
+ GET /config/{configuration name}/{property}
+ 
+ Examples:
+ GET /config/default/anchorIndexingFilter.deduplicate
+ 
+ 
+ __Response__ contains parameter's value as string
+ 
+ false
+ 
+ 
+  Create configuration 
+ Creates new nutch configuration with given parameters. It force field is 
true, then already existing configuration will be overridden, otherwise not.
+ 
+ POST /config/{configuration name}
+ 
+ Examples:
+ POST /config/new-config
+{
+   configId:new-config,
+   force:true,
+   params:{anchorIndexingFilter.deduplicate:false,... }
+}
+ 
+ 
+ 
+ __Response__ is created config's id.
+ 
+ new-config
+ 
+ 
+  Delete configuration 
+ 
+ DELETE /config/{configuration name}
+ 
+ Examples:
+ DELETE /config/new-config
+ 
+ 
  === Jobs ===
  This point allows job management, including creation, job information and 
killing of a job.
   Listing all jobs

HTTP Post Authentication

2015-03-12 Thread Tizy Ninan

Hi,

Is there any detailed step by step explanation on how to implement
HTTPPostAuthentication on Nutch 1.10.?

Thanks and Regards,
Tizy

[jira] [Assigned] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-12 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-1960:


Assignee: Chris A. Mattmann

 JUnit test for dump method of CommonCrawlDataDumper
 ---

 Key: NUTCH-1960
 URL: https://issues.apache.org/jira/browse/NUTCH-1960
 Project: Nutch
  Issue Type: Test
Affects Versions: 1.9
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: NUTCH-1960.patch, test-segments.tar.gz


 Hi all,
 you can find in attachment the PATCH including an extremely simple JUnit test 
 for {{dump}} method of {{CommonCrawlDataDumper}} class.
 Essentially, it checks if {{dump}} is able to create a given list of files 
 from Butch segments (in {{testresources}}).
 Thanks a lot,
 Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-12 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1960 started by Chris A. Mattmann.

 JUnit test for dump method of CommonCrawlDataDumper
 ---

 Key: NUTCH-1960
 URL: https://issues.apache.org/jira/browse/NUTCH-1960
 Project: Nutch
  Issue Type: Test
Affects Versions: 1.9
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: NUTCH-1960.patch, test-segments.tar.gz


 Hi all,
 you can find in attachment the PATCH including an extremely simple JUnit test 
 for {{dump}} method of {{CommonCrawlDataDumper}} class.
 Essentially, it checks if {{dump}} is able to create a given list of files 
 from Butch segments (in {{testresources}}).
 Thanks a lot,
 Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (NUTCH-1959) Improving CommonCrawlFormat implementations

2015-03-12 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1959 started by Chris A. Mattmann.

 Improving CommonCrawlFormat implementations
 ---

 Key: NUTCH-1959
 URL: https://issues.apache.org/jira/browse/NUTCH-1959
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: NUTCH-1959.patch


 {{CommonCrawlFormat}} is an interface for Java classes that implement methods 
 for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is 
 an abstract class that implements {{CommonCrawlFormat}} and provides abstract 
 methods for CommonCrawl formatter classes.
 You can find in attachment a PATCH that includes some improvements for 
 {{CommonCrawlFormat}}-based classes;
 * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only 
 the {{getJsonData()}} method, responsible for getting out JSON data.
 * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each 
 subclass has to implement in order to handle JSON objects.
 * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now 
 provide also escaping of JSON string values.
 This PATCH aims at providing a better interface for implementing/extending 
 {{CommonCrawlFormat}} classes.
 I would really appreciate your feedback.
 Thanks a lot,
 Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

RE: Google Summer of Code 2015 Mentor Registration

Re: [jira] [Issue Comment Deleted] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

[jira] [Created] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

[jira] [Commented] (NUTCH-1956) Members to be public in URLCrawlDatum

[GitHub] nutch pull request: NUTCH-1957 using MD5 as part of file path to s...

[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions

Re: HTTP Post Authentication

[jira] [Updated] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

Re: HTTP Post Authentication

[jira] [Created] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked

[jira] [Updated] (NUTCH-1959) Improving CommonCrawlFormat implementations

[jira] [Commented] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked

[jira] [Updated] (NUTCH-1957) FileDumper output file name collisions

[jira] [Updated] (NUTCH-1957) FileDumper output file name collisions

[Nutch Wiki] Update of Nutch_1.X_RESTAPI by SujenShah

HTTP Post Authentication

[jira] [Assigned] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

[jira] [Work started] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

[jira] [Work started] (NUTCH-1959) Improving CommonCrawlFormat implementations

22 matches

Site Navigation

Mail list logo

Footer information