RE: Google Summer of Code 2015 Mentor Registration
+1 -Original message- From:Talat Uyarer ta...@uyarer.com Sent: Wednesday 11th March 2015 13:45 To: ment...@community.apache.org; dev@nutch.apache.org Subject: Google Summer of Code 2015 Mentor Registration Nutch PMC, Please acknowledge my request to become a mentor for Google Summer of Code 2015 projects for Apache Nutch. My Melange username is talat. -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: [jira] [Issue Comment Deleted] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X
Hi, My name is Mohit Bagde. I am currently doing my Master's in CS at USC. I have taken CS572 Information Retrieval and Search Engines under Prof. Mattmann and as have worked on Nutch 1.X as part of the first assignment which involved crawling with Nutch and integrating with Tika and subsequently developing a plugin in Nutch. I have also taken INF 550 under Prof. Kim where I am learning about the HDFS and Map Reduce and I find that both these subjects have a common point in the JIRA issue NUTCH-1936 which is about porting Nutch to Hadoop 2.X. My questions are, I would like to know on a very high level, what the requirements for this project are? And what kind of background is required? I would like to submit a project proposal but I am not entirely sure what to put into it. I enjoyed working with Nutch and found the entire experience to be very knowledgeable. I would like to continue to develop and contribute to Nutch in any which way possible. I would be really obliged if you could give some more insight into this JIRA issue. Sincerely, Mohit Bagde. On Tue, Mar 10, 2015 at 9:54 PM, Ashwini Tokekar (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwini Tokekar updated NUTCH-1936: --- Comment: was deleted (was: Thanks Lewis) GSoC 2015 - Move Nutch to Hadoop 2.X Key: NUTCH-1936 URL: https://issues.apache.org/jira/browse/NUTCH-1936 Project: Nutch Issue Type: Task Components: build Reporter: Lewis John McGibbney Labels: gsoc2015 Fix For: 2.4, 1.11 The Nutch PMC [discussed| http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] ideas for a good 2015 GSoC project. It appears that porting the (trunk) codebase to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an attractive option and one which would present an excellent learning experience for a summer student. A more comprehensive description of this issue should be included within either a mentor-defined project description or a successful student application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) -- Mohit Bagde Graduate Student, Computer Science, University of Southern California, Los Angeles, CA 90007.
[jira] [Created] (NUTCH-1958) Remove scoring-opic from nutch-default.xml
Markus Jelsma created NUTCH-1958: Summary: Remove scoring-opic from nutch-default.xml Key: NUTCH-1958 URL: https://issues.apache.org/jira/browse/NUTCH-1958 Project: Nutch Issue Type: Improvement Affects Versions: 1.9, 2.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.4, 1.10 I propose we remove scoring-opic from nutch-default. We all know it is flawed for any kind of incremental crawl, which most of us do. It is also useless if you want to perform a single crawl, if you must crawl all records of a domain, using OPIC for prioritizing URLS makes no sense. It also confuses users as we have seen in the past and recently [1]. What do you think? [1]: http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1956) Members to be public in URLCrawlDatum
[ https://issues.apache.org/jira/browse/NUTCH-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358349#comment-14358349 ] Sebastian Nagel commented on NUTCH-1956: +1 Members to be public in URLCrawlDatum - Key: NUTCH-1956 URL: https://issues.apache.org/jira/browse/NUTCH-1956 Project: Nutch Issue Type: Task Affects Versions: 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.10 Attachments: NUTCH-1956.patch URLCrawlDatum's datum member cannot be accessed from other unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-1957 using MD5 as part of file path to s...
GitHub user renxiawang opened a pull request: https://github.com/apache/nutch/pull/12 NUTCH-1957 using MD5 as part of file path to solve filename collision issue You can merge this pull request into a Git repository by running: $ git pull https://github.com/renxiawang/nutch trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/12.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12 commit 23d7d8f62dec166b210cca0f49883580dfbef48d Author: Renxia Wang renxia.w...@gmail.com Date: 2015-03-12T10:01:38Z NUTCH-1957 using MD5 as part of path and filename to solve filename collision issue --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions
[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358439#comment-14358439 ] Renxia Wang commented on NUTCH-1957: Hi Sebastian, Thank you for your suggestions. Based on your comment, I resolve this issue and sent a pull request here: https://github.com/apache/nutch/pull/12 FileDumper output file name collisions -- Key: NUTCH-1957 URL: https://issues.apache.org/jira/browse/NUTCH-1957 Project: Nutch Issue Type: Bug Components: tool Affects Versions: 1.10 Reporter: Renxia Wang Priority: Minor Labels: dumper, filename, tools The FileDumper extracts file base name and extension and use basename.extension(e.g. given the url https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the basename.extension will be project.html) as the file name to dump the file. Code from FileDumper.java: String url = key.toString(); String baseName = FilenameUtils.getBaseName(url); String extension = FilenameUtils.getExtension(url); ... String filename = baseName + . + extension; This introduce file name collision and leads to loss of data when using bin/nutch dump. Sample logs: 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: http://beringsea.eol.ucar.edu/data/ 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: http://catalog.eol.ucar.edu/ 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Carin%20Ashjian/project.html 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Christopher%20Arp/project.html 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Juha%20Alatalo/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Kerim%20Aydin/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Knut%20Aagaard/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Mary%20Albert/project.html 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: HTTP Post Authentication
Hi Tizy, this should help: https://wiki.apache.org/nutch/HttpPostAuthentication http://svn.apache.org/repos/asf/nutch/trunk/conf/httpclient-auth.xml.template For more details you could also check https://issues.apache.org/jira/browse/NUTCH-827 https://issues.apache.org/jira/browse/NUTCH-1943 Cheers, Sebastian 2015-03-12 7:59 GMT+01:00 Tizy Ninan tizy1...@gmail.com: Hi, Is there any detailed step by step explanation on how to implement HTTPPostAuthentication on Nutch 1.10.? Thanks and Regards, Tizy
[jira] [Updated] (NUTCH-1962) Need to have mimetype-filter.txt file available by default
[ https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez updated NUTCH-1962: -- Attachment: NUTCH-1962.patch Need to have mimetype-filter.txt file available by default -- Key: NUTCH-1962 URL: https://issues.apache.org/jira/browse/NUTCH-1962 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Lewis John McGibbney Fix For: 1.10 Attachments: NUTCH-1962.patch By default the mimetype-filter.txt file quoted within nutch-default.xml is not available. We need to provide this as it is a PITA to constantly have to add it it new crawler configurations. https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default
[ https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359894#comment-14359894 ] Lewis John McGibbney commented on NUTCH-1962: - +1 commit thanks Jorge On Thursday, March 12, 2015, Jorge Luis Betancourt Gonzalez (JIRA) -- *Lewis* Need to have mimetype-filter.txt file available by default -- Key: NUTCH-1962 URL: https://issues.apache.org/jira/browse/NUTCH-1962 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Lewis John McGibbney Fix For: 1.10 Attachments: NUTCH-1962.patch By default the mimetype-filter.txt file quoted within nutch-default.xml is not available. We need to provide this as it is a PITA to constantly have to add it it new crawler configurations. https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default
[ https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359931#comment-14359931 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1962: --- Committed r1666356. Need to have mimetype-filter.txt file available by default -- Key: NUTCH-1962 URL: https://issues.apache.org/jira/browse/NUTCH-1962 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Lewis John McGibbney Fix For: 1.10 Attachments: NUTCH-1962.patch By default the mimetype-filter.txt file quoted within nutch-default.xml is not available. We need to provide this as it is a PITA to constantly have to add it it new crawler configurations. https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default
[ https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359935#comment-14359935 ] Hudson commented on NUTCH-1962: --- SUCCESS: Integrated in Nutch-trunk #3012 (See [https://builds.apache.org/job/Nutch-trunk/3012/]) NUTCH-1962 Need to have mimetype-filter.txt file available by default (jorgelbg: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1666356) * /nutch/trunk/conf/mimetype-filter.txt Need to have mimetype-filter.txt file available by default -- Key: NUTCH-1962 URL: https://issues.apache.org/jira/browse/NUTCH-1962 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Lewis John McGibbney Fix For: 1.10 Attachments: NUTCH-1962.patch By default the mimetype-filter.txt file quoted within nutch-default.xml is not available. We need to provide this as it is a PITA to constantly have to add it it new crawler configurations. https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: HTTP Post Authentication
Hi Lewis, Thank you for the reply. I tried by providing the parameters specified in the httpclient-auth.xml template file. But while crawling I am getting the following warnings. WARN httpclient.Http: Bad auth conf file: root element credentials found in httpclient-auth.xml - must be auth-configuration WARN httpclient.Http: Bad auth conf file: Element loginPostData not recognized in httpclient-auth.xml - expected credentials WARN httpclient.Http: Bad auth conf file: Element additionalPostHeaders not recognized in httpclient-auth.xml - expected credentials The httpclient-auth.xml file is placed in the conf folder. The version of nutch used is nutch 1.10 (trunk). Could you please explain what could be wrong? Thanks, Tizy On Fri, Mar 13, 2015 at 1:26 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Tizy, On Thu, Mar 12, 2015 at 12:20 AM, user-digest-h...@nutch.apache.org wrote: Is there any detailed step by step explanation on how to implement HTTPPostAuthentication on Nutch 1.10.? https://github.com/apache/nutch/blob/trunk/conf/httpclient-auth.xml.template#L61-L105 https://wiki.apache.org/nutch/HttpPostAuthentication HTH Lewis -- Thanks and Regards, Tizy
[jira] [Created] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked
Lewis John McGibbney created NUTCH-1963: --- Summary: CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked Key: NUTCH-1963 URL: https://issues.apache.org/jira/browse/NUTCH-1963 Project: Nutch Issue Type: Bug Components: commoncrawl Affects Versions: 1.10 Reporter: Lewis John McGibbney Fix For: 1.10 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype application/pdf* I get the following stack trace which results in a failure of the task {code} java.lang.RuntimeException: file name 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' is too long ( 100 bytes) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275) at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400) at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236) {code} The workaround consists of not using the *-gzip* option, instead delaying this until a later task, however this is a workaround and not a solution. We need to fix this in order for the tool to work as designed and required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1959) Improving CommonCrawlFormat implementations
[ https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1959: Attachment: NUTCH-1959.v02.patch Giuseppe's patch Improving CommonCrawlFormat implementations --- Key: NUTCH-1959 URL: https://issues.apache.org/jira/browse/NUTCH-1959 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Attachments: NUTCH-1959.patch, NUTCH-1959.v02.patch {{CommonCrawlFormat}} is an interface for Java classes that implement methods for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is an abstract class that implements {{CommonCrawlFormat}} and provides abstract methods for CommonCrawl formatter classes. You can find in attachment a PATCH that includes some improvements for {{CommonCrawlFormat}}-based classes; * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only the {{getJsonData()}} method, responsible for getting out JSON data. * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each subclass has to implement in order to handle JSON objects. * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now provide also escaping of JSON string values. This PATCH aims at providing a better interface for implementing/extending {{CommonCrawlFormat}} classes. I would really appreciate your feedback. Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1963) CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked
[ https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359127#comment-14359127 ] Giuseppe Totaro commented on NUTCH-1963: Thanks a lot [~lewismc]. We can solve this problem using {{setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU)}} for {{TarArchiveOutputStream}} ([Apache Commons Compress|http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/archivers/tar/TarArchiveOutputStream.html]). I will update the patch soon in [https://issues.apache.org/jira/browse/NUTCH-1959|NUTCH-1959]. Thank you, Giuseppe CommonsCrawlDataDumper is too long ( 100 bytes) when -gzip option invoked --- Key: NUTCH-1963 URL: https://issues.apache.org/jira/browse/NUTCH-1963 Project: Nutch Issue Type: Bug Components: commoncrawl Affects Versions: 1.10 Reporter: Lewis John McGibbney Fix For: 1.10 When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype application/pdf* I get the following stack trace which results in a failure of the task {code} java.lang.RuntimeException: file name 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' is too long ( 100 bytes) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674) at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275) at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400) at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236) {code} The workaround consists of not using the *-gzip* option, instead delaying this until a later task, however this is a workaround and not a solution. We need to fix this in order for the tool to work as designed and required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1957) FileDumper output file name collisions
[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renxia Wang updated NUTCH-1957: --- Attachment: NUTCH-1957.patch FileDumper output file name collisions -- Key: NUTCH-1957 URL: https://issues.apache.org/jira/browse/NUTCH-1957 Project: Nutch Issue Type: Bug Components: tool Affects Versions: 1.10 Reporter: Renxia Wang Priority: Minor Labels: dumper, filename, tools Attachments: NUTCH-1957.patch The FileDumper extracts file base name and extension and use basename.extension(e.g. given the url https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the basename.extension will be project.html) as the file name to dump the file. Code from FileDumper.java: String url = key.toString(); String baseName = FilenameUtils.getBaseName(url); String extension = FilenameUtils.getExtension(url); ... String filename = baseName + . + extension; This introduce file name collision and leads to loss of data when using bin/nutch dump. Sample logs: 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: http://beringsea.eol.ucar.edu/data/ 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: http://catalog.eol.ucar.edu/ 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Carin%20Ashjian/project.html 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Christopher%20Arp/project.html 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Juha%20Alatalo/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Kerim%20Aydin/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Knut%20Aagaard/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Mary%20Albert/project.html 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1957) FileDumper output file name collisions
[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renxia Wang updated NUTCH-1957: --- Patch Info: Patch Available FileDumper output file name collisions -- Key: NUTCH-1957 URL: https://issues.apache.org/jira/browse/NUTCH-1957 Project: Nutch Issue Type: Bug Components: tool Affects Versions: 1.10 Reporter: Renxia Wang Priority: Minor Labels: dumper, filename, tools The FileDumper extracts file base name and extension and use basename.extension(e.g. given the url https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the basename.extension will be project.html) as the file name to dump the file. Code from FileDumper.java: String url = key.toString(); String baseName = FilenameUtils.getBaseName(url); String extension = FilenameUtils.getExtension(url); ... String filename = baseName + . + extension; This introduce file name collision and leads to loss of data when using bin/nutch dump. Sample logs: 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: http://beringsea.eol.ucar.edu/data/ 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: http://catalog.eol.ucar.edu/ 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Carin%20Ashjian/project.html 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Christopher%20Arp/project.html 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Juha%20Alatalo/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Kerim%20Aydin/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Knut%20Aagaard/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Mary%20Albert/project.html 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of Nutch_1.X_RESTAPI by SujenShah
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch_1.X_RESTAPI page has been changed by SujenShah: https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI?action=diffrev1=2rev2=3 Ok + === Configuration === + Configuration's list + + GET /config + + + __Response__ contains names of availible configurations. + + [default,custom-config] + + + Configuration parameters + + GET /config/{configuration name} + + Examples: + GET /config/default + GET /config/custom-config + + + __Response__ contains parameters with values + + { +anchorIndexingFilter.deduplicate:false, +crawl.gen.delay:60480, +db.fetch.interval.default:2592000, +db.fetch.interval.max:7776000, + + +} + + + Get property value + + GET /config/{configuration name}/{property} + + Examples: + GET /config/default/anchorIndexingFilter.deduplicate + + + __Response__ contains parameter's value as string + + false + + + Create configuration + Creates new nutch configuration with given parameters. It force field is true, then already existing configuration will be overridden, otherwise not. + + POST /config/{configuration name} + + Examples: + POST /config/new-config +{ + configId:new-config, + force:true, + params:{anchorIndexingFilter.deduplicate:false,... } +} + + + + __Response__ is created config's id. + + new-config + + + Delete configuration + + DELETE /config/{configuration name} + + Examples: + DELETE /config/new-config + + === Jobs === This point allows job management, including creation, job information and killing of a job. Listing all jobs
HTTP Post Authentication
Hi, Is there any detailed step by step explanation on how to implement HTTPPostAuthentication on Nutch 1.10.? Thanks and Regards, Tizy
[jira] [Assigned] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-1960: Assignee: Chris A. Mattmann JUnit test for dump method of CommonCrawlDataDumper --- Key: NUTCH-1960 URL: https://issues.apache.org/jira/browse/NUTCH-1960 Project: Nutch Issue Type: Test Affects Versions: 1.9 Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Attachments: NUTCH-1960.patch, test-segments.tar.gz Hi all, you can find in attachment the PATCH including an extremely simple JUnit test for {{dump}} method of {{CommonCrawlDataDumper}} class. Essentially, it checks if {{dump}} is able to create a given list of files from Butch segments (in {{testresources}}). Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1960 started by Chris A. Mattmann. JUnit test for dump method of CommonCrawlDataDumper --- Key: NUTCH-1960 URL: https://issues.apache.org/jira/browse/NUTCH-1960 Project: Nutch Issue Type: Test Affects Versions: 1.9 Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Attachments: NUTCH-1960.patch, test-segments.tar.gz Hi all, you can find in attachment the PATCH including an extremely simple JUnit test for {{dump}} method of {{CommonCrawlDataDumper}} class. Essentially, it checks if {{dump}} is able to create a given list of files from Butch segments (in {{testresources}}). Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-1959) Improving CommonCrawlFormat implementations
[ https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1959 started by Chris A. Mattmann. Improving CommonCrawlFormat implementations --- Key: NUTCH-1959 URL: https://issues.apache.org/jira/browse/NUTCH-1959 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Attachments: NUTCH-1959.patch {{CommonCrawlFormat}} is an interface for Java classes that implement methods for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is an abstract class that implements {{CommonCrawlFormat}} and provides abstract methods for CommonCrawl formatter classes. You can find in attachment a PATCH that includes some improvements for {{CommonCrawlFormat}}-based classes; * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only the {{getJsonData()}} method, responsible for getting out JSON data. * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each subclass has to implement in order to handle JSON objects. * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now provide also escaping of JSON string values. This PATCH aims at providing a better interface for implementing/extending {{CommonCrawlFormat}} classes. I would really appreciate your feedback. Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)