Build failed in Hudson: Nutch-trunk #649
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/649/changes Changes: [kubes] NUTCH-667: Input Format for working with Content in Hadoop Streaming [kubes] NUTCH-665: Search Load Testing Tool [kubes] NUTCH-647: Resolve URLs tool [kubes] NUTCH-647: Resolve URLs tool [kubes] NUTCH-663: Upgrade Nutch to use Hadoop 0.19 [kubes] NUTCH-662: Upgrade Nutch to use Lucene 2.4 -- [...truncated 2151 lines...] A src/plugin/protocol-http/src/test/org/apache A src/plugin/protocol-http/src/test/org/apache/nutch A src/plugin/protocol-http/src/test/org/apache/nutch/protocol A src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http A src/plugin/protocol-http/src/java A src/plugin/protocol-http/src/java/org A src/plugin/protocol-http/src/java/org/apache A src/plugin/protocol-http/src/java/org/apache/nutch A src/plugin/protocol-http/src/java/org/apache/nutch/protocol A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http AU src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/package.html AUsrc/plugin/protocol-http/plugin.xml AUsrc/plugin/protocol-http/build.xml A bin AUbin/nutch A docs A docs/ms A docs/ms/search.html A docs/ms/help.html A docs/ms/about.html A docs/zh A docs/zh/search.html A docs/zh/help.html A docs/zh/about.html A docs/ca A docs/ca/search.html A docs/ca/help.html A docs/ca/about.html A docs/pt A docs/pt/search.html A docs/pt/help.html A docs/pt/about.html A docs/sr AUdocs/sr/search.html AUdocs/sr/help.html AUdocs/sr/about.html A docs/sv A docs/sv/search.html A docs/sv/help.html A docs/sv/about.html A docs/de A docs/de/search.html A docs/de/help.html A docs/de/about.html A docs/fi A docs/fi/search.html A docs/fi/help.html A docs/fi/about.html A docs/en A docs/en/search.html A docs/en/help.html A docs/en/about.html A docs/es A docs/es/search.html A docs/es/help.html A docs/es/about.html A docs/fr A docs/fr/search.html AUdocs/fr/help.html A docs/fr/about.html A docs/jp A docs/jp/search.html A docs/jp/help.html A docs/jp/about.html A docs/nl A docs/nl/search.html A docs/nl/help.html A docs/nl/about.html A docs/sh AUdocs/sh/search.html AUdocs/sh/help.html AUdocs/sh/about.html A docs/th A docs/th/search.html A docs/th/help.html A docs/th/about.html A docs/pl A docs/pl/search.html A docs/pl/help.html A docs/pl/about.html A docs/it AUdocs/it/search.html AUdocs/it/help.html AUdocs/it/about.html A docs/img A docs/img/lang AUdocs/img/lang/romanian.png AUdocs/img/lang/bulgarian.png AUdocs/img/lang/spanish.png AUdocs/img/lang/danish.png AUdocs/img/lang/dutch.png AUdocs/img/lang/icelandic.png AUdocs/img/lang/hungarian.png AUdocs/img/lang/russian.png AUdocs/img/lang/japanese.png AUdocs/img/lang/turkish.png AUdocs/img/lang/suomi.png AUdocs/img/lang/lithuanian.png AUdocs/img/lang/czech.png AUdocs/img/lang/greek.png AUdocs/img/lang/galego.png AUdocs/img/lang/polish.png AUdocs/img/lang/latvian.png AUdocs/img/lang/croatian.png AUdocs/img/lang/portuguese.png AUdocs/img/lang/french.png AUdocs/img/lang/swedish.png AUdocs/img/lang/german.png AUdocs/img/lang/chinese.png AUdocs/img/lang/malaysian.png AUdocs/img/lang/korean.png AUdocs/img/lang/arabic.png AUdocs/img/lang/italian.png AUdocs/img/lang/brazil.png AUdocs/img/lang/catala.png AUdocs/img/lang/thai.png AUdocs/img/lang/indonesian.png AUdocs/img/lang/norwegian.png AUdocs/img/lang/english.png AUdocs/img/poweredbynutch_01.gif AUdocs/img/poweredbynutch_02.gif A docs/img/reiter AUdocs/img/reiter/reiter_inactive_le.gif AUdocs/img/reiter/_spacer_cc.gif AUdocs/img/reiter/reiter_inactive_le1.gif AUdocs/img/reiter/bg_subnavi.gif AUdocs/img/reiter/002bg_fle.gif AUdocs/img/reiter/spacer_66.gif AUdocs/img/reiter/ul.gif AUdocs/img/reiter/_bg_reiter.gif AUdocs/img/reiter/logo_nutch.gif AU
[jira] Updated: (NUTCH-668) Domain URL Filter
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-668: --- Attachment: NUTCH-668-1-20081202.patch Includes the DomainURLFilter and test files. Domains can either be filtered by top level domains ignoring subdomains, or by hostnames through configuration. There is a configuration file where valid domains are placed one per line. Those domains are used to create valid domain set against which we validate urls at runtime. Only urls which match domains in the domain set are considered valid. > Domain URL Filter > - > > Key: NUTCH-668 > URL: https://issues.apache.org/jira/browse/NUTCH-668 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-668-1-20081202.patch > > > A URLFilter that adds the ability to filter out URLs by top level domain or > by hostname. A configuration file with a listing of URLs is used to denote > accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-668) Domain URL Filter
Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-667) Input Format for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-667. Resolution: Fixed Committed with revision 722483 > Input Format for working with Content in Hadoop Streaming > - > > Key: NUTCH-667 > URL: https://issues.apache.org/jira/browse/NUTCH-667 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-667-1-20081126.patch > > > This is a ContextAsText input format that removes line endings with spaces > that allow Nutch content to be used more effectively inside of Hadoop > streaming jobs that allow MapReduce jobs to be written in any language that > can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-667) Input Format for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-667. -- > Input Format for working with Content in Hadoop Streaming > - > > Key: NUTCH-667 > URL: https://issues.apache.org/jira/browse/NUTCH-667 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-667-1-20081126.patch > > > This is a ContextAsText input format that removes line endings with spaces > that allow Nutch content to be used more effectively inside of Hadoop > streaming jobs that allow MapReduce jobs to be written in any language that > can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-665) Search Load Testing Tool
[ https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-665. -- > Search Load Testing Tool > > > Key: NUTCH-665 > URL: https://issues.apache.org/jira/browse/NUTCH-665 > Project: Nutch > Issue Type: New Feature > Components: searcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-665-20081126-1.patch > > > A tool which spawn a number of threads and executes searches against > configured search servers. This is used for light load testing of search > servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-665) Search Load Testing Tool
[ https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-665. Resolution: Fixed Committed with revision 722481 > Search Load Testing Tool > > > Key: NUTCH-665 > URL: https://issues.apache.org/jira/browse/NUTCH-665 > Project: Nutch > Issue Type: New Feature > Components: searcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-665-20081126-1.patch > > > A tool which spawn a number of threads and executes searches against > configured search servers. This is used for light load testing of search > servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-647) Resolve URLs tool
[ https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-647. Resolution: Fixed Fix Version/s: 1.0.0 Committed with revision 722478 > Resolve URLs tool > - > > Key: NUTCH-647 > URL: https://issues.apache.org/jira/browse/NUTCH-647 > Project: Nutch > Issue Type: New Feature > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch > > > A tool that takes a listing of urls and attempts to resolve their IP > addresses. Useful for running after the fetcher has run to determine if DNS > problems exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-647) Resolve URLs tool
[ https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-647. -- > Resolve URLs tool > - > > Key: NUTCH-647 > URL: https://issues.apache.org/jira/browse/NUTCH-647 > Project: Nutch > Issue Type: New Feature > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch > > > A tool that takes a listing of urls and attempts to resolve their IP > addresses. Useful for running after the fetcher has run to determine if DNS > problems exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-663. -- > Upgrade Nutch to use Hadoop 0.19 > > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, > NUTCH-663-1-20081126.patch > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-663. Resolution: Fixed Committed with revision 722477 > Upgrade Nutch to use Hadoop 0.19 > > > Key: NUTCH-663 > URL: https://issues.apache.org/jira/browse/NUTCH-663 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, > NUTCH-663-1-20081126.patch > > > Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes > performance improvements, bug fixes, and new functionality. Changes some > current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
[ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-662. Resolution: Fixed Committed with revision 722475 > Upgrade Nutch to use Lucene 2.4 > --- > > Key: NUTCH-662 > URL: https://issues.apache.org/jira/browse/NUTCH-662 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, > lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch > > > Upgrade nutch to use Lucene 2.4. This release changes the lucene file > format. New indexes created by this lucene version will NOT be readable by > older versions. Lucene 2.4 can read and update older index formats although > updating an older format will convert it to the new format. There are also > some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
[ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-662. -- closed > Upgrade Nutch to use Lucene 2.4 > --- > > Key: NUTCH-662 > URL: https://issues.apache.org/jira/browse/NUTCH-662 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, > lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch > > > Upgrade nutch to use Lucene 2.4. This release changes the lucene file > format. New indexes created by this lucene version will NOT be readable by > older versions. Lucene 2.4 can read and update older index formats although > updating an older format will convert it to the new format. There are also > some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pending Commits for Nutch Issues
I agree with John too. Probably you meant $ 0.02, since 0.02 cents is too less. It is usually 2 cents. :-P Regards, Susam Pal On Tue, Dec 2, 2008 at 6:09 PM, John Martyniak <[EMAIL PROTECTED]> wrote: > Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/Solr > integration would be a huge. > > just my .02 cents. > > -John > > On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: > > And here is a list of issues from me that needs more discussion/review: >> >> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to >> review for people, for now we can just write a SolrIndexer like Sami >> Siren's and deal with 442 after 1.0. I would be happy to provide such >> a patch. >> >> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I >> don't know how to fix this one but indexing almost always fails with >> index-more enabled. >> >> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate >> fetch interval correctly: I botched it once so now I am afraid to >> commit it :D >> >> NUTCH-626 - fetcher2 breaks out the domain with >> db.ignore.external.links set at cross domain redirects: I am going to >> update the patch and commit it if no objections. >> >> Also, I think NUTCH-658 would be a nice feature for 1.0. >> >> There are some others but these are the most recent and we really >> should push 1.0 out the door already :D >> >> Oh and finally we should do a review of all libraries in nutch >> (libraries in plugins included) and update them to latest versions. I >> am going to open an issue with the intenton of updating all the >> libraries that do not require code changes. >> >> -- >> Doğacan Güney >> > >
Re: Pending Commits for Nutch Issues
I agree with John. NUTCH-442 is by far the most popular/watched item in JIRA and, I think, has been already used by quite a lot of different people to be deemed reliable. Julien 2008/12/2 John Martyniak <[EMAIL PROTECTED]> > Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/Solr > integration would be a huge. > > just my .02 cents. > > -John > > > On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: > > And here is a list of issues from me that needs more discussion/review: >> >> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to >> review for people, for now we can just write a SolrIndexer like Sami >> Siren's and deal with 442 after 1.0. I would be happy to provide such >> a patch. >> >> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I >> don't know how to fix this one but indexing almost always fails with >> index-more enabled. >> >> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate >> fetch interval correctly: I botched it once so now I am afraid to >> commit it :D >> >> NUTCH-626 - fetcher2 breaks out the domain with >> db.ignore.external.links set at cross domain redirects: I am going to >> update the patch and commit it if no objections. >> >> Also, I think NUTCH-658 would be a nice feature for 1.0. >> >> There are some others but these are the most recent and we really >> should push 1.0 out the door already :D >> >> Oh and finally we should do a review of all libraries in nutch >> (libraries in plugins included) and update them to latest versions. I >> am going to open an issue with the intenton of updating all the >> libraries that do not require code changes. >> >> -- >> Doğacan Güney >> > > -- DigitalPebble Ltd http://www.digitalpebble.com
Re: Pending Commits for Nutch Issues
Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/ Solr integration would be a huge. just my .02 cents. -John On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: And here is a list of issues from me that needs more discussion/ review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney
named parameters in crawl command
Hi all, I've defined a couple of custom parameters for the usage of bin/nutch like for example the parameter "-conf" to set the conf dir from the command line. To be able to use the crawl command, I have to adjust the for-loop and if/else statements for the command line arguments args[] in the crawl.java in order to make my new parameters known to the class, because otherwise it takes the last "unknown" parameter as URL input directory (last else if statement). Wouldn't it be better to use a named parameter for the URL directory like for all the other parameters? By this, one wouldn't have to change Nutch core classes to use custom input parameters because they would simply be discarded, if the JAVA program has no use for them. What do you think? In my opinion the change to version 1.0 would be a good point in time to introduce a slightly different usage of the standard crawl command. Kind regards, Martina
[jira] Issue Comment Edited: (NUTCH-664) Possibility to update already stored documents.
[ https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651458#action_12651458 ] skhil edited comment on NUTCH-664 at 12/2/08 1:29 AM: --- Good news! So, I'll wait until 1.0 and prepare project for hbase-solr! was (Author: skhil): Good news! So, I'll wait until 1.0 and prepare project for hbase-solr/katta/etc! > Possibility to update already stored documents. > --- > > Key: NUTCH-664 > URL: https://issues.apache.org/jira/browse/NUTCH-664 > Project: Nutch > Issue Type: Wish >Reporter: Sergey Khilkov >Priority: Minor > > We have huge index of stored documents. It is high cost procedure to fetch > page, merge indexes any time we update some information about page. The > information can be changed 1-3 times per day. At this moment we have to store > changed info in database, but in this case we have lots of problems with > sorting, search restricions and so on. Lucene itself allows delete single > document and add new one into existing index. But there is a problem with > hadoop... As I understand hadoop filesystem has no possibility to write in > random positions. But it will be great feature if nutch will be able to > update created index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.