Re: no nutch script file under bin directory
Hi: sorry, here's the original discussion that led to the link I accidentally sent twice; I had meant to include it too. http://www.mail-archive.com/[EMAIL PROTECTED]/msg08621.html - Original Message From: Tsengtan A Shuy [EMAIL PROTECTED] To: Tsengtan A Shuy [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Sent: Tuesday, July 17, 2007 12:32:49 PM Subject: RE: no nutch script file under bin directory BTW, I just found out there is only one web page reference in your last email. So I do not understand what you quoted two discussions. Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 17, 2007 12:23 PM To: 'nutch-dev@lucene.apache.org' Subject: no nutch script file under bin directory I follow the msg06571.html to check out the trunk. Then I found there is no nutch script file under the bin directory. How do you crawl the multiple websites without this nutch script file? Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 8:43 AM To: nutch-dev@lucene.apache.org Subject: Re: OOM error during parsing with nekohtml You could try looking at these two discussions: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html --Kai The fish are biting. Get more visitors on your site using Yahoo! Search Marketing. http://searchmarketing.yahoo.com/arp/sponsoredsearch_v2.php
Re: no nutch script file under bin directory
I'm not actually sure ... I think I downloaded and unzipped a nightly build in my usr/local directory thus creating this directory: /usr/local/nutch-2007-06-27_06-52-44 then from within that directory I ran the svn command ... if I remember correctly. You can always try just making a 'nutch' directory or a 'nutch0.9' directory, running svn, and see if it creates another subdirectory under that, then moves things to where you want. - Original Message From: Tsengtan A Shuy [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, July 17, 2007 5:30:18 PM Subject: RE: no nutch script file under bin directory This may seems like a silly question, but I need to know it anyway. When I check out the trunk, I shall put it to the nutch directory which should be the latest release directory e.g: nutch-0.9 release. Am I right? Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 17, 2007 12:33 PM To: 'Tsengtan A Shuy'; nutch-dev@lucene.apache.org Subject: RE: no nutch script file under bin directory BTW, I just found out there is only one web page reference in your last email. So I do not understand what you quoted two discussions. Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 17, 2007 12:23 PM To: 'nutch-dev@lucene.apache.org' Subject: no nutch script file under bin directory I follow the msg06571.html to check out the trunk. Then I found there is no nutch script file under the bin directory. How do you crawl the multiple websites without this nutch script file? Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 8:43 AM To: nutch-dev@lucene.apache.org Subject: Re: OOM error during parsing with nekohtml You could try looking at these two discussions: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html --Kai Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7
Re: no nutch script file under bin directory
The nightly builds are all cataloged here: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ The current nightly build is #153 from July 18. For instance, you could do: wget http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/153/artifact/trunk/build/nutch-2007-07-18_04-01-20.tar.gz --Kai - Original Message From: Tsengtan A Shuy [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Wednesday, July 18, 2007 11:59:52 AM Subject: RE: no nutch script file under bin directory Where do you get the nightly build? I followed your referral web page and use wget http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild /artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz to get it. Then I got the file not found error message. Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 18, 2007 11:35 AM To: nutch-dev@lucene.apache.org Subject: Re: no nutch script file under bin directory I'm not actually sure ... I think I downloaded and unzipped a nightly build in my usr/local directory thus creating this directory: /usr/local/nutch-2007-06-27_06-52-44 then from within that directory I ran the svn command ... if I remember correctly. You can always try just making a 'nutch' directory or a 'nutch0.9' directory, running svn, and see if it creates another subdirectory under that, then moves things to where you want. - Original Message From: Tsengtan A Shuy [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, July 17, 2007 5:30:18 PM Subject: RE: no nutch script file under bin directory This may seems like a silly question, but I need to know it anyway. When I check out the trunk, I shall put it to the nutch directory which should be the latest release directory e.g: nutch-0.9 release. Am I right? Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 17, 2007 12:33 PM To: 'Tsengtan A Shuy'; nutch-dev@lucene.apache.org Subject: RE: no nutch script file under bin directory BTW, I just found out there is only one web page reference in your last email. So I do not understand what you quoted two discussions. Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 17, 2007 12:23 PM To: 'nutch-dev@lucene.apache.org' Subject: no nutch script file under bin directory I follow the msg06571.html to check out the trunk. Then I found there is no nutch script file under the bin directory. How do you crawl the multiple websites without this nutch script file? Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 8:43 AM To: nutch-dev@lucene.apache.org Subject: Re: OOM error during parsing with nekohtml You could try looking at these two discussions: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html --Kai Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 Food fight? Enjoy some healthy debate in the Yahoo! Answers Food Drink QA. http://answers.yahoo.com/dir/?link=listsid=396545367
Re: OOM error during parsing with nekohtml
You could try looking at these two discussions: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html --Kai - Original Message From: Tsengtan A Shuy [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:45:59 AM Subject: RE: OOM error during parsing with nekohtml I successfully run the whole-web crawl with the my new ubuntu OS, and I am ready to fix the bug. I need someone to guide me to get the most updated source code and the bug assignment. Thank you in advance!! Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:05 AM To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Subject: OOM error during parsing with nekohtml Hi All, We are getting an OOM Exception during the processing of http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied Nutch-497 patch to our source code. But actually the error is coming during the parse method. Does anybody has any idea regarding this. Here is the complete stacktrace : java.lang.OutOfMemoryError: Java heap space at java.lang.String.toUpperCase(String.java:2637) at java.lang.String.toUpperCase(String.java:2660) at org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja va:443) at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java :252) at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100 9) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j ava:2343) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16 4) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) Regards, Shailendra Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow
Nutch nightly build and NUTCH-505 draft patch
Recently I successully applied applied NUTCH-505_draft_v2.patch as follows: $ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch $ wget https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch --no-check-certificate $ sudo patch -p0 NUTCH-505_draft_v2.patch $ ant clean $ ant However, I also needed other recent nutch functionality, so I downloaded a nightly build: $ wget http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz I then attempted to apply the patch to that build using the successive steps. I was able to run ant clean but ant failed with build.xml:61: Specify at least one source--a file or resource collection Do I need to get a source checkout of a nightly build? How would I do that? Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/
Re: NUTCH-119 :: how hard to fix
wow, setting db.max.outlinks.per.page immediately fixed my problem. It looks like I totally mis-diagnosed things. May I pose two questions: 1) how did you view all the outlinks? 2) how severe is NUTCH-119 - does it occur on a lot of sites? - Original Message From: Doğacan Güney [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, June 26, 2007 10:56:32 PM Subject: Re: NUTCH-119 :: how hard to fix On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: I am evaluating nutch+lucene as a crawl and search solution. However, I am finding major bugs in nutch right off the bat. In particular, NUTCH-119: nutch is not crawling relative URLs. I have some discussion of it here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html Most of the links off www.variety.com, one of my main test sites, have relative URLs. It seems incredible that nutch, which is capable of mapreduce, cannot fetch these URLs. It could be that I would fix this bug if, for other reasons, I decide to go with nutch+lucene. Has anyone tried fixing this problem? Is it intractable? Or are the developers, who are just volunteers anyway, more interested in fixing other problems? Could someone outline the issue for me a bit more clearly so I would know how to evaluate it? Both this one and the other site you were mentioning (sf911truth) have more than 100 outlinks. Nutch, by default, only stores 100 outlinks per page (db.max.outlinks.per.page). Link about.html happens to be 105th link or so, so nutch doesn't store it. All you have to do is either increase db.max.outlinks.per.page or set it to -1 (which means, store all outlinks). Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/ -- Doğacan Güney Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=listsid=396545433
Re: [jira] Commented: (NUTCH-505) Outlink urls should be validated
I can confirm that with NUTCH-505_draft_v2.patch I no longer get outlink urls that contain html mark-up as I was getting before on www.variety.com. --Kai Middleton - Original Message From: Doğacan Güney (JIRA) [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Monday, June 25, 2007 1:09:26 AM Subject: [jira] Commented: (NUTCH-505) Outlink urls should be validated [ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507803 ] Doğacan Güney commented on NUTCH-505: - btw, for http://www.variety.com/, these are the 'urls' filtered: http:/ http://www.variety.com//div http://www.variety.com//div/a mailto:[EMAIL PROTECTED] http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '? http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '? Since we will not distribute score to these, this patch may also slightly improve scoring. Outlink urls should be validated Key: NUTCH-505 URL: https://issues.apache.org/jira/browse/NUTCH-505 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Priority: Minor Attachments: NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch See discussion here: http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html Parse plugins may extract garbage urls from pages. We need a url validation system that tests these urls and filters out garbage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. Need a vacation? Get great deals to amazing places on Yahoo! Travel. http://travel.yahoo.com/
NUTCH-119 :: how hard to fix
I am evaluating nutch+lucene as a crawl and search solution. However, I am finding major bugs in nutch right off the bat. In particular, NUTCH-119: nutch is not crawling relative URLs. I have some discussion of it here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html Most of the links off www.variety.com, one of my main test sites, have relative URLs. It seems incredible that nutch, which is capable of mapreduce, cannot fetch these URLs. It could be that I would fix this bug if, for other reasons, I decide to go with nutch+lucene. Has anyone tried fixing this problem? Is it intractable? Or are the developers, who are just volunteers anyway, more interested in fixing other problems? Could someone outline the issue for me a bit more clearly so I would know how to evaluate it? Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/