Re: no nutch script file under bin directory

2007-07-18 Thread Kai_testing Middleton
Hi:  sorry, here's the original discussion that led to the link I accidentally 
sent twice; I had meant to include it too.
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08621.html


- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: Tsengtan A Shuy [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Sent: Tuesday, July 17, 2007 12:32:49 PM
Subject: RE: no nutch script file under bin directory

BTW, I just found out there is only one web page reference in your last
email. So I do not understand what you quoted two discussions.

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 17, 2007 12:23 PM
To: 'nutch-dev@lucene.apache.org'
Subject: no nutch script file under bin directory

I follow the msg06571.html to check out the trunk.
Then I found there is no nutch script file under the bin directory.
How do you crawl the multiple websites without this nutch script file?

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 8:43 AM
To: nutch-dev@lucene.apache.org
Subject: Re: OOM error during parsing with nekohtml

You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai








 

The fish are biting. 
Get more visitors on your site using Yahoo! Search Marketing.
http://searchmarketing.yahoo.com/arp/sponsoredsearch_v2.php

Re: no nutch script file under bin directory

2007-07-18 Thread Kai_testing Middleton
I'm not actually sure ... I think I downloaded and unzipped a nightly build in 
my usr/local directory thus creating this directory:
/usr/local/nutch-2007-06-27_06-52-44
then from within that directory I ran the svn command ... if I remember 
correctly.

You can always try just making a 'nutch' directory or a 'nutch0.9' directory, 
running svn, and see if it creates another subdirectory under that, then moves 
things to where you want.

- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, July 17, 2007 5:30:18 PM
Subject: RE: no nutch script file under bin directory

This may seems like a silly question, but I need to know it anyway.
When I check out the trunk, I shall put it to the nutch directory which
should be the latest release directory e.g: nutch-0.9 release.
Am I right?

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 17, 2007 12:33 PM
To: 'Tsengtan A Shuy'; nutch-dev@lucene.apache.org
Subject: RE: no nutch script file under bin directory

BTW, I just found out there is only one web page reference in your last
email. So I do not understand what you quoted two discussions.

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 17, 2007 12:23 PM
To: 'nutch-dev@lucene.apache.org'
Subject: no nutch script file under bin directory

I follow the msg06571.html to check out the trunk.
Then I found there is no nutch script file under the bin directory.
How do you crawl the multiple websites without this nutch script file?

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 8:43 AM
To: nutch-dev@lucene.apache.org
Subject: Re: OOM error during parsing with nekohtml

You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai








  

Shape Yahoo! in your own image.  Join our Network Research Panel today!   
http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 



Re: no nutch script file under bin directory

2007-07-18 Thread Kai_testing Middleton
The nightly builds are all cataloged here:
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/
The current nightly build is #153 from July 18.

For instance, you could do:
wget 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/153/artifact/trunk/build/nutch-2007-07-18_04-01-20.tar.gz

--Kai

- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Wednesday, July 18, 2007 11:59:52 AM
Subject: RE: no nutch script file under bin directory

Where do you get the nightly build? I followed your referral web page and
use  wget
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild
/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz to get it.  Then I
got the file not found error message.

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 18, 2007 11:35 AM
To: nutch-dev@lucene.apache.org
Subject: Re: no nutch script file under bin directory

I'm not actually sure ... I think I downloaded and unzipped a nightly build
in my usr/local directory thus creating this directory:
/usr/local/nutch-2007-06-27_06-52-44
then from within that directory I ran the svn command ... if I remember
correctly.

You can always try just making a 'nutch' directory or a 'nutch0.9'
directory, running svn, and see if it creates another subdirectory under
that, then moves things to where you want.

- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, July 17, 2007 5:30:18 PM
Subject: RE: no nutch script file under bin directory

This may seems like a silly question, but I need to know it anyway.
When I check out the trunk, I shall put it to the nutch directory which
should be the latest release directory e.g: nutch-0.9 release.
Am I right?

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 17, 2007 12:33 PM
To: 'Tsengtan A Shuy'; nutch-dev@lucene.apache.org
Subject: RE: no nutch script file under bin directory

BTW, I just found out there is only one web page reference in your last
email. So I do not understand what you quoted two discussions.

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 17, 2007 12:23 PM
To: 'nutch-dev@lucene.apache.org'
Subject: no nutch script file under bin directory

I follow the msg06571.html to check out the trunk.
Then I found there is no nutch script file under the bin directory.
How do you crawl the multiple websites without this nutch script file?

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 8:43 AM
To: nutch-dev@lucene.apache.org
Subject: Re: OOM error during parsing with nekohtml

You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai








 


Shape Yahoo! in your own image.  Join our Network Research Panel today!
http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 









 

Food fight? Enjoy some healthy debate 
in the Yahoo! Answers Food  Drink QA.
http://answers.yahoo.com/dir/?link=listsid=396545367

Re: OOM error during parsing with nekohtml

2007-07-16 Thread Kai_testing Middleton
You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai

- Original Message 
From: Tsengtan A Shuy [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Monday, July 16, 2007 3:45:59 AM
Subject: RE: OOM error during parsing with nekohtml

I successfully run the whole-web crawl with the my new ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the most updated
source code and the bug assignment.

Thank you in advance!! 

Adam Shuy, President
ePacific Web Design  Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-Original Message-
From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 3:05 AM
To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toUpperCase(String.java:2637)
at java.lang.String.toUpperCase(String.java:2660)
at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja
va:443)
at
org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java
:252)
at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100
9)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j
ava:2343)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16
4)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra








   

Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for 
today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow  

Nutch nightly build and NUTCH-505 draft patch

2007-07-02 Thread Kai_testing Middleton
Recently I successully applied applied NUTCH-505_draft_v2.patch as follows:

$ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
$ cd nutch
$ wget 
https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch
 --no-check-certificate
$ sudo patch -p0  NUTCH-505_draft_v2.patch
$ ant clean
$ ant

However, I also needed other recent nutch functionality, so I downloaded a 
nightly build:

$ wget 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz

I then attempted to apply the patch to that build using the successive steps.  
I was able to run ant clean but ant failed with 

build.xml:61: Specify at least one source--a file or resource collection

Do I need to get a source checkout of a nightly build?  How would I do that?



   

Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/

Re: NUTCH-119 :: how hard to fix

2007-06-27 Thread Kai_testing Middleton
wow, setting db.max.outlinks.per.page immediately fixed my problem.  It looks 
like I totally mis-diagnosed things.

May I pose two questions:
1) how did you view all the outlinks?
2) how severe is NUTCH-119 - does it occur on a lot of sites?


- Original Message 
From: Doğacan Güney [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, June 26, 2007 10:56:32 PM
Subject: Re: NUTCH-119 :: how hard to fix

On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:
 I am evaluating nutch+lucene as a crawl and search solution.

 However, I am finding major bugs in nutch right off the bat.

 In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
 discussion of it here:
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html

 Most of the links off www.variety.com, one of my main test sites, have 
 relative URLs.  It seems incredible that nutch, which is capable of 
 mapreduce, cannot fetch these URLs.

 It could be that I would fix this bug if, for other reasons, I decide to go 
 with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable? 
  Or are the developers, who are just volunteers anyway, more interested in 
 fixing other problems?

 Could someone outline the issue for me a bit more clearly so I would know how 
 to evaluate it?

Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it  to -1 (which
means, store all outlinks).





   
 
 Park yourself in front of a world of choices in alternative vehicles. Visit 
 the Yahoo! Auto Green Center.
 http://autos.yahoo.com/green_center/


-- 
Doğacan Güney







   

Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=listsid=396545433

Re: [jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-06-26 Thread Kai_testing Middleton
I can confirm that with NUTCH-505_draft_v2.patch I no longer get outlink urls 
that contain html mark-up as I was getting before on www.variety.com.

--Kai Middleton

- Original Message 
From: Doğacan Güney (JIRA) [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Monday, June 25, 2007 1:09:26 AM
Subject: [jira] Commented: (NUTCH-505) Outlink urls should be validated


[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507803
 ] 

Doğacan Güney commented on NUTCH-505:
-

btw, for http://www.variety.com/, these are the 'urls' filtered:

http:/
http://www.variety.com//div
http://www.variety.com//div/a
mailto:[EMAIL PROTECTED]
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '?
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '?

Since we will not distribute score to these, this patch may also slightly 
improve scoring.


 Outlink urls should be validated
 

 Key: NUTCH-505
 URL: https://issues.apache.org/jira/browse/NUTCH-505
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Priority: Minor
 Attachments: NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch


 See discussion here:
 http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
 Parse plugins may extract garbage urls from pages. We need a url validation 
 system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.








   

Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/

NUTCH-119 :: how hard to fix

2007-06-26 Thread Kai_testing Middleton
I am evaluating nutch+lucene as a crawl and search solution.

However, I am finding major bugs in nutch right off the bat.

In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
discussion of it here:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html

Most of the links off www.variety.com, one of my main test sites, have relative 
URLs.  It seems incredible that nutch, which is capable of mapreduce, cannot 
fetch these URLs.

It could be that I would fix this bug if, for other reasons, I decide to go 
with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable?  
Or are the developers, who are just volunteers anyway, more interested in 
fixing other problems?

Could someone outline the issue for me a bit more clearly so I would know how 
to evaluate it?




  

Park yourself in front of a world of choices in alternative vehicles. Visit the 
Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/