RE: Nutch/Hardtop on EC2

2007-10-15 Thread Sathyam Y
This turned out to be a simple disk space issue. My bad !!, although the error message was quite cryptic. Thanks. Balachanthar <[EMAIL PROTECTED]> wrote: Hi sathyam, I think ther is a problem in your setting if u can give me your setting give I can check it out. bala -Original Mess

Re: Possible public applications with nutch and hadoop

2007-10-15 Thread Matt Kangas
Hi Andrzej (and everyone else), A few weeks ago, I intended to chime in on your "Scoring API issues" thread, but this new thread is perhaps an even better place to speak up. Time to stop lurking and contribute. :) First, I want echo Stefan Groschupf's comment several months ago that the N

web-app config files

2007-10-15 Thread Rohit Trivedi
Hi guys, is there any other place within tomcat which I can place my nutch config files - at the moment they are in WEB-INF/classes, and that's really ugly..I've tried putting them in shared/classes and in the /conf directory, but to no avail...I'd really like to have them somewhere neater - I

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Pike
Hi Chris > There are currently 2 plugins that parse feeds and get them indexed: > parse-rss - older, but gets the job done > feed - newer, and takes advantage of the ability to parse/index feeds in > one step, rather than in many [..] > Parse-rss indexes the whole feed, whereas the feed plugi

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann
Hi Brian, Sorry for taking so long to reply. Here ya go: > Do you have any URLs for feeds that are reliably parsed and indexed by > the feed parser? I haven't tested/used this plugin in a quite a while. There was someone on the nutch-user list before, nutch.newbie, that was doing quite a bit

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann
Hi Pike, Parse-rss indexes the whole feed, whereas the feed plugin takes advantage of NUTCH-443, which allows Parsers to return multiple Parse objects, which indexes each item in the feed as its own record. HTH, Chris On 10/15/07 7:25 AM, "Pike" <[EMAIL PROTECTED]> wrote: > Hi > >>> I hav

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Pike
Hi >> I have this with all results: what is indexed >> seems to be 1 record per feed, containing a >> parsed version of the content including all its items, >> with sometimes bits of xml and html markup in it. >> >> I was assuming this is the intended behaviour ? > > It may well be the intended

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

2007-10-15 Thread Dennis Kubes
Marcin is correct about the .asp extension and the regex filter, but nutch is not downloading this as an image src. The page itself http://0086jia.com/include/validCode.asp, returns an image with content type of bmp. It looks like a simple captcha to me. Since nutch can't parse this type of

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

2007-10-15 Thread Marcin Okraszewski
The regex filter just filters URL, not content types. As the URL ends with .asp it does not fall into the prohibited URL patterns. The problem is that Nutch fallows img/@src, so it downloads images. There is a patch for this under http://issues.apache.org/jira/browse/Nutch-488 which allows selec

Re: Fetch schedule and unmodified content

2007-10-15 Thread chris sleeman
Thanks for your inputs.will try it out. -Chris On 10/15/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > chris sleeman wrote: > > Hi Andrzej, > > > > Thanks for your response. However, I still have a couple of doubts. > > > >> In your case, I would recommend setting a very short interval

Re: Possible public applications with nutch and hadoop

2007-10-15 Thread Andrzej Bialecki
Berlin Brown wrote: Yea, you are right. You have to have a constrained set of domains to search and to be honest, that works pretty well. The only thing, I still get a lot of junk links. I would say that 30% are valid or interesting links while the other is kind of worthless. I guess it is a

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Rick Moynihan
Pike wrote: Hi Ricky, Chris I've not noticed much difference, with both plugins failing on the feedburner feed: - http://feeds.feedburner.com/Techcrunch Strange, but that feed is indeed invalid xml if I wget it. It starts with newlines and ends with comments. Very picky, but that's not all

ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

2007-10-15 Thread eyal edri
Hello, During a fetch, the fetcher failed to retrieve a certain page with the following exception: // url is masked Error parsing: http://*/validCode.asp: org.apache.nutch.parse.ParseException: parser not found for contentType=image/bmp url=http://0086jia.com/include/validCode.asp

Re: Fetch schedule and unmodified content

2007-10-15 Thread Andrzej Bialecki
chris sleeman wrote: Hi Andrzej, Thanks for your response. However, I still have a couple of doubts. In your case, I would recommend setting a very short interval for the main page, and setting longer (default) intervals for other pages. Isnt' the fetch interval a system wide setting? Or ca

Re: Fetch schedule and unmodified content

2007-10-15 Thread chris sleeman
Hi Andrzej, Thanks for your response. However, I still have a couple of doubts. >In your case, I would recommend setting a very short interval for the >main page, and setting longer (default) intervals for other pages. Isnt' the fetch interval a system wide setting? Or can we set it for individ