date:20071015

RE: Nutch/Hardtop on EC2

2007-10-15 Thread Sathyam Y

This turned out to be a simple disk space issue. My bad !!, although the error message was quite cryptic. Thanks. Balachanthar <[EMAIL PROTECTED]> wrote: Hi sathyam, I think ther is a problem in your setting if u can give me your setting give I can check it out. bala -Original Mess

Re: Possible public applications with nutch and hadoop

2007-10-15 Thread Matt Kangas

Hi Andrzej (and everyone else), A few weeks ago, I intended to chime in on your "Scoring API issues" thread, but this new thread is perhaps an even better place to speak up. Time to stop lurking and contribute. :) First, I want echo Stefan Groschupf's comment several months ago that the N

web-app config files

2007-10-15 Thread Rohit Trivedi

Hi guys, is there any other place within tomcat which I can place my nutch config files - at the moment they are in WEB-INF/classes, and that's really ugly..I've tried putting them in shared/classes and in the /conf directory, but to no avail...I'd really like to have them somewhere neater - I

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Pike

Hi Chris > There are currently 2 plugins that parse feeds and get them indexed: > parse-rss - older, but gets the job done > feed - newer, and takes advantage of the ability to parse/index feeds in > one step, rather than in many [..] > Parse-rss indexes the whole feed, whereas the feed plugi

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann

Hi Brian, Sorry for taking so long to reply. Here ya go: > Do you have any URLs for feeds that are reliably parsed and indexed by > the feed parser? I haven't tested/used this plugin in a quite a while. There was someone on the nutch-user list before, nutch.newbie, that was doing quite a bit

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann

Hi Pike, Parse-rss indexes the whole feed, whereas the feed plugin takes advantage of NUTCH-443, which allows Parsers to return multiple Parse objects, which indexes each item in the feed as its own record. HTH, Chris On 10/15/07 7:25 AM, "Pike" <[EMAIL PROTECTED]> wrote: > Hi > >>> I hav

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Pike

Hi >> I have this with all results: what is indexed >> seems to be 1 record per feed, containing a >> parsed version of the content including all its items, >> with sometimes bits of xml and html markup in it. >> >> I was assuming this is the intended behaviour ? > > It may well be the intended

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

2007-10-15 Thread Dennis Kubes

Marcin is correct about the .asp extension and the regex filter, but nutch is not downloading this as an image src. The page itself http://0086jia.com/include/validCode.asp, returns an image with content type of bmp. It looks like a simple captcha to me. Since nutch can't parse this type of

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

2007-10-15 Thread Marcin Okraszewski

The regex filter just filters URL, not content types. As the URL ends with .asp it does not fall into the prohibited URL patterns. The problem is that Nutch fallows img/@src, so it downloads images. There is a patch for this under http://issues.apache.org/jira/browse/Nutch-488 which allows selec

Re: Fetch schedule and unmodified content

2007-10-15 Thread chris sleeman

Thanks for your inputs.will try it out. -Chris On 10/15/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > chris sleeman wrote: > > Hi Andrzej, > > > > Thanks for your response. However, I still have a couple of doubts. > > > >> In your case, I would recommend setting a very short interval

Re: Possible public applications with nutch and hadoop

2007-10-15 Thread Andrzej Bialecki

Berlin Brown wrote: Yea, you are right. You have to have a constrained set of domains to search and to be honest, that works pretty well. The only thing, I still get a lot of junk links. I would say that 30% are valid or interesting links while the other is kind of worthless. I guess it is a

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Rick Moynihan

Pike wrote: Hi Ricky, Chris I've not noticed much difference, with both plugins failing on the feedburner feed: - http://feeds.feedburner.com/Techcrunch Strange, but that feed is indeed invalid xml if I wget it. It starts with newlines and ends with comments. Very picky, but that's not all

ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

2007-10-15 Thread eyal edri

Hello, During a fetch, the fetcher failed to retrieve a certain page with the following exception: // url is masked Error parsing: http://*/validCode.asp: org.apache.nutch.parse.ParseException: parser not found for contentType=image/bmp url=http://0086jia.com/include/validCode.asp

Re: Fetch schedule and unmodified content

2007-10-15 Thread Andrzej Bialecki

chris sleeman wrote: Hi Andrzej, Thanks for your response. However, I still have a couple of doubts. In your case, I would recommend setting a very short interval for the main page, and setting longer (default) intervals for other pages. Isnt' the fetch interval a system wide setting? Or ca

Re: Fetch schedule and unmodified content

2007-10-15 Thread chris sleeman

Hi Andrzej, Thanks for your response. However, I still have a couple of doubts. >In your case, I would recommend setting a very short interval for the >main page, and setting longer (default) intervals for other pages. Isnt' the fetch interval a system wide setting? Or can we set it for individ

RE: Nutch/Hardtop on EC2

Re: Possible public applications with nutch and hadoop

web-app config files

Re: Indexing Feeds & Blog Posts with Nutch

Re: Indexing Feeds & Blog Posts with Nutch

Re: Indexing Feeds & Blog Posts with Nutch

Re: Indexing Feeds & Blog Posts with Nutch

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

Re: Fetch schedule and unmodified content

Re: Possible public applications with nutch and hadoop

Re: Indexing Feeds & Blog Posts with Nutch

ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

Re: Fetch schedule and unmodified content

Re: Fetch schedule and unmodified content

15 matches

Site Navigation

Mail list logo

Footer information