This turned out to be a simple disk space issue. My bad !!, although the error
message was quite cryptic.
Thanks.
Balachanthar <[EMAIL PROTECTED]> wrote:
Hi sathyam,
I think ther is a problem in your setting if u can give me your setting give I
can check it out.
bala
-Original Mess
Hi Andrzej (and everyone else),
A few weeks ago, I intended to chime in on your "Scoring API issues"
thread, but this new thread is perhaps an even better place to speak
up. Time to stop lurking and contribute. :)
First, I want echo Stefan Groschupf's comment several months ago that
the N
Hi guys,
is there any other place within tomcat which I can place my nutch config
files - at the moment they are in WEB-INF/classes, and that's really
ugly..I've tried putting them in shared/classes and in the /conf
directory, but to no avail...I'd really like to have them somewhere neater
- I
Hi Chris
> There are currently 2 plugins that parse feeds and get them indexed:
> parse-rss - older, but gets the job done
> feed - newer, and takes advantage of the ability to parse/index feeds in
> one step, rather than in many
[..]
> Parse-rss indexes the whole feed, whereas the feed plugi
Hi Brian,
Sorry for taking so long to reply. Here ya go:
> Do you have any URLs for feeds that are reliably parsed and indexed by
> the feed parser?
I haven't tested/used this plugin in a quite a while. There was someone on
the nutch-user list before, nutch.newbie, that was doing quite a bit
Hi Pike,
Parse-rss indexes the whole feed, whereas the feed plugin takes advantage
of NUTCH-443, which allows Parsers to return multiple Parse objects, which
indexes each item in the feed as its own record.
HTH,
Chris
On 10/15/07 7:25 AM, "Pike" <[EMAIL PROTECTED]> wrote:
> Hi
>
>>> I hav
Hi
>> I have this with all results: what is indexed
>> seems to be 1 record per feed, containing a
>> parsed version of the content including all its items,
>> with sometimes bits of xml and html markup in it.
>>
>> I was assuming this is the intended behaviour ?
>
> It may well be the intended
Marcin is correct about the .asp extension and the regex filter, but
nutch is not downloading this as an image src. The page itself
http://0086jia.com/include/validCode.asp, returns an image with content
type of bmp. It looks like a simple captcha to me. Since nutch can't
parse this type of
The regex filter just filters URL, not content types. As the URL ends with .asp
it does not fall into the prohibited URL patterns. The problem is that Nutch
fallows img/@src, so it downloads images. There is a patch for this under
http://issues.apache.org/jira/browse/Nutch-488 which allows selec
Thanks for your inputs.will try it out.
-Chris
On 10/15/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> chris sleeman wrote:
> > Hi Andrzej,
> >
> > Thanks for your response. However, I still have a couple of doubts.
> >
> >> In your case, I would recommend setting a very short interval
Berlin Brown wrote:
Yea, you are right. You have to have a constrained set of domains to
search and to be honest, that works pretty well. The only thing, I
still get a lot of junk links. I would say that 30% are valid or
interesting links while the other is kind of worthless. I guess it is
a
Pike wrote:
Hi Ricky, Chris
I've not noticed much
difference, with both plugins failing on the feedburner feed:
- http://feeds.feedburner.com/Techcrunch
Strange, but that feed is indeed invalid xml if I wget it.
It starts with newlines and ends with comments. Very
picky, but that's not all
Hello,
During a fetch, the fetcher failed to retrieve a certain page with the
following exception:
// url is masked
Error parsing: http://*/validCode.asp:
org.apache.nutch.parse.ParseException: parser not found for
contentType=image/bmp url=http://0086jia.com/include/validCode.asp
chris sleeman wrote:
Hi Andrzej,
Thanks for your response. However, I still have a couple of doubts.
In your case, I would recommend setting a very short interval for the
main page, and setting longer (default) intervals for other pages.
Isnt' the fetch interval a system wide setting? Or ca
Hi Andrzej,
Thanks for your response. However, I still have a couple of doubts.
>In your case, I would recommend setting a very short interval for the
>main page, and setting longer (default) intervals for other pages.
Isnt' the fetch interval a system wide setting? Or can we set it for
individ
15 matches
Mail list logo