Could NDFS be easily modified so that the master node
sends the Map task to the data replica/task node that
actually has the data locally? alleviating network
traffic load?
In a scenerio like this the master node could be
prepped like google does so that when the job is
nearing completion it could
[
http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362274 ]
Paul Baclace commented on NUTCH-162:
The best practice for identifying localization is to use the ISO language and
country code in the form of lowercase language code follo
Andrew McNabb wrote:
SequenceFileInputFormat inputformat = new SequenceFileInputFormat();
RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob,
nullreporter);
To read sequence files directly outside of MapReduce, just use
SequenceFile directly, e.g., something like:
MyKe
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ]
Paul Baclace commented on NUTCH-153:
> NUTCH-160?
There is slowness and then there is continental drift. The quantifiers should
be used with any regex package unless the
On Mon, Jan 09, 2006 at 03:28:45PM -0800, Doug Cutting wrote:
>
> I'm still not clear why one might need a NullReporter.
To be more clear I should be a little more specific. I had to read in
from a SequenceFile to interpret results of a string of MapReduce
stages. Here's a simplified snippet.
Hi,
I was going over the code and I noticed the following in
class org.apache.nutch.parse.html.HTMLMetaProcessor
method getMetaTagsHelper
the following code would fail in case the meta tags are in upper case
Node nameNode = attrs.getNamedItem("name");
Node equivNode = attrs.get
Andrew McNabb wrote:
One of the great things about open source is that projects can be used
for unintended purposes. In fact, Nutch works well for parallel
computing in general, not just for web indexing. Apparently Google has
thousands of projects that use MapReduce.
The plan is to move NDFS
On Mon, Jan 09, 2006 at 11:45:09AM -0800, Doug Cutting wrote:
> A NullReporter would be easy to define, but I'm not sure why you ask
> since Reporter's are not usually created by user code but rather by
> the MapReduce system.
>
One of the great things about open source is that projects can be us
setting http.content.limit to -1 seems to break text parsing on some files
--
Key: NUTCH-168
URL: http://issues.apache.org/jira/browse/NUTCH-168
Project: Nutch
Type: Bug
Components: fetcher
[ http://issues.apache.org/jira/browse/NUTCH-160?page=all ]
Doug Cutting resolved NUTCH-160:
Fix Version: 0.8-dev
Resolution: Fixed
I just committed this patch. Thanks!
> Use standard Java Regex library rather than org.apache.oro.text.regex
>
... in fact, not really... really unrelated !!!
I remove it immediately.
Thanks
On 1/9/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> [EMAIL PROTECTED] wrote:
> > --- lucene/nutch/trunk/src/plugin/build.xml (original)
> > +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006
> > @@
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362249 ]
Doug Cutting commented on NUTCH-139:
Let me try to be more concrete. I'd prefer that the X-nutch properties be
removed from MetadataNames before this is committed, and mov
Yes, everything is in org.apache now, I believe. Thanks for helping out.
Otis
- Original Message
From: Jerry Russell <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Mon 09 Jan 2006 02:20:02 PM EST
Subject: wiki:commandline options classpaths
I noticed that the command line
Stefan Groschupf wrote:
in nutch 0.8 the index is not in the segment folder any more.
What was the reason for that? in the context of a web gui it would be
may be better to have the index also in the segment folder, since the
segment folder would be the single item to manage a life-cycle,
T
Andrew McNabb wrote:
I'm looking at the Reporter interface, and I would like to verify my
understanding of what it is. It appears to me that Reporter.setStatus()
is called periodically during an operation to give a human-readable
description of how far the progress is so far. Is that correct?
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ]
Doug Cutting commented on NUTCH-139:
We can just use different names, rather than two metaData objects: X-nutch
names for derived or other values that are usually protocol
I noticed that the command line options in the wiki has net.nutch.* instead of
the newer org.apache.*. Just wanted to confirm if its ok to change them all.
(I'm new to this group, just wanted to confirm first)
Thanks,
Jerry
---
This SF.net e
[EMAIL PROTECTED] wrote:
--- lucene/nutch/trunk/src/plugin/build.xml (original)
+++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006
@@ -6,13 +6,14 @@
-
-
+
+
Was this change intentional? It looks unrelated.
Otherwise, this looks great!
Doug
Just a followup, i figured out the 3rd exception below ( Exception in
thread "main" java.io.IOException: No input directories specified in:
NutchConf..) so no worries there. but the others are still issues.
Matt Zytaruk wrote:
I've been having a lot of trouble lately with the newest nutch src.
I've been having a lot of trouble lately with the newest nutch src. Both
my crawls and parses are failing (for our fetches we crawl and parse at
the same time with just the default nutch config, just to get the
outlinks and update the crawldb, but then later on, after the fetch we
do another pa
I have the same problem too.
I don't understand what happens.
In fact, the CommandRunner returns a -1 exit code, but nothing in the error
output and the good string in the standard output ("nutch rocks nutch rocks
nutch rocks").
All seems to be ok but the exit code.
Jérôme
On 1/9/06, Piotr Kosior
It fails on my machine on parse-ext tests. I am not sure what is causing
it yet and I am afraid I do not have time to investigate it today -
maybe in few days. I did a small change to make it compile a few days
ago, but all tests went ok before I committed it.
Regards
Piotr
Stefan Groschupf wro
Hi Doug,
in nutch 0.8 the index is not in the segment folder any more.
What was the reason for that? in the context of a web gui it would be
may be better to have the index also in the segment folder, since the
segment folder would be the single item to manage a life-cycle,
Thanks for a expla
On Mon, 2006-01-09 at 12:07 +0200, Gal Nitzan wrote:
> I am trying to figure out how the required map is set/calculated by
> Nutch.
>
> I have 3 task trackers.
>
> I added one more.
>
> When I run fetch only the initial three are fetching.
>
> I have added the task tracker before calling genera
I am trying to figure out how the required map is set/calculated by
Nutch.
I have 3 task trackers.
I added one more.
When I run fetch only the initial three are fetching.
I have added the task tracker before calling generate (if it has any
meanning)
Thanks,
G.
--
OK. thanks for the patch.
I shall embed it tonight.
I promise :) to let you know...
Gal.
On Mon, 2006-01-09 at 10:53 +0100, Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>
> >Sorry :) no.
> >
> >
> >
>
> Hmm. ok. :) But I think that patch is needed anyway, because now we
> silently assume t
Gal Nitzan wrote:
Sorry :) no.
Hmm. ok. :) But I think that patch is needed anyway, because now we
silently assume that parse plugins will always copy all Content metadata
to ParseData.metadata, while it may not be the case - and it certainly
does not happen if there is a parse error ..
Sorry :) no.
I run fetcher with parse.
This NPE happens for only a few documents and that is the problem :)
On Mon, 2006-01-09 at 09:43 +0100, Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>
> >Hi Andrzej,
> >
> >The value cannot be null is my message :)
> >
> >
> >
>
> :)
>
> I'm guessing
Gal Nitzan wrote:
Hi Andrzej,
The value cannot be null is my message :)
:)
I'm guessing that you are using Fetcher in non-parsing mode, and then
you run ParseSegment as a separate step, right?
Please try the attached patch.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _
29 matches
Mail list logo