[Nutch-dev] NDFS / map tasks

2006-01-09 Thread Byron Miller
Could NDFS be easily modified so that the master node sends the Map task to the data replica/task node that actually has the data locally? alleviating network traffic load? In a scenerio like this the master node could be prepped like google does so that when the job is nearing completion it could

[Nutch-dev] [jira] Commented: (NUTCH-162) country code "jp" is used instead of language code "ja" for Japanese

2006-01-09 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362274 ] Paul Baclace commented on NUTCH-162: The best practice for identifying localization is to use the ISO language and country code in the form of lowercase language code follo

[Nutch-dev] Re: Reporter interface

2006-01-09 Thread Doug Cutting
Andrew McNabb wrote: SequenceFileInputFormat inputformat = new SequenceFileInputFormat(); RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob, nullreporter); To read sequence files directly outside of MapReduce, just use SequenceFile directly, e.g., something like: MyKe

[Nutch-dev] [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-09 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ] Paul Baclace commented on NUTCH-153: > NUTCH-160? There is slowness and then there is continental drift. The quantifiers should be used with any regex package unless the

[Nutch-dev] Re: Reporter interface

2006-01-09 Thread Andrew McNabb
On Mon, Jan 09, 2006 at 03:28:45PM -0800, Doug Cutting wrote: > > I'm still not clear why one might need a NullReporter. To be more clear I should be a little more specific. I had to read in from a SequenceFile to interpret results of a string of MapReduce stages. Here's a simplified snippet.

[Nutch-dev] HTMLMetaProcessor a bug?

2006-01-09 Thread Gal Nitzan
Hi, I was going over the code and I noticed the following in class org.apache.nutch.parse.html.HTMLMetaProcessor method getMetaTagsHelper the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem("name"); Node equivNode = attrs.get

[Nutch-dev] Re: Reporter interface

2006-01-09 Thread Doug Cutting
Andrew McNabb wrote: One of the great things about open source is that projects can be used for unintended purposes. In fact, Nutch works well for parallel computing in general, not just for web indexing. Apparently Google has thousands of projects that use MapReduce. The plan is to move NDFS

[Nutch-dev] Re: Reporter interface

2006-01-09 Thread Andrew McNabb
On Mon, Jan 09, 2006 at 11:45:09AM -0800, Doug Cutting wrote: > A NullReporter would be easy to define, but I'm not sure why you ask > since Reporter's are not usually created by user code but rather by > the MapReduce system. > One of the great things about open source is that projects can be us

[Nutch-dev] [jira] Created: (NUTCH-168) setting http.content.limit to -1 seems to break text parsing on some files

2006-01-09 Thread Jerry Russell (JIRA)
setting http.content.limit to -1 seems to break text parsing on some files -- Key: NUTCH-168 URL: http://issues.apache.org/jira/browse/NUTCH-168 Project: Nutch Type: Bug Components: fetcher

[Nutch-dev] [jira] Resolved: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

2006-01-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-160?page=all ] Doug Cutting resolved NUTCH-160: Fix Version: 0.8-dev Resolution: Fixed I just committed this patch. Thanks! > Use standard Java Regex library rather than org.apache.oro.text.regex >

[Nutch-dev] Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/

2006-01-09 Thread Jérôme Charron
... in fact, not really... really unrelated !!! I remove it immediately. Thanks On 1/9/06, Doug Cutting <[EMAIL PROTECTED]> wrote: > > [EMAIL PROTECTED] wrote: > > --- lucene/nutch/trunk/src/plugin/build.xml (original) > > +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 > > @@

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362249 ] Doug Cutting commented on NUTCH-139: Let me try to be more concrete. I'd prefer that the X-nutch properties be removed from MetadataNames before this is committed, and mov

[Nutch-dev] Re: wiki:commandline options classpaths

2006-01-09 Thread ogjunk-nutch
Yes, everything is in org.apache now, I believe. Thanks for helping out. Otis - Original Message From: Jerry Russell <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Mon 09 Jan 2006 02:20:02 PM EST Subject: wiki:commandline options classpaths I noticed that the command line

[Nutch-dev] Re: why index not in segment anymore

2006-01-09 Thread Doug Cutting
Stefan Groschupf wrote: in nutch 0.8 the index is not in the segment folder any more. What was the reason for that? in the context of a web gui it would be may be better to have the index also in the segment folder, since the segment folder would be the single item to manage a life-cycle, T

[Nutch-dev] Re: Reporter interface

2006-01-09 Thread Doug Cutting
Andrew McNabb wrote: I'm looking at the Reporter interface, and I would like to verify my understanding of what it is. It appears to me that Reporter.setStatus() is called periodically during an operation to give a human-readable description of how far the progress is so far. Is that correct?

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-09 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ] Doug Cutting commented on NUTCH-139: We can just use different names, rather than two metaData objects: X-nutch names for derived or other values that are usually protocol

[Nutch-dev] wiki:commandline options classpaths

2006-01-09 Thread Jerry Russell
I noticed that the command line options in the wiki has net.nutch.* instead of the newer org.apache.*. Just wanted to confirm if its ok to change them all. (I'm new to this group, just wanted to confirm first) Thanks, Jerry --- This SF.net e

[Nutch-dev] Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/

2006-01-09 Thread Doug Cutting
[EMAIL PROTECTED] wrote: --- lucene/nutch/trunk/src/plugin/build.xml (original) +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 @@ -6,13 +6,14 @@ - - + + Was this change intentional? It looks unrelated. Otherwise, this looks great! Doug

[Nutch-dev] Re: Crawl and parse exceptions

2006-01-09 Thread Matt Zytaruk
Just a followup, i figured out the 3rd exception below ( Exception in thread "main" java.io.IOException: No input directories specified in: NutchConf..) so no worries there. but the others are still issues. Matt Zytaruk wrote: I've been having a lot of trouble lately with the newest nutch src.

[Nutch-dev] Crawl and parse exceptions

2006-01-09 Thread Matt Zytaruk
I've been having a lot of trouble lately with the newest nutch src. Both my crawls and parses are failing (for our fetches we crawl and parse at the same time with just the default nutch config, just to get the outlinks and update the crawldb, but then later on, after the fetch we do another pa

[Nutch-dev] Re: test suite fails?

2006-01-09 Thread Jérôme Charron
I have the same problem too. I don't understand what happens. In fact, the CommandRunner returns a -1 exit code, but nothing in the error output and the good string in the standard output ("nutch rocks nutch rocks nutch rocks"). All seems to be ok but the exit code. Jérôme On 1/9/06, Piotr Kosior

[Nutch-dev] Re: test suite fails?

2006-01-09 Thread Piotr Kosiorowski
It fails on my machine on parse-ext tests. I am not sure what is causing it yet and I am afraid I do not have time to investigate it today - maybe in few days. I did a small change to make it compile a few days ago, but all tests went ok before I committed it. Regards Piotr Stefan Groschupf wro

[Nutch-dev] why index not in segment anymore

2006-01-09 Thread Stefan Groschupf
Hi Doug, in nutch 0.8 the index is not in the segment folder any more. What was the reason for that? in the context of a web gui it would be may be better to have the index also in the segment folder, since the segment folder would be the single item to manage a life-cycle, Thanks for a expla

[Nutch-dev] Re: What/how num of required maps is set? OOP Wrong list

2006-01-09 Thread Gal Nitzan
On Mon, 2006-01-09 at 12:07 +0200, Gal Nitzan wrote: > I am trying to figure out how the required map is set/calculated by > Nutch. > > I have 3 task trackers. > > I added one more. > > When I run fetch only the initial three are fetching. > > I have added the task tracker before calling genera

[Nutch-dev] What/how num of required maps is set?

2006-01-09 Thread Gal Nitzan
I am trying to figure out how the required map is set/calculated by Nutch. I have 3 task trackers. I added one more. When I run fetch only the initial three are fetching. I have added the task tracker before calling generate (if it has any meanning) Thanks, G. --

[Nutch-dev] Re: NPE in Indexer.java line 184

2006-01-09 Thread Gal Nitzan
OK. thanks for the patch. I shall embed it tonight. I promise :) to let you know... Gal. On Mon, 2006-01-09 at 10:53 +0100, Andrzej Bialecki wrote: > Gal Nitzan wrote: > > >Sorry :) no. > > > > > > > > Hmm. ok. :) But I think that patch is needed anyway, because now we > silently assume t

[Nutch-dev] Re: NPE in Indexer.java line 184

2006-01-09 Thread Andrzej Bialecki
Gal Nitzan wrote: Sorry :) no. Hmm. ok. :) But I think that patch is needed anyway, because now we silently assume that parse plugins will always copy all Content metadata to ParseData.metadata, while it may not be the case - and it certainly does not happen if there is a parse error ..

[Nutch-dev] Re: NPE in Indexer.java line 184

2006-01-09 Thread Gal Nitzan
Sorry :) no. I run fetcher with parse. This NPE happens for only a few documents and that is the problem :) On Mon, 2006-01-09 at 09:43 +0100, Andrzej Bialecki wrote: > Gal Nitzan wrote: > > >Hi Andrzej, > > > >The value cannot be null is my message :) > > > > > > > > :) > > I'm guessing

[Nutch-dev] Re: NPE in Indexer.java line 184

2006-01-09 Thread Andrzej Bialecki
Gal Nitzan wrote: Hi Andrzej, The value cannot be null is my message :) :) I'm guessing that you are using Fetcher in non-parsing mode, and then you run ParseSegment as a separate step, right? Please try the attached patch. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _