[Nutch-dev] Reporter interface

2006-01-06 Thread Andrew McNabb
I'm looking at the Reporter interface, and I would like to verify my understanding of what it is. It appears to me that Reporter.setStatus() is called periodically during an operation to give a human-readable description of how far the progress is so far. Is that correct? If so, is there a reaso

[Nutch-dev] Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
Matt Zytaruk wrote: So will this throw an exception on older segments? or will it just not get the correct metadata? I have a lot of older segments I still need to use. Thanks for your help. The patch that I sent in my previous email handles both versions, so you will be able to use your o

[Nutch-dev] [jira] Commented: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2006-01-06 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362043 ] Paul Baclace commented on NUTCH-152: >re 3: Why is a separate thread needed for stdout? It certainly makes the code easier to read. Using the main thread to read the sub

[Nutch-dev] Re: Per-page crawling policy

2006-01-06 Thread Andrzej Bialecki
Jack Tang wrote: Hi Andrzej The idea brings vertical search into nutch and definitely it is great:) I think nutch should add information retrieving layer into the who architecture, and export some abstract interface, say UrlBasedInformationRetrieve(you can implement your url grouping idea here?

[Nutch-dev] Re: creating MapFiles from unsorted data?

2006-01-06 Thread Matt Kangas
Thanks for the quick feedback! I'll use the existing facilities to finish NUTCH-87 for now. There's a good chance that I'll need to do more stuff like this soon, 'tho, and if so, I'll consider patching MapFile. --Matt On Jan 6, 2006, at 2:12 PM, Doug Cutting wrote: Matt Kangas wrote: Cle

[Nutch-dev] Class Cast exception

2006-01-06 Thread Matt Zytaruk
The newest src (as of this morning) of trunk is occaisionally giving ClassCastExceptions when doing a crawl, with parsing (and by occaisionally I mean this was the only page out of the small list I crawled that it happened on). This is with the nothing changed from the defaults and on a server

[Nutch-dev] Re: svn commit: r366550 - /lucene/nutch/trunk/src/java/org/apache/nutch/ipc/Client.java

2006-01-06 Thread Stefan Groschupf
Make it clearer why this optimization is valid. For Stefan. ... Thanks. :-) --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your l

[Nutch-dev] Re: creating MapFiles from unsorted data?

2006-01-06 Thread Doug Cutting
Matt Kangas wrote: Clearly this won't scale for a large textfile, so I'm changing it to use as temporary SequenceFile instead. Then I'll sort the SequenceFile, and copy item-by-item into the MapFile. While I'm doing this, I'm wondering if there isn't a way to avoid the 2nd copy. No, not

[Nutch-dev] Re: mapred crawling exception - Job failed!

2006-01-06 Thread Lukas Vlcek
Huh... anybody interested in this? Normally I would be so pushy but to me it seems that Nutch dies if it meets word document which can't be parsed. This seems like a serious issue to me. Or did I overlooked something important/fundamental? Lukas On 1/6/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: >

[Nutch-dev] [jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted > Standard metadata property names in the ParseData metadata > -- > > Key: NU

[Nutch-dev] [jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted > Standard metadata property names in the ParseData metadata > -- > > Key: NU

[Nutch-dev] RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread chris.mattmann
Guys, My apologies for the spamming comments -- I tried to submit my comment through JIRA one time and it kept giving me service unavailable. So I resubmitted like 5 times, on the fifth time it finally went through -- but I guess the other comments went through too. I'll try and remove them right

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361927 ] Chris A. Mattmann commented on NUTCH-139: - Hi Doug, While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361926 ] Chris A. Mattmann commented on NUTCH-139: - Hi Doug, While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361925 ] Chris A. Mattmann commented on NUTCH-139: - Hi Doug, While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361923 ] Chris A. Mattmann commented on NUTCH-139: - Hi Doug, While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361924 ] Chris A. Mattmann commented on NUTCH-139: - Hi Doug, While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it

[Nutch-dev] Nutch Deployment

2006-01-06 Thread Chris Mattmann
Hi Folks, Jerome and I have been thinking a bit about the whole issue of "static" NutchConf, versus removing it and making it a constructor parameter, etc. I personally think that a lot of this issue stems from the fact that the actual source code for nutch, and the what I would call "source dis

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362013 ] Jerome Charron commented on NUTCH-139: -- Doug, The purpose of this patch is to provide some standard metadata names and to be able to handle erroneous names, not to handle

[Nutch-dev] Re: Class Cast exception

2006-01-06 Thread Matt Zytaruk
Worked perfectly. Thanks -Matt Zytaruk Andrzej Bialecki wrote: Hi, I attached the patch. Please test. Index: ParseData.java === --- ParseData.java (r

[Nutch-dev] Re: problems http-client

2006-01-06 Thread Ken Krugler
I have started to see this problem recently. topN=20 per crawl, but fetched pages = 15 - 17, while error pages = 2000 - 5000. >25000 pages are missing. this is reproducible with nutch0.7.1, both protocol-http and protocol-httpclient are included. Depending on how you have Nutch con

[Nutch-dev] [jira] Resolved: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ] Doug Cutting resolved NUTCH-150: Fix Version: 0.7.2-dev Resolution: Fixed I just committed this. Thanks, Paul! > OutlinkExtractor extremely slow on some non-plain text >

[Nutch-dev] [jira] Resolved: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Doug Cutting resolved NUTCH-151: Fix Version: 0.8-dev Resolution: Fixed I just committed this. Thanks, Paul! > CommandRunner can hang after the main thread exec is finished and has

[Nutch-dev] [jira] Commented: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362004 ] Doug Cutting commented on NUTCH-152: re 1,2,5: sounds good. re 3: Why is a separate thread needed for stdout? Can you please elaborate on how this causes problems? re 4: I

[Nutch-dev] Re: Class Cast exception

2006-01-06 Thread Matt Zytaruk
So will this throw an exception on older segments? or will it just not get the correct metadata? I have a lot of older segments I still need to use. Thanks for your help. -Matt Zytaruk Andrzej Bialecki wrote: Matt Zytaruk wrote: Here you go. java.lang.ClassCastException: java.util.ArrayLi

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362003 ] Doug Cutting commented on NUTCH-139: Also, since the primary use of multiple metadata values should be for protocols where multiple-values are required, the method to add a

[Nutch-dev] Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
Hi, I attached the patch. Please test. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at s

[Nutch-dev] Re: [bug?] PRC called emthod require parameter

2006-01-06 Thread Doug Cutting
Okay, here's my patch attached. We don't need an all-new unit test file, when just a few lines are needed there. Does this look right to you? Doug Stefan Groschupf wrote: What bug was that? What is your one-line fix? http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html somet

[Nutch-dev] Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
Matt Zytaruk wrote: Here you go. java.lang.ClassCastException: java.util.ArrayList at org.apache.nutch.parse.ParseData.write(ParseData.java:122) at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51) at org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:

[Nutch-dev] [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362002 ] Doug Cutting commented on NUTCH-153: Paul, Does http://issues.apache.org/jira/browse/NUTCH-160 address this issue too? I.e., is at least part of the problem that oro has

[Nutch-dev] Re: Per-page crawling policy

2006-01-06 Thread Jack Tang
Hi Andrzej The idea brings vertical search into nutch and definitely it is great:) I think nutch should add information retrieving layer into the who architecture, and export some abstract interface, say UrlBasedInformationRetrieve(you can implement your url grouping idea here?), TextBasedInformat

[Nutch-dev] Re: problems http-client

2006-01-06 Thread AJ Chen
I have started to see this problem recently. topN=20 per crawl, but fetched pages = 15 - 17, while error pages = 2000 - 5000. >25000 pages are missing. this is reproducible with nutch0.7.1, both protocol-http and protocol-httpclient are included. I also see lots of "Response content

[Nutch-dev] [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362000 ] Paul Baclace commented on NUTCH-153: > mime.type.magic? The particular run that had problems was using mime.type.magic=true. It turns out that the magic "%!PS-Adobe" wa

[Nutch-dev] Re: Class Cast exception

2006-01-06 Thread Matt Zytaruk
Here you go. java.lang.ClassCastException: java.util.ArrayList at org.apache.nutch.parse.ParseData.write(ParseData.java:122) at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51) at org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57) at org.apa

[Nutch-dev] Re: Adaptive fetch interval & unmodified content detection, episode II

2006-01-06 Thread Doug Cutting
Andrzej Bialecki wrote: For efficiency reasons, most of this information is stored and passed to processing jobs inside instances of CrawlDatum - for the key step of DB update any other parts of segments (such as Content, ParseData or ParseText) are not used, which prevents easy access to other

[Nutch-dev] [jira] Commented: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ] Doug Cutting commented on NUTCH-160: +1 I like this patch. I don't see a need for us to use oro anywhere, since Java now has good builtin regex support. And Java's regex

[Nutch-dev] Re: [bug?] PRC called emthod require parameter

2006-01-06 Thread Stefan Groschupf
What bug was that? What is your one-line fix? http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html something like: Object[] values; method.getReturnType()!=null ? values = (Object[])Array.newInstance (method.getReturnType(),wrappedValues.length) : values = new Object[0];

[Nutch-dev] Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
Matt Zytaruk wrote: The newest src (as of this morning) of trunk is occaisionally giving ClassCastExceptions when doing a crawl, with parsing (and by occaisionally I mean this was the only page out of the small list I crawled that it happened on). This is with the nothing changed from the def

[Nutch-dev] Re: [bug?] PRC called emthod require parameter

2006-01-06 Thread Doug Cutting
Stefan Groschupf wrote: Different parameters are sent to each address. So params.length should equal addresses.length, and if params.length==0 then addresses.length==0 and there's no call to be made. Make sense? It might be clearer if the test were changed to addresses.length==0. Yes, t

[Nutch-dev] Re: Normalizing URLs with anchors

2006-01-06 Thread Doug Cutting
Ken Krugler wrote: I'm wondering whether it would also make sense to remove anchor text from URLs. For example, currently these two URLs are treated as different: http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex and http://www.dina.kvl.dk/~sestoft/gcsharp/index.html Is it safe

[Nutch-dev] [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread KuroSaka TeruHiko (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361997 ] KuroSaka TeruHiko commented on NUTCH-153: - Actually, shouldn't turning on the mime.type.magic property do the job that the patch is trying to address? > TextParser i

[Nutch-dev] [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread KuroSaka TeruHiko (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361995 ] KuroSaka TeruHiko commented on NUTCH-153: - The strings command would work with mostly ASCII text content. It is highly doubtful if we can have a universal strings comm

[Nutch-dev] Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting
Chris Mattmann wrote: I've tried removing the 5 copies of the comment, however I can't find a button on JIRA to remove comments. Maybe an administrator for Nutch can do it? I removed the extra comments. No problem. Doug --- This SF.net ema

[Nutch-dev] [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ] Doug Cutting commented on NUTCH-139: Jerome, Some HTTP headers have multiple values. Correctly reflecting that was I thought the primary motivation for adding multiple va

[Nutch-dev] creating MapFiles from unsorted data?

2006-01-06 Thread Matt Kangas
Hi folks, I'm in the process of cleaning up my WhitelistURLFilter (NUTCH-87 on JIRA), and I've got a question about working with org.apache.nutch.io.MapFile. I am parsing a textfile with one key/value pair per line. I want to write this into a new MapFile. MapFile.Writer requires keys to

[Nutch-dev] [jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted > Standard metadata property names in the ParseData metadata > -- > > Key: NU

[Nutch-dev] [jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted > Standard metadata property names in the ParseData metadata > -- > > Key: NU

[Nutch-dev] Antwort: Re: [VOTE] Commiter access for Stefan Groschupf

2006-01-06 Thread marcel . schnippe
Jérôme Charron <[EMAIL PROTECTED]> 05.01.2006 23:03 Bitte antworten an nutch-dev@lucene.apache.org An nutch-dev@lucene.apache.org Kopie Thema Re: [VOTE] Commiter access for Stefan Groschupf +1 >For me, it's 0 >I really like all Stefan's support efforts on mailing lists, all his >brainsto

[Nutch-dev] Re: problems http-client

2006-01-06 Thread Andrzej Bialecki
Jérôme Charron wrote: A related issue is that these two plugins replicate a lot of code. At some point we should try to fix that. See: http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html I have beginning working on this. Nobody else? Can I go on?

[Nutch-dev] Re: problems http-client

2006-01-06 Thread Jérôme Charron
> > A related issue is that these two plugins replicate a lot of code. At > > some point we should try to fix that. See: > > > > > http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html I have beginning working on this. Nobody else? Can I go on? Jérôme -- http://motrech.fr