I'm looking at the Reporter interface, and I would like to verify my
understanding of what it is. It appears to me that Reporter.setStatus()
is called periodically during an operation to give a human-readable
description of how far the progress is so far. Is that correct?
If so, is there a reaso
Matt Zytaruk wrote:
So will this throw an exception on older segments? or will it just not
get the correct metadata? I have a lot of older segments I still need
to use.
Thanks for your help.
The patch that I sent in my previous email handles both versions, so you
will be able to use your o
[
http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362043 ]
Paul Baclace commented on NUTCH-152:
>re 3: Why is a separate thread needed for stdout?
It certainly makes the code easier to read. Using the main thread to read the
sub
Jack Tang wrote:
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?
Thanks for the quick feedback! I'll use the existing facilities to
finish NUTCH-87 for now. There's a good chance that I'll need to do
more stuff like this soon, 'tho, and if so, I'll consider patching
MapFile.
--Matt
On Jan 6, 2006, at 2:12 PM, Doug Cutting wrote:
Matt Kangas wrote:
Cle
The newest src (as of this morning) of trunk is occaisionally giving
ClassCastExceptions when doing a crawl, with parsing (and by
occaisionally I mean this was the only page out of the small list I
crawled that it happened on). This is with the nothing changed from the
defaults and on a server
Make it clearer why this optimization is valid. For Stefan.
... Thanks. :-)
---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your l
Matt Kangas wrote:
Clearly this won't scale for a large textfile, so I'm changing it to
use as temporary SequenceFile instead. Then I'll sort the SequenceFile,
and copy item-by-item into the MapFile.
While I'm doing this, I'm wondering if there isn't a way to avoid the
2nd copy.
No, not
Huh...
anybody interested in this?
Normally I would be so pushy but to me it seems that Nutch dies if it
meets word document which can't be parsed. This seems like a serious
issue to me.
Or did I overlooked something important/fundamental?
Lukas
On 1/6/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
> Standard metadata property names in the ParseData metadata
> --
>
> Key: NU
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
> Standard metadata property names in the ParseData metadata
> --
>
> Key: NU
Guys,
My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them right
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361927 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Doug,
While it's true that content-length can be computed from the Content's data,
wouldn't it also be nice to have it
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361926 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Doug,
While it's true that content-length can be computed from the Content's data,
wouldn't it also be nice to have it
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361925 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Doug,
While it's true that content-length can be computed from the Content's data,
wouldn't it also be nice to have it
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361923 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Doug,
While it's true that content-length can be computed from the Content's data,
wouldn't it also be nice to have it
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361924 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Doug,
While it's true that content-length can be computed from the Content's data,
wouldn't it also be nice to have it
Hi Folks,
Jerome and I have been thinking a bit about the whole issue of "static"
NutchConf, versus removing it and making it a constructor parameter, etc. I
personally think that a lot of this issue stems from the fact that the
actual source code for nutch, and the what I would call "source
dis
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362013 ]
Jerome Charron commented on NUTCH-139:
--
Doug,
The purpose of this patch is to provide some standard metadata names and to be
able to handle erroneous names, not to handle
Worked perfectly. Thanks
-Matt Zytaruk
Andrzej Bialecki wrote:
Hi,
I attached the patch. Please test.
Index: ParseData.java
===
--- ParseData.java (r
I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000. >25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
Depending on how you have Nutch con
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
Doug Cutting resolved NUTCH-150:
Fix Version: 0.7.2-dev
Resolution: Fixed
I just committed this. Thanks, Paul!
> OutlinkExtractor extremely slow on some non-plain text
>
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Doug Cutting resolved NUTCH-151:
Fix Version: 0.8-dev
Resolution: Fixed
I just committed this. Thanks, Paul!
> CommandRunner can hang after the main thread exec is finished and has
[
http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362004 ]
Doug Cutting commented on NUTCH-152:
re 1,2,5: sounds good.
re 3: Why is a separate thread needed for stdout? Can you please elaborate on
how this causes problems?
re 4: I
So will this throw an exception on older segments? or will it just not
get the correct metadata? I have a lot of older segments I still need to
use.
Thanks for your help.
-Matt Zytaruk
Andrzej Bialecki wrote:
Matt Zytaruk wrote:
Here you go.
java.lang.ClassCastException: java.util.ArrayLi
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362003 ]
Doug Cutting commented on NUTCH-139:
Also, since the primary use of multiple metadata values should be for protocols
where multiple-values are required, the method to add a
Hi,
I attached the patch. Please test.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at s
Okay, here's my patch attached.
We don't need an all-new unit test file, when just a few lines are
needed there.
Does this look right to you?
Doug
Stefan Groschupf wrote:
What bug was that? What is your one-line fix?
http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html
somet
Matt Zytaruk wrote:
Here you go.
java.lang.ClassCastException: java.util.ArrayList
at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
at
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362002 ]
Doug Cutting commented on NUTCH-153:
Paul,
Does http://issues.apache.org/jira/browse/NUTCH-160 address this issue too?
I.e., is at least part of the problem that oro has
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?), TextBasedInformat
I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000. >25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
I also see lots of "Response content
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362000 ]
Paul Baclace commented on NUTCH-153:
> mime.type.magic?
The particular run that had problems was using mime.type.magic=true. It turns
out that the magic "%!PS-Adobe" wa
Here you go.
java.lang.ClassCastException: java.util.ArrayList
at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
at
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
at
org.apa
Andrzej Bialecki wrote:
For efficiency reasons, most of this information is stored and passed to
processing jobs inside instances of CrawlDatum - for the key step of DB
update any other parts of segments (such as Content, ParseData or
ParseText) are not used, which prevents easy access to other
[
http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ]
Doug Cutting commented on NUTCH-160:
+1
I like this patch. I don't see a need for us to use oro anywhere, since Java
now has good builtin regex support. And Java's regex
What bug was that? What is your one-line fix?
http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html
something like:
Object[] values;
method.getReturnType()!=null ? values = (Object[])Array.newInstance
(method.getReturnType(),wrappedValues.length) : values = new Object[0];
Matt Zytaruk wrote:
The newest src (as of this morning) of trunk is occaisionally giving
ClassCastExceptions when doing a crawl, with parsing (and by
occaisionally I mean this was the only page out of the small list I
crawled that it happened on). This is with the nothing changed from
the def
Stefan Groschupf wrote:
Different parameters are sent to each address. So params.length
should equal addresses.length, and if params.length==0 then
addresses.length==0 and there's no call to be made. Make sense? It
might be clearer if the test were changed to addresses.length==0.
Yes, t
Ken Krugler wrote:
I'm wondering whether it would also make sense to remove anchor text
from URLs. For example, currently these two URLs are treated as different:
http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex
and
http://www.dina.kvl.dk/~sestoft/gcsharp/index.html
Is it safe
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361997 ]
KuroSaka TeruHiko commented on NUTCH-153:
-
Actually, shouldn't turning on the mime.type.magic property do the job that the
patch is trying to address?
> TextParser i
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361995 ]
KuroSaka TeruHiko commented on NUTCH-153:
-
The strings command would work with mostly ASCII text content. It is highly
doubtful if we can have a universal strings comm
Chris Mattmann wrote:
I've tried removing the 5 copies of the comment, however I can't find a
button on JIRA to remove comments. Maybe an administrator for Nutch can do
it?
I removed the extra comments. No problem.
Doug
---
This SF.net ema
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ]
Doug Cutting commented on NUTCH-139:
Jerome,
Some HTTP headers have multiple values. Correctly reflecting that was I
thought the primary motivation for adding multiple va
Hi folks,
I'm in the process of cleaning up my WhitelistURLFilter (NUTCH-87 on
JIRA), and I've got a question about working with
org.apache.nutch.io.MapFile.
I am parsing a textfile with one key/value pair per line. I want to
write this into a new MapFile. MapFile.Writer requires keys to
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
> Standard metadata property names in the ParseData metadata
> --
>
> Key: NU
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
> Standard metadata property names in the ParseData metadata
> --
>
> Key: NU
Jérôme Charron <[EMAIL PROTECTED]>
05.01.2006 23:03
Bitte antworten an
nutch-dev@lucene.apache.org
An
nutch-dev@lucene.apache.org
Kopie
Thema
Re: [VOTE] Commiter access for Stefan Groschupf
+1
>For me, it's 0
>I really like all Stefan's support efforts on mailing lists, all his
>brainsto
Jérôme Charron wrote:
A related issue is that these two plugins replicate a lot of code. At
some point we should try to fix that. See:
http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
> > A related issue is that these two plugins replicate a lot of code. At
> > some point we should try to fix that. See:
> >
> >
> http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
Jérôme
--
http://motrech.fr
50 matches
Mail list logo