does maxSegments control the number of segments per level?
Do I know for sure that if I have 1 milion pages in a certain level, and
assuming I'm not setting topN paramater (so it is set to default, MAX LONG),
and I set maxSegment to 4, than for that level i'll have 4 segments, each
250K pages?
Or
On Mon, 7 May 2012 22:52:52 -0700 (PDT), "nutch.bu...@gmail.com"
wrote:
Yeah I've meant an unexpected failure that crashed the job, like OOM.
Regarding topN - Nutch tutorial says:
"-topN N determines the maximum number of pages that will be
retrieved at
each level up to the depth."
Does it m
Yeah I've meant an unexpected failure that crashed the job, like OOM.
Regarding topN - Nutch tutorial says:
"-topN N determines the maximum number of pages that will be retrieved at
each level up to the depth."
Does it mean that when the limit is reached, no more urls on this level will
be added
On Mon, 7 May 2012 22:31:43 -0700 (PDT), "nutch.bu...@gmail.com"
wrote:
In a previous discussion about handling of failures in nutch, it was
mentioned that a broken segment cannot be fixed and it's urls should
be
re-crawled.
Thus, it seems that there should be a way to control segment size, so
In a previous discussion about handling of failures in nutch, it was
mentioned that a broken segment cannot be fixed and it's urls should be
re-crawled.
Thus, it seems that there should be a way to control segment size, so that
one can limit the risk of having to re-crawl a huge amount of urls if o
Hi,
I m facing similar issue with certificates.Can you please let me know if
you have solved this issue.
Regards,
Sidd
--
View this message in context:
http://lucene.472066.n3.nabble.com/https-authentication-tp2547192p3968230.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hy,
I am facing similar problem,can you please tell some example like How to
pass trust-source and other details from Command prompt
Regards,
Sidd
--
View this message in context:
http://lucene.472066.n3.nabble.com/Client-certificate-authentication-tp3209084p3968216.html
Sent from the Nutch - Use
Hi
This is not possible out-of-the-box. You can, however, use the OutlinkExtactor
to find this link. But it is only invoked when the parser doesn't return
outlinks.
Cheers
On Monday 07 May 2012 08:19:19 Mohammad wrk wrote:
> Hi,
>
> Can Nutch be configured to consider the url in the following
Hi,
Have any of you has worked on crawling https sites with certificate.pls let
me know
--
View this message in context:
http://lucene.472066.n3.nabble.com/Re-Crawl-sites-with-hashtags-in-url-tp3954098p3968209.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hi,
Can Nutch be configured to consider the url in the following html snippet as a
link?
http://www.example.com/link";);">...
Thanks,
Mohammad
Hi,
When i merge indexes using nutch's IndexMerger, I give as input some folders
that were created by the indexer and get as output the merged index.
The folders that are crated by the indexer are of this structure:
Indexes/part-0/
the output of the index merger is a folder of this structure:
Hi Remi,
Thank you so much for your reply. We have decided not to take any further
actions on this matter as this is not necesarry anymore.
Still i would like to thank you for your time!
Kind regards,
Roberto Gardenier
-Oorspronkelijk bericht-
Van: remi tassing [mailto:tassingr...@gmail
Hi Sebastian,
I have looked at the RFC and im convinced that i dont need to take any further
action on this issue, as is that this website is just not following the rules.
Just like twitter... but who cares.
Its not our problem anymore, thank you so much for your reply!
Kind regards,
Roberto G
13 matches
Mail list logo