Re: Is it possible to control the segment size?

2012-05-07 Thread nutch.bu...@gmail.com
does maxSegments control the number of segments per level? Do I know for sure that if I have 1 milion pages in a certain level, and assuming I'm not setting topN paramater (so it is set to default, MAX LONG), and I set maxSegment to 4, than for that level i'll have 4 segments, each 250K pages? Or

Re: Is it possible to control the segment size?

2012-05-07 Thread Markus Jelsma
On Mon, 7 May 2012 22:52:52 -0700 (PDT), "nutch.bu...@gmail.com" wrote: Yeah I've meant an unexpected failure that crashed the job, like OOM. Regarding topN - Nutch tutorial says: "-topN N determines the maximum number of pages that will be retrieved at each level up to the depth." Does it m

Re: Is it possible to control the segment size?

2012-05-07 Thread nutch.bu...@gmail.com
Yeah I've meant an unexpected failure that crashed the job, like OOM. Regarding topN - Nutch tutorial says: "-topN N determines the maximum number of pages that will be retrieved at each level up to the depth." Does it mean that when the limit is reached, no more urls on this level will be added

Re: Is it possible to control the segment size?

2012-05-07 Thread Markus Jelsma
On Mon, 7 May 2012 22:31:43 -0700 (PDT), "nutch.bu...@gmail.com" wrote: In a previous discussion about handling of failures in nutch, it was mentioned that a broken segment cannot be fixed and it's urls should be re-crawled. Thus, it seems that there should be a way to control segment size, so

Is it possible to control the segment size?

2012-05-07 Thread nutch.bu...@gmail.com
In a previous discussion about handling of failures in nutch, it was mentioned that a broken segment cannot be fixed and it's urls should be re-crawled. Thus, it seems that there should be a way to control segment size, so that one can limit the risk of having to re-crawl a huge amount of urls if o

Re: https authentication

2012-05-07 Thread Siddharth Jain
Hi, I m facing similar issue with certificates.Can you please let me know if you have solved this issue. Regards, Sidd -- View this message in context: http://lucene.472066.n3.nabble.com/https-authentication-tp2547192p3968230.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Client certificate authentication

2012-05-07 Thread Siddharth Jain
Hy, I am facing similar problem,can you please tell some example like How to pass trust-source and other details from Command prompt Regards, Sidd -- View this message in context: http://lucene.472066.n3.nabble.com/Client-certificate-authentication-tp3209084p3968216.html Sent from the Nutch - Use

Re: link without href

2012-05-07 Thread Markus Jelsma
Hi This is not possible out-of-the-box. You can, however, use the OutlinkExtactor to find this link. But it is only invoked when the parser doesn't return outlinks. Cheers On Monday 07 May 2012 08:19:19 Mohammad wrk wrote: > Hi, > > Can Nutch be configured to consider the url in the following

Re: Crawl sites with hashtags in url

2012-05-07 Thread Siddharth Jain
Hi, Have any of you has worked on crawling https sites with certificate.pls let me know -- View this message in context: http://lucene.472066.n3.nabble.com/Re-Crawl-sites-with-hashtags-in-url-tp3954098p3968209.html Sent from the Nutch - User mailing list archive at Nabble.com.

link without href

2012-05-07 Thread Mohammad wrk
Hi, Can Nutch be configured to consider the url in the following html snippet as a link? http://www.example.com/link";);">... Thanks, Mohammad

How do I merge indexes so that the "indexes" folder is merged as well?

2012-05-07 Thread nutch.bu...@gmail.com
Hi, When i merge indexes using nutch's IndexMerger, I give as input some folders that were created by the indexer and get as output the merged index. The folders that are crated by the indexer are of this structure: Indexes/part-0/ the output of the index merger is a folder of this structure:

RE: Crawl sites with hashtags in url

2012-05-07 Thread Roberto Gardenier
Hi Remi, Thank you so much for your reply. We have decided not to take any further actions on this matter as this is not necesarry anymore. Still i would like to thank you for your time! Kind regards, Roberto Gardenier -Oorspronkelijk bericht- Van: remi tassing [mailto:tassingr...@gmail

RE: Crawl sites with hashtags in url

2012-05-07 Thread Roberto Gardenier
Hi Sebastian, I have looked at the RFC and im convinced that i dont need to take any further action on this issue, as is that this website is just not following the rules. Just like twitter... but who cares. Its not our problem anymore, thank you so much for your reply! Kind regards, Roberto G