RE: [EXT] Re: protocol-foo: How to tell nutch about more URLs to fetch?

2017-09-29 Thread Hiran CHAUDHURI
Hello Sebastian, I extended the protocol-foo implementation so it can serve a small numer of URLs. I believe you should not see issues with multiple parallel instances of PluginRepository any more, and protocol-foo should no longer throw Exceptions as it comes with a better implementation.

RE: [EXT] Re: protocol-foo: How to tell nutch about more URLs to fetch?

2017-09-29 Thread Hiran CHAUDHURI
>> It would be something like a directory listing, or no directory listing but >> content. >Have a look at the protocol-file plugin: it wraps a directory listing into a >HTML page similar >as the Apache web server does if there is no index.html in a directory. Yep. I did it. The protocol-foo

Re: protocol-foo: How to tell nutch about more URLs to fetch?

2017-09-27 Thread Sebastian Nagel
Hi, > It would be something like a directory listing, or no directory listing but > content. Have a look at the protocol-file plugin: it wraps a directory listing into a HTML page similar as the Apache web server does if there is no index.html in a directory. > It is possible that a

protocol-foo: How to tell nutch about more URLs to fetch?

2017-09-26 Thread Hiran CHAUDHURI
Hi there. While I am trying to create the protocol-foo, an implementation for the example protocol with URLs like foo://something I see difficulty in distinguishing when to tell nutch to search for more URLs and when not to. It would be something like a directory listing, or no directory

Nutch 2.X - Prefered urls to fetch

2013-11-16 Thread glumet
://kunsthalkade.nl - 157 docs http://velodrom.de - 232 docs Is it possible tell Nutch to prefer some URLs? Or is it possible to say that Nutch should to crawl all URLs equally? Thank you, Jan -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-X-Prefered-urls-to-fetch

Preffered urls to fetch

2013-09-27 Thread glumet
in very small number. Is it possible to say to Nutch to specialize or preffer some urls? -- View this message in context: http://lucene.472066.n3.nabble.com/Preffered-urls-to-fetch-tp4092361.html Sent from the Nutch - User mailing list archive at Nabble.com.

how to solveNo URLs to fetch - check your seed list and URL filters

2012-08-02 Thread veryblues_cn
rootUrlDir = urls threads = 10 depth = 2 topN = 4 Injector: starting Injector: crawlDb: mycrawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-09 Thread Andy Xue
On 2 April 2012 22:59, jepse j...@jepse.net wrote: I have this site: http://www.soccer-forum.de/ When i put it into my browser its fine. No Robots.txt, no redirect... but simply no urls to fetch. where can i find the reasons for not fetching? What do you mean by saying seed list

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse
Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix that. Did you meanwhile solve your problem? Cheers, Philipp -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread remi tassing
It could be a million reasons: seed, filter, authentication...maybe the pages are already crawled... is there any clue in the log? Remi On Mon, Apr 2, 2012 at 5:37 PM, jepse j...@jepse.net wrote: Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse
I have this site: http://www.soccer-forum.de/ When i put it into my browser its fine. No Robots.txt, no redirect... but simply no urls to fetch. where can i find the reasons for not fetching? What do you mean by saying seed list. Is there a way to disable all filters at once? -- View

no urls to fetch check your seed list and url filters nutch

2011-07-06 Thread serenity
Hello Friends, I am trying to crawl Nutch 1.3 and getting this message no urls to fetch check your seed list and url filters nutch . I understand that I need to do some chages in the configuration files but I am not sure. Earlier, I worked on Nutch 0.9 and I was able to crawl without any

Re: no urls to fetch check your seed list and url filters nutch

2011-07-06 Thread Markus Jelsma
Friends, I am trying to crawl Nutch 1.3 and getting this message *no urls to fetch check your seed list and url filters nutch * . I understand that I need to do some chages in the configuration files but I am not sure. Earlier, I worked on Nutch 0.9 and I was able to crawl without any issues

Re: No more urls to fetch

2011-06-29 Thread tamanjit.bin...@yahoo.co.in
I forgot to mention that no changes were made either in the crawl-ulrfilter.txt and regex-urlfilter.txt between a successful crawl and a crawl with the message no more urls to fetch rootUrlDir = urls/$folder/urls.txt threads = 10 depth = 1 indexer=lucene topN = 1500 Injector: starting Injector

Re: No more urls to fetch

2011-06-29 Thread tamanjit.bin...@yahoo.co.in
the pages it has crawled. Is this the only solution? Experienced guys please revert back. -- View this message in context: http://lucene.472066.n3.nabble.com/No-more-urls-to-fetch-tp3122462p3124591.html Sent from the Nutch - User mailing list archive at Nabble.com.

Fwd: No Urls to fetch

2011-06-14 Thread Hannes Carl Meyer
is correct but Nutch doesn´t crawl anything. It says that there are *No Urls to Fetch - check your seed list and URL filters.* Am I missing something ?? Thanks, Hannes C. Meyer www.informera.de

No Urls to fetch

2011-06-13 Thread Adelaida Lejarazu
-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html so, I think the regular expression is correct but Nutch doesn´t crawl anything. It says that there are *No Urls to Fetch - check your seed list and URL filters.* Am I missing something ?? Thanks,

Re: No Urls to fetch

2011-06-13 Thread lewis john mcgibbney
://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html so, I think the regular expression is correct but Nutch doesn´t crawl anything. It says that there are *No Urls to Fetch - check your seed list and URL filters.* Am I missing something

Re: No Urls to fetch

2011-06-13 Thread Hannes Carl Meyer
so, I think the regular expression is correct but Nutch doesn´t crawl anything. It says that there are *No Urls to Fetch - check your seed list and URL filters.* Am I missing something ?? Thanks, Hannes C. Meyer www.informera.de

Re: No Urls to fetch

2011-06-13 Thread Adelaida Lejarazu
be for example, http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html so, I think the regular expression is correct but Nutch doesn´t crawl anything. It says that there are *No Urls to Fetch - check your seed list and URL filters.* Am I

Stopping at depth=1 - no more URLs to fetch.

2011-04-29 Thread Alex
with more data shows the Stopping at depth=1 - no more URLs to fetch Do I need to change the settings of Nutch for large sites? Thanks, Alex Here are the logs of the indexing: Stopping at depth=1 - no more URLs to fetch. INFO sitesearch.CrawlerUtil: indexHost : Starting an Site

Re: No URLs to fetch - check your seed list and URL filters

2011-02-21 Thread Thomas Anderson
://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to crawl the url http://lucene.apache.org as described in tutorial, I keep getting `No URLs to fetch - check your seed list and URL filters.' The command used to crawl the sample website is     bin/nutch crawl lucene.apache.org -dir

Re: No URLs to fetch - check your seed list and URL filters

2011-02-20 Thread Ibrahim Alkharashi
-9]*\.)*apache.org Ibrahim On Mon, 2011-02-21 at 13:16 +0800, Thomas Anderson wrote: I learn setting up nutch to crawl a website through http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to crawl the url http://lucene.apache.org as described in tutorial, I keep getting `No URLs