Hello Sebastian,
I extended the protocol-foo implementation so it can serve a small numer of
URLs.
I believe you should not see issues with multiple parallel instances of
PluginRepository any more, and protocol-foo should no longer throw Exceptions
as it comes with a better implementation.
>> It would be something like a directory listing, or no directory listing but
>> content.
>Have a look at the protocol-file plugin: it wraps a directory listing into a
>HTML page similar
>as the Apache web server does if there is no index.html in a directory.
Yep. I did it. The protocol-foo
Hi,
> It would be something like a directory listing, or no directory listing but
> content.
Have a look at the protocol-file plugin: it wraps a directory listing into a
HTML page
similar as the Apache web server does if there is no index.html in a directory.
> It is possible that a
Hi there.
While I am trying to create the protocol-foo, an implementation for the example
protocol with URLs like foo://something I see difficulty in distinguishing when
to tell nutch to search for more URLs and when not to. It would be something
like a directory listing, or no directory
://kunsthalkade.nl - 157 docs
http://velodrom.de - 232 docs
Is it possible tell Nutch to prefer some URLs? Or is it possible to say that
Nutch should to crawl all URLs equally?
Thank you,
Jan
--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-2-X-Prefered-urls-to-fetch
in
very small number. Is it possible to say to Nutch to specialize or preffer
some urls?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Preffered-urls-to-fetch-tp4092361.html
Sent from the Nutch - User mailing list archive at Nabble.com.
rootUrlDir = urls
threads = 10
depth = 2
topN = 4
Injector: starting
Injector: crawlDb: mycrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator
On 2 April 2012 22:59, jepse j...@jepse.net wrote:
I have this site: http://www.soccer-forum.de/
When i put it into my browser its fine. No Robots.txt, no redirect... but
simply no urls to fetch. where can i find the reasons for not fetching?
What do you mean by saying seed list
Hey,
i have the same problem. No urls to fetch.. For couple urls. Have no clou
how to fix that. Did you meanwhile solve your problem?
Cheers, Philipp
--
View this message in context:
http://lucene.472066.n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying
It could be a million reasons: seed, filter, authentication...maybe the
pages are already crawled...
is there any clue in the log?
Remi
On Mon, Apr 2, 2012 at 5:37 PM, jepse j...@jepse.net wrote:
Hey,
i have the same problem. No urls to fetch.. For couple urls. Have no clou
how to fix
I have this site: http://www.soccer-forum.de/
When i put it into my browser its fine. No Robots.txt, no redirect... but
simply no urls to fetch. where can i find the reasons for not fetching?
What do you mean by saying seed list.
Is there a way to disable all filters at once?
--
View
Hello Friends,
I am trying to crawl Nutch 1.3 and getting this message no urls to fetch
check your seed list and url filters nutch . I understand that I need to
do some chages in the configuration files but I am not sure. Earlier, I
worked on Nutch 0.9 and I was able to crawl without any
Friends,
I am trying to crawl Nutch 1.3 and getting this message *no urls to fetch
check your seed list and url filters nutch * . I understand that I need to
do some chages in the configuration files but I am not sure. Earlier, I
worked on Nutch 0.9 and I was able to crawl without any issues
I forgot to mention that no changes were made either in the
crawl-ulrfilter.txt and regex-urlfilter.txt between a successful crawl and
a crawl with the message no more urls to fetch
rootUrlDir = urls/$folder/urls.txt
threads = 10
depth = 1
indexer=lucene
topN = 1500
Injector: starting
Injector
the pages it has crawled.
Is this the only solution? Experienced guys please revert back.
--
View this message in context:
http://lucene.472066.n3.nabble.com/No-more-urls-to-fetch-tp3122462p3124591.html
Sent from the Nutch - User mailing list archive at Nabble.com.
is correct but Nutch doesn´t crawl
anything. It says that there are *No Urls to Fetch - check your seed
list
and URL filters.*
Am I missing something ??
Thanks,
Hannes C. Meyer
www.informera.de
-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
so, I think the regular expression is correct but Nutch doesn´t crawl
anything. It says that there are *No Urls to Fetch - check your seed list
and URL filters.*
Am I missing something ??
Thanks,
://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
so, I think the regular expression is correct but Nutch doesn´t crawl
anything. It says that there are *No Urls to Fetch - check your seed list
and URL filters.*
Am I missing something
so, I think the regular expression is correct but Nutch doesn´t crawl
anything. It says that there are *No Urls to Fetch - check your seed list
and URL filters.*
Am I missing something ??
Thanks,
Hannes C. Meyer
www.informera.de
be for example,
http://www.elcorreo.com/vizcaya/20110613/mas-actualidad/politica/lopez-consta-pactado-bildu-201106131023.html
so, I think the regular expression is correct but Nutch doesn´t crawl
anything. It says that there are *No Urls to Fetch - check your seed
list
and URL filters.*
Am I
with more data shows the Stopping at depth=1 - no more URLs to
fetch
Do I need to change the settings of Nutch for large sites?
Thanks,
Alex
Here are the logs of the indexing:
Stopping at depth=1 - no more URLs to fetch.
INFO sitesearch.CrawlerUtil: indexHost : Starting an Site
://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to
crawl the url http://lucene.apache.org as described in tutorial, I
keep getting `No URLs to fetch - check your seed list and URL
filters.'
The command used to crawl the sample website is
bin/nutch crawl lucene.apache.org -dir
-9]*\.)*apache.org
Ibrahim
On Mon, 2011-02-21 at 13:16 +0800, Thomas Anderson wrote:
I learn setting up nutch to crawl a website through
http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to
crawl the url http://lucene.apache.org as described in tutorial, I
keep getting `No URLs
23 matches
Mail list logo