In a parsing fetcher iirc outlinks are processed in the mapper (at least when
outlinks are followed). If a fetcher's reducer stalls you may run out of memory
or disk space.
-Original message-
From:kaveh minooie ka...@plutoz.com
Sent: Wed 13-Jun-2012 19:28
To: user@nutch.apache.org
Hi
This CrawlDatum's FetchTime is tomorrow in EST
Fetch time: Tue Jun 12 02:59:27 EST 2012
-Original message-
From:Andy Xue andyxuey...@gmail.com
Sent: Mon 11-Jun-2012 11:00
To: user@nutch.apache.org
Subject: Generator: 0 records selected for fetching, exiting ...
Hi all:
Hi,
Nutch will fetch URL's without robots.txt, but if robots.txt throws an
UnknownHostException, the URL will throw it as well and fail.
Cheers
-Original message-
From:chethan chethan.p...@gmail.com
Sent: Thu 07-Jun-2012 16:16
To: user@nutch.apache.org
Subject: robots.txt
Great work Lewis, Chris, committers and contributors!
Thanks all!
-Original message-
From:lewis john mcgibbney lewi...@apache.org
Sent: Thu 07-Jun-2012 19:01
To: annou...@apache.org; d...@nutch.apache.org; user@nutch.apache.org
Subject: [ANNOUNCE] Apache Nutch 1.5 Released
Hello!
Sounds very interesting. Anyway, Solr can run embedded in a Java application
called EmbeddedSolrServer. You do need to make some changes to the SolrIndexer
tools in Nutch.
Cheers
-Original message-
From:Emre Çelikten e...@celikten.name
Sent: Thu 07-Jun-2012 22:24
To:
-Original message-
From:Matthias Paul magethle.nu...@gmail.com
Sent: Wed 06-Jun-2012 09:47
To: user@nutch.apache.org
Subject: Linkdb empty
Hi all,
hi
I noticed that my linkdb is always empty although I use the generated
segments from the last crawl for the generation of the
-Original message-
From:chethan chethan.p...@gmail.com
Sent: Wed 06-Jun-2012 05:12
To: user@nutch.apache.org
Subject: Nutch topN selection
Hi,
hi
Does the topN threshold consider page score for the selection. If it's set
to say 10, does Nutch queue up the 10 top scoring URLs
-Original message-
From:Andy Xue andyxuey...@gmail.com
Sent: Wed 06-Jun-2012 05:04
To: user@nutch.apache.org
Subject: Behaviour of quot;urlfilter-suffixquot; plug-in when dealing with
a URL without filename extension
Hi all:
hi
Does the urlfilter-suffix plug-in prune URL
-Original message-
From:pepe3059 pepe3...@gmail.com
Sent: Wed 06-Jun-2012 02:58
To: user@nutch.apache.org
Subject: RE: threads disminution when fetching page
me again :)
at the end of fetch process, is the regex-urlfilter considered?
No. At the end of the fetch the mapper
-Original message-
From:Andy Xue andyxuey...@gmail.com
Sent: Wed 06-Jun-2012 11:11
To: Markus Jelsma markus.jel...@openindex.io; user@nutch.apache.org
Subject: Re: Behaviour of quot;urlfilter-suffixquot; plug-in when dealing
with a URL without filename extension
Hi Markus:
hi
What's the problem with having the seed page? Can you not only inject the /news
pages? Anyway, you can always filter it away later after the first fetch cycle.
-Original message-
From:Shameema Umer shem...@gmail.com
Sent: Wed 06-Jun-2012 13:02
To: user@nutch.apache.org
Subject:
Hi
Nutch cannot do this by default and is tricky to make because there may not be
one unique referrer per page. What you can try is to add the referrer to
outlinks when parsing records. This outlink can be added to CrawlDatum's
MetaData which you can then later use to set the referrer. To set
Hi,
The generator can only do it the other way around via the addDays parameter. To
make it work your way you can modifiy the generator to restrict to documents
younger than 48 hours.
Cheers
-Original message-
From:Shameema Umer shem...@gmail.com
Sent: Mon 04-Jun-2012 08:33
To:
This is normal and means the fetcher is finishing all it's input URL's and
writing stuff to disk.
-Original message-
From:pepe3059 pepe3...@gmail.com
Sent: Sat 02-Jun-2012 22:15
To: user@nutch.apache.org
Subject: threads disminution when fetching page
Hello, i hope you can help
-Original message-
From:pepe3059 pepe3...@gmail.com
Sent: Mon 04-Jun-2012 20:42
To: user@nutch.apache.org
Subject: RE: threads disminution when fetching page
thank you for your answer Markus
Hi
you mean, until the fetch process finishes, is information stored using hdfs
by
Hi,
That's a patch for the fetcher. The error you are seeing is quite simple
actually. Because you set those two link.ignore parameters to true, no links
between the same domain and host or aggregated, only links from/to external
hosts and domains. This is a good setting for wide web crawls.
and
link.ignore.limit.domain to false and the link.ignore.internal.xxx can be
set to true? Or should I just set all of the link.ignore.xxx.xxx values
to false?
On 5/29/2012 4:43 PM, Markus Jelsma wrote:
Hi,
That's a patch for the fetcher. The error you are seeing is quite simple
actually. Because you set
valuecom.custom.CustomEventFetchScheduler/value
/property
How do I include my custom logic so that it gets picked as a part of the
crawl cycle.
Regards | Vikas
On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Yes, you can pass ParseMeta keys to the FetchSchedule as part
Hi,
Yes, this is no problem.
Cheers
-Original message-
From:Dustine Rene Bernasor dust...@thecyberguardian.com
Sent: Thu 24-May-2012 12:58
To: user@nutch.apache.org
Subject: Multiple nutch jobs on a Hadoop cluster simultaneosuly
Hello
I was wondering, would it be possible to
Please read the description.
-Original message-
From:Tolga to...@ozses.net
Sent: Tue 22-May-2012 11:37
To: user@nutch.apache.org
Subject: Re: PDF not crawled/indexed
What is that value's unit? kilobytes? My PDF file is 4.7mb.
On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
-Original message-
From:Bai Shen baishen.li...@gmail.com
Sent: Tue 22-May-2012 19:40
To: user@nutch.apache.org
Subject: URL filtering and normalization
Somehow my crawler started fetching youtube. I'm not really sure why as I
have db.ignore.external.links set to true.
Weird!
Great!
My +1 for a new release based on the state of the codebase.
-Original message-
From:Julien Nioche lists.digitalpeb...@gmail.com
Sent: Tue 22-May-2012 22:19
To: d...@nutch.apache.org
Cc: user@nutch.apache.org
Subject: Re: Apache Nutch release 1.5 RC2
Read
Yes, you can pass ParseMeta keys to the FetchSchedule as part of the
CrawlDatum's meta data as i did with:
https://issues.apache.org/jira/browse/NUTCH-1024
-Original message-
From:Vikas Hazrati vi...@knoldus.com
Sent: Mon 21-May-2012 13:44
To: user@nutch.apache.org
Subject:
Hi
Which version do you use? It should list the troubling URL. What's the stack
trace?
Cheers
-Original message-
From:Ing. Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Mon 21-May-2012 17:07
To: user@nutch.apache.org
Subject: error parsing some xml
Hi all.
When I try to crawl
)
- Mensaje original -
De: Markus Jelsma markus.jel...@openindex.io
Para: user@nutch.apache.org
Enviados: Lunes, 21 de Mayo 2012 11:41:40
Asunto: RE: error parsing some xml
Hi
Which version do you use? It should list the troubling URL
-Original message-
From:Matthias Paul magethle.nu...@gmail.com
Sent: Fri 18-May-2012 14:57
To: user@nutch.apache.org
Subject: Exclude certain mime-types
How can I exlude certain mime-types from crawling, for example Word-documents?
If I have parse-tika in plugin.includes it
] Apache Nutch 1.5 release rc #1
When will Nutch 1.5 be released?
Matthias
On Wed, Apr 18, 2012 at 1:46 PM, Bharat Goyal bharat.go...@shiksha.com
wrote:
+1
On Monday 16 April 2012 12:34 PM, Markus Jelsma wrote:
+1
On Mon, 16 Apr 2012 05:43:22 +, Mattmann, Chris
yes
On Tuesday 15 May 2012 12:45:28 Taeseong Kim wrote:
is whole web content download possible?
include Flash, Image, CSS, JavaScript
you for your help.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type
-text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing
list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
/11/12 9:40 AM, Markus Jelsma wrote:
Ah, that means don't use the crawl command and do a little shell
scripting to execute the separte crawl cycle commands, see the nutch
wiki for examples. And don't do solrdedup. Search the Solr wiki for
deduplication.
cheers
On Fri, 11 May 2012 07
?
Matthias
On Thu, May 10, 2012 at 8:39 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
By default each crawl is iterative. The crawl command is nothing more
than a wrapper around the individual crawl cycle commands. The depth
parameter is nothing
://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
in fact it uses much less memory than it can.
Any idea?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Heap-space-problem-when-running-nutch-on-cluster-tp3983561.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
, Markus Jelsma wrote:
thanks
This is a known issue:
https://issues.apache.org/jira/browse/NUTCH-1100
I have not been able find the bug nor do i know how to reproduce it
from scratch. If you have a public site with which we can reproduce it
please comment to the Jira ticket. Make sure you use
mode.
Also I want some urls filtered by my urlfilter to be stored in
an
external
flat file. How can I achieve this.
--
*Thanks Regards*
*
*
*Vijith V*
--
*Thanks Regards*
*
*
*Vijith V*
--
*Thanks Regards*
*
*
*Vijith V*
--
Markus Jelsma
is
mentioned. Tried to upgrade to hadoop-core-0.20.203.0.jar but then
this
is
thrown:
Exception in thread main java.lang.**NoClassDefFoundError:
org/apache/commons/**configuration/Configuration
Can someone, please, shed some light on this?
Thanks.
Igor
--
Markus Jelsma - CTO - Openindex
and there is plenty free
space.
All the best,
Igor
On Thu, May 10, 2012 at 10:35 AM, Markus Jelsma wrote:
Plenty of disk space does not mean you have enough room in your
hadoop.tmp.dir which is /tmp by default.
On Thu, 10 May 2012 10:26:00 +0200, Igor Salma wrote:
Hi, Adriana, Sebastian,
We
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
-tp3974397p3976568.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
hi
On Thursday 10 May 2012 15:19:09 Vikas Hazrati wrote:
Hi Markus,
Thanks for your response. My responses inline
On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
hi
On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati vi...@knoldus.com
wrote
should upgrade accordingly in trunk.
Thanks
Lewis
On Thu, May 10, 2012 at 1:56 PM, Michael Erickson
erickson.mich...@gmail.com wrote:
On May 10, 2012, at 1:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300, Tolga to...@ozses.net wrote:
Hi
to
work?
Thanks
Matthias
--
Markus Jelsma - CTO - Openindex
, it works similar and uses the same signature
algorithm as Nutch has. Please consult the Solr wiki page on
deduplication.
Good luck
On Thu, 10 May 2012 22:54:37 +0300, Tolga to...@ozses.net wrote:
Hi Markus,
On 05/10/2012 09:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300
] mailto:krist...@yahoo-inc.com
[22]
http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207
[23]
http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002
[24] http://webmail.openindex.io/tel:%28408%29%20349%203300
[25] http://webmail.openindex.io/tel:%28408%29%20349%203301
--
Markus
...@gmail.com
[1] http://www8.org/w8-papers/5a-search-query/crawling/
[2] http://www.cse.iitb.ac.in/~soumen/focus/
[3]
http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html
--
Markus Jelsma - CTO - Openindex
that
CrawlDB
would not allow duplicate links to get inside it?
What link deduplication do you mean? CrawlDB records have a unique key
on the URL.
Regards | Vikas
www.knoldus.com
--
Markus Jelsma - CTO - Openindex
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
many segments of ~N records
are generated.
Markus Jelsma-2 wrote
On Mon, 7 May 2012 22:31:43 -0700 (PDT), nutch.buddy@
nutch.buddy@ wrote:
In a previous discussion about handling of failures in nutch, it
was
mentioned that a broken segment cannot be fixed and it's urls
should
be
re
Hi
Nutch should parse an HTML file with a .txt extension just as a normal
HTML file, at least, here it does. What does your parserchecker say? In
any case you must strip potential left-over HTML in your Solr analyzer,
if left like this it's a bad XSS vulnerability.
Cheers
On Tue, 8 May
/?page=2633pid=1043ELEsite=191;1;db_unfetched;Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;null
Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?
Regards,
Dean Pullen
--
Markus Jelsma
a custom URL Normalizer to get this to work.
But why? It doesn't seem alright.
On Tue, 08 May 2012 14:46:14 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
I'm not sure this is going to work as a lowercase flag is used on the
regular expressions.
On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen
html snippet
as a link?
tr onclick=clickOnLink(http://www.example.com/link;);.../tr
Thanks,
Mohammad
--
Markus Jelsma - CTO - Openindex
.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
Hi,
This is a tough problem indeed. We partially mitigate this problem by
using several regular expressions, linkrank scores with domain limiting
generator for regular crawls and a second shallow crawl, only following
links from the home page.
A custom URLFilter as Ferdy explains is a good
of
reducers or, slightly increase the host or domain limit value.
On Thu, 26 Apr 2012 21:02:58 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi,
We sometimes see the generator running OOM. This happens because we
either have a too high topN value or too many segments to generate.
In
any case
of that command I don't
see any keywords or description fields :( just the usual ones
(site,title,content,etc).
Am I missing something here?
Also let me know if you need more details or my nutch-site.xml
config file...
Regards
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com
to
an indexed document.
From: Markus Jelsma markus.jel...@openindex.io
To: ML mail mlnos...@yahoo.com
Cc: Lewis John Mcgibbney lewis.mcgibb...@gmail.com; user@nutch.apache.org
Sent: Thursday, May 3, 2012 9:32 AM
Subject: Re: Indexing meta tags in Nutch 1.4
FUTURO, CONECTADOS A LA REVOLUCION
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
.
With kind regard,
Roberto Gardenier
--
Markus Jelsma - CTO - Openindex
Do you have running task trackers and data nodes? Which Nutch job did
you start? Any custom code?
Check the logs of of the four Hadoop daemons, there may be something
there.
On Tue, 01 May 2012 16:26:31 +0100, Dean Pullen
dean.pul...@semantico.com wrote:
Hi all,
If this is definitely a
of Nutch info on the web...
http://wiki.apache.org/nutch/
http://wiki.apache.org/nutch/PluginCentral
hth
Lewis
--
Lewis
--
Markus Jelsma - CTO - Openindex
Hi,
We sometimes see the generator running OOM. This happens because we
either have a too high topN value or too many segments to generate. In
any case, a very large amount of records is being generated with the
same (lowest) score and end up in a single reducer. We limit the
generator by
of monickr: http://monickr.com [3]
01926 813736 | 07973 156616
_-- _
Links:
--
[1] http://[domain]/solr/
[2] http://www.tellura.co.uk/
[3] http://monickr.com/
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
On Sat, 21 Apr 2012 17:44:49 -0700 (PDT), benmccann
benjamin.j.mcc...@gmail.com wrote:
Hi,
I have a few questions about getting started. Is there a good
tutorial
anywhere?
Questions I have:
* How do I restrict the crawling or saving of pages to only those
matching
certain regexes?
With
the status
in the Hadoop web gui.
I'm doing a local crawl. Does this mean the Hadoop web gui is
unavailable? Is there anyway to check status of a local crawl?
What's the
URL for the hadoop web gui?
Thanks!
-Ben
On Sun, Apr 22, 2012 at 7:33 AM, Markus Jelsma-2 [via Lucene]
ml-node
Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
.
On Sun, Apr 15, 2012 at 10:46 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
This error?
[javac] warning: [path] bad path element
/home/markus/projects/apache/**
nutch/trunk/build/lib/**activation.jar: no such file or directory
On Sun, 15 Apr 2012 20:42:42 +0100, Lewis John Mcgibbney
an OutlinkDB can
make a mess out of itself? Should we enforce uniqueness in the mean
time?
On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi,
Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I
Will provide a patch tomorrow.
https://issues.apache.org/jira/browse/NUTCH-1335
On Mon, 16 Apr 2012 20:19:46 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
It seems a single URL has about half a million outlinks connected to
it in the OutlinkDB! A pattern of 50 URL's repeats a 100.000
The CrawlDB is not a suitable data source but the WebGraph's NodeDB is.
You could probably write a new MR tool reading the NodeDB and outputting
data in a format such a visualization tool understands.
I think the only real problem would be the size of the data.
On Sun, 15 Apr 2012 12:43:57
, but I
don't see a nodedb folder.
Thanks in advance.
Safdar
On Sun, Apr 15, 2012 at 4:17 PM, Markus Jelsma wrote:
The CrawlDB is not a suitable data source but the WebGraph's NodeDB
is. You could probably write a new MR tool reading the NodeDB and
outputting data in a format
This error?
[javac] warning: [path] bad path element
/home/markus/projects/apache/nutch/trunk/build/lib/activation.jar: no
such file or directory
On Sun, 15 Apr 2012 20:42:42 +0100, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
Whilst doing some testing on Nutchgora within
this
functionality?
Best regards,
--Anders Rask
www.findwise.com
--
Markus Jelsma - CTO - Openindex
in order to recrawl sites then the total number of
URLs that
are crawled for one site will not be limited by the
generate.max.count
parameter. Am I right?
Best regards,
--Anders Rask
www.findwise.com
Den 11 april 2012 17:14 skrev Markus Jelsma
markus.jel...@openindex.io:
Check these properties
somewhere? We have had
this URL for a longer time and it happily passed all jobs many times
before.
On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi,
Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also
Debugging this with a stand-alone Tika would certainly make things
easier. There may be an issue in Tika or even in the parser
implementation itself.
On Wed, 11 Apr 2012 09:37:04 -0700 (PDT), nutch.bu...@gmail.com
nutch.bu...@gmail.com wrote:
I'm running nutch on large xlsx file (100-150mb),
hi,
On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), nutch.bu...@gmail.com
nutch.bu...@gmail.com wrote:
Hi
There are some scenarios of failure in nutch which I'm not sure how
to
handle.
1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those cannot
input file.
Any other insights on these issues will be appreciated
Markus Jelsma-2 wrote
hi,
On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), nutch.buddy@
nutch.buddy@ wrote:
Hi
There are some scenarios of failure in nutch which I'm not sure how
to
handle.
1. I run nutch on a huge amount of urls
Hi,
Recently a reducer got killed because of this. Increasing heap did work
but the next job some days later also failed. I looked at the code and i
cannot seem to find why it would take more than 400MB of RAM to process
outlinks of a single record. We do limit outlinks so the HashSets pages
,
Anastasia
--
View this message in context:
http://lucene.472066.n3.nabble.com/Class-in-the-code-that-handles-parsing-
of-html-files-and-selection-of-URLs-tp3890250p3890250.html Sent from the
Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
like a result.
When I can jump this raw during my crawling? Is possible exclude
this
raw?
thank you in adavande
alessio
--
*Lewis*
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?
We have got 70k+ pages in the index and a full recrawl would take to
long.
Thanks
Jan
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
.n3.**
nabble.com/Relative-urls-**interpage-href-anchors-**
tp3861215p3861215.htmlhttp://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http
be the command to do that?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-sourc
e-to-Solr-tp3855918p3855918.html Sent from the Nutch - User mailing list
archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
mapred.JobClient: Map output records=5*
===
=
Regards
Andy
--
Markus Jelsma - CTO - Openindex
, but nothing happened.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158
.html Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
a database that could potentially be locked at any point
in time?
Thanks!
--
View this message in context:
http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
wldb-tp3848358p3848358.html Sent from the Nutch - User mailing list archive
at Nabble.com.
--
Markus Jelsma
scripting and locking horror and it's an I/O
consumer.
--
View this message in context:
http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
wldb-tp3848358p3848423.html Sent from the Nutch - User mailing list archive
at Nabble.com.
--
Markus Jelsma - CTO - Openindex
.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
wldb-tp3848358p3848665.html Sent from the Nutch - User mailing list archive
at Nabble.com.
--
Markus Jelsma - CTO - Openindex
This is not supported by Nutch and there's no issue ticket yet. Feel
free to open one.
On Thu, 22 Mar 2012 14:32:26 -0500, thomas.j.lut...@wellsfargo.com
wrote:
Ran across a posting for the Nutch roadmap mentioning support for the
canonical tag.
to alter these settings to point to the non-default Hadoop?
Regards,
Dean.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
dean.pul...@semantico.com wrote:
Thanks for your reply.
I understand what you've said, but how does Nutch know where the
Hadoop jobtracker is running?
Regards,
Dean.
On 20/03/2012 11:03, Markus Jelsma wrote:
This is not a Nutch thing. A Nutch job, any job, is submitted to the
Hadoop Jobtracker
/urls-won-t-get-crawled-tp3650610p384206
6.html Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
.
--
Markus Jelsma - CTO - Openindex
of the great technologies.
We would really appreciate feedback as there will undoubtedly be some
errors or data missing.
Thanks
Lewis
[0] http://wiki.apache.org/nutch/NutchHadoopTutorial
--
Markus Jelsma - CTO - Openindex
.
--
Markus Jelsma - CTO - Openindex
this in a larger setup.
thanks !
pvremort
--
Markus Jelsma - CTO - Openindex
.nabble.com/Incompatible-format-version-2-expected-1-or-lower-tp3796473p3796473.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
records restricted by status:
generate -Dgenerate.restrict.status=status
Thanks.
Alex.
-Original Message-
From: Markus Jelsma
To: user
Cc: nutch-user
Sent: Thu, Mar 1, 2012 10:30 pm
Subject: Re: different fetch interval for each depth urls
Well, you could set a new default fetch
801 - 900 of 1614 matches
Mail list logo