See https://issues.apache.org/jira/browse/NUTCH-1598
-Original message-
> From:Amit Sela
> Sent: Thursday 15th August 2013 11:19
> To: user@nutch.apache.org
> Subject: Nucth 1.7 and ElasticSearch
>
> Hi all,
>
> I want to setup nutch 1.7 with ElasticSearch 0.90.3 as indexer.
> From
indexing
>
> more soon
>
> ps: my next question will be how to script this, those Hadoop command lines
> are doing my head in
>
>
> On Wed, Aug 14, 2013 at 12:48 PM, Markus Jelsma
> wrote:
>
> > Also, the webgraph is not part of indexing. That just has a ScoreU
/opt/nutch/apache-nutch-1.7/build/apache-nutch-1.7.job
> > > org.apache.nutch.indexer.solr.SolrIndexer -solr
> > > http://solr.server.tld:8088/solr/core1/ /user/crawl-1.7-1/crawldb
> > -linkdb
> > > /user/crawl-1.7-1/linkdb -dir /user/crawl-1.7-1/seg
ache.nutch.indexer.solr.SolrIndexer -solr
> > > http://solr.server.tld:8088/solr/core1/ /user/crawl-1.7-1/crawldb
> > -linkdb
> > > /user/crawl-1.7-1/linkdb -dir /user/crawl-1.7-1/segments
> >
> >
> > On Wed, Aug 14, 2013 at 11:33 AM, Marku
That's right. Check NUTCH-1047, that is what changed:
https://issues.apache.org/jira/browse/NUTCH-1047
-Original message-
> From:Nicholas Roberts
> Sent: Wednesday 14th August 2013 20:11
> To: user@nutch.apache.org
> Subject: Nutch 1.7 on Hadoop Exception in thread "main"
> java.lang.Cl
moval
>
> I didn't test this time around, but I think I did do testing before...
>
> Anything could possibly go wrong? Anything else I can do?
>
>
> On Wed, Aug 7, 2013 at 11:30 AM, Markus Jelsma
> wrote:
>
> > You are sure the patch works? You get differen
You are sure the patch works? You get different text output with
tika.use_boilerpipe enabled and disabled?
-Original message-
> From:Joe Zhang
> Sent: Wednesday 7th August 2013 20:12
> To: user
> Subject: Boilerplate removal
>
> I'm having the following in my nutchsite.xml. Yet the
Hi,
Why move the files if you could just point the webroot to that directory?
Anyway, what's the issue with the regex filter? i think having
+FW13-sample-docs as rule would pass any URL below that level. I also think a
seed file can contain a space so no need to escape the spaces. The metadata
Yes. Depth actually means, run N crawl cycles or rounds.
-Original message-
> From:devang pandey
> Sent: Wednesday 7th August 2013 11:09
> To: user@nutch.apache.org
> Subject: nutch relation between depth parameter and segment
>
> hello ,
> I am new to nutch , I have one question that i
please guide me on how to exactly
> use these commands to read webgraph.
>
>
> On Thu, Aug 1, 2013 at 2:56 PM, Markus Jelsma
> wrote:
>
> > There are reader and dumper tools you can use, see:
> >
> > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache
There are reader and dumper tools you can use, see:
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/
-Original message-
> From:devang pandey
> Sent: Thursday 1st August 2013 8:27
> To: user@nutch.apache.org
> Subject: nutch webgraph analysis
>
> Hello,
cool!
-Original message-
> From:Julien Nioche
> Sent: Monday 29th July 2013 17:45
> To: user@nutch.apache.org
> Subject: 2 day Nutch training course
>
> Hi,
>
> We are planning to run a 2 day Nutch training course this autumn. More
> details on
> http://digitalpebble.blogspot.co.uk/
ulted the expected the results, but when I run the crawler, I get
> ~98% Error while Parsing,
>
> I get the following error
>
> *"Unable to successfully parse content URL*"
>
>
>
> On Mon, Jul 29, 2013 at 4:53 PM, Markus Jelsma
> wrote:
>
> &g
Simple, only use parse-tika and patch with NUTCH-961.
https://issues.apache.org/jira/browse/NUTCH-961
Extractor algorithms are fixed, it is not possible to preanalyze a page and
select an extractor accordingly.
-Original message-
> From:imran khan
> Sent: Monday 29th July 2013 11:25
Patch should work with any 1.x, it doesn't change existing sources and only
reads the CrawlDB.
-Original message-
> From:devang pandey
> Sent: Monday 29th July 2013 13:00
> To: user@nutch.apache.org
> Subject: Re: nutch crawldb analytics
>
> @ Markus Jelsma
Using the HostDB tools your can create a database of hosts and dump their
statistics.
https://issues.apache.org/jira/browse/NUTCH-1325
-Original message-
> From:devang pandey
> Sent: Monday 29th July 2013 12:30
> To: user@nutch.apache.org
> Subject: nutch crawldb analytics
>
> Hello
-Original message-
> From:stone2dbone
> Sent: Wednesday 24th July 2013 18:25
> To: user@nutch.apache.org
> Subject: RE: Prevent crawl of parent URL
>
> Thanks Markus. I will give this a try. I did refilter the crawldb. One more
> question:
>
> I'm not good with regex. If I wanted to cr
Hi
-Original message-
> From:stone2dbone
> Sent: Wednesday 24th July 2013 14:56
> To: user@nutch.apache.org
> Subject: Prevent crawl of parent URL
>
> I would like to crawl everything in
>
> http://my.domain.name/dir/subdir
>
> but nothing in its parent
>
> http://my.domain.name/dir/
-Original message-
> From:dogrdon
> Sent: Monday 22nd July 2013 17:02
> To: user@nutch.apache.org
> Subject: RE: Why aren't my path exclusions getting excluded in the Nutch
> index to Solr?
>
> Thanks, I was wondering if some kind of reset like that needed to happen.
>
> I am fai
Check the Solr wiki pages:
http://wiki.apache.org/solr/FunctionQuery
http://wiki.apache.org/solr/SolrRelevancyFAQ
http://wiki.apache.org/solr/ExtendedDisMax
The only thing you can do from Nutch is pass it's LinkRank score to the boost
field.
http://wiki.apache.org/nutch/NewScoring
-Original
Don't forget to refilter the database after changes been made to URL filters or
they will be regenerated and fetched.
-Original message-
> From:dogrdon
> Sent: Friday 19th July 2013 18:44
> To: user@nutch.apache.org
> Subject: Why aren't my path exclusions getting excluded in the Nutch
Yes! Boilerpipe is the best open source alternative and has a working patch for
Nutch! There are also some other open source extraction toolkits but they have
not been ported to Tika or do not directly work with SAX ContentHandlers
(usable in Tika) so they would require some work there plus inte
toring Nutch statistics
>
> hey markus , thanx for replying .. What I want exactly is that to export
> this result of (say readdb )to databse so that it is easy to query . So
> what needs to be done to export stats to database for eg postgres.I am
> usinh nutch 1.4
>
>
> On
what kind of stats are you looking for? The default Nutch ships with the readdb
tool. Use readdb crawl/crawldb -stats.
-Original message-
> From:devang pandey
> Sent: Monday 15th July 2013 8:27
> To: user@nutch.apache.org
> Subject: Storing Nutch statistics
>
> Hello,
>
> I am usin
Hi,
This is always an interesting problem. You can either buy or build your own
extraction software or be satisfied by what Boilerpipe has to offer. Tika has
support for Boilerpipe and NUTCH-961 has a patch for 2.x as well enabling
Boilerpipe.
https://issues.apache.org/jira/browse/NUTCH-961
B
gt; hello markus I have one confusion should i implement changes in crawl-url
> filter or regex filter
>
>
> On Wed, Jul 10, 2013 at 3:12 PM, Markus Jelsma
> wrote:
>
> > Hi,
> >
> > Use a regex url filter to filter those URL's and prevent them from be
Hi,
Use a regex url filter to filter those URL's and prevent them from being
crawled again.
Cheers
-Original message-
> From:devang pandey
> Sent: Wednesday 10th July 2013 10:29
> To: user@nutch.apache.org
> Subject: nutch crawling issues
>
> I have a website eg . www.example.com. N
Hi,
Those were removed because i copied it from our own Nutch dist. They aren't
being used anyway and document related variables or literal have no place in
indexing backends. They aren't available either in the ES backend we added.
Also, there is no need to modify SolrClean because it simply d
nd notify you guys about the result.
>
> Best.
>
>
> On Tue, Jul 9, 2013 at 1:55 PM, Markus Jelsma
> wrote:
>
> > Hi,
> >
> > Just as i explained. The DistributedUpdateRequestProcessor does that on
> > the Solr node for you. There's an issue at Solr
rver(s). I'll give a shot for your
> patch.
>
> Best
>
>
> On Tue, Jul 9, 2013 at 1:34 PM, Markus Jelsma
> wrote:
>
> > Yes, it only takes URL's for your ensemble because that is how
> > CloudSolrServer works and it is the best method of connecting to
r is enough?
>
> BTW my patch is ready, how could suppose to attach it?
>
> Best
>
>
> On Tue, Jul 9, 2013 at 1:11 PM, Markus Jelsma
> wrote:
>
> > I attached a patch for support of CloudSolrServer and a Zookeeper
> > ensemble. Use solr.zookee
e.org
> Subject: RE: Regarding crawling https links
>
> How can I make nutch ignore robots.txt file?
>
> Regards,
> Vincent Anup Kuri
>
>
> -----Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Tuesday, July 09, 2013 3:46 PM
That's because the checker tools do not use robots.txt.
-Original message-
> From:Anup Kuri, Vincent
> Sent: Tuesday 9th July 2013 12:14
> To: user@nutch.apache.org
> Subject: RE: Regarding crawling https links
>
> That's for the asp file. When I used Parser Checker, it works perfectly,
I attached a patch for support of CloudSolrServer and a Zookeeper ensemble. Use
solr.zookeeper.hosts and solr.collection to enable it. Patch also required
NUTCH-1486.
https://issues.apache.org/jira/browse/NUTCH-1377
-Original message-
> From:Tuğcem Oral
> Sent: Tuesday 9th July 2013
> what if I have to talk to Solr from Nutch ?
>
>
> On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma
> wrote:
>
> > Processing the logs would be easy but since you need some metadata your
> > probably need to hack into the Fetcher.java code. The fetcher has several
&g
Processing the logs would be easy but since you need some metadata your
probably need to hack into the Fetcher.java code. The fetcher has several inner
classes but you'd need the FetcherThread class which is responsible for the
actual download and anything else that needs to be done there. If y
1.6 to solr 4.3.1 cloud
>
>
> OK then. I generated the corresponding patch. If someone also needs it till
> nutch 1.8 is released, I'd be happy to share.
>
> Best,
>
> Tugcem.
>
>
> On Mon, Jul 8, 2013 at 12:10 PM, Markus Jelsma
> wrote:
>
> >
rOutputFormat.java:48)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2013-07-08 17:38:47,577 ERROR solr.SolrIndexer - ja
You're not using the correct httpclient jar for the solrj client version. This
is some dependency hell. You'd better build 1.2 from source and change the
versions in ivy. That would be easier i think.
-Original message-
> From:devang pandey
> Sent: Monday 8th July 2013 13:46
> To: use
; Am 08.07.2013 13:20, schrieb Markus Jelsma:
> > Ah, i assume this didn't happen with 1.6? 1.7 finally has support for
> > multivalued metadata fields and index-metadata seems to write them
> > all.
>
> Absolutely right! Which is some cases fine, but very annoying els
Ah, i assume this didn't happen with 1.6? 1.7 finally has support for
multivalued metadata fields and index-metadata seems to write them all.
-Original message-
> From:Christian Nölle
> Sent: Monday 8th July 2013 13:17
> To: user@nutch.apache.org
> Subject: Drop inproper multiValued fi
owngrade my solr to 3.1 ?? will this work
>
>
> On Mon, Jul 8, 2013 at 4:16 PM, Markus Jelsma
> wrote:
>
> > mm, if dropping jars doesn't work (yields the same error in the logs) i
> > don't know what you could do except upgrading to a more recent Nutch. 1.5
pache.org
> Subject: Re: nutch 1.2 solr 3.6 integration issue
>
> I am very sorry for wrong reply . I am using binary .
>
>
> On Mon, Jul 8, 2013 at 4:10 PM, Markus Jelsma
> wrote:
>
> > you're building nutch from source?
> >
> > -Original messa
emoved
> the older ones ... I also carried an ant job ..But still this stuff is not
> working .
>
>
> On Mon, Jul 8, 2013 at 3:54 PM, Markus Jelsma
> wrote:
>
> > Well, the API hasn't changed i think so you might consider upgrading the
> > solrJ client i
be done
> to support this version of nutch.
>
>
> On Mon, Jul 8, 2013 at 3:46 PM, Markus Jelsma
> wrote:
>
> > Ah, since you're using an old Nutch and an old SolrJ client and that the
> > Javabin format has changed over time, i think your Solr is too new for
duceTask.java:474)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2013-07-08 15:17:39,539 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!
>
>
>
> On Mon, Jul 8, 2013 at 3:41 PM,
an you pls
> suggest me some other way of resolving this issue.
>
>
> On Mon, Jul 8, 2013 at 3:14 PM, Markus Jelsma
> wrote:
>
> > You need to provide the log output. But i think crawl/segments/* is the
> > problem. You must either do seg1 seg2 seg3 or -dir
You need to provide the log output. But i think crawl/segments/* is the
problem. You must either do seg1 seg2 seg3 or -dir segments/. No wildcards
supported!
Cheers
-Original message-
> From:devang pandey
> Sent: Monday 8th July 2013 11:41
> To: user@nutch.apache.org
> Subject: nut
First we need to upgrade to Solr >= 4.3 (NUTCH-1486). Then we'll have to add an
option to index via CloudSolrServer (NUTCH-1377) where you input your Zookeeper
ensemble vs. a target host. Then we can do NUTCH-1480 and write to multiple
individual servers and/or multiple cloud clusters.
The upgr
Hi,
Begin with decreasing the number of records your process per mapper and
increase the number of mappers. Better parallelism and less recovery work in
case of problems. We usually don't do more than 25k URL's per mapper but do a
lot of mappers instead! Easier to control and debug.
Cheers.
other computer.
>
> Best,
>
> Tugcem Oral
>
>
>
> On Fri, Jul 5, 2013 at 5:48 PM, Markus Jelsma
> wrote:
>
> > Hi,
> >
> > 1480 and 1377 are different. We already use CloudSolrServer (i haven't
> > added the patch yet) but also use
H,
What exactly is the problem you're trying to solve? If there's an issue, please
open a Jira ticket, describe the problem and attach a patch file.
Thanks
Markus
-Original message-
> From:RS
> Sent: Sunday 7th July 2013 11:14
> To: Lewis John Mcgibbney
> Cc: nutch-user
> Subject
The webgraph package is simply not there yet. Please upgrade to a recent
version and you can use the webgraph package.
Cheers,
Markus
-Original message-
> From:devang pandey
> Sent: Monday 8th July 2013 7:44
> To: user@nutch.apache.org
> Subject: not able to use webgraph command in n
Hi,
1480 and 1377 are different. We already use CloudSolrServer (i haven't added
the patch yet) but also use 1480 to write to multiple Solr clusters! Both need
still need patches and i haven't had time yet to provide them although we
already use both features in our Nutch.
I'll try to find som
generate.max.count?
-Original message-
> From:Dennis Yurichev
> Sent: Friday 5th July 2013 5:25
> To: user@nutch.apache.org
> Subject: limit to fetch only N pages from each host?
>
> Hi.
>
> How to limit nutch 2.x to fetch only N (5-10) pages from each host or
> domain?
> I fail to
How many different hosts do you crawl? I see one reducer and only one queue and
Nutch queus by domain or host. Hosts will always end up in the same queue so
Nutch will only crawl a lot and very fast if there's a large number of queues
to process.
The only thing you can do then is increase the
Great news, thanks Lewis!
-Original message-
From: Lewis John Mcgibbney
Sent: Tuesday 2nd July 2013 18:32
To: user@nutch.apache.org; d...@nutch.apache.org
Subject: [ANNOUNCE] Apache Nutch v2.2.1 Released
Good Afternoon Everyone,
The Apache Nutch PMC are very pleased to announce the immed
I've got a version of the indexchecker that does that, as well as providing a
telnet server. I was just thinking to open an issue about that this afternoon!
-Original message-
> From:Sebastian Nagel
> Sent: Tuesday 2nd July 2013 22:29
> To: user@nutch.apache.org
> Subject: Re: no dige
Hi,
Increase your memory in the task trackers by setting your Xmx in
mapred.map.child.java.opts.
Cheers
-Original message-
> From:Sznajder ForMailingList
> Sent: Tuesday 2nd July 2013 15:25
> To: user@nutch.apache.org
> Subject: Distributed mode and java/lang/OutOfMemoryError
>
>
Hi,
Nutch can easily scale to many many billions of records, it just depends on how
many and how powerful your nodes are. Crawl speed is not very relevant as it is
always very fast, the problem usually is updating the databases. If you spread
your data over more machines you will increase your
These are available in Nutch 1.7:
headings
h1,h2
Comma separated list of headings to retrieve from the
document
headings.multivalued
false
Whether to support multivalued headings.
So if the plugin is enabled those headings are extracted and added do the
document's parse metada
)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)|iframemeta
> > > So I us parse-html but also tika, text metatags and js. maybe it's to
> > much
> > > ? I copied this configuration from an example I saw. I do know that I
Thanks again Lewis for the properly managing the release! Looking forward
already to 1.8!
Cheers
-Original message-
From: Lewis John Mcgibbney
Sent: Thursday 27th June 2013 1:39
To: user@nutch.apache.org; d...@nutch.apache.org
Subject: [ANNOUNCE] Apache Nutch v1.7 Released
N.B. Previous
Looks fine Lewis! +1
-Original message-
From: Lewis John Mcgibbney
Sent: Thursday 27th June 2013 20:00
To: d...@nutch.apache.org; user@nutch.apache.org
Subject: [VOTE] Apache Nutch 2.2.1 RC#1
Hi,
It would be greatly appreciated if you could take some time to VOTE on the
release candidat
er@nutch.apache.org
> Subject: Re: Fetch iframe from HTML (if exists)
>
> How will it affect ? I Crawl with no depth (depth 1) so outlinks don't
> matter and it seems that the urls fetched don't get parsed, or am I
> misunderstanding something ?
>
>
> O
gt; >
> > Thanks.
> >
> >
> >
> >
> > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela wrote:
> >
> >> Thanks for the prompt answer!
> >>
> >>
> >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <
> >> markus.jel...
Hi,
Do i understand you correctly if you want all iframe src attributes on a given
page stored in the iframe field?
The src attributes are not extracted and there is no facility to do so right
now. You should create your own HTMLParseFilter, loop through the document
looking for iframe tags an
http://mail-archives.apache.org/mod_mbox/nutch-user/
http://lucene.472066.n3.nabble.com/Nutch-f603146.html
-Original message-
> From:Sznajder ForMailingList
> Sent: Tuesday 25th June 2013 14:16
> To: user@nutch.apache.org
> Subject: Mailing list archive
>
> Hi
>
> Where can I consult
Hi All,
>
> I am going to close this thread off and formally close the VOTE for Nutch
> 1.7 Release candidate.
> VOTE'ing has tallied as follows
>
> [7] +1, lets get it released!!!
> Markus Jelsma
> Julien Nioche
> Tejas Patil
> Feng Lu
> Chris Mattmann
>
yes, that matters indeed! But if you don't normalize, your URL filters may not
work although that should not be a problem in small crawls or a limited number
of (good) websites. You could try the following normalizing rule to remove very
long URL's as your first rule.
.{256,}
With an empty sub
Sigs checked out for tgz!
-Original message-
From: Lewis John Mcgibbney
Sent: Friday 21st June 2013 20:41
To: user@nutch.apache.org
Cc: d...@nutch.apache.org
Subject: Re: [VOTE] Apache Nutch 1.7 Release Candidate
Hi Julien,
Done, thanks for the attention to detail.
I wonder if you got to
Nice!
Signatures are ok and everything builds and all tests pass. The pom does still
point to 16.-SNAPSHOT. Other than that: definate +1!
Cheers
-Original message-
From: lewis john mcgibbney
Sent: Friday 21st June 2013 0:33
To: d...@nutch.apache.org; user@nutch.apache.org
Subject: [VO
Hi Joe,
You don't need a scoring filter for Linkrank. Just follow the wiki and run the
webgraph tool on your segments. Then you can run the linkrank tool on the
webgraph you just created from your segments. Finally use the scoreupdater tool
to write the scores back to your crawldb.
Cheers
htt
I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is
not true for 1.x, see NUTCH-1482.
https://issues.apache.org/jira/browse/NUTCH-1482
-Original message-
> From:Tony Mullins
> Sent: Wed 12-Jun-2013 14:37
> To: user@nutch.apache.org
> Subject: HTMLParseFi
We happily use that filter just as it is shipped with Nutch. Just enabling it
in plugin.includes works for us. To ease testing you can use the bin/nutch
org.apache.nutch.net.URLFilterChecker to test filters.
-Original message-
> From:Bai Shen
> Sent: Wed 12-Jun-2013 14:32
> To: user@
Hi,
Sure, you don't need to index the data and can use the individual commands or
the new bin/crawl script.
Cheers
-Original message-
> From:Peter Gaines
> Sent: Wed 12-Jun-2013 13:57
> To: user@nutch.apache.org
> Subject: Running Nutch standalone (without Solr)
>
> Hi There,
>
>
: using Tika within Nutch to remove boiler plates?
>
> So what in your opinion is the most effective way of removing boilerplates
> in Nutch crawls?
>
>
> On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma
> wrote:
>
> > Yes, Boilerpipe is complex and difficult to ad
lugins
> available on web ?
>
> Thanks,
> Tony.
>
>
> On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
> wrote:
>
> > Hi,
> >
> > Yes, you should write a plugin that has a parse filter and indexing
> > filter. To ease maintenance you would w
7;t use boilerpipe any more? So what do you
> suggest as an alternative?
>
>
> On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
> wrote:
>
> > we don't use Boilerpipe anymore so no point in sharing. Just set those two
> > configuratio
Hi,
Yes, you should write a plugin that has a parse filter and indexing filter. To
ease maintenance you would want to have a file per host/domain containing XPath
expressions, far easier that switch statements that need to be recompiled. The
indexing filter would then index the field values ext
-Original message-
> From:Sourajit Basak
> Sent: Tue 11-Jun-2013 14:50
> To: user@nutch.apache.org
> Subject: RSS based crawl - how to crawl ref links in next round
>
> We are crawling RSS links using a custom plugin. Thats working fine.
>
> Our intention is to crawl the discovered
11-Jun-2013 01:42
> To: user
> Subject: Re: using Tika within Nutch to remove boiler plates?
>
> Marcus, do you mind sharing a sample nutch-site.xml?
>
>
> On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
> wrote:
>
> > Those settings belong to nutch-site. Enable
Those settings belong to nutch-site. Enable BP and set the correct extractor
and it should work just fine.
-Original message-
> From:Lewis John Mcgibbney
> Sent: Sun 09-Jun-2013 20:47
> To: user@nutch.apache.org
> Subject: Re: using Tika within Nutch to remove boiler plates?
>
> Hi J
Please don't break existing scripts and support lower and uppercase.
Markus
-Original message-
> From:Lewis John Mcgibbney
> Sent: Fri 31-May-2013 19:11
> To: user@nutch.apache.org
> Subject: Re: Generator -adddays
>
> Seems like a small cli syntax bug.
> Please submit a patch and w
You can either use robots.txt or modify the Fetcher. Fetcher has a
FetchItemQueue for each queue, this also records the CrawlDelay for that queue.
A FetchItemQueue is created by FetchItemQueues.getFetchItemQueue(), here it
sets the CrawlDelay for the queue. You can have a lookup table here that
Hi,
For some reason the fetcher sometimes produces corrupts unreadable segments. It
then exists with exception like "problem advancing post", or "negative array
size exception" etc.
java.lang.RuntimeException: problem advancing post rec#702
at org.apache.hadoop.mapred.Task$ValuesIterat
Hi,
The 1.x indexer takes a -normalize parameter and there you can rewrite your
URL's. Judging from your patterns the RegexURLNormalizer should be sufficient.
Make sure you use the config file containing that pattern only when indexing,
otherwise they'll end up in the CrawlDB and segments. Use
Rodney,
Those are valid URL's but you clearly don't need them. You can either use
filters to get rid of them or normalize them away. Use the
org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test
your config.
Markus
-Original message-
> From:Rodney Barnett
>
> To: user@nutch.apache.org
> Subject: Re: Does Nutch Checks Whether A Page crawled before or not
>
> Where does Nutch stores that information?
>
> 2013/3/21 Markus Jelsma-2 [via Lucene] <
> ml-node+s472066n4049568...@n3.nabble.com>
>
> > Nutch selects records th
If Nutch exits with an error then the segment is bad, a failing thread is not
an error that leads to a failed segments. This means the segment is properly
fetched but just that some records failed. Those records will be eligible for
refetch.
Assuming you use the crawl command, the updatedb comm
Nutch selects records that are eligible for fetch. It's either due to a
transient failure or if the fetch interval has been expired. This means that
failed fetches due to network issues are refetched within 24 hours.
Successfully fetched pages are only refetched if the current time exceeds the
Feng Lu, welcome! :)
-Original message-
> From:Julien Nioche
> Sent: Mon 18-Mar-2013 13:23
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org
> Subject: Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer
>
> Hi Feng,
>
> Congratulations on becoming a committer and welcome
Hi
You can't do this with -slice but you can merge segments and filter them. This
would mean you'd have to merge the segments for each domain. But that's far too
much work. Why do you want to do this? There may be better ways in achieving
you goal.
-Original message-
> From:Jason S
The default heap size of 1G is just enough for a parsing fetcher with 10
threads. The only problem that may rise is too large and complicated PDF files
or very large HTML files. If you generate fetch lists of a reasonable size
there won't be a problem most of the time. And if you want to crawl a
Hi,
Regarding politeness, 3 threads per queue is not really polite :)
Cheers
-Original message-
> From:jc
> Sent: Fri 01-Mar-2013 15:08
> To: user@nutch.apache.org
> Subject: Re: a lot of threads spinwaiting
>
> Hi Roland and lufeng,
>
> Thank you very much for your replies, I alr
w can we do that?
>
>
> Feng Lu : Thank you for the reference link.
>
> Thanks - David
>
>
>
> On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> wrote:
>
> > The default or the injected interval? The default interval can be set in
> > the config (
interval , incase if I
> require to fetch the page before the time interval is passed?
>
>
>
> Thanks very much
> - David
>
>
>
>
>
> On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> wrote:
>
> > If you want records to
;
>
> Yes, my first options is differents files to differents domains.
> The point is how can I link the files with each domain? Do I need do
> some changes in Nutch code or the project have a feature for do
> that?
>
> On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrot
Yes, it will support that until you run out of memory. But having a million
expressions is not going to work nicely. If you have a lot of expressions but
can divide them into domains i would patch the filter so it will only execute
filters that or for a specific domain.
-Original message-
Something seems to be missing here. It's clear that 1.x has more features and
is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better
if you are going to crawl on a very large scale but i still haven't seen any
numbers to support this assumption. Nutch 1.x can easily deal
601 - 700 of 2005 matches
Mail list logo