RE: very long fetch reduce task

2012-06-13 Thread Markus Jelsma
In a parsing fetcher iirc outlinks are processed in the mapper (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space. -Original message- From:kaveh minooie ka...@plutoz.com Sent: Wed 13-Jun-2012 19:28 To: user@nutch.apache.org

RE: Generator: 0 records selected for fetching, exiting ...

2012-06-11 Thread Markus Jelsma
Hi This CrawlDatum's FetchTime is tomorrow in EST Fetch time: Tue Jun 12 02:59:27 EST 2012 -Original message- From:Andy Xue andyxuey...@gmail.com Sent: Mon 11-Jun-2012 11:00 To: user@nutch.apache.org Subject: Generator: 0 records selected for fetching, exiting ... Hi all:

RE: robots.txt UnknownHostException

2012-06-07 Thread Markus Jelsma
Hi, Nutch will fetch URL's without robots.txt, but if robots.txt throws an UnknownHostException, the URL will throw it as well and fail. Cheers -Original message- From:chethan chethan.p...@gmail.com Sent: Thu 07-Jun-2012 16:16 To: user@nutch.apache.org Subject: robots.txt

RE: [ANNOUNCE] Apache Nutch 1.5 Released

2012-06-07 Thread Markus Jelsma
Great work Lewis, Chris, committers and contributors! Thanks all! -Original message- From:lewis john mcgibbney lewi...@apache.org Sent: Thu 07-Jun-2012 19:01 To: annou...@apache.org; d...@nutch.apache.org; user@nutch.apache.org Subject: [ANNOUNCE] Apache Nutch 1.5 Released

RE: Building Lucene index with Nutch 1.4

2012-06-07 Thread Markus Jelsma
Hello! Sounds very interesting. Anyway, Solr can run embedded in a Java application called EmbeddedSolrServer. You do need to make some changes to the SolrIndexer tools in Nutch. Cheers -Original message- From:Emre Çelikten e...@celikten.name Sent: Thu 07-Jun-2012 22:24 To:

RE: Linkdb empty

2012-06-06 Thread Markus Jelsma
-Original message- From:Matthias Paul magethle.nu...@gmail.com Sent: Wed 06-Jun-2012 09:47 To: user@nutch.apache.org Subject: Linkdb empty Hi all, hi I noticed that my linkdb is always empty although I use the generated segments from the last crawl for the generation of the

RE: Nutch topN selection

2012-06-06 Thread Markus Jelsma
-Original message- From:chethan chethan.p...@gmail.com Sent: Wed 06-Jun-2012 05:12 To: user@nutch.apache.org Subject: Nutch topN selection Hi, hi Does the topN threshold consider page score for the selection. If it's set to say 10, does Nutch queue up the 10 top scoring URLs

RE: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-06 Thread Markus Jelsma
-Original message- From:Andy Xue andyxuey...@gmail.com Sent: Wed 06-Jun-2012 05:04 To: user@nutch.apache.org Subject: Behaviour of quot;urlfilter-suffixquot; plug-in when dealing with a URL without filename extension Hi all: hi Does the urlfilter-suffix plug-in prune URL

RE: threads disminution when fetching page

2012-06-06 Thread Markus Jelsma
-Original message- From:pepe3059 pepe3...@gmail.com Sent: Wed 06-Jun-2012 02:58 To: user@nutch.apache.org Subject: RE: threads disminution when fetching page me again :) at the end of fetch process, is the regex-urlfilter considered? No. At the end of the fetch the mapper

RE: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-06 Thread Markus Jelsma
-Original message- From:Andy Xue andyxuey...@gmail.com Sent: Wed 06-Jun-2012 11:11 To: Markus Jelsma markus.jel...@openindex.io; user@nutch.apache.org Subject: Re: Behaviour of quot;urlfilter-suffixquot; plug-in when dealing with a URL without filename extension Hi Markus: hi

RE: How to write complex rules on regex-urlfilter

2012-06-06 Thread Markus Jelsma
What's the problem with having the seed page? Can you not only inject the /news pages? Anyway, you can always filter it away later after the first fetch cycle. -Original message- From:Shameema Umer shem...@gmail.com Sent: Wed 06-Jun-2012 13:02 To: user@nutch.apache.org Subject:

RE: HTTP REFERER is missing

2012-06-06 Thread Markus Jelsma
Hi Nutch cannot do this by default and is tricky to make because there may not be one unique referrer per page. What you can try is to add the referrer to outlinks when parsing records. This outlink can be added to CrawlDatum's MetaData which you can then later use to set the referrer. To set

RE: How to configure nutch to fetch only recent documents

2012-06-04 Thread Markus Jelsma
Hi, The generator can only do it the other way around via the addDays parameter. To make it work your way you can modifiy the generator to restrict to documents younger than 48 hours. Cheers -Original message- From:Shameema Umer shem...@gmail.com Sent: Mon 04-Jun-2012 08:33 To:

RE: threads disminution when fetching page

2012-06-04 Thread Markus Jelsma
This is normal and means the fetcher is finishing all it's input URL's and writing stuff to disk. -Original message- From:pepe3059 pepe3...@gmail.com Sent: Sat 02-Jun-2012 22:15 To: user@nutch.apache.org Subject: threads disminution when fetching page Hello, i hope you can help

RE: threads disminution when fetching page

2012-06-04 Thread Markus Jelsma
-Original message- From:pepe3059 pepe3...@gmail.com Sent: Mon 04-Jun-2012 20:42 To: user@nutch.apache.org Subject: RE: threads disminution when fetching page thank you for your answer Markus Hi you mean, until the fetch process finishes, is information stored using hdfs by

RE: No links to process, is the webgraph empty?

2012-05-29 Thread Markus Jelsma
Hi, That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set those two link.ignore parameters to true, no links between the same domain and host or aggregated, only links from/to external hosts and domains. This is a good setting for wide web crawls.

RE: No links to process, is the webgraph empty?

2012-05-29 Thread Markus Jelsma
and link.ignore.limit.domain to false and the link.ignore.internal.xxx can be set to true? Or should I just set all of the link.ignore.xxx.xxx values to false? On 5/29/2012 4:43 PM, Markus Jelsma wrote: Hi, That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set

RE: Setting the Fetch time with a CustomFetchSchedule

2012-05-29 Thread Markus Jelsma
valuecom.custom.CustomEventFetchScheduler/value /property How do I include my custom logic so that it gets picked as a part of the crawl cycle. Regards | Vikas On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma markus.jel...@openindex.iowrote: Yes, you can pass ParseMeta keys to the FetchSchedule as part

RE: Multiple nutch jobs on a Hadoop cluster simultaneosuly

2012-05-24 Thread Markus Jelsma
Hi, Yes, this is no problem. Cheers -Original message- From:Dustine Rene Bernasor dust...@thecyberguardian.com Sent: Thu 24-May-2012 12:58 To: user@nutch.apache.org Subject: Multiple nutch jobs on a Hadoop cluster simultaneosuly Hello I was wondering, would it be possible to

RE: PDF not crawled/indexed

2012-05-22 Thread Markus Jelsma
Please read the description. -Original message- From:Tolga to...@ozses.net Sent: Tue 22-May-2012 11:37 To: user@nutch.apache.org Subject: Re: PDF not crawled/indexed What is that value's unit? kilobytes? My PDF file is 4.7mb. On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:

RE: URL filtering and normalization

2012-05-22 Thread Markus Jelsma
-Original message- From:Bai Shen baishen.li...@gmail.com Sent: Tue 22-May-2012 19:40 To: user@nutch.apache.org Subject: URL filtering and normalization Somehow my crawler started fetching youtube. I'm not really sure why as I have db.ignore.external.links set to true. Weird!

RE: Apache Nutch release 1.5 RC2

2012-05-22 Thread Markus Jelsma
Great! My +1 for a new release based on the state of the codebase. -Original message- From:Julien Nioche lists.digitalpeb...@gmail.com Sent: Tue 22-May-2012 22:19 To: d...@nutch.apache.org Cc: user@nutch.apache.org Subject: Re: Apache Nutch release 1.5 RC2 Read

RE: Setting the Fetch time with a CustomFetchSchedule

2012-05-21 Thread Markus Jelsma
Yes, you can pass ParseMeta keys to the FetchSchedule as part of the CrawlDatum's meta data as i did with: https://issues.apache.org/jira/browse/NUTCH-1024 -Original message- From:Vikas Hazrati vi...@knoldus.com Sent: Mon 21-May-2012 13:44 To: user@nutch.apache.org Subject:

RE: error parsing some xml

2012-05-21 Thread Markus Jelsma
Hi Which version do you use? It should list the troubling URL. What's the stack trace? Cheers -Original message- From:Ing. Eyeris Rodriguez Rueda eru...@uci.cu Sent: Mon 21-May-2012 17:07 To: user@nutch.apache.org Subject: error parsing some xml Hi all. When I try to crawl

RE: error parsing some xml

2012-05-21 Thread Markus Jelsma
) - Mensaje original - De: Markus Jelsma markus.jel...@openindex.io Para: user@nutch.apache.org Enviados: Lunes, 21 de Mayo 2012 11:41:40 Asunto: RE: error parsing some xml Hi Which version do you use? It should list the troubling URL

RE: Exclude certain mime-types

2012-05-18 Thread Markus Jelsma
-Original message- From:Matthias Paul magethle.nu...@gmail.com Sent: Fri 18-May-2012 14:57 To: user@nutch.apache.org Subject: Exclude certain mime-types How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes it

RE: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Markus Jelsma
] Apache Nutch 1.5 release rc #1 When will Nutch 1.5 be released? Matthias On Wed, Apr 18, 2012 at 1:46 PM, Bharat Goyal bharat.go...@shiksha.com wrote: +1 On Monday 16 April 2012 12:34 PM, Markus Jelsma wrote:  +1  On Mon, 16 Apr 2012 05:43:22 +, Mattmann, Chris

Re: webpage download

2012-05-15 Thread Markus Jelsma
yes On Tuesday 15 May 2012 12:45:28 Taeseong Kim wrote: is whole web content download possible? include Flash, Image, CSS, JavaScript

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-15 Thread Markus Jelsma
you for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type -text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: HTTP error 400

2012-05-15 Thread Markus Jelsma
/11/12 9:40 AM, Markus Jelsma wrote: Ah, that means don't use the crawl command and do a little shell scripting to execute the separte crawl cycle commands, see the nutch wiki for examples. And don't do solrdedup. Search the Solr wiki for deduplication. cheers On Fri, 11 May 2012 07

Re: Crawl-tool for iterative crawling?

2012-05-15 Thread Markus Jelsma
? Matthias On Thu, May 10, 2012 at 8:39 PM, Markus Jelsma markus.jel...@openindex.io wrote: By default each crawl is iterative. The crawl command is nothing more than a wrapper around the individual crawl cycle commands. The depth parameter is nothing

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-14 Thread Markus Jelsma
://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: Heap space problem when running nutch on cluster

2012-05-13 Thread Markus Jelsma
in fact it uses much less memory than it can. Any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Heap-space-problem-when-running-nutch-on-cluster-tp3983561.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: HTTP error 400

2012-05-11 Thread Markus Jelsma
, Markus Jelsma wrote: thanks This is a known issue: https://issues.apache.org/jira/browse/NUTCH-1100 I have not been able find the bug nor do i know how to reproduce it from scratch. If you have a public site with which we can reproduce it please comment to the Jira ticket. Make sure you use

Re: Separate logger for nutch

2012-05-11 Thread Markus Jelsma
mode. Also I want some urls filtered by my urlfilter to be stored in an external flat file. How can I achieve this. -- *Thanks Regards* * * *Vijith V* -- *Thanks Regards* * * *Vijith V* -- *Thanks Regards* * * *Vijith V* -- Markus Jelsma

Re: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local..

2012-05-10 Thread Markus Jelsma
is mentioned. Tried to upgrade to hadoop-core-0.20.203.0.jar but then this is thrown: Exception in thread main java.lang.**NoClassDefFoundError: org/apache/commons/**configuration/Configuration Can someone, please, shed some light on this? Thanks. Igor -- Markus Jelsma - CTO - Openindex

Re: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local..

2012-05-10 Thread Markus Jelsma
and there is plenty free space. All the best, Igor On Thu, May 10, 2012 at 10:35 AM, Markus Jelsma wrote: Plenty of disk space does not mean you have enough room in your hadoop.tmp.dir which is /tmp by default. On Thu, 10 May 2012 10:26:00 +0200, Igor Salma wrote: Hi, Adriana, Sebastian, We

Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Markus Jelsma
Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Fwd: Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Markus Jelsma
-tp3974397p3976568.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: De-duplication of Nutch parsed data

2012-05-10 Thread Markus Jelsma
hi On Thursday 10 May 2012 15:19:09 Vikas Hazrati wrote: Hi Markus, Thanks for your response. My responses inline On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma markus.jel...@openindex.iowrote: hi On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati vi...@knoldus.com wrote

Re: HTTP error 400

2012-05-10 Thread Markus Jelsma
should upgrade accordingly in trunk. Thanks Lewis On Thu, May 10, 2012 at 1:56 PM, Michael Erickson erickson.mich...@gmail.com wrote: On May 10, 2012, at 1:42 AM, Markus Jelsma wrote: Hi, On Thu, 10 May 2012 09:10:04 +0300, Tolga to...@ozses.net wrote: Hi

Re: Crawl-tool for iterative crawling?

2012-05-10 Thread Markus Jelsma
to work? Thanks Matthias -- Markus Jelsma - CTO - Openindex

Re: HTTP error 400

2012-05-10 Thread Markus Jelsma
, it works similar and uses the same signature algorithm as Nutch has. Please consult the Solr wiki page on deduplication. Good luck On Thu, 10 May 2012 22:54:37 +0300, Tolga to...@ozses.net wrote: Hi Markus, On 05/10/2012 09:42 AM, Markus Jelsma wrote: Hi, On Thu, 10 May 2012 09:10:04 +0300

Re: HTTP ERROR 400

2012-05-09 Thread Markus Jelsma
] mailto:krist...@yahoo-inc.com [22] http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207 [23] http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002 [24] http://webmail.openindex.io/tel:%28408%29%20349%203300 [25] http://webmail.openindex.io/tel:%28408%29%20349%203301 -- Markus

Re: Focused Crawling with Nutch (IndexingFilter:filter)

2012-05-09 Thread Markus Jelsma
...@gmail.com [1] http://www8.org/w8-papers/5a-search-query/crawling/ [2] http://www.cse.iitb.ac.in/~soumen/focus/ [3] http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html -- Markus Jelsma - CTO - Openindex

Re: De-duplication of Nutch parsed data

2012-05-09 Thread Markus Jelsma
that CrawlDB would not allow duplicate links to get inside it? What link deduplication do you mean? CrawlDB records have a unique key on the URL. Regards | Vikas www.knoldus.com -- Markus Jelsma - CTO - Openindex

Re: Make Nutch to crawl internal urls only

2012-05-09 Thread Markus Jelsma
-- View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: Is it possible to control the segment size?

2012-05-08 Thread Markus Jelsma
many segments of ~N records are generated. Markus Jelsma-2 wrote On Mon, 7 May 2012 22:31:43 -0700 (PDT), nutch.buddy@ nutch.buddy@ wrote: In a previous discussion about handling of failures in nutch, it was mentioned that a broken segment cannot be fixed and it's urls should be re

Re: HTML documents with TXT extension

2012-05-08 Thread Markus Jelsma
Hi Nutch should parse an HTML file with a .txt extension just as a normal HTML file, at least, here it does. What does your parserchecker say? In any case you must strip potential left-over HTML in your Solr analyzer, if left like this it's a bad XSS vulnerability. Cheers On Tue, 8 May

Re: Lower case URLs - correct regex?

2012-05-08 Thread Markus Jelsma
/?page=2633pid=1043ELEsite=191;1;db_unfetched;Tue May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT 1970;0;2592000.0;30.0;500.0;null Notice the URL starts with an L? (Thus not matching http/https in another config). Is this some problem with the regex above? Regards, Dean Pullen -- Markus Jelsma

Re: Lower case URLs - correct regex?

2012-05-08 Thread Markus Jelsma
a custom URL Normalizer to get this to work. But why? It doesn't seem alright. On Tue, 08 May 2012 14:46:14 +0200, Markus Jelsma markus.jel...@openindex.io wrote: I'm not sure this is going to work as a lowercase flag is used on the regular expressions. On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen

Re: link without href

2012-05-07 Thread Markus Jelsma
html snippet as a link? tr onclick=clickOnLink(http://www.example.com/link;);.../tr Thanks, Mohammad -- Markus Jelsma - CTO - Openindex

Re: Is it possible to control the segment size?

2012-05-07 Thread Markus Jelsma
. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Avoid crawling nonsense calendar webpage

2012-05-05 Thread Markus Jelsma
Hi, This is a tough problem indeed. We partially mitigate this problem by using several regular expressions, linkrank scores with domain limiting generator for regular crawls and a second shallow crawl, only following links from the home page. A custom URLFilter as Ferdy explains is a good

Re: Generator OOM

2012-05-03 Thread Markus Jelsma
of reducers or, slightly increase the host or domain limit value. On Thu, 26 Apr 2012 21:02:58 +0200, Markus Jelsma markus.jel...@openindex.io wrote: Hi, We sometimes see the generator running OOM. This happens because we either have a too high topN value or too many segments to generate. In any case

Re: Indexing meta tags in Nutch 1.4

2012-05-03 Thread Markus Jelsma
of that command I don't see any keywords or description fields :( just the usual ones (site,title,content,etc). Am I missing something here? Also let me know if you need more details or my nutch-site.xml config file... Regards -- Markus Jelsma - CTO - Openindex http://www.linkedin.com

Re: Indexing meta tags in Nutch 1.4

2012-05-03 Thread Markus Jelsma
to an indexed document. From: Markus Jelsma markus.jel...@openindex.io To: ML mail mlnos...@yahoo.com Cc: Lewis John Mcgibbney lewis.mcgibb...@gmail.com; user@nutch.apache.org Sent: Thursday, May 3, 2012 9:32 AM Subject: Re: Indexing meta tags in Nutch 1.4

Re: fields foreach document

2012-05-02 Thread Markus Jelsma
FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Crawl sites with hashtags in url

2012-05-01 Thread Markus Jelsma
. With kind regard, Roberto Gardenier -- Markus Jelsma - CTO - Openindex

Re: Hadoop not doing anything

2012-05-01 Thread Markus Jelsma
Do you have running task trackers and data nodes? Which Nutch job did you start? Any custom code? Check the logs of of the four Hadoop daemons, there may be something there. On Tue, 01 May 2012 16:26:31 +0100, Dean Pullen dean.pul...@semantico.com wrote: Hi all, If this is definitely a

Re: Changing from Indexing Filter

2012-04-27 Thread Markus Jelsma
of Nutch info on the web... http://wiki.apache.org/nutch/ http://wiki.apache.org/nutch/PluginCentral hth Lewis -- Lewis -- Markus Jelsma - CTO - Openindex

Generator OOM

2012-04-26 Thread Markus Jelsma
Hi, We sometimes see the generator running OOM. This happens because we either have a too high topN value or too many segments to generate. In any case, a very large amount of records is being generated with the same (lowest) score and end up in a single reducer. We limit the generator by

Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Markus Jelsma
of monickr: http://monickr.com [3] 01926 813736 | 07973 156616 _-- _ Links: -- [1] http://[domain]/solr/ [2] http://www.tellura.co.uk/ [3] http://monickr.com/ -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Help getting started

2012-04-22 Thread Markus Jelsma
On Sat, 21 Apr 2012 17:44:49 -0700 (PDT), benmccann benjamin.j.mcc...@gmail.com wrote: Hi, I have a few questions about getting started. Is there a good tutorial anywhere? Questions I have: * How do I restrict the crawling or saving of pages to only those matching certain regexes? With

Re: Help getting started

2012-04-22 Thread Markus Jelsma
the status in the Hadoop web gui. I'm doing a local crawl. Does this mean the Hadoop web gui is unavailable? Is there anyway to check status of a local crawl? What's the URL for the hadoop web gui? Thanks! -Ben On Sun, Apr 22, 2012 at 7:33 AM, Markus Jelsma-2 [via Lucene] ml-node

Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-04-16 Thread Markus Jelsma
Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Failing to copy activation jar to build/lib

2012-04-16 Thread Markus Jelsma
. On Sun, Apr 15, 2012 at 10:46 PM, Markus Jelsma markus.jel...@openindex.iowrote: This error? [javac] warning: [path] bad path element /home/markus/projects/apache/** nutch/trunk/build/lib/**activation.jar: no such file or directory On Sun, 15 Apr 2012 20:42:42 +0100, Lewis John Mcgibbney

Re: WebGraph Outlinks.reduce OOM

2012-04-16 Thread Markus Jelsma
an OutlinkDB can make a mess out of itself? Should we enforce uniqueness in the mean time? On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Recently a reducer got killed because of this. Increasing heap did work but the next job some days later also failed. I

Re: WebGraph Outlinks.reduce OOM

2012-04-16 Thread Markus Jelsma
Will provide a patch tomorrow. https://issues.apache.org/jira/browse/NUTCH-1335 On Mon, 16 Apr 2012 20:19:46 +0200, Markus Jelsma markus.jel...@openindex.io wrote: It seems a single URL has about half a million outlinks connected to it in the OutlinkDB! A pattern of 50 URL's repeats a 100.000

Re: How to do detailed postmortem analysis (and visualization) of Nutch crawl data

2012-04-15 Thread Markus Jelsma
The CrawlDB is not a suitable data source but the WebGraph's NodeDB is. You could probably write a new MR tool reading the NodeDB and outputting data in a format such a visualization tool understands. I think the only real problem would be the size of the data. On Sun, 15 Apr 2012 12:43:57

Re: How to do detailed postmortem analysis (and visualization) of Nutch crawl data

2012-04-15 Thread Markus Jelsma
, but I don't see a nodedb folder. Thanks in advance. Safdar On Sun, Apr 15, 2012 at 4:17 PM, Markus Jelsma wrote: The CrawlDB is not a suitable data source but the WebGraph's NodeDB is. You could probably write a new MR tool reading the NodeDB and outputting data in a format

Re: Failing to copy activation jar to build/lib

2012-04-15 Thread Markus Jelsma
This error? [javac] warning: [path] bad path element /home/markus/projects/apache/nutch/trunk/build/lib/activation.jar: no such file or directory On Sun, 15 Apr 2012 20:42:42 +0100, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Whilst doing some testing on Nutchgora within

Re: Limiting Nutch crawl

2012-04-11 Thread Markus Jelsma
this functionality? Best regards, --Anders Rask www.findwise.com -- Markus Jelsma - CTO - Openindex

Re: Limiting Nutch crawl

2012-04-11 Thread Markus Jelsma
in order to recrawl sites then the total number of URLs that are crawled for one site will not be limited by the generate.max.count parameter. Am I right? Best regards, --Anders Rask www.findwise.com Den 11 april 2012 17:14 skrev Markus Jelsma markus.jel...@openindex.io: Check these properties

Re: WebGraph Outlinks.reduce OOM

2012-04-11 Thread Markus Jelsma
somewhere? We have had this URL for a longer time and it happily passed all jobs many times before. On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Recently a reducer got killed because of this. Increasing heap did work but the next job some days later also

Re: Having trouble running nutch on large xlsx files

2012-04-11 Thread Markus Jelsma
Debugging this with a stand-alone Tika would certainly make things easier. There may be an issue in Tika or even in the parser implementation itself. On Wed, 11 Apr 2012 09:37:04 -0700 (PDT), nutch.bu...@gmail.com nutch.bu...@gmail.com wrote: I'm running nutch on large xlsx file (100-150mb),

Re: How to handle failures in nutch?

2012-04-10 Thread Markus Jelsma
hi, On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), nutch.bu...@gmail.com nutch.bu...@gmail.com wrote: Hi There are some scenarios of failure in nutch which I'm not sure how to handle. 1. I run nutch on a huge amount of urls and some kind of OOM exception if thrown, or one of those cannot

Re: How to handle failures in nutch?

2012-04-10 Thread Markus Jelsma
input file. Any other insights on these issues will be appreciated Markus Jelsma-2 wrote hi, On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), nutch.buddy@ nutch.buddy@ wrote: Hi There are some scenarios of failure in nutch which I'm not sure how to handle. 1. I run nutch on a huge amount of urls

WebGraph Outlinks.reduce OOM

2012-04-10 Thread Markus Jelsma
Hi, Recently a reducer got killed because of this. Increasing heap did work but the next job some days later also failed. I looked at the code and i cannot seem to find why it would take more than 400MB of RAM to process outlinks of a single record. We do limit outlinks so the HashSets pages

Re: Class in the code that handles parsing of html files and selection of URLs

2012-04-06 Thread Markus Jelsma
, Anastasia -- View this message in context: http://lucene.472066.n3.nabble.com/Class-in-the-code-that-handles-parsing- of-html-files-and-selection-of-URLs-tp3890250p3890250.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: request about snippets (with attachement)

2012-04-05 Thread Markus Jelsma
like a result. When I can jump this raw during my crawling? Is possible exclude this raw? thank you in adavande alessio -- *Lewis* -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: recrawl a single page explicit

2012-04-02 Thread Markus Jelsma
manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Relative urls, interpage href anchors

2012-03-28 Thread Markus Jelsma
.n3.** nabble.com/Relative-urls-**interpage-href-anchors-** tp3861215p3861215.htmlhttp://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http

Re: Out-of-the-box Nutch indexing url source to Solr

2012-03-28 Thread Markus Jelsma
be the command to do that? -- View this message in context: http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-sourc e-to-Solr-tp3855918p3855918.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: Too much logging

2012-03-22 Thread Markus Jelsma
mapred.JobClient: Map output records=5* === = Regards Andy -- Markus Jelsma - CTO - Openindex

Re: Generator taking time

2012-03-22 Thread Markus Jelsma
, but nothing happened. -- View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158 .html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: crawl and update one url already in crawldb

2012-03-22 Thread Markus Jelsma
a database that could potentially be locked at any point in time? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra wldb-tp3848358p3848358.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma

Re: crawl and update one url already in crawldb

2012-03-22 Thread Markus Jelsma
scripting and locking horror and it's an I/O consumer. -- View this message in context: http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra wldb-tp3848358p3848423.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: crawl and update one url already in crawldb

2012-03-22 Thread Markus Jelsma
.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra wldb-tp3848358p3848665.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: canonical tag support

2012-03-22 Thread Markus Jelsma
This is not supported by Nutch and there's no issue ticket yet. Feel free to open one. On Thu, 22 Mar 2012 14:32:26 -0500, thomas.j.lut...@wellsfargo.com wrote: Ran across a posting for the Nutch roadmap mentioning support for the canonical tag.

Re: Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

2012-03-20 Thread Markus Jelsma
to alter these settings to point to the non-default Hadoop? Regards, Dean. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Job failed while creating SolrIndex

2012-03-20 Thread Markus Jelsma
at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

2012-03-20 Thread Markus Jelsma
dean.pul...@semantico.com wrote: Thanks for your reply. I understand what you've said, but how does Nutch know where the Hadoop jobtracker is running? Regards, Dean. On 20/03/2012 11:03, Markus Jelsma wrote: This is not a Nutch thing. A Nutch job, any job, is submitted to the Hadoop Jobtracker

Re: urls won't get crawled

2012-03-20 Thread Markus Jelsma
/urls-won-t-get-crawled-tp3650610p384206 6.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: urls won't get crawled

2012-03-20 Thread Markus Jelsma
. -- Markus Jelsma - CTO - Openindex

Re: NutchHadoopTutorial Updated

2012-03-20 Thread Markus Jelsma
of the great technologies. We would really appreciate feedback as there will undoubtedly be some errors or data missing. Thanks Lewis [0] http://wiki.apache.org/nutch/NutchHadoopTutorial -- Markus Jelsma - CTO - Openindex

Re: Blacklisted Tasktracker / AlreadyBeingCreatedException

2012-03-16 Thread Markus Jelsma
. -- Markus Jelsma - CTO - Openindex

Re: Nutch as crawler for text analysis: setup ? version ?

2012-03-09 Thread Markus Jelsma
this in a larger setup. thanks ! pvremort -- Markus Jelsma - CTO - Openindex

Re: Incompatible format version 2 expected 1 or lower

2012-03-04 Thread Markus Jelsma
.nabble.com/Incompatible-format-version-2-expected-1-or-lower-tp3796473p3796473.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: different fetch interval for each depth urls

2012-03-03 Thread Markus Jelsma
records restricted by status: generate -Dgenerate.restrict.status=status Thanks. Alex. -Original Message- From: Markus Jelsma To: user Cc: nutch-user Sent: Thu, Mar 1, 2012 10:30 pm Subject: Re: different fetch interval for each depth urls Well, you could set a new default fetch

<    4   5   6   7   8   9   10   11   12   13   >