RE: Error parsing html

2012-07-12 Thread Markus Jelsma
strange, check if text/html is mapped to parse-tika or parse-html in parse-plugins.xml. You may also want to check tika's plugin.xml, it must be mapped to * or a regex of content types. -Original message- > From:Sudip Datta > Sent: Thu 12-Jul-2012 20:36 > To: user@nutch.apache.org >

RE: [ANNOUNCEMENT] Apache Nutch v1.5.1 Released

2012-07-10 Thread Markus Jelsma
Great! Thanks Lewis -Original message- > From:lewis john mcgibbney > Sent: Tue 10-Jul-2012 17:01 > To: user@nutch.apache.org; annou...@apache.org; d...@nutch.apache.org > Subject: [ANNOUNCEMENT] Apache Nutch v1.5.1 Released > > Good Afternoon Everyone, > > The Apache Nutch PMC are

RE: NutchField

2012-07-05 Thread Markus Jelsma
Hello, The index-more plugin might run after your custom plugin. You can configure the order in which plugins are run. Please consult the indexingfilter.order directive's description.in conf/nutch-default.xml. Cheers, -Original message- > From:Jim Chandler > Sent: Thu 05-Jul-2012

RE: Adaptive scheduling, but different

2012-07-05 Thread Markus Jelsma
with any reasonable input... sorry. > Lewis > > On Thu, Jul 5, 2012 at 8:51 AM, Markus Jelsma > wrote: > > Any ideas? > > > > > > > > -Original message- > >> From:Markus Jelsma > >> Sent: Mon 02-Jul-2012 23:05 > >> To:

RE: Adaptive scheduling, but different

2012-07-05 Thread Markus Jelsma
Any ideas? -Original message- > From:Markus Jelsma > Sent: Mon 02-Jul-2012 23:05 > To: user@nutch.apache.org > Subject: Adaptive scheduling, but different > > Hi, > > We use an adaptive scheduler for our crawl, this works fine for most cases > but a specific type of page is crawled

RE: Filtering pages during crawling

2012-07-03 Thread Markus Jelsma
You can try the fetch filter: https://issues.apache.org/jira/browse/NUTCH-828 -Original message- > From:shekhar sharma > Sent: Tue 03-Jul-2012 06:42 > To: user@nutch.apache.org > Subject: Filtering pages during crawling > > Hello, > Is it possible to define a filtering condition in Nu

RE: ParseSegment taking a long time to finish

2012-07-02 Thread Markus Jelsma
No so odd after all. I should have known it started the reducer at that time, silly me. The parse went perfectly fine in 42 minutes. The problem lies in your regex. Cheers -Original message- > From:sidbatra > Sent: Tue 03-Jul-2012 00:18 > To: user@nutch.apache.org > Subject: RE: Pa

RE: ParseSegment taking a long time to finish

2012-07-02 Thread Markus Jelsma
I've modified the parser to log long running records and ran your segment. There are quite a few records that run for more than a second on one machine with 2x2.4GHz CPU. It, unfortunately doesn't show me a record it's waiting for. I ouput a record prior to parsing and after parsing with elasped

RE: ParseSegment taking a long time to finish

2012-07-02 Thread Markus Jelsma
Regex order matters. Happy to hear the results. Considering your hardware you should parse this amount of pages in less than an hour. And you should decrease your mapper/reducer heap size significantly, it doesn't take 4G of RAM. 1G mapper and 500M reducer is safe enough. You can then allocate

Adaptive scheduling, but different

2012-07-02 Thread Markus Jelsma
Hi, We use an adaptive scheduler for our crawl, this works fine for most cases but a specific type of page is crawled more often than it should. These are usually news or article archives such as news/archive/12345. Most websites generate these pages dynamically. The problem is that whenever a

RE: ParseSegment taking a long time to finish

2012-07-02 Thread Markus Jelsma
You already have that rule configured? Is it one of the first simple expressions you have? How many records are you processing each time, is it roughly the same for all segments? And are you running on Hadoop or pseudo or local? -Original message- > From:sidbatra > Sent: Mon 02-Jul

RE: ParseSegment taking a long time to finish

2012-07-02 Thread Markus Jelsma
Hi The log output doesn't tell you what the task is actually doing, it is only Hadoop output and initialization of the URL filters. There should be no real problem with the parser job and URL filter programming in Nutch, we crawl large parts of the internet but the parser never stalls, at least

RE: Solr 4.x and Nutch 1.5

2012-07-02 Thread Markus Jelsma
Check your Solr log. It's likely to trip over the absense of the versioning field. -Original message- > From:Daniel > Sent: Mon 02-Jul-2012 15:05 > To: user@nutch.apache.org > Subject: Solr 4.x and Nutch 1.5 > > Hey, > > i have Solr 4.0 (Nightly-Build) and Nutch 1.5 > And if i go t

RE: How to update the index quickly?

2012-07-02 Thread Markus Jelsma
We have done that too. The biggest problem is not having a reliable lastModified date and indeed inlinks and not knowing whether the document has changed. The inlink problem can be solved with the new Solr update semantics where partial updates are possible. -Original message- > From

RE: Language-focused crawling

2012-07-01 Thread Markus Jelsma
It's a use case for a fetch filter: https://issues.apache.org/jira/browse/NUTCH-828 -Original message- > From:Alexander Aristov > Sent: Sun 01-Jul-2012 20:43 > To: user@nutch.apache.org; safdar.kurei...@gmail.com > Subject: Re: Language-focused crawling > > Hi > > First of all you u

RE: NoSuchMethodError

2012-06-28 Thread Markus Jelsma
The API changed a bit with NUTCH-1230. What version are you using? -Original message- > From:Jim Chandler > Sent: Wed 27-Jun-2012 20:45 > To: user@nutch.apache.org > Subject: NoSuchMethodError > > Greetings, > > I am trying to create a plugin similar to Index-More which uses MimeUti

RE: Using Nutch with Boilerpipe

2012-06-27 Thread Markus Jelsma
> > > 5. Add the following lines to runtime/local/conf/nutch-site.xml > > > tika.boilerpipe > > true > > > > Thanks again! > > Cheers,

RE: Using Nutch with Boilerpipe

2012-06-27 Thread Markus Jelsma
etConf().get("tika.boilerpipe.extractor", "ArticleExtractor" > > Still, I am unsure where to specify these variables. Instead I added the > following lines to the java code (and commented the previous lines): > > boolean useBoilerpipe = true; > String b

RE: Using Nutch with Boilerpipe

2012-06-27 Thread Markus Jelsma
Hi René, It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally released 1.5 at all, the TikaParser.java has changed a bit since the patch and the release of 1.5. Did you resolve the failde hunks? If so, are you sure Tika is being used for (x)html pages? Nutch by default uses the o

RE: [VOTE] Apache Nutch 1.5.1 Release Candidate

2012-06-26 Thread Markus Jelsma
h-1.x/ > in the package (src and bin). That's cosmetic but not blocking. Also: > permissions of bin/nutch should be 755 (exec bits should be set). > > Beside: Runs (tested local mode only). > > Sebastian > > On 06/26/2012 06:32 PM, Markus Jelsma wrote: > > Thi

RE: [VOTE] Apache Nutch 1.5.1 Release Candidate

2012-06-26 Thread Markus Jelsma
t;extract here" from the file menu > > Not a blocker IMHO > > On 26 June 2012 08:04, Markus Jelsma wrote: > > > Hi, > > > > It builds and runs smoothly but there's something that didn't catch my eye > > with 1.5 since i then used a GUI to unpack t

RE: [VOTE] Apache Nutch 1.5.1 Release Candidate

2012-06-26 Thread Markus Jelsma
Hi, It builds and runs smoothly but there's something that didn't catch my eye with 1.5 since i then used a GUI to unpack the src file, the src and bin packages decompresses everything in the cwd, this means no apache-nutch-1.5 folder is created. This was the case with 1.4 and earlier. I believ

RE: Content type config on Parser plugin work improperly

2012-06-25 Thread Markus Jelsma
Hello, Did you add your parser to parse-plugins.xml? Cheers -Original message- > From:Ake Tangkananond > Sent: Mon 25-Jun-2012 16:56 > To: user@nutch.apache.org > Subject: Content type config on Parser plugin work improperly > > Hi experts, > > I am experimenting a feature to add

RE: HTTP REFERER is missing

2012-06-25 Thread Markus Jelsma
ma or > something? Or I have to hack crawling code too like you wrote about protocol > plugin? > > > Markus Jelsma-2 wrote > > > > What you can try is to add the referrer to outlinks when parsing records. > > This outlink can be added to CrawlDatum's MetaData

RE: Near Duplicate Detection in nutch /Solr

2012-06-23 Thread Markus Jelsma
Thanks for your comments. Please consider adding it to the issue so we can keep track of it. -Original message- > From:John McCormac > Sent: Sat 23-Jun-2012 16:36 > To: user@nutch.apache.org > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/

RE: Near Duplicate Detection in nutch /Solr

2012-06-23 Thread Markus Jelsma
and a bit easier to deal with. -Original message- > From:John McCormac > Sent: Sat 23-Jun-2012 15:11 > To: user@nutch.apache.org > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 13:17, Markus Jelsma wrote: > > Hello, > > > >

RE: Near Duplicate Detection in nutch /Solr

2012-06-23 Thread Markus Jelsma
e: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 12:14, Markus Jelsma wrote: > > Nutch now has a HostURLNormalizer capable of normalizing source hosts to a > > target host. This prevents duplication of complete websites and bad > > hyperlinks. > > >

RE: Near Duplicate Detection in nutch /Solr

2012-06-23 Thread Markus Jelsma
Nutch now has a HostURLNormalizer capable of normalizing source hosts to a target host. This prevents duplication of complete websites and bad hyperlinks. https://issues.apache.org/jira/browse/NUTCH-1319 -Original message- > From:John McCormac > Sent: Sat 23-Jun-2012 13:08 > To: user@

RE: Near Duplicate Detection in nutch /Solr

2012-06-23 Thread Markus Jelsma
You can use Nutch TextProfileSignature to create a less than exact signature for pages. It can delete some near duplicates. -Original message- > From:parnab kumar > Sent: Sat 23-Jun-2012 10:42 > To: user@nutch.apache.org > Subject: Near Duplicate Detection in nutch /Solr > > Hi, > >

RE: Document boosting during indexing

2012-06-22 Thread Markus Jelsma
I am not sure but the boost field may be available. I think it was populated with the document score but you could increase it with a custom filter or some hacking around. -Original message- > From:parnab kumar > Sent: Fri 22-Jun-2012 17:36 > To: user@nutch.apache.org > Subject: Doc

RE: Odd results from nutch-crawl (1.4), and request for inlink command

2012-06-22 Thread Markus Jelsma
Hi, If Nutch finds a relative URL it will be converted to absolute. This means that any URL that does not explicitly start with http:// is going to have the host prefixed. You domain.com pages produce bad URL's such as http/www. And since this is not http://, it'll end up as http://domain.com/

RE: robots.txt, disallow: with empty string

2012-06-22 Thread Markus Jelsma
I tried debugging your problem but it doesn't seem to exist. I fixed Nutch' RobotParser test [1] but i cannot confirm URL's being disallowed if there is NO value for Disallow: in the robots.txt file. https://issues.apache.org/jira/browse/NUTCH-1408 Test with: $ bin/nutch plugin lib-http org.a

RE: getting reports from nutch

2012-06-22 Thread Markus Jelsma
Hi, You can sue the domainstats tools to generate counts for domain, host, suffix and tld. There's also the readdb -stats tool that shows your overall statistics. NUTCH-1325 provides the same as readdb -stats but for individual hosts. Cheers -Original message- > From:kaveh minooie

RE: Parser choking on irregular url

2012-06-21 Thread Markus Jelsma
Hi Lewis, You got fooled by the ampersand switch on Unix terminals that sends a command to the background. The [] integers are Unix process ID's of the commands you have given. $ a&b&c is not one but three commands, sending a and b to the background. Your shell will output the [process ID] if

RE: Deleting file: urls from crawldb that give 404 status

2012-06-20 Thread Markus Jelsma
Sounds like: https://issues.apache.org/jira/browse/NUTCH-1245 Also, with a recent Nutch you can index with a -deleteGone flag. It behaves similar to SolrClean but only on records you just fetched. -Original message- > From:webdev1977 > Sent: Tue 19-Jun-2012 21:40 > To: user@nutch.apac

RE: HTTP REFERER is missing

2012-06-20 Thread Markus Jelsma
> To: user@nutch.apache.org > Subject: RE: HTTP REFERER is missing > > > Markus Jelsma-2 wrote > > > > Nutch cannot do this by default and is tricky to make because there may > > not be one unique referrer per page. > > > I don't realy need unique ref

RE: robots.txt, disallow: with empty string

2012-06-20 Thread Markus Jelsma
If you're sure Nutch treats an empty string the same as / then please file an issue in Jira so we can track and fix it. Thanks -Original message- > From:Magnús Skúlason > Sent: Wed 20-Jun-2012 18:36 > To: nutch-u...@lucene.apache.org > Subject: robots.txt, disallow: with empty string

RE: Nutch and Solr Redundancy

2012-06-20 Thread Markus Jelsma
-Original message- > From:Lewis John Mcgibbney > Sent: Wed 20-Jun-2012 22:23 > To: user@nutch.apache.org > Subject: Re: Nutch and Solr Redundancy > > Hi Oakage, > > On Wed, Jun 20, 2012 at 9:08 PM, Oakage wrote: > > Okay I've just started researching about nutch and knows that nutch

RE: Nutch 1.5 - "Error: Java heap space" during MAP step of CrawlDb update

2012-06-20 Thread Markus Jelsma
The log you provide doesn't look like the actual mapper log. Can you check it out? The job has output for the main class but also separate logs for each map and reduce task. -Original message- > From:sidbatra > Sent: Wed 20-Jun-2012 20:29 > To: user@nutch.apache.org > Subject: Re: N

RE: very long fetch reduce task

2012-06-13 Thread Markus Jelsma
In a parsing fetcher iirc outlinks are processed in the mapper (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space. -Original message- > From:kaveh minooie > Sent: Wed 13-Jun-2012 19:28 > To: user@nutch.apache.org > Subject: Re

RE: Generator: 0 records selected for fetching, exiting ...

2012-06-11 Thread Markus Jelsma
Hi This CrawlDatum's FetchTime is tomorrow in EST Fetch time: Tue Jun 12 02:59:27 EST 2012 -Original message- > From:Andy Xue > Sent: Mon 11-Jun-2012 11:00 > To: user@nutch.apache.org > Subject: Generator: 0 records selected for fetching, exiting ... > > Hi all: > > This is regardin

RE: Building Lucene index with Nutch 1.4

2012-06-07 Thread Markus Jelsma
Hello! Sounds very interesting. Anyway, Solr can run embedded in a Java application called EmbeddedSolrServer. You do need to make some changes to the SolrIndexer tools in Nutch. Cheers -Original message- > From:Emre Çelikten > Sent: Thu 07-Jun-2012 22:24 > To: user@nutch.apache.org

RE: [ANNOUNCE] Apache Nutch 1.5 Released

2012-06-07 Thread Markus Jelsma
Great work Lewis, Chris, committers and contributors! Thanks all! -Original message- > From:lewis john mcgibbney > Sent: Thu 07-Jun-2012 19:01 > To: annou...@apache.org; d...@nutch.apache.org; user@nutch.apache.org > Subject: [ANNOUNCE] Apache Nutch 1.5 Released > > (apologies for cr

RE: robots.txt UnknownHostException

2012-06-07 Thread Markus Jelsma
If Nutch runs on a different machine the DNS may not be resolving the host after all. To solve the issue you will have to find a way to resolve the host. Take a look in the Nutch logs. -Original message- > From:Chethan Prasad > Sent: Thu 07-Jun-2012 16:49 > To: Markus Jels

RE: robots.txt UnknownHostException

2012-06-07 Thread Markus Jelsma
t it find more links on > the root page and follow them? > > Thanks, > Chethan > > On Thu, Jun 7, 2012 at 7:49 PM, Markus Jelsma > wrote: > > > Hi, > > > > Nutch will fetch URL's without robots.txt, but if robots.txt throws an

RE: robots.txt UnknownHostException

2012-06-07 Thread Markus Jelsma
Hi, Nutch will fetch URL's without robots.txt, but if robots.txt throws an UnknownHostException, the URL will throw it as well and fail. Cheers -Original message- > From:chethan > Sent: Thu 07-Jun-2012 16:16 > To: user@nutch.apache.org > Subject: robots.txt UnknownHostException > >

RE: HTTP REFERER is missing

2012-06-06 Thread Markus Jelsma
Hi Nutch cannot do this by default and is tricky to make because there may not be one unique referrer per page. What you can try is to add the referrer to outlinks when parsing records. This outlink can be added to CrawlDatum's MetaData which you can then later use to set the referrer. To set t

RE: How to write complex rules on regex-urlfilter

2012-06-06 Thread Markus Jelsma
What's the problem with having the seed page? Can you not only inject the /news pages? Anyway, you can always filter it away later after the first fetch cycle. -Original message- > From:Shameema Umer > Sent: Wed 06-Jun-2012 13:02 > To: user@nutch.apache.org > Subject: How to write co

RE: Linkdb empty

2012-06-06 Thread Markus Jelsma
s not used in the crawldb but in the parse job, which is input to the crawldb. > > > > On Wed, Jun 6, 2012 at 10:02 AM, Markus Jelsma > wrote: > > > > -Original message- > >> From:Matthias Paul > >> Sent: Wed 06-Jun-2012 09:47 > >> T

RE: Behaviour of "urlfilter-suffix" plug-in when dealing with a URL without filename extension

2012-06-06 Thread Markus Jelsma
-Original message- > From:Andy Xue > Sent: Wed 06-Jun-2012 11:11 > To: Markus Jelsma ; user@nutch.apache.org > Subject: Re: Behaviour of "urlfilter-suffix" plug-in when dealing > with a URL without filename extension > > Hi Markus: hi > > Thanks f

RE: threads disminution when fetching page

2012-06-06 Thread Markus Jelsma
-Original message- > From:pepe3059 > Sent: Wed 06-Jun-2012 02:58 > To: user@nutch.apache.org > Subject: RE: threads disminution when fetching page > > me again :) > > at the end of fetch process, is the regex-urlfilter considered? No. At the end of the fetch the mapper output is writti

RE: Behaviour of "urlfilter-suffix" plug-in when dealing with a URL without filename extension

2012-06-06 Thread Markus Jelsma
-Original message- > From:Andy Xue > Sent: Wed 06-Jun-2012 05:04 > To: user@nutch.apache.org > Subject: Behaviour of "urlfilter-suffix" plug-in when dealing with > a URL without filename extension > > Hi all: hi > > Does the "urlfilter-suffix" plug-in prune URL which does not have a

RE: Nutch topN selection

2012-06-06 Thread Markus Jelsma
-Original message- > From:chethan > Sent: Wed 06-Jun-2012 05:12 > To: user@nutch.apache.org > Subject: Nutch topN selection > > Hi, hi > > Does the topN threshold consider page score for the selection. If it's set > to say 10, does Nutch queue up the 10 top scoring URLs on a page? Ye

RE: Linkdb empty

2012-06-06 Thread Markus Jelsma
-Original message- > From:Matthias Paul > Sent: Wed 06-Jun-2012 09:47 > To: user@nutch.apache.org > Subject: Linkdb empty > > Hi all, hi > > I noticed that my linkdb is always empty although I use the generated > segments from the last crawl for the generation of the linkdb. Check th

RE: threads disminution when fetching page

2012-06-04 Thread Markus Jelsma
-Original message- > From:pepe3059 > Sent: Mon 04-Jun-2012 20:42 > To: user@nutch.apache.org > Subject: RE: threads disminution when fetching page > > thank you for your answer Markus Hi > > you mean, until the fetch process finishes, is information stored using hdfs > by nutch? mean

RE: threads disminution when fetching page

2012-06-04 Thread Markus Jelsma
This is normal and means the fetcher is finishing all it's input URL's and writing stuff to disk. -Original message- > From:pepe3059 > Sent: Sat 02-Jun-2012 22:15 > To: user@nutch.apache.org > Subject: threads disminution when fetching page > > Hello, i hope you can help me > > > i a

RE: How to configure nutch to fetch only recent documents

2012-06-04 Thread Markus Jelsma
Hi, The generator can only do it the other way around via the addDays parameter. To make it work your way you can modifiy the generator to restrict to documents younger than 48 hours. Cheers -Original message- > From:Shameema Umer > Sent: Mon 04-Jun-2012 08:33 > To: user@nutch.apac

RE: No links to process, is the webgraph empty?

2012-05-30 Thread Markus Jelsma
a, there are no outlinks to > external sites. (If you check the tinymce site, it has links to > microsoft, facebook, etc) So I am thinking my problem is more or less > related to the issue described > here > > https://issues.apache.org/jira/browse/NUTCH-1346 No, that is a fi

RE: Setting the Fetch time with a CustomFetchSchedule

2012-05-29 Thread Markus Jelsma
> overriding the nutch-default.xml > > > > db.fetch.schedule.class > com.custom.CustomEventFetchScheduler > > > > How do I include my custom logic so that it gets picked as a part of the > crawl cycle. > > Regards | Vikas > > On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma

RE: No links to process, is the webgraph empty?

2012-05-29 Thread Markus Jelsma
ignore.limit.domain to false and the link.ignore.internal.xxx can be > set to true? Or should I just set all of the link.ignore.xxx.xxx values > to false? > > On 5/29/2012 4:43 PM, Markus Jelsma wrote: > > Hi, > > > > That's a patch for the fetcher. The error you

RE: No links to process, is the webgraph empty?

2012-05-29 Thread Markus Jelsma
Hi, That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set those two link.ignore parameters to true, no links between the same domain and host or aggregated, only links from/to external hosts and domains. This is a good setting for wide web crawls. If

RE: Multiple nutch jobs on a Hadoop cluster simultaneosuly

2012-05-24 Thread Markus Jelsma
Hi, Yes, this is no problem. Cheers -Original message- > From:Dustine Rene Bernasor > Sent: Thu 24-May-2012 12:58 > To: user@nutch.apache.org > Subject: Multiple nutch jobs on a Hadoop cluster simultaneosuly > > Hello > > I was wondering, would it be possible to run multiple nutch jo

RE: Apparently far from last question :)

2012-05-23 Thread Markus Jelsma
You can inspect the CrawlDB with the readdb tool, check if it's there. -Original message- > From:Tolga > Sent: Wed 23-May-2012 14:21 > To: user@nutch.apache.org > Subject: Re: Apparently far from last question :) > > My colleague has just made me realize something. Is it possible tha

RE: Apache Nutch release 1.5 RC2

2012-05-22 Thread Markus Jelsma
Great! My +1 for a new release based on the state of the codebase. -Original message- > From:Julien Nioche > Sent: Tue 22-May-2012 22:19 > To: d...@nutch.apache.org > Cc: user@nutch.apache.org > Subject: Re: Apache Nutch release 1.5 RC2 > > Read http://people.apache.org/~lewismc/nutc

RE: URL filtering and normalization

2012-05-22 Thread Markus Jelsma
-Original message- > From:Bai Shen > Sent: Tue 22-May-2012 19:40 > To: user@nutch.apache.org > Subject: URL filtering and normalization > > Somehow my crawler started fetching youtube. I'm not really sure why as I > have db.ignore.external.links set to true. Weird! > > I've since add

RE: PDF not crawled/indexed

2012-05-22 Thread Markus Jelsma
Please read the description. -Original message- > From:Tolga > Sent: Tue 22-May-2012 11:37 > To: user@nutch.apache.org > Subject: Re: PDF not crawled/indexed > > What is that value's unit? kilobytes? My PDF file is 4.7mb. > > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: > > Yes I

RE: error parsing some xml

2012-05-21 Thread Markus Jelsma
.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > ******** > > > > > - Mensaje original - > De: "Markus Jelsma" > Para: user@nutch.apache.org > Enviados: Lunes, 21 de Mayo 201

RE: error parsing some xml

2012-05-21 Thread Markus Jelsma
Hi Which version do you use? It should list the troubling URL. What's the stack trace? Cheers -Original message- > From:Ing. Eyeris Rodriguez Rueda > Sent: Mon 21-May-2012 17:07 > To: user@nutch.apache.org > Subject: error parsing some xml > > Hi all. > When I try to crawl i have

RE: Setting the Fetch time with a CustomFetchSchedule

2012-05-21 Thread Markus Jelsma
Yes, you can pass ParseMeta keys to the FetchSchedule as part of the CrawlDatum's meta data as i did with: https://issues.apache.org/jira/browse/NUTCH-1024 -Original message- > From:Vikas Hazrati > Sent: Mon 21-May-2012 13:44 > To: user@nutch.apache.org > Subject: Setting the Fetch ti

RE: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Markus Jelsma
he Nutch 1.5 release rc #1 > > When will Nutch 1.5 be released? > > Matthias > > On Wed, Apr 18, 2012 at 1:46 PM, Bharat Goyal > wrote: > > +1 > > > > > > On Monday 16 April 2012 12:34 PM, Markus Jelsma wrote: > >> > >>

RE: Exclude certain mime-types

2012-05-18 Thread Markus Jelsma
-Original message- > From:Matthias Paul > Sent: Fri 18-May-2012 14:57 > To: user@nutch.apache.org > Subject: Exclude certain mime-types > > How can I exlude certain mime-types from crawling, for example Word-documents? > If I have parse-tika in plugin.includes it will parse them. Do

Re: Crawl-tool for iterative crawling?

2012-05-15 Thread Markus Jelsma
wling" there's the sentence "This also > > > permits ... incremental crawling", as if the crawl command described > > > before (3.1 Using the Crawl Command) couldn't do that. > > > > > > Could someone perhaps improve this part of the tuto

Re: HTTP error 400

2012-05-15 Thread Markus Jelsma
me? > > Regards, > > On 5/11/12 9:40 AM, Markus Jelsma wrote: > > Ah, that means don't use the crawl command and do a little shell > > scripting to execute the separte crawl cycle commands, see the nutch > > wiki for examples. And don't do solrdedup. Sea

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-15 Thread Markus Jelsma
the meaning of "-53" > > If necessary ,I can provide the js files. > > Thank you for your help. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type > -text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing > list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: webpage download

2012-05-15 Thread Markus Jelsma
yes On Tuesday 15 May 2012 12:45:28 Taeseong Kim wrote: > is whole web content download possible? > > include Flash, Image, CSS, JavaScript

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-14 Thread Markus Jelsma
etrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: java.lang.NullPointerException:org.apache.hadoop.io.Text.encode(Text.java:388)

2012-05-14 Thread Markus Jelsma
to debug & resolve ?? -- View this message in context: http://lucene.472066.n3.nabble.com/java-lang-NullPointerException-org-apache-hadoop-io-Text-encode-Text-java-388-tp3983600.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: Heap space problem when running nutch on cluster

2012-05-13 Thread Markus Jelsma
ommitted heap usage (bytes): 26456621056. So in fact it uses much less memory than it can. Any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Heap-space-problem-when-running-nutch-on-cluster-tp3983561.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: Separate logger for nutch

2012-05-11 Thread Markus Jelsma
e > > > existing log. > > > I am running nutch in eploy mode. > > > Also I want some urls filtered by my urlfilter to be stored in an > > external > > > flat file. How can I achieve this. > > > > > > -- > > > *Thanks & Re

Re: HTTP error 400

2012-05-10 Thread Markus Jelsma
How do I exactly "omit solrdedup and use Solr's internal deduplication" instead.? I don't even know what any of that means :D I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 100 to get the error. I have to use all the steps? Regards, On

Re: HTTP error 400

2012-05-10 Thread Markus Jelsma
n instead, it works similar and uses the same signature algorithm as Nutch has. Please consult the Solr wiki page on deduplication. Good luck On Thu, 10 May 2012 22:54:37 +0300, Tolga wrote: Hi Markus, On 05/10/2012 09:42 AM, Markus Jelsma wrote: Hi, On Thu, 10 May 2012 09:10:04 +0300, Tol

Re: Crawl-tool for iterative crawling?

2012-05-10 Thread Markus Jelsma
incremental indexing but I can't find it just now sorry. Lewis On Thu, May 10, 2012 at 5:18 PM, Matthias Paul wrote: Hi all, can the crawl-command also be used for iterative crawls? In older Nutch-versions this was not possible but in 1.5 it seems to work? Thanks Matthias -- Markus J

Re: HTTP error 400

2012-05-10 Thread Markus Jelsma
gt; Nutch is above to index to Solr 3.6.0, however if not then maybe we > should upgrade accordingly in trunk. > > Thanks > > Lewis > > On Thu, May 10, 2012 at 1:56 PM, Michael Erickson > > wrote: > > On May 10, 2012, at 1:42 AM, Markus Jelsma wrote: > >&g

Re: De-duplication of Nutch parsed data

2012-05-10 Thread Markus Jelsma
hi On Thursday 10 May 2012 15:19:09 Vikas Hazrati wrote: > Hi Markus, > > Thanks for your response. My responses inline > > On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma > > wrote: > > hi > > > > > > On Thu, 10 May 2012 00:26:40 +0530,

Fwd: Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Markus Jelsma
76568.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Markus Jelsma
Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local..

2012-05-10 Thread Markus Jelsma
free space. All the best, Igor On Thu, May 10, 2012 at 10:35 AM, Markus Jelsma wrote: Plenty of disk space does not mean you have enough room in your hadoop.tmp.dir which is /tmp by default. On Thu, 10 May 2012 10:26:00 +0200, Igor Salma wrote: Hi, Adriana, Sebastian, We are struggling wit

Re: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local..

2012-05-10 Thread Markus Jelsma
o hadoop-core-0.20.203.0.jar but then this is thrown: Exception in thread "main" java.lang.**NoClassDefFoundError: org/apache/commons/**configuration/Configuration Can someone, please, shed some light on this? Thanks. Igor -- Markus Jelsma - CTO - Openindex

Re: HTTP error 400

2012-05-09 Thread Markus Jelsma
iling list. /Powered by Jetty:// /What am I doing wrong? Regards,/ / -- Markus Jelsma - CTO - Openindex

Re: HTTP ERROR 400

2012-05-09 Thread Markus Jelsma
-inc.com [7] mailto:krist...@yahoo-inc.com [8] mailto:krist...@yahoo-inc.com [9] mailto:krist...@yahoo-inc.com [10] mailto:krist...@yahoo-inc.com [11] http://webmail.openindex.io/cid:part1.02010906.02030606@yahoo-inc.com [12] mailto:krist...@yahoo-inc.com -- Markus Jelsma - CTO - Openindex

Re: Make Nutch to crawl internal urls only

2012-05-09 Thread Markus Jelsma
anks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: De-duplication of Nutch parsed data

2012-05-09 Thread Markus Jelsma
links to get inside it? What link deduplication do you mean? CrawlDB records have a unique key on the URL. Regards | Vikas www.knoldus.com -- Markus Jelsma - CTO - Openindex

Re: Focused Crawling with Nutch (IndexingFilter:filter)

2012-05-09 Thread Markus Jelsma
mail.com [1] http://www8.org/w8-papers/5a-search-query/crawling/ [2] http://www.cse.iitb.ac.in/~soumen/focus/ [3] http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html -- Markus Jelsma - CTO - Openindex

Re: HTTP ERROR 400

2012-05-09 Thread Markus Jelsma
io/tel:%28408%29%20349%203300 [20] http://webmail.openindex.io/tel:%28408%29%20349%203301 [21] mailto:krist...@yahoo-inc.com [22] http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207 [23] http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002 [24] http://webmail.openindex.i

Re: Lower case URLs - correct regex?

2012-05-08 Thread Markus Jelsma
custom URL Normalizer to get this to work. But why? It doesn't seem alright. On Tue, 08 May 2012 14:46:14 +0200, Markus Jelsma wrote: I'm not sure this is going to work as a lowercase flag is used on the regular expressions. On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen wrote: Hi

Re: Lower case URLs - correct regex?

2012-05-08 Thread Markus Jelsma
2633&pid=1043ELE&site=191";1;"db_unfetched";Tue May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT 1970;0;2592000.0;30.0;500.0;"null" Notice the URL starts with an L? (Thus not matching http/https in another config). Is this some problem with the regex above? Regards, Dean Pullen -- Markus Jelsma - CTO - Openindex

Re: HTML documents with TXT extension

2012-05-08 Thread Markus Jelsma
Hi Nutch should parse an HTML file with a .txt extension just as a normal HTML file, at least, here it does. What does your parserchecker say? In any case you must strip potential left-over HTML in your Solr analyzer, if left like this it's a bad XSS vulnerability. Cheers On Tue, 8 May 2012

Re: Is it possible to control the segment size?

2012-05-07 Thread Markus Jelsma
l how many segments of ~N records are generated. Markus Jelsma-2 wrote On Mon, 7 May 2012 22:31:43 -0700 (PDT), "nutch.buddy@" wrote: In a previous discussion about handling of failures in nutch, it was mentioned that a broken segment cannot be fixed and it's urls sh

Re: Is it possible to control the segment size?

2012-05-07 Thread Markus Jelsma
etried by Hadoop. Any existing way in nutch to do this? Sure, the -topN parameter of the generator tool. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: link without href

2012-05-07 Thread Markus Jelsma
the url in the following html snippet > as a link? > > http://www.example.com/link";);">... > > > Thanks, > Mohammad -- Markus Jelsma - CTO - Openindex

<    5   6   7   8   9   10   11   12   13   14   >