RE: after 404 -> status switches directly to db_gone (db.fetch.retry.max does not work)

2015-10-19 Thread Markus Jelsma
To: user@nutch.apache.org > Subject: Re: after 404 -> status switches directly to db_gone > (db.fetch.retry.max does not work) > > Am Freitag, 16. Oktober 2015, 12:57:54 schrieb Markus Jelsma: > > Hello - no, that is not how Nutch works. Why would you want it to behave as &g

RE: after 404 -> status switches directly to db_gone (db.fetch.retry.max does not work)

2015-10-16 Thread Markus Jelsma
Hello - no, that is not how Nutch works. Why would you want it to behave as such anyway? Markus -Original message- > From:Axel Schöner > Sent: Friday 16th October 2015 14:14 > To: user@nutch.apache.org > Subject: after 404 -> status switches directly to db_gone > (db.fetch.retry.max do

RE: JRE/JDK version with Nutch 1.10

2015-10-15 Thread Markus Jelsma
Hello, Nutch runs fine on Java 8. We use OpenJDK 8 on all our nodes. Markus -Original message- > From:mar...@automationdirect.com > Sent: Thursday 15th October 2015 14:47 > To: user@nutch.apache.org > Subject: JRE/JDK version with Nutch 1.10 > > Hi, > > I just wanted to confirm the

RE: Webcast : Apache Nutch on EMR

2015-09-23 Thread Markus Jelsma
Very cool! This is probably going to be useful. -Original message- From: Julien Nioche Sent: Wednesday 23rd September 2015 16:35 To: user@nutch.apache.org; d...@nutch.apache.org Subject: Webcast : Apache Nutch on EMR Hi again, I have uploaded at webcast explaining how to run Nutch on

RE: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-15 Thread Markus Jelsma
Welcome!! -Original message- From: Sujen Shah Sent: Wednesday 16th September 2015 0:58 To: d...@nutch.apache.org Cc: user@nutch.apache.org Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah Hi Everyone, I would like to thank the members of the Apache Nutch PMC for bringing

RE: Compatible Hadoop version with Nutch 1.10

2015-09-14 Thread Markus Jelsma
Hello, we are running Nutch 1.10 on Hadoop 2.7 for quite some time now. But since Hadoop kept its binary compatibility, even older Nutch' should work just as fine. Markus -Original message- > From:Imtiaz Shakil Siddique > Sent: Monday 14th September 2015 18:08 > To: user@nutch.apache

RE: Document scores(boost)

2015-09-10 Thread Markus Jelsma
> Subject: RE: Document scores(boost) > > Hello Markus Jelsma, > > Thank you for the advice. But this score calculation is done after the data > is indexed to solr. So when the scores are updated inside the crawldb Solr > won't get it. > > I think a workaround for

RE: Document scores(boost)

2015-09-10 Thread Markus Jelsma
. This program is very intensive. Use it only if you really need it. Markus -Original message- > From:Imtiaz Shakil Siddique > Sent: Thursday 10th September 2015 16:04 > To: user@nutch.apache.org > Subject: Re: Document scores(boost) > > Hello Markus Jelsma, > &g

RE: Nutch 1.10 not following links

2015-09-10 Thread Markus Jelsma
Hi - this is usually a URL filter problem. Please check out the configured URL filters. Markus -Original message- > From:spam > Sent: Tuesday 8th September 2015 17:46 > To: user@nutch.apache.org > Subject: Nutch 1.10 not following links > > Sincere greetings to all of the apache users!

RE: Document scores(boost)

2015-09-10 Thread Markus Jelsma
Hello - OPIC is useless in incremental crawls. You can either disable scoring altogether, or use webgraph > linkrank > scoreupdater. Markus -Original message- > From:Imtiaz Shakil Siddique > Sent: Wednesday 9th September 2015 23:09 > To: user@nutch.apache.org > Subject: Document scores(

RE: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Markus Jelsma
Welcome! -Original message- > From:Sebastian Nagel > Sent: Thursday 10th September 2015 0:01 > To: d...@nutch.apache.org > Cc: user@nutch.apache.org > Subject: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra > > Dear all, > > on behalf of the Nutch PMC it is my pleasure to an

Nutch tests from Maven

2015-07-28 Thread Markus Jelsma
Hello - Nutch does not ship unit tests anymore as Maven artifacts, hence we cannot use CrawlDBTestUtil in external projects. Should we ship them? Or just copy the utils? What do you think? Markus

RE: Duplicate pages with and without www. prefix being indexed

2015-07-08 Thread Markus Jelsma
content duplicated? Or is there some way > to write a www. => no-www rule to cover any domain Nutch happens to > encounter in the future? > > Arthur. > > On 2015-07-07 12:59, Markus Jelsma wrote: > > Hello, i added an example to the issue, Hope it helps. > > &g

RE: Duplicate pages with and without www. prefix being indexed

2015-07-07 Thread Markus Jelsma
st-urlnormalizer.txt? Or can I add a generic rule to cover any host? > > i.e. > www.example1.org .example1.org > www.example2.org .example2.org > www.example3.org .example3.org > ... > > or can I do something along the lines of:- > www.* .($1) > > Arthur. >

RE: Duplicate pages with and without www. prefix being indexed

2015-07-07 Thread Markus Jelsma
You can use the host normalizer for this. https://issues.apache.org/jira/browse/NUTCH-1319 -Original message- > From:Arthur Yarwood > Sent: Tuesday 7th July 2015 0:02 > To: user@nutch.apache.org > Subject: Duplicate pages with and without www. prefix being indexed > > I have a Nutch 1.1

RE: CXF dependency on 1.10

2015-07-01 Thread Markus Jelsma
maven2/org/apache/cxf/cxf/3.0.4/ > > Not sure why it's giving you an error... > > Cheers, > Chris > > > From: Markus Jelsma [markus.jel...@openindex.io] > Sent: Tuesday, June 23, 2015 7:12 AM > To: user@nutch.apache.org &

CXF dependency on 1.10

2015-06-23 Thread Markus Jelsma
Hi - one of our Maven projects won't build with Nutch 1.10 as dependency. For some reason it does actually dowload the CXF dependencies, which are then correctly stored in my local repo, but then complains about not finding them. [ERROR] Failed to execute goal on project openindex-nutch: Could

RE: [MASSMAIL]Re: about boost field extremely high

2015-05-20 Thread Markus Jelsma
> Is there any description about how function scoring-link? i was reading the > source code but don't understand at all. > > Markus are you suggesting me use scoring-link plugin, is this Nutch' LinkRank  > or not? > > I really appreciated your help. > > >

RE: [MASSMAIL]Re: about boost field extremely high

2015-05-20 Thread Markus Jelsma
Yes indeed. But it also makes sense to rely on Lucene's scoring algorithms and custom boosting functions. The problem with generic document boosting is that they can negatively influence your result sets. Causing non-relevant, but highly scored documents, on top. Another alternative is to use Nu

RE: nutch-1-9-not-crawling-url-with-querystring-params

2015-04-09 Thread Markus Jelsma
Check you conf/regex-urlfilter.txt for # skip URLs containing certain characters as probable queries, etc. -[?*!@=] -Original message- > From:Rohan Shah > Sent: Thursday 9th April 2015 16:10 > To: user@nutch.apache.org > Subject: nutch-1-9-not-crawling-url-with-querystring-params >

RE: Ignore navigation during index

2015-03-26 Thread Markus Jelsma
Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika parser. It's crude but works reasonably. https://issues.apache.org/jira/browse/NUTCH-961 Markus -Original message- > From:Richardson, Jacquelyn F. > Sent: Thursday 26th March 2015 16:20 > To: user@nutch.apache

RE: [ANNOUNCE] New Nutch committer and PMC - Mo Omer

2015-03-22 Thread Markus Jelsma
Welcome Mohammad! -Original message- From: Mohammed Omer Sent: Sunday 22nd March 2015 18:55 To: user@nutch.apache.org Cc: d...@nutch.apache.org Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Mo Omer Hello all, First, and most importantly, Id like to send out a thank you to Chris,

RE: Handling servers with wrong Last Modified HTTP header

2015-03-12 Thread Markus Jelsma
Hello Jorge, This is an interesting but very complicated issue. First of all, do not rely on HTTP headers, they are incorrect on any scale larger than very small. This is true for Last-Modified due to dynamic CMS' but for many other headers. You can even expect website descriptions in headers s

RE: Nutch documents have huge scores in Solr

2015-03-12 Thread Markus Jelsma
5 9:40 > To: user > Subject: Re: Nutch documents have huge scores in Solr > > Hi Markus, > > On 10 March 2015 at 13:11, Markus Jelsma wrote: > > > Hello - Adaptive OPIC [1] is supposed to solve the drawbacks OPIC has with > > incremental crawling, scores will c

RE: Nutch documents have huge scores in Solr

2015-03-11 Thread Markus Jelsma
o: user > > Subject: Re: Nutch documents have huge scores in Solr > > > > Hi Markus, > > > > On 10 March 2015 at 13:11, Markus Jelsma wrote: > > > > > Hello - Adaptive OPIC [1] is supposed to solve the drawbacks OPIC has with > > > incremen

RE: Nutch documents have huge scores in Solr

2015-03-10 Thread Markus Jelsma
gal van Hemert | alterNET internet BV > Sent: Tuesday 10th March 2015 13:01 > To: user > Subject: Re: Nutch documents have huge scores in Solr > > Hi Markus, > > On 10 March 2015 at 10:45, Markus Jelsma wrote: > > > Hello Jigal - reading OPIC gives it away. You can

RE: Nutch documents have huge scores in Solr

2015-03-10 Thread Markus Jelsma
Hello Jigal - reading OPIC gives it away. You can check Nutch record, they must have a very high score, which is added to the NutchDocument as boost field. If, in Solr, you actually use it, this is what you get. Do not use OPIC, unless you have a reason to. Markus -Original message

RE: Can anyone fetch this page?

2015-02-27 Thread Markus Jelsma
Seems fine to me http://oldservice.openindex.io/extract.php?url=http%3A%2F%2Fwww.nature.com%2Fnature%2Fjournal%2Fv518%2Fn7540%2Ffull%2Fnature14236.html -Original message- > From:Lewis John Mcgibbney > Sent: Friday 27th February 2015 18:56 > To: user@nutch.apache.org > Subject: Can anyo

RE: [MASSMAIL]RE: [MASSMAIL]URL filter plugins for nutch

2015-02-19 Thread Markus Jelsma
are of any content yet, as you explained in your email. > > Sorry if my out-of-time email caused some confusion :) > > Regards, > > - Original Message - > From: "Markus Jelsma" > To: user@nutch.apache.org > Sent: Wednesday, February 18, 2015 6:31:51 P

RE: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-19 Thread Markus Jelsma
Cheers!! -Original message- > From:Sebastian Nagel > Sent: Thursday 19th February 2015 18:22 > To: user@nutch.apache.org > Cc: d...@nutch.apache.org > Subject: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt > Gonzalez > > Dear all, > > on behalf of the Nutch PMC it i

RE: [MASSMAIL]URL filter plugins for nutch

2015-02-18 Thread Markus Jelsma
Hi Jorge - perhaps i am missing something, but the linkdb cannot hold content derived information such as similarity hashes, nor does it cluster similar URL's as you would want when detecting spider traps. What do you think? Markus -Original message- > From:Jorge Luis Betancourt Gon

RE: URL filter plugins for nutch

2015-02-18 Thread Markus Jelsma
it a plugin(which implements URLFilter interface). > Hence filter all those URLs, whose content is nearly same as one which have > alrady been fetched. Would this be possible or I am heading in wrong > direction. > > Thanks for your patience Markus. > > > Regards, &g

RE: URL filter plugins for nutch

2015-02-18 Thread Markus Jelsma
nesday 18th February 2015 21:58 > To: user > Subject: Re: URL filter plugins for nutch > > Hi Markus, > > I am looking for the one's with similar in content. > > Regards, > Madan Patil > > On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma > wrote: > >

RE: URL filter plugins for nutch

2015-02-18 Thread Markus Jelsma
By near-duplicate you mean similar URL's, or URL's with similar content? -Original message- > From:Madan Patil > Sent: Wednesday 18th February 2015 21:10 > To: user > Subject: URL filter plugins for nutch > > Hi, > > I am working on assignment where I am supposed to use nutch to craw

RE: Crawl Ajax based sites

2015-02-10 Thread Markus Jelsma
Sure: https://issues.apache.org/jira/browse/NUTCH-1323 -Original message- > From:Tizy Ninan > Sent: Tuesday 10th February 2015 9:42 > To: user@nutch.apache.org > Subject: Crawl Ajax based sites > > Hi, > > Does Nutch v1.9 support crawling Ajax based websites? > > Thanks and Regards

RE: how to crawl image first on every round of nutch?

2015-02-08 Thread Markus Jelsma
Implement a ScoringFilter, specifically the generate something method(), and emit a high float for image MIME's. -Original message- > From:Eyeris RodrIguez Rueda > Sent: Friday 6th February 2015 19:54 > To: user@nutch.apache.org > Subject: how to crawl image first on every round of nu

RE: How to implement an own crawler for specific tasks with nutch?

2015-02-01 Thread Markus Jelsma
Regarding URL filters, use the domain URL filter to restrict the crawl to a list of hosts. -Original message- > From:Mattmann, Chris A (3980) > Sent: Sunday 1st February 2015 17:47 > To: user@nutch.apache.org > Subject: Re: How to implement an own crawler for specific tasks with nutc

RE: problems with download links of nutch versions

2015-01-14 Thread Markus Jelsma
Hi - you can always get the sources from SVN tags or fetch releases from the Apache archive: http://archive.apache.org/dist/ -Original message- > From:Eyeris RodrIguez Rueda > Sent: Wednesday 14th January 2015 15:14 > To: user@nutch.apache.org > Subject: problems with download links

RE: Problem with time out on QueueFeeder

2015-01-09 Thread Markus Jelsma
Do you have enough memory? 50 thtreads and PDF's and and older Tika version will get you in trouble. That PDFBox version eats memory! Try upgrading to the latest PDFBox, you can drop jars manually and reference them in Tika's plugin.xml. M -Original message- > From:Paul Rogers > S

RE: Problems with DomainStatistics

2015-01-07 Thread Markus Jelsma
Hi - it is a strange piece indeed. You cannot just tell it where the crawldb is, you need to tell it where the directory is, so specifying current is ok, but not part-* M -Original message- > From:Lewis John Mcgibbney > Sent: Wednesday 7th January 2015 19:48 > To: user@nutch.apache.

RE: Nutch 1.9 error

2015-01-05 Thread Markus Jelsma
pParser0.9.jar > file. There is no information on the page that tells you where to put the > file and how to tell nutch to use it. > > Have you used this utility or do you know the answers to any of the questions > above? > > Jackie > > -Original Message- &g

RE: Depth option

2015-01-05 Thread Markus Jelsma
I would recommend to use the domain-urlfilter, it is the most straightforward method of controlling the list of hosts in the crawldb. M -Original message- > From:Shadi Saleh > Sent: Sunday 4th January 2015 16:23 > To: user > Subject: Depth option > > Hello, > > I want to check thi

RE: Nutch stopped after 5 segments

2014-12-28 Thread Markus Jelsma
The segment isn't parsed and didn't write its hyperlinks back to the DB. Parse the segment and then updatedb it. -Original message- > From:Chaushu, Shani > Sent: Sunday 28th December 2014 13:35 > To: user@nutch.apache.org > Subject: Nutch stopped after 5 segments > > Hi all, > I ran

RE: Nutch 1.9 error

2014-12-19 Thread Markus Jelsma
le with Nutch 1.x? > > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Thursday, December 18, 2014 3:09 PM > > To: user@nutch.apache.org > > Subject: RE: Nutch 1.9 error > > > > Hi - the sitem

RE: Nutch 1.9 error

2014-12-19 Thread Markus Jelsma
No, unfortunately not. -Original message- > From:Richardson, Jacquelyn F. > Sent: Friday 19th December 2014 5:16 > To: user@nutch.apache.org > Subject: RE: Nutch 1.9 error > > Is it possible to crawl sitemap.xml file with Nutch 1.x? > > -Original Mes

RE: Nutch 1.9 error

2014-12-18 Thread Markus Jelsma
Hi - the sitemap command is not part of Nutch 1.x, nor does it have a HostDB. I suspect you are using Nutch 2.x commands. -Original message- > From:Richardson, Jacquelyn F. > Sent: Thursday 18th December 2014 20:30 > To: user@nutch.apache.org > Subject: Nutch 1.9 error > > I am using

RE: questions about nutch 1.9

2014-12-16 Thread Markus Jelsma
using https and it > is a limitation for use nutch 1.9 in my university. > > > > > - Original Message - > From: "Markus Jelsma" > To: user@nutch.apache.org > Sent: Tuesday, December 16, 2014 7:46:54 AM > Subject: RE: questions about nutch 1.9 > &g

RE: questions about nutch 1.9

2014-12-16 Thread Markus Jelsma
Hi - can you try the protocol-http plugin instead? It has some support for TLS. -Original message- > From:Eyeris RodrIguez Rueda > Sent: Thursday 11th December 2014 22:18 > To: user@nutch.apache.org > Subject: Re: questions about nutch 1.9 > > Please any help? > > > Hello. > I want to

RE: Nutch 1.9 Fetchers Hung

2014-11-28 Thread Markus Jelsma
I think your're looking at, https://issues.apache.org/jira/browse/NUTCH-1182 logging of hung threads was added in 1.9, so it should happen in 1.8 as well, but not being logged. Markus -Original message- > From:Issam Maamria > Sent: Friday 28th November 2014 15:37 > To: user@nutch.ap

RE: How to parse specific html tag in nutch+solr while crawling

2014-11-27 Thread Markus Jelsma
t; App for Teams > <https://appexchange.salesforce.com/listingDetail?listingId=a0N300B5UPKEA3> > > > > > On Thu, Nov 27, 2014 at 10:33 PM, Markus Jelsma > wrote: > > > You may want to check the headings plugin, it reads content from those > >

RE: How to parse specific html tag in nutch+solr while crawling

2014-11-27 Thread Markus Jelsma
You may want to check the headings plugin, it reads content from those elements and writes them to some field. Very basic. -Original message- > From:Vishal Sharma > Sent: Thursday 27th November 2014 17:59 > To: user > Subject: How to parse specific html tag in nutch+solr while crawl

RE: fetcher.throttle.bandwidth

2014-11-26 Thread Markus Jelsma
Oh i forgot, if you run Nutch locally you can also use trickle to shape traffic, much easier on a per command basis. It may be difficult to get it to work in distributed mode, because Hadoop itself spawn the child processes. -Original message- > From:Markus Jelsma > Sent: Wednesday 2

RE: fetcher.throttle.bandwidth

2014-11-26 Thread Markus Jelsma
Maybe your network operator can limit the traffic in the switch or router your nodes are attached to? We have done this too at some point and it works fine. Also, if you are on linux, iptables and tc could help you to limit bandwidth on a per-user basis. -Original message- > From:Dan

RE: Processing Pages in Pairs

2014-11-26 Thread Markus Jelsma
Using Solr BlockJoins would probably be the easiest these days unless you really need to process them in Nutch. If you still want to process them simultaneously you can write a custom Solr UpdateRequestProcessor plugin and build the logic there. -Original message- > From:Lewis John Mcg

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma
(ToolRunner.java:65) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) > > Advice me please.. > > > On Mon, Nov 3, 2014 at 5:47 PM, Muhamad Muchlis wrote: > > > Like this ? > > > > > > > > > > > > > > > > > > >

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma
NUTCH 1.9 > > Like this ? > > > > > > > > > > http.agent.name > My Nutch Spider > > > ** > * solr.server.url* > * http://localhost:8983/solr/ <http://localhost:8983/solr/>* > ** > > > > > > On Mon, Nov 3, 2014 at 5:41 PM,

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma
t; > Hi Markus, > > Where can I find the settings solr url?  -D > > On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma > wrote: > > > Well, here is is: > > java.lang.RuntimeException: Missing SOLR URL. Should be set via > > -Dsolr.server.url > > >

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma
org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:57) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.ind

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma
Hi - see the logs for more details. Markus -Original message- > From:Muhamad Muchlis > Sent: Monday 3rd November 2014 9:15 > To: user@nutch.apache.org > Subject: [Error Crawling Job Failed] NUTCH 1.9 > > Hello. > > I get an error message when I run the command: > > *crawl seed/seed.tx

RE: Nutch vs Lucidworks Fusion

2014-10-13 Thread Markus Jelsma
Hi - anything on this? These are interesting topics so i am curious :) Cheers, Markus -Original message- > From:Markus Jelsma > Sent: Thursday 9th October 2014 0:46 > To: user@nutch.apache.org; a...@getopt.org > Subject: RE: Nutch vs Lucidworks Fusion > > Hi Andrzej - how are you de

RE: Nutch vs Lucidworks Fusion

2014-10-08 Thread Markus Jelsma
Hi Andrzej - how are you dealing with text extraction and other relevant items such as article date and accompanying images? And what about other metadata such as the author of the article or the rating some pasta recipe got? Also, must clients (or your consultants) implement site-specific URL f

RE: Generated Segment Too Large

2014-10-07 Thread Markus Jelsma
Hi - you have been using Nutch for some time already so aren't you already familiar with generate.max.count configuration directive possibly combined with the -topN parameter for the Generator job? With generate.max.count the segment size depends on the number of distinct hosts or domains so it

RE: Why are specific URLs not fetched?

2014-10-01 Thread Markus Jelsma
- > From:Jigal van Hemert | alterNET internet BV > Sent: Wednesday 1st October 2014 8:54 > To: user@nutch.apache.org > Subject: Re: Why are specific URLs not fetched? > > Hi, > > On 30 September 2014 11:29, Markus Jelsma wrote: > > Ah, check this out: > &

Re: Why are specific URLs not fetched?

2014-09-30 Thread Markus Jelsma
eptember 2014 11:13:34 Jigal van Hemert | alterNET internet BV wrote: > Hi, > > 2014-09-17 16:43 GMT+02:00 Jigal van Hemert | alterNET internet BV > > : > > Hi, > > > > 2014-09-16 16:15 GMT+02:00 Markus Jelsma : > >> You can check the bin/nutch parse

RE: Solr Indexer Reduce Tasks "fail to report status"

2014-09-29 Thread Markus Jelsma
Hi - i don't think the indexing stage is reached at all, judging from the MapOutputFormat. We sometimes see this happening during the shuffle stage, some mapred limits need to be adjusted to overcome this, but don't remember which. But you can always decrease the size of a job and just run more

RE: DOCUMENTATION - Nutch and Hidden Services

2014-09-24 Thread Markus Jelsma
Hi - this is really awesome! Is there also a way to use different exit nodes for different fetchers or queues, or can you instruct to regularly change exit nodes? Markus -Original message- From: Lewis John Mcgibbney Sent: Wednesday 24th September 2014 4:57 To: user@nutch.apache.org; d...

RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Markus Jelsma
Subject: RE: get generated segments from step / fetch all empty segments > > Markus, I have used the maxnum segments but no luck, is it driven by the > size of the segment instead ? > On Sep 22, 2014 9:28 AM, "Markus Jelsma" wrote: > > > You can use maxNumSegment

RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Markus Jelsma
You can use maxNumSegments to generate more than one segment. And instead of passing a list of segment names around, why not just loop over the entire directory, and move finished segments to another. -Original message- > From:Edoardo Causarano > Sent: Monday 22nd September 2014 15:

RE: index command failing, no plugins found

2014-09-17 Thread Markus Jelsma
Hi - you must add either the Solr or Elasticsearch indexing plugin to your nutch-site.xml configuration file. You can see conf/nutch-default.xml for examples. User defined config should be in nutch-site.xml. Markus -Original message- > From:Edoardo Causarano > Sent: Wednesday 17th S

RE: Revisiting Loops Job in Nutch Trunk

2014-09-16 Thread Markus Jelsma
Hi - So you are not using it for scoring right, but to inspect the graph of the web. Then there's certainly no need to weed out loops using the loops algorithm, neither a need to run the linkrank job Markus -Original message- > From:Lewis John Mcgibbney > Sent: Thursday 11th Septem

RE: Why are specific URLs not fetched?

2014-09-16 Thread Markus Jelsma
that's not usually the case. -Original message- > From:Jigal van Hemert | alterNET internet BV > Sent: Tuesday 16th September 2014 16:04 > To: user@nutch.apache.org > Subject: Re: Why are specific URLs not fetched? > > Hi, > > Thanks for your reply. > &

RE: Fetch Job Started Failing on Hadoop Cluster

2014-09-16 Thread Markus Jelsma
Hi - you made Nutch believe that hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a segment, but it is not. So either no segment was created or written to the wrong location. I don't know what kind of script you are using but you should check the return code of the generato

RE: generatorsortvalue

2014-09-16 Thread Markus Jelsma
Hi - if you need inlinks as input you need to change how Nutch works. By default, inlinks are only used when indexing. So depending on whatever scoring filter you implement, you also need to process inlinks at that state (generator or updater). This is going to be a costly process because the li

RE: Crawl URL with varying query parameters values

2014-09-16 Thread Markus Jelsma
Hi - you probably have URL filtering enabled, the regex specifically. By default it filters out query strings. Check your URL filters. Markus -Original message- > From:Krishnanand, Kartik > > Sent: Friday 12th September 2014 13:04 >

RE: Why are specific URLs not fetched?

2014-09-16 Thread Markus Jelsma
Hi - it is usually a problem with URL filters, which by default do not accept query strings etc. Check your URL filters. Markus -Original message- > From:Jigal van Hemert | alterNET internet BV > > Sent: Tuesday 16th September 2014 12:24 > To: user@nutch.

Re: Revisiting Loops Job in Nutch Trunk

2014-09-10 Thread Markus Jelsma
Hi - i would not use LinkRank on small scale crawls, and neither for verticals, if internal links are ignored, there are few links to score, if not, the graph is too dense. It is only useful - for me/us - to let the web decide what hosts and pages are popular, so that means large scale. On Wed

RE: Revisiting Loops Job in Nutch Trunk

2014-09-10 Thread Markus Jelsma
m> > > Sent: Wednesday 10th September 2014 20:09 > To: Markus Jelsma mailto:mar...@openindex.io> > > Cc: user@nutch.apache.org <mailto:user@nutch.apache.org> > Subject: Re: Revisiting Loops Job in Nutch Trunk > > Hi Markus, > Yeah +1 on this one. I was aware of

RE: generatorsortvalue

2014-09-10 Thread Markus Jelsma
Benjamin. > > 2014-09-10 10:48 GMT+02:00 Markus Jelsma <mailto:markus.jel...@openindex.io> >: > > > Hi, you can implement a custom FetchSchedule. We use it as well to > > influence how records are sorted. > > Markus > > > > > > > > > &g

RE: Revisiting Loops Job in Nutch Trunk

2014-09-10 Thread Markus Jelsma
Hey Lewis, We didn't use it in the end, but did run the LinkRank on large amounts of data. We then used the scores generated by it for biasing a deduplication algorithm. We tested it thoroughly and never stumbled on issues that could have been resolved using the Loops algorithm. Markus

RE: generatorsortvalue

2014-09-10 Thread Markus Jelsma
Hi, you can implement a custom FetchSchedule. We use it as well to influence how records are sorted. Markus -Original message- > From:Benjamin Derei mailto:stygm...@gmail.com> > > Sent: Tuesday 9th September 2014 20:39 > To: user@nutch.apache.org > Su

RE: problems changing domain name for a website

2014-09-01 Thread Markus Jelsma
Yes - you need to erase everything in Solr and reindex it. You can use the regex URL normalizer to rename the data in your crawldb and segments. -Original message- > From:Eyeris RodrIguez Rueda > Sent: Monday 1st September 2014 17:10 > To: user@nutch.apache.org > Subject: problems ch

RE: Nutch 1.7 failing on Hadoop YARN after running for a while.

2014-08-21 Thread Markus Jelsma
There must be something more in the logs, if it's an error produced by Nutch. -Original message- > From:S.L > Sent: Wednesday 20th August 2014 21:31 > To: u...@hadoop.apache.org; user@nutch.apache.org > Subject: Nutch 1.7 failing on Hadoop YARN after running for a while. > > Hi All, >

RE: [RELEASE] Apache Nutch 1.9

2014-08-20 Thread Markus Jelsma
Thanks Lewis!! -Original message- From: Lewis John Mcgibbney Sent: Monday 18th August 2014 22:36 To: user@nutch.apache.org; d...@nutch.apache.org Subject: [RELEASE] Apache Nutch 1.9 Hi Everyone, The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.9, we

RE: Filtering indexing of documents by MIME Type

2014-07-22 Thread Markus Jelsma
Hi - we just modify parse-plugins to only parse what we want to parse, those documents are never indexed anyway, and we can skip the parsing. -Original message- > From:Sebastian Nagel > Sent: Monday 21st July 2014 23:13 > To: user@nutch.apache.org > Subject: Re: Filtering indexing of

RE: Ignoring errors in crawl

2014-07-17 Thread Markus Jelsma
Hi - there's some setting fetcher queue exceptions something that will dump a queue if it had too many exceptions. -Original message- > From:Adam Estrada > Sent: Thursday 17th July 2014 16:07 > To: user@nutch.apache.org > Subject: Ignoring errors in crawl > > All, > > I am coming acr

RE: Upgrading nutch 1.8 for having solrj 4.9

2014-07-17 Thread Markus Jelsma
https://issues.apache.org/jira/browse/NUTCH-1486 -Original message- > From:Talat Uyarer > Sent: Thursday 17th July 2014 12:02 > To: user@nutch.apache.org > Subject: Re: Upgrading nutch 1.8 for having solrj 4.9 > > Hi Ali, > > At the present we support Solrj 4.x at branch of Nutch 2.

RE: Feasibility questions regarding my new project

2014-07-02 Thread Markus Jelsma
Hi Daniel, see inline -Original message- > From:Daniel Sachse > Sent: Wednesday 2nd July 2014 18:35 > To: user@nutch.apache.org > Subject: Feasibility questions regarding my new project > > Hey guys, > > I am working on a new SAAS product regarding website metrics. > There are some ba

RE: Changing nutch for update documents instead of add new ones

2014-07-01 Thread Markus Jelsma
Hi, NutchIndexAction is indeed prepared to handle updates but the methods are not implemented. In case of Solr, it still does an internal add/delete for updated documents, and to do so, you must have all fields stored="true". So in almost all cases, it is more efficient not to store all fields

RE: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

2014-06-22 Thread Markus Jelsma
Eeh, this reply was meant for the "Please share your experience of using Nutch in production" topic. Markus -Original message- > From:Markus Jelsma > Sent: Sunday 22nd June 2014 22:54 > To: user@nutch.apache.org > Subject: RE: Relationship between fetcher.threads.fetch and > fetcher.

RE: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

2014-06-22 Thread Markus Jelsma
Hi Meraj, If you see things from another perspective, you may not even need a (very) small crawl delay. Even using a high delay, say 10 seconds, you can still recrawl websites up to 200.000 records large every month, and still quickly discover and index newly found content. If the sites you tar

RE: Elasticsearch & customized indicies

2014-06-18 Thread Markus Jelsma
Chris, does ES allow reuse of the same connection when the indices are in the same cluster? Then you could change the indexing backend to change index target for each host. It would be really hard to open many connections to many different end points because each segment or batch can contain a l

RE: Clarifications regarding re-crawl and Nutch2 storage

2014-06-17 Thread Markus Jelsma
om:Dan Kinder > Sent: Tuesday 17th June 2014 19:24 > To: user@nutch.apache.org > Subject: Re: Clarifications regarding re-crawl and Nutch2 storage > > Thanks again for the quick response, see inline. > > On Mon, Jun 16, 2014 at 4:08 PM, Markus Jelsma > wrote: >

RE: Clarifications regarding re-crawl and Nutch2 storage

2014-06-16 Thread Markus Jelsma
below. > > On Mon, Jun 16, 2014 at 3:01 PM, Markus Jelsma > wrote: > > > Hi Dan, please see inline for comments. > > > > Regards, > > Markus > > > > -Original message- > > > From:Dan Kinder > > > Sent: Monday 16th June 2

RE: Clarifications regarding re-crawl and Nutch2 storage

2014-06-16 Thread Markus Jelsma
Hi Dan, please see inline for comments. Regards, Markus -Original message- > From:Dan Kinder > Sent: Monday 16th June 2014 23:32 > To: user@nutch.apache.org > Subject: Clarifications regarding re-crawl and Nutch2 storage > > Hi there, > > My company currently runs a full-web crawler (

RE: re-crawling with nutch 1.8

2014-06-13 Thread Markus Jelsma
Hi Ali, Nutch does not really re-crawl, it crawls every URL every N interval, default of 30 days. Usually one would keep Nutch running indefinately (e.g. by cron), the URL's will then automatically be `recrawled` every 30 days by default. Markus -Original message- > From:Ali Nazemi

RE: anchor text in content field

2014-06-13 Thread Markus Jelsma
Hi, In case of TikaParser you need a customized ContentHandler that remembers it's state (being inside an anchor) and ignores characters based on that state in characters(char ch[], int start, int length); Easiest would be to modify parse-tika's DOMBuilder for that. It's use case puzzles me, w

RE: New Apache Nutch Site

2014-06-11 Thread Markus Jelsma
Awesome!!!   -Original message- From:Lewis John Mcgibbney Sent:Wed 11-06-2014 06:13 Subject:New Apache Nutch Site To:user@nutch.apache.org; d...@nutch.apache.org; Hi Folks, I recently attacked [0] which now enables us to run our site as a content management system as oppose to a static ho

RE: Crawling web and intranet files into single crawldb

2014-06-04 Thread Markus Jelsma
lter.txt file (default configuration): http:// https:// ftp:// file:// Even though I removed "file://" or not, the result of nutch URLFilterChecker is still the same. On Wed, Jun 4, 2014 at 7:50 PM, Markus Jelsma wrote: > Remove it from the prefix filter and confir

RE: Crawling web and intranet files into single crawldb

2014-06-04 Thread Markus Jelsma
Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma wrote: > Hi Bayu, > > > You must enabled the protocol-file first. Then make sure the file:// > prefix is not filtered via prefix-urlfilter.txt or any other. Now just > inject new URL's and start the crawl. > > > Chee

RE: Crawling web and intranet files into single crawldb

2014-06-04 Thread Markus Jelsma
Hi Bayu,   You must enabled the protocol-file first. Then make sure the file:// prefix is not filtered via prefix-urlfilter.txt or any other. Now just inject new URL's and start the crawl.   Cheers   -Original message- From:Bayu Widyasanyata Sent:Wed 04-06-2014 14:30 Subject:Crawling

<    1   2   3   4   5   6   7   8   9   10   >