RE: Nucth 1.7 and ElasticSearch

2013-08-15 Thread Markus Jelsma
See https://issues.apache.org/jira/browse/NUTCH-1598 -Original message- > From:Amit Sela > Sent: Thursday 15th August 2013 11:19 > To: user@nutch.apache.org > Subject: Nucth 1.7 and ElasticSearch > > Hi all, > > I want to setup nutch 1.7 with ElasticSearch 0.90.3 as indexer. > From

RE: Nutch 1.7 on Hadoop Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.indexer.solr.SolrIndexer

2013-08-14 Thread Markus Jelsma
indexing > > more soon > > ps: my next question will be how to script this, those Hadoop command lines > are doing my head in > > > On Wed, Aug 14, 2013 at 12:48 PM, Markus Jelsma > wrote: > > > Also, the webgraph is not part of indexing. That just has a ScoreU

RE: Nutch 1.7 on Hadoop Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.indexer.solr.SolrIndexer

2013-08-14 Thread Markus Jelsma
/opt/nutch/apache-nutch-1.7/build/apache-nutch-1.7.job > > > org.apache.nutch.indexer.solr.SolrIndexer -solr > > > http://solr.server.tld:8088/solr/core1/ /user/crawl-1.7-1/crawldb > > -linkdb > > > /user/crawl-1.7-1/linkdb -dir /user/crawl-1.7-1/seg

RE: Nutch 1.7 on Hadoop Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.indexer.solr.SolrIndexer

2013-08-14 Thread Markus Jelsma
ache.nutch.indexer.solr.SolrIndexer -solr > > > http://solr.server.tld:8088/solr/core1/ /user/crawl-1.7-1/crawldb > > -linkdb > > > /user/crawl-1.7-1/linkdb -dir /user/crawl-1.7-1/segments > > > > > > On Wed, Aug 14, 2013 at 11:33 AM, Marku

RE: Nutch 1.7 on Hadoop Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.indexer.solr.SolrIndexer

2013-08-14 Thread Markus Jelsma
That's right. Check NUTCH-1047, that is what changed: https://issues.apache.org/jira/browse/NUTCH-1047 -Original message- > From:Nicholas Roberts > Sent: Wednesday 14th August 2013 20:11 > To: user@nutch.apache.org > Subject: Nutch 1.7 on Hadoop Exception in thread "main" > java.lang.Cl

RE: Boilerplate removal

2013-08-07 Thread Markus Jelsma
moval > > I didn't test this time around, but I think I did do testing before... > > Anything could possibly go wrong? Anything else I can do? > > > On Wed, Aug 7, 2013 at 11:30 AM, Markus Jelsma > wrote: > > > You are sure the patch works? You get differen

RE: Boilerplate removal

2013-08-07 Thread Markus Jelsma
You are sure the patch works? You get different text output with tika.use_boilerpipe enabled and disabled? -Original message- > From:Joe Zhang > Sent: Wednesday 7th August 2013 20:12 > To: user > Subject: Boilerplate removal > > I'm having the following in my nutchsite.xml. Yet the

RE: file:/// URLS with spaces in path

2013-08-07 Thread Markus Jelsma
Hi, Why move the files if you could just point the webroot to that directory? Anyway, what's the issue with the regex filter? i think having +FW13-sample-docs as rule would pass any URL below that level. I also think a seed file can contain a space so no need to escape the spaces. The metadata

RE: nutch relation between depth parameter and segment

2013-08-07 Thread Markus Jelsma
Yes. Depth actually means, run N crawl cycles or rounds. -Original message- > From:devang pandey > Sent: Wednesday 7th August 2013 11:09 > To: user@nutch.apache.org > Subject: nutch relation between depth parameter and segment > > hello , > I am new to nutch , I have one question that i

RE: nutch webgraph analysis

2013-08-01 Thread Markus Jelsma
please guide me on how to exactly > use these commands to read webgraph. > > > On Thu, Aug 1, 2013 at 2:56 PM, Markus Jelsma > wrote: > > > There are reader and dumper tools you can use, see: > > > > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache

RE: nutch webgraph analysis

2013-08-01 Thread Markus Jelsma
There are reader and dumper tools you can use, see: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/ -Original message- > From:devang pandey > Sent: Thursday 1st August 2013 8:27 > To: user@nutch.apache.org > Subject: nutch webgraph analysis > > Hello,

RE: 2 day Nutch training course

2013-07-30 Thread Markus Jelsma
cool! -Original message- > From:Julien Nioche > Sent: Monday 29th July 2013 17:45 > To: user@nutch.apache.org > Subject: 2 day Nutch training course > > Hi, > > We are planning to run a 2 day Nutch training course this autumn. More > details on > http://digitalpebble.blogspot.co.uk/

RE: Nutch HTML Parsers & tika-boilerpipe configuration

2013-07-29 Thread Markus Jelsma
ulted the expected the results, but when I run the crawler, I get > ~98% Error while Parsing, > > I get the following error > > *"Unable to successfully parse content URL*" > > > > On Mon, Jul 29, 2013 at 4:53 PM, Markus Jelsma > wrote: > > &g

RE: Nutch HTML Parsers & tika-boilerpipe configuration

2013-07-29 Thread Markus Jelsma
Simple, only use parse-tika and patch with NUTCH-961. https://issues.apache.org/jira/browse/NUTCH-961 Extractor algorithms are fixed, it is not possible to preanalyze a page and select an extractor accordingly. -Original message- > From:imran khan > Sent: Monday 29th July 2013 11:25

RE: nutch crawldb analytics

2013-07-29 Thread Markus Jelsma
Patch should work with any 1.x, it doesn't change existing sources and only reads the CrawlDB. -Original message- > From:devang pandey > Sent: Monday 29th July 2013 13:00 > To: user@nutch.apache.org > Subject: Re: nutch crawldb analytics > > @ Markus Jelsma

RE: nutch crawldb analytics

2013-07-29 Thread Markus Jelsma
Using the HostDB tools your can create a database of hosts and dump their statistics. https://issues.apache.org/jira/browse/NUTCH-1325 -Original message- > From:devang pandey > Sent: Monday 29th July 2013 12:30 > To: user@nutch.apache.org > Subject: nutch crawldb analytics > > Hello

RE: Prevent crawl of parent URL

2013-07-25 Thread Markus Jelsma
-Original message- > From:stone2dbone > Sent: Wednesday 24th July 2013 18:25 > To: user@nutch.apache.org > Subject: RE: Prevent crawl of parent URL > > Thanks Markus. I will give this a try. I did refilter the crawldb. One more > question: > > I'm not good with regex. If I wanted to cr

RE: Prevent crawl of parent URL

2013-07-24 Thread Markus Jelsma
Hi -Original message- > From:stone2dbone > Sent: Wednesday 24th July 2013 14:56 > To: user@nutch.apache.org > Subject: Prevent crawl of parent URL > > I would like to crawl everything in > > http://my.domain.name/dir/subdir > > but nothing in its parent > > http://my.domain.name/dir/

RE: Why aren't my path exclusions getting excluded in the Nutch index to Solr?

2013-07-22 Thread Markus Jelsma
-Original message- > From:dogrdon > Sent: Monday 22nd July 2013 17:02 > To: user@nutch.apache.org > Subject: RE: Why aren't my path exclusions getting excluded in the Nutch > index to Solr? > > Thanks, I was wondering if some kind of reset like that needed to happen. > > I am fai

RE: Nutch 1.6/ Solr - order of the search results (tweaking order of the page/pagerank?)

2013-07-22 Thread Markus Jelsma
Check the Solr wiki pages: http://wiki.apache.org/solr/FunctionQuery http://wiki.apache.org/solr/SolrRelevancyFAQ http://wiki.apache.org/solr/ExtendedDisMax The only thing you can do from Nutch is pass it's LinkRank score to the boost field. http://wiki.apache.org/nutch/NewScoring -Original

RE: Why aren't my path exclusions getting excluded in the Nutch index to Solr?

2013-07-22 Thread Markus Jelsma
Don't forget to refilter the database after changes been made to URL filters or they will be regenerated and fetched. -Original message- > From:dogrdon > Sent: Friday 19th July 2013 18:44 > To: user@nutch.apache.org > Subject: Why aren't my path exclusions getting excluded in the Nutch

RE: Nutch how to crawl but not index the site navigation (w/ Solr)

2013-07-17 Thread Markus Jelsma
Yes! Boilerpipe is the best open source alternative and has a working patch for Nutch! There are also some other open source extraction toolkits but they have not been ported to Tika or do not directly work with SAX ContentHandlers (usable in Tika) so they would require some work there plus inte

RE: Storing Nutch statistics

2013-07-15 Thread Markus Jelsma
toring Nutch statistics > > hey markus , thanx for replying .. What I want exactly is that to export > this result of (say readdb )to databse so that it is easy to query . So > what needs to be done to export stats to database for eg postgres.I am > usinh nutch 1.4 > > > On

RE: Storing Nutch statistics

2013-07-15 Thread Markus Jelsma
what kind of stats are you looking for? The default Nutch ships with the readdb tool. Use readdb crawl/crawldb -stats. -Original message- > From:devang pandey > Sent: Monday 15th July 2013 8:27 > To: user@nutch.apache.org > Subject: Storing Nutch statistics > > Hello, > > I am usin

RE: Nutch(2.2.1) How to extract a proper snippet text from a crawled site to display under search result?

2013-07-12 Thread Markus Jelsma
Hi, This is always an interesting problem. You can either buy or build your own extraction software or be satisfied by what Boilerpipe has to offer. Tika has support for Boilerpipe and NUTCH-961 has a patch for 2.x as well enabling Boilerpipe. https://issues.apache.org/jira/browse/NUTCH-961 B

RE: nutch crawling issues

2013-07-10 Thread Markus Jelsma
gt; hello markus I have one confusion should i implement changes in crawl-url > filter or regex filter > > > On Wed, Jul 10, 2013 at 3:12 PM, Markus Jelsma > wrote: > > > Hi, > > > > Use a regex url filter to filter those URL's and prevent them from be

RE: nutch crawling issues

2013-07-10 Thread Markus Jelsma
Hi, Use a regex url filter to filter those URL's and prevent them from being crawled again. Cheers -Original message- > From:devang pandey > Sent: Wednesday 10th July 2013 10:29 > To: user@nutch.apache.org > Subject: nutch crawling issues > > I have a website eg . www.example.com. N

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-10 Thread Markus Jelsma
Hi, Those were removed because i copied it from our own Nutch dist. They aren't being used anyway and document related variables or literal have no place in indexing backends. They aren't available either in the ES backend we added. Also, there is no need to modify SolrClean because it simply d

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-09 Thread Markus Jelsma
nd notify you guys about the result. > > Best. > > > On Tue, Jul 9, 2013 at 1:55 PM, Markus Jelsma > wrote: > > > Hi, > > > > Just as i explained. The DistributedUpdateRequestProcessor does that on > > the Solr node for you. There's an issue at Solr

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-09 Thread Markus Jelsma
rver(s). I'll give a shot for your > patch. > > Best > > > On Tue, Jul 9, 2013 at 1:34 PM, Markus Jelsma > wrote: > > > Yes, it only takes URL's for your ensemble because that is how > > CloudSolrServer works and it is the best method of connecting to

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-09 Thread Markus Jelsma
r is enough? > > BTW my patch is ready, how could suppose to attach it? > > Best > > > On Tue, Jul 9, 2013 at 1:11 PM, Markus Jelsma > wrote: > > > I attached a patch for support of CloudSolrServer and a Zookeeper > > ensemble. Use solr.zookee

RE: Regarding crawling https links

2013-07-09 Thread Markus Jelsma
e.org > Subject: RE: Regarding crawling https links > > How can I make nutch ignore robots.txt file? > > Regards, > Vincent Anup Kuri > > > -----Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Tuesday, July 09, 2013 3:46 PM

RE: Regarding crawling https links

2013-07-09 Thread Markus Jelsma
That's because the checker tools do not use robots.txt. -Original message- > From:Anup Kuri, Vincent > Sent: Tuesday 9th July 2013 12:14 > To: user@nutch.apache.org > Subject: RE: Regarding crawling https links > > That's for the asp file. When I used Parser Checker, it works perfectly,

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-09 Thread Markus Jelsma
I attached a patch for support of CloudSolrServer and a Zookeeper ensemble. Use solr.zookeeper.hosts and solr.collection to enable it. Patch also required NUTCH-1486. https://issues.apache.org/jira/browse/NUTCH-1377 -Original message- > From:Tuğcem Oral > Sent: Tuesday 9th July 2013

RE: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

2013-07-08 Thread Markus Jelsma
> what if I have to talk to Solr from Nutch ? > > > On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma > wrote: > > > Processing the logs would be easy but since you need some metadata your > > probably need to hack into the Fetcher.java code. The fetcher has several &g

RE: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

2013-07-08 Thread Markus Jelsma
Processing the logs would be easy but since you need some metadata your probably need to hack into the Fetcher.java code. The fetcher has several inner classes but you'd need the FetcherThread class which is responsible for the actual download and anything else that needs to be done there. If y

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-08 Thread Markus Jelsma
1.6 to solr 4.3.1 cloud > > > OK then. I generated the corresponding patch. If someone also needs it till > nutch 1.8 is released, I'd be happy to share. > > Best, > > Tugcem. > > > On Mon, Jul 8, 2013 at 12:10 PM, Markus Jelsma > wrote: > > >

RE: nutch 1.2 solr 3.1 integration issue

2013-07-08 Thread Markus Jelsma
rOutputFormat.java:48) > at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) > 2013-07-08 17:38:47,577 ERROR solr.SolrIndexer - ja

RE: nutch 1.2 solr 3.1 integration issue

2013-07-08 Thread Markus Jelsma
You're not using the correct httpclient jar for the solrj client version. This is some dependency hell. You'd better build 1.2 from source and change the versions in ivy. That would be easier i think. -Original message- > From:devang pandey > Sent: Monday 8th July 2013 13:46 > To: use

RE: Drop inproper multiValued field

2013-07-08 Thread Markus Jelsma
; Am 08.07.2013 13:20, schrieb Markus Jelsma: > > Ah, i assume this didn't happen with 1.6? 1.7 finally has support for > > multivalued metadata fields and index-metadata seems to write them > > all. > > Absolutely right! Which is some cases fine, but very annoying els

RE: Drop inproper multiValued field

2013-07-08 Thread Markus Jelsma
Ah, i assume this didn't happen with 1.6? 1.7 finally has support for multivalued metadata fields and index-metadata seems to write them all. -Original message- > From:Christian Nölle > Sent: Monday 8th July 2013 13:17 > To: user@nutch.apache.org > Subject: Drop inproper multiValued fi

RE: nutch 1.2 solr 3.6 integration issue

2013-07-08 Thread Markus Jelsma
owngrade my solr to 3.1 ?? will this work > > > On Mon, Jul 8, 2013 at 4:16 PM, Markus Jelsma > wrote: > > > mm, if dropping jars doesn't work (yields the same error in the logs) i > > don't know what you could do except upgrading to a more recent Nutch. 1.5

RE: nutch 1.2 solr 3.6 integration issue

2013-07-08 Thread Markus Jelsma
pache.org > Subject: Re: nutch 1.2 solr 3.6 integration issue > > I am very sorry for wrong reply . I am using binary . > > > On Mon, Jul 8, 2013 at 4:10 PM, Markus Jelsma > wrote: > > > you're building nutch from source? > > > > -Original messa

RE: nutch 1.2 solr 3.6 integration issue

2013-07-08 Thread Markus Jelsma
emoved > the older ones ... I also carried an ant job ..But still this stuff is not > working . > > > On Mon, Jul 8, 2013 at 3:54 PM, Markus Jelsma > wrote: > > > Well, the API hasn't changed i think so you might consider upgrading the > > solrJ client i

RE: nutch 1.2 solr 3.6 integration issue

2013-07-08 Thread Markus Jelsma
be done > to support this version of nutch. > > > On Mon, Jul 8, 2013 at 3:46 PM, Markus Jelsma > wrote: > > > Ah, since you're using an old Nutch and an old SolrJ client and that the > > Javabin format has changed over time, i think your Solr is too new for

RE: nutch 1.2 solr 3.6 integration issue

2013-07-08 Thread Markus Jelsma
duceTask.java:474) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) > 2013-07-08 15:17:39,539 ERROR solr.SolrIndexer - java.io.IOException: Job > failed! > > > > On Mon, Jul 8, 2013 at 3:41 PM,

RE: nutch 1.2 solr 3.6 integration issue

2013-07-08 Thread Markus Jelsma
an you pls > suggest me some other way of resolving this issue. > > > On Mon, Jul 8, 2013 at 3:14 PM, Markus Jelsma > wrote: > > > You need to provide the log output. But i think crawl/segments/* is the > > problem. You must either do seg1 seg2 seg3 or -dir

RE: nutch 1.2 solr 3.6 integration issue

2013-07-08 Thread Markus Jelsma
You need to provide the log output. But i think crawl/segments/* is the problem. You must either do seg1 seg2 seg3 or -dir segments/. No wildcards supported! Cheers -Original message- > From:devang pandey > Sent: Monday 8th July 2013 11:41 > To: user@nutch.apache.org > Subject: nut

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-08 Thread Markus Jelsma
First we need to upgrade to Solr >= 4.3 (NUTCH-1486). Then we'll have to add an option to index via CloudSolrServer (NUTCH-1377) where you input your Zookeeper ensemble vs. a target host. Then we can do NUTCH-1480 and write to multiple individual servers and/or multiple cloud clusters. The upgr

RE: Long crawl keeps failing in fetch phase

2013-07-08 Thread Markus Jelsma
Hi, Begin with decreasing the number of records your process per mapper and increase the number of mappers. Better parallelism and less recovery work in case of problems. We usually don't do more than 25k URL's per mapper but do a lot of mappers instead! Easier to control and debug. Cheers.

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-08 Thread Markus Jelsma
other computer. > > Best, > > Tugcem Oral > > > > On Fri, Jul 5, 2013 at 5:48 PM, Markus Jelsma > wrote: > > > Hi, > > > > 1480 and 1377 are different. We already use CloudSolrServer (i haven't > > added the patch yet) but also use

RE: how about change a liite in the QueueFeeder

2013-07-08 Thread Markus Jelsma
H, What exactly is the problem you're trying to solve? If there's an issue, please open a Jira ticket, describe the problem and attach a patch file. Thanks Markus -Original message- > From:RS > Sent: Sunday 7th July 2013 11:14 > To: Lewis John Mcgibbney > Cc: nutch-user > Subject

RE: not able to use webgraph command in nutch 1.2

2013-07-08 Thread Markus Jelsma
The webgraph package is simply not there yet. Please upgrade to a recent version and you can use the webgraph package. Cheers, Markus -Original message- > From:devang pandey > Sent: Monday 8th July 2013 7:44 > To: user@nutch.apache.org > Subject: not able to use webgraph command in n

RE: Indexing from nutch 1.6 to solr 4.3.1 cloud

2013-07-05 Thread Markus Jelsma
Hi, 1480 and 1377 are different. We already use CloudSolrServer (i haven't added the patch yet) but also use 1480 to write to multiple Solr clusters! Both need still need patches and i haven't had time yet to provide them although we already use both features in our Nutch. I'll try to find som

RE: limit to fetch only N pages from each host?

2013-07-05 Thread Markus Jelsma
generate.max.count? -Original message- > From:Dennis Yurichev > Sent: Friday 5th July 2013 5:25 > To: user@nutch.apache.org > Subject: limit to fetch only N pages from each host? > > Hi. > > How to limit nutch 2.x to fetch only N (5-10) pages from each host or > domain? > I fail to

RE: Nutch scalability tests

2013-07-03 Thread Markus Jelsma
How many different hosts do you crawl? I see one reducer and only one queue and Nutch queus by domain or host. Hosts will always end up in the same queue so Nutch will only crawl a lot and very fast if there's a large number of queues to process. The only thing you can do then is increase the

RE: [ANNOUNCE] Apache Nutch v2.2.1 Released

2013-07-03 Thread Markus Jelsma
Great news, thanks Lewis! -Original message- From: Lewis John Mcgibbney Sent: Tuesday 2nd July 2013 18:32 To: user@nutch.apache.org; d...@nutch.apache.org Subject: [ANNOUNCE] Apache Nutch v2.2.1 Released Good Afternoon Everyone, The Apache Nutch PMC are very pleased to announce the immed

RE: no digest field avaliable

2013-07-02 Thread Markus Jelsma
I've got a version of the indexchecker that does that, as well as providing a telnet server. I was just thinking to open an issue about that this afternoon! -Original message- > From:Sebastian Nagel > Sent: Tuesday 2nd July 2013 22:29 > To: user@nutch.apache.org > Subject: Re: no dige

RE: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Markus Jelsma
Hi, Increase your memory in the task trackers by setting your Xmx in mapred.map.child.java.opts. Cheers -Original message- > From:Sznajder ForMailingList > Sent: Tuesday 2nd July 2013 15:25 > To: user@nutch.apache.org > Subject: Distributed mode and java/lang/OutOfMemoryError > >

RE: Nutch scalability tests

2013-07-02 Thread Markus Jelsma
Hi, Nutch can easily scale to many many billions of records, it just depends on how many and how powerful your nodes are. Crawl speed is not very relevant as it is always very fast, the problem usually is updating the databases. If you spread your data over more machines you will increase your

RE: Parse headings plugin

2013-06-28 Thread Markus Jelsma
These are available in Nutch 1.7: headings h1,h2 Comma separated list of headings to retrieve from the document headings.multivalued false Whether to support multivalued headings. So if the plugin is enabled those headings are extracted and added do the document's parse metada

RE: Fetch iframe from HTML (if exists)

2013-06-27 Thread Markus Jelsma
)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)|iframemeta > > > So I us parse-html but also tika, text metatags and js. maybe it's to > > much > > > ? I copied this configuration from an example I saw. I do know that I

RE: [ANNOUNCE] Apache Nutch v1.7 Released

2013-06-27 Thread Markus Jelsma
Thanks again Lewis for the properly managing the release! Looking forward already to 1.8! Cheers -Original message- From: Lewis John Mcgibbney Sent: Thursday 27th June 2013 1:39 To: user@nutch.apache.org; d...@nutch.apache.org Subject: [ANNOUNCE] Apache Nutch v1.7 Released N.B. Previous

RE: [VOTE] Apache Nutch 2.2.1 RC#1

2013-06-27 Thread Markus Jelsma
Looks fine Lewis! +1 -Original message- From: Lewis John Mcgibbney Sent: Thursday 27th June 2013 20:00 To: d...@nutch.apache.org; user@nutch.apache.org Subject: [VOTE] Apache Nutch 2.2.1 RC#1 Hi, It would be greatly appreciated if you could take some time to VOTE on the release candidat

RE: Fetch iframe from HTML (if exists)

2013-06-26 Thread Markus Jelsma
er@nutch.apache.org > Subject: Re: Fetch iframe from HTML (if exists) > > How will it affect ? I Crawl with no depth (depth 1) so outlinks don't > matter and it seems that the urls fetched don't get parsed, or am I > misunderstanding something ? > > > O

RE: Fetch iframe from HTML (if exists)

2013-06-26 Thread Markus Jelsma
gt; > > > Thanks. > > > > > > > > > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela wrote: > > > >> Thanks for the prompt answer! > >> > >> > >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma < > >> markus.jel...

RE: Fetch iframe from HTML (if exists)

2013-06-25 Thread Markus Jelsma
Hi, Do i understand you correctly if you want all iframe src attributes on a given page stored in the iframe field? The src attributes are not extracted and there is no facility to do so right now. You should create your own HTMLParseFilter, loop through the document looking for iframe tags an

RE: Mailing list archive

2013-06-25 Thread Markus Jelsma
http://mail-archives.apache.org/mod_mbox/nutch-user/ http://lucene.472066.n3.nabble.com/Nutch-f603146.html -Original message- > From:Sznajder ForMailingList > Sent: Tuesday 25th June 2013 14:16 > To: user@nutch.apache.org > Subject: Mailing list archive > > Hi > > Where can I consult

RE: [RESULT] WAS Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-24 Thread Markus Jelsma
Hi All, > > I am going to close this thread off and formally close the VOTE for Nutch > 1.7 Release candidate. > VOTE'ing has tallied as follows > > [7] +1, lets get it released!!! > Markus Jelsma > Julien Nioche > Tejas Patil > Feng Lu > Chris Mattmann >

RE: Parse reduce stage take forver

2013-06-24 Thread Markus Jelsma
yes, that matters indeed! But if you don't normalize, your URL filters may not work although that should not be a problem in small crawls or a limited number of (good) websites. You could try the following normalizing rule to remove very long URL's as your first rule. .{256,} With an empty sub

RE: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Markus Jelsma
Sigs checked out for tgz! -Original message- From: Lewis John Mcgibbney Sent: Friday 21st June 2013 20:41 To: user@nutch.apache.org Cc: d...@nutch.apache.org Subject: Re: [VOTE] Apache Nutch 1.7 Release Candidate Hi Julien, Done, thanks for the attention to detail. I wonder if you got to

RE: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-21 Thread Markus Jelsma
Nice! Signatures are ok and everything builds and all tests pass. The pom does still point to 16.-SNAPSHOT. Other than that: definate +1! Cheers -Original message- From: lewis john mcgibbney Sent: Friday 21st June 2013 0:33 To: d...@nutch.apache.org; user@nutch.apache.org Subject: [VO

RE: Nutch scoring question again

2013-06-16 Thread Markus Jelsma
Hi Joe, You don't need a scoring filter for Linkrank. Just follow the wiki and run the webgraph tool on your segments. Then you can run the linkrank tool on the webgraph you just created from your segments. Finally use the scoreupdater tool to write the scores back to your crawldb. Cheers htt

RE: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Markus Jelsma
I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is not true for 1.x, see NUTCH-1482. https://issues.apache.org/jira/browse/NUTCH-1482 -Original message- > From:Tony Mullins > Sent: Wed 12-Jun-2013 14:37 > To: user@nutch.apache.org > Subject: HTMLParseFi

RE: Suffix URLFilter not working

2013-06-12 Thread Markus Jelsma
We happily use that filter just as it is shipped with Nutch. Just enabling it in plugin.includes works for us. To ease testing you can use the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters. -Original message- > From:Bai Shen > Sent: Wed 12-Jun-2013 14:32 > To: user@

RE: Running Nutch standalone (without Solr)

2013-06-12 Thread Markus Jelsma
Hi, Sure, you don't need to index the data and can use the individual commands or the new bin/crawl script. Cheers -Original message- > From:Peter Gaines > Sent: Wed 12-Jun-2013 13:57 > To: user@nutch.apache.org > Subject: Running Nutch standalone (without Solr) > > Hi There, > >

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma
: using Tika within Nutch to remove boiler plates? > > So what in your opinion is the most effective way of removing boilerplates > in Nutch crawls? > > > On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma > wrote: > > > Yes, Boilerpipe is complex and difficult to ad

RE: Data Extraction from 100+ different sites...

2013-06-11 Thread Markus Jelsma
lugins > available on web ? > > Thanks, > Tony. > > > On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma > wrote: > > > Hi, > > > > Yes, you should write a plugin that has a parse filter and indexing > > filter. To ease maintenance you would w

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma
7;t use boilerpipe any more? So what do you > suggest as an alternative? > > > On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma > wrote: > > > we don't use Boilerpipe anymore so no point in sharing. Just set those two > > configuratio

RE: Data Extraction from 100+ different sites...

2013-06-11 Thread Markus Jelsma
Hi, Yes, you should write a plugin that has a parse filter and indexing filter. To ease maintenance you would want to have a file per host/domain containing XPath expressions, far easier that switch statements that need to be recompiled. The indexing filter would then index the field values ext

RE: RSS based crawl - how to crawl ref links in next round

2013-06-11 Thread Markus Jelsma
-Original message- > From:Sourajit Basak > Sent: Tue 11-Jun-2013 14:50 > To: user@nutch.apache.org > Subject: RSS based crawl - how to crawl ref links in next round > > We are crawling RSS links using a custom plugin. Thats working fine. > > Our intention is to crawl the discovered

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma
11-Jun-2013 01:42 > To: user > Subject: Re: using Tika within Nutch to remove boiler plates? > > Marcus, do you mind sharing a sample nutch-site.xml? > > > On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma > wrote: > > > Those settings belong to nutch-site. Enable

RE: using Tika within Nutch to remove boiler plates?

2013-06-10 Thread Markus Jelsma
Those settings belong to nutch-site. Enable BP and set the correct extractor and it should work just fine. -Original message- > From:Lewis John Mcgibbney > Sent: Sun 09-Jun-2013 20:47 > To: user@nutch.apache.org > Subject: Re: using Tika within Nutch to remove boiler plates? > > Hi J

RE: Generator -adddays

2013-05-31 Thread Markus Jelsma
Please don't break existing scripts and support lower and uppercase. Markus -Original message- > From:Lewis John Mcgibbney > Sent: Fri 31-May-2013 19:11 > To: user@nutch.apache.org > Subject: Re: Generator -adddays > > Seems like a small cli syntax bug. > Please submit a patch and w

RE: How to achieve different fetcher.server.delay configuration for different hosts/sub domains?

2013-05-28 Thread Markus Jelsma
You can either use robots.txt or modify the Fetcher. Fetcher has a FetchItemQueue for each queue, this also records the CrawlDelay for that queue. A FetchItemQueue is created by FetchItemQueues.getFetchItemQueue(), here it sets the CrawlDelay for the queue. You can have a lookup table here that

Fetcher corrupting some segments

2013-05-27 Thread Markus Jelsma
Hi, For some reason the fetcher sometimes produces corrupts unreadable segments. It then exists with exception like "problem advancing post", or "negative array size exception" etc. java.lang.RuntimeException: problem advancing post rec#702 at org.apache.hadoop.mapred.Task$ValuesIterat

RE: rewriting urls that are index

2013-04-22 Thread Markus Jelsma
Hi, The 1.x indexer takes a -normalize parameter and there you can rewrite your URL's. Judging from your patterns the RegexURLNormalizer should be sufficient. Make sure you use the config file containing that pattern only when indexing, otherwise they'll end up in the CrawlDB and segments. Use

RE: Period-terminated hostnames

2013-04-18 Thread Markus Jelsma
Rodney, Those are valid URL's but you clearly don't need them. You can either use filters to get rid of them or normalize them away. Use the org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test your config. Markus -Original message- > From:Rodney Barnett >

RE: Does Nutch Checks Whether A Page crawled before or not

2013-03-20 Thread Markus Jelsma
> To: user@nutch.apache.org > Subject: Re: Does Nutch Checks Whether A Page crawled before or not > > Where does Nutch stores that information? > > 2013/3/21 Markus Jelsma-2 [via Lucene] < > ml-node+s472066n4049568...@n3.nabble.com> > > > Nutch selects records th

RE: How to Continue to Crawl with Nutch Even An Error Occurs?

2013-03-20 Thread Markus Jelsma
If Nutch exits with an error then the segment is bad, a failing thread is not an error that leads to a failed segments. This means the segment is properly fetched but just that some records failed. Those records will be eligible for refetch. Assuming you use the crawl command, the updatedb comm

RE: Does Nutch Checks Whether A Page crawled before or not

2013-03-20 Thread Markus Jelsma
Nutch selects records that are eligible for fetch. It's either due to a transient failure or if the fetch interval has been expired. This means that failed fetches due to network issues are refetched within 24 hours. Successfully fetched pages are only refetched if the current time exceeds the

RE: [WELCOME] Feng Lu as Apache Nutch PMC and Committer

2013-03-18 Thread Markus Jelsma
Feng Lu, welcome! :) -Original message- > From:Julien Nioche > Sent: Mon 18-Mar-2013 13:23 > To: user@nutch.apache.org > Cc: d...@nutch.apache.org > Subject: Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer > > Hi Feng,  > > Congratulations on becoming a committer and welcome

RE: keep all pages from a domain in one slice

2013-03-05 Thread Markus Jelsma
Hi You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge the segments for each domain. But that's far too much work. Why do you want to do this? There may be better ways in achieving you goal. -Original message- > From:Jason S

RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Markus Jelsma
The default heap size of 1G is just enough for a parsing fetcher with 10 threads. The only problem that may rise is too large and complicated PDF files or very large HTML files. If you generate fetch lists of a reasonable size there won't be a problem most of the time. And if you want to crawl a

RE: a lot of threads spinwaiting

2013-03-01 Thread Markus Jelsma
Hi, Regarding politeness, 3 threads per queue is not really polite :) Cheers -Original message- > From:jc > Sent: Fri 01-Mar-2013 15:08 > To: user@nutch.apache.org > Subject: Re: a lot of threads spinwaiting > > Hi Roland and lufeng, > > Thank you very much for your replies, I alr

RE: Nutch Incremental Crawl

2013-02-27 Thread Markus Jelsma
w can we do that? > > > Feng Lu : Thank you for the reference link. > > Thanks - David > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma > wrote: > > > The default or the injected interval? The default interval can be set in > > the config (

RE: Nutch Incremental Crawl

2013-02-27 Thread Markus Jelsma
interval , incase if I > require to fetch the page before the time interval is passed? > > > > Thanks very much > - David > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma > wrote: > > > If you want records to

RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Markus Jelsma
; > > Yes, my first options is differents files to differents domains. > The point is how can I link the files with each domain? Do I need do > some changes in Nutch code or the project have a feature for do > that? > > On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrot

RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Markus Jelsma
Yes, it will support that until you run out of memory. But having a million expressions is not going to work nicely. If you have a lot of expressions but can divide them into domains i would patch the filter so it will only execute filters that or for a specific domain. -Original message-

RE: Differences between 2.1 and 1.6

2013-02-25 Thread Markus Jelsma
Something seems to be missing here. It's clear that 1.x has more features and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better if you are going to crawl on a very large scale but i still haven't seen any numbers to support this assumption. Nutch 1.x can easily deal

<    2   3   4   5   6   7   8   9   10   11   >