Re: Scoring-similarity plugin for Nutch 2.3.1

2019-07-02 Thread Gajanan Watkar
and try to apply those to scoring-similarity. > > > Can somebody guide me on this? > > There is currently no Nutch committer actively working on 2.x - just > compare the commit history on > the master and 2.x branches. > > Sebastian > > > > On 6/28/19 12:46 PM,

Scoring-similarity plugin for Nutch 2.3.1

2019-06-28 Thread Gajanan Watkar
Hi all, I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to solr-6.5.1. I want to add *focussed crawling capabilities to nutch 2.3.1* similar to one provided by *scoring-similarity plugin for nutch 1.x*. Can

Re: webapp for Nutch deploy mode

2018-10-20 Thread Gajanan Watkar
once I find time to dwell into this issue deeper. -Gajanan On Fri, Oct 19, 2018 at 12:54 AM Lewis John McGibbney wrote: > Hi Gahanna, > Response inline > > On 2018/10/12 07:40:50, Gajanan Watkar wrote: > > Hi all, > > I am using Nutch 2.3.1 with Hbase-1.2.3 as storag

webapp for Nutch deploy mode

2018-10-12 Thread Gajanan Watkar
Hi all, I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to solr-6.5.1. I want to use *webapp* for creating, controlling and monitoring crawl jobs in deploy mode. With Hadoop cluster, Hbase and nutchserver st

Re: Unable to get regex-urlfilter working

2018-10-12 Thread Gajanan Watkar
12:19 AM > wrote: > > > > > > > From: Gajanan Watkar > > To: user@nutch.apache.org > > Cc: > > Bcc: > > Date: Wed, 10 Oct 2018 17:19:24 +0530 > > Subject: Re: Unable to get regex-urlfilter working > > I am using Nutch 2.x with habse as backend storage. > > > > *-Gajanan* > > >

Re: Unable to get regex-urlfilter working

2018-10-10 Thread Gajanan Watkar
I am using Nutch 2.x with habse as backend storage. *-Gajanan* On Wed, Oct 10, 2018 at 5:17 PM Gajanan Watkar wrote: > Hi all, > > *1. Want to fillter all urls like:* > > http://14538.diarynote.jp/items/music-jp/B5FMG1/ > http://12899diarynote.jp/amp/20150316

Unable to get regex-urlfilter working

2018-10-10 Thread Gajanan Watkar
Hi all, *1. Want to fillter all urls like:* http://14538.diarynote.jp/items/music-jp/B5FMG1/ http://12899diarynote.jp/amp/201503160602121325/ http://15131513marudiarynote.jp/amp/201603181431397340/ http://11621diarynote.jp/amp/20040906174131/ http://14291.diarynote.jp/items/dvd-jp/B00016Z

Re: Uneven HBase region sizes WAS Re: Nodemanager crashing repeatedly

2018-09-22 Thread Gajanan Watkar
se correct the > record. > BTW, I found the following article written by Elis, to be extremely useful > https://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ > > On Wed, Sep 19, 2018 at 3:55 AM wrote: > > > From: Gajanan Watkar > > To: user@nutch.apach

Re: Nodemanager crashing repeatedly

2018-09-19 Thread Gajanan Watkar
patch for MalformedURLException. I am getting uneven region sizes, can you suggest me on pre-spliting webpage table i.e. split points to be used and splitting policy and optimum GC setup for regionserver for efficient Nutch crawling. -Gajanan On Sun, Sep 9, 2018 at 8:34 AM Gajanan Watkar wrote:

Re: Nodemanager crashing repeatedly

2018-09-08 Thread Gajanan Watkar
use the 2.x codebase, you should > use the most recent from SCM e.g. check out master and change to 2.x > branch. > Finally, for now at least, you didn't mention the phase at which the crawl > is failing. Can you provide this? > > On Thu, Sep 6, 2018 at 8:58 AM wrote: &g

Nodemanager crashing repeatedly

2018-09-04 Thread Gajanan Watkar
I am running Nutch-2.3.1 over Hadoop-2.5.2 and Hbase-1.2.3 with integration to Solr-6.5.1. I have crawled over 10 million pages. But while doing all this I am continuously facing two problems: 1. My Nodemanager is crashing repeatedly during different phases of crawl. It crashes my linux session an