date:20140829

Re: New documents still not being added by nutch

2014-08-29 Thread Paul Rogers

Hi Guys I'm still struggling with this. In summary my directory structure is as follows / |_doccontrol |_DC-10 Incoming Correspondence |_DC-11 Outgoing Correspondence If when I first run nutch the folders DC-10 and DC-11 contain all the files to be indexed then nutch crawls everything w

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Guy McDowell

I'm confused as to what are the significant differences between 1.x and 2.x. Is there a bit of history that I could read about why the development of the two parallel to each other happened? As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which path would be best for me to follow

Re: Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Nicholas Roberts

Hi Lewis I have a main development server and connect to it from other machines over Lan and sometimes via Wan Vagrant by itself has highly constrained network config. you need the proprietary cloud vagrant share which does ssh tunnels or you use config management for networking A vagrant file w

Re: Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Mattmann, Chris A (3980)

How about simply including a Vagrantfile for Nutch in the trunk? Then someone would: svn co https:// or; git clone https://github.com/apache/nutch cd nutch vagrant up Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrume

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Lewis John Mcgibbney

Hi Julien, On Fri, Aug 29, 2014 at 6:01 AM, wrote: > > Just out of interest, what sort of analytics do you do and why is it better > to do it in 2.x than 1.x? > Nowhere did I say it was better or worse than in 1.X. Let me be clear here. I use Nutch 2.X, as I indicated because it provides me wit

Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Lewis John Mcgibbney

Hi Nicholas, NOTE: Thread name has changed to reflect diversion on topic. On Fri, Aug 29, 2014 at 6:01 AM, wrote: > > will you use config management like ansible backing vagrant? > Well thanks for the links here. The github repos they have indicates that they code is GPL licensed meaning that

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Bin Wang

I think it is a great idea of build images with an environment correctly set up. I think two types of images would be helpful. 1. Development (Virtualbox) Here, we have Eclipse, plugin, pseudo hadoop...etc correctly installed maybe on a ubuntu box with 3D-acceleration enabled. Then people can down

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche

No, just do 'bin/crawl' from the master node. It internally calls the nutch script for the individual commands, which takes care of sending the job jar to your hadoop cluster, see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 On 29 August 2014 15:24, S.L wrote: > Sorry Jul

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Mattmann, Chris A (3980)

+1, great. I'd like to have a conversation about versioning. Since we're at 1.9, my suggestion would be to have the next in the trunk series (1.x) move to version 3.x post 1.9 for the release. Nutch2 remains Nutch and can be worked on there. That would give us a nice split in the diversionary br

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread S.L

Sorry Julien , I overlooked the directory names. My understanding is that the Hadoop Job is submitted to a cluster by using the following command on the RM node bin/hadoop .job file Are you suggesting I submit the script instead of the Nutch .job jar like below? bin/hadoop bin/crawl On

Re: Nutch Confusion

2014-08-29 Thread Ali Nazemian

Dear Iqbal, Hi, As far as I know, If you dont need Gora mapper for using Nutch over Hbase or MySQL or etc. , it is better to use version 1.x since some of Nutch functionality are not implemented on version 2.x and Nutch 1.x provides better performance for crawling web pages. ES is not difficult ind

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche

As the name runtime/deploy suggest - it is used exactly for that purpose ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, that's all. Look at the bottom of the nutch script for details. Julien PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( http://s

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread S.L

Thanks, can this be used on a hadoop cluster? Sent from my HTC - Reply message - From: "Julien Nioche" To: "user@nutch.apache.org" Subject: Nutch 1.7 fetch happening in a single map task. Date: Fri, Aug 29, 2014 9:00 AM See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_cra

RE: Nutch Confusion

2014-08-29 Thread Iqbal Shaikh

Thanks Julien for the prompt response. Actually since the model for 1.9 version is all plugin based I shouldn't be expecting an ivy.xml like in 2.x to have a elastic config. So ignore that comment. Yes I mean HDFS (new to big data and Hadoop). Isn't HBase the default one for 1.9 too ? Perhaps

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche

See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan wrote: > Hi Julien, > > I have 15 domains and they are all being fetched in a single map task which > does not

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Meraj A. Khan

Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any

Re: Nutch Confusion

2014-08-29 Thread Julien Nioche

Hi Iqbal, Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1 > version. > > We would be indexing our crawled data in ElasticSearch 1.x version. > > I know the 2.2.1 version provides OTB support for Elastic 0.x version but > to use 2.x I need to change the code (ElasticWriter.ja

Nutch Confusion

2014-08-29 Thread Iqbal Shaikh

Hi All, Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1 version. We would be indexing our crawled data in ElasticSearch 1.x version. I know the 2.2.1 version provides OTB support for Elastic 0.x version but to use 2.x I need to change the code (ElasticWriter.java) This me

How do I pass custom URL filter URL configuration to filter plugins?

2014-08-29 Thread Krishnanand, Kartik

Hi, Nutch Gurus, I have a use case that I need to implement and I hope that someone can help. I have a situation where I need to generate and build URLs dynamically and pass them to the respective filter. I want to pass a newly constructed string to the Filter implementation associated with re

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche

Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering wi

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Julien Nioche

Hi Lewis, A few comments below. I use Nutch 2.x as it enables me to do analytics over the data I am > crawling. This is my justification for trying to maintain an further the > development on that branch over the last while. > Just out of interest, what sort of analytics do you do and why is it

Re: New documents still not being added by nutch

Re: [RELEASE] Apache Nutch 1.9

Re: Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

Re: Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

Re: [RELEASE] Apache Nutch 1.9

Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

Re: [RELEASE] Apache Nutch 1.9

Re: Nutch 1.7 fetch happening in a single map task.

Re: [RELEASE] Apache Nutch 1.9

Re: Nutch 1.7 fetch happening in a single map task.

Re: Nutch Confusion

Re: Nutch 1.7 fetch happening in a single map task.

Re: Nutch 1.7 fetch happening in a single map task.

RE: Nutch Confusion

Re: Nutch 1.7 fetch happening in a single map task.

Re: Nutch 1.7 fetch happening in a single map task.

Re: Nutch Confusion

Nutch Confusion

How do I pass custom URL filter URL configuration to filter plugins?

Re: Nutch 1.7 fetch happening in a single map task.

Re: [RELEASE] Apache Nutch 1.9

21 matches

Site Navigation

Mail list logo

Footer information