Hi Guys
I'm still struggling with this. In summary my directory structure is as
follows
/
|_doccontrol
|_DC-10 Incoming Correspondence
|_DC-11 Outgoing Correspondence
If when I first run nutch the folders DC-10 and DC-11 contain all the files
to be indexed then nutch crawls everything w
I'm confused as to what are the significant differences between 1.x and
2.x.
Is there a bit of history that I could read about why the development of
the two parallel to each other happened?
As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which
path would be best for me to follow
Hi Lewis
I have a main development server and connect to it from other machines over
Lan and sometimes via Wan
Vagrant by itself has highly constrained network config. you need the
proprietary cloud vagrant share which does ssh tunnels or you use config
management for networking
A vagrant file w
How about simply including a Vagrantfile for Nutch in
the trunk?
Then someone would:
svn co https:// or;
git clone https://github.com/apache/nutch
cd nutch
vagrant up
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrume
Hi Julien,
On Fri, Aug 29, 2014 at 6:01 AM, wrote:
>
> Just out of interest, what sort of analytics do you do and why is it better
> to do it in 2.x than 1.x?
>
Nowhere did I say it was better or worse than in 1.X. Let me be clear here.
I use Nutch 2.X, as I indicated because it provides me wit
Hi Nicholas,
NOTE: Thread name has changed to reflect diversion on topic.
On Fri, Aug 29, 2014 at 6:01 AM, wrote:
>
> will you use config management like ansible backing vagrant?
>
Well thanks for the links here. The github repos they have indicates that
they code is GPL licensed meaning that
I think it is a great idea of build images with an environment correctly
set up. I think two types of images would be helpful.
1. Development (Virtualbox)
Here, we have Eclipse, plugin, pseudo hadoop...etc correctly installed
maybe on a ubuntu box with 3D-acceleration enabled. Then people can
down
No, just do 'bin/crawl' from
the master node. It internally calls the nutch script for the individual
commands, which takes care of sending the job jar to your hadoop cluster,
see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
On 29 August 2014 15:24, S.L wrote:
> Sorry Jul
+1, great.
I'd like to have a conversation about versioning.
Since we're at 1.9, my suggestion would be to have the
next in the trunk series (1.x) move to version 3.x post
1.9 for the release.
Nutch2 remains Nutch and can be worked on there. That
would give us a nice split in the diversionary br
Sorry Julien , I overlooked the directory names.
My understanding is that the Hadoop Job is submitted to a cluster by using
the following command on the RM node bin/hadoop .job file
Are you suggesting I submit the script instead of the Nutch .job jar like
below?
bin/hadoop bin/crawl
On
Dear Iqbal,
Hi,
As far as I know, If you dont need Gora mapper for using Nutch over Hbase
or MySQL or etc. , it is better to use version 1.x since some of Nutch
functionality are not implemented on version 2.x and Nutch 1.x provides
better performance for crawling web pages. ES is not difficult ind
As the name runtime/deploy suggest - it is used exactly for that purpose
;-) Just make sure HADOOP_HOME/bin is added to the path and run the script,
that's all.
Look at the bottom of the nutch script for details.
Julien
PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
http://s
Thanks, can this be used on a hadoop cluster?
Sent from my HTC
- Reply message -
From: "Julien Nioche"
To: "user@nutch.apache.org"
Subject: Nutch 1.7 fetch happening in a single map task.
Date: Fri, Aug 29, 2014 9:00 AM
See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_cra
Thanks Julien for the prompt response.
Actually since the model for 1.9 version is all plugin based I shouldn't be
expecting an ivy.xml like in 2.x to have a elastic config. So ignore that
comment.
Yes I mean HDFS (new to big data and Hadoop). Isn't HBase the default one for
1.9 too ?
Perhaps
See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
just go to runtime/deploy/bin and run the script from there.
Julien
On 29 August 2014 13:38, Meraj A. Khan wrote:
> Hi Julien,
>
> I have 15 domains and they are all being fetched in a single map task which
> does not
Hi Julien,
I have 15 domains and they are all being fetched in a single map task which
does not fetch all the urls no matter what depth or topN i give.
I am submitting the Nutch job jar which seems to be using the Crawl.java
class, how do I use the Crawl script on a Hadoop cluster, are there any
Hi Iqbal,
Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1
> version.
>
> We would be indexing our crawled data in ElasticSearch 1.x version.
>
> I know the 2.2.1 version provides OTB support for Elastic 0.x version but
> to use 2.x I need to change the code (ElasticWriter.ja
Hi All,
Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1 version.
We would be indexing our crawled data in ElasticSearch 1.x version.
I know the 2.2.1 version provides OTB support for Elastic 0.x version but to
use 2.x I need to change the code (ElasticWriter.java) This me
Hi, Nutch Gurus,
I have a use case that I need to implement and I hope that someone can help.
I have a situation where I need to generate and build URLs dynamically and pass
them to the respective filter.
I want to pass a newly constructed string to the Filter implementation
associated with re
Hi Meraj,
The generator will place all the URLs in a single segment if all they
belong to the same host for politeness reason. Otherwise it will use
whichever value is passed with the -numFetchers parameter in the generation
step.
Why don't you use the crawl script in /bin instead of tinkering wi
Hi Lewis,
A few comments below.
I use Nutch 2.x as it enables me to do analytics over the data I am
> crawling. This is my justification for trying to maintain an further the
> development on that branch over the last while.
>
Just out of interest, what sort of analytics do you do and why is it
21 matches
Mail list logo