On Thu, Jan 23, 2014 at 1:36 PM, d_k <mail...@gmail.com> wrote: > My main concerns with the Nutch2Tutorial was that it didn't stand by > itself. As a newcomer to nutch I treated the NutchTutorial (for 1.x) with > suspicion because I didn't know what is relevant for Nutch 2 and what isn't. > And the Nutch2Tutorial tutorial alone is not enough to get you going. > > I think this can be addressed by creating a single page or perhaps several > pages that together cover everything you need to perform a basic crawl: > > [*] Configuring the data store > [**] HBase > [**] Cassandra > [*] General nutch 2 client configuration that are relevant to any store
[1] : http://wiki.apache.org/nutch/Nutch2Tutorial [2] : http://wiki.apache.org/nutch/Nutch2Cassandra > [**] MySQL > Is now not supported in Gora and new Nutch versions so no wiki page for it. > > [*] Crawling > [**] Crawling step by step (running each step seperatly) > [**] Performing a full crawl > [***] using the crawl script > [***] using the job file > The commands are same as 1.X. The only change needed would be for arguments which can be traced looking at the command usage. The notion of having everything in one place would make things neat. AFAIK, the reason why this was not done before was maintenance overhead. If you want to create such a page, feel free to add the same. You would need to create a login to nutch wiki. If there are issues with that, then just share the document in text format and I would add it to nutch wiki. ~tejas > > > > > On Wed, Jan 22, 2014 at 1:53 PM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > >> Thanks Tejas! >> >> >> On 22 January 2014 11:51, Tejas Patil <tejas.patil...@gmail.com> wrote: >> >>> Moved the old nutchhadooptutorial page from Nutch wiki "Front page" to >>> "Archive and Legacy". >>> >>> ~tejas >>> >>> >>> On Wed, Jan 22, 2014 at 5:09 PM, Tejas Patil >>> <tejas.patil...@gmail.com>wrote: >>> >>>> Thanks *Julien* for pointing me to new "NutchHadoopSingleNodeTutorial" >>>> wiki page [0]. I would soon remove the old nutchhadooptutorial page >>>> from wiki. >>>> >>>> [0] : http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial >>>> >>>> *@d_k*, there are already tutorials for running Nutch 2.x. See [1] and >>>> [2]. Those are not as extensive as the tutorial for 1.x [3] but carry the >>>> steps which are different for 2.x. The rest steps after datastore setup are >>>> similar - the only difference being in the command params which can be >>>> figured out from the usage and so they were not duplicated in those 2.x >>>> tutorials to avoid maintenance overhead. Do you think that the 2.x >>>> tutorials are inadequate in some regards ? >>>> >>>> [1] : http://wiki.apache.org/nutch/Nutch2Tutorial >>>> [2] : http://wiki.apache.org/nutch/Nutch2Cassandra >>>> [3] : http://wiki.apache.org/nutch/NutchTutorial >>>> >>>> Thanks, >>>> Tejas >>>> >>>> >>>> On Wed, Jan 22, 2014 at 2:47 AM, d_k <mail...@gmail.com> wrote: >>>> >>>>> Actually what I would like to see is a Nutch 2.x tutorial at the same >>>>> level of detail as the >>>>> http://wiki.apache.org/nutch/NutchHadoopTutorial >>>>> What is the process of contributing to that wiki page? >>>>> >>>>> >>>>> On Tue, Jan 21, 2014 at 9:33 PM, Julien Nioche < >>>>> lists.digitalpeb...@gmail.com> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> The whole thing has been replaced with >>>>>> >>>>>> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial<http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial>which >>>>>> does exactly what you described. +1 to remove the old >>>>>> nutchhadooptutorial page >>>>>> >>>>>> J. >>>>>> >>>>>> >>>>>> On 21 January 2014 17:44, Tejas Patil <tejas.patil...@gmail.com>wrote: >>>>>> >>>>>>> Hi nutch-dev, >>>>>>> >>>>>>> I was looking at [0] and realized that with the massive number of >>>>>>> Hadoop setup tutorials out there on internet, we need not repeat the >>>>>>> same >>>>>>> on nutch wiki page and instead assume that user has already done Hadoop >>>>>>> setup. For convinience, we could direct users to the Hadoop wiki page >>>>>>> which >>>>>>> has Hadoop setup details. >>>>>>> Plus, I propose following: >>>>>>> >>>>>>> - Section "Downloading Hadoop and Nutch" : Remove the Hadoop >>>>>>> portions and let the Nutch stuff stay. >>>>>>> - Section "Setting Up The Deployment Architecture" must be removed. >>>>>>> - Section "Deploy Nutch to Single Machine" and "Deploy Nutch to >>>>>>> Multiple Machines" can be merged together. >>>>>>> - Section "Performing a Nutch Crawl", "Testing the Crawl" and >>>>>>> "Performing a Search" must be merged, its contents must be updated. >>>>>>> - Section "Rsyncing Code to Slaves" and "Updates" can be completely >>>>>>> removed. >>>>>>> >>>>>>> Any comments ? >>>>>>> >>>>>>> [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial >>>>>>> >>>>>>> Thanks, >>>>>>> Tejas >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Open Source Solutions for Text Engineering >>>>>> >>>>>> http://digitalpebble.blogspot.com/ >>>>>> http://www.digitalpebble.com >>>>>> http://twitter.com/digitalpebble >>>>>> >>>>> >>>>> >>>> >>> >> >> >> -- >> >> Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > >