Hi Julien, Thank you very much for your answer ! I see you noticed I was french ;) I added my answers below as you did before :
> Bonjour Thomas > answers below > On 10 October 2013 13:10, Thomas COUDERC <[email protected]> wrote: >> Hi everybody, >> >> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a new >> dev contributor for Sauce Labs and DynamoDB subjects. >> >> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with >> cassandra 1.2.8 using gora 0.3. >> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop >> cluster. >> I also know that gora manage some map reduce operations for backend. > GORA wraps the content from the backends into inputs for Mapreduce. For which mapreduce task? Any task (inject, generate, ...) ? Does Gora wraps the content from backend not using any Mapreduce? >> I have two questions : >> >> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase is >> used as datastore, where are the Map Reduce tasks distributed? Nutch >> hadoop cluster or HBase (via Gora). > not clear what you mean by distributed. > Nutch uses Gora internally to pull the content from the backends and This > happens on the Hadoop side so to speak, not within the backends. I don't really understand what you mean. I think I am a bit confused with the fact that a datastore can work on top of some MapReduce system (HBase, cassandra also maybe, ...) and the fact that Nutch can also be deployed on top of a such system. In that case with which one does GORA deals? >> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on an >> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?) > I don't understand what you mean by 'how many of Nutch'. The number of > mappers used for the fetching depends on your configuration, the > distribution of URLs and the configuration of the Hadoop cluster. I thought that in a Nutch cluster there were as many Nutchs as the number of machines. For example with a 5 machines cluster, I thought that there were 5 Nutchs available, but I think I'm totally wrong. I don't really understand how the Nutch .job (in deploy folder) are working and what it means. I cannot find some information for that point. In fact the question was : can the mappers used for fetching be located on each machine of the cluster so that it is possible to see incoming network trafic on each machine? Maybe I get really confused on these 2 points : - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...) and by whom (Nutch/Gora/datastore/...) ? - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that uses NoSQL datastore (like Hbase using also Hadoop)? And why? I will try to find these informations into the source code or in the internet during the next days . If you have some links it would really help me. Maybe, I could synthetize these informations into graphical diagrams for the wiki. Again, Thank you very much for your help Julien. HTH Julien > > Thank you for helping me., and excuse me for my poor English. > > Thomas > Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa > propriété : ils sont protégés au double > titre du droit d'auteur et de la protection des bases de données. > Ce message est confidentiel et établi à > l'intention de ses destinataires. > Tout message électronique étant susceptible d'altération, > la société Médiamétrie > décline toute responsabilité s'il a été altéré, déformé ou falsifié. > > > We remind you that the results produced by Médiamétrie are and remain its > sole property covered by both copyright > and databases protection. > This message is confidential and intended solely for the adressees. > E-mails are susceptible > to alteration. > Neither Médiamétrie company shall be liable for the message if altered, > changed or falsified. > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

