Hi Julien,

Thank you very much for your answer !
I see you noticed I was french ;)
I added my answers below as you did before :


> Bonjour Thomas

> answers below


> On 10 October 2013 13:10, Thomas COUDERC <[email protected]> wrote:

>> Hi everybody,
>>
>> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a
new
>> dev contributor for Sauce Labs and DynamoDB subjects.
>>
>> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with
>> cassandra 1.2.8 using gora 0.3.
>> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop
>> cluster.
>> I also know that gora manage some map reduce operations for backend.

> GORA wraps the content from the backends into inputs for Mapreduce.

For which mapreduce task? Any task (inject, generate, ...) ?
Does Gora wraps the content from backend not using any Mapreduce?

>> I have two questions :
>>
>> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase
is
>> used as datastore,  where are the Map Reduce tasks distributed? Nutch
>> hadoop cluster or HBase (via Gora).

> not clear what you mean by distributed.
> Nutch uses Gora internally to pull the content from the backends and This
> happens on the Hadoop side so to speak, not within the backends.

I don't really understand what you mean. I think I am a bit confused with
the fact that a datastore can work on top of some MapReduce system (HBase,
cassandra also maybe, ...) and the fact that Nutch can also be deployed on
top of a such system. In that case with which one does GORA deals?

>> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on an
>> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?)

> I don't understand what you mean by 'how many of Nutch'. The number of
> mappers used for the fetching depends on your configuration, the
> distribution of URLs and the configuration of the Hadoop cluster.

I thought that in a Nutch cluster there were as many Nutchs as the number
of machines. For example with a 5 machines cluster, I thought that there
were 5 Nutchs available, but I think I'm totally wrong. I don't really
understand how the Nutch .job (in deploy folder) are working and what it
means. I cannot find some information for that point.
In fact the question was : can the mappers used for fetching be located on
each machine of the cluster so that it is possible to see incoming network
trafic on each machine?


Maybe I get really confused on these 2 points :
 - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
and by whom (Nutch/Gora/datastore/...) ?
 - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that
uses NoSQL datastore (like Hbase using also Hadoop)? And why?

I will try to find these informations into the source code or in the
internet during the next days . If you have some links it would really help
me.

Maybe, I could synthetize these informations into graphical diagrams for
the wiki.


Again, Thank you very much for your help Julien.


HTH

Julien



>
> Thank you for helping me., and excuse me for my poor English.
>
> Thomas
> Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa
> propriété : ils sont protégés au double
> titre du droit d'auteur et de la protection des bases de données.
> Ce message est confidentiel et établi à
> l'intention de ses destinataires.
> Tout message électronique étant susceptible d'altération,
> la société Médiamétrie
> décline toute responsabilité s'il a été altéré, déformé ou falsifié.
>
>
> We remind you that the results produced by Médiamétrie are and remain its
> sole property covered by both copyright
> and databases protection.
> This message is confidential and intended solely for the adressees.
> E-mails are susceptible
> to alteration.
> Neither Médiamétrie company shall be liable for the message if altered,
> changed or falsified.
>
>


--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to