Re: From a newbie: Questions and will MapReduce fit our needs

Per Steffensen Mon, 29 Aug 2011 01:50:05 -0700

Thanks for you reply.

Peyman Mohajerian skrev:

Hi,
You should definitely take a look at Apache Sqoop as previouslymentioned, if your file is large enough and you have several map jobsrunning and hitting your database concurrently, you will experienceissues at the db level.

I believe several map jobs will not hit the same database concurrently -at least not to a very high degree - because I believe we will run oneseparate/isolated database on each machine. I guess it will be aSOLR/Lucene database on each machine, because we need to do full-textsearches on some of the data, and that separate/isolated databases oneach machine/shard it the way SOLR/Lucene scales over many machines toisolate index sizes. Only quering will involve all databases on allmachines - inserting new datarecords will only involve the "local" database.

But then again, I am curious about what Apache Sqoop can do to help withthe problem you mention. What can a framework do about the problem thatdoing many concurrent inserts into the same database will eventuallymake the database a bottleneck. That is just a build-in problem, that Icannot see that any framework and help you with. But please enlighten me.

In terms of speculative jobs (redundant jobs) running to deal withslow jobs, you have control over that in Hadoop. You can turn offspeculative jobs or make sure when one job is finished the other onefor the same input file is shutdown.

Thanks, we will do that.


Good Luck,

On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain<alain.montm...@thalesgroup.com<mailto:alain.montm...@thalesgroup.com>> wrote:

Hi,

I am going to try to response to your response in the text. I am

    not an hadoop expert but we are facing the same kind of problem
    (dealing with file which are external to HDFS) in our project and
    we use hadoop.

[@@THALES GROUP RESTRICTED@@]-----Message d'origine-----

    De : Per Steffensen [mailto:st...@designware.dk]
    Envoyé : vendredi 26 août 2011 13:13
    À : mapreduce-user@hadoop.apache.org
    <mailto:mapreduce-user@hadoop.apache.org>
    Objet : From a newbie: Questions and will MapReduce fit our needs

HiWe are considering to use MapReduce for a project. I am

    participating in
    an "investigation"-phase where we try to reveal if we would
    benefit from
    using the MapReduce framework.

A little bit about the project:

    We will be receiving data from the "outside world" in files via
    FTP. It
    will be a mix of very small files (50 records/lines) and very big
    files
    (5mio+ records/lines). The FTP server will be running in a DMZ
    where we
    have no plans of using any Hadoop technology. For every file arriving
    over FTP we will add a message (just pointing to that file) to a
    MQ also
    running in DMZ - how we do that is not relevant for my questions
    here.
    In the secure zone of our system we plan to run many machines
    (shards if
    you like) a.o. being consumers on the MQ in DMZ. Their job will be
    a.o.
    to "load" (storing i db, indexing etc.) the files pointed to by the
    messages they receive from the MQ. For resonably small files they
    will
    probably just do the "loading" of the entire file themselves. For
    very
    big files we would like to have more machines/shards, than the single
    machine/shard that happens to receive the corresponding message,
    participating in "loading" that particular file.

Questions:- In general, do you think MapReduce will be beneficial for us to

    use?
    Please remember that the files to be "loaded" does not live on a
    HDFS.
    Any descriptions on why you would suggest that we use MapReduce
    will be
    very velcome.

Response : Yes because you could treat the "big file" in parallel

    and the parallesisation done by hadoop is very effective. To treat
    your file you need to have an InputFormat class which is able to
    read it. Here, two solutions :

       1. you copy your file inside the HDFS file system and you use
          "FileInputFormat" (for text based file some are already
          produced by hadoop). inconvenient the copy may be long…(in
          our case it is unacceptable) and this copy is an extra cost
          in the whole treatment

       2. You make your "BigFile" accessible by NFS or other Shared FS
          from Hadoop cluster Node. The first job in your treatment
          pipeline read the file and split it by record offset
          *reference* (Output1 : record from 0 to N , Ouput2 : N to M
          and so on…)

       3. On each OuputX a Map task is launch in // which will treat
          file (still accessible through sharedFS) from reord N to M
          according to OutputX info

- Reading about MapReduce it sounds to be a general framework able to

    split a "big job" into many smaller "sub-jobs", and have those
    "sub-jobs" executed concurrently (potentially on other different
    machines), all-in-all to complete the "big job". This could be
    used for
    many other things than "working with files", but then again
    examples and
    some of the descriptions makes it sound like it is all only about
    "jobs
    working with files". Is MapReduce only usefull/concerned with "jobs"
    related to "working with files" or is it more general-purpose so
    that it
    is usefull for any
    
split-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem?

Response : Hadoop are not only specialised with (while i think it

    is 99% of its utilisation…). As a say before your input are
    accessible through InputFormat interface.

- I believe we will end up having a HDFS over the disks on the

    machines/shards in secure zone. Is HDFS a "must have" for
    MapReduce to
    work at all? E.g. HDFS might be the way sub-jobs are distributed
    and/or
    persisted (so that they will not be forgotten i case of a shard
    breakdown or something).

Response : Hadoop can work on other FS (Amazon S3 for example), or

    with other style of input (like NoSql Cassandra table), but i
    think there is a need for either a small HDFS to store the working
    space of running jobs. I think that most of usage rely on HDFS
    which take care of data localisation. The JobTracker launch the
    job on the node which hold the data in its local disk to avoid
    netwok exchange…

- I think it sounds like an overhead to copy the big file (it will

    have
    to be deleted after succesful "loading") from the FTP server disk
    in DMZ
    to the HDFS in secure zone, just to be able to use MapReduce to
    distribute the work of "loading" it. We might want to do it in way so
    that each "sub-job" (of a "big job" about loading e.g. a big file
    big.txt) just points to big.txt together with from- and to-
    indexes into
    the file. Each "sub-job" will then have to only read the part of
    big.txt
    from from-index to to-index and "load" that. Will we be able to do
    something like that using MapReduce or is it all kind of "based on
    operating on files on the HDFS"?

Response : I don't clearly understand all what you said but it

    sounds like to me not far from the solution we use and that i
    proposed to you in previous response.

- Depending on the answer to the above question, we might want to be

    able to make the disk on the FTP server "join" the HDFS, in a way so
    that it is visible, but in a way so that data on it will not get
    copied
    in several copies (for redundancy matters) thoughout the disks on the
    shards (the "real" part of the HDFS) - remember the file will have
    to be
    deleted as soon as it has been "loaded". Is there such a
    concept/possibility of making "external" disk visible from HDFS, to
    enable MapReduce to work on files on such disks, without the files on
    such disks automatically will be copied to several different other
    disks
    (on the shards)?

Response : Hadoop jobs are (generally) Java jobs so it is still

    possible to open file external to HDFS provides they could be
    accessed (through NFS or Other shared FS (Glouster FS, GPFS, etc))..

- As it understand it, each "sub-job" (the result of the

    split-operation) will be run on new dedicated JVM. It sounds like
    a big
    overhead to start a new JVM just to run a "small" job. Is it correct
    that each "sub-job" will run on its own new JVM that has to be
    started
    for that purpose only? If yes, it seems to me like the overhead is
    only
    "worth it" for fairly large "sub-jobs". Do you agree?

Response : due to Hadoop overhead to launch a task on a task

    tracker, it is not recommended to have jobs running less than a
    minute. In the proposed solution we could adjust the time by the
    number of record treated in one OutputX split…
    remenber that the jobs are launch on different computers. With
    modern java JVM the overhead of launching a JVM is not so eavy.
    Hadoop try (since 0.19) to reuse JVM which are already exist to

launch similar jobs see : mapred.job.reuse.jvm.num.tasks propertyIf yes, I find the "WordCount" example on

    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
    kinda
    stupid, because it seems like each "sub-job" is only about
    handling one
    single line, and that seems to me to be way too small "sub-jobs"
    to make
    it "worth the effort" to move it to a remote machine and start a
    new JVM
    to handle it. Do you agree that it is stupid (yes, it is just an
    example, I know), or what did I miss?

Response : 99% of the example deal with word count… it is a big

    problem where i have to face when i begin with hadoop…and Yes one
    job to treat one line is not efficient (seen response above…)

- Finally with respect to side effects. When handling the files we

    plan
    to load the records in the files into some kind of database (maybe
    several instances of a database). It is important that each record
    will
    only get inserted into one database once. As I understand it,
    MapReduce
    will make every "sub-job" run in several instances concurrently on
    several different machines, in order to make sure that it is finished
    quickly even if one of the attempts to handle the particular
    "sub-job"
    fails. It that true?
    If yes, isnt that a big problem with respect to "sub-jobs" with side
    effects (like inserting into a database)? Or are there some kind of
    build-in assumption that all side effects are done on HDFS and
    that HDFS
    supports some kind of transaction-handling so that it is easy for
    MapReduce to rollback the side effects of one of the "identical"
    sub-jobs if two should both succeed?
    In general, is it a build-in thing that each sub-job is running in
    one
    single transaction, so that it is not possible that a sub-job will
    "partly" succeed and "partly" fail (e.g. if it has to load 10000
    records
    into a database, and succeeds with 9999 of those it might be
    stupud to
    roll it all back in order to try it all all-over again)

Response : Have a look to Apache sqoop may it could help you

    import/export data into a database. Otherwise your could set a
    reduce phase in your treatment and in the reduce the input key are
    sorted for the whole data set and then you could deal with "will
    only get inserted into one database once"

I know my english is not perfect, but I hope you at least get the

    essence of my questions. I hope you will try to answer all the
    questions, even though some of them might seem stupid to you.
    Remember
    that I am a newbie :-) I have been running thourgh the FAQ, but didnt
    find any answers to my questions (maybe because they are stupid
    :-) ). I
    wasnt able to search the archives of the mailing-list, so I
    quickly gave
    up finding my answers in "old threads". Can someone point me to a
    way of
    searching in the archives?

Response : My english is not perfect too!

    extra advice : use a 0.20.xxx version (we use a 0.20.2 cloudera
    distrbution) and old api (the 0.21 version and New API (mapreduce
    package) are not yet complete and stable, see Todd Lipcon
    advice..). Don't be afraid by multiple depreaceated class using
    old API…they are not so depreaceated. I spend a lot of time at the
    begining trying to use New API..
    Hadoop framework is not so simple to handle, if your file contains
    text information consider use of high level tool like pig or hive.
    If your file contains binary information consider use of cascading
    (_www.cascading.org_ <http://www.cascading.org>) library. For us
    it dramasticly simplify the writting (but we have complex query to
    do on the binary data hold in Hadoop), depends on the kind of
    treatment you have to perform…

hope my response could help you..

    Regards Alain Montmory
    (Thales company)

Regards, Per Steffensen

Re: From a newbie: Questions and will MapReduce fit our needs

Reply via email to