On Mon, Aug 29, 2011 at 2:04 AM, Per Steffensen <st...@designware.dk> wrote: > Can you point me to at good place to read about Sqoop. I only find > http://incubator.apache.org/projects/sqoop.html and > https://cwiki.apache.org/confluence/display/SQOOP. There is really not much > to find, about what Sqoop can do, how to use it etc.
Please see the Sqoop user guide: http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html Thanks, Arvind > > Regards, Per Steffensen > > Peyman Mohajerian skrev: > > Hi, > > You should definitely take a look at Apache Sqoop as previously mentioned, > if your file is large enough and you have several map jobs running and > hitting your database concurrently, you will experience issues at the db > level. > In terms of speculative jobs (redundant jobs) running to deal with slow > jobs, you have control over that in Hadoop. You can turn off speculative > jobs or make sure when one job is finished the other one for the same input > file is shutdown. > > Good Luck, > > On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain > <alain.montm...@thalesgroup.com> wrote: >> >> Hi, >> >> I am going to try to response to your response in the text. I am not an >> hadoop expert but we are facing the same kind of problem (dealing with file >> which are external to HDFS) in our project and we use hadoop. >> >> [@@THALES GROUP RESTRICTED@@] >> >> >> -----Message d'origine----- >> De : Per Steffensen [mailto:st...@designware.dk] >> Envoyé : vendredi 26 août 2011 13:13 >> À : mapreduce-user@hadoop.apache.org >> Objet : From a newbie: Questions and will MapReduce fit our needs >> >> Hi >> >> We are considering to use MapReduce for a project. I am participating in >> an "investigation"-phase where we try to reveal if we would benefit from >> using the MapReduce framework. >> >> A little bit about the project: >> We will be receiving data from the "outside world" in files via FTP. It >> will be a mix of very small files (50 records/lines) and very big files >> (5mio+ records/lines). The FTP server will be running in a DMZ where we >> have no plans of using any Hadoop technology. For every file arriving >> over FTP we will add a message (just pointing to that file) to a MQ also >> running in DMZ - how we do that is not relevant for my questions here. >> In the secure zone of our system we plan to run many machines (shards if >> you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o. >> to "load" (storing i db, indexing etc.) the files pointed to by the >> messages they receive from the MQ. For resonably small files they will >> probably just do the "loading" of the entire file themselves. For very >> big files we would like to have more machines/shards, than the single >> machine/shard that happens to receive the corresponding message, >> participating in "loading" that particular file. >> >> Questions: >> >> - In general, do you think MapReduce will be beneficial for us to use? >> Please remember that the files to be "loaded" does not live on a HDFS. >> Any descriptions on why you would suggest that we use MapReduce will be >> very velcome. >> >> Response : Yes because you could treat the "big file" in parallel and the >> parallesisation done by hadoop is very effective. To treat your file you >> need to have an InputFormat class which is able to read it. Here, two >> solutions : >> >> you copy your file inside the HDFS file system and you use >> "FileInputFormat" (for text based file some are already produced by hadoop). >> inconvenient the copy may be long…(in our case it is unacceptable) and this >> copy is an extra cost in the whole treatment >> >> >> >> You make your "BigFile" accessible by NFS or other Shared FS from Hadoop >> cluster Node. The first job in your treatment pipeline read the file and >> split it by record offset reference (Output1 : record from 0 to N , Ouput2 : >> N to M and so on…) >> >> >> >> On each OuputX a Map task is launch in // which will treat file (still >> accessible through sharedFS) from reord N to M according to OutputX info >> >> >> - Reading about MapReduce it sounds to be a general framework able to >> split a "big job" into many smaller "sub-jobs", and have those >> "sub-jobs" executed concurrently (potentially on other different >> machines), all-in-all to complete the "big job". This could be used for >> many other things than "working with files", but then again examples and >> some of the descriptions makes it sound like it is all only about "jobs >> working with files". Is MapReduce only usefull/concerned with "jobs" >> related to "working with files" or is it more general-purpose so that it >> is usefull for any >> >> split-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem? >> >> Response : Hadoop are not only specialised with (while i think it is 99% >> of its utilisation…). As a say before your input are accessible through >> InputFormat interface. >> >> - I believe we will end up having a HDFS over the disks on the >> machines/shards in secure zone. Is HDFS a "must have" for MapReduce to >> work at all? E.g. HDFS might be the way sub-jobs are distributed and/or >> persisted (so that they will not be forgotten i case of a shard >> breakdown or something). >> >> Response : Hadoop can work on other FS (Amazon S3 for example), or with >> other style of input (like NoSql Cassandra table), but i think there is a >> need for either a small HDFS to store the working space of running jobs. I >> think that most of usage rely on HDFS which take care of data localisation. >> The JobTracker launch the job on the node which hold the data in its local >> disk to avoid netwok exchange… >> >> - I think it sounds like an overhead to copy the big file (it will have >> to be deleted after succesful "loading") from the FTP server disk in DMZ >> to the HDFS in secure zone, just to be able to use MapReduce to >> distribute the work of "loading" it. We might want to do it in way so >> that each "sub-job" (of a "big job" about loading e.g. a big file >> big.txt) just points to big.txt together with from- and to- indexes into >> the file. Each "sub-job" will then have to only read the part of big.txt >> from from-index to to-index and "load" that. Will we be able to do >> something like that using MapReduce or is it all kind of "based on >> operating on files on the HDFS"? >> >> Response : I don't clearly understand all what you said but it sounds like >> to me not far from the solution we use and that i proposed to you in >> previous response. >> >> - Depending on the answer to the above question, we might want to be >> able to make the disk on the FTP server "join" the HDFS, in a way so >> that it is visible, but in a way so that data on it will not get copied >> in several copies (for redundancy matters) thoughout the disks on the >> shards (the "real" part of the HDFS) - remember the file will have to be >> deleted as soon as it has been "loaded". Is there such a >> concept/possibility of making "external" disk visible from HDFS, to >> enable MapReduce to work on files on such disks, without the files on >> such disks automatically will be copied to several different other disks >> (on the shards)? >> >> Response : Hadoop jobs are (generally) Java jobs so it is still possible >> to open file external to HDFS provides they could be accessed (through NFS >> or Other shared FS (Glouster FS, GPFS, etc)).. >> >> - As it understand it, each "sub-job" (the result of the >> split-operation) will be run on new dedicated JVM. It sounds like a big >> overhead to start a new JVM just to run a "small" job. Is it correct >> that each "sub-job" will run on its own new JVM that has to be started >> for that purpose only? If yes, it seems to me like the overhead is only >> "worth it" for fairly large "sub-jobs". Do you agree? >> >> Response : due to Hadoop overhead to launch a task on a task tracker, it >> is not recommended to have jobs running less than a minute. In the proposed >> solution we could adjust the time by the number of record treated in one >> OutputX split… >> remenber that the jobs are launch on different computers. With modern java >> JVM the overhead of launching a JVM is not so eavy. Hadoop try (since 0.19) >> to reuse JVM which are already exist to launch similar jobs see : >> mapred.job.reuse.jvm.num.tasks property >> >> If yes, I find the "WordCount" example on >> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html kinda >> stupid, because it seems like each "sub-job" is only about handling one >> single line, and that seems to me to be way too small "sub-jobs" to make >> it "worth the effort" to move it to a remote machine and start a new JVM >> to handle it. Do you agree that it is stupid (yes, it is just an >> example, I know), or what did I miss? >> >> Response : 99% of the example deal with word count… it is a big problem >> where i have to face when i begin with hadoop…and Yes one job to treat one >> line is not efficient (seen response above…) >> >> - Finally with respect to side effects. When handling the files we plan >> to load the records in the files into some kind of database (maybe >> several instances of a database). It is important that each record will >> only get inserted into one database once. As I understand it, MapReduce >> will make every "sub-job" run in several instances concurrently on >> several different machines, in order to make sure that it is finished >> quickly even if one of the attempts to handle the particular "sub-job" >> fails. It that true? >> If yes, isnt that a big problem with respect to "sub-jobs" with side >> effects (like inserting into a database)? Or are there some kind of >> build-in assumption that all side effects are done on HDFS and that HDFS >> supports some kind of transaction-handling so that it is easy for >> MapReduce to rollback the side effects of one of the "identical" >> sub-jobs if two should both succeed? >> In general, is it a build-in thing that each sub-job is running in one >> single transaction, so that it is not possible that a sub-job will >> "partly" succeed and "partly" fail (e.g. if it has to load 10000 records >> into a database, and succeeds with 9999 of those it might be stupud to >> roll it all back in order to try it all all-over again) >> >> Response : Have a look to Apache sqoop may it could help you import/export >> data into a database. Otherwise your could set a reduce phase in your >> treatment and in the reduce the input key are sorted for the whole data set >> and then you could deal with "will only get inserted into one database once" >> >> I know my english is not perfect, but I hope you at least get the >> essence of my questions. I hope you will try to answer all the >> questions, even though some of them might seem stupid to you. Remember >> that I am a newbie :-) I have been running thourgh the FAQ, but didnt >> find any answers to my questions (maybe because they are stupid :-) ). I >> wasnt able to search the archives of the mailing-list, so I quickly gave >> up finding my answers in "old threads". Can someone point me to a way of >> searching in the archives? >> >> Response : My english is not perfect too! >> extra advice : use a 0.20.xxx version (we use a 0.20.2 cloudera >> distrbution) and old api (the 0.21 version and New API (mapreduce package) >> are not yet complete and stable, see Todd Lipcon advice..). Don't be afraid >> by multiple depreaceated class using old API…they are not so depreaceated. I >> spend a lot of time at the begining trying to use New API.. >> Hadoop framework is not so simple to handle, if your file contains text >> information consider use of high level tool like pig or hive. If your file >> contains binary information consider use of cascading (www.cascading.org) >> library. For us it dramasticly simplify the writting (but we have complex >> query to do on the binary data hold in Hadoop), depends on the kind of >> treatment you have to perform… >> >> hope my response could help you.. >> Regards Alain Montmory >> (Thales company) >> >> Regards, Per Steffensen >> > >