Sending data to all reducers

2012-08-23 Thread Hamid Oliaei
Hi, I want to broadcast some data to all nodes under Hadoop 0.20.2. I tested DistributedCache module. Unfortunately, it was time-consuming and runtime is important for my work. I want to write a MR job so that a copy of input data are generated in output of all reducers. Is that possible? How? I

Re: Sending data to all reducers

2012-08-23 Thread Tim Robertson
So you are trying to run a single reducer on each machine, and all input data regardless of its location gets streamed to each reducer? On Thu, Aug 23, 2012 at 10:41 AM, Hamid Oliaei oli...@gmail.com wrote: Hi, I want to broadcast some data to all nodes under Hadoop 0.20.2. I tested

Re: Sending data to all reducers

2012-08-23 Thread Tim Robertson
Sorry to ask too many questions, but it will help the user list best offer you advice, as this is not a typical MR use case. - Do you foresee the reducer store the data on a local files system to the machine? - Do you need to use specific input formats for the job, or is it really just text

Re: Sending data to all reducers

2012-08-23 Thread Hamid Oliaei
Hi, First of all, thank you Tim for giving your time. The answer of first question is yes. My inputs are in format of triples (sub,pre,obj) and they are stored on the HDFS. The problem is: After running some MR jobs,some data generated in all machines and I want to each machine send part of

Re: Sending data to all reducers

2012-08-23 Thread Tim Robertson
Then I think you might be best exploring running a getmerge on each client. How you trigger that is up to you, but something like Fabric [1] might help. Others might propose different solutions, but it doesn't sound like MR is a natural choice to me. I would expect this is the very fastest way

Re: Sending data to all reducers

2012-08-23 Thread Hamid Oliaei
Hi, I take a look to that, hope it can be useful for my purpose. Thank you so much. Hamid

Reg: when failures on writing to DB from map\reduce

2012-08-23 Thread Manoj Babu
Hi All, In Sqoop: When exporting from HDFS to DB, If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their

Re: Sending data to all reducers

2012-08-23 Thread Sonal Goyal
Hamid, I would recommend taking a relook at your current algorithm and making sure you are utilizing the MR framework to its strengths. You can evaluate having multiple passes for your map reduce program, or doing a map side join. You mention runtime is important for your system, so make sure you

Re: Question Regarding FileAlreadyExistsException

2012-08-23 Thread Bertrand Dechoux
I don't think so. The client is responsible for deleting the resource before, if it might exist. Correct me if I am wrong. Higher solution (such as Cascading) usually provides a way to define a strategy to handle it : KEEP, REPLACE, UPDATE ...

Re: Question Regarding FileAlreadyExistsException

2012-08-23 Thread Harsh J
I think this specific behavior irritates a lot of new users. We may as well provide a Generic Option to overwrite the output directory if set. That way, we at least help avoid typing a whole delete command. If you agree, please file an improvement request against MAPREDUCE project on the ASF JIRA.

Re: Question Regarding FileAlreadyExistsException

2012-08-23 Thread Daniel Hoffman
Well, I'm using the MultipleOutputs capability to create a directory Structure with Dates. So I'm managing this myself. What I've found, and I could be doing this wrong... is that I still have to tell the Tool that I want to use a: TextOutputFormat or a FileOutputFormat, and then, have to tell

Re: Question Regarding FileAlreadyExistsException

2012-08-23 Thread Harsh J
Daniel, Perhaps you want your OutputFormat set as NullOutputFormat. That does not carry any checks for output directory pre-existence. On Thu, Aug 23, 2012 at 9:47 PM, Daniel Hoffman hoffmandani...@gmail.com wrote: Well, I'm using the MultipleOutputs capability to create a directory Structure

Re: Side-loading output from one MR into another?

2012-08-23 Thread Michael Parker
Thanks for the prompt reply! Unfortunately, it's not that small. I'm using the new API; are map side joins accomplished using http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html? Are there any examples which use this package or map side

Running map tasks after all reduces have finished

2012-08-23 Thread Jan Lukavský
Hi all, we are seeing strange behaviour of JobTracker in the following scenario: - job finishes map phase and starts reduce - after the shuffle phase of all reducers we loose a tasktracker, that doesn't run any reducer - so all remaining reducers are still running in the reduce phase - map

Re: Customized input format

2012-08-23 Thread Dino Kečo
Hi, There is good example here: http://hadoopchicago.com/tips-tricks/custom-xmlreader-boris-lublinsky-michael-segel/ Regards, Dino Kečo msn: xdi...@hotmail.com mail: dino.k...@gmail.com skype: dino.keco phone: +387 61 507 851 On Thu, Aug 23, 2012 at 11:56 AM, Siddharth Tiwari

Re: Running map tasks after all reduces have finished

2012-08-23 Thread Harsh J
Hey Jan, What version/distribution of Hadoop are you noticing this on? On Thu, Aug 23, 2012 at 2:55 PM, Jan Lukavský jan.lukav...@firma.seznam.cz wrote: Hi all, we are seeing strange behaviour of JobTracker in the following scenario: - job finishes map phase and starts reduce - after the

Re: Running map tasks after all reduces have finished

2012-08-23 Thread Jan Lukavský
Hi, sorry I forgot to mention. We are using cdh3u3. Jan On 23.8.2012 12:08, Harsh J wrote: Hey Jan, What version/distribution of Hadoop are you noticing this on? On Thu, Aug 23, 2012 at 2:55 PM, Jan Lukavský jan.lukav...@firma.seznam.cz wrote: Hi all, we are seeing strange behaviour of

Re: Install Hive and Pig

2012-08-23 Thread Russell Jurney
Install Pig: http://pig.apache.org/docs/r0.10.0/start.html Install Hive: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallingHivefromaStableRelease These blog posts should help you to get started after that.

Re: running a job on single-node setup takes less time than running on a cluster

2012-08-23 Thread Mahsa Mofidpoor
Thank you very much. On Tue, Aug 21, 2012 at 11:46 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Dear Mahsa, Yes what you have observed is defined to happen that way. On a single node cluster -- everything is local. There is network transfer and every thing else

building Hadoop source code in Eclipse

2012-08-23 Thread susnata basak
Hi All, While trying to build hadoop source code in eclipse using maven following the instructions on - http://wiki.apache.org/hadoop/EclipseEnvironment I noticed that the project layout has changed in the latest development version, as such the instruction didn't quite match. I was wondering

Re: building Hadoop source code in Eclipse

2012-08-23 Thread Adam Berry
I just ran through the same thing. The addition for me was an additional Import step, but pointing at the hadoop-yarn-project, and then grabbing all the projects from that. With this I drop to the command line for building with maven. I've tried to work with m2eclipse and this, but so far with

Re: building Hadoop source code in Eclipse

2012-08-23 Thread Harsh J
Hey Adam, I use m2e and it seems to work pretty well for me. Of course, I do not look for a perfectly clean project state (some projects show build issues), and rely on CLI maven commands when I need to compile something properly. But as a reference/editor, using m2e seems to work just fine. On

Re: Learning hadoop

2012-08-23 Thread Varad Meru
Hi Pravin, Studying Hadoop or MapReduce can look a daunting task if you get your hand dirty at the start. Some of the prerequisites for learning Hadoop are having a good experience in Java. Good Analytical skills help a lot as well and final secret sauce for being successful is – you need to

RE: building Hadoop source code in Eclipse

2012-08-23 Thread susnata basak
Hey Harsh, I came across a video on building hadoop source code on cloudera site, but it was using Ant (on an older project layout). If you're able to use m2eclipse, would you like to make a similar video post or document it somewhere. The other issue I ran into - the unit tests weren't

Re: unsubscribe

2012-08-23 Thread Vinod Kumar Vavilapalli
Please see http://hadoop.apache.org/common/mailing_lists.html. You should send an email to user-unsubscr...@hadoop.apache.org to unsubscribe. HTH, +Vinod On Aug 23, 2012, at 5:43 AM, sathyavageeswaran wrote: Once in hadoop, no free exit From: msridha...@inautix.co.in

Re: Learning hadoop

2012-08-23 Thread Serge Blazhiyevskyy
Hi Pravin, I have installation instructions on my blog: hadoopway.blogspot.com Regards, Serge From: Keith Wiley kwi...@keithwiley.commailto:kwi...@keithwiley.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Thu, 23

Re: Learning hadoop

2012-08-23 Thread VIGNESH PRAJAPATI
Hey Pravin, I am highly recommend you to start through Big Data University[www.bigdatauniversity.com].It have covered all basics for hadoop and hadoop architecture. Thanks Vignesh On 8/23/12, Serge Blazhiyevskyy serge.blazhiyevs...@nice.com wrote: Hi Pravin, I have installation

I/O stats interpretation during concurrent hive M/R runs

2012-08-23 Thread Himanish Kushary
Hi, I am curious about interpretation of the output from iostat on a datanode during a M/R run.I want to understand how to diagnosis a disk i/o issue in a hadoop cluster. Is there any good documentation to help me understand the results from iostats in Hadoop context ? Here are the iostat

Re: I/O stats interpretation during concurrent hive M/R runs

2012-08-23 Thread Himanish Kushary
After sending this message I issued the iostat -dxm 5 command on the DNs the %util column shows 70-80 average value sometimes going up to 90-100 for few seconds Does this mean the disk is becoming the bottleneck ? or is this normal ? On Thu, Aug 23, 2012 at 3:14 PM, Himanish Kushary

Re: Side-loading output from one MR into another?

2012-08-23 Thread Serge Blazhiyevskyy
I have map-side join example here http://askhadoop.blogspot.com/2011/12/map-side-join_27.html It is a great way to load data into memory on multiple machines Regards, Serge On 8/23/12 3:57 PM, Michael Parker michael.g.par...@gmail.com wrote: Actually, I was able to do some tricks and

Re: Hadoop on EC2 Managing Internal/External IPs

2012-08-23 Thread igor Finkelshteyn
That would work, but wouldn't a much simpler solution just be to force the machines in the cluster to always pass around their external FQDNs, since those will properly resolve to the internal or external IP depending on what machine is asking? Is there no way to just do that? On Aug 23,

About many user accounts in hadoop platform

2012-08-23 Thread Li Shengmei
Hi, all There are many users in hadoop platform. Can they install their own hadoop version on the same clusters platform? I tried to do this but failed. There exsited a user account and the user install his hadoop. I create another account and install his hadoop. The logs display ERROR

Re: About many user accounts in hadoop platform

2012-08-23 Thread Sonal Goyal
Hi, Do your users want different versions of Hadoop? Or can they share the same hadoop cluster and schedule their jobs? If the latter, Hadoop can be configured to run for multiple users, and each user can submit their data and jobs to the same cluster. Hence you can maintain a single cluster and

Re: Hadoop on EC2 Managing Internal/External IPs

2012-08-23 Thread Aaron Eng
Hi Igor, I don't think theres anything in Hadoop thats going to allow you to have an internal IP assigned to a machines network interface and to have it advertise the external IP. Even if that were in place, you'd then have to differentiate between requests coming from the other nodes in the

Re: About many user accounts in hadoop platform

2012-08-23 Thread Bertrand Dechoux
You might also want to look at Hadoop On Demand. http://hadoop.apache.org/common/docs/r0.17.0/hod.html But I would not recommend to make one cluster per user. Regards Bertrand On Fri, Aug 24, 2012 at 5:50 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, Do your users want different

Reading multiple lines from a microsoft doc in hadoop

2012-08-23 Thread Siddharth Tiwari
hi, I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help ** Cheers !!!