Re: Question about Flume
You could also consider using WebHDFS instead of sftp / flume. WebHDFS is a REST API which will allow you to copy data directly into HDFS. Regards, Olivier On 23 January 2014 05:25, sudhakara st sudhakara...@gmail.com wrote: Hello Kaalu Singh, Flume is best mach for your requirement. First define the storage structure of data in HDFS and how you are going process the stored data in HDFS. Data is very large size flume supports multiple-hop flow, filtering and aggregation. I think no source plugin required, a command or a script or a program which converts you data in stream bytes work with flume. On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah prince_mithi...@yahoo.co.inwrote: Fair enough. I just wanted to point out that doing it via a script is going to be a million times faster to implement compared to something like Flume (and arguably more reliable too with no maintenance overhead). Don't get me wrong, we use Flume for our data collection as well but our use case is real time/online data collection and Flume does the job well. So nothing against Flume per se. I was just thinking - if a script becomes a pain down the road how much throw away effort are we talking about here, a few minutes to a few hours at max vs what happens if Flume becomes a pain, a few days to a few weeks of throw away work. Sent from Yahoo Mail on Androidhttps://overview.mail.yahoo.com/mobile/?.src=Android -- * From: * Kaalu Singh kaalusingh1...@gmail.com; * To: * user@hadoop.apache.org; Dhaval Shah prince_mithi...@yahoo.co.in; * Subject: * Re: Question about Flume * Sent: * Wed, Jan 22, 2014 11:20:52 PM The closest built-in functionality to the use case I have is the Spooling Directory Source and I like the idea of using/building software with higher level languages like Java for reasons of extensibility etc (and don't like the idea of scripts). However, I am soliciting opinions and can be swayed to change my mind. Thanks for your response Dhaval - appreciate it. Regards KS On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah prince_mithi...@yahoo.co.in wrote: Flume is useful for online log aggregation in a streaming format. Your use case seems more like a batch format where you just need to grab the file and put it in HDFS at regular intervals which can be much more easily achieved by a bash script running on a cron'd basis. Regards, Dhaval From: Kaalu Singh kaalusingh1...@gmail.com To: user@hadoop.apache.org Sent: Wednesday, 22 January 2014 5:52 PM Subject: Question about Flume Hi, I have the following use case: I have data files getting generated frequently on a certain machine, X. The only way I can bring them into my Hadoop cluster is by SFTPing at certain intervals of time and getting them and landing them in HDFS. I am new to Hadoop and to Flume. I read up about Flume and it seems like this framework is appropriate for something like this although I did not see an available 'source' that can do exactly what I am looking for. Unavailability of a 'source' plugin is not a deal breaker for me as I can write one but first I want to make sure this is the right way to go. So, my questions are: 1. What are the pros/cons of using Flume for this use case? 2. Does anybody know of a source plugin that does what I am looking for? 3. Does anybody think I should not use Flume and instead write my own application to achieve this use case? Thanks KS -- Regards, ...Sudhakara.st -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Streaming jobs getting poor locality
Hi, I posted a question to Stack Overflow yesterday about an issue I'm seeing, but judging by the low interest (only 7 views in 24 hours, and 3 of them are probably me! :-) it seems like I should switch venue. I'm pasting the same question here in hopes of finding someone with interest. Original SO post is at http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality . * I have some fairly simple Hadoop streaming jobs that look like this: yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \ -files hdfs:///apps/local/count.pl \ -input /foo/data/bz2 \ -output /user/me/myoutput \ -mapper cut -f4,8 -d, \ -reducer count.pl \ -combiner count.pl The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary. The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed). When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too. The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI. I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files. It looks like this previous thread [ why map task always running on a single node - http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node ] is relevant but not conclusive. * Thanks. -- Ken Williams, Senior Research Scientist WindLogics http://windlogics.com CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.
Re: Streaming jobs getting poor locality
I think In order to configure a Hadoop Job to read the Compressed input you have to specify compression codec in code or in command linelike *-D io.compression.codecs=org.apache.hadoop.io.compress.BZip2Codec * On Thu, Jan 23, 2014 at 12:40 AM, Williams, Ken ken.willi...@windlogics.com wrote: Hi, I posted a question to Stack Overflow yesterday about an issue I’m seeing, but judging by the low interest (only 7 views in 24 hours, and 3 of them are probably me! :-) it seems like I should switch venue. I’m pasting the same question here in hopes of finding someone with interest. Original SO post is at http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality. * I have some fairly simple Hadoop streaming jobs that look like this: yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \ -files hdfs:///apps/local/count.pl \ -input /foo/data/bz2 \ -output /user/me/myoutput \ -mapper cut -f4,8 -d, \ -reducer count.pl \ -combiner count.pl The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary. The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed). When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too. The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI. I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files. It looks like this previous thread [ why map task always running on a single node - http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node] is relevant but not conclusive. * Thanks. -- Ken Williams, Senior Research Scientist *Wind**Logics* http://windlogics.com -- CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you. -- Regards, ...Sudhakara.st
Re: How to learn hadoop follow Tom White
I am sorry if that site is illegal. Just though it will be useful. On Wed, Jan 22, 2014 at 11:43 PM, Amr Shahin amrnab...@gmail.com wrote: Looks pretty illegal to me as well. On Thu, Jan 23, 2014 at 11:32 AM, Marco Shaw marco.s...@gmail.com wrote: I'm pretty sure that site is illegal... On Jan 23, 2014, at 3:22 AM, Cooleaf cool...@gmail.com wrote: thanks all for comment on my concern. by the way, are those books in the website are copy right free? 2014/1/23 Abirami V abiramipand...@gmail.com From the below site we can download hadoop related books. Hope this will help others who are interested to learn hadoop. http://it-ebooks.info/ click the book title and click download ebooks link. Thanks, Abi On Tue, Jan 21, 2014 at 10:28 PM, Harsh J ha...@cloudera.com wrote: The book's contents should still be very relevant as the APIs haven't changed. On Wed, Jan 22, 2014 at 11:23 AM, Cooleaf cool...@gmail.com wrote: hi folks, I am new to Hadoop and I am trying to learning Hadoop follow the book(Hadoop: The definitive guide 2nd edition), I found the sample code is under 0.20, should I learn and exercise it under Hadoop 1.0 version? I have installed Hadoop 2.2 which is another branch. thanks, Xiaoguang -- Harsh J
HDFS buffer sizes
What is the interaction between dfs.stream-buffer-size and dfs.client-write-packet-size? I see that the default for dfs.stream-buffer-size is 4K. Does anyone have experience using larger buffers to optimize large writes? Thanks John
RE: Streaming jobs getting poor locality
I believe Hadoop can figure out the codec from the file name extension, and Bzip2 codec is supported from Hadoop as Java implementation, which is also a SplitableCompressionCodec. So 5G bzip2 files generate about 45 mappers is very reasonable, assuming 128M/block. The question is why ONLY one node will run this 45 mappers. What described in the original question is not very clear. I am not very familiar with the streaming and yarn (It looks like you are suing MRV2). So why do you think all the mappers running on one node? Did someone else run other Jobs in the cluster at the same time? What are the memory allocation and configuration in your cluster on each nodes? Yong Date: Thu, 23 Jan 2014 15:12:54 +0530 Subject: Re: Streaming jobs getting poor locality From: sudhakara...@gmail.com To: user@hadoop.apache.org I think In order to configure a Hadoop Job to read the Compressed input you have to specify compression codec in code or in command linelike -D io.compression.codecs=org.apache.hadoop.io.compress.BZip2Codec On Thu, Jan 23, 2014 at 12:40 AM, Williams, Ken ken.willi...@windlogics.com wrote: Hi, I posted a question to Stack Overflow yesterday about an issue I’m seeing, but judging by the low interest (only 7 views in 24 hours, and 3 of them are probably me! :-) it seems like I should switch venue. I’m pasting the same question here in hopes of finding someone with interest. Original SO post is at http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality . * I have some fairly simple Hadoop streaming jobs that look like this: yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \ -files hdfs:///apps/local/count.pl \ -input /foo/data/bz2 \ -output /user/me/myoutput \ -mapper cut -f4,8 -d, \ -reducer count.pl \ -combiner count.pl The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary. The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed). When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too. The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI. I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files. It looks like this previous thread [ why map task always running on a single node - http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node ] is relevant but not conclusive. * Thanks. -- Ken Williams, Senior Research Scientist WindLogics http://windlogics.com CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you. -- Regards,...Sudhakara.st
HDFS federation configuration
Hi, We tried setting up HDFS name node federation set up with 2 name nodes. I am facing few issues. Can any one help me in understanding below points? 1) how can we configure different namespaces to different name node? Where exactly we need to configure this? 2) After formatting each NN with one cluster id, Do we need to set this cluster id in hdfs-site.xml? 3) I am getting exception like, data dir already locked by one of the NN, But when don't specify data.dir, then it's not showing exception. So what could be the issue? Thanks Regards, B Anil Kumar.
hdfs fsck -locations
Hi, hdfs fsck -locations is supposed to show every block with its location? Is location the ip of the datanode? Thank you, Mark
RE: Streaming jobs getting poor locality
I cannot explain it (Your configuration looks fine to me, and you mention that those mappers can ONLY run on one node in one run, but could be on different nodes across running). But as I said, I am not an expect in Yarn, as it is also very new to me. Let's see if someone else in the list can give you some hints. Meantime, maybe you can do some tests to help us narrow the cause, like: Since you just need the 'cut' in your mapper 1) Run an example job, like 'Pi' or 'Write output' coming from the hadoop-example jar, do the mapper tasks running concurrently on multi nodes in your cluster?2) If you don't use bzip2 file as input, do you have the same problem for other type files, like plain text file? Yong From: ken.willi...@windlogics.com To: user@hadoop.apache.org Subject: RE: Streaming jobs getting poor locality Date: Thu, 23 Jan 2014 16:06:03 + java8964 wrote: I believe Hadoop can figure out the codec from the file name extension, and Bzip2 codec is supported from Hadoop as Java implementation, which is also a SplitableCompressionCodec. So 5G bzip2 files generate about 45 mappers is very reasonable, assuming 128M/block. Correct - the number of splits seems reasonable to me too, and the codec is indeed figured out automatically. The job does produce the correct output. The question is why ONLY one node will run this 45 mappers. Exactly. I am not very familiar with the streaming and yarn (It looks like you are suing MRV2). So why do you think all the mappers running on one node? Did someone else run other Jobs in the cluster at the same time? What are the memory allocation and configuration in your cluster on each nodes? 1) I am using Hadoop 2.2.0, 2) I know they’re all running on one node because in the Ambari interface (we’re using the free Hortonworks distro) I can see that all the Map jobs are assigned to the same IP address. I can confirm it by SSH-ing to that node and I see all the jobs running there, using 'top' or 'ps' or whatever. If I SSH to any other node, I see no jobs running. 3) There are no other jobs running at the same time on the cluster. 4) All data nodes are configured identically, and the config includes: resourcemanager_heapsize = 1024 nodemanager_heapsize = 1024 yarn_heapsize = 1024 yarn.nodemanager.resource.memory-mb = 98403 yarn.nodemanager.vmem-pmem-ratio = 2.1 yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler yarn.scheduler.minimum-allocation-mb = 512 yarn.scheduler.maximum-allocation-mb = 10240 -Ken From: Williams, Ken Sent: Wednesday, January 22, 2014 1:10 PM To: 'user@hadoop.apache.org' Subject: Streaming jobs getting poor locality Hi, I posted a question to Stack Overflow yesterday about an issue I’m seeing, but judging by the low interest (only 7 views in 24 hours, and 3 of them are probably me! :-) it seems like I should switch venue. I’m pasting the same question here in hopes of finding someone with interest. Original SO post is at http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality . * I have some fairly simple Hadoop streaming jobs that look like this: yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \ -files hdfs:///apps/local/count.pl \ -input /foo/data/bz2 \ -output /user/me/myoutput \ -mapper cut -f4,8 -d, \ -reducer count.pl \ -combiner count.pl The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary. The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed). When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too. The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI. I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files. It looks like this previous thread [ why map task always running on a single node - http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node ] is relevant but not conclusive. * CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you
Re: HDFS buffer sizes
HDFS does not appear to use dfs.stream-buffer-size. On Thu, Jan 23, 2014 at 6:57 AM, John Lilley john.lil...@redpoint.netwrote: What is the interaction between dfs.stream-buffer-size and dfs.client-write-packet-size? I see that the default for dfs.stream-buffer-size is 4K. Does anyone have experience using larger buffers to optimize large writes? Thanks John -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
HDFS data transfer is faster than SCP based transfer?
Hello I have a use case that requires transfer of input files from remote storage using SCP protocol (using jSCH jar). To optimize this use case, I have pre-loaded all my input files into HDFS and modified my use case so that it copies required files from HDFS. So, when tasktrackers works, it copies required number of input files to its local directory from HDFS. All my tasktrackers are also datanodes. I could see my use case has run faster. The only modification in my application is that file copy from HDFS instead of transfer using SCP. Also, my use case involves parallel operations (run in tasktrackers) and they do lot of file transfer. Now all these transfers are replaced with HDFS copy. Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it uses TCP/IP? Can anyone give me reasonable reasons to support the decrease of time? with thanks and regards rab