Re: Question about Flume

2014-01-23 Thread Olivier Renault
You could also consider using WebHDFS instead of sftp / flume. WebHDFS is a
REST API which will allow you to copy data directly into HDFS.

Regards,
Olivier


On 23 January 2014 05:25, sudhakara st sudhakara...@gmail.com wrote:

 Hello Kaalu Singh,

 Flume is best mach for your requirement. First define the storage
 structure of data in HDFS and how you are going process the stored data in
 HDFS. Data is very large size flume supports multiple-hop flow, filtering
 and aggregation. I think no source plugin required, a command or a script
 or a program which converts you data in stream bytes work with flume.


 On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah 
 prince_mithi...@yahoo.co.inwrote:

 Fair enough. I just wanted to point out that doing it via a script is
 going to be a million times faster to implement compared to something like
 Flume (and arguably more reliable too with no maintenance overhead). Don't
 get me wrong, we use Flume for our data collection as well but our use case
 is real time/online data collection and Flume does the job well. So nothing
 against Flume per se. I was just thinking - if a script becomes a pain down
 the road how much throw away effort are we talking about here, a few
 minutes to a few hours at max vs what happens if Flume becomes a pain, a
 few days to a few weeks of throw away work.

 Sent from Yahoo Mail on 
 Androidhttps://overview.mail.yahoo.com/mobile/?.src=Android

  --
 * From: * Kaalu Singh kaalusingh1...@gmail.com;
 * To: * user@hadoop.apache.org; Dhaval Shah 
 prince_mithi...@yahoo.co.in;
 * Subject: * Re: Question about Flume
 * Sent: * Wed, Jan 22, 2014 11:20:52 PM

   The closest built-in functionality to the use case I have is the
 Spooling Directory Source and I like the idea of using/building software
 with higher level languages like Java for reasons of extensibility etc (and
 don't like the idea of scripts).

 However, I am soliciting opinions and can be swayed to change my mind.

 Thanks for your response Dhaval - appreciate it.

 Regards
 KS


 On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah prince_mithi...@yahoo.co.in
  wrote:

 Flume is useful for online log aggregation in a streaming format. Your
 use case seems more like a batch format where you just need to grab the
 file and put it in HDFS at regular intervals which can be much more easily
  achieved by a bash script running on a cron'd basis.

 Regards,

 Dhaval


 
 From: Kaalu Singh kaalusingh1...@gmail.com
 To: user@hadoop.apache.org
 Sent: Wednesday, 22 January 2014 5:52 PM
 Subject: Question about Flume



 Hi,

 I have the following use case:

 I have data files getting generated frequently on a certain machine, X.
 The only way I can bring them into my Hadoop cluster  is by SFTPing at
 certain intervals of time and getting them and landing them in HDFS.

 I am new to Hadoop and to Flume. I read up about Flume and it seems like
 this framework is appropriate for something like this although I did not
 see an available 'source' that can do exactly what I am looking for.
 Unavailability of a 'source' plugin is not a deal breaker for me as I can
 write one but first I want to make sure this is the right way to go. So, my
 questions are:

 1. What are the pros/cons of using Flume for this use case?
 2. Does anybody know of a source plugin that does what I am looking for?
 3. Does anybody think I should not use Flume and instead write my own
 application to achieve this use case?

 Thanks
 KS





 --

 Regards,
 ...Sudhakara.st



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Streaming jobs getting poor locality

2014-01-23 Thread Williams, Ken
Hi,

I posted a question to Stack Overflow yesterday about an issue I'm seeing, but 
judging by the low interest (only 7 views in 24 hours, and 3 of them are 
probably me! :-) it seems like I should switch venue.  I'm pasting the same 
question here in hopes of finding someone with interest.

Original SO post is at 
http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality .

*
I have some fairly simple Hadoop streaming jobs that look like this:

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
  -files hdfs:///apps/local/count.pl \
  -input /foo/data/bz2 \
  -output /user/me/myoutput \
  -mapper cut -f4,8 -d, \
  -reducer count.pl \
  -combiner count.pl

The count.pl script is just a simple script that accumulates counts in a hash 
and prints them out at the end - the details are probably not relevant but I 
can post it if necessary.

The input is a directory containing 5 files encoded with bz2 compression, 
roughly the same size as each other, for a total of about 5GB (compressed).

When I look at the running job, it has 45 mappers, but they're all running on 
one node. The particular node changes from run to run, but always only one 
node. Therefore I'm achieving poor data locality as data is transferred over 
the network to this node, and probably achieving poor CPU usage too.

The entire cluster has 9 nodes, all the same basic configuration. The blocks of 
the data for all 5 files are spread out among the 9 nodes, as reported by the 
HDFS Name Node web UI.

I'm happy to share any requested info from my configuration, but this is a 
corporate cluster and I don't want to upload any full config files.

It looks like this previous thread [ why map task always running on a single 
node - 
http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node
 ] is relevant but not conclusive.

*

Thanks.

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com




CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.


Re: Streaming jobs getting poor locality

2014-01-23 Thread sudhakara st
I think In order to configure a Hadoop Job to read the Compressed input you
have to specify compression codec in code or in command linelike
*-D io.compression.codecs=org.apache.hadoop.io.compress.BZip2Codec *


On Thu, Jan 23, 2014 at 12:40 AM, Williams, Ken ken.willi...@windlogics.com
 wrote:

  Hi,



 I posted a question to Stack Overflow yesterday about an issue I’m seeing,
 but judging by the low interest (only 7 views in 24 hours, and 3 of them
 are probably me! :-) it seems like I should switch venue.  I’m pasting the
 same question here in hopes of finding someone with interest.



 Original SO post is at
 http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality.



 *

 I have some fairly simple Hadoop streaming jobs that look like this:



 yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \

   -files hdfs:///apps/local/count.pl \

   -input /foo/data/bz2 \

   -output /user/me/myoutput \

   -mapper cut -f4,8 -d, \

   -reducer count.pl \

   -combiner count.pl



 The count.pl script is just a simple script that accumulates counts in a
 hash and prints them out at the end - the details are probably not relevant
 but I can post it if necessary.



 The input is a directory containing 5 files encoded with bz2 compression,
 roughly the same size as each other, for a total of about 5GB (compressed).



 When I look at the running job, it has 45 mappers, but they're all running
 on one node. The particular node changes from run to run, but always only
 one node. Therefore I'm achieving poor data locality as data is transferred
 over the network to this node, and probably achieving poor CPU usage too.



 The entire cluster has 9 nodes, all the same basic configuration. The
 blocks of the data for all 5 files are spread out among the 9 nodes, as
 reported by the HDFS Name Node web UI.



 I'm happy to share any requested info from my configuration, but this is a
 corporate cluster and I don't want to upload any full config files.



 It looks like this previous thread [ why map task always running on a
 single node -
 http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node]
  is relevant but not conclusive.



 *



 Thanks.



 --

 Ken Williams, Senior Research Scientist

 *Wind**Logics*

 http://windlogics.com



 --

 CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution of
 any kind is strictly prohibited. If you are not the intended recipient,
 please contact the sender via reply e-mail and destroy all copies of the
 original message. Thank you.




-- 

Regards,
...Sudhakara.st


Re: How to learn hadoop follow Tom White

2014-01-23 Thread Abirami V
I am sorry if that site is illegal. Just though it will be useful.


On Wed, Jan 22, 2014 at 11:43 PM, Amr Shahin amrnab...@gmail.com wrote:

 Looks pretty illegal to me as well.


 On Thu, Jan 23, 2014 at 11:32 AM, Marco Shaw marco.s...@gmail.com wrote:

 I'm pretty sure that site is illegal...

 On Jan 23, 2014, at 3:22 AM, Cooleaf cool...@gmail.com wrote:

 thanks all for comment on my concern. by the way, are those books in the
 website are copy right free?



 2014/1/23 Abirami V abiramipand...@gmail.com

 From the below site we can download hadoop related books. Hope this will
 help others who are interested to learn hadoop.

 http://it-ebooks.info/

 click the book title and click download ebooks link.

 Thanks,
 Abi



 On Tue, Jan 21, 2014 at 10:28 PM, Harsh J ha...@cloudera.com wrote:

 The book's contents should still be very relevant as the APIs haven't
 changed.

 On Wed, Jan 22, 2014 at 11:23 AM, Cooleaf cool...@gmail.com wrote:
  hi folks,
  I am new to Hadoop and I am trying to learning Hadoop
 follow the
  book(Hadoop: The definitive guide 2nd edition), I found the sample
 code is
  under 0.20, should I learn and exercise it under Hadoop 1.0  version?
 I have
  installed Hadoop 2.2 which is another branch.
 
  thanks,
 
  Xiaoguang



 --
 Harsh J







HDFS buffer sizes

2014-01-23 Thread John Lilley
What is the interaction between dfs.stream-buffer-size and 
dfs.client-write-packet-size?
I see that the default for dfs.stream-buffer-size is 4K.  Does anyone have 
experience using larger buffers to optimize large writes?
Thanks
John



RE: Streaming jobs getting poor locality

2014-01-23 Thread java8964
I believe Hadoop can figure out the codec from the file name extension, and 
Bzip2 codec is supported from Hadoop as Java implementation, which is also a 
SplitableCompressionCodec.
So 5G bzip2 files generate about 45 mappers is very reasonable, assuming 
128M/block.
The question is why ONLY one node will run this 45 mappers. What described in 
the original question is not very clear. 
I am not very familiar with the streaming and yarn (It looks like you are suing 
MRV2). So why do you think all the mappers running on one node? Did someone 
else run other Jobs in the cluster at the same time? What are the memory 
allocation and configuration in your cluster  on each nodes?
Yong
Date: Thu, 23 Jan 2014 15:12:54 +0530
Subject: Re: Streaming jobs getting poor locality
From: sudhakara...@gmail.com
To: user@hadoop.apache.org

I think In order to configure a Hadoop Job to read the Compressed input you 
have to specify compression codec in code or in command linelike 
-D io.compression.codecs=org.apache.hadoop.io.compress.BZip2Codec 



On Thu, Jan 23, 2014 at 12:40 AM, Williams, Ken ken.willi...@windlogics.com 
wrote:








Hi,
 
I posted a question to Stack Overflow yesterday about an issue I’m seeing, but 
judging by the low interest (only 7 views in 24 hours, and 3 of them are 
probably me! :-) it seems like I should switch venue.  I’m pasting the same 
question
 here in hopes of finding someone with interest.
 
Original SO post is at 
http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality .
 
*
I have some fairly simple Hadoop streaming jobs that look like this:
 
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
  -files hdfs:///apps/local/count.pl \
  -input /foo/data/bz2 \
  -output /user/me/myoutput \
  -mapper cut -f4,8 -d, \
  -reducer count.pl \
  -combiner count.pl
 
The count.pl script is just a simple script that accumulates counts in a hash 
and prints them out at the end - the details are probably not relevant but I 
can post it if necessary.

 
The input is a directory containing 5 files encoded with bz2 compression, 
roughly the same size as each other, for a total of about 5GB (compressed).
 
When I look at the running job, it has 45 mappers, but they're all running on 
one node. The particular node changes from run to run, but always only one 
node. Therefore I'm achieving poor data locality as data is transferred over 
the network
 to this node, and probably achieving poor CPU usage too.
 
The entire cluster has 9 nodes, all the same basic configuration. The blocks of 
the data for all 5 files are spread out among the 9 nodes, as reported by the 
HDFS Name Node web UI.
 
I'm happy to share any requested info from my configuration, but this is a 
corporate cluster and I don't want to upload any full config files.
 
It looks like this previous thread [ why map task always running on a single 
node -

http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node
 ] is relevant but not conclusive.
 
*
 
Thanks.
 
--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com
 






CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not
 the intended recipient, please contact the sender via reply e-mail and destroy 
all copies of the original message. Thank you.






-- 
   
Regards,...Sudhakara.st

   

  

HDFS federation configuration

2014-01-23 Thread AnilKumar B
Hi,

We tried setting up HDFS name node federation set up with 2 name nodes. I
am facing few issues.

Can any one help me in understanding below points?

1) how can we configure different namespaces to different name node? Where
exactly we need to configure this?

2) After formatting each NN with one cluster id, Do we need to set this
cluster id in hdfs-site.xml?

3) I am getting exception like, data dir already locked by one of the NN,
But when don't specify data.dir, then it's not showing exception. So what
could be the issue?

Thanks  Regards,
B Anil Kumar.


hdfs fsck -locations

2014-01-23 Thread Mark Kerzner
Hi,

hdfs fsck -locations

is supposed to show every block with its location? Is location the ip of
the datanode?

Thank you,
Mark


RE: Streaming jobs getting poor locality

2014-01-23 Thread java8964
I cannot explain it (Your configuration looks fine to me, and you mention that 
those mappers can ONLY run on one node in one run, but could be on different 
nodes across running). But as I said, I am not an expect in Yarn, as it is also 
very new to me. Let's see if someone else in the list can give you some hints.
Meantime, maybe you can do some tests to help us narrow the cause, like:
Since you just need the 'cut' in your mapper
1) Run an example job, like 'Pi' or 'Write output' coming from the 
hadoop-example jar, do the mapper tasks running concurrently on multi nodes in 
your cluster?2) If you don't use bzip2 file as input, do you have the same 
problem for other type files, like plain text file?
Yong

From: ken.willi...@windlogics.com
To: user@hadoop.apache.org
Subject: RE: Streaming jobs getting poor locality
Date: Thu, 23 Jan 2014 16:06:03 +








java8964 wrote:
 
 I believe Hadoop can figure out the codec from the file name extension, and 
 Bzip2 codec is
 supported from Hadoop as Java implementation, which is also a 
 SplitableCompressionCodec.
 So 5G bzip2 files generate about 45 mappers is very reasonable, assuming 
 128M/block.
 
Correct - the number of splits seems reasonable to me too, and the codec is 
indeed figured out automatically.  The job does produce the correct output.
 
 The question is why ONLY one node will run this 45 mappers.

 
Exactly.
 
 I am not very familiar with the streaming and yarn (It looks like you are 
 suing MRV2). So
 why do you think all the mappers running on one node? Did someone else run 
 other Jobs in the
 cluster at the same time? What are the memory allocation and configuration in 
 your cluster
  on each nodes?
 
1) I am using Hadoop 2.2.0, 



2) I know they’re all running on one node because in the Ambari interface 
(we’re using the free Hortonworks distro) I can see that all the Map jobs are 
assigned to the same IP address.  I can confirm it by SSH-ing to that node and 
I see all the jobs running
 there, using 'top' or 'ps' or whatever.  If I SSH to any other node, I see no 
jobs running.
 
3) There are no other jobs running at the same time on the cluster.
 
4) All data nodes are configured identically, and the config includes:
resourcemanager_heapsize = 1024
nodemanager_heapsize = 1024
yarn_heapsize = 1024
yarn.nodemanager.resource.memory-mb = 98403
yarn.nodemanager.vmem-pmem-ratio = 2.1
yarn.resourcemanager.scheduler.class = 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
yarn.scheduler.minimum-allocation-mb = 512
yarn.scheduler.maximum-allocation-mb = 10240
 
 
-Ken
 



From: Williams, Ken


Sent: Wednesday, January 22, 2014 1:10 PM

To: 'user@hadoop.apache.org'

Subject: Streaming jobs getting poor locality


 
Hi,
 
I posted a question to Stack Overflow yesterday about an issue I’m seeing, but 
judging by the low interest (only 7 views in 24 hours, and 3 of them are 
probably me! :-) it seems like I should switch venue.  I’m pasting the same 
question
 here in hopes of finding someone with interest.
 
Original SO post is at 
http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality .
 
*
I have some fairly simple Hadoop streaming jobs that look like this:
 
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
  -files hdfs:///apps/local/count.pl \
  -input /foo/data/bz2 \
  -output /user/me/myoutput \
  -mapper cut -f4,8 -d, \
  -reducer count.pl \
  -combiner count.pl
 
The count.pl script is just a simple script that accumulates counts in a hash 
and prints them out at the end - the details are probably not relevant but I 
can post it if necessary.
 
The input is a directory containing 5 files encoded with bz2 compression, 
roughly the same size as each other, for a total of about 5GB (compressed).
 
When I look at the running job, it has 45 mappers, but they're all running on 
one node. The particular node changes from run to run, but always only one 
node. Therefore I'm achieving poor data locality as data is transferred over 
the network
 to this node, and probably achieving poor CPU usage too.
 
The entire cluster has 9 nodes, all the same basic configuration. The blocks of 
the data for all 5 files are spread out among the 9 nodes, as reported by the 
HDFS Name Node web UI.
 
I'm happy to share any requested info from my configuration, but this is a 
corporate cluster and I don't want to upload any full config files.
 
It looks like this previous thread [ why map task always running on a single 
node -

http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node
 ] is relevant but not conclusive.
 
*
 







CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you 

Re: HDFS buffer sizes

2014-01-23 Thread Arpit Agarwal
HDFS does not appear to use dfs.stream-buffer-size.


On Thu, Jan 23, 2014 at 6:57 AM, John Lilley john.lil...@redpoint.netwrote:

  What is the interaction between dfs.stream-buffer-size and
 dfs.client-write-packet-size?

 I see that the default for dfs.stream-buffer-size is 4K.  Does anyone have
 experience using larger buffers to optimize large writes?

 Thanks


 John




-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


HDFS data transfer is faster than SCP based transfer?

2014-01-23 Thread rab ra
Hello

I have a use case that requires transfer of input files from remote storage
using SCP protocol (using jSCH jar).  To optimize this use case, I have
pre-loaded all my input files into HDFS and modified my use case so that it
copies required files from HDFS. So, when tasktrackers works, it copies
required number of input files to its local directory from HDFS. All my
tasktrackers are also datanodes. I could see my use case has run faster.
The only modification in my application is that file copy from HDFS instead
of transfer using SCP. Also, my use case involves parallel operations (run
in tasktrackers) and they do lot of file transfer. Now all these transfers
are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
it uses TCP/IP? Can anyone give me reasonable reasons to support the
decrease of time?


with thanks and regards
rab