Re: What skills to Learn to become Hadoop Admin

2015-03-07 Thread jay vyas
Setting up vendor distros is a great first step.

1) Running TeraSort and benchmarking is a good step.  You can also run
larger, full stack hadoop applications like bigpetstore, which we curate
here : https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore/.

2) Write some mapreduce or spark jobs which write data to a persistent
transactional store, such as SOLR or HBase.  This is a hugely important
part of real world hadoop administration, where you will encounter problems
like running out of memory, possibly CPU overclocking on some nodes, and so
on.

3) Now, did you want to go deeper into the build/setup/deployment of hadoop
?  Its worth it  to try building/deploying/debugging hadoop ecosytem
components from scratch, by setting up Apache BigTop, which packages
RPM/DEB artifacts and provides puppet recipes for distributions.  Its the
original roots of both the cloudera and hortonworks distributions, so you
will learn something about both by playing with it.

We have some exersizes you can use to guide you and get started
https://cwiki.apache.org/confluence/display/BIGTOP/BigTop+U%3A+Exersizes .
Feel free to join the mailing list for questions.




On Sat, Mar 7, 2015 at 9:32 AM, max scalf oracle.bl...@gmail.com wrote:

 Krish,

 I dont mean to hijack your mail here but i wanted to find out how/what you
 did for the below portion, as i am trying to go down your path as well, i
 was able to get 4-5 node cluster using ambari and cdh and now wanted to
 take it to next level.  What have you done for below?

 I have done a web log integration using flume and twitter sentiment
 analysis.

 On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com
 wrote:

 Hi,

 I would like to enter into Big Data world as Hadoop Admin and I have
 setup 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop.
 I have installed the services like hive, oozie, zookeeper etc.

 I have done a web log integration using flume and twitter sentiment
 analysis.

 I wanted to understand what are the other skills I should learn ?

 Thanks
 Krish





-- 
jay vyas


Re: What skills to Learn to become Hadoop Admin

2015-03-07 Thread max scalf
Krish,

I dont mean to hijack your mail here but i wanted to find out how/what you
did for the below portion, as i am trying to go down your path as well, i
was able to get 4-5 node cluster using ambari and cdh and now wanted to
take it to next level.  What have you done for below?

I have done a web log integration using flume and twitter sentiment
analysis.

On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com wrote:

 Hi,

 I would like to enter into Big Data world as Hadoop Admin and I have setup
 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop.
 I have installed the services like hive, oozie, zookeeper etc.

 I have done a web log integration using flume and twitter sentiment
 analysis.

 I wanted to understand what are the other skills I should learn ?

 Thanks
 Krish



Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-07 Thread tesm...@gmail.com
 Dear Jonathan,

Would you please describe the process of running EMR based Hadoop for
$15.00, I tried and my cost were rocketing like $60 for one hour.

Regards


On 05/03/2015 23:57, Jonathan Aquilina wrote:

krish EMR wont cost you much with all the testing and data we ran through
the test systems as well as the large amont of data when everythign was
read we paid about 15.00 USD. I honestly do not think that the specs there
would be enough as java can be pretty ram hungry.



---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

 On 2015-03-06 00:41, Krish Donald wrote:

 Hi,

I am new to AWS and would like to setup Hadoop cluster using cloudera
manager for 6-7 nodes.

t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
I would like to use free service as of now.

Please advise.

Thanks
Krish


Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-07 Thread Jonathan Aquilina
 

When i was testing I was using default setup 1 master node 2 core and no
task nodes. i would spiin up the cluster then terminate it. The term for
that is a transient cluster. 

When the big data was needing to be crunched i changed the setup a bit.
An Important note there is a limitation of 20 Nodes be it core or task
with EMR a request can be submitted to lift that limitation. 

When actually live i had 1 master node 3 task nodes (which have HDFS
storage) and 10 task nodes. All instances used were of size m3.large.
Ran another batch of data for 2013 through EMR with this setup in 31 min
just to run the data that isnt including cluster spawn up time. 

One thing to note you do not need to use HDFS storage as that can and
will drive up the cost quickly and there there is a chance of data
corruption or even data loss if a core node crashes. I have been using
amazon S3 and pulling the data from there. The biggest advantage is that
you can spawn up multiple clusters and share the same data to be
processed that way. Using HDFS has its perks too but costs can
drastically increase as well. 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-03-07 09:54, tesm...@gmail.com wrote: 

 Dear Jonathan,
 
 Would you please describe the process of running EMR based Hadoop for $15.00, 
 I tried and my cost were rocketing like $60 for one hour.
 
 Regards
 
 On 05/03/2015 23:57, Jonathan Aquilina wrote: 
 
 krish EMR wont cost you much with all the testing and data we ran through the 
 test systems as well as the large amont of data when everythign was read we 
 paid about 15.00 USD. I honestly do not think that the specs there would be 
 enough as java can be pretty ram hungry. 
 
 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T
 
 On 2015-03-06 00:41, Krish Donald wrote: 
 
 Hi, 
 
 I am new to AWS and would like to setup Hadoop cluster using cloudera manager 
 for 6-7 nodes. 
 
 t2.micro on AWS; Is it enough for setting up Hadoop cluster ? 
 I would like to use free service as of now. 
 
 Please advise. 
 
 Thanks 
 Krish
 

Snappy Configuration in Hadoop2.5.2

2015-03-07 Thread donhoff_h
Hi, experts.

I meet the following problem when configuring the Snappy lib in Hadoop2.5.2

My snappy installation home is /opt/snappy
My Hadoop installation home is /opt/hadoop/hadoophome

To configure the snappy path, I tried to add the following environment 
variables in /etc/profile and hadoop-env.sh :
export JAVA_LIBRARY_PATH=/opt/hadoop/hadoophome/lib/native:/opt/snappy/lib
export LD_LIBRARY_PATH=/opt/hadoop/hadoophome/lib/native:/opt/snappy/lib‍

After the configuration, I ran the command hadoop checknative. The result 
showed as following which I think means the hadoop can find the snappy lib:
Native library checking:
hadoop: true /opt/hadoop/hadoop-2.5.2/lib/native/libhadoop.so.1.0.0
zlib:   true /lib64/libz.so.1
snappy: true /opt/snappy/lib/libsnappy.so.1
lz4:true revision:99
bzip2:  false‍

But when I ran a MapReduce Job, it reported the following error:
Error: java.lang.RuntimeException: native snappy library not available: 
SnappyCompressor has not been loaded.
at 
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:69)
at 
org.apache.hadoop.io.compress.SnappyCodec.createCompressor(SnappyCodec.java:143)
at 
org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:98)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:136)‍

I also tried to set io.compression.codecs, but it did not work too.

The only way I found worked is to make a soft link as following:
ln -s /opt/snappy/lib/libsnappy.so.1.2.1 
/opt/hadoop/hadoophome/lib/native/libsnappy.so.1

I used to config snappy in Hadoop2.4.0 successfully. I remembered that I only 
need to config the LD_LIBRARY_PATH in /etc/profile. There is no need to make a 
such soft link. Does Hadoop2.5.2 not support this configuration anymore? Or is 
there other ways to config in Hadoop2.5.2 which don't require to make links or 
have to copy the lib to the hadoop's lib/native directory?

Many Thanks!

sorting in hive -- general

2015-03-07 Thread max scalf
Hello all,

I am a new to hadoop and hive in general and i am reading hadoop the
definitive guide by Tom White and on page 504 for the hive chapter, Tom
says below with regards to soritng

*Sorting and Aggregating*
*Sorting data in Hive can be achieved by using a standard ORDER BY clause.
ORDER BY performs a parallel total sort of the input (like that described
in “Total Sort” on page 261). When a globally sorted result is not
required—and in many cases it isn’t—you can use Hive’s nonstandard
extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


My Questions is, what exactly does he mean by globally sorted result?, if
the sort by operation produces a sorted file per reducer does that mean at
the end of the sort all the reducer are put back together to give the
correct results ?


Re: sorting in hive -- general

2015-03-07 Thread Alexander Pivovarov
sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote:

 Hello all,

 I am a new to hadoop and hive in general and i am reading hadoop the
 definitive guide by Tom White and on page 504 for the hive chapter, Tom
 says below with regards to soritng

 *Sorting and Aggregating*
 *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
 ORDER BY performs a parallel total sort of the input (like that described
 in “Total Sort” on page 261). When a globally sorted result is not
 required—and in many cases it isn’t—you can use Hive’s nonstandard
 extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


 My Questions is, what exactly does he mean by globally sorted result?,
 if the sort by operation produces a sorted file per reducer does that mean
 at the end of the sort all the reducer are put back together to give the
 correct results ?