Re: What skills to Learn to become Hadoop Admin
Setting up vendor distros is a great first step. 1) Running TeraSort and benchmarking is a good step. You can also run larger, full stack hadoop applications like bigpetstore, which we curate here : https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore/. 2) Write some mapreduce or spark jobs which write data to a persistent transactional store, such as SOLR or HBase. This is a hugely important part of real world hadoop administration, where you will encounter problems like running out of memory, possibly CPU overclocking on some nodes, and so on. 3) Now, did you want to go deeper into the build/setup/deployment of hadoop ? Its worth it to try building/deploying/debugging hadoop ecosytem components from scratch, by setting up Apache BigTop, which packages RPM/DEB artifacts and provides puppet recipes for distributions. Its the original roots of both the cloudera and hortonworks distributions, so you will learn something about both by playing with it. We have some exersizes you can use to guide you and get started https://cwiki.apache.org/confluence/display/BIGTOP/BigTop+U%3A+Exersizes . Feel free to join the mailing list for questions. On Sat, Mar 7, 2015 at 9:32 AM, max scalf oracle.bl...@gmail.com wrote: Krish, I dont mean to hijack your mail here but i wanted to find out how/what you did for the below portion, as i am trying to go down your path as well, i was able to get 4-5 node cluster using ambari and cdh and now wanted to take it to next level. What have you done for below? I have done a web log integration using flume and twitter sentiment analysis. On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com wrote: Hi, I would like to enter into Big Data world as Hadoop Admin and I have setup 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop. I have installed the services like hive, oozie, zookeeper etc. I have done a web log integration using flume and twitter sentiment analysis. I wanted to understand what are the other skills I should learn ? Thanks Krish -- jay vyas
Re: What skills to Learn to become Hadoop Admin
Krish, I dont mean to hijack your mail here but i wanted to find out how/what you did for the below portion, as i am trying to go down your path as well, i was able to get 4-5 node cluster using ambari and cdh and now wanted to take it to next level. What have you done for below? I have done a web log integration using flume and twitter sentiment analysis. On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com wrote: Hi, I would like to enter into Big Data world as Hadoop Admin and I have setup 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop. I have installed the services like hive, oozie, zookeeper etc. I have done a web log integration using flume and twitter sentiment analysis. I wanted to understand what are the other skills I should learn ? Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
Dear Jonathan, Would you please describe the process of running EMR based Hadoop for $15.00, I tried and my cost were rocketing like $60 for one hour. Regards On 05/03/2015 23:57, Jonathan Aquilina wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
When i was testing I was using default setup 1 master node 2 core and no task nodes. i would spiin up the cluster then terminate it. The term for that is a transient cluster. When the big data was needing to be crunched i changed the setup a bit. An Important note there is a limitation of 20 Nodes be it core or task with EMR a request can be submitted to lift that limitation. When actually live i had 1 master node 3 task nodes (which have HDFS storage) and 10 task nodes. All instances used were of size m3.large. Ran another batch of data for 2013 through EMR with this setup in 31 min just to run the data that isnt including cluster spawn up time. One thing to note you do not need to use HDFS storage as that can and will drive up the cost quickly and there there is a chance of data corruption or even data loss if a core node crashes. I have been using amazon S3 and pulling the data from there. The biggest advantage is that you can spawn up multiple clusters and share the same data to be processed that way. Using HDFS has its perks too but costs can drastically increase as well. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-07 09:54, tesm...@gmail.com wrote: Dear Jonathan, Would you please describe the process of running EMR based Hadoop for $15.00, I tried and my cost were rocketing like $60 for one hour. Regards On 05/03/2015 23:57, Jonathan Aquilina wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Snappy Configuration in Hadoop2.5.2
Hi, experts. I meet the following problem when configuring the Snappy lib in Hadoop2.5.2 My snappy installation home is /opt/snappy My Hadoop installation home is /opt/hadoop/hadoophome To configure the snappy path, I tried to add the following environment variables in /etc/profile and hadoop-env.sh : export JAVA_LIBRARY_PATH=/opt/hadoop/hadoophome/lib/native:/opt/snappy/lib export LD_LIBRARY_PATH=/opt/hadoop/hadoophome/lib/native:/opt/snappy/lib After the configuration, I ran the command hadoop checknative. The result showed as following which I think means the hadoop can find the snappy lib: Native library checking: hadoop: true /opt/hadoop/hadoop-2.5.2/lib/native/libhadoop.so.1.0.0 zlib: true /lib64/libz.so.1 snappy: true /opt/snappy/lib/libsnappy.so.1 lz4:true revision:99 bzip2: false But when I ran a MapReduce Job, it reported the following error: Error: java.lang.RuntimeException: native snappy library not available: SnappyCompressor has not been loaded. at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:69) at org.apache.hadoop.io.compress.SnappyCodec.createCompressor(SnappyCodec.java:143) at org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:98) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:136) I also tried to set io.compression.codecs, but it did not work too. The only way I found worked is to make a soft link as following: ln -s /opt/snappy/lib/libsnappy.so.1.2.1 /opt/hadoop/hadoophome/lib/native/libsnappy.so.1 I used to config snappy in Hadoop2.4.0 successfully. I remembered that I only need to config the LD_LIBRARY_PATH in /etc/profile. There is no need to make a such soft link. Does Hadoop2.5.2 not support this configuration anymore? Or is there other ways to config in Hadoop2.5.2 which don't require to make links or have to copy the lib to the hadoop's lib/native directory? Many Thanks!
sorting in hive -- general
Hello all, I am a new to hadoop and hive in general and i am reading hadoop the definitive guide by Tom White and on page 504 for the hive chapter, Tom says below with regards to soritng *Sorting and Aggregating* *Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a parallel total sort of the input (like that described in “Total Sort” on page 261). When a globally sorted result is not required—and in many cases it isn’t—you can use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted file per reducer.* My Questions is, what exactly does he mean by globally sorted result?, if the sort by operation produces a sorted file per reducer does that mean at the end of the sort all the reducer are put back together to give the correct results ?
Re: sorting in hive -- general
sort by query produces multiple independent files. order by - just one file usually sort by is used with distributed by. In older hive versions (0.7) they might be used to implement local sort within partition similar to RANK() OVER (PARTITION BY A ORDER BY B) On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote: Hello all, I am a new to hadoop and hive in general and i am reading hadoop the definitive guide by Tom White and on page 504 for the hive chapter, Tom says below with regards to soritng *Sorting and Aggregating* *Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a parallel total sort of the input (like that described in “Total Sort” on page 261). When a globally sorted result is not required—and in many cases it isn’t—you can use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted file per reducer.* My Questions is, what exactly does he mean by globally sorted result?, if the sort by operation produces a sorted file per reducer does that mean at the end of the sort all the reducer are put back together to give the correct results ?