Small question

2012-10-03 Thread Abhishek
Hi all, Below hive query in pig latin how to do that. select t2.col1, t3.col2 from table2 t2 join table3 t3 WHERE t3.col2 IS NOT NULL AND t2.col1 LIKE CONCAT(CONCAT('%',t3.col2),'%') Regards Abhi

Re: Which hardware to choose

2012-10-03 Thread J. Rottinghuis
Of course it all depends... But something like this could work: Leave 1-2 GB for the kernel, pagecache, tools, overhead etc. Plan 3-4 GB for Datanode and Tasktracker each Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more or less memory per slot. Have 2-3 times as many

Re: Which hardware to choose

2012-10-03 Thread Michael Segel
Well... If you're not running HBase, you're less harmed by minimal swapping so you could push the number of slots and over subscribe. The only thing I would have to suggest is that you monitor your system closely as you adjust the number of slots. You have to admit though, its fun to tune

Re: Small question

2012-10-03 Thread J. Rottinghuis
moved common-user@hadoop.apache.org to bcc and added u...@pig.apache.org Best asked on the Pig users list. Cheers, Joep On Wed, Oct 3, 2012 at 7:04 AM, Abhishek abhishek.dod...@gmail.com wrote: Hi all, Below hive query in pig latin how to do that. select t2.col1, t3.col2 from table2 t2

Pig vs hive performance

2012-10-03 Thread Abhishek
Hi all, Can we discuss performance of pig vs hive 1) what hive is good at? 2) what pig is good at? 3) Hive optimizer vs pig optimizer 4) hive limitations vs pig limitations Regards Abhi Sent from my iPhone

Re: Pig vs hive performance

2012-10-03 Thread TianYi Zhu
from amazon web site: http://aws.amazon.com/elasticmapreduce/faqs/#hive-8 Q: When should I use Hive vs. PIG? Hive and PIG both provide high level data-processing languages with support for complex data types for operating on large datasets. The Hive language is a variant of SQL and so is more

Re: Pig vs hive performance

2012-10-03 Thread Abhishek
Hi Zhu, Thanks for the reply.I am running some querys where is slower than pig. I was also thinking that pig optimizer is better than hive optimizer. Regards Abhi Sent from my iPhone On Oct 3, 2012, at 7:15 PM, TianYi Zhu tianyi@facilitatedigital.com wrote: from amazon web site:

Re: Pig vs hive performance

2012-10-03 Thread TianYi Zhu
Hi Abhishek, I've no idea with the optimizer. In my opinion, SQL like programming language is hard to optimize, hive may slower than pig in many cases. But on the earth, for every hadoop job, there must be a best(time or space) sequence of map/reduce phases. you should rewrite your pig/hive

Re: Pig vs hive performance

2012-10-03 Thread Abhishek
Thanks Zhu for your reply, your points makes sense to me. Regards Abhishek On Oct 3, 2012, at 8:14 PM, TianYi Zhu tianyi@facilitatedigital.com wrote: Hi Abhishek, I've no idea with the optimizer. In my opinion, SQL like programming language is hard to optimize, hive may slower than

Re: Pig vs hive performance

2012-10-03 Thread abhishek dodda
On Wed, Oct 3, 2012 at 7:50 PM, Dan Richelson drichel...@tendrilinc.com wrote: Anecdotally I can say that Pig seems to scale down better than Hive. We see this in tests- hive scripts running small amounts of data take much longer than similar Pig scripts. Hive parallel settings are enabled.

Re: Pig vs hive performance

2012-10-03 Thread Dan Richelson
Anecdotally I can say that Pig seems to scale down better than Hive. We see this in tests- hive scripts running small amounts of data take much longer than similar Pig scripts. Hive parallel settings are enabled. I think this has to do with the fact that there doesn't seem to be a 'local' mode for

Re: Hadoop Jobtracker web administration tool, hadoop job -list and Tasktrackers Web UI's show no information

2012-10-03 Thread Romedius Weiss
Sorry for double posting Hi! The services are running on master(NN,JT,TT,DN) and slave(TT,DN) according to jps. In the WebUI`s the slaves are shown as up and running, I'm getting heartbeats and everything. When running a job, it completes and logs everything to the command prompt. The

Re: Classic(MapReduce 1) cluster in Hadoop 0.23 just won't listen

2012-10-03 Thread Harsh J
Hi, The classic option exists to provide backward compatibility for users wanting to run an MR1 cluster (with JT, etc.). With the inclusion of YARN and MR2 modes of runtime, Apache Hadoop removed MR1 services support: ➜ mapred jobtracker Sorry, the jobtracker command is no longer supported.

Re: Hadoop Jobtracker web administration tool, hadoop job -list and Tasktrackers Web UI's show no information

2012-10-03 Thread Steve Loughran
I have no DHCP, DNS is configured manually(and there might be the problem) etc hosts on both machines: 172.16.0.1 master vmmaker-ubd64 172.16.0.11 slave vmmaker-slave 127.0.0.1 localhost # on master 127.0.0.1 vmmaker-ubd64 # on slave 127.0.0.1 vmmaker-slave

Re: Lib conflicts

2012-10-03 Thread Michael Segel
Yup, I hate that when it happens. You tend to see this more with Avro than anything else. The issue is that in Java, the first class loaded wins. So when Hadoop loads 1.4 first, you can't unload it and replace it with 1.7. The only solution that we found to be workable is to replace the

Re: Lib conflicts

2012-10-03 Thread Harsh J
Hi Ben, As long as the switch of libraries doesn't impact the execution of the Child task code itself, for Apache Hadoop 1.x, using the config mapreduce.user.classpath.first set to true may solve your trouble. On Wed, Oct 3, 2012 at 4:51 PM, Ben Rycroft brycrof...@gmail.com wrote: Hi all, I

Re: Lib conflicts

2012-10-03 Thread Jay Vyas
Yup! I hate this issue. It also happens with json libs if you have an old hadoop !!! Its really easy also to dump the exact class path at runtime using the ((URLClassLoader)ClassLoader.getSystemClassLoader()).getURLs(); trick... very very very very useful :) I always do this just to make sure

Re: A small portion of map tasks slows down the job

2012-10-03 Thread Hemanth Yamijala
Hi, Would reducing the output from the map tasks solve the problem ? i.e. are reducers slowing down because a lot of data is being shuffled ? If that's the case, you could see if the map output size will reduce by using the framework's combiner or an in-mapper combining technique. Thanks

Re: How to lower the total number of map tasks

2012-10-03 Thread Shing Hing Man
I have followed a suggestion on the given link, and set  mapred.min.split.size to 134217728. With the above mapred.min.split.size, I get   mapred.map.tasks =121 (previously it was 242). Thanks for all the replies ! Shing From: Romedius Weiss

Counters that track the max value

2012-10-03 Thread Jeremy Lewi
HI hadoop-users, I'm curious if there is an implementation somewhere of a counter which tracks the maximum of some value across all mappers or reducers? Thanks J

Re: Counters that track the max value

2012-10-03 Thread Harsh J
Jeremy, Here's my shot at it (pardon the quick crappy code): https://gist.github.com/3828246 Basically - you can achieve it in two ways: Requirement: All tasks must increment the max designated counter only AFTER the max has been computed (i.e. in cleanup). 1. All tasks may use same counter

GenericOptionsParser

2012-10-03 Thread Koert Kuipers
Why does GenericOptionParser also remove -Dprop=value options (without setting the system properties)? Those are not hadoop options but java options. And why does hadoop jar not accept -Dprop=value before the class name, as java does? like hadoop jar -Dprop=value class How do i set java system

[quickstart] Setting up a Cluster with Whirr and Cloudera Manager

2012-10-03 Thread George London
Hi All, I recently had occasion to learn Hadoop/HBase and set up a fully distributed cluster on EC2. To save future Hadoop'ers some of the frustration that involved, I wrote a how-to blog post detailing the necessary steps to get going quickly and avoid the gotchas. As far as I know, this is

Re: GenericOptionsParser

2012-10-03 Thread Harsh J
Koert, System properties ought to go to JVM args directly. Use HADOOP_CLIENT_OPTS to pass those -Ds. This is cause System properties is a JVM-level concept and java is the utility that accepts those. We provide a wrapper in form of hadoop jar. GenericOptionsParser (i.e. Tool - Don't use GOP

Re: Classic(MapReduce 1) cluster in Hadoop 0.23 just won't listen

2012-10-03 Thread Harsh J
I am incorrect on the below: The classic option exists to provide backward compatibility for users wanting to run an MR1 cluster (with JT, etc.). Turns out, classic is just for older clients to run on YARN without any other changes. It roughly translates the RM address from a JT property into

Re: hadoop disk selection

2012-10-03 Thread Andy Isaacson
Moving this to user@ since it's not appropriate for general@. On Fri, Sep 28, 2012 at 11:16 PM, Xiang Hua bea...@gmail.com wrote: Hi, i want to select 4(600G) local disks combined with 3*800G disks form diskarray in one datanode. is there any problem? performance ? The recommended