Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-10 Thread Harsh J
Hello Hao, Am sorry if I confused you. By CPUs I meant the CPUs visible to your OS (/proc/cpuinfo), so yes the total number of cores. On 10-Jan-2012, at 12:39 PM, hao.wang wrote: Hi , Thanks for your reply! According to your suggestion, Maybe I can't apply it to our hadoop cluster. Cus,

Calling webservices in Hadoop

2012-01-10 Thread Shreya.Pal
Hi , Is it possible to get data from web services using Hadoop MR jobs? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please

Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-10 Thread hao.wang
Hi, Thanks for your help, your suggestion is very usefully. I have another question that is whether the sum of maps and reduces equals to the total number of cores. regards! 2012-01-10 hao.wang 发件人: Harsh J 发送时间: 2012-01-10 16:44:07 收件人: common-user 抄送: 主题: Re: how to set

Re: connection between slaves and master

2012-01-10 Thread Praveen Sripati
Mark, [mark@node67 ~]$ telnet node77 You need to specify the port number along with the server name like `telnet node77 1234`. 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). Slaves are not able to

Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-10 Thread Prashant Kommireddi
Hi Hao, Ideally you would want to leave out a core each for Tasktracker and Datanode process' on each node. The rest could be used for maps and reducers. Thanks, Prashant 2012/1/10 hao.wang hao.w...@ipinyou.com Hi, Thanks for your help, your suggestion is very usefully. I have another

Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-10 Thread Harsh J
Yes, divide the number of cores between map and reduce slots. Depending on your workload, start with a 4:3 ratio and work your way to better tuning eventually (if you have more map-only jobs, adjust ratio accordingly, etc.). Changing slot params requires TaskTracker restarts alone, not

RE: has bzip2 compression been deprecated?

2012-01-10 Thread Tony Burton
Thanks all for advice - one more question on re-reading Harsh's helpful reply. Intermediate (M-to-R) files use a custom IFile format these days. How recently is these days, and can this addition be pinned down to any one version of Hadoop? Tony -Original Message- From: Harsh J

Re: has bzip2 compression been deprecated?

2012-01-10 Thread Joey Echeverria
Yes. Hive doesn't format data when you load it. The only exception is if you do an INSERT OVERWRITE ... . -Joey On Jan 10, 2012, at 6:08, Tony Burton tbur...@sportingindex.com wrote: Thanks for this Bejoy, very helpful. So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED

Re: has bzip2 compression been deprecated?

2012-01-10 Thread Harsh J
Tony, Sorry for being ambiguous, I was too lazy to search at the time. This has been the case since release 0.18.0. See https://issues.apache.org/jira/browse/HADOOP-2095 for more information. On 10-Jan-2012, at 4:18 PM, Tony Burton wrote: Thanks all for advice - one more question on

Yarn Container Limit

2012-01-10 Thread raghavendhra rahul
Hi, How to set the maximum number of containers to be executed in each node. So that at a time only that much of containers will be running in that node..

Re: Calling webservices in Hadoop

2012-01-10 Thread Jayunit100
At the cloudera course, they said this is a bad idea, but im working at a place that does just this... In the reducers. the answer is Yes You can make http requests in Hadoop jobs. I'd like to know more about others thoughts on this Is it customary ? Jay Vyas MMSB UCHC On Jan 10,

Re: increase number of map tasks

2012-01-10 Thread GorGo
Hi. I am no expert, but you could try this. Your problem, I guess, is that the record reader reads multiple lines of work (tasks) and gives to each mapper and thus if you only have a few tasks (line of work in the input file) Hadoop will not spawn multiple mappers. You could try this, make

Re: Calling webservices in Hadoop

2012-01-10 Thread Harsh J
If you are looking to crawl websites, you can take a look at Apache Nutch and how it connects with Apache Hadoop. I'll let others comment on why we do not recommend this, but I can surely think of a case where a large-slotted cluster having all its tasks hitting a particular site at the same

Re: increase number of map tasks

2012-01-10 Thread Robert Evans
Similarly there is the NLineInputFormat that does this automatically. If your input is small it will read in the input and make a split for every N lines of input. Then you don't have to reformat your data files. --Bobby Evans On 1/10/12 8:09 AM, GorGo gylf...@ru.is wrote: Hi. I am no

Hadoop PIPES job using C++ and binary data results in data locality problem.

2012-01-10 Thread GorGo
Hi everyone. I am running C++ code using the PIPES wrapper and I am looking for some tutorials, examples or any kind of help with regards to using binary data. My problems is that I am working with large chunks of binary data and converting it to text not an option. My first question is thus,

Re: Hadoop PIPES job using C++ and binary data results in data locality problem.

2012-01-10 Thread Robert Evans
I think what you want to try and do is to use JNI rather then pipes or streaming. PIPES has known issues and it is my understanding that its use is now discouraged. The ideal way to do this is to use JNI to send your data to the C code. Be aware that moving large amounts of data through JNI

Re: has bzip2 compression been deprecated?

2012-01-10 Thread Bejoy Ks
Hi Tony Please find responses inline So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW FORMAT and other parameters you mention are telling Hive what to expect when it reads the data I want to analyse, despite not checking the data to see if it meets these criteria?

Re: WritableComparable and the case of duplicate keys in the reducer

2012-01-10 Thread William Kinney
I have noticed this too with one job. Keys that are equal (.equals(), hashCode() === and compareTo === 0) are being sent to multiple reduce tasks therefore resulting in incorrect output. Any insight? On Sat, Aug 13, 2011 at 11:14 AM, Stan Rosenberg srosenb...@proclivitysystems.com wrote: Hi

Re: WritableComparable and the case of duplicate keys in the reducer

2012-01-10 Thread W.P. McNeill
The Hadoop framework reuses Writable objects for key and value arguments, so if your code stores a pointer to that object instead of copying it you can find yourself with mysterious duplicate objects. This has tripped me up a number of times. Details on what exactly I encountered and how I fixed

Re: WritableComparable and the case of duplicate keys in the reducer

2012-01-10 Thread William Kinney
I'm (unfortunately) aware of this and this isn't the issue. My key object contains only long, int and String values. The job map output is consistent, but the reduce input groups and values for the key vary from one job to the next on the same input. It's like it isn't properly comparing and

Re: WritableComparable and the case of duplicate keys in the reducer

2012-01-10 Thread William Kinney
Naturally after I send that email I find that I am wrong. I was also using an enum field, which was the culprit. On Tue, Jan 10, 2012 at 6:13 PM, William Kinney william.kin...@gmail.comwrote: I'm (unfortunately) aware of this and this isn't the issue. My key object contains only long, int and

how to specify class name to run in mapreduce job

2012-01-10 Thread T Vinod Gupta
hi, how can i specify which class' main method to run as a job when i do mapreduce? lets say my jar has 4 classes and each one of them has a main method. i want to pass the class name in the 'hadoop jar jar file classname' command. this will be similar to running stock tools inside hbase or other

Re: Yarn Container Limit

2012-01-10 Thread Vinod Kumar Vavilapalli
You can use yarn.nodemanager.resource.memory-mb to set the limit on each NodeManager. You should have a good look at http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html . It has enough information to get you a good distance. HTH. +Vinod On Tue, Jan 10,

Re: Container launch from appmaster

2012-01-10 Thread Vinod Kumar Vavilapalli
Yes, you can. http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html#Writing_an_ApplicationMaster should give you a very good idea and example code about this. But, the requirements are not hard-fixed. If the scheduler cannot find free resources on

Package/call a binary in hadoop job

2012-01-10 Thread Daren Hasenkamp
Hi, I would like to bundle a binary with a hadoop job and call it from inside the mappers/reducers. The binary is a C++ program that I do not want to re-implement in Java. I want to fork it as a subprocess from inside mappers/reducers and capture the output (on stdout). So, I need to get the

getting file position for a LZO file

2012-01-10 Thread Paul Ho
Hi all, For the TextInputFormat class, the input key is a file position. This is working well. But when I switch to LzoTextInputFormat to read LZO files, the key does not make sense. It does not indicate file position. Is the file position supported with LzoTextInputFormat? Here is a job

Re: Package/call a binary in hadoop job

2012-01-10 Thread Ravi Prakash
Couldn't you write a simple wrapper around your binary, include the binary using the -file option and use Streaming? Or use the distributed cache to copy your binaries to all the compute nodes. On Tue, Jan 10, 2012 at 5:01 PM, Daren Hasenkamp dhasenk...@berkeley.eduwrote: Hi, I would like to

Re: how to specify class name to run in mapreduce job

2012-01-10 Thread Bejoy Ks
Hi Vinod You can use the format as hadoop jar jarName className Like - hadoop jar /home/user/sample.jar com.sample.apps.MainClass .. Don't specify the main class while packing your jar. This would help you incorporate multiple entry points in same jar for different functionality.