Reduce > copy at 0.00 MB/s

2012-01-25 Thread praveenesh kumar
Hey, Can anyone explain me what is reduce > copy phase in the reducer section ? The (K,List(V)), is passed to the reducer. Is reduce > copy representing copying of K,List(V) on the reducer from all mappers ? I am monitoring my jobs on the cluster, using Jobtracker url. I am seeing for most of my

Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread Romeo Kienzler
Dear List, we're trying to use a central HDFS storage in order to be accessed from various other Hadoop-Distributions. Do you think this is possible? We're having trouble, but not related to different RPC-Versions. When trying to access a Cloudera CDH3 Update 2 (cdh3u2) HDFS from BigInsigh

Re: Reduce > copy at 0.00 MB/s

2012-01-25 Thread hadoop hive
i face the same issue but after sumtime when i balanced the cluster the jobs started running fine, On Wed, Jan 25, 2012 at 3:34 PM, praveenesh kumar wrote: > Hey, > > Can anyone explain me what is reduce > copy phase in the reducer section ? > The (K,List(V)), is passed to the reducer. Is reduce

Re: Reduce > copy at 0.00 MB/s

2012-01-25 Thread praveenesh kumar
@hadoophive Can you explain more by "balance the cluster" ? Thanks, Praveenesh On Wed, Jan 25, 2012 at 4:29 PM, hadoop hive wrote: > i face the same issue but after sumtime when i balanced the cluster the > jobs started running fine, > > On Wed, Jan 25, 2012 at 3:34 PM, praveenesh kumar >wrot

Understanding fair schedulers

2012-01-25 Thread praveenesh kumar
Understanding Fair Schedulers better. Can we create mulitple pools in Fair Schedulers. I guess Yes. Please correct me. Suppose I have 2 pools in my fair-scheduler.xml 1. Hadoop-users : Min map : 10, Max map : 50, Min Reduce : 10, Max Reduce : 50 2. Admin-users: Min map : 20, Max map : 80, Min Re

Re: Hadoop Terasort Error- "File _partition.lst does not exist"

2012-01-25 Thread Utkarsh Rathore
Thanks Harsh. I' ll look into the tasktracker logs to find any issues with mapreduce and update this thread accordingly. (PS: Sorry for the wide circulation. My mails still don't directly land on common-user@hadoop.apache.org so tried posting it through Nabble and something got broken. I have mai

Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Ondřej Klimpera
Hello, I'm trying to develop an application, where Reducer has to produce multiple outputs. In detail I need the Reducer to produce two types of files. Each file will have different output. I found in Hadoop, The Definitive Guide, that new API uses only MultipleOutputs, but working with Mu

Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread Harsh J
Hello Romeo, Inline… On Wed, Jan 25, 2012 at 4:07 PM, Romeo Kienzler wrote: > Dear List, > > we're trying to use a central HDFS storage in order to be accessed from > various other Hadoop-Distributions. The HDFS you've setup, what 'distribution' is that from? You will have to use that particula

Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread Michael Segel
BigInsights? ... Ok, I'll be nice ... :-) Ok, so of I understand your question, you want to use a single HDFS file system to be used by different 'Hadoop' frameworks ? (derivatives) First, it doesn't make sense. I mean it really doesn't make any sense. Second.. I don't think it would be possib

Re: Hadoop Terasort Error- "File _partition.lst does not exist"

2012-01-25 Thread Harsh J
Its not your TaskTracker thats failing, your job itself is running locally, and not on a JobTracker. This would not work for what you're trying to run. Are you sure you have the right mapred-site.xml configuration from where you launch your job? On Wed, Jan 25, 2012 at 5:12 PM, Utkarsh Rathore w

Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread alo alt
Insight is a IBM related product, based on an fork of hadoop I think. The mixing of totally different stacks make no sense. And will not work, I guess. - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 25, 2012, at 1:12 PM, Harsh J wrote: > Hello Romeo, > > Inline… > > On Wed,

Re: Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Harsh J
What version/release/distro of Hadoop are you using? Apache releases got the new (unstable) API MultipleOutputs only in 0.21+, and was only very recently backported to branch-1. That said, the next release in 1.x (1.1.0, out soon) will carry the new API MultipleOutputs, but presently no release in

Re: Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Harsh J
Oh and btw, do not fear the @deprecated 'Old' API. We have undeprecated it in the recent stable releases, and will continue to support it for a long time. I'd recommend using the older API, as that is more feature complete and test covered in the version you use. On Wed, Jan 25, 2012 at 6:09 PM, H

Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread Rajiv Chittajallu
Did you try using hftp:// instead of hdfs://. This would work across different rpc versions as long as the code base is not from significantly different branches. EOFException might also be related to RPC version mismatch. If the release of Hadoop is based off the 0.20.2xx (Hadoop with securit

Re: Understanding fair schedulers

2012-01-25 Thread Srinivas Surasani
Praveenesh, You can try specifying "mapred.fairscheduler.pool" to your pool name while running the job. By default, mapred.faircheduler.poolnameproperty set to user.name ( each job run by user is allocated to his named pool ) and you can also change this property to group.name. Srinivas -- Also,

Re: Understanding fair schedulers

2012-01-25 Thread Srinivas Surasani
Praveenesh, You can try specifying "mapred.fairscheduler.pool" to your pool name while running the job. By default, mapred.faircheduler.poolnameproperty set to user.name ( each job run by user is allocated to his named pool ) and you can also change this property to group.name. Srinivas -- Also,

Re: Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Ondřej Klimpera
I'm using 1.0.0 beta, suppose it was wrong decision to use beta version. So do you recommend using 0.20.203.X and stick to Hadoop definitive guide approaches? Thanks for your reply On 01/25/2012 01:41 PM, Harsh J wrote: Oh and btw, do not fear the @deprecated 'Old' API. We have undeprecated i

Re: Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Harsh J
I recommend sticking to older APIs for the 1.x release line (Know that 1.x is a micro revision over and a rename of the 0.20.20x branches [0]). Do not worry about the @deprecated markers, these APIs are still fully available and supported upto 0.23 and beyond, and should give off no upgrade worrie

Re: Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Ondřej Klimpera
One more question. Just downloaded Hadoop 0.20.203.0 considered to be last stable release. What about JobConf vs. Confirguration classes. What should I use to avoid wrong approaches, because JobConf seems to be depricated. Sorry for bothering you with this questions. I'm just not used to having

Re: Using MultipleOutputs with new API (v1.0)

2012-01-25 Thread Harsh J
Hi, Just set your code to ignore the deprecation warnings for JobConf/etc. - it causes no harm to use it. On Wed, Jan 25, 2012 at 6:32 PM, Ondřej Klimpera wrote: > One more question. Just downloaded Hadoop 0.20.203.0 considered to be last > stable release. What about JobConf vs. Confirguration c

Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread Michael Segel
Alex, I said I would be nice and hold my tongue when it comes to IBM and their IM pillar products... :-) You could write a client that talks to two different hadoop versions but then you would be using hftp which is what you have under the hood in distcp... But that doesn't seem to be what he

Re: Understanding fair schedulers

2012-01-25 Thread praveenesh kumar
I am running pig jobs, how can I specify on which pool, it should run ? Also do you mean, the pool allocation is done job wise, not user wise ? On Wed, Jan 25, 2012 at 6:14 PM, Srinivas Surasani wrote: > Praveenesh, > > You can try specifying "mapred.fairscheduler.pool" to your pool name while

Re: Reduce > copy at 0.00 MB/s

2012-01-25 Thread hadoop hive
this problem arise after adding a node , so then i start balancer to make it balance , On Wed, Jan 25, 2012 at 4:38 PM, praveenesh kumar wrote: > @hadoophive > > Can you explain more by "balance the cluster" ? > > Thanks, > Praveenesh > > On Wed, Jan 25, 2012 at 4:29 PM, hadoop hive wrote: > > >

Re: Reduce > copy at 0.00 MB/s

2012-01-25 Thread Harsh J
The copy phase fetches the map outputs. It may hang for a while if there are no newly completed map outputs to fetch yet. You can raise your reducers' slowstart value to have it not spend so many cycles waiting but rather start at 80-90% of map completions, instead of default 5%. This helps your M

Re: Understanding fair schedulers

2012-01-25 Thread Harsh J
Set the property in Pig with the 'set' command or other ways: http://pig.apache.org/docs/r0.9.1/cmds.html#set or http://pig.apache.org/docs/r0.9.1/start.html#properties As Srinivas covered earlier, pool allocation can be done per-user if you set the scheduler poolnameproperty to "user.name". Per g

Re: Reduce > copy at 0.00 MB/s

2012-01-25 Thread praveenesh kumar
Yeah , I am doing it, currently its on 20 %, I guess I have to raise it more. Funny thing is, its still happening after map is 100% completed. when map is completed, it should not wait, right. But I see it still give same message, for some time. Thanks, Praveenesh On Wed, Jan 25, 2012 at 7:29 PM,

Re: Understanding fair schedulers

2012-01-25 Thread praveenesh kumar
I am looking for the solution where we can do it permanently without specify these things inside jobs. I want to keep these things hidden from the end-user. End-user would just write pig scripts and all the jobs submitted by the particular user will get submit to their respective pools automaticall

Re: Understanding fair schedulers

2012-01-25 Thread praveenesh kumar
Also, with the above mentioned method, my problem is I am having one pool/user (thats obviously not a good way of configuring schedulers) How can I allocate multiple users to one pool in the xml properties, so that I don't have to care giving any options inside my codes. Thanks, Praveenesh On Wed

Re: Understanding fair schedulers

2012-01-25 Thread Harsh J
A solution would be to place your users into groups, and use group.name identifier to be the poolnameproperty. Would this work for you instead? On Wed, Jan 25, 2012 at 8:00 PM, praveenesh kumar wrote: > Also, with the above mentioned method, my problem is I am having one > pool/user (thats obvio

Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread Romeo Kienzler
Dear all, first of all the reason for this is that we have a lot of data in Cloudera but want to test BigSheets (from BigInsights) and Datameer using the same HDFS - Source (instead of reimporting). Thanks a lot for your suggestion. I finally got it working, here is the steps I have done:

Re: Understanding fair schedulers

2012-01-25 Thread praveenesh kumar
Then in that case, will I be using group name tag in allocations file, like this inside each pool ? < group name="ABC"> 6 Thanks, Praveenesh On Wed, Jan 25, 2012 at 8:08 PM, Harsh J wrote: > A solution would be to place your users into groups, and use > group.name identifier to be the

Re: Understanding fair schedulers

2012-01-25 Thread Harsh J
Not exactly. See, the poolnameproperty being group.name will map the group name as a pool name. So you need to only use for configuring a group "ABC". Does that make sense? On Wed, Jan 25, 2012 at 8:49 PM, praveenesh kumar wrote: > Then in that case, will I be using group name tag in allocations

Re: When to use a combiner?

2012-01-25 Thread Raj V
Touche`! Raj > > From: Robert Evans >To: "common-user@hadoop.apache.org" ; Raj V > >Sent: Wednesday, January 25, 2012 7:36 AM >Subject: Re: When to use a combiner? > > >Re: When to use a combiner? >You can use a combiner for average.  You just have to write

Re: Understanding fair schedulers

2012-01-25 Thread praveenesh kumar
okie got it.. same pool name.. as group name... On Wed, Jan 25, 2012 at 8:51 PM, Harsh J wrote: > Not exactly. See, the poolnameproperty being group.name will map the > group name as a pool name. So you need to only use > for configuring a group "ABC". Does that make sense? > > On Wed, Jan 25,

Re: When to use a combiner?

2012-01-25 Thread Robert Evans
You can use a combiner for average. You just have to write a separate combiner from your reducer. Class myCombiner { //The value is sum/count pairs void reduce(Key key, Interable> values, Context context) { long sum = 0; long count = 0; for(Pair value: values) {

hadoop 1.0.0 installation question

2012-01-25 Thread Naveen M
Hi, I'm trying to set up hadoop1.0.0 on my mac book, notice a bunch of set up files in 'sbin' directory rwxr-xr-x@ 1 root wheel 3392 16 Dec 03:39 hadoop-create-user.sh -rwxr-xr-x@ 1 root wheel 3636 16 Dec 03:39 hadoop-setup-applications.sh -rwxr-xr-x@ 1 root wheel 26777 16 Dec 03:39 ha

Re: hadoop 1.0.0 installation question

2012-01-25 Thread Harsh J
These scripts aren't for tarball installs. They are for package installs that does not apply to Mac OSX. I haven't a clue what they're even doing in the release tarball. You should file a JIRA issue to have them removed. You just need to follow: http://hadoop.apache.org/common/docs/current/single_