from:"Bejoy Ks"

Hi Amit

Are you seeing any errors or warnings on JT logs?

Regards
Bejoy KS

Re: VM reuse!

Hi Rahul

If you look at larger cluster and jobs that involve larger input data sets.
The data would be spread across the whole cluster, and a single node might
have  various blocks of that entire data set. Imagine you have a cluster
with 100 map slots and your job has 500 map tasks, now in that case there
should be multiple map tasks in a single task tracker based on slot
availability.

Here if you enable jvm reuse, all tasks related to a job on a single
TaskTracker would use the same jvm. The benefit here is just the time you
are saving in spawning and cleaning up jvm for individual tasks.




On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Hi,

 I have a question related to VM reuse in Hadoop.I now understand the
 purpose of VM reuse , but I am wondering how is it useful.

 Example. for VM reuse to be effective or kicked in , we need more than one
 mapper task to be submitted to a single node (for the same job).Hadoop
 would consider spawning mappers into nodes which actually contains the data
 , it might rarely happen that multiple mappers are allocated to a single
 task tracker. And even if a single task nodes gets to run multiple mappers
 then it might as well run in parallel in multiple VM rather than
 sequentially in a single VM.

 I am sure I am missing some link here , please help me find that.

 Thanks,
 Rahul

Re: HW infrastructure for Hadoop

+1 for Hadoop Operations


On Tue, Apr 16, 2013 at 3:57 PM, MARCOS MEDRADO RUBINELLI 
marc...@buscapecompany.com wrote:

  Tadas,

 Hadoop Operations has pretty useful, up-to-date information. The chapter
 on hardware selection is available here:
 http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2760689

 Regards,
 Marcos

 Em 16-04-2013 07:13, Tadas Makčinskas escreveu:

  We are thinking to distribute like 50 node cluster. And trying to figure
 out what would be a good HW infrastructure (Disks – I/O‘s, RAM, CPUs,
 network). I cannot actually come around any examples that people ran and
 found it working well and cost effectively. 

 ** **

 If anybody could share their best considered infrastructure. Would be a
 tremendous help not trying to figure it out on our own.

 ** **

 Regards, Tadas

 ** **

 ** **

Re: VM reuse!

 When you process larger data volumes, this is the case mostly. :)


Say you have a job with smaller input size and if you have  2 blocks on a
single node and then the JT may schedule two tasks on the same TT if there
are available free slots. So those tasks can take advantage of JVM reuse.

Which TT the JT would assign tasks is totally dependent on data locality
and availability of task slots.


On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Ok, Thanks Bejoy.

 Only in some typical scenarios it's possible , like the one that you have
 mentioned.
 Much more number of mappers and less number of mappers slots.

 Regards,
 Rahul


 On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Rahul

 If you look at larger cluster and jobs that involve larger input data
 sets. The data would be spread across the whole cluster, and a single node
 might have  various blocks of that entire data set. Imagine you have a
 cluster with 100 map slots and your job has 500 map tasks, now in that case
 there should be multiple map tasks in a single task tracker based on slot
 availability.

 Here if you enable jvm reuse, all tasks related to a job on a single
 TaskTracker would use the same jvm. The benefit here is just the time you
 are saving in spawning and cleaning up jvm for individual tasks.




 On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I have a question related to VM reuse in Hadoop.I now understand the
 purpose of VM reuse , but I am wondering how is it useful.

 Example. for VM reuse to be effective or kicked in , we need more than
 one mapper task to be submitted to a single node (for the same job).Hadoop
 would consider spawning mappers into nodes which actually contains the data
 , it might rarely happen that multiple mappers are allocated to a single
 task tracker. And even if a single task nodes gets to run multiple mappers
 then it might as well run in parallel in multiple VM rather than
 sequentially in a single VM.

 I am sure I am missing some link here , please help me find that.

 Thanks,
 Rahul

Re: guessing number of reducers.

2012-11-21 Thread Bejoy KS

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume 
to reduce phase. In tools like hive and pig by default for every 1GB of map 
output there will be a reducer. So if you have 100 gigs of map output then 100 
reducers.
If your tasks are more CPU intensive then you need lesser volume of data per 
reducer for better performance results. 

In general it is better to have the number of reduce tasks slightly less than 
the number of available reduce slots in the cluster.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: jamal sasha jamalsha...@gmail.com
Date: Wed, 21 Nov 2012 11:38:38 
To: user@hadoop.apache.orguser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data...
say average transaction occurring...
Now the output will be just one line in one part... rest of them will be
empty.So i am guessing i need loads of reducers but then most of them will
be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks

Re: guessing number of reducers.

2012-11-21 Thread Bejoy KS

Hi Manoj

If you intend to calculate the number of reducers based on the input size, then 
in your driver class you should get the size of the input dir in hdfs and  say 
you intended to give n bytes to a reducer then the number of reducers can be 
computed as
Total input size/ bytes per reducer.

You can round this value and use it to set the number of reducers in conf 
programatically.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Manoj Babu manoj...@gmail.com
Date: Wed, 21 Nov 2012 23:28:00 
To: user@hadoop.apache.org
Cc: bejoy.had...@gmail.combejoy.had...@gmail.com
Subject: Re: guessing number of reducers.

Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy andy.kartas...@mpac.cawrote:

  Bejoy,



 I’ve read somethere about keeping number of mapred.reduce.tasks below the
 reduce task capcity. Here is what I just tested:



 Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:



 1 Reducer   – 22mins

 4 Reducers – 11.5mins

 8 Reducers – 5mins

 10 Reducers – 7mins

 12 Reducers – 6:5mins

 16 Reducers – 5.5mins



 8 Reducers have won the race. But Reducers at the max capacity was very
 clos. J



 AK47





 *From:* Bejoy KS [mailto:bejoy.had...@gmail.com]
 *Sent:* Wednesday, November 21, 2012 11:51 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: guessing number of reducers.



 Hi Sasha

 In general the number of reduce tasks is chosen mainly based on the data
 volume to reduce phase. In tools like hive and pig by default for every 1GB
 of map output there will be a reducer. So if you have 100 gigs of map
 output then 100 reducers.
 If your tasks are more CPU intensive then you need lesser volume of data
 per reducer for better performance results.

 In general it is better to have the number of reduce tasks slightly less
 than the number of available reduce slots in the cluster.

 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.
  --

 *From: *jamal sasha jamalsha...@gmail.com

 *Date: *Wed, 21 Nov 2012 11:38:38 -0500

 *To: *user@hadoop.apache.orguser@hadoop.apache.org

 *ReplyTo: *user@hadoop.apache.org

 *Subject: *guessing number of reducers.



 By default the number of reducers is set to 1..
 Is there a good way to guess optimal number of reducers
 Or let's say i have tbs worth of data... mappers are of order 5000 or so...
 But ultimately i am calculating , let's say, some average of whole data...
 say average transaction occurring...
 Now the output will be just one line in one part... rest of them will be
 empty.So i am guessing i need loads of reducers but then most of them will
 be empty but at the same time one reducer won't suffice..
 What's the best way to solve this..
 How to guess optimal number of reducers..
 Thanks
  NOTICE: This e-mail message and any attachments are confidential, subject
 to copyright and may be privileged. Any unauthorized use, copying or
 disclosure is prohibited. If you are not the intended recipient, please
 delete and contact the sender immediately. Please consider the environment
 before printing this e-mail. AVIS : le présent courriel et toute pièce
 jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
 et peuvent être couverts par le secret professionnel. Toute utilisation,
 copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
 destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
 l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
 courriel

Re: fundamental doubt

2012-11-21 Thread Bejoy KS

Hi Jamal

It is performed at a frame work level map emits key value pairs and the 
framework collects and groups all the values corresponding to a key from all 
the map tasks. Now the reducer takes the input as a key and a collection of 
values only. The reduce method signature defines it.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: jamal sasha jamalsha...@gmail.com
Date: Wed, 21 Nov 2012 14:50:51 
To: user@hadoop.apache.orguser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: fundamental doubt

Hi..
I guess i am asking alot of fundamental questions but i thank you guys for
taking out time to explain my doubts.
So i am able to write map reduce jobs but here is my mydoubt
As of now i am writing mappers which emit key and a value
This key value is then captured at reducer end and then i process the key
and value there.
Let's say i want to calculate the average...
Key1 value1
Key2 value 2
Key 1 value 3

So the output is something like
Key1 average of value  1 and value 3
Key2 average 2 = value 2

Right now in reducer i have to create a dictionary with key as original
keys and value is a list.
Data = defaultdict(list) == // python usrr
But i thought that
Mapper takes in the key value pairs and outputs key: ( v1,v2)and
Reducer takes in this key and list of values and returns
Key , new value..

So why is the input of reducer the simple output of mapper and not the list
of all the values to a particular key or did i  understood something.
Am i making any sense ??

Re: Supplying a jar for a map-reduce job

2012-11-20 Thread Bejoy KS

Hi Pankaj

AFAIK You can do the same. Just provide the properties like mapper class, 
reducer class, input format, output format etc using -D option at run time.



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Pankaj Gupta pan...@brightroll.com
Date: Tue, 20 Nov 2012 20:49:29 
To: user@hadoop.apache.orguser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Supplying a jar for a map-reduce job

Hi,

I am running map-reduce jobs on Hadoop 0.23 cluster. Right now I supply the jar 
to use for running the map-reduce job using the setJarByClass function on 
org.apache.hadoop.mapreduce.Job. This makes my code depend on a class in the MR 
job at compile. What I want is to be able to run an MR job without being 
dependent on it at compile time. It would be great if I could use a jar that 
contains the Mapper and Reducer classes and just pass it to run the map reduce 
job. That would make it easy to choose an MR job to run at runtime. Is that 
possible?


Thanks in Advance,
Pankaj

Re: Strange error in Hive

2012-11-15 Thread Bejoy KS

Hi Mark

I noticed,there is no 'Select' clause seen in 'Insert Overwrite'.

I believe your table is using a HiveHbase Storage Handler. Ensure that the 
required jars are given in hive --auxpath. You'll require the following jars
Hive Hbase Handler jar
Hbase jar
Zookeeper jar
Guava jar


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Mark Kerzner mark.kerz...@shmsoft.com
Date: Wed, 14 Nov 2012 17:05:20 
To: Hadoop Useruser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Strange error in Hive

Hi,

I am trying to insert a table in hive, and I am getting this strange error.

Here is what I do

insert overwrite table hivetable
struct(lpad(ch, 20, ' '),lpad(start, 10, 0),lpad(strand,10,' '),lpad(ref,
3, ' ')),
struct(X,mmm,c_count,t_count,mm)
from atable;

and here is what I get. Any and all ideas are welcome :)

Thank you,
Mark

java.lang.ClassNotFoundException: org/apache/hadoop/hive/hbase/HBaseSerDe
Continuing ...
java.lang.ClassNotFoundException:
org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat
Continuing ...
java.lang.ClassNotFoundException:
org/apache/hadoop/hive/hbase/HiveHBaseTableOutputFormat
Continuing ...
java.lang.NullPointerException
Continuing ...
java.lang.NullPointerException
 at
org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:280)
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
 at
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
 at
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:62)
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
 at
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
 at
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
 at
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
 at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
 at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
 at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
 at org.apache.hadoop.mapred.Child.main(Child.java:260)

Re: Setting up a edge node to submit jobs

2012-11-14 Thread Bejoy KS

Hi Manoj

For an edge node, you need to include the hadoop jars and configuration files 
in that box like any other node(Use the same version your cluster has). But no 
need to start any hadoop daemons.

You need to ensure that this node is able to connect with all machines in the 
cluster.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Manoj Babu manoj...@gmail.com
Date: Thu, 15 Nov 2012 10:03:24 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Setting up a edge node to submit jobs

Hi,

How to setup a edge node for a hadoop cluster to submit jobs?

Thanks in advance!

Cheers!
Manoj.

Re: Map-Reduce V/S Hadoop Ecosystem

2012-11-07 Thread Bejoy KS

Hi Yogesh,

The development time in Pig and hive are pretty less compared to its equivalent 
mapreduce code and for generic cases it is very efficient. 
If your requirement is that complex and you need very low level control of your 
code mapreduce is better. If you are an expert in mapreduce your code can be 
efficient as yours would very specific to your app but the MR in hive and pig 
may be more generic.

To just write your custom mapreduce functions, just basic knowledge on java is 
good. As you are better with java you can understand the internals better.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: yogesh.kuma...@wipro.com
Date: Wed, 7 Nov 2012 15:33:07 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Map-Reduce V/S Hadoop Ecosystem

Hello Hadoop Champs,

Please give some suggestion..

As Hadoop Ecosystem(Hive, Pig...) internally do Map-Reduce to process.

My Question is

1). where Map-Reduce program(written in Java, python etc) are overtaking Hadoop 
Ecosystem.

2). Limitations of Hadoop Ecosystem comparing with Writing Map-Reduce program.

3) for writing Map-Reduce jobs in java how much we need to have skills in java 
out of 10 (?/10)


Please put some light over it.


Thanks  Regards
Yogesh Kumar


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com

Re: Data locality of map-side join

2012-10-23 Thread Bejoy KS

Hi Sigurd

Mapside joins are efficiently implemented in Hive and Pig. I'm talking in terms 
of how mapside joins are implemented in hive.

In map side join, the smaller data set is first loaded into DistributedCache. 
The larger dataset is streamed as usual and the smaller dataset in memory. For 
every record in larger data set the look up is made in memory on the smaller 
set and there by joins are done.

In later versions of hive the hive framework itself intelligently determines 
the smaller data set. In older versions you can specify the smaller data set 
using some hints in query.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Sigurd Spieckermann sigurd.spieckerm...@gmail.com
Date: Mon, 22 Oct 2012 22:29:15 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Data locality of map-side join

Hi guys,

I've been trying to figure out whether a map-side join using the 
join-package does anything clever regarding data locality with respect 
to at least one of the partitions to join. To be more specific, if I 
want to join two datasets and some partition of dataset A is larger than 
the corresponding partition of dataset B, does Hadoop account for this 
and try to ensure that the map task is executed on the datanode storing 
the bigger partition thus reducing data transfer (if the other partition 
does not happen to be located on that same datanode)? I couldn't 
conclude the one or the other behavior from the source code and I 
couldn't find any documentation about this detail.

Thanks for clarifying!
Sigurd

Re: Old vs New API

2012-10-22 Thread Bejoy KS

Hi alberto

The new mapreduce API is coming to shape now. The majority of the classes 
available in old API has been ported to new API as well.

The Old mapred API was marked depreciated in an earlier version of hadoop 
(0.20.x) but later it was un-depreciated as all the functionality in old API 
was not available in new mapreduce API at that point.

Now mapreduce API is pretty good and you can go ahead with that for 
development.  AFAIK mapreduce API is the future. 

Let's wait for a commiter to officially comment on this.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Alberto Cordioli cordioli.albe...@gmail.com
Date: Mon, 22 Oct 2012 15:22:41 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Old vs New API

Hi all,

I am using last stable Hadoop version (1.0.3) and I am implementing
right now my first MR jobs.
I read about the presence of 2 API: the old and the new one. I read
some stuff about them, but I am not able to find quite fresh news.
I read that the old api was deprecated, but in my version they do not
seem to. Moreover the new api does not have all the features
implemented (see for example the package contrib with its classes to
do joins).

I found this post on the ML:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3ca6906bde1002230730s24d6092av1e57b46bad806...@mail.gmail.com%3E
but it is very old (2010) and I think that further changes have been
made meanwhile.

My question is: does make sense to use the new api, instead of the old
one? Does this new version providing other functionalities with
respect to the older one?
Or, given the slow progress in implementation, is better to use the old api?


Thanks.

Re: extracting lzo compressed files

2012-10-21 Thread Bejoy KS

Hi Manoj

You can get the file in a readable format using
hadoop fs -text fileName

Provided you have lzo codec within the property 'io.compression.codecs' in 
core-site.xml

A 'hadoop fs -ls' command would itself display the file size.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Manoj Babu manoj...@gmail.com
Date: Sun, 21 Oct 2012 13:10:55 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: extracting lzo compressed files

Hi,

Is there any option to extract  the lzo compressed file in HDFS from
command line and any option to find the original size of the compressed
file.

Thanks in Advance!

Cheers!
Manoj.

Re: Hadoop counter

2012-10-19 Thread Bejoy KS

Hi Jay

Counters are reported at the end of a task to JT. So if a task fails the 
counters from that task are not send to JT and hence won't be included in the 
final value of counters from that Job.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Jay Vyas jayunit...@gmail.com
Date: Fri, 19 Oct 2012 10:18:42 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop counter

Ah this answers alot about why some of my dynamic counters never show up
and i have to bite my nails waiting to see whats going on until the end of
the job- thanks.

Another question: what happens if a task fails ?  What happen to the
counters for it ?  Do they dissappear into the ether? Or do they get merged
in with the counters from other tasks?

On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux decho...@gmail.comwrote:

 And by default the number of counters is limited to 120 with the
 mapreduce.job.counters.limit property.
 They are useful for displaying short statistics about a job but should not
 be used for results (imho).
 I know people may misuse them but I haven't tried so I wouldn't be able to
 list the caveats.

 Regards

 Bertrand


 On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel 
 michael_se...@hotmail.comwrote:

 As I understand it... each Task has its own counters and are
 independently updated. As they report back to the JT, they update the
 counter(s)' status.
 The JT then will aggregate them.

 In terms of performance, Counters take up some memory in the JT so while
 its OK to use them, if you abuse them, you can run in to issues.
 As to limits... I guess that will depend on the amount of memory on the
 JT machine, the size of the cluster (Number of TT) and the number of
 counters.

 In terms of global accessibility... Maybe.

 The reason I say maybe is that I'm not sure by what you mean by globally
 accessible.
 If a task creates and implements a dynamic counter... I know that it will
 eventually be reflected in the JT. However, I do not believe that a
 separate Task could connect with the JT and see if the counter exists or if
 it could get a value or even an accurate value since the updates are
 asynchronous.  Not to mention that I don't believe that the counters are
 aggregated until the job ends. It would make sense that the JT maintains a
 unique counter for each task until the tasks complete. (If a task fails, it
 would have to delete the counters so that when the task is restarted the
 correct count is maintained. )  Note, I haven't looked at the source code
 so I am probably wrong.

 HTH
 Mike
 On Oct 19, 2012, at 5:50 AM, Lin Ma lin...@gmail.com wrote:

 Hi guys,

 I have some quick questions regarding to Hadoop counter,


- Hadoop counter (customer defined) is global accessible (for both
read and write) for all Mappers and Reducers in a job?
- What is the performance and best practices of using Hadoop
counters? I am not sure if using Hadoop counters too heavy, there will be
performance downgrade to the whole job?

 regards,
 Lin





 --
 Bertrand Dechoux




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop installation on mac

2012-10-16 Thread Bejoy KS

Hi Suneel

You can get the latest stable versions of hadoop from the following url

http://hadoop.apache.org/releases.html#Download

to download choose a mirror and  slect the stable versions (the ones Harsh
suggested) you like to go for. (the 1.0.x releases are the current stable
versions)

Regards
Bejoy KS



--
View this message in context: 
http://hadoop-common.472056.n3.nabble.com/Hadoop-installation-on-mac-tp3999520p3999535.html
Sent from the Users mailing list archive at Nabble.com.

Re: document on hdfs

2012-10-10 Thread Bejoy KS

Hi Murthy

Hadoop - The definitive Guide by Tom White has the details on file write 
anatomy.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: murthy nvvs murthy_n1...@yahoo.com
Date: Wed, 10 Oct 2012 04:27:58 
To: user@hadoop.apache.orguser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: document on hdfs

Hi All,

    Iam new to Hadoop, i just want to know the writing of files into 
datanodes in depth.
means the file is divided into blocks again the blocks are divided into packets.
i need some detailed doc abt the packets movement by using Datapackets  
Acknowledge packets.


Thanks  Regards,
Murthy

Re: stable release of hadoop

2012-10-09 Thread Bejoy KS

Hi Nisha

The current stable version is the 1.0.x releases. This is well suited for 
production environments.

0.23.x/2.x.x releases is of alpha quality and hence not that recommended on 
production.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: nisha nishakulkarn...@gmail.com
Date: Tue, 9 Oct 2012 17:09:52 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: stable release of hadoop

17 September, 2012: Release 0.23.3 is this release a stable one

and can it be used in production...

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

2012-10-07 Thread Bejoy KS

Definitely, If data local map tasks are more the performance will be improved 
much.

Ideally if data is uniformly distributed across DNs and if you have enough 
number of map task slots on colocated TTs then most of your map tasks should be 
Data Local. You may have just a few non data local map tasks when the number of 
input splits/map tasks are large which is quite common.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: centerqi hu cente...@gmail.com
Date: Sun, 7 Oct 2012 23:28:55 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: What is the difference between Rack-local map tasks and
 Data-local map tasks?

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel michael_se...@hotmail.com

 Rack local means that while the data isn't local to the node running the
 task, it is still on the same rack.
 (Its meaningless unless you've set up rack awareness because all of the
 machines are on the default rack. )

 Data local means that the task is running local to the machine that
 contains the actual data.

 HTH

 -Mike

 On Oct 7, 2012, at 8:56 AM, centerqi hu cente...@gmail.com wrote:


 hi all

 When I run hadoop job -status xxx,Output the following some list.

 Rack-local map tasks=124
 Data-local map tasks=6

 What is the difference between Rack-local map tasks and Data-local map
 tasks?
 --
 cente...@gmail.com|Sam





-- 
cente...@gmail.com|齐忠

Re: Multiple Aggregate functions in map reduce program

2012-10-05 Thread Bejoy KS

Hi 

It is definitely possible. In your map make the dept name as the output key and 
salary as the value.

In the reducer for every key you can initialize a counter and a sum. Add on to 
the sum for all values and increment the counter by 1 for each value. Output 
the dept key and the new aggregated sum and count for each key.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: iwannaplay games funnlearnfork...@gmail.com
Date: Fri, 5 Oct 2012 12:32:28 
To: useru...@hbase.apache.org; u...@hadoop.apache.org; 
hdfs-userhdfs-user@hadoop.apache.org
Reply-To: u...@hadoop.apache.org
Subject: Multiple Aggregate functions in map reduce program

Hi All,

I have to get the count and sum of data
for eg if my  table is


*employeename   salary   department*
A   1000 testing
B   2000 testing
C   3000 development
D   4000 testing
E   1000 development
F   5000 management



I want result like

Department   TotalSalary  count(employees)

testing7000 3
development   4000  2
management   5000  1


Please let me know whether it is possible to write a java map reduce for
this.I tried this on hive.It takes time for big data.I heard map reduce
java code will b faster.IS it true???Or i should go for pig programming??

Please guide..


Regards
Prabhjot

Re: Multiple Aggregate functions in map reduce program

2012-10-05 Thread Bejoy KS

Hi 

It is definitely possible. In your map make the dept name as the output key and 
salary as the value.

In the reducer for every key you can initialize a counter and a sum. Add on to 
the sum for all values and increment the counter by 1 for each value. Output 
the dept key and the new aggregated sum and count for each key.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: iwannaplay games funnlearnfork...@gmail.com
Date: Fri, 5 Oct 2012 12:32:28 
To: useru...@hbase.apache.org; user@hadoop.apache.org; 
hdfs-userhdfs-u...@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Multiple Aggregate functions in map reduce program

Hi All,

I have to get the count and sum of data
for eg if my  table is


*employeename   salary   department*
A   1000 testing
B   2000 testing
C   3000 development
D   4000 testing
E   1000 development
F   5000 management



I want result like

Department   TotalSalary  count(employees)

testing7000 3
development   4000  2
management   5000  1


Please let me know whether it is possible to write a java map reduce for
this.I tried this on hive.It takes time for big data.I heard map reduce
java code will b faster.IS it true???Or i should go for pig programming??

Please guide..


Regards
Prabhjot

Re: hadoop memory settings

2012-10-05 Thread Bejoy KS

Hi Sadak

AFAIK HADOOP_HEAPSIZE determines the jvm size of the daemons like NN,JT,TT,DN 
etc.

 mapred.child.java.opts and mapred.child.ulimit is used to set the jvm heap for 
child jvms launched for each map/reduce task launched.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Visioner Sadak visioner.sa...@gmail.com
Date: Fri, 5 Oct 2012 13:47:24 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: hadoop memory settings

coz i m getting Error occurred during initialization of VM hadoop
java.lang.Throwable: Child Error At org.apache.hadoop.mapred.TaskRunner.run
whe running a job.:)

On Fri, Oct 5, 2012 at 1:39 PM, Visioner Sadak visioner.sa...@gmail.comwrote:

 Is ther a relation between HADOOP_HEAPSIZE mapred.child.java.opts and
 mapred.child.ulimit settings in hadoop-env.sh and mapred-site.xml i have a
 sinngle machine with 2gb ram and running hadoop on psuedo distr mode my
 HADOOP_HEAPSIZE is set to 256 wat shud i set mapred.child.java.opts and
 mapred.child.ulimit and how these settings are calculated if my ram is
 incresed or machine clusters are increased

Re: copyFromLocal

2012-10-04 Thread Bejoy KS

Hi Sadak

If you are issuing copyFromLocal from a client/edge node you can copy the files 
available in the client's lfs to hdfs in cluster. The client/edge node could be 
a box that has all the hadoop jars and config files exactly same as that of the 
cluster and the cluster nodes should be accessible from this client.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Kartashov, Andy andy.kartas...@mpac.ca
Date: Thu, 4 Oct 2012 16:51:35 
To: user@hadoop.apache.orguser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: RE: copyFromLocal

I use -put -get commands to bring files in/our of HDFS from/to my home 
directory on EC2. Then use WinSCP to download files to my laptop.

Andy Kartashov
MPAC
Architecture RD, Co-op
1340 Pickering Parkway, Pickering, L1V 0C4
* Phone : (905) 837 6269
* Mobile: (416) 722 1787
andy.kartas...@mpac.camailto:andy.kartas...@mpac.ca

From: Visioner Sadak [mailto:visioner.sa...@gmail.com]
Sent: Thursday, October 04, 2012 11:53 AM
To: user@hadoop.apache.org
Subject: copyFromLocal


guys i have hadoop installled in a remote box ... does copyFromLocal method 
copies data from tht local box only wht if i have to copy data from uses 
desktop pc(for example E drive) thru my my web application will i have to first 
copy data to tht remote box using some java code then use copyFromLocal method 
to copy in to hadoop

NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont 
confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est 
interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, 
supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? 
l'environnement avant d'imprimer le pr?sent courriel

Re: How to lower the total number of map tasks

2012-10-02 Thread Bejoy Ks

Hi

You need to alter the value of mapred.max.split size to a value larger than
your block size to have less number of map tasks than the default.

On Tue, Oct 2, 2012 at 10:04 PM, Shing Hing Man mat...@yahoo.com wrote:




 I am running Hadoop 1.0.3 in Pseudo  distributed mode.
 When I  submit a map/reduce job to process a file of  size about 16 GB, in
 job.xml, I have the following


 mapred.map.tasks =242
 mapred.min.split.size =0
 dfs.block.size = 67108864


 I would like to reduce   mapred.map.tasks to see if it improves
 performance.
 I have tried doubling  the size of  dfs.block.size. But
 themapred.map.tasks remains unchanged.
 Is there a way to reduce  mapred.map.tasks  ?


 Thanks in advance for any assistance !
 Shing

Re: How to lower the total number of map tasks

2012-10-02 Thread Bejoy Ks

Sorry for the typo, the property name is mapred.max.split.size

Also just for changing the number of map tasks you don't need to modify the
hdfs block size.

On Tue, Oct 2, 2012 at 10:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi

 You need to alter the value of mapred.max.split size to a value larger
 than your block size to have less number of map tasks than the default.


 On Tue, Oct 2, 2012 at 10:04 PM, Shing Hing Man mat...@yahoo.com wrote:




 I am running Hadoop 1.0.3 in Pseudo  distributed mode.
 When I  submit a map/reduce job to process a file of  size about 16 GB,
 in job.xml, I have the following


 mapred.map.tasks =242
 mapred.min.split.size =0
 dfs.block.size = 67108864


 I would like to reduce   mapred.map.tasks to see if it improves
 performance.
 I have tried doubling  the size of  dfs.block.size. But
 themapred.map.tasks remains unchanged.
 Is there a way to reduce  mapred.map.tasks  ?


 Thanks in advance for any assistance !
 Shing

Re: How to lower the total number of map tasks

2012-10-02 Thread Bejoy KS

Hi Shing

Is your input a single file or set of small files? If latter you need to use 
CombineFileInputFormat.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Shing Hing Man mat...@yahoo.com
Date: Tue, 2 Oct 2012 10:38:59 
To: user@hadoop.apache.orguser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: How to lower the total number of map tasks


I have tried 
   Configuration.setInt(mapred.max.split.size,134217728);

and setting mapred.max.split.size in mapred-site.xml. ( dfs.block.size is left 
unchanged at 67108864).

But in the job.xml, I am still getting mapred.map.tasks =242 .

Shing 







 From: Bejoy Ks bejoy.had...@gmail.com
To: user@hadoop.apache.org; Shing Hing Man mat...@yahoo.com 
Sent: Tuesday, October 2, 2012 6:03 PM
Subject: Re: How to lower the total number of map tasks
 

Sorry for the typo, the property name is mapred.max.split.size

Also just for changing the number of map tasks you don't need to modify the 
hdfs block size.


On Tue, Oct 2, 2012 at 10:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

Hi


You need to alter the value of mapred.max.split size to a value larger than 
your block size to have less number of map tasks than the default.



On Tue, Oct 2, 2012 at 10:04 PM, Shing Hing Man mat...@yahoo.com wrote:




I am running Hadoop 1.0.3 in Pseudo  distributed mode.
When I  submit a map/reduce job to process a file of  size about 16 GB, in 
job.xml, I have the following


mapred.map.tasks =242
mapred.min.split.size =0
dfs.block.size = 67108864


I would like to reduce   mapred.map.tasks to see if it improves performance.
I have tried doubling  the size of  dfs.block.size. But 
themapred.map.tasks remains unchanged.
Is there a way to reduce  mapred.map.tasks  ?


Thanks in advance for any assistance !  
Shing

Re: Add file to distributed cache

2012-10-01 Thread Bejoy KS

Hi Abshiek

You can find a simple example of using Distributed Cache here
http://kickstarthadoop.blogspot.co.uk/2011/05/hadoop-for-dependent-data-splits-using.html
--Original Message--
From: Abhishek
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Add file to distributed cache
Sent: Oct 2, 2012 05:44

Hi all 

How do you add a small file to distributed cache in MR program 

Regards
Abhi

Sent from my iPhone


Regards
Bejoy KS

Sent from handheld, please excuse typos.

Re: File block size use

2012-10-01 Thread Bejoy KS

Hi Anna

If you want to increase the block size of existing files. You can use a 
Identity Mapper with no reducer.  Set the min and max split sizes to your 
requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat 
for your job.
Your job should be done.

 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Chris Nauroth cnaur...@hortonworks.com
Date: Mon, 1 Oct 2012 21:12:58 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: File block size use

Hello Anna,

If I understand correctly, you have a set of multiple sequence files, each
much smaller than the desired block size, and you want to concatenate them
into a set of fewer files, each one more closely aligned to your desired
block size.  Presumably, the goal is to improve throughput of map reduce
jobs using those files as input by running fewer map tasks, reading a
larger number of input records.

Whenever I've had this kind of requirement, I've run a custom map reduce
job to implement the file consolidation.  In my case, I was typically
working with TextInputFormat (not sequence files).  I used IdentityMapper
and a custom reducer that passed through all values but with key set to
NullWritable, because the keys (input file offsets in the case of
TextInputFormat) were not valuable data.  For my input data, this was
sufficient to achieve fairly even distribution of data across the reducer
tasks, and I could reasonably predict the input data set size, so I could
reasonably set the number of reducers and get decent results.  (This may or
may not be true for your data set though.)

A weakness of this approach is that the keys must pass from the map tasks
to the reduce tasks, only to get discarded before writing the final output.
 Also, the distribution of input records to reduce tasks is not truly
random, and therefore the reduce output files may be uneven in size.  This
could be solved by writing NullWritable keys out of the map task instead of
the reduce task and writing a custom implementation of Partitioner to
distribute them randomly.

To expand on this idea, it could be possible to inspect the FileStatus of
each input, sum the values of FileStatus.getLen(), and then use that
information to make a decision about how many reducers to run (and
therefore approximately set a target output file size).  I'm not aware of
any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud annalah...@gmail.com wrote:

 I would like to be able to resize a set of inputs, already in SequenceFile
 format, to be larger.

 I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
 get what I expected. The outputs were exactly the same as the inputs.

 I also tried running a job with an IdentityMapper and IdentityReducer.
 Although that approaches a better solution, it still requires that I know
 in advance how many reducers I need to get better file sizes.

 I was looking at the SequenceFile.Writer constructors and noticed that
 there are block size parameters that can be used. Using a writer
 constructed with a 512MB block size, there is nothing that splits the
 output and I simply get a single file the size of my inputs.

 What is the current standard for combining sequence files to create larger
 files for map-reduce jobs? I have seen code that tracks what it writes into
 the file, but that seems like the long version. I am hoping there is a
 shorter path.

 Thank you.

 Anna

Re: Programming Question / Joining Dataset

2012-09-26 Thread Bejoy Ks

Hi Oliver

I have scribbled a small post on reduce side joins ,
the implementation matches with your requirement
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html

Regards
Bejoy KS

Re: Unit tests for Map and Reduce functions.

2012-09-26 Thread Bejoy Ks

Hi Ravi

You can take a look at mockito

http://books.google.co.in/books?id=Nff49D7vnJcCpg=PA138lpg=PA138dq=mockito+%2B+hadoopsource=blots=IifyVu7yXpsig=Q1LoxqAKO0nqRquus8jOW5CBiWYhl=ensa=Xei=b2pjULHSOIPJrAeGsIHwAgved=0CC0Q6AEwAg#v=onepageq=mockito%20%2B%20hadoopf=false

On Thu, Sep 27, 2012 at 2:09 AM, Kai Voigt k...@123.org wrote:

 I don't know any other unit testing framework.

 Kai

 Am 26.09.2012 um 22:37 schrieb Ravi P hadoo...@outlook.com:

 Thanks Kai,   I am exploring MRunit .
 Are there any other options/ways to write unit tests for Map and Reduce
 functions. Would like to evaluate all options.

 -
 Ravi

 --
 From: hadoo...@outlook.com
 To: k...@123.org
 Subject: RE: Unit tests for Map and Reduce functions.
 Date: Wed, 26 Sep 2012 13:35:57 -0700

 Thanks Kai, Which MRUnit jar I should use for Hadoop 0.20 ?

 https://repository.apache.org/content/repositories/releases/org/apache/mrunit/mrunit/0.9.0-incubating/

 -
 Ravi

 --
 From: k...@123.org
 Subject: Re: Unit tests for Map and Reduce functions.
 Date: Wed, 26 Sep 2012 22:21:06 +0200
 To: user@hadoop.apache.org

 Hello,

 yes, http://mrunit.apache.org is your reference. MRUnit is a framework on
 top of JUnit which emulates the mapreduce framework to test your mappers
 and reducers.

 Kai

 Am 26.09.2012 um 22:18 schrieb Ravi P hadoo...@outlook.com:

 Is it possible to write unit test for mapper Map , and reducer Reduce
 function ?


 -
 Ravi


 --
 Kai Voigt
 k...@123.org


 --
 Kai Voigt
 k...@123.org

Re: Help on a Simple program

2012-09-25 Thread Bejoy Ks

Hi

If you don't want either key or value in the output, just make the
corresponding data types as NullWritable.

Since you just need to filter out a few records/itemd from your logs,
reduce phase is not mandatory just a mappper would suffice your needs. From
your mapper just output the records that match your criteria. Also set
number of reduce tasks to zero in your driver class to completely avoid the
reduce phase.

A sample code would look like

public static class Map extends
MapperLongWritable, Text, Text, NullWritable {
private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if(-1 != meetConditions(value)) {
context.write(value, NullWritable.*get*());
}
}
}


Om your driver class
*job.setNumReduceTasks(0);*
*
*
*Alternatively you can specify this st runtime as*
hadoop jar xyz.jar com.*.*.* –D mapred.reduce.tasks=0 input/ output/

On Tue, Sep 25, 2012 at 11:38 PM, Matthieu Labour matth...@actionx.comwrote:

 Hi

 I am completely new to Hadoop and I am trying to address the following
 simple application. I apologize if this sounds trivial.

 I have multiple log files I need to read the log files and collect the
 entries that meet some conditions and write them back to files for further
 processing. ( On other words, I need to filter out some events)

 I am using the WordCount example to get going.

 public static class Map extends
 MapperLongWritable, Text, Text, IntWritable {
 private final static IntWritable one = new IntWritable(1);

 public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {
 if(-1 != meetConditions(value)) {
 context.write(value, one);
 }
 }
 }

 public static class Reduce extends
 ReducerText, IntWritable, Text, IntWritable {

 public void reduce(Text key, IterableIntWritable values,
 Context context) throws IOException, InterruptedException {
 context.write(key, new IntWritable(1));
 }
 }

 The problem is that it prints the value 1 after each entry.

 Hence my question. What is the best trivial implementation of the map and
 reduce function to address the use case above ?

 Thank you greatly for your help

Re: Detect when file is not being written by another process

2012-09-25 Thread Bejoy Ks

Hi Peter

AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
soon as the files are written to a  certain hdfs directory.

On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan 
psheri...@millennialmedia.com wrote:

  These are log files being deposited by other processes, which we may not
 have control over.

  We don't want multiple processes to write to the same files — we just
 don't want to start our jobs until they have been completely written.

  Sorry for lack of clarity  thanks for the response.


  --Pete

   From: Bertrand Dechoux decho...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Tuesday, September 25, 2012 12:33 PM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Subject: Re: Detect when file is not being written by another process

  Hi,

 Multiple files and aggregation or something like hbase?

 Could you tell use more about your context? What are the volumes? Why do
 you want multiple processes to write to the same file?

 Regards

 Bertrand

 On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan 
 psheri...@millennialmedia.com wrote:

  Hi all.

  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
 files when they've finished being written to HDFS by a different process.
  There doesn't appear to be an API specifically for this.  We had
 discovered through experimentation that the FileSystem.append() method can
 be used for this purpose — it will fail if another process is writing to
 the file.

  However: when running this on a multi-node cluster, using that API
 actually corrupts the file.  Perhaps this is a known issue?  Looking at the
 bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
 bunch of similar-sounding things.

  What's the right way to solve this problem?  Thanks.


  --Pete




 --
 Bertrand Dechoux

Re: Job failed with large volume of small data: java.io.EOFException

2012-09-20 Thread Bejoy Ks

Hi Jason

Are you seeing any errors in your data node logs. Specifically like '
xceivers count exceeded'. In that case you may need to bump up te value of
  dfs.datanode.max.xcievers  to ahigher value.

If not, it is possible that you are crossing the upper limit of open files
on your linux boxes that run DNs. You can verify the current value using
'ulimit -n' and then try increasing the same to a much higher value.

Regards
Bejoy KS

Re: How to make the hive external table read from subdirectories

2012-09-13 Thread Bejoy KS

Hi Nataraj

Once you have created a partitioned table you need to add the partitions, only 
then the data in sub dirs will be visible to hive.
After creating the table you need to execute a command like below

ALTER TABLE some_table ADD PARTITION (year='2012', month='09', dayofmonth='11') 
LOCATION '/user/myuser/MapReduceOutput/2012/09/11';

Like this you need to register each of the paritions. After this your query 
should work as desired.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Nataraj Rashmi - rnatar rashmi.nata...@acxiom.com
Date: Thu, 13 Sep 2012 03:04:52 
To: user@hadoop.apache.orguser@hadoop.apache.org; 
bejoy.had...@gmail.combejoy.had...@gmail.com
Subject: RE: How to make the hive external table read from subdirectories

Thanks for your response. Can someone see if this is ok? I am not getting any 
records when I query the hive table when I use Partitions.

This is how I am creating the table.
CREATE EXTERNAL TABLE Data (field1 STRING,field2) PARTITIONED BY(year 
STRING, month STRING, dayofmonth STRING) ROW FORMAT DELIMITED FIELDS TERMINATED 
BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 
'/user/myuser/MapReduceOutput';

Data dir looks  like this.
/user/myuser/MapReduceOutput/2012/09/11

When I create the table using '/user/myuser/MapReduceOutput/2012/09/11' as the 
location, I can query the table and get data back.

Please advice,
Thanks.


From: Bejoy KS [mailto:bejoy.had...@gmail.com]
Sent: Wednesday, September 12, 2012 3:09 PM
To: user@hadoop.apache.org
Subject: Re: How to make the hive external table read from subdirectories

Hi Natraj

Create a partitioned table and add the sub dirs as partitions. You need to have 
some logic in place for determining the partitions. Say if the sub dirs denote 
data based on a date then make date as the partition.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

From: Nataraj Rashmi - rnatar rashmi.nata...@acxiom.com
Date: Wed, 12 Sep 2012 19:19:19 +
To: user@hadoop.apache.orguser@hadoop.apache.org
ReplyTo: user@hadoop.apache.org
Subject: How to make the hive external table read from subdirectories

I have a hive external table created from a hdfs location. How do I make it 
read the data from all  the subdirectories also?

Thanks.

***
The information contained in this communication is confidential, is
intended only for the use of the recipient named above, and may be legally
privileged.

If the reader of this message is not the intended recipient, you are
hereby notified that any dissemination, distribution or copying of this
communication is strictly prohibited.

If you have received this communication in error, please resend this
communication to the sender and delete the original message or any copy
of it from your computer system.

Thank You.

Re: How to make the hive external table read from subdirectories

2012-09-12 Thread Bejoy KS

Hi Natraj

Create a partitioned table and add the sub dirs as partitions. You need to have 
some logic in place for determining the partitions. Say if the sub dirs denote 
data based on a date then make date as the partition.
 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Nataraj Rashmi - rnatar rashmi.nata...@acxiom.com
Date: Wed, 12 Sep 2012 19:19:19 
To: user@hadoop.apache.orguser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: How to make the hive external table read from subdirectories

I have a hive external table created from a hdfs location. How do I make it 
read the data from all  the subdirectories also?

Thanks.
***
The information contained in this communication is confidential, is
intended only for the use of the recipient named above, and may be legally
privileged.

If the reader of this message is not the intended recipient, you are
hereby notified that any dissemination, distribution or copying of this
communication is strictly prohibited.

If you have received this communication in error, please resend this
communication to the sender and delete the original message or any copy
of it from your computer system.

Thank You.

Re: what's the default reducer number?

2012-09-11 Thread Bejoy Ks

Hi Lin

The default value for number of reducers is 1

namemapred.reduce.tasks/name
  value1/value

It is not determined by data volume. You need to specify the number of
reducers for your mapreduce jobs as per your data volume.

Regards
Bejoy KS

On Tue, Sep 11, 2012 at 4:53 PM, Jason Yang lin.yang.ja...@gmail.comwrote:

 Hi, all

 I was wondering what's the default number of reducer if I don't set it in
 configuration?

 Will it change dynamically according to the output volume of Mapper?

 --
 YANG, Lin

Re: what's the default reducer number?

2012-09-11 Thread Bejoy Ks

Hi Lin

The default values for all the properties are in
core-default.xml
hdfs-default.xml and
mapred-default.xml


Regards
Bejoy KS


On Tue, Sep 11, 2012 at 5:06 PM, Jason Yang lin.yang.ja...@gmail.comwrote:

 Hi, Bejoy

 Thanks for you reply.

 where could I find the default value of mapred.reduce.tasks ? I have
 checked the core-site.xml, hdfs-site.xml and mapred-site.xml, but I haven't
 found it.


 2012/9/11 Bejoy Ks bejoy.had...@gmail.com

 Hi Lin

 The default value for number of reducers is 1

 namemapred.reduce.tasks/name
   value1/value

 It is not determined by data volume. You need to specify the number of
 reducers for your mapreduce jobs as per your data volume.

 Regards
 Bejoy KS

 On Tue, Sep 11, 2012 at 4:53 PM, Jason Yang lin.yang.ja...@gmail.comwrote:

 Hi, all

 I was wondering what's the default number of reducer if I don't set it
 in configuration?

 Will it change dynamically according to the output volume of Mapper?

 --
 YANG, Lin





 --
 YANG, Lin

Re: Some general questions about DBInputFormat

2012-09-11 Thread Bejoy KS

Hi Yaron

Sqoop uses a similar implementation. You can get some details there.

Replies inline
• (more general question) Are there many use-cases for using DBInputFormat? Do 
most Hadoop jobs take their input from files or DBs?

 From my small experience Most MR jobs have data in hdfs. It is useful for 
 getting data out of rdbms to hadoop, sqoop implemenation is an example.


• Since all mappers open a connection to the same DBS, one cannot use hundreds 
of mapper. Is there a solution to this problem? 

Num of mappers shouldn't be more than the permissible number of connections 
allowed for that db. 



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Yaron Gonen yaron.go...@gmail.com
Date: Tue, 11 Sep 2012 15:41:26 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Some general questions about DBInputFormat

Hi,
After reviewing the class's (not very complicated) code, I have some
questions I hope someone can answer:

   - (more general question) Are there many use-cases for using
   DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
   - What happens when the database is updated during mappers' data
   retrieval phase? is there a way to lock the database before the data
   retrieval phase and release it afterwords?
   - Since all mappers open a connection to the same DBS, one cannot use
   hundreds of mapper. Is there a solution to this problem?

Thanks,
Yaron

Re: How to remove datanode from cluster..

2012-09-11 Thread Bejoy Ks

Hi Yogesh

The detailed steps are available in hadoop wiki on FAQ page

http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F

Regrads
Bejoy KS



On Wed, Sep 12, 2012 at 12:14 AM, yogesh dhari yogeshdh...@live.com wrote:

  Hello all,

 I am not getting the clear way out to remove datanode from the cluster.

 please explain me decommissioning steps with example.
 like how to creating exclude files and other steps involved in it.


 Thanks  regards
 Yogesh Kumar

Re: Reg: parsing all files file append

2012-09-10 Thread Bejoy Ks

Hi Manoj

From my limited knowledge on file appends in hdfs , i have seen more
recommendations to use sync() in the latest releases than using append().
Let us wait for some commiter to authoritatively comment on 'the production
readiness of append()' . :)

Regards
Bejoy KS

On Mon, Sep 10, 2012 at 11:03 AM, Manoj Babu manoj...@gmail.com wrote:

 Thank you Bejoy.

 Does file append is production stable?


 Cheers!
 Manoj.



 On Sun, Sep 9, 2012 at 10:19 PM, Bejoy KS bejoy.had...@gmail.com wrote:

 **
 Hi Manoj

 You can load daily logs into a individual directories in hdfs and process
 them daily. Keep those results in hdfs or hbase or dbs etc. Every day do
 the processing, get the results and aggregate the same with the previously
 aggregated results till date.

 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.
 --
 *From: * Manoj Babu manoj...@gmail.com
 *Date: *Sun, 9 Sep 2012 21:28:54 +0530
 *To: *mapreduce-user@hadoop.apache.org
 *ReplyTo: * mapreduce-user@hadoop.apache.org
 *Subject: *Reg: parsing all files  file append

 Hi All,

 I have two questions, providing info on it will be helpful.

 1, I am using hadoop to analyze and to find top n search term metric's
 from logs.
 If any new log file is added to HDFS then again we are running the job to
 find the metrics.
 Daily we will be getting log files and we are parsing the whole file and
 getting the metric's.
 All the log file's are parsed daily to get the latest metric's is there
 any way is there any way to avoid this?

 2, Does file append is production stable?

 Cheers!
 Manoj.

Re: Reg: parsing all files file append

2012-09-09 Thread Bejoy KS

Hi Manoj

You can load daily logs into a individual directories in hdfs and process them 
daily. Keep those results in hdfs or hbase or dbs etc. Every day do the 
processing, get the results and aggregate the same with the previously 
aggregated results till date.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Manoj Babu manoj...@gmail.com
Date: Sun, 9 Sep 2012 21:28:54 
To: mapreduce-user@hadoop.apache.org
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Reg: parsing all files  file append

Hi All,

I have two questions, providing info on it will be helpful.

1, I am using hadoop to analyze and to find top n search term metric's from
logs.
If any new log file is added to HDFS then again we are running the job to
find the metrics.
Daily we will be getting log files and we are parsing the whole file and
getting the metric's.
All the log file's are parsed daily to get the latest metric's is there any
way is there any way to avoid this?

2, Does file append is production stable?

Cheers!
Manoj.

Re: Using hadoop for analytics

2012-09-05 Thread Bejoy Ks

Hi Prashant

Welcome to Hadoop Community. :)

Hadoop is meant for processing large data volumes. Saying that, for your
custom requirements you should write your own mapper and reducer that
contains your business logic for processing the input data. Also you can
have a look at hive and pig, which are tools built on top of map reduce
that is highly used for data analysis. Hive supports SQL like queries. If
your requirements could be satisfied with Hive or Pig, it is highly
 recommend to go with those.



On Wed, Sep 5, 2012 at 2:12 PM, pgaurav pgauravi...@gmail.com wrote:


 Hi Guys,
 I’m 5 days old in hadoop world and trying to analyse this as a long term
 solution to our client.
 I could do some rd on Amazon EC2 / EMR:
 Load the data, text / csv, to S3
 Write your mapper / reducer / Jobclient and upload the jar to s3
 Start a job flow
 I tried 2 sample code, word count and csv data process.
 My question is that to further analyse the data / reporting / search, what
 should be done? Do I need to implement in Mapper class itself? Do I need to
 dump the data to the database and then write some custom application? What
 is the standard way to analysing the data?

 Thanks
 Prashant

 --
 View this message in context:
 http://old.nabble.com/Using-hadoop-for-analytics-tp34391246p34391246.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Replication Factor Modification

2012-09-05 Thread Bejoy Ks

Hi

You can change the replication factor of an existing directory using
'-setrep'

http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#setrep

The below command will recursively set the replication factor to 1 for all
files within the given directory '/user'
hadoop fs -setrep -w 1 -R /user




On Wed, Sep 5, 2012 at 11:39 PM, Uddipan Mukherjee 
uddipan_mukher...@infosys.com wrote:

  Hi, 

  

We have a requirement where we have change our Hadoop Cluster's
 Replication Factor without restarting the Cluster. We are running our
 Cluster on Amazon EMR.

  

 Can you please suggest the way to achieve this? Any pointer to this will
 be very helpful.

  

 Thanks And Regards

 Uddipan Mukherjee

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
 for the use of the addressee(s). If you are not the intended recipient, please
 notify the sender by e-mail and delete the original message. Further, you are 
 not
 to copy, disclose, or distribute this e-mail or its contents to any other 
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has 
 taken
 every reasonable precaution to minimize this risk, but is not liable for any 
 damage
 you may sustain as a result of any virus in this e-mail. You should carry out 
 your
 own virus checks before opening the e-mail or attachment. Infosys reserves the
 right to monitor and review the content of all messages sent to or from this 
 e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***

Re: Replication Factor Modification

2012-09-05 Thread Bejoy Ks

Hi  Uddipan

As Harsh mentioned, replication factor is a client side property . So you
need to update the value for 'dfs.replication' in hdfs-site.xml as per your
requirement in your edge nodes or from the machines your are copying files
to hdfs. If you are using some of the existing DN's for this purpose (as
client) you need to update the value in there. No need of restarting the
services.

On Wed, Sep 5, 2012 at 11:54 PM, Uddipan Mukherjee 
uddipan_mukher...@infosys.com wrote:

  Hi,

 ** **

Thanks for the help. But How I will set the replication factor as
 desired so that when new files comes in it will automatically take the new
 value of dfs.replication without a cluster restart. Please note we have a
 200 nodes cluster.

 ** **

 Thanks and Regards,

 Uddipan Mukherjee

 ** **

 *From:* Harsh J [mailto:ha...@cloudera.com]
 *Sent:* Wednesday, September 05, 2012 7:17 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Replication Factor Modification

 ** **

 Replication factor is per-file, and is a client-side property. So, this is
 doable.

 ** **

 1. Change the replication factor of all existing files (or needed ones):**
 **

 ** **

 $ hadoop fs -setrep -R value /

 ** **

 2. Change the dfs.replication parameter in all client configs to the
 desired value

 On Wed, Sep 5, 2012 at 11:39 PM, Uddipan Mukherjee 
 uddipan_mukher...@infosys.com wrote:

 Hi, 

  

We have a requirement where we have change our Hadoop Cluster's
 Replication Factor without restarting the Cluster. We are running our
 Cluster on Amazon EMR.

  

 Can you please suggest the way to achieve this? Any pointer to this will
 be very helpful.

  

 Thanks And Regards

 Uddipan Mukherjee

  CAUTION - Disclaimer *

 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
 

 for the use of the addressee(s). If you are not the intended recipient, 
 please 

 notify the sender by e-mail and delete the original message. Further, you are 
 not 

 to copy, disclose, or distribute this e-mail or its contents to any other 
 person and 

 any such actions are unlawful. This e-mail may contain viruses. Infosys has 
 taken 

 every reasonable precaution to minimize this risk, but is not liable for any 
 damage 

 you may sustain as a result of any virus in this e-mail. You should carry out 
 your 

 own virus checks before opening the e-mail or attachment. Infosys reserves 
 the 

 right to monitor and review the content of all messages sent to or from this 
 e-mail 

 address. Messages sent to or from this e-mail address may be stored on the 
 

 Infosys e-mail system.

 ***INFOSYS End of Disclaimer INFOSYS***



 

 ** **

 --
 Harsh J

Re: Integrating hadoop with java UI application deployed on tomcat

2012-09-04 Thread Bejoy KS

Hi

You are running tomact on a windows machine and trying to connect to a remote 
hadoop cluster from there. Your core site has
name
fs.default.name/name
valuehdfs://localhost:9000/value

But It is localhost here.( I assume you are not running hadoop on this windows 
environment for some testing)

 You need to have the exact configuration files and hadoop jars from the 
cluster machines on this tomcat environment as well. I mean on the classpath of 
your application. 
 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Visioner Sadak visioner.sa...@gmail.com
Date: Tue, 4 Sep 2012 15:31:25 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: Integrating hadoop with java UI application deployed on tomcat

also getting one more error

*

org.apache.hadoop.ipc.RemoteException*: Server IPC version 5 cannot
communicate with client version 4


On Tue, Sep 4, 2012 at 2:44 PM, Visioner Sadak visioner.sa...@gmail.comwrote:

 Thanks shobha tried adding conf folder to tomcats classpath  still getting
 same error


 Call to localhost/127.0.0.1:9000 failed on local exception:
 java.io.IOException: An established connection was aborted by the software
 in your host machine

  On Tue, Sep 4, 2012 at 11:18 AM, Mahadevappa, Shobha 
 shobha.mahadeva...@nttdata.com wrote:

  Hi,

 Try adding the hadoop/conf directory in the TOMCAT’s classpath 

 ** **

 Ex :
 CLASSPATH=/usr/local/Apps/hbase-0.90.4/conf:/usr/local/Apps/hadoop-0.20.203.0/conf:
 

 ** **

 ** **

 ** **

 Regards,

 *Shobha M *

 ** **

 *From:* Visioner Sadak [mailto:visioner.sa...@gmail.com]
 *Sent:* 03 September 2012 PM 04:01
 *To:* user@hadoop.apache.org

 *Subject:* Re: Integrating hadoop with java UI application deployed on
 tomcat

 ** **

 Thanks steve thers nothing in logs and no exceptions as well i found that
 some file is created in my F:\user with directory name but its not visible
 inside my hadoop browse filesystem directories i also added the config by
 using the below method 

 hadoopConf.addResource(

 F:/hadoop-0.22.0/conf/core-site.xml); 

 when running thru WAR printing out the filesystem i m getting
 org.apache.hadoop.fs.LocalFileSystem@9cd8db 

 when running an independet jar within hadoop i m getting
 DFS[DFSClient[clientName=DFSClient_296231340, ugi=dell]]

 when running an independet jar i m able to do uploads

  

 just wanted to know will i have to add something in my classpath of
 tomcat or is there any other configurations of core-site.xml that i am
 missing out..thanks for your help.

  

 ** **

 On Sat, Sep 1, 2012 at 1:38 PM, Steve Loughran ste...@hortonworks.com
 wrote:

 ** **

 well, it's worked for me in the past outside Hadoop itself:

 ** **


 http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/utils/DfsUtils.java?revision=8882view=markup
 

 ** **

1. Turn logging up to DEBUG
2. Make sure that the filesystem you've just loaded is what you
expect, by logging its value. It may turn out to be file:///, because
the normal Hadoop site-config.xml isn't being picked up

   

  ** **

 On Fri, Aug 31, 2012 at 1:08 AM, Visioner Sadak visioner.sa...@gmail.com
 wrote:

 but the problem is that my  code gets executed with the warning but file
 is not copied to hdfs , actually i m trying to copy a file from local to
 hdfs 

  

Configuration hadoopConf=new Configuration();
 //get the default associated file system
FileSystem fileSystem=FileSystem.get(hadoopConf);
// HarFileSystem harFileSystem= new HarFileSystem(fileSystem);
 //copy from lfs to hdfs
fileSystem.copyFromLocalFile(new Path(E:/test/GANI.jpg),new
 Path(/user/TestDir/)); 

  

 ** **

 ** **


 __
 Disclaimer:This email and any attachments are sent in strictest
 confidence for the sole use of the addressee and may contain legally
 privileged, confidential, and proprietary data. If you are not the intended
 recipient, please advise the sender by replying promptly to this email and
 then delete and destroy this email and any attachments without any further
 use, copying or forwarding

Re: Exception while running a Hadoop example on a standalone install on Windows 7

2012-09-04 Thread Bejoy Ks

Hi Udayani

By default hadoop works well for linux and linux based OS. Since you are on
Windows you need to install and configure ssh using cygwin before you start
hadoop daemons.

On Tue, Sep 4, 2012 at 6:16 PM, Udayini Pendyala udayini_pendy...@yahoo.com
 wrote:

 Hi,


 Following is a description of what I am trying to do and the steps I
 followed.


 GOAL:

 a). Install Hadoop 1.0.3

 b). Hadoop in a standalone (or local) mode

 c). OS: Windows 7


 STEPS FOLLOWED:

 1.1.   I followed instructions from:
 http://www.oreillynet.com/pub/a/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html.
 Listing the steps I did -

 a.   I went to: http://hadoop.apache.org/core/releases.html.

 b.  I installed hadoop-1.0.3 by downloading “hadoop-1.0.3.tar.gz” and
 unzipping/untarring the file.

 c.   I installed JDK 1.6 and set up JAVA_HOME to point to it.

 d.  I set up HADOOP_INSTALL to point to my Hadoop install location. I
 updated my PATH variable to have $HADOOP_INSTALL/bin

 e.  After the above steps, I ran the command: “hadoop version” and
 got the following information:

 $ hadoop version

 Hadoop 1.0.3

 Subversion
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
 1335192

 Compiled by hortonfo on Tue May 8 20:31:25 UTC 2012

 From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be



 2.  2.  The standalone was very easy to install as described above.
 Then, I tried to run a sample command as given in:

 http://hadoop.apache.org/common/docs/r0.17.2/quickstart.html#Local

 Specifically, the steps followed were:

 a.   cd $HADOOP_INSTALL

 b.  mkdir input

 c.   cp conf/*.xml input

 d.  bin/hadoop jar hadoop-examples-1.0.3.jar grep input output
 ‘dfs[a-z.]+’

 and got the following error:



 $ bin/hadoop jar hadoop-examples-1.0.3.jar grep input output 'dfs[a-z.]+'

 12/09/03 15:41:57 WARN util.NativeCodeLoader: Unable to load native-hadoop
 libra ry for your platform... using builtin-java classes where applicable

 12/09/03 15:41:57 ERROR security.UserGroupInformation:
 PriviledgedActionExceptio n as:upendyal cause:java.io.IOException: Failed
 to set permissions of path: \tmp
 \hadoop-upendyal\mapred\staging\upendyal-1075683580\.staging to 0700

 java.io http://java.io.IO.IOException: Failed to set permissions of
 path: \tmp\hadoop-upendyal\map red\staging\upendyal-1075683580\.staging to
 0700

 at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)

 at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)

 at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
 tem.java:509)

 at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.jav
 a:344)

 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:18 9)

 at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmi
 ssionFiles.java:116)

 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)

 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Unknown Source)

 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
 tion.java:1121)

 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:8
 50)

 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)

 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)

 at org.apache.hadoop.examples.Grep.run(Grep.java:69)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

 at org.apache.hadoop.examples.Grep.main(Grep.java:93)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra
 mDriver.java:68)

 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)

 at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



 3.3.   I googled the problem and found the following links but none
 of these suggestions helped. Most people seem to be getting a resolution
 when they change the version of Hadoop.

 a.
 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201105.mbox/%3cbanlktin-8+z8uybtdmaa4cvxz4jzm14...@mail.gmail.com%3E

 b.
 http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/25837


 Is this a problem in the version of Hadoop I selected OR am I doing
 something wrong? I would appreciate any help with this.

 Thanks

 Udayini

Re: reading a binary file

2012-09-03 Thread Bejoy Ks

Hi Francesco

TextInputFormat reads line by line based on '\n' by default, there the key
values is the position offset and the line contents respectively. But in
your case it is just a sequence of integers and also it is Binary. Also you
require the offset for each integer value and not offset by line.
I believe you may have to write your own custom Record Reader to get this
done.

On Mon, Sep 3, 2012 at 8:38 PM, Francesco Silvestri yuri@gmail.comwrote:

Hi Mohammad,

SequenceFileInputFormathttp://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html
requires
the file to be a sequence of key/value stored in binary (i.e., the key is
stored in the file). In my case, the key is implicitly given by the
position of the value within the file.

Thank you,
Francesco

On Mon, Sep 3, 2012 at 5:01 PM, Mohammad Tariq donta...@gmail.com wrote:

Hello Francesco,

Have a look at SequenceFileInputFormat :
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html

Regards,
Mohammad Tariq

On Mon, Sep 3, 2012 at 8:26 PM, Francesco Silvestri
yuri@gmail.comwrote:

Hello,

I have a binary file of integers and I would like an input format that
generates pairs key,value, where value is an integer in the file and key
the position of the integer in the file. Which class should I use? (i.e.
I'm looking for a kind of TextinputFormat for binary files)

Thank you for your consideration,

Francesco

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Bejoy Ks

HI Abhay

The TaskTrackers on which the reduce tasks are triggered is chosen in
random based on the reduce slot availability. So if you don't need the
reduce tasks to be scheduled on some particular nodes you need to set
'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The
bottleneck here is that this property is not a job level one you need to
set it on a cluster level.

A cleaner approach will be to configure each of your nodes with the right
number of map and reduce slots based on the resources available on each
machine.

On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi 
abhay.ratnapar...@gmail.com wrote:

 Hello,

 How can one get to know the nodes on which reduce tasks will run?

 One of my job is running and it's completing all the map tasks.
 My map tasks write lots of intermediate data. The intermediate directory
 is getting full on all the nodes.
 If the reduce task take any node from cluster then It'll try to copy the
 data to same disk and it'll eventually fail due to Disk space related
 exceptions.

 I have added few more tasktracker nodes in the cluster and now want to run
 reducer on new nodes only.
 Is it possible to choose a node on which the reducer will run? What's the
 algorithm hadoop uses to get a new node to run reducer?

 Thanks in advance.

 Bye
 Abhay

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Bejoy Ks

Hi Abhay

You need this value to be changed before you submit your job and restart
TT. Modifying this value in  mid time won't affect the running jobs.

On Mon, Sep 3, 2012 at 9:06 PM, Abhay Ratnaparkhi 
abhay.ratnapar...@gmail.com wrote:

 How can I set  'mapred.tasktracker.reduce.tasks.maximum'  to 0 in a
 running tasktracker?
 Seems that I need to restart the tasktracker and in that case I'll loose
 the output of map tasks by particular tasktracker.

 Can I change   'mapred.tasktracker.reduce.tasks.maximum'  to 0  without
 restarting tasktracker?

 ~Abhay


 On Mon, Sep 3, 2012 at 8:53 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 HI Abhay

 The TaskTrackers on which the reduce tasks are triggered is chosen in
 random based on the reduce slot availability. So if you don't need the
 reduce tasks to be scheduled on some particular nodes you need to set
 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The
 bottleneck here is that this property is not a job level one you need to
 set it on a cluster level.

 A cleaner approach will be to configure each of your nodes with the right
 number of map and reduce slots based on the resources available on each
 machine.


 On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi 
 abhay.ratnapar...@gmail.com wrote:

 Hello,

 How can one get to know the nodes on which reduce tasks will run?

 One of my job is running and it's completing all the map tasks.
 My map tasks write lots of intermediate data. The intermediate directory
 is getting full on all the nodes.
 If the reduce task take any node from cluster then It'll try to copy the
 data to same disk and it'll eventually fail due to Disk space related
 exceptions.

 I have added few more tasktracker nodes in the cluster and now want to
 run reducer on new nodes only.
 Is it possible to choose a node on which the reducer will run? What's
 the algorithm hadoop uses to get a new node to run reducer?

 Thanks in advance.

 Bye
 Abhay

Re: MRBench Maps strange behaviour

2012-08-29 Thread Bejoy KS

Hi Gaurav

You can get the information on the num of map tasks in the job from the JT web 
UI itself.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Gaurav Dasgupta gdsay...@gmail.com
Date: Wed, 29 Aug 2012 13:14:11 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: MRBench Maps strange behaviour

Hi Hemanth,

Thanks for the reply.
Can you tell me how can I calculate or ensure from the counters what should
be the exact number of Maps?
Thanks,
Gaurav Dasgupta
On Wed, Aug 29, 2012 at 11:26 AM, Hemanth Yamijala yhema...@gmail.comwrote:

 Hi,

 The number of maps specified to any map reduce program (including
 those part of MRBench) is generally only a hint, and the actual number
 of maps will be influenced in typical cases by the amount of data
 being processed. You can take a look at this wiki link to understand
 more: http://wiki.apache.org/hadoop/HowManyMapsAndReduces

 In the examples below, since the data you've generated is different,
 the number of mappers are different. To be able to judge your
 benchmark results, you'd need to benchmark against the same data (or
 at least same type of type - i.e. size and type).

 The number of maps printed at the end is straight from the input
 specified and doesn't reflect what the job actually ran with. The
 information from the counters is the right one.

 Thanks
 Hemanth

 On Tue, Aug 28, 2012 at 4:02 PM, Gaurav Dasgupta gdsay...@gmail.com
 wrote:
  Hi All,
 
  I executed the MRBench program from hadoop-test.jar in my 12 node
 CDH3
  cluster. After executing, I had some strange observations regarding the
  number of Maps it ran.
 
  First I ran the command:
  hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 3 -maps
 200
  -reduces 200 -inputLines 1024 -inputType random
  And I could see that the actual number of Maps it ran was 201 (for all
 the 3
  runs) instead of 200 (Though the end report displays the launched to be
  200). Here is the console report:
 
 
  12/08/28 04:34:35 INFO mapred.JobClient: Job complete:
 job_201208230144_0035
 
  12/08/28 04:34:35 INFO mapred.JobClient: Counters: 28
 
  12/08/28 04:34:35 INFO mapred.JobClient:   Job Counters
 
  12/08/28 04:34:35 INFO mapred.JobClient: Launched reduce tasks=200
 
  12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=617209
 
  12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all
 reduces
  waiting after reserving slots (ms)=0
 
  12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all maps
  waiting after reserving slots (ms)=0
 
  12/08/28 04:34:35 INFO mapred.JobClient: Rack-local map tasks=137
 
  12/08/28 04:34:35 INFO mapred.JobClient: Launched map tasks=201
 
  12/08/28 04:34:35 INFO mapred.JobClient: Data-local map tasks=64
 
  12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1756882
 
 
 
  Again, I ran the MRBench for just 10 Maps and 10 Reduces:
 
  hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -maps 10
 -reduces 10
 
 
 
  This time the actual number of Maps were only 2 and again the end report
  displays Maps Lauched to be 10. The console output:
 
 
 
  12/08/28 05:05:35 INFO mapred.JobClient: Job complete:
 job_201208230144_0040
  12/08/28 05:05:35 INFO mapred.JobClient: Counters: 27
  12/08/28 05:05:35 INFO mapred.JobClient:   Job Counters
  12/08/28 05:05:35 INFO mapred.JobClient: Launched reduce tasks=20
  12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6648
  12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all
 reduces
  waiting after reserving slots (ms)=0
  12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all maps
  waiting after reserving slots (ms)=0
  12/08/28 05:05:35 INFO mapred.JobClient: Launched map tasks=2
  12/08/28 05:05:35 INFO mapred.JobClient: Data-local map tasks=2
  12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=163257
  12/08/28 05:05:35 INFO mapred.JobClient:   FileSystemCounters
  12/08/28 05:05:35 INFO mapred.JobClient: FILE_BYTES_READ=407
  12/08/28 05:05:35 INFO mapred.JobClient: HDFS_BYTES_READ=258
  12/08/28 05:05:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1072596
  12/08/28 05:05:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3
  12/08/28 05:05:35 INFO mapred.JobClient:   Map-Reduce Framework
  12/08/28 05:05:35 INFO mapred.JobClient: Map input records=1
  12/08/28 05:05:35 INFO mapred.JobClient: Reduce shuffle bytes=647
  12/08/28 05:05:35 INFO mapred.JobClient: Spilled Records=2
  12/08/28 05:05:35 INFO mapred.JobClient: Map output bytes=5
  12/08/28 05:05:35 INFO mapred.JobClient: CPU time spent (ms)=17070
  12/08/28 05:05:35 INFO mapred.JobClient: Total committed heap usage
  (bytes)=6218842112
  12/08/28 05:05:35 INFO mapred.JobClient: Map input bytes=2
  12/08/28 05:05:35 INFO mapred.JobClient: Combine input records=0
  12/08/28 05:05:35 INFO mapred.JobClient

Re: one reducer is hanged in reduce- copy phase

2012-08-28 Thread Bejoy KS

Hi Abhay

The map outputs are deleted only after the reducer runs to completion. 

Is it possible to run the same attempt again? Does killing the child java 
process or tasktracker on the node help? (since hadoop may schedule a reduce 
attempt on another node).

Yes,it is possible to re attempt the task again for that you need to fail the 
current attempt. 

Can I copy the map intermediate output required for this single reducer (which 
is hanged) and rerun only the hang reducer?

It is not that easy to accomplish this. Better fail the task explicitly so that 
the it is re attempted.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com
Date: Tue, 28 Aug 2012 19:40:58 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: one reducer is hanged in reduce- copy phase

Hello,

I have a MR job which has 4 reducers running.
One of the reduce attempt is pending since long time in reduce-copy phase.

The job is not able to complete because of this.
I have seen that the child java process on tasktracker is running.

Is it possible to run the same attempt again? Does killing the child java
process or tasktracker on the node help? (since hadoop may schedule a
reduce attempt on another node).

Can I copy the map intermediate output required for this single reducer
(which is hanged) and rerun only the hang reducer?

Thank you in advance.
~Abhay


ask_201208250623_0005_r_00http://dpep089.innovate.ibm.com:50030/taskdetails.jsp?tipid=task_201208250623_0005_r_00
26.41%

reduce  copy(103 of 130 at 0.08 MB/s)
28-Aug-2012 03:09:34

Re: namenode not starting

2012-08-24 Thread Bejoy KS

Hi Abhay

What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to /tmp 
the contents would be deleted on a OS restart. You need to change this location 
before you start your NN.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com
Date: Fri, 24 Aug 2012 12:58:41 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: namenode not starting

Hello,

I had a running hadoop cluster.
I restarted it and after that namenode is unable to start. I am getting
error saying that it's not formatted. :(
Is it possible to recover the data on HDFS?

2012-08-24 03:17:55,378 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.IOException: NameNode is not formatted.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
2012-08-24 03:17:55,380 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
NameNode is not formatted.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

Regards,
Abhay

Re: Streaming issue ( URGENT )

2012-08-20 Thread Bejoy Ks

Hi Siddharth

Joins are better implemented in hive and pig. Try checking out those and
see whether it fits your requirements.

If you are still looking for implementing joins using mapreduce, you can
take a look at this example which uses MultipleInputs
http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html

Regards
Bejoy KS

Re: Number of Maps running more than expected

2012-08-17 Thread Bejoy Ks

Hi Gaurav

How many input files are there for the wordcount map reduce job? Do you
have input files lesser than a block size? If you are using the default
TextInputFormat there will be one task generated per file for sure, so if
you have  files less than block size the calculation specified here for
number of splits won't hold. If small files are there then definitely the
number of maps tasks should be more.

Also did you change the split sizes as well along with block size?

Regards
Bejoy KS

Re: help in distribution of a task with hadoop

2012-08-13 Thread Bejoy Ks

Hi Bertrand

-libjars option works well with the 'hadoop jar' command. Instead of
executing your runnable with the plain java 'jar' command use 'hadoop jar'
. When you use hadoop jar you can ship the dependent jars/files etc as
1) include them in the /lib folder in your jar
2) use -libjars / -files to distribute jars or files

Regards
Bejoy KS

Re: how to enhance job start up speed?

2012-08-13 Thread Bejoy KS

Hi Matthais

When an mapreduce program is being used there are some extra steps like 
checking for input and output dir, calclulating input splits, JT assigning TT 
for executing the task etc.

If your file is non splittable , then one map task per file will be generated 
irrespective of the number of hdfs blocks. Now some blocks will be in a 
different node than the node where map task is executed so time will be spend 
here on the network transfer.

In your case MR would be a overhead as your file is non splittable hence no 
parallelism and also there is an overhead of copying blocks to the map task 
node. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Matthias Kricke matthias.mk.kri...@gmail.com
Sender: matthias.zeng...@gmail.com
Date: Mon, 13 Aug 2012 16:33:06 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux decho...@gmail.com

 I am not sure to understand and I guess I am not the only one.

 1) What's a worker in your context? Only the logic inside your Mapper or
 something else?
 2) You should clarify your cases. You seem to have two cases but both are
 in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
 sequential is not Hadoop?
 3) What are the size of the file?

 Bertrand


 On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke 
 matthias.mk.kri...@gmail.com wrote:

 Hello all,

 I'm using CDH3u3.
 If I want to process one File, set to non splitable hadoop starts one
 Mapper and no Reducer (thats ok for this test scenario). The Mapper
 goes through a configuration step where some variables for the worker
 inside the mapper are initialized.
 Now the Mapper gives me K,V-pairs, which are lines of an input file. I
 process the V with the worker.

 When I compare the run time of hadoop to the run time of the same process
 in sequentiell manner, I get:

 worker time -- same in both cases

 case: mapper -- overhead of ~32% to the worker process (same for bigger
 chunk size)
 case: sequentiell -- overhead of ~15% to the worker process

 It shouldn't be that much slower, because of non splitable, the mapper
 will be executed where the data is saved by HDFS, won't it?
 Where did those 17% go? How to reduce this? Did hadoop needs the whole
 time for reading or streaming the data out of HDFS?

 I would appreciate your help,

 Greetings
 mk




 --
 Bertrand Dechoux

Re: Hbase JDBC API

2012-08-10 Thread Bejoy Ks

Hi Sandeep

You can have a look at HbaseStorageHandler which maps the hbase tables to
hive tables . Once this mapping is done you can use the hive jdbc to query
Hbase tables. See whether this hive Hbase Integration suits your
requirement.

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Regards
Bejoy KS

Re: fs.local.block.size vs file.blocksize

2012-08-09 Thread Bejoy Ks

HI Rahul

Better to to start a new thread than hijacking others .:) It helps to keep
the mailing list archives clean.

Learning java, you need to get some JAVA books and start off.

If you just want to run wordcount example just follow the steps in below url
http://wiki.apache.org/hadoop/WordCount

To understand more details on the working, i have just scribbled something
long back, may be it can help you start off
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html


Regards
Bejoy KS

Re: Problem with hadoop filesystem after restart cluster

2012-08-08 Thread Bejoy Ks

Hi Andy

Is your hadoop.tmp.dir or dfs.name.dir configured to /tmp? If so it can
happen as /tmp dir gets wiped out on OS restarts

Regards
Bejoy KS

Re: Reading fields from a Text line

2012-08-03 Thread Bejoy KS

That is a good pointer Harsh.
Thanks a lot.

But if IdentityMapper is being used shouldn't the job.xml reflect that? But 
Job.xml always shows mapper as our CustomMapper.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Harsh J ha...@cloudera.com
Date: Fri, 3 Aug 2012 13:02:32 
To: mapreduce-user@hadoop.apache.org
Reply-To: mapreduce-user@hadoop.apache.org
Cc: Mohammad Tariqdonta...@gmail.com
Subject: Re: Reading fields from a Text line

That is not really a bug. Only if you use @Override will you be really
asserting that you've overriden the right method (since new API uses
inheritance instead of interfaces). Without that kinda check, its easy
to make mistakes and add in methods that won't get considered by the
framework (and hence the default IdentityMapper comes into play).

Always use @Override annotations when inheriting and overriding methods.

On Fri, Aug 3, 2012 at 4:41 AM, Bejoy Ks bejoy.had...@gmail.com wrote:
 Hi Tariq

 On further analysis I noticed a odd behavior in this context.

 If we use the default InputFormat (TextInputFormat) but specify the Key type
 in mapper as IntWritable instead of Long Writable. The framework is supposed
 throw a class cast exception.Such an exception is thrown only if the key
 types at class level and method level are the same (IntWritable) in Mapper.
 But if we provide the Input key type as IntWritable on the class level but
 LongWritable on the method level (map method), instead of throwing a compile
 time error, the code compliles fine . In addition to it on execution the
 framework triggers Identity Mapper instead of the custom mapper provided
 with the configuration.

 This seems like a bug to me . Filed a jira to track this issue
 https://issues.apache.org/jira/browse/MAPREDUCE-4507


 Regards
 Bejoy KS



-- 
Harsh J

Re: Reading fields from a Text line

2012-08-03 Thread Bejoy KS

Ok Got it now. That is a good piece of information.

Thank You :)

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Harsh J ha...@cloudera.com
Date: Fri, 3 Aug 2012 16:28:27 
To: mapreduce-user@hadoop.apache.org; bejoy.had...@gmail.com
Cc: Mohammad Tariqdonta...@gmail.com
Subject: Re: Reading fields from a Text line

Bejoy,

In the new API, the default map() function, if not properly
overridden, is the identity map function. There is no IdentityMapper
class in the new API, the Mapper class itself is identity by default.

On Fri, Aug 3, 2012 at 1:07 PM, Bejoy KS bejoy.had...@gmail.com wrote:
 That is a good pointer Harsh.
 Thanks a lot.

 But if IdentityMapper is being used shouldn't the job.xml reflect that? But 
 Job.xml always shows mapper as our CustomMapper.

 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.

 -Original Message-
 From: Harsh J ha...@cloudera.com
 Date: Fri, 3 Aug 2012 13:02:32
 To: mapreduce-user@hadoop.apache.org
 Reply-To: mapreduce-user@hadoop.apache.org
 Cc: Mohammad Tariqdonta...@gmail.com
 Subject: Re: Reading fields from a Text line

 That is not really a bug. Only if you use @Override will you be really
 asserting that you've overriden the right method (since new API uses
 inheritance instead of interfaces). Without that kinda check, its easy
 to make mistakes and add in methods that won't get considered by the
 framework (and hence the default IdentityMapper comes into play).

 Always use @Override annotations when inheriting and overriding methods.

 On Fri, Aug 3, 2012 at 4:41 AM, Bejoy Ks bejoy.had...@gmail.com wrote:
 Hi Tariq

 On further analysis I noticed a odd behavior in this context.

 If we use the default InputFormat (TextInputFormat) but specify the Key type
 in mapper as IntWritable instead of Long Writable. The framework is supposed
 throw a class cast exception.Such an exception is thrown only if the key
 types at class level and method level are the same (IntWritable) in Mapper.
 But if we provide the Input key type as IntWritable on the class level but
 LongWritable on the method level (map method), instead of throwing a compile
 time error, the code compliles fine . In addition to it on execution the
 framework triggers Identity Mapper instead of the custom mapper provided
 with the configuration.

 This seems like a bug to me . Filed a jira to track this issue
 https://issues.apache.org/jira/browse/MAPREDUCE-4507


 Regards
 Bejoy KS



 --
 Harsh J



-- 
Harsh J

Re: Reading fields from a Text line

2012-08-02 Thread Bejoy KS

Hi Tariq

I assume the mapper being used is IdentityMapper instead of XPTMapper class. 
Can you share your main class?

If you are using TextInputFormat an reading from a file in hdfs, it should have 
LongWritable Keys as input and your code has IntWritable as the input key type. 
Have a check on that as well.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Mohammad Tariq donta...@gmail.com
Date: Thu, 2 Aug 2012 15:48:42 
To: mapreduce-user@hadoop.apache.org
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Re: Reading fields from a Text line

Thanks for the response Harsh n Sri. Actually, I was trying to prepare
a template for my application using which I was trying to read one
line at a time, extract the first field from it and emit that
extracted value from the mapper. I have these few lines of code for
that :

public static class XPTMapper extends MapperIntWritable, Text,
LongWritable, Text{

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException{

Text word = new Text();
String line = value.toString();
if (!line.startsWith(TT)){
context.setStatus(INVALID 
LINE..SKIPPING);
}else{
String stdid = line.substring(0, 7);
word.set(stdid);
context.write(key, word);
}
}

But the output file contains all the rows of the input file including
the lines which I was expecting to get skipped. Also, I was expecting
only the fields I am emitting but the file contains entire lines.
Could you guys please point out the the mistake I might have made.
(Pardon my ignorance, as I am not very good at MapReduce).Many thanks.

Regards,
Mohammad Tariq


On Thu, Aug 2, 2012 at 10:58 AM, Sriram Ramachandrasekaran
sri.ram...@gmail.com wrote:
 Wouldn't it be better if you could skip those unwanted lines
 upfront(preprocess) and have a file which is ready to be processed by the MR
 system? In any case, more details are needed.


 On Thu, Aug 2, 2012 at 8:23 AM, Harsh J ha...@cloudera.com wrote:

 Mohammad,

  But it seems I am not doing  things in correct way. Need some guidance.

 What do you mean by the above? What is your written code exactly
 expected to do and what is it not doing? Perhaps since you ask for a
 code question here, can you share it with us (pastebin or gists,
 etc.)?

 For skipping 8 lines, if you are using splits, you need to detect
 within the mapper or your record reader if the map task filesplit has
 an offset of 0 and skip 8 line reads if so (Cause its the first split
 of some file).

 On Thu, Aug 2, 2012 at 1:54 AM, Mohammad Tariq donta...@gmail.com wrote:
  Hello list,
 
 I have a flat file in which data is stored as lines of 107
  bytes each. I need to skip the first 8 lines(as they don't contain any
  valuable info). Thereafter, I have to read each line and extract the
  information from them, but not the line as a whole. Each line is
  composed of several fields without any delimiter between them. For
  example, the first field is of 8 bytes, second of 2 bytes and so on. I
  was trying to reach each line as a Text value, convert it into string
  and using String.subring() method to extract the value of each field.
  But it seems I am not doing  things in correct way. Need some
  guidance. Many thanks.
 
  Regards,
  Mohammad Tariq



 --
 Harsh J




 --
 It's just about how deep your longing is!

Re: Reading fields from a Text line

Hi Tariq

Again I strongly suspect the IdentityMapper in play here. The reasoning why
I suspect so is

When you have the whole data in output file it should be the Identity
Mapper. Due to the mismatch in input key type at class level and method
level the framework is falling back to IdentityMapper. I have noticed this
fall back while using new mapreduce API.
public static class XPTMapper extends Mapper*IntWritable*, Text,
LongWritable, Text{

public void map(*LongWritable* key, Text value, Context
context)
throws IOException, InterruptedException{


When you change the Input Key type to LongWritable in class level, it is
your custom mapper(XPTMapper) being called. Because of some exceptional
cases it is just going into if condition where you are not writing anything
out of Mapper and hence an empty output file.

public static class XPTMapper extends Mapper*LongWritable*, Text,
LongWritable, Text{

public void map(*LongWritable* key, Text value, Context
context)
throws IOException, InterruptedException{

To cross check this, try enabling some logging on your code to see exactly
what is happening.

By the way are you getting the output of this line in your logs when you
change the input key type to LongWritable?
context.setStatus(INVALID LINE..SKIPPING);
If so that confirms my assumption. :)

Try adding more logs to trace the flow and see what is going wrong. Or you
can use MRunit to unit test your code as the first step.

Hope it helps!..

Regards
Bejoy KS

Re: All reducers are not being utilized

Hi Saurab/Steve

From my understanding the schedulers in hadoop consider only data
locality(for map tasks) and availability of slots for scheduling tasks on
various nodes. Say if you have a 3 TT nodes with 2 reducer slots each
(assume all slots are free) . If we execute a map reduce job with  3 reduce
tasks there is no gaurentee that one task will be scheduled on each node.
It can be like 2 in one node and 1 in another.

Regards
Bejoy KS

Re: DBOutputWriter timing out writing to database

Hi Nathan

Alternatively you can have a look at Sqoop , which offers efficient data
transfers between rdbms and hdfs.


Regards
Bejoy KS

Re: Reading fields from a Text line

Hi Tariq

On further analysis I noticed a odd behavior in this context.

If we use the default InputFormat (TextInputFormat) but specify the Key
type in mapper as IntWritable instead of Long Writable. The framework is
supposed throw a class cast exception.Such an exception is thrown only if
the key types at class level and method level are the same (IntWritable) in
Mapper. But if we provide the Input key type as IntWritable on the class
level but LongWritable on the method level (map method), instead of
throwing a compile time error, the code compliles fine . In addition to it
on execution the framework triggers Identity Mapper instead of the custom
mapper provided with the configuration.

This seems like a bug to me . Filed a jira to track this issue
https://issues.apache.org/jira/browse/MAPREDUCE-4507


Regards
Bejoy KS

Re: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text

Hi Harit

You need to set the Key Type as well. If you are using different Data Type
for Key and Values in your map output with respect to reduce output then
you need to specify both.

 //setting the map output data type classes
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);

 //setting the final reduce output data type classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

Regards
Bejoy KS

Re: Disable retries

2012-08-02 Thread Bejoy KS

Hi Marco

You can disable retries by setting
mapred.map.max.attempts and mapred.reduce.max.attempts  to 1.

Also if you need to disable speculative execution. You can disable it by setting
mapred.map.tasks.speculative.execution and 
mapred.reduce.tasks.speculative.execution to false.

With these two steps you can ensure that a task is attempted only once.

These properties to be set in mapred-site.xml or at job level.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Marco Gallotta ma...@gallotta.co.za
Date: Thu, 2 Aug 2012 16:52:00 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Disable retries

Hi there

Is there a way to disable retries when a mapper/reducer fails? I'm writing data 
in my mapper and I'd rather catch the failure, recover from a backup (fairly 
lightweight in this case, as the output tables aren't big) and restart.



-- 
Marco Gallotta | Mountain View, California
Software Engineer, Infrastructure | Loki Studios
fb.me/marco.gallotta | twitter.com/marcog
ma...@gallotta.co.za | +1 (650) 417-3313

Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

Re: Merge Reducers Output

2012-07-30 Thread Bejoy KS

Hi 

Why not use 'hadoop fs -getMerge outputFolderInHdfs targetFileNameInLfs' 
while copying files out of hdfs for the end users to consume. This will merge 
all the files in 'outputFolderInHdfs'  into one file and put it in lfs.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Michael Segel michael_se...@hotmail.com
Date: Mon, 30 Jul 2012 21:08:22 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Merge Reducers Output

Why not use a combiner?

On Jul 30, 2012, at 7:59 PM, Mike S wrote:

 Liked asked several times, I need to merge my reducers output files.
 Imagine I have many reducers which will generate 200 files. Now to
 merge them together, I have written another map reduce job where each
 mapper read a complete file in full in memory, and output that and
 then only one reducer has to merge them together. To do so, I had to
 write a custom fileinputreader that reads the complete file into
 memory and then another custom fileoutputfileformat to append the each
 reducer item bytes together. this how my mapper and reducers looks
 like
 
 public static class MapClass extends MapperNullWritable,
 BytesWritable, IntWritable, BytesWritable
   {
   @Override
   public void map(NullWritable key, BytesWritable value, Context
 context) throws IOException, InterruptedException
   {
   context.write(key, value);
   }
   }
 
   public static class Reduce extends ReducerNullWritable,
 BytesWritable, NullWritable, BytesWritable
   {
   @Override
   public void reduce(NullWritable key, IterableBytesWritable 
 values,
 Context context) throws IOException, InterruptedException
   {
   for (BytesWritable value : values)
   {
   context.write(NullWritable.get(), value);
   }
   }
   }
 
 I still have to have one reducers and that is a bottle neck. Please
 note that I must do this merging as the users of my MR job are outside
 my hadoop environment and the result as one file.
 
 Is there better way to merge reducers output files?

Re: Error reading task output

2012-07-27 Thread Bejoy Ks

Hi Ben

This error happens when the mapreduce job triggers more number of
process than allowed by the underlying OS. You need to increase the
nproc value if it is the default one.

You  can get the current values from linux using
ulimit -u
The default is 1024 I guess. Check that for the user that runs
mapreduce jobs, for a non security enabled cluster it is mapred.

You need to increase this to a laarge value using
mapred soft nproc 1
mapred hard nproc 1

If you are running on a security enabled cluster, this value should be
raised for the user who submits the job.

Regards
Bejoy KS

Re: Hadoop 1.0.3 start-daemon.sh doesn't start all the expected daemons

2012-07-27 Thread Bejoy Ks

Hi Dinesh

Try using $HADOOP_HOME/bin/start-all.sh . It starts all the hadoop
daemons including TT and DN.


Regards
Bejoy KS

Re: Retrying connect to server: localhost/127.0.0.1:9000.

2012-07-27 Thread Bejoy KS

Hi Keith

Your NameNode is not up still. What does the NN logs say?

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: anil gupta anilgupt...@gmail.com
Date: Fri, 27 Jul 2012 11:30:57 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Retrying connect to server: localhost/127.0.0.1:9000.

Hi Keith,

Does ping to localhost returns a reply? Try telneting to localhost 9000.

Thanks,
Anil

On Fri, Jul 27, 2012 at 11:22 AM, Keith Wiley kwi...@keithwiley.com wrote:

 I'm plagued with this error:
 Retrying connect to server: localhost/127.0.0.1:9000.

 I'm trying to set up hadoop on a new machine, just a basic
 pseudo-distributed setup.  I've done this quite a few times on other
 machines, but this time I'm kinda stuck.  I formatted the namenode without
 obvious errors and ran start-all.sh with no errors to stdout.  However, the
 logs are full of that error above and if I attempt to access hdfs (ala
 hadoop fs -ls /) I get that error again.  Obviously, my core-site.xml
 sets fs.default.name to hdfs://localhost:9000.

 I assume something is wrong with /etc/hosts, but I'm not sure how to fix
 it.  If hostname returns X and hostname -f returns Y, then what are the
 corresponding entries in /etc/hosts?

 Thanks for any help.


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 I used to be with it, but then they changed what it was.  Now, what I'm
 with
 isn't it, and what's it seems weird and scary to me.
--  Abe (Grandpa) Simpson

 




-- 
Thanks  Regards,
Anil Gupta

Re: KeyValueTextInputFormat absent in hadoop-0.20.205

2012-07-25 Thread Bejoy Ks

Hi Tariq

KeyValueTextInputFormat is available from hadoop 1.0.1 version on
wards for the new  mapreduce API

http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/mapreduce/lib/input/KeyValueTextInputFormat.html

Regards
Bejoy KS

On Wed, Jul 25, 2012 at 8:07 PM, Mohammad Tariq donta...@gmail.com wrote:
 Hello list,

  I am trying to run a small MapReduce job that includes
 KeyValueTextInputFormat with the new API(hadoop-0.20.205.0), but it
 seems KeyValueTextInputFormat is not included in the new API. Am I
 correct???

 Regards,
 Mohammad Tariq

Re: Unexpected end of input stream (GZ)

2012-07-24 Thread Bejoy Ks

Hi Oleg

From the job tracker page, you can get to the failed tasks and see
which was the file split processed by that task. The split information
is available under the status column for each task.

The file split information is not available on job history.


Regrads
Bejoy KS

On Tue, Jul 24, 2012 at 1:49 PM, Oleg Ruchovets oruchov...@gmail.com wrote:
 Hi , I got such exception running hadoop job:

 java.io.EOFException: Unexpected end of input stream at
 org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
 at
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
 at
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
 at java.io.InputStream.read(InputStream.java:85) at
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at
 org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)
 at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
 at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at
 org.apache.hadoop.mapred.Child$4.run(Child.

 As I understood some of my files are corrupted ( I am working with GZ
 format).

 I resolve the issue using conf.set(mapred.max.map.failures.percent , 1),

 But I don't know what file cause the problem.

 Question:
  How can I get a filename which is corrupted.

 Thanks in advance
 Oleg.

Re: fail and kill all tasks without killing job.

2012-07-20 Thread Bejoy KS

Hi Jay

Did you try
hadoop job -kill-task task-id ? And is that not working as desired? 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: jay vyas jayunit...@gmail.com
Date: Fri, 20 Jul 2012 17:17:58 
To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: fail and kill all tasks without killing job.

Hi guys : I want my tasks to end/fail, but I don't want to kill my 
entire hadoop job.

I have a hadoop job that runs 5 hadoop jobs in a row.
Im on the last of those sub-jobs, and want to fail all tasks so that the 
task tracker stops delegating them,
and the hadoop main job can naturally come to a close.

However, when I run hadoop job kill-attempt / fail-attempt , the 
jobtracker seems to simply relaunch
the same tasks with new ids.

How can I tell the jobtracker to give up on redelegating?

Re: NameNode fails

2012-07-20 Thread Bejoy KS

Hi Yogesh

Is your dfs.name.dir pointing to /tmp dir? If so try changing that to any other 
dir . The contents of /tmp may get wiped out on OS restarts.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: yogesh.kuma...@wipro.com
Date: Fri, 20 Jul 2012 06:20:02 
To: hdfs-user@hadoop.apache.org
Reply-To: hdfs-user@hadoop.apache.org
Subject: NameNode fails

Hello All :-),

I am new to Hdfs.

I have installed single node hdfs and started all nodes, every nodes gets 
started and work fine.
But when I shutdown my system or Restart it, then i try to run all nodes but 
Namenode doesn't start .

to Start it i need to format the namenode and all data gets wash off :-(.

Please help me and suggest me regarding this and how can I recover namenode 
from secondary name node on single node setup


Thanks  Regards
Yogesh Kumar Dhari

Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com

Re: Hadoop filesystem directories not visible

2012-07-19 Thread Bejoy KS

Hi Saniya

In hdfs the directory exists only as meta data in the name node. There is no 
real hierarchical existence like normal file system.  It is the data in the 
files that is stored as hdfs blocks distributed across data nodes. You see 
these hdfs blocks arranged in dfs.data.dir .
 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Yuvrajsinh Chauhan yuvraj.chau...@elitecore.com
Date: Thu, 19 Jul 2012 15:16:24 
To: hdfs-user@hadoop.apache.org
Reply-To: hdfs-user@hadoop.apache.org
Subject: RE: Hadoop filesystem directories not visible

Dear Saniya,

 

I Second to you on this. Am also find exactly the same folder on secondary
data node.

Also, How can I write files from my external application ?

 

Regards,

 

Yuvrajsinh Chauhan || Sr. DBA || CRESTEL-PSG

Elitecore Technologies Pvt. Ltd.

904, Silicon Tower || Off C.G.Road

Behind Pariseema Building || Ahmedabad || INDIA

[GSM]: +91 9727746022

 

From: Saniya Khalsa [mailto:saniya.kha...@gmail.com] 
Sent: 19 July 2012 14:58
To: hdfs-user@hadoop.apache.org
Subject: Re: Hadoop filesystem directories not visible

 

Hi Mohammad Tariq,

Thanks for the reply!!

The path to dfs.data.dir is /app/hadoop/tmp/dfs/data

when i go there i find only these :

BlocksBeingWriiten
Current
Detach
In_use.lock
storage
tmp

I am unable to see the created directories here.

Regards
Saniya




On Thu, Jul 19, 2012 at 2:39 PM, Mohammad Tariq donta...@gmail.com wrote:

Hello Saniya,

If you are talking about the local FS, then it will be present
at the location specified as the value of 'dfs.data.dir' property in
hdfs-site.xml file.

Regards,
Mohammad Tariq



On Thu, Jul 19, 2012 at 1:09 PM, Saniya Khalsa saniya.kha...@gmail.com
wrote:
 Hi,

 I ran these commands

 $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
 $HADOOP_HOME/bin/hadoop fs -mkdir /user
 The directories got created and I can now see the directories using
 following commands:

 [hadoop@master bin]$ ./hadoop fs -ls /
 Found 5 items
 drwxr-xr-x   - hadoop supergroup  0 2012-07-16 14:11 /app
 drwxr-xr-x   - hadoop supergroup  0 2012-07-17 17:41 /hadoop
 drwxr-xr-x   - hadoop supergroup  0 2012-07-18 14:11 /hbase
 drwxr-xr-x   - hadoop supergroup  0 2012-07-19 14:11 /tmp
 drwxr-xr-x   - hadoop supergroup  0 2012-07-19 17:41 /user


 I can see this data from both the nodes by typing the command ,but i
cannot
 view  the directories created in the file path anywhere.Please tell me how
 to see these directories created in file system.

 Thanks

Re: Hadoop filesystem directories not visible

2012-07-19 Thread Bejoy Ks

This can be good reference to start with
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

On Thu, Jul 19, 2012 at 3:42 PM, Mohammad Tariq donta...@gmail.com wrote:
 Hi Yuvraj,

   Yes. The starting point for the Hadoop file API is the
 'FileSystem' class. Hadoop’s FileSystem provides the us with
 FSDataInputStream and FSDataInputStream classes for reading and
 writing files.


 Regards,
 Mohammad Tariq


 On Thu, Jul 19, 2012 at 3:31 PM, Yuvrajsinh Chauhan
 yuvraj.chau...@elitecore.com wrote:
 So, I understand that, If I want to write a file then I need to change the
 code of my external application need to integrate Hadoop read-write
 command/API.



 Regards,

 Yuvrajsinh Chauhan



 From: Saniya Khalsa [mailto:saniya.kha...@gmail.com]
 Sent: 19 July 2012 15:27
 To: hdfs-user@hadoop.apache.org; bejoy.had...@gmail.com


 Subject: Re: Hadoop filesystem directories not visible



 Thanks Bejoy!!


 On Thu, Jul 19, 2012 at 3:22 PM, Bejoy KS bejoy.had...@gmail.com wrote:

 Hi Saniya

 In hdfs the directory exists only as meta data in the name node. There is no
 real hierarchical existence like normal file system. It is the data in the
 files that is stored as hdfs blocks distributed across data nodes. You see
 these hdfs blocks arranged in dfs.data.dir .

 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.

 

 From: Yuvrajsinh Chauhan yuvraj.chau...@elitecore.com

 Date: Thu, 19 Jul 2012 15:16:24 +0530

 To: hdfs-user@hadoop.apache.org

 ReplyTo: hdfs-user@hadoop.apache.org

 Subject: RE: Hadoop filesystem directories not visible



 Dear Saniya,



 I Second to you on this. Am also find exactly the same folder on secondary
 data node.

 Also, How can I write files from my external application ?



 Regards,



 Yuvrajsinh Chauhan || Sr. DBA || CRESTEL-PSG

 Elitecore Technologies Pvt. Ltd.

 904, Silicon Tower || Off C.G.Road

 Behind Pariseema Building || Ahmedabad || INDIA

 [GSM]: +91 9727746022



 From: Saniya Khalsa [mailto:saniya.kha...@gmail.com]
 Sent: 19 July 2012 14:58
 To: hdfs-user@hadoop.apache.org
 Subject: Re: Hadoop filesystem directories not visible



 Hi Mohammad Tariq,



 Thanks for the reply!!

 The path to dfs.data.dir is /app/hadoop/tmp/dfs/data

 when i go there i find only these :

 BlocksBeingWriiten
 Current
 Detach
 In_use.lock
 storage
 tmp

 I am unable to see the created directories here.

 Regards
 Saniya

 On Thu, Jul 19, 2012 at 2:39 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello Saniya,

 If you are talking about the local FS, then it will be present
 at the location specified as the value of 'dfs.data.dir' property in
 hdfs-site.xml file.

 Regards,
 Mohammad Tariq



 On Thu, Jul 19, 2012 at 1:09 PM, Saniya Khalsa saniya.kha...@gmail.com
 wrote:
 Hi,

 I ran these commands

 $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
 $HADOOP_HOME/bin/hadoop fs -mkdir /user
 The directories got created and I can now see the directories using
 following commands:

 [hadoop@master bin]$ ./hadoop fs -ls /
 Found 5 items
 drwxr-xr-x   - hadoop supergroup  0 2012-07-16 14:11 /app
 drwxr-xr-x   - hadoop supergroup  0 2012-07-17 17:41 /hadoop
 drwxr-xr-x   - hadoop supergroup  0 2012-07-18 14:11 /hbase
 drwxr-xr-x   - hadoop supergroup  0 2012-07-19 14:11 /tmp
 drwxr-xr-x   - hadoop supergroup  0 2012-07-19 17:41 /user


 I can see this data from both the nodes by typing the command ,but i
 cannot
 view  the directories created in the file path anywhere.Please tell me how
 to see these directories created in file system.

 Thanks

Re: Loading data in hdfs

2012-07-19 Thread Bejoy Ks

Hi Prabhjot

Yes, Just use the filesystem commands
hadoop fs -copyFromLocal src fs path destn hdfs path

Regards
Bejoy KS

On Thu, Jul 19, 2012 at 3:49 PM, iwannaplay games
funnlearnfork...@gmail.com wrote:
 Hi,

 I am unable to use sqoop and want to load data in hdfs for testing,
 Is there any way by which i can load my csv or text  file to hadoop
 file system directly without writing code in java

 Regards
 Prabhjot

Re: Jobs randomly not starting

2012-07-12 Thread Bejoy KS

Hi Robert

It could be because there are no free slots available in your cluster during 
job submission time to launch those tasks. Some other tasks may have already 
occupied the map/reduce slots. 

When you experience this random issue please  verify whether there are free 
task slots available.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Robert Dyer psyb...@gmail.com
Date: Thu, 12 Jul 2012 23:03:02 
To: mapreduce-user@hadoop.apache.org
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Jobs randomly not starting

I'm using Hadoop 1.0.3 on a small cluster (1 namenode, 1 jobtracker, 2
compute nodes).  My input size is a sequence file of around 280mb.

Generally, my jobs run just fine and all finish in 2-5 minutes.  However,
quite randomly the jobs refuse to run.  They submit and appear when running
'hadoop job -list' but don't appear on the jobtracker's webpage.  If I
manually type in the job ID on the webpage I can see it is trying to run
the setup task - the map tasks haven't even started.  I've left them to run
and even after several minutes it is still in this state.

When I spot this, I kill the job and resubmit it and generally it works.

A couple of times I have seen similar problems with reduce tasks that get
stuck while 'initializing'.

Any ideas?

Re: Error using MultipleInputs

2012-07-05 Thread Bejoy Ks

Hi Sanchita

Try your code after commenting the following Line of code,

//conf.setInputFormat(TextInputFormat.class);

AFAIK This explicitly sets the input format as TextInputFormat instead
of MultipleInput and hence the compiler throws an error stating 'no
input path specified'.

Regards
Bejoy KS

On Thu, Jul 5, 2012 at 5:19 PM, Sanchita Adhya sad...@infocepts.com wrote:
 Hi,



 I am using cloudera's hadoop version - Hadoop 0.20.2-cdh3u3 and trying to
 use the MultipleInputs incorporating separate mapper class in the following
 manner-



 public static void main(String[] args) throws Exception {

  JobConf conf = new JobConf(IntegrateExisting.class);

  conf.setJobName(IntegrateExisting);



  conf.setOutputKeyClass(Text.class);

  conf.setOutputValueClass(Text.class);



  Path existingKeysInputPath = new Path(args[0]);

  Path newKeysInputPath = new Path(args[1]);

 Path outputPath = new Path(args[2]);



  MultipleInputs.addInputPath(conf, existingKeysInputPath,
 TextInputFormat.class, MapExisting.class);

  MultipleInputs.addInputPath(conf, newKeysInputPath,
 TextInputFormat.class, MapNew.class);



  conf.setCombinerClass(ReduceAndFilterOut.class);

  conf.setReducerClass(ReduceAndFilterOut.class);



  conf.setInputFormat(TextInputFormat.class);

  conf.setOutputFormat(TextOutputFormat.class);



  FileOutputFormat.setOutputPath(conf, outputPath);





 //FileInputFormat.addInputPath(conf,existingKeysInputPath);

//FileInputFormat.addInputPath(conf,newKeysInputPath);



  JobClient.runJob(conf);

}



 Without the commented lines in the above code, the MR job fails with the
 following error-



 12/07/05 16:59:25 ERROR security.UserGroupInformation:
 PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException:
 No input paths specified in job

 Exception in thread main java.io.IOException: No input paths specified in
 job

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:153
 )

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)

 at
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971)

 at
 org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963)

 at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)

 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)

 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
 va:1157)

 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)

 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)

 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242)

 at org.myorg.IntegrateExisting.main(IntegrateExisting.java:122)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
 )

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
 .java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at org.apache.hadoop.util.RunJar.main(RunJar.java:197)



 Uncommenting the lines, leads to the following error in the mappers-



 java.lang.ClassCastException: org.apache.hadoop.mapred.FileSplit cannot be
 cast to org.apache.hadoop.mapred.lib.TaggedInputSplit

 at
 org.apache.hadoop.mapred.lib.DelegatingMapper.map(DelegatingMapper.java:48)

 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
 va:1157)

 at org.apache.hadoop.mapred.Child.main(Child.java:264)



 I see the MAPREDUCE-1178 that discusses the second error is included in the
 CDH3 version. Is there any code missing from the above piece?



 Thanks for the help.



 Regards,

 Sanchita

Re: Hive/Hdfs Connector

2012-07-05 Thread Bejoy KS

Hi Sandeep

You can connect to hdfs from a remote machine if that machine is reachable from 
the cluster, and you have the hadoop jars and right hadoop configuration files.

Similarly you can issue HQL programatically from your application using hive 
jdbc driver.

--Original Message--
From: Sandeep Reddy P
To: common-user@hadoop.apache.org
To: cdh-u...@cloudera.org
Cc: t...@cloudwick.com
ReplyTo: common-user@hadoop.apache.org
Subject: Hive/Hdfs Connector
Sent: Jul 5, 2012 20:32

Hi,
We have some application which generates SQL queries and connects to RDBMS
using connectors like JDBC for mysql. Now if we generate HQL using our
application is there any way to connect to Hive/Hdfs using connectors?? I
need help on what connectors i have to use?
We dont want to pull data from Hive/Hdfs to RDBMS instead we need our
application to connect to Hive/Hdfs.

-- 
Thanks,
sandeep



Regards
Bejoy KS

Sent from handheld, please excuse typos.

Re: change hdfs block size for file existing on HDFS

2012-06-26 Thread Bejoy KS

Hi Anurag,

The easiest option would be , in your map reduce job set the dfs.block.size to 
128 mb

--Original Message--
From: Anurag Tangri
To: hdfs-u...@hadoop.apache.org
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: change hdfs block size for file existing on HDFS
Sent: Jun 26, 2012 11:07

Hi,
We have a situation where all files that we have are 64 MB block size.

I want to change these files (output of a map job mainly) to 128 MB blocks.

What would be good way to do this migration from 64 mb to 128 mb block
files ?

Thanks,
Anurag Tangri

Regards
Bejoy KS

Sent from handheld, please excuse typos.

Re: change hdfs block size for file existing on HDFS

2012-06-26 Thread Bejoy Ks

Hi Anurag,

To add on, you can also change the replication of exiting files by
hadoop fs -setrep

http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#setrep

On Tue, Jun 26, 2012 at 7:42 PM, Bejoy KS bejoy.had...@gmail.com wrote:
 Hi Anurag,

 The easiest option would be , in your map reduce job set the dfs.block.size 
 to 128 mb

 --Original Message--
 From: Anurag Tangri
 To: hdfs-u...@hadoop.apache.org
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: change hdfs block size for file existing on HDFS
 Sent: Jun 26, 2012 11:07

 Hi,
 We have a situation where all files that we have are 64 MB block size.


 I want to change these files (output of a map job mainly) to 128 MB blocks.

 What would be good way to do this migration from 64 mb to 128 mb block
 files ?

 Thanks,
 Anurag Tangri



 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.

Re: change hdfs block size for file existing on HDFS

2012-06-26 Thread Bejoy KS

Hi Anurag,

The easiest option would be , in your map reduce job set the dfs.block.size to 
128 mb

--Original Message--
From: Anurag Tangri
To: hdfs-user@hadoop.apache.org
To: common-u...@hadoop.apache.org
ReplyTo: common-u...@hadoop.apache.org
Subject: change hdfs block size for file existing on HDFS
Sent: Jun 26, 2012 11:07

Hi,
We have a situation where all files that we have are 64 MB block size.

I want to change these files (output of a map job mainly) to 128 MB blocks.

What would be good way to do this migration from 64 mb to 128 mb block
files ?

Thanks,
Anurag Tangri

Regards
Bejoy KS

Sent from handheld, please excuse typos.

Re: Streaming in mapreduce

2012-06-16 Thread Bejoy KS

Hi Pedro

In simple terms Streaming API is used in hadoop if you have your mapper or 
reducer is in any language other than java . Say ruby or python. 


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Pedro Costa psdc1...@gmail.com
Date: Sat, 16 Jun 2012 10:23:20 
To: mapreduce-user@hadoop.apache.orgmapreduce-user@hadoop.apache.org
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Re: Streaming in mapreduce

I still don't get why hadoop streaming is useful. If I have man and reduce 
functions defined in shell script, like the one below, why should I use Hadoop?
cat someInputFile | shellMapper.sh | shellReducer.sh  someOutputFile


On 16/06/2012, at 01:21, Ruslan Al-Fakikh metarus...@gmail.com wrote:

 Hi Pedro,
 
 You can find it here
 http://wiki.apache.org/hadoop/HadoopStreaming
 
 Thanks
 
 On Sat, Jun 16, 2012 at 2:46 AM, Pedro Costa psdc1...@gmail.com wrote:
 Hi,
 
 Hadoop mapreduce can be used for streaming. But what is streaming from the 
 point of view of mapreduce? For me, streaming are video and audio data.
 
  Why mapreduce supports streaming?
 
 Can anyone give me an example on why to use streaming in mapreduce?
 
 Thanks,
 Pedro

Re: Setting number of mappers according to number of TextInput lines

2012-06-16 Thread Bejoy KS

Hi Ondrej

You can use NLineInputFormat with n set to 10.

--Original Message--
From: Ondřej Klimpera
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Setting number of mappers according to number of TextInput lines
Sent: Jun 16, 2012 14:31

Hello,

I have very small input size (kB), but processing to produce some output 
takes several minutes. Is there a way how to say, file has 100 lines, i 
need 10 mappers, where each mapper node has to process 10 lines of input 
file?

Thanks for advice.
Ondrej Klimpera

Regards
Bejoy KS

Sent from handheld, please excuse typos.

Re: [Newbie] How to make Multi Node Cluster from Single Node Cluster

2012-06-14 Thread Bejoy KS


You can follow the documents for 0.20.x . It is almost  same for 1.0.x as well.
 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Alpha Bagus Sunggono bagusa...@gmail.com
Date: Thu, 14 Jun 2012 17:15:16 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: [Newbie] How to make Multi Node Cluster from Single Node Cluster

Hello ramon
as newbie as I am

2012/6/14 ramon@accenture.com

 At Newbie level just the same.

 -Original Message-
 From: Alpha Bagus Sunggono [mailto:bagusa...@gmail.com]
 Sent: jueves, 14 de junio de 2012 12:01
 To: common-user@hadoop.apache.org
 Subject: [Newbie] How to make Multi Node Cluster from Single Node Cluster

 Dear All.

 I've been configuring 3 server using Hadoop 1.0.x  , Single Node, how to
 assembly them into 1 Multi Node Cluster?

 Because when I search for documentation, i've just got configuration for
 Hadoop 0.20.x

 Would you mind to assist me?

 
 Subject to local law, communications with Accenture and its affiliates
 including telephone calls and emails (including content), may be monitored
 by our systems for the purposes of security and the assessment of internal
 compliance with Accenture policy.

 __

 www.accenture.com




-- 
Alpha Bagus Sunggono, CBSP
(Certified Brownies Solution Provider)

Re: Map/Reduce | Multiple node configuration

Hi Girish

Lemme try answering your queries

1. For multiple nodes I understand I should add the URL of the secondary nodes 
in the slaves.xml. Am I correct?

Bejoy: AFAIK you nedd to add it on /etc/hosts

2. What should be installed on the secondary nodes for executing a job/task?

Bejoy: In small clusters you have the NameNode and JobTracker on one node , 
SecondaryNameNode on another node and DataNode and TaskTrackers on all other 
nodes.

3. I understand I can set the map/reduce classes as a jar to the Job - through 
the JobConf - so does this mean I need not really install/copy my map/reduce 
code on all the secondary nodes?

Bejoy: There is no differnce in sub$itting jobs as compared to a pseudo node 
set up. MapReduce frame work distributes this job jar and other required files. 
It is better having a client node to launch jobs

4. How do I route the data to these nodes? Is it required for the Map Reduce to 
execute on the machines which has the data stored (DFS)?

Bejoy: MR framework takes care of this. Map tasks consider data locality.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Girish Ravi giri...@srmtech.com
Date: Tue, 12 Jun 2012 06:55:26 
To: mapreduce-user@hadoop.apache.orgmapreduce-user@hadoop.apache.org
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Map/Reduce | Multiple node configuration

Hello Team,

I have started to understand about Hadoop Mapreduce and was able to set-up a 
single cluster single node execution environment.

I want to now extend this to a multi node environment.
I have the following questions and it would very helpful if somebody can help:
1. For multiple nodes I understand I should add the URL of the secondary nodes 
in the slaves.xml. Am I correct?
2. What should be installed on the secondary nodes for executing a job/task?
3. I understand I can set the map/reduce classes as a jar to the Job - through 
the JobConf - so does this mean I need not really install/copy my map/reduce 
code on all the secondary nodes?
4. How do I route the data to these nodes? Is it required for the Map Reduce to 
execute on the machines which has the data stored (DFS)?

Any samples for doing this would help.
Request for suggestions.

Regards
Girish
Ph: +91-9916212114

Re: Need logical help

Hi Girish

You can achice this using reduce side joins. Use MultipleInputFormat for 
parsing two different sets of log files.

 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Girish Ravi giri...@srmtech.com
Date: Tue, 12 Jun 2012 12:59:32 
To: mapreduce-user@hadoop.apache.orgmapreduce-user@hadoop.apache.org
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Need logical help

Hi All,

I am thinking of a condition where the data in two log files are to be 
compared, can I use Map-Reduce to do this?

I have one log file (LOG1) which has user ID and dept ID and another log file 
(LOG2) has some rows which has user ID and dept ID and other data.
Can I compare the data where LOG1.userID = LOG2.userID and LOG1.deptID = 
LOG2.deptID?

If so any suggestion to implement the mapper for this?

Regards
Girish
Ph: +91-9916212114

Re: Need logical help

To add on, have a look at hive and pig. Those are perfect fit for similar use 
cases.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Bejoy KS bejoy.had...@gmail.com
Date: Tue, 12 Jun 2012 13:04:33 
To: mapreduce-user@hadoop.apache.org
Reply-To: bejoy.had...@gmail.com
Subject: Re: Need logical help

Hi Girish

You can achice this using reduce side joins. Use MultipleInputFormat for 
parsing two different sets of log files.

 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Girish Ravi giri...@srmtech.com
Date: Tue, 12 Jun 2012 12:59:32 
To: mapreduce-user@hadoop.apache.orgmapreduce-user@hadoop.apache.org
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Need logical help

Hi All,

I am thinking of a condition where the data in two log files are to be 
compared, can I use Map-Reduce to do this?

I have one log file (LOG1) which has user ID and dept ID and another log file 
(LOG2) has some rows which has user ID and dept ID and other data.
Can I compare the data where LOG1.userID = LOG2.userID and LOG1.deptID = 
LOG2.deptID?

If so any suggestion to implement the mapper for this?

Regards
Girish
Ph: +91-9916212114

Re: set the mapred.map.tasks.speculative.execution=false, but it is not useful.