Re: Hadoop property precedence

2013-07-14 Thread Shekhar Sharma
Check out how does the writing happens on HDFS...

When client issues the command   "hadoop fs -put local_source
hdfs_destination",  client contact the namenode that he wants to write and
then NameNode creates a blockID and ask three datanodes ( if replication on
the client side is set to 3) to host the replica. THis information is sent
to client and then client starts writing the data on to the data nodes by
forming pipe line..And then client write that much amount of data onto a
block which he has set.


Regards,
Som Shekhar Sharma
+91-8197243810


On Sun, Jul 14, 2013 at 4:24 PM, Harsh J  wrote:

> Replication, block size, etc. are all per-file and pure client
> supplied properties. They either take their default from the client
> config, or directly from an API argument override.
>
> On Sun, Jul 14, 2013 at 4:14 PM, varun kumar  wrote:
> > What Shumin told is correct,hadoop configurations has been over written
> > through client application.
> >
> > We have faced similar type of issue,Where default replication factor was
> > mentioned 2 in hadoop configuration.But when when ever the client
> > application writes a files,it was having 3 copies in hadoop
> cluster.Later on
> > checking client application it's default replica factor has 3.
> >
> >
> > On Sun, Jul 14, 2013 at 4:51 AM, Shumin Guo  wrote:
> >>
> >> I Think the client side configuration will take effect.
> >>
> >> Shumin
> >>
> >> On Jul 12, 2013 11:50 AM, "Shalish VJ"  wrote:
> >>>
> >>> Hi,
> >>>
> >>>
> >>> Suppose block size set in configuration file at client side is
> 64MB,
> >>> block size set in configuration file at name node side is 128MB and
> block
> >>> size set in configuration file at datanode side is something else.
> >>> Please advice, If the client is writing a file to hdfs,which property
> >>> would be executed.
> >>>
> >>> Thanks,
> >>> Shalish.
> >
> >
> >
> >
> > --
> > Regards,
> > Varun Kumar.P
>
>
>
> --
> Harsh J
>


Map slots and Reduce slots

2013-07-14 Thread Shekhar Sharma
Does the properties mapred.map.max.tasks=3 and mapred.reduce.max.tasks=4
means that machine has 3 map slots and 4 reduce slots?

Or is there any way i can determine the number of map slots and reduce
slots that i can allocate for a machine?

Let's say on a machine if i have 8GB RAM and dual core machine, how can i
determine that what would be the optimal number of map and reducer slots
for this machine




Regards,
Som Shekhar Sharma
+91-8197243810


Re: Map slots and Reduce slots

2013-07-14 Thread Shekhar Sharma
Sorry for the wrong properties name, i meant the same..
I understand the properties functionality, can i add the slots at run time
to a particular task trackers depending on the load, because as you
suggested we can determine the slots depending on the load..and since the
load can be dynamic, so can i dynamically allocate the task tracker based
on information lets say depending on the availablity of the resources on
the task tracker machine..

Regards,
Som Shekhar Sharma
+91-8197243810


On Mon, Jul 15, 2013 at 7:27 AM, Devaraj k  wrote:

>  Hi Shekar,
>
> ** **
>
>I assume you are trying with Hadoop-1. There are no properties with the
> names 'mapred.map.max.tasks' and 'mapred.reduce.max.tasks'.  
>
> ** **
>
> We have these configuration to control the max no of map/reduce tasks run
> simultaneously.
>
> mapred.tasktracker.map.tasks.maximum - The maximum number of map tasks
> that will be run simultaneously by a task tracker.
>
> mapred.tasktracker.reduce.tasks.maximum - The maximum number of reduce
> tasks that will be run simultaneously by a task tracker.
>
> ** **
>
> For ex: If we declare mapred.tasktracker.map.tasks.maximum=3  and
> mapred.tasktracker.reduce.tasks.maximum=4 for a task tracker, means the TT
> has 3 map slots and 4 reduce slots.
>
> ** **
>
> > Let's say on a machine if i have 8GB RAM and dual core machine, how can
> i determine that what would be the optimal number of map and reducer slots
> for this machine
>
> It purely depends on which type of tasks you are going to run and load of
> the task. Normally each task requires one core to execute, no of concurrent
> tasks can be configured based on this. And memory required for the task
> depends on how much data it is going to process.
>
> ** **
>
> ** **
>
> Thanks
>
> Devaraj k
>
> ** **
>
> *From:* Shekhar Sharma [mailto:shekhar2...@gmail.com]
> *Sent:* 14 July 2013 23:15
> *To:* user@hadoop.apache.org
> *Subject:* Map slots and Reduce slots
>
> ** **
>
> Does the properties mapred.map.max.tasks=3 and mapred.reduce.max.tasks=4
> means that machine has 3 map slots and 4 reduce slots?
>
> ** **
>
> Or is there any way i can determine the number of map slots and reduce
> slots that i can allocate for a machine?
>
> ** **
>
> Let's say on a machine if i have 8GB RAM and dual core machine, how can i
> determine that what would be the optimal number of map and reducer slots
> for this machine
>
> ** **
>
> ** **
>
> ** **
>
>
> 
>
> Regards,
>
> Som Shekhar Sharma
>
> +91-8197243810
>


Re: New hadoop 1.2 single node installation giving problems

2013-07-23 Thread Shekhar Sharma
Its warning not error...

Create a directory and then do ls ( In your case /user/hduser is not
created untill and unless for the first time you create a directory or put
some file)

hadoop fs  -mkdir sample

hadoop fs  -ls

I would suggest if you are getting pemission problem,
please check the following:

(1) Have you run the command "hadoop namenode -format" with different user
and you are accessing the hdfs with different user?

On Tue, Jul 23, 2013 at 10:10 PM,  wrote:

> **
> Hi Ashish
>
> In your hdfs-site.xml within  tag you need to have the
>  tag and inside a  tag you can have , and
>  tags.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Ashish Umrani 
> *Date: *Tue, 23 Jul 2013 09:28:00 -0700
> *To: *
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: New hadoop 1.2 single node installation giving problems
>
> Hey thanks for response.  I have changed 4 files during installation
>
> core-site.xml
> mapred-site.xml
> hdfs-site.xml   and
> hadoop-env.sh
>
>
> I could not find any issues except that all params in the hadoop-env.sh
> are commented out.  Only java_home is un commented.
>
> If you have a quick minute can you please browse through these files in
> email and let me know where could be the issue.
>
> Regards
> ashish
>
>
>
> I am listing those files below.
> *core-site.xml *
> 
> 
>
> 
>
> 
>   
> hadoop.tmp.dir
> /app/hadoop/tmp
> A base for other temporary directories.
>   
>
>   
> fs.default.name
> hdfs://localhost:54310
> The name of the default file system.  A URI whose
> scheme and authority determine the FileSystem implementation.  The
> uri's scheme determines the config property (fs.SCHEME.impl) naming
> the FileSystem implementation class.  The uri's authority is used to
> determine the host, port, etc. for a filesystem.
>   
> 
>
>
>
> *mapred-site.xml*
> 
> 
>
> 
>
> 
>   
> mapred.job.tracker
> localhost:54311
> The host and port that the MapReduce job tracker runs
> at.  If "local", then jobs are run in-process as a single map
> and reduce task.
> 
>   
> 
>
>
>
> *hdfs-site.xml   and*
> 
> 
>
> 
>
> 
>   dfs.replication
>   1
>   Default block replication.
> The actual number of replications can be specified when the file is
> created.
> The default is used if replication is not specified in create time.
>   
> 
>
>
>
> *hadoop-env.sh*
> # Set Hadoop-specific environment variables here.
>
> # The only required environment variable is JAVA_HOME.  All others are
> # optional.  When running a distributed configuration it is best to
> # set JAVA_HOME in this file, so that it is correctly defined on
> # remote nodes.
>
> # The java implementation to use.  Required.
> export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_25
>
> # Extra Java CLASSPATH elements.  Optional.
> # export HADOOP_CLASSPATH=
>
>
> All pther params in hadoop-env.sh are commented
>
>
>
>
>
>
>
>
> On Tue, Jul 23, 2013 at 8:38 AM, Jitendra Yadav <
> jeetuyadav200...@gmail.com> wrote:
>
>> Hi,
>>
>> You might have missed some configuration (XML tags ), Please check all
>> the Conf files.
>>
>> Thanks
>> On Tue, Jul 23, 2013 at 6:25 PM, Ashish Umrani 
>> wrote:
>>
>>> Hi There,
>>>
>>> First of all, sorry if I am asking some stupid question.  Myself being
>>> new to the Hadoop environment , am finding it a bit difficult to figure out
>>> why its failing
>>>
>>> I have installed hadoop 1.2, based on instructions given in the
>>> folllowing link
>>>
>>> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
>>>
>>> All went well and I could do the start-all.sh and the jps command does
>>> show all 5 process to be present.
>>>
>>> However when I try to do
>>>
>>> hadoop fs -ls
>>>
>>> I get the following error
>>>
>>>  hduser@ashish-HP-Pavilion-dv6-Notebook-PC:/usr/local/hadoop/conf$
>>> hadoop fs -ls
>>> Warning: $HADOOP_HOME is deprecated.
>>>
>>> 13/07/23 05:55:06 WARN conf.Configuration: bad conf file: element not
>>> 
>>> 13/07/23 05:55:06 WARN conf.Configuration: bad conf file: element not
>>> 
>>> 13/07/23 05:55:06 WARN conf.Configuration: bad conf file: element not
>>> 
>>> 13/07/23 05:55:06 WARN conf.Configuration: bad conf file: element not
>>> 
>>> 13/07/23 05:55:06 WARN conf.Configuration: bad conf file: element not
>>> 
>>> 13/07/23 05:55:06 WARN conf.Configuration: bad conf file: element not
>>> 
>>> ls: Cannot access .: No such file or directory.
>>> hduser@ashish-HP-Pavilion-dv6-Notebook-PC:/usr/local/hadoop/conf$
>>>
>>>
>>>
>>> Can someone help me figure out whats the issue in my installation
>>>
>>>
>>> Regards
>>> ashish
>>>
>>
>>
>


Re: New hadoop 1.2 single node installation giving problems

2013-07-23 Thread Shekhar Sharma
After starting  i would suggest always check whether your NameNode and job
tracker UI are working or not and check the number of live nodes in both of
the UI..
Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Jul 23, 2013 at 10:41 PM, Ashish Umrani wrote:

> Thanks,
>
> But the issue was that there was no directory and hence it was not showing
> anything.  Adding a directory cleared the warning.
>
> I appreciate your help.
>
> Regards
> ashish
>
>
> On Tue, Jul 23, 2013 at 10:08 AM, Mohammad Tariq wrote:
>
>> Hello Ashish,
>>
>> Change the permissions of /app/hadoop/tmp to 755 and see if it helps.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Tue, Jul 23, 2013 at 10:27 PM, Ashish Umrani 
>> wrote:
>>
>>> Thanks Jitendra, Bejoy and Yexi,
>>>
>>> I got past that.  And now the ls command says it can not access the
>>> directory.  I am sure this is a permissions issue.  I am just wondering
>>> which directory and I missing permissions on.
>>>
>>> Any pointers?
>>>
>>> And once again, thanks a lot
>>>
>>> Regards
>>> ashish
>>>
>>> *hduser@ashish-HP-Pavilion-dv6-Notebook-PC:/usr/local/hadoop/conf$
>>> hadoop fs -ls*
>>> *Warning: $HADOOP_HOME is deprecated.*
>>> *
>>> *
>>> *ls: Cannot access .: No such file or directory.*
>>>
>>>
>>>
>>> On Tue, Jul 23, 2013 at 9:42 AM, Jitendra Yadav <
>>> jeetuyadav200...@gmail.com> wrote:
>>>
>>>> Hi Ashish,
>>>>
>>>> Please check   in hdfs-site.xml.
>>>>
>>>> It is missing.
>>>>
>>>> Thanks.
>>>> On Tue, Jul 23, 2013 at 9:58 PM, Ashish Umrani >>> > wrote:
>>>>
>>>>> Hey thanks for response.  I have changed 4 files during installation
>>>>>
>>>>> core-site.xml
>>>>> mapred-site.xml
>>>>> hdfs-site.xml   and
>>>>> hadoop-env.sh
>>>>>
>>>>>
>>>>> I could not find any issues except that all params in the
>>>>> hadoop-env.sh are commented out.  Only java_home is un commented.
>>>>>
>>>>> If you have a quick minute can you please browse through these files
>>>>> in email and let me know where could be the issue.
>>>>>
>>>>> Regards
>>>>> ashish
>>>>>
>>>>>
>>>>>
>>>>> I am listing those files below.
>>>>>  *core-site.xml *
>>>>>  
>>>>> 
>>>>>
>>>>> 
>>>>>
>>>>> 
>>>>>   
>>>>> hadoop.tmp.dir
>>>>> /app/hadoop/tmp
>>>>> A base for other temporary directories.
>>>>>   
>>>>>
>>>>>   
>>>>> fs.default.name
>>>>> hdfs://localhost:54310
>>>>> The name of the default file system.  A URI whose
>>>>> scheme and authority determine the FileSystem implementation.  The
>>>>> uri's scheme determines the config property (fs.SCHEME.impl) naming
>>>>> the FileSystem implementation class.  The uri's authority is used
>>>>> to
>>>>> determine the host, port, etc. for a filesystem.
>>>>>   
>>>>> 
>>>>>
>>>>>
>>>>>
>>>>> *mapred-site.xml*
>>>>>  
>>>>> 
>>>>>
>>>>> 
>>>>>
>>>>> 
>>>>>   
>>>>> mapred.job.tracker
>>>>> localhost:54311
>>>>> The host and port that the MapReduce job tracker runs
>>>>> at.  If "local", then jobs are run in-process as a single map
>>>>> and reduce task.
>>>>> 
>>>>>   
>>>>> 
>>>>>
>>>>>
>>>>>
>>>>> *hdfs-site.xml   and*
>>>>>  
>>>>> 
>>>>>
>>>>> 
>>>>>
>>>>> 
>>>>>   dfs.replication
>>>>>   1
>>>>>   Default block replication.
>>>>> The actual number of replications can be specified when the file
>>>>> is created.
>>>>> The defau

Re: New hadoop 1.2 single node installation giving problems

2013-07-23 Thread Shekhar Sharma
hadoop jar wc.jar  inputdata outputdestination


Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Jul 23, 2013 at 10:58 PM, Ashish Umrani wrote:

> Jitendra, Som,
>
> Thanks.  Issue was in not having any file there.  Its working fine now.
>
> I am able to do -ls and could also do -mkdir and -put.
>
> Now is time to run the jar and apparently I am getting
>
> no main manifest attribute, in wc.jar
>
>
> But I believe its because of maven pom file does not have the main class
> entry.
>
> Which I go ahead and change the pom file and build it again, please let me
> know if you guys think of some other reason.
>
> Once again this user group rocks.  I have never seen this quick a response.
>
> Regards
> ashish
>
>
> On Tue, Jul 23, 2013 at 10:21 AM, Jitendra Yadav <
> jeetuyadav200...@gmail.com> wrote:
>
>> Try..
>>
>> *hadoop fs -ls /*
>>
>> **
>> Thanks
>>
>>
>> On Tue, Jul 23, 2013 at 10:27 PM, Ashish Umrani 
>> wrote:
>>
>>> Thanks Jitendra, Bejoy and Yexi,
>>>
>>> I got past that.  And now the ls command says it can not access the
>>> directory.  I am sure this is a permissions issue.  I am just wondering
>>> which directory and I missing permissions on.
>>>
>>> Any pointers?
>>>
>>> And once again, thanks a lot
>>>
>>> Regards
>>> ashish
>>>
>>>  *hduser@ashish-HP-Pavilion-dv6-Notebook-PC:/usr/local/hadoop/conf$
>>> hadoop fs -ls*
>>> *Warning: $HADOOP_HOME is deprecated.*
>>> *
>>> *
>>> *ls: Cannot access .: No such file or directory.*
>>>
>>>
>>>
>>> On Tue, Jul 23, 2013 at 9:42 AM, Jitendra Yadav <
>>> jeetuyadav200...@gmail.com> wrote:
>>>
>>>> Hi Ashish,
>>>>
>>>> Please check   in hdfs-site.xml.
>>>>
>>>> It is missing.
>>>>
>>>> Thanks.
>>>> On Tue, Jul 23, 2013 at 9:58 PM, Ashish Umrani >>> > wrote:
>>>>
>>>>> Hey thanks for response.  I have changed 4 files during installation
>>>>>
>>>>> core-site.xml
>>>>> mapred-site.xml
>>>>> hdfs-site.xml   and
>>>>> hadoop-env.sh
>>>>>
>>>>>
>>>>> I could not find any issues except that all params in the
>>>>> hadoop-env.sh are commented out.  Only java_home is un commented.
>>>>>
>>>>> If you have a quick minute can you please browse through these files
>>>>> in email and let me know where could be the issue.
>>>>>
>>>>> Regards
>>>>> ashish
>>>>>
>>>>>
>>>>>
>>>>> I am listing those files below.
>>>>>  *core-site.xml *
>>>>>  
>>>>> 
>>>>>
>>>>> 
>>>>>
>>>>> 
>>>>>   
>>>>> hadoop.tmp.dir
>>>>> /app/hadoop/tmp
>>>>> A base for other temporary directories.
>>>>>   
>>>>>
>>>>>   
>>>>> fs.default.name
>>>>> hdfs://localhost:54310
>>>>> The name of the default file system.  A URI whose
>>>>> scheme and authority determine the FileSystem implementation.  The
>>>>> uri's scheme determines the config property (fs.SCHEME.impl) naming
>>>>> the FileSystem implementation class.  The uri's authority is used
>>>>> to
>>>>> determine the host, port, etc. for a filesystem.
>>>>>   
>>>>> 
>>>>>
>>>>>
>>>>>
>>>>> *mapred-site.xml*
>>>>>  
>>>>> 
>>>>>
>>>>> 
>>>>>
>>>>> 
>>>>>   
>>>>> mapred.job.tracker
>>>>> localhost:54311
>>>>> The host and port that the MapReduce job tracker runs
>>>>> at.  If "local", then jobs are run in-process as a single map
>>>>> and reduce task.
>>>>> 
>>>>>   
>>>>> 
>>>>>
>>>>>
>>>>>
>>>>> *hdfs-site.xml   and*
>>>>>  
>>>>> 
>>>>>
>>>>> 
>>>>>
>>>>> 
>>>>>   dfs.r

Re: Is there any way to use a hdfs file as a Circular buffer?

2013-08-07 Thread Shekhar Sharma
Use CEP tool like Esper and Storm, you will be able to achieve that
...I can give you more inputs if you can provide me more details of what
you are trying to achieve
Regards,
Som Shekhar Sharma
+91-8197243810


On Wed, Aug 7, 2013 at 9:58 PM, Wukang Lin  wrote:

> Hi Niels and Bertrand,
> Thank you for you great advices.
> In our scenario, we need to store a steady stream of binary data into
> a circular storage,throughput and concurrency are the most important
> indicators.The first way seems work, but as  hdfs is not friendly for small
> files, this approche may be not smooth enough.HBase is good, but  not
> appropriate for us, both for throughput and storage.mongodb is quite good
> for web applications, but not suitable the scenario we meet all the same.
> we need a distributed storage system,with Highe throughput, HA,LB and
> secure. Maybe It act much like hbase, manager a lot of small file(hfile) as
> a large region. we manager a lot of small file as a large one. Perhaps we
> should develop it by ourselives.
>
> Thank you.
> Lin Wukang
>
>
> 2013/7/25 Niels Basjes 
>
>> A circular file on hdfs is not possible.
>>
>> Some of the ways around this limitation:
>> - Create a series of files and delete the oldest file when you have too
>> much.
>> - Put the data into an hbase table and do something similar.
>> - Use completely different technology like mongodb which has built in
>> support for a circular buffer (capped collection).
>>
>> Niels
>>
>> Hi all,
>>Is there any way to use a hdfs file as a Circular buffer? I mean, if I 
>> set a quotas to a directory on hdfs, and writting data to a file in that 
>> directory continuously. Once the quotas exceeded, I can redirect the writter 
>> and write the data from the beginning of the file automatically .
>>
>>
>


Re: Datanode doesn't connect to Namenode

2013-08-07 Thread Shekhar Sharma
Disable the firewall on data node and namenode machines..
Regards,
Som Shekhar Sharma
+91-8197243810


On Wed, Aug 7, 2013 at 11:33 PM, Jitendra Yadav
wrote:

> Your hdfs name entry should be same on master and databnodes
>
> * fs.default.name*
> *hdfs://cloud6:54310*
>
> Thanks
> On Wed, Aug 7, 2013 at 11:05 PM, Felipe Gutierrez <
> felipe.o.gutier...@gmail.com> wrote:
>
>> on my slave the process is running:
>> hduser@cloud15:/usr/local/hadoop$ jps
>> 19025 DataNode
>> 19092 Jps
>>
>>
>> On Wed, Aug 7, 2013 at 2:26 PM, Jitendra Yadav <
>> jeetuyadav200...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Your logs showing that the process is creating IPC call not for
>>> namenode, it is hitting datanode itself.
>>>
>>> Check you please check you datanode processes status?.
>>>
>>> Regards
>>> Jitendra
>>>
>>> On Wed, Aug 7, 2013 at 10:29 PM, Felipe Gutierrez <
>>> felipe.o.gutier...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> My slave machine (cloud15) the datanode shows this log. It doesn't
>>>> connect to the master (cloud6).
>>>>
>>>>  2013-08-07 13:44:03,110 INFO org.apache.hadoop.ipc.Client: Retrying
>>>> connect to server: cloud15/192.168.188.15:54310. Already tried 9
>>>> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
>>>> sleepTime=1 SECONDS)
>>>> 2013-08-07 13:44:03,110 INFO org.apache.hadoop.ipc.RPC: Server at
>>>> cloud15/192.168.188.15:54310 not available yet, Z...
>>>>
>>>> But when I type jps command on slave machine DataNode is running. This
>>>> is my file core-site.xml in slave machine (cloud15):
>>>>  
>>>> 
>>>>   hadoop.tmp.dir
>>>>   /app/hadoop/tmp
>>>>   A base for other temporary directories.
>>>> 
>>>>
>>>> 
>>>>   fs.default.name
>>>>   hdfs://cloud15:54310
>>>>   The name of the default file system.  A URI whose
>>>>   scheme and authority determine the FileSystem implementation.  The
>>>>   uri's scheme determines the config property (fs.SCHEME.impl) naming
>>>>   the FileSystem implementation class.  The uri's authority is used to
>>>>   determine the host, port, etc. for a filesystem.
>>>> 
>>>> 
>>>>
>>>> In the master machine I just swap cloud15 to cloud6.
>>>> In the file /etc/host I have (192.168.188.15  cloud15) and
>>>> (192.168.188.6   cloud6) lines, and both machines access through ssh with
>>>> out password.
>>>>
>>>> Am I missing anything?
>>>>
>>>> Thanks in advance!
>>>> Felipe
>>>>
>>>>
>>>> --
>>>> *--
>>>> -- Felipe Oliveira Gutierrez
>>>> -- felipe.o.gutier...@gmail.com
>>>> -- https://sites.google.com/site/lipe82/Home/diaadia*
>>>>
>>>
>>>
>>
>>
>> --
>> *--
>> -- Felipe Oliveira Gutierrez
>> -- felipe.o.gutier...@gmail.com
>> -- https://sites.google.com/site/lipe82/Home/diaadia*
>>
>
>


Re: specify Mapred tasks and slots

2013-08-07 Thread Shekhar Sharma
use mapred.tasktracker.reduce.tasks in mapred-site.xml

the default value is 2...Which means that on this task tracker it will not
run more than 2 reducer tasks at any given point of time..


Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Aug 8, 2013 at 7:19 AM, Azuryy Yu  wrote:

> Hi Dears,
>
> Can I specify how many slots to use for reduce?
>
> I know we can specify reduces tasks, but is there one task occupy one slot?
>
> it it possible that one tak occupy more than one slot in Hadoop-1.1.2.
>
> Thanks.
>


Re: specify Mapred tasks and slots

2013-08-07 Thread Shekhar Sharma
Slots are decided upon the configuration of machines, RAM etc...

Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Aug 8, 2013 at 7:19 AM, Azuryy Yu  wrote:

> Hi Dears,
>
> Can I specify how many slots to use for reduce?
>
> I know we can specify reduces tasks, but is there one task occupy one slot?
>
> it it possible that one tak occupy more than one slot in Hadoop-1.1.2.
>
> Thanks.
>


Re: Datanode doesn't connect to Namenode

2013-08-08 Thread Shekhar Sharma
if you have removed this property from the slave machines then your DN
information will be created under /tmp folder and once you reboot your data
node machines, the information will be lost..

Sorry i have not seen the logs..but you dont have play around the
properties..
...see datanode will not come up in scenario, where it is not able to send
the heart beat signal to the name node at port 54310




Do step by step :

Check whether you can ping every machine and you can do SSH in password
less manner

Lets say i have one master machine whose hostname is *Master* and i have
two slave machines *Slave0* and *Slave1* ( i am assuming the OS used are
CentOS)

In *master Machine* do the following things:

*First disable the firewall by running the following command:
*as a root user run the following commands
service iptables save
service iptables stop
chkconfig iptables off

specify the following properties in the corresponding files

*mapred-site.xml*

   - mapred.job.tracker (Master:54311)

*core-site.xml*

   - fs.default.name (hdfs://Master:54310)
   - hadoop.tmp.dir (choose some persistent directory)

*hdfs-site.xml*

   - dfs.replication (3)
   - dfs.block.size(64MB)

*Masters file*

   - Master

*Slaves file*

   - Slave0
   - Slave1

*hadoop-env.sh*

   - export JAVA_HOME=


In *slave0 machine
*

   - Disable the firewall
   - Same properties as you did in Masters machine

In *slave 1 machine*

   - Disable the firewall
   - Same properties as you did in Master machine


Once you start the cluster by running the command start-all.sh, check the
ports 54310 and 54311 got opened by running the command "netstat
-tuplen"..it will show whether ports are opened or not

Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Aug 8, 2013 at 4:57 PM, Felipe Gutierrez <
felipe.o.gutier...@gmail.com> wrote:

> Thanks,
> at all files I changed to master (cloud6) and I take off this property
> hadoop.tmp.dir.
>
> Felipe
>
>
> On Wed, Aug 7, 2013 at 3:20 PM, Shekhar Sharma wrote:
>
>> Disable the firewall on data node and namenode machines..
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Wed, Aug 7, 2013 at 11:33 PM, Jitendra Yadav <
>> jeetuyadav200...@gmail.com> wrote:
>>
>>> Your hdfs name entry should be same on master and databnodes
>>>
>>> * fs.default.name*
>>> *hdfs://cloud6:54310*
>>>
>>> Thanks
>>> On Wed, Aug 7, 2013 at 11:05 PM, Felipe Gutierrez <
>>> felipe.o.gutier...@gmail.com> wrote:
>>>
>>>> on my slave the process is running:
>>>> hduser@cloud15:/usr/local/hadoop$ jps
>>>> 19025 DataNode
>>>> 19092 Jps
>>>>
>>>>
>>>> On Wed, Aug 7, 2013 at 2:26 PM, Jitendra Yadav <
>>>> jeetuyadav200...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Your logs showing that the process is creating IPC call not for
>>>>> namenode, it is hitting datanode itself.
>>>>>
>>>>> Check you please check you datanode processes status?.
>>>>>
>>>>> Regards
>>>>> Jitendra
>>>>>
>>>>> On Wed, Aug 7, 2013 at 10:29 PM, Felipe Gutierrez <
>>>>> felipe.o.gutier...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> My slave machine (cloud15) the datanode shows this log. It doesn't
>>>>>> connect to the master (cloud6).
>>>>>>
>>>>>>  2013-08-07 13:44:03,110 INFO org.apache.hadoop.ipc.Client: Retrying
>>>>>> connect to server: cloud15/192.168.188.15:54310. Already tried 9
>>>>>> time(s); retry policy is 
>>>>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
>>>>>> sleepTime=1 SECONDS)
>>>>>> 2013-08-07 13:44:03,110 INFO org.apache.hadoop.ipc.RPC: Server at
>>>>>> cloud15/192.168.188.15:54310 not available yet, Z...
>>>>>>
>>>>>> But when I type jps command on slave machine DataNode is running.
>>>>>> This is my file core-site.xml in slave machine (cloud15):
>>>>>>  
>>>>>> 
>>>>>>   hadoop.tmp.dir
>>>>>>   /app/hadoop/tmp
>>>>>>   A base for other temporary directories.
>>>>>> 
>>>>>>
>>>>>> 
>>>>>>   fs.default.name
>>>>>>   hdfs://cloud15:54310
>>>>>>   The name of the default file system.  A URI whose
>>>>>>   scheme and author

Re: Datanode doesn't connect to Namenode

2013-08-08 Thread Shekhar Sharma
keep the configuration same in the datanodes as well for the time
being..Only thing that data node or slave machine should know is Masters
files ( that means who is the master)
and you need to tell the slave machine where is your namenode running,
which you need to specify in the property fs.default.name and also you need
to tell where is your job tracker running which you need to specify by the
property mapred.job.tracker

Hope you might be able to bring up your cluster now..

if you still face the issues, you can follow my blog
http://ksssblogs.blogspot.in/2013/07/multi-node-hadoop-cluster-set-using-vms.html
Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Aug 8, 2013 at 5:21 PM, Shekhar Sharma wrote:

> if you have removed this property from the slave machines then your DN
> information will be created under /tmp folder and once you reboot your data
> node machines, the information will be lost..
>
> Sorry i have not seen the logs..but you dont have play around the
> properties..
> ...see datanode will not come up in scenario, where it is not able to send
> the heart beat signal to the name node at port 54310
>
>
>
>
> Do step by step :
>
> Check whether you can ping every machine and you can do SSH in password
> less manner
>
> Lets say i have one master machine whose hostname is *Master* and i have
> two slave machines *Slave0* and *Slave1* ( i am assuming the OS used are
> CentOS)
>
> In *master Machine* do the following things:
>
> *First disable the firewall by running the following command:
> *as a root user run the following commands
> service iptables save
> service iptables stop
> chkconfig iptables off
>
> specify the following properties in the corresponding files
>
> *mapred-site.xml*
>
>- mapred.job.tracker (Master:54311)
>
> *core-site.xml*
>
>- fs.default.name (hdfs://Master:54310)
>- hadoop.tmp.dir (choose some persistent directory)
>
> *hdfs-site.xml*
>
>- dfs.replication (3)
>- dfs.block.size(64MB)
>
> *Masters file*
>
>- Master
>
> *Slaves file*
>
>- Slave0
>- Slave1
>
> *hadoop-env.sh*
>
>- export JAVA_HOME=
>
>
> In *slave0 machine
> *
>
>- Disable the firewall
>- Same properties as you did in Masters machine
>
> In *slave 1 machine*
>
>- Disable the firewall
>- Same properties as you did in Master machine
>
>
> Once you start the cluster by running the command start-all.sh, check the
> ports 54310 and 54311 got opened by running the command "netstat
> -tuplen"..it will show whether ports are opened or not
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Thu, Aug 8, 2013 at 4:57 PM, Felipe Gutierrez <
> felipe.o.gutier...@gmail.com> wrote:
>
>> Thanks,
>> at all files I changed to master (cloud6) and I take off this property
>> hadoop.tmp.dir.
>>
>> Felipe
>>
>>
>> On Wed, Aug 7, 2013 at 3:20 PM, Shekhar Sharma wrote:
>>
>>> Disable the firewall on data node and namenode machines..
>>> Regards,
>>> Som Shekhar Sharma
>>> +91-8197243810
>>>
>>>
>>> On Wed, Aug 7, 2013 at 11:33 PM, Jitendra Yadav <
>>> jeetuyadav200...@gmail.com> wrote:
>>>
>>>> Your hdfs name entry should be same on master and databnodes
>>>>
>>>> * fs.default.name*
>>>> *hdfs://cloud6:54310*
>>>>
>>>> Thanks
>>>> On Wed, Aug 7, 2013 at 11:05 PM, Felipe Gutierrez <
>>>> felipe.o.gutier...@gmail.com> wrote:
>>>>
>>>>> on my slave the process is running:
>>>>> hduser@cloud15:/usr/local/hadoop$ jps
>>>>> 19025 DataNode
>>>>> 19092 Jps
>>>>>
>>>>>
>>>>> On Wed, Aug 7, 2013 at 2:26 PM, Jitendra Yadav <
>>>>> jeetuyadav200...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Your logs showing that the process is creating IPC call not for
>>>>>> namenode, it is hitting datanode itself.
>>>>>>
>>>>>> Check you please check you datanode processes status?.
>>>>>>
>>>>>> Regards
>>>>>> Jitendra
>>>>>>
>>>>>> On Wed, Aug 7, 2013 at 10:29 PM, Felipe Gutierrez <
>>>>>> felipe.o.gutier...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> My slave machine (cloud15) the datanode shows this log. It does

Re: hadoop debugging tools

2013-08-27 Thread Shekhar Sharma
You can get the stats for a job using rumen.
http://ksssblogs.blogspot.in/2013/06/getting-job-statistics-using-rumen.html

Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Aug 27, 2013 at 10:54 AM, Gopi Krishna M  wrote:

> Harsh: thanks for the quick response.
>
> we often see an error response such as "Failed(Query returned non-zero
> code: 2, cause: FAILED: Execution Error, return code 2 from
> org.apache.hadoop.hive.ql.exec.MapRedTask)" and then go through all the
> logs to figure out what happened.  I use the jobtracker UI to go to the
> error logs and see what happened.
>
> I was thinking a log parsing tool with a good UI to go through the
> distributed-logs and help you find errors,  get stats on similar effors in
> prev runs etc will be useful.  HADOOP-9861 might help in getting good
> info, but might be still not very easy for quick debugging.
>
> Has anybody faced similar issues as part of their development?  Are there
> any better ways to pin point the cause of error?
>
> Thx
> Gopi | www.wignite.com
>
>
> On Tue, Aug 27, 2013 at 10:42 AM, Harsh J  wrote:
>
>> We set a part of the failure reason as the diagnostic message for a
>> failed task that a JobClient API retrieves/can retrieve:
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getTaskDiagnostics(org.apache.hadoop.mapred.TaskAttemptID)
>> .
>> Often this is
>> 'useless' given the stack trace's top part isn't always carrying the
>> most relevant information, so perhaps HADOOP-9861 may help here once
>> it is checked in.
>>
>> On Tue, Aug 27, 2013 at 10:34 AM, Gopi Krishna M 
>> wrote:
>> > Hi
>> >
>> > We are seeing our map-reduce jobs crashing once in a while and have to
>> go
>> > through the logs on all the nodes to figure out what went wrong.
>>  Sometimes
>> > it is low resources and sometimes it is a programming error which is
>> > triggered on specific inputs..  Same is true for some of our hive
>> queries.
>> >
>> > Are there any tools (free/paid) which help us to do this debugging
>> quickly?
>> > I am planning to write a debugging tool for sifting through the
>> distributed
>> > logs of hadoop but wanted to check if there are already any useful
>> tools for
>> > this.
>> >
>> > Thx
>> > Gopi | www.wignite.com
>>
>>
>>
>> --
>> Harsh J
>>
>
>


Re: how to find process under node

2013-08-29 Thread Shekhar Sharma
Are your trying to find the java process under a node...Then simple
thing would be to do ssh and run jps command to get the list of java
process
Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Aug 29, 2013 at 12:27 PM, suneel hadoop
 wrote:
> Hi All,
>
> what im trying out here is to capture the process which is running under
> which node
>
> this is the unix script which i tried
>
>
> !/bin/ksh
>
>
> Cnt=cat /users/hadoop/unixtest/nodefilename.txt | wc -l
> cd /users/hadoop/unixtest/
> ls -ltr | awk '{print $9}' > list_of_scripts.txt
> split -l $Cnt list_of_scripts.txt node_scripts
> ls -ltr node_scripts* | awk '{print $9}' > list_of_node_scripts.txt
> for i in nodefilename.txt
> do
> for j in list_of_node_scripts.txt
> do
> node=$i
> script_file=$j
> cat $node\n $script_file >> $script_file
> done
> done
>
>
> exit 0;
>
>
>
> but my result should look like below:
>
>
> node1 node2
> - ---
> process1 proces3
> process2 proces4
>
>
> can some one please help in this..
> thanks in advance..


Re: reading input stream

2013-08-29 Thread Shekhar Sharma
Path p = new Path("path of the file which youwould like to read from HDFS");
FSDataInputStream iStream = FileSystem.open(p);
String str;
while((str = iStream.readLine())!=null)
{
System.out.printn(str);

}
Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Aug 29, 2013 at 12:15 PM, jamal sasha  wrote:
> Hi,
>   Probably a very stupid question.
> I have this data in binary format... and the following piece of code works
> for me in normal java.
>
>
> public classparser {
>
> public static void main(String [] args) throws Exception{
> String filename = "sample.txt";
> File file = new File(filename);
> FileInputStream fis = new FileInputStream(filename);
> System.out.println("Total file size to read (in bytes) : "
> + fis.available());
> BSONDecoder bson = new BSONDecoder();
> System.out.println(bson.readObject(fis));
> }
> }
>
>
> Now finally the last line is the answer..
> Now, I want to implement this on hadoop but the challenge (which I think)
> is.. that I am not reading or parsing data line by line.. rather its a
> stream of data??? right??
> How do i replicate the above code logic.. but in hadoop?


Re: Hadoop client user

2013-08-29 Thread Shekhar Sharma
Put that user in hadoop group...
And if the user wants to hadoop client, then the use should be aware
of two properties fs.default.name which is the address of NameNode and
mapred.job.tracker which is the address of job tracker
Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Aug 29, 2013 at 10:55 AM, Harsh J  wrote:
> The user1 will mainly require a home directory on the HDFS, created by
> the HDFS administrator user ('hadoop' in your case): sudo -u hadoop
> hadoop fs -mkdir /user/user1; sudo -u hadoop hadoop fs -chown
> user1:user1 /user/user1. After this, the user should be able to run
> jobs and manipulate files in their own directory.
>
> On Thu, Aug 29, 2013 at 10:21 AM, Hadoop Raj  wrote:
>> Hi,
>>
>> I have a hadoop learning environment on a pseudo distributed mode. It is 
>> owned by the user 'hadoop'.
>>
>> I am trying to get an understanding on how can another user on this box can 
>> act as a Hadoop client and able to create HDFS files and run Map Reduce 
>> jobs. Say I have a Linux user 'user1'.
>>
>> What permissions , privileges and configuration settings are required for 
>> 'user1' to act as a Hadoop client?
>>
>> Thanks,
>> Raj
>
>
>
> --
> Harsh J


Re: Sqoop issue related to Hadoop

2013-08-29 Thread Shekhar Sharma
Go inside the $HADOOP_HOME/log/user/history...
Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Aug 29, 2013 at 10:13 AM, Hadoop Raj  wrote:
> Hi Kate,
>
> Where can I find the task attempt log? Can you specify the location please?
>
>
> Thanks,
> Raj
>
> On Aug 28, 2013, at 7:13 PM, Kathleen Ting  wrote:
>
>> Raj, in addition to what Abe said, please also send the failed task attempt 
>> log
>> attempt_201307041900_0463_m_00_0 as well.
>>
>> Thanks,
>> Kate
>>
>> On Wed, Aug 28, 2013 at 2:25 PM, Abraham Elmahrek  wrote:
>>> Hey Raj,
>>>
>>> It seems like the number of fields you have in your data doesn't match the
>>> number of fields in your RAJ.CUSTOMERS table.
>>>
>>> Could you please add "--verbose" to the beginning of your argument list and
>>> provide the entire contents here?
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Aug 28, 2013 at 9:36 AM, Raj Hadoop  wrote:
>>>>
>>>> Hello all,
>>>>
>>>> I am getting an error while using sqoop export ( Load HDFS file to Oracle
>>>> ). I am not sure the issue might be a Sqoop or Hadoop related one. So I am
>>>> sending it to both the dist lists.
>>>>
>>>> I am using -
>>>>
>>>> sqoop export --connect jdbc:oracle:thin:@//dbserv:9876/OKI --table
>>>> RAJ.CUSTOMERS --export-dir /user/hive/warehouse/web_cust 
>>>> --input-null-string
>>>> '\\N' --input-null-non-string '\\N'  --username <> --password <> -m 1
>>>> --input-fields-terminated-by '\t'
>>>> I am getting the following error -
>>>>
>>>> Warning: /usr/lib/hbase does not exist! HBase imports will fail.
>>>> Please set $HBASE_HOME to the root of your HBase installation.
>>>> Warning: $HADOOP_HOME is deprecated.
>>>> 13/08/28 09:42:36 WARN tool.BaseSqoopTool: Setting your password on the
>>>> command-line is insecure. Consider using -P instead.
>>>> 13/08/28 09:42:36 INFO manager.SqlManager: Using default fetchSize of 1000
>>>> 13/08/28 09:42:36 INFO tool.CodeGenTool: Beginning code generation
>>>> 13/08/28 09:42:38 INFO manager.OracleManager: Time zone has been set to
>>>> GMT
>>>> 13/08/28 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT
>>>> t.* FROM RAJ.CUSTOMERS t WHERE 1=0
>>>> 13/08/28 09:42:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
>>>> /software/hadoop/hadoop/hadoop-1.1.2
>>>> Note:
>>>> /tmp/sqoop-hadoop/compile/c1376f66d2151b48024c54305377c981/RAJ_CUSTOMERS.java
>>>> uses or overrides a deprecated API.
>>>> Note: Recompile with -Xlint:deprecation for details.
>>>> 13/08/28 09:42:40 INFO orm.CompilationManager: Writing jar file:
>>>> /tmp/sqoop-hadoop/compile/c1376f66d2151b48024c54305377c981/RAJ.CUSTOMERS.jar
>>>> 13/08/28 09:42:40 INFO mapreduce.ExportJobBase: Beginning export of
>>>> RAJ.CUSTOMERS
>>>> 13/08/28 09:42:41 INFO manager.OracleManager: Time zone has been set to
>>>> GMT
>>>> 13/08/28 09:42:43 INFO input.FileInputFormat: Total input paths to process
>>>> : 1
>>>> 13/08/28 09:42:43 INFO input.FileInputFormat: Total input paths to process
>>>> : 1
>>>> 13/08/28 09:42:43 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>> library
>>>> 13/08/28 09:42:43 WARN snappy.LoadSnappy: Snappy native library not loaded
>>>> 13/08/28 09:42:43 INFO mapred.JobClient: Running job:
>>>> job_201307041900_0463
>>>> 13/08/28 09:42:44 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 13/08/28 09:42:56 INFO mapred.JobClient:  map 1% reduce 0%
>>>> 13/08/28 09:43:00 INFO mapred.JobClient:  map 2% reduce 0%
>>>> 13/08/28 09:43:03 INFO mapred.JobClient:  map 4% reduce 0%
>>>> 13/08/28 09:43:10 INFO mapred.JobClient:  map 5% reduce 0%
>>>> 13/08/28 09:43:13 INFO mapred.JobClient:  map 6% reduce 0%
>>>> 13/08/28 09:43:17 INFO mapred.JobClient: Task Id :
>>>> attempt_201307041900_0463_m_00_0, Status : FAILED
>>>> java.io.IOException: Can't export data, please check task tracker logs
>>>>at
>>>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>>>>at
>>>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>>>>at org.apache.hadoop.mapreduc

Re: secondary sort - number of reducers

2013-08-29 Thread Shekhar Sharma
No...partitionr decides which keys should go to which reducer...and
number of reducers you need to decide...No of reducers depends on
factors like number of key value pair, use case etc
Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi  wrote:
> so it cant figure out an appropriate number of reducers as it does for
> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
> .. since im overriding the partitioner class shouldnt that decide how
> manyredeucers there should be based on how many different partition values
> being returned by the custom partiotioner
>
>
> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley  wrote:
>>
>> If you don't specify the number of Reducers, Hadoop will use the default
>> -- which, unless you've changed it, is 1.
>>
>> Regards
>>
>> Ian.
>>
>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi  wrote:
>>
>> I have implemented secondary sort in my MR job and for some reason if i
>> dont specify the number of reducers it uses 1 which doesnt seems right
>> because im working with 800M+ records and one reducer slows things down
>> significantly. Is this some kind of limitation with the secondary sort that
>> it has to use a single reducer .. that kind of would defeat the purpose of
>> having a scalable solution such as secondary sort. I would appreciate any
>> help.
>>
>> Thanks
>> Adeel
>>
>>
>>
>> ---
>> Ian Wrigley
>> Sr. Curriculum Manager
>> Cloudera, Inc
>> Cell: (323) 819 4075
>>
>


Re: secondary sort - number of reducers

2013-08-30 Thread Shekhar Sharma
Is the hash code of that key  is negative.?
Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi  wrote:
> okay so when i specify the number of reducers e.g. in my example i m using 4
> (for a much smaller data set) it works if I use a single column in my
> composite key .. but if I add multiple columns in the composite key
> separated by a delimi .. it then throws the illegal partition error (keys
> before the pipe are group keys and after the pipe are the sort keys and my
> partioner only uses the group keys
>
> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> (-1)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> public int getPartition(Text key, HCatRecord record, int numParts) {
> //extract the group key from composite key
> String groupKey = key.toString().split("\\|")[0];
> return groupKey.hashCode() % numParts;
> }
>
>
> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma 
> wrote:
>>
>> No...partitionr decides which keys should go to which reducer...and
>> number of reducers you need to decide...No of reducers depends on
>> factors like number of key value pair, use case etc
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi 
>> wrote:
>> > so it cant figure out an appropriate number of reducers as it does for
>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> > reducer
>> > .. since im overriding the partitioner class shouldnt that decide how
>> > manyredeucers there should be based on how many different partition
>> > values
>> > being returned by the custom partiotioner
>> >
>> >
>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley  wrote:
>> >>
>> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> default
>> >> -- which, unless you've changed it, is 1.
>> >>
>> >> Regards
>> >>
>> >> Ian.
>> >>
>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi 
>> >> wrote:
>> >>
>> >> I have implemented secondary sort in my MR job and for some reason if i
>> >> dont specify the number of reducers it uses 1 which doesnt seems right
>> >> because im working with 800M+ records and one reducer slows things down
>> >> significantly. Is this some kind of limitation with the secondary sort
>> >> that
>> >> it has to use a single reducer .. that kind of would defeat the purpose
>> >> of
>> >> having a scalable solution such as secondary sort. I would appreciate
>> >> any
>> >> help.
>> >>
>> >> Thanks
>> >> Adeel
>> >>
>> >>
>> >>
>> >> ---
>> >> Ian Wrigley
>> >> Sr. Curriculum Manager
>> >> Cloudera, Inc
>> >> Cell: (323) 819 4075
>> >>
>> >
>
>


Re: InvalidProtocolBufferException while submitting crunch job to cluster

2013-08-31 Thread Shekhar Sharma
: java.net.UnknownHostException: bdatadev


edit your /etc/hosts file
Regards,
Som Shekhar Sharma
+91-8197243810


On Sat, Aug 31, 2013 at 2:05 AM, Narlin M  wrote:
> Looks like I was pointing to incorrect ports. After correcting the port
> numbers,
>
> conf.set("fs.defaultFS", "hdfs://:8020");
> conf.set("mapred.job.tracker", ":8021");
>
> I am now getting the following exception:
>
> 2880 [Thread-15] INFO
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob  -
> java.lang.IllegalArgumentException: java.net.UnknownHostException: bdatadev
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:389)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:356)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:124)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
> at
> org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:103)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:531)
> at
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:305)
> at
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:180)
> at
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:209)
> at
> org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:100)
> at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51)
> at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75)
> at java.lang.Thread.run(Thread.java:680)
> Caused by: java.net.UnknownHostException: bdatadev
> ... 27 more
>
> However nowhere in my code a host named "bdatadev" is mentioned, and I
> cannot ping this host.
>
> Thanks for the help.
>
>
> On Fri, Aug 30, 2013 at 3:04 PM, Narlin M  wrote:
>>
>> I am getting following exception while trying to submit a crunch pipeline
>> job to a remote hadoop cluster:
>>
>> Exception in thread "main" java.lang.RuntimeException: Cannot create job
>> output directory /tmp/crunch-324987940
>> at
>> org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:344)
>> at org.apache.crunch.impl.mr.MRPipeline.(MRPipeline.java:125)
>> at test.CrunchTest.setup(CrunchTest.java:98)
>> at test.CrunchTest.main(CrunchTest.java:367)
>> Caused by: java.io.IOException: Failed on local exception:
>> com.google.protobuf.InvalidProtocolBufferException: Protocol message
>> end-group tag did not match expected tag.; Host Details : local host is:
>> "NARLIN/127.0.0.1"; destination host is: "":50070;
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1164)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>> at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>> at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
>> at
>> org.apach

Re: InvalidProtocolBufferException while submitting crunch job to cluster

2013-08-31 Thread Shekhar Sharma
Can you please check whether are you able to access HDFS using java
API..and also able to run MR Job.
Regards,
Som Shekhar Sharma
+91-8197243810


On Sat, Aug 31, 2013 at 7:08 PM, Narlin M  wrote:
> The  that was mentioned in my original post is not
> pointing to bdatadev. I should have mentioned this in my original post,
> sorry I missed that.
>
> On 8/31/13 8:32 AM, "Narlin M"  wrote:
>
>>I would, but bdatadev is not one of my servers, it seems like a random
>>host name. I can't figure out how or where this name got generated. That's
>>what puzzling me.
>>
>>On 8/31/13 5:43 AM, "Shekhar Sharma"  wrote:
>>
>>>: java.net.UnknownHostException: bdatadev
>>>
>>>
>>>edit your /etc/hosts file
>>>Regards,
>>>Som Shekhar Sharma
>>>+91-8197243810
>>>
>>>
>>>On Sat, Aug 31, 2013 at 2:05 AM, Narlin M  wrote:
>>>> Looks like I was pointing to incorrect ports. After correcting the port
>>>> numbers,
>>>>
>>>> conf.set("fs.defaultFS", "hdfs://:8020");
>>>> conf.set("mapred.job.tracker", ":8021");
>>>>
>>>> I am now getting the following exception:
>>>>
>>>> 2880 [Thread-15] INFO
>>>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob
>>>>-
>>>> java.lang.IllegalArgumentException: java.net.UnknownHostException:
>>>>bdatadev
>>>> at
>>>>
>>>>org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.j
>>>>a
>>>>va:414)
>>>> at
>>>>
>>>>org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.
>>>>j
>>>>ava:164)
>>>> at
>>>>
>>>>org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:
>>>>1
>>>>29)
>>>> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:389)
>>>> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:356)
>>>> at
>>>>
>>>>org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileS
>>>>y
>>>>stem.java:124)
>>>> at
>>>>org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218)
>>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80)
>>>> at
>>>>org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252)
>>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234)
>>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300)
>>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
>>>> at
>>>>
>>>>org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissi
>>>>o
>>>>nFiles.java:103)
>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902)
>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>>> at
>>>>
>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
>>>>n
>>>>.java:1332)
>>>> at
>>>>org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896)
>>>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:531)
>>>> at
>>>>
>>>>org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.su
>>>>b
>>>>mit(CrunchControlledJob.java:305)
>>>> at
>>>>
>>>>org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.start
>>>>R
>>>>eadyJobs(CrunchJobControl.java:180)
>>>> at
>>>>
>>>>org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJ
>>>>o
>>>>bStatusAndStartNewOnes(CrunchJobControl.java:209)
>>>> at
>>>>
>>>>org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:10
>>>>0
>>>>)
>>>> at
>>>>org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51)
>>>> at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75)
>>>> at java.lang.Thread.run(Thread.java:680)
>>>> Caused by: java.net.UnknownHostException: bdatadev
>>>> ... 27 more
>>>>
>>>> However nowhere in my code a h

Re: Uploading a file to HDFS

2013-09-26 Thread Shekhar Sharma
Its not the namenode that does the reading or breaking of the file..
When you run the command hadoop fs -put  .
Here "hadoop" is a script file which is default client for hadoop..and
when the client contacts the namenode for writing, then NN creates a
block id and ask 3 dN to host the block ( replication factor to 3) and
this information is sent to client.

client caches 64KB of data on its own side and then pushes the data to
the DN and then this data gets pushed through pipeline..and this
process gets repeated till 64MB data is written and if the client
wants to to write more then he will again contact NN via heart beat
signal and this process continuess...

Check how does writing happens in HDFS?


Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Sep 26, 2013 at 3:41 PM, Karim Awara  wrote:
> Hi,
>
> I have a couple of questions about the process of uploading a large file (>
> 10GB) to HDFS.
>
> To make sure my understanding is correct, assuming I have a cluster of N
> machines.
>
> What happens in the following:
>
>
> Case 1:
> assuming i want to uppload a file (input.txt) of size K GBs
> that resides on the local disk of machine 1 (which happens to be the
> namenode only). if I am running the command  -put input.txt {some hdfs dir}
> from the namenode (assuming it does not play the datanode role), then will
> the namenode read the first 64MB in a temporary pipe and then transfers it
> to one of the cluster datanodes once finished?  Or the namenode does not do
> any reading of the file, but rather asks a certain datanode to read the 64MB
> window from the file remotely?
>
>
> Case 2:
>  assume machine 1 is the namenode, but i run the -put command
> from machine 3 (which is a datanode). who will start reading the file?
>
>
>
> --
> Best Regards,
> Karim Ahmed Awara
>
> 
> This message and its contents, including attachments are intended solely for
> the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete this
> message from your computer system. Any unauthorized use or distribution is
> prohibited. Please consider the environment before printing this email.


Re: 2 Map tasks running for a small input file

2013-09-26 Thread Shekhar Sharma
Number of map tasks on a mapreduce job doesnt depend on this
property..it depends on the number of input splits...( or equal to
number blocks if input split size = block size)

1. What is the input format you are using? if yes what is the value of
N, you are using?

2. WHat is the propety mapred.min.split.size? have you changed to
something else or is it default which is 1?




Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Sep 26, 2013 at 4:39 PM, Viji R  wrote:
> Hi,
>
> Default number of map tasks is 2. You can set mapred.map.tasks to 1 to
> avoid this.
>
> Regards,
> Viji
>
> On Thu, Sep 26, 2013 at 4:28 PM, Sai Sai  wrote:
>> Hi
>> Here is the input file for the wordcount job:
>> **
>> Hi This is a simple test.
>> Hi Hadoop how r u.
>> Hello Hello.
>> Hi Hi.
>> Hadoop Hadoop Welcome.
>> **
>>
>> After running the wordcount successfully
>> here r the counters info:
>>
>> ***
>> Job Counters SLOTS_MILLIS_MAPS 0 0 8,386
>> Launched reduce tasks 0 0 1
>> Total time spent by all reduces waiting after reserving slots (ms) 0 0 0
>> Total time spent by all maps waiting after reserving slots (ms) 0 0 0
>> Launched map tasks 0 0 2
>> Data-local map tasks 0 0 2
>> SLOTS_MILLIS_REDUCES 0 0 9,199
>> ***
>> My question why r there 2 launched map tasks when i have only a small file.
>> Per my understanding it is only 1 block.
>> and should be only 1 split.
>> Then for each line a map computation should occur
>> but it shows 2 map tasks.
>> Please let me know.
>> Thanks
>> Sai
>>


Re: 2 Map tasks running for a small input file

2013-09-26 Thread Shekhar Sharma
Dmapred.tasktracker.map.tasks.maximum=1 ...Guys this property is set
for task tracker...when you set this property it means, that
particular task tracker will not run more than 1 mapper task
parallely..

FOr example: if a map reduce job requires 5 mapper tasks and if you
set this property to 1, then only 1 mapper task will run and other
will wait..once the task is completed other tasks will be scheduled...


Could you please send the code, you are trying to run..the driver code
and mapred-site.xml contents..?

You can controll the numbr of map task through input split size(
mapred.min.split.size, mapred.max.split.size and dfs.block.size)

max(minSPlitSize, min(maxSPlitsize, blocksize))



Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Sep 26, 2013 at 6:07 PM, shashwat shriparv
 wrote:
> just try giving -Dmapred.tasktracker.map.tasks.maximum=1 on the command line
> and check how many map task its running. and also set this in
> mapred-site.xml and check.
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Thu, Sep 26, 2013 at 5:24 PM, Harsh J  wrote:
>>
>> Hi Sai,
>>
>> What Viji indicated is that the default Apache Hadoop setting for any
>> input is 2 maps. If the input is larger than one block, regular
>> policies of splitting such as those stated by Shekhar would apply. But
>> for smaller inputs, just for an out-of-box "parallelism experience",
>> Hadoop ships with a 2-maps forced splitting default
>> (mapred.map.tasks=2).
>>
>> This means your 5 lines is probably divided as 2:3 or other ratios and
>> is processed by 2 different Tasks. As Viji also indicated, to turn off
>> this behavior, you can set the mapred.map.tasks to 1 in your configs
>> and then you'll see only one map task process all 5 lines.
>>
>> On Thu, Sep 26, 2013 at 4:59 PM, Sai Sai  wrote:
>> > Thanks Viji.
>> > I am confused a little when the data is small y would there b 2 tasks.
>> > U will use the min as 2 if u need it but in this case it is not needed
>> > due
>> > to size of the data being small
>> > so y would 2 map tasks exec.
>> > Since it results in 1 block with 5 lines of data in it
>> > i am assuming this results in 5 map computations 1 per each line
>> > and all of em in 1 process/node since i m using a pseudo vm.
>> > Where is the second task coming from.
>> > The 5 computations of map on each line is 1 task.
>> > Is this right.
>> > Please help.
>> > Thanks
>> >
>> >
>> > 
>> > From: Viji R 
>> > To: user@hadoop.apache.org; Sai Sai 
>> > Sent: Thursday, 26 September 2013 5:09 PM
>> > Subject: Re: 2 Map tasks running for a small input file
>> >
>> > Hi,
>> >
>> > Default number of map tasks is 2. You can set mapred.map.tasks to 1 to
>> > avoid this.
>> >
>> > Regards,
>> > Viji
>> >
>> > On Thu, Sep 26, 2013 at 4:28 PM, Sai Sai  wrote:
>> >> Hi
>> >> Here is the input file for the wordcount job:
>> >> **
>> >> Hi This is a simple test.
>> >> Hi Hadoop how r u.
>> >> Hello Hello.
>> >> Hi Hi.
>> >> Hadoop Hadoop Welcome.
>> >> **
>> >>
>> >> After running the wordcount successfully
>> >> here r the counters info:
>> >>
>> >> ***
>> >> Job Counters SLOTS_MILLIS_MAPS 0 0 8,386
>> >> Launched reduce tasks 0 0 1
>> >> Total time spent by all reduces waiting after reserving slots (ms) 0 0
>> >> 0
>> >> Total time spent by all maps waiting after reserving slots (ms) 0 0 0
>> >> Launched map tasks 0 0 2
>> >> Data-local map tasks 0 0 2
>> >> SLOTS_MILLIS_REDUCES 0 0 9,199
>> >> ***
>> >> My question why r there 2 launched map tasks when i have only a small
>> >> file.
>> >> Per my understanding it is only 1 block.
>> >> and should be only 1 split.
>> >> Then for each line a map computation should occur
>> >> but it shows 2 map tasks.
>> >> Please let me know.
>> >> Thanks
>> >> Sai
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>


Re: Hadoop-MapReduce

2013-12-09 Thread Shekhar Sharma
First Option: Put the jar in $HADOOP_HOME/lib folder
And then run "hadoop classpath" command on your terminal to check
whether the jar has been added

Second OPtion: PUt the jar path in HADOOP_CLASSPATH variable (
hadoop-env.sh file) and restart your cluster..
Regards,
Som Shekhar Sharma
+91-8197243810


On Mon, Dec 9, 2013 at 6:30 PM, Ranjini Rathinam  wrote:
> Hi Subroto Sanyal,
>
> The link  provided about xml, it does not work . The Class written
> XmlContent is not allowed in the XmlInputFormat.
>
> I request you to help , whether this scenaio some one has coded, and needed
> working code.
>
> I have written using SAX Parser too, but eventhough the jars are added in
> classpath THe error is is coming has NoClasFoung Exception.
>
> Please provide sample code for the same.
>
> Thanks in advance,
> Ranjini.R
>
> On Mon, Dec 9, 2013 at 12:34 PM, Ranjini Rathinam 
> wrote:
>>
>>
>>>> Hi,
>>>>
>>>> As suggest by the link below , i have used for my program ,
>>>>
>>>> but i am facing the below issues, please help me to fix these error.
>>>>
>>>>
>>>> XmlReader.java:8: XmlReader.Map is not abstract and does not override
>>>> abstract method
>>>> map(org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.OutputCollector,org.apache.hadoop.mapred.Reporter)
>>>> in org.apache.hadoop.mapred.Mapper
>>>>  public static class Map extends MapReduceBase implements Mapper
>>>>  {
>>>>^
>>>> ./XmlInputFormat.java:16: XmlInputFormat.XmlRecordReader is not abstract
>>>> and does not override abstract method
>>>> next(java.lang.Object,java.lang.Object) in
>>>> org.apache.hadoop.mapred.RecordReader
>>>> public class XmlRecordReader implements RecordReader {
>>>>^
>>>> Note: XmlReader.java uses unchecked or unsafe operations.
>>>> Note: Recompile with -Xlint:unchecked for details.
>>>> 2 errors
>>>>
>>>>
>>>> i am using hadoop 0.20 version and java 1.6 .
>>>>
>>>> Please suggest.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regrads,
>>>> Ranjini. R
>>>> On Mon, Dec 9, 2013 at 11:08 AM, Ranjini Rathinam
>>>>  wrote:
>>>>>
>>>>>
>>>>>
>>>>> -- Forwarded message --
>>>>> From: Subroto 
>>>>> Date: Fri, Dec 6, 2013 at 4:42 PM
>>>>> Subject: Re: Hadoop-MapReduce
>>>>> To: user@hadoop.apache.org
>>>>>
>>>>>
>>>>> Hi Ranjini,
>>>>>
>>>>> A good example to look into :
>>>>> http://www.undercloud.org/?p=408
>>>>>
>>>>> Cheers,
>>>>> Subroto Sanyal
>>>>>
>>>>> On Dec 6, 2013, at 12:02 PM, Ranjini Rathinam wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> How to read xml file via mapreduce and load them in hbase and hive
>>>>> using java.
>>>>>
>>>>> Please provide sample code.
>>>>>
>>>>> I am using hadoop 0.20 version and java 1.6. Which parser version
>>>>> should be used.
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Ranjini
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Hadoop-MapReduce

2013-12-09 Thread Shekhar Sharma
It does work i have used it long back..

BTW if it is not working, write the custom input format and implement
your record reader. That would be far more easy than breaking your
head with others code.

Break your problem in step:

(1) First the XML data is multiline...Meaning multiple lines makes a
single record for you...May be a record for you would be


 x
  y


(2) Implement a record reader that looks out for the starting and
ending person tag ( Checkout how RecordReader.java is written)

(3) Once you got the contents between starting and ending tag, now you
can use a xml parser to parse the contents into an java object and
form your own key value pairs ( custom key and custom value)


Hope you have enough pointers to write the code.


Regards,
Som Shekhar Sharma
+91-8197243810


On Mon, Dec 9, 2013 at 6:30 PM, Ranjini Rathinam  wrote:
> Hi Subroto Sanyal,
>
> The link  provided about xml, it does not work . The Class written
> XmlContent is not allowed in the XmlInputFormat.
>
> I request you to help , whether this scenaio some one has coded, and needed
> working code.
>
> I have written using SAX Parser too, but eventhough the jars are added in
> classpath THe error is is coming has NoClasFoung Exception.
>
> Please provide sample code for the same.
>
> Thanks in advance,
> Ranjini.R
>
> On Mon, Dec 9, 2013 at 12:34 PM, Ranjini Rathinam 
> wrote:
>>
>>
>>>> Hi,
>>>>
>>>> As suggest by the link below , i have used for my program ,
>>>>
>>>> but i am facing the below issues, please help me to fix these error.
>>>>
>>>>
>>>> XmlReader.java:8: XmlReader.Map is not abstract and does not override
>>>> abstract method
>>>> map(org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.OutputCollector,org.apache.hadoop.mapred.Reporter)
>>>> in org.apache.hadoop.mapred.Mapper
>>>>  public static class Map extends MapReduceBase implements Mapper
>>>>  {
>>>>^
>>>> ./XmlInputFormat.java:16: XmlInputFormat.XmlRecordReader is not abstract
>>>> and does not override abstract method
>>>> next(java.lang.Object,java.lang.Object) in
>>>> org.apache.hadoop.mapred.RecordReader
>>>> public class XmlRecordReader implements RecordReader {
>>>>^
>>>> Note: XmlReader.java uses unchecked or unsafe operations.
>>>> Note: Recompile with -Xlint:unchecked for details.
>>>> 2 errors
>>>>
>>>>
>>>> i am using hadoop 0.20 version and java 1.6 .
>>>>
>>>> Please suggest.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regrads,
>>>> Ranjini. R
>>>> On Mon, Dec 9, 2013 at 11:08 AM, Ranjini Rathinam
>>>>  wrote:
>>>>>
>>>>>
>>>>>
>>>>> -- Forwarded message --
>>>>> From: Subroto 
>>>>> Date: Fri, Dec 6, 2013 at 4:42 PM
>>>>> Subject: Re: Hadoop-MapReduce
>>>>> To: user@hadoop.apache.org
>>>>>
>>>>>
>>>>> Hi Ranjini,
>>>>>
>>>>> A good example to look into :
>>>>> http://www.undercloud.org/?p=408
>>>>>
>>>>> Cheers,
>>>>> Subroto Sanyal
>>>>>
>>>>> On Dec 6, 2013, at 12:02 PM, Ranjini Rathinam wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> How to read xml file via mapreduce and load them in hbase and hive
>>>>> using java.
>>>>>
>>>>> Please provide sample code.
>>>>>
>>>>> I am using hadoop 0.20 version and java 1.6. Which parser version
>>>>> should be used.
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Ranjini
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: External db

2013-12-15 Thread Shekhar Sharma
HDFS is batch mode suitable for OLAP not for OLTP..

OLTP requires the data need to be updated, deleted , basically CRUD
operation..But HDFS doesnt support random writes..so not suitable for
OLTP...And thats where the nosql databases comes in
Regards,
Som Shekhar Sharma
+91-8197243810


On Sun, Dec 15, 2013 at 6:51 PM, Kishore  wrote:
> Hi experts,
>
> What exactly needed the nosql database like Cassandra, hbase and mongodb when 
> we have hdfs, ofcourse
> It supports oltp, are we storing the results from hdfs and reanalysing in 
> nosql?
>
> Sent from my iPhone


Re: How to set "hadoop.tmp.dir" if I have multiple disks per node?

2013-12-16 Thread Shekhar Sharma
hadoop.tmp.dir is a directory created on local file system
For example if you have set hadoop.tmp.dir property to /home/training/hadoop

This directory will be created when you format the namenode by running
the command
hadoop namenode -format

When you open this folder


you will see two subfolders dfs and mapred.

the /home/training/hadoop/mapred folder will be on HDFS also

Hope this clears
Regards,
Som Shekhar Sharma
+91-8197243810


On Mon, Dec 16, 2013 at 1:42 PM, Dieter De Witte  wrote:
> Hi,
>
> Make sure to also set mapred.local.dir to the same set of output
> directories, this is were the intermediate key-value pairs are stored!
>
> Regards, Dieter
>
>
> 2013/12/16 Tao Xiao 
>>
>> I have ten disks per node,and I don't know what value I should set to
>> "hadoop.tmp.dir". Some said this property refers to a location in local disk
>> while some other said it refers to a directory in HDFS. I'm confused, who
>> can explain it ?
>>
>> I want to spread I/O since I have ten disks per node, so should I set a
>> comma-separated list of directories (which are on different disks) to
>> "hadoop.tmp.dir" ?
>
>


Re: issue when using HDFS

2013-12-16 Thread Shekhar Sharma
Seems like DataNode is not running or went dead
Regards,
Som Shekhar Sharma
+91-8197243810


On Mon, Dec 16, 2013 at 1:40 PM, Geelong Yao  wrote:
> Hi Everyone
>
> After I upgrade the hadoop to CDH 4.2.0 Hadoop 2.0.0,I try to running some
> test
> When I try to upload file to HDFS,error comes:
>
>
>
> node32:/software/hadoop-2.0.0-cdh4.2.0 # hadoop dfs -put
> /public/data/carinput1G_BK carinput1G
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
>
> ls: Call From node32/11.11.11.32 to node32:9000 failed on connection
> exception: java.net.ConnectException: Connection refused; For more details
> see:  http://wiki.apache.org/hadoop/ConnectionRefused
>
>
>
> Something wrong with my setting?
>
> BRs
> Geelong
>
>
> --
> From Good To Great


Re: Hadoop-MapReduce

2013-12-17 Thread Shekhar Sharma
Hello Ranjini,
This error will come when you use mix and match newer and older API..

You might have written program using newer API and the the XML input
format is using older api..
The older api has package structure of org.apache.hadoop.mapred

The newer api has package structure package of org.apache.hadoop.mapreduce.lib

Check out the XMLINputFormat.java, which package of FileInputFormat
they have used...


Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Dec 17, 2013 at 12:55 PM, Ranjini Rathinam
 wrote:
> Hi,
>
> I am using hadoop 0.20 version
>
> In that while exceuting the XmlInformat class
> I am getting the error as
>
> "Error: Found Class  org.apache.hadoop.mapreduce.TaskAttemptContext, but
> interface was excepted,."
>
> Please suggest to fix the error.
>
> Thanks in advance.
>
> Ranjini
>
> On Wed, Dec 11, 2013 at 12:30 PM, Ranjini Rathinam 
> wrote:
>>
>> hi,
>>
>> I have fixed the error , the code is running fine, but this code just
>> split the part of the tag.
>>
>> i want to convert into text format so that i can load them into tables of
>> hbase and hive.
>>
>> I have used the DOM Parser but this parser uses File as Object  but hdfs
>> uses FileSystem.
>>
>> Eg,
>>
>> File fXmlFile = new File("D:/elango/test.xml");
>>
>>  System.out.println(g);
>>  DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
>>  DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
>>  Document doc = dBuilder.parse(fXmlFile);
>>
>>
>> This cant be used as hdfs, because hdfs path  is accessed through
>> FileSystem.
>>
>> I kindly request u to , Please suggest me to fix the above issue.
>>
>> Thanks in advance
>>
>> Ranjini R
>>
>>
>>
>>
>> On Tue, Dec 10, 2013 at 11:07 AM, Ranjini Rathinam
>>  wrote:
>>>
>>>
>>>
>>> -- Forwarded message --
>>> From: Shekhar Sharma 
>>> Date: Mon, Dec 9, 2013 at 10:23 PM
>>> Subject: Re: Hadoop-MapReduce
>>> To: user@hadoop.apache.org
>>> Cc: ssan...@datameer.com
>>>
>>>
>>> It does work i have used it long back..
>>>
>>> BTW if it is not working, write the custom input format and implement
>>> your record reader. That would be far more easy than breaking your
>>> head with others code.
>>>
>>> Break your problem in step:
>>>
>>> (1) First the XML data is multiline...Meaning multiple lines makes a
>>> single record for you...May be a record for you would be
>>>
>>> 
>>>  x
>>>   y
>>> 
>>>
>>> (2) Implement a record reader that looks out for the starting and
>>> ending person tag ( Checkout how RecordReader.java is written)
>>>
>>> (3) Once you got the contents between starting and ending tag, now you
>>> can use a xml parser to parse the contents into an java object and
>>> form your own key value pairs ( custom key and custom value)
>>>
>>>
>>> Hope you have enough pointers to write the code.
>>>
>>>
>>> Regards,
>>> Som Shekhar Sharma
>>> +91-8197243810
>>>
>>>
>>> On Mon, Dec 9, 2013 at 6:30 PM, Ranjini Rathinam 
>>> wrote:
>>> > Hi Subroto Sanyal,
>>> >
>>> > The link  provided about xml, it does not work . The Class written
>>> > XmlContent is not allowed in the XmlInputFormat.
>>> >
>>> > I request you to help , whether this scenaio some one has coded, and
>>> > needed
>>> > working code.
>>> >
>>> > I have written using SAX Parser too, but eventhough the jars are added
>>> > in
>>> > classpath THe error is is coming has NoClasFoung Exception.
>>> >
>>> > Please provide sample code for the same.
>>> >
>>> > Thanks in advance,
>>> > Ranjini.R
>>> >
>>> > On Mon, Dec 9, 2013 at 12:34 PM, Ranjini Rathinam
>>> > 
>>> > wrote:
>>> >>
>>> >>
>>> >>>> Hi,
>>> >>>>
>>> >>>> As suggest by the link below , i have used for my program ,
>>> >>>>
>>> >>>> but i am facing the below issues, please help me to fix these error.
>>> >>>>
>>> >>>>
>>> >>>> XmlReader.java:8: XmlReader.Map is not abstract and does not
>

Re: XmlInputFormat Hadoop -Mapreduce

2013-12-17 Thread Shekhar Sharma
Hi Ranjini,
I have modified the code and it is perfectly working fine for
me...Please mail me on shekhar2...@gmail.com i will send u the zip
code...

The code which you have writtenl, i really dont understand why from
the mapper class you are emitting the key as NullWritable which
doesn't make sense...

If you are making use of reducer after this, then there are two possiblites

(1)Grouping on null will happen at reduce and you will see null
pointer exception
(2) Grouping on null will have u something like this null,{Val1,val2..valn}

My Suggestion dont ever use null as an op key from mapper.

Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Dec 17, 2013 at 5:42 PM, Ranjini Rathinam
 wrote:
> Hi,
>
> I have attached the code. Please verify.
>
> Please suggest . I am using hadoop 0.20 version.
>
>
> import java.io.IOException;
> import java.util.logging.Level;
> import java.util.logging.Logger;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.NullWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> //import org.apache.hadoop.mapreduce.lib.input.XmlInputFormat;
>
> public class ParserDriverMain {
>
> public static void main(String[] args) {
> try {
> runJob(args[0], args[1]);
>
> } catch (IOException ex) {
> Logger.getLogger(ParserDriverMain.class.getName()).log(Level.SEVERE, null,
> ex);
> }
>
> }
>
> //The code is mostly self explanatory. You need to define the starting and
> ending tag of to split a record from the xml file and it can be defined in
> the following lines
>
> //conf.set("xmlinput.start", "");
> //conf.set("xmlinput.end", "");
>
>
> public static void runJob(String input,String output ) throws IOException {
>
> Configuration conf = new Configuration();
>
> conf.set("xmlinput.start", "");
> conf.set("xmlinput.end", "");
> conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
>
> Job job = new Job(conf, "jobName");
>
> input="/user/hduser/Ran/";
> output="/user/task/Sales/";
> FileInputFormat.setInputPaths(job, input);
> job.setJarByClass(ParserDriverMain.class);
> job.setMapperClass(MyParserMapper.class);
> job.setNumReduceTasks(1);
> job.setInputFormatClass(XmlInputFormatNew.class);
> job.setOutputKeyClass(NullWritable.class);
> job.setOutputValueClass(Text.class);
> Path outPath = new Path(output);
> FileOutputFormat.setOutputPath(job, outPath);
> FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
> if (dfs.exists(outPath)) {
> dfs.delete(outPath, true);
> }
>
>
> try {
>
> job.waitForCompletion(true);
>
> } catch (InterruptedException ex) {
> Logger.getLogger(ParserDriverMain.class.getName()).log(Level.SEVERE, null,
> ex);
> } catch (ClassNotFoundException ex) {
> Logger.getLogger(ParserDriverMain.class.getName()).log(Level.SEVERE, null,
> ex);
> }
>
> }
>
> }
>
>
>
>
>
> import java.io.IOException;
> import java.util.logging.Level;
> import java.util.logging.Logger;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.NullWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Mapper;
> import org.jdom.Document;
> import org.jdom.Element;
> import org.jdom.JDOMException;
> import org.jdom.input.SAXBuilder;
> import java.io.Reader;
> import java.io.StringReader;
>
> /**
>  *
>  * @author root
>  */
> public class MyParserMapper extends Mapper Text> {
>
> @Override
> public void map(LongWritable key, Text value1,Context context)throws
> IOException, InterruptedException {
>
> String xmlString = value1.toString();
>  System.out.println("xmlString"+xmlString);
>  SAXBuilder builder = new SAXBuilder();
> Reader in = new StringReader(xmlString);
> String value="";
> try {
>
> Document doc = builder.build(in);
> Element root = doc.getRootElement();
>
> //String tag1
> =root.getChild("tag").getChild("tag1").getTextTrim() ;
>
>// String tag2
> =root.getChil

Re: XmlInputFormat Hadoop -Mapreduce

2013-12-17 Thread Shekhar Sharma
Hello Ranjini,

PFA the source code for XML Input Format.
Also find the output and the input which i have used.

ATTACHED FILES DESRIPTION:

(1) emp.xml --->Input Data for testing
(2)emp_op.tar.zg-->Output. Results of the map only job ( I have set
the number of reducer=0)
(3)src.tar--> the source files (Please create a project in eclipse and
paste the files ) The code is written with appropriate package and
source folder..

RUNNING THE JOB:

hadoop jar xml.jar com.xg.hadoop.training.mr.MyDriver -D
START_TAG=\ -D END_TAG=\ emp op

Explaination of the above command:

(1) xml.jar is the jar name which we create either through eclipse or
maven or ant

(2) com.xg.hadoop.training.mr.MyDriver  is the fully qualified driver
class name. It means that MyDriver is residing under package
com.xg.hadoop.training.mr

(3) -D START_TAG= will not work because it will treat the
Employee as input directory which is not the case..Therefore you need
to escape them
 and thats why it is written as  -D START_TAG=\, you can
very well see that the two angular brackets are escaped. The similar
explanation goes for -D END_TAG


(4) emp is the input data which is present on HDFS

(5) op is the output directory which will be created as part of mapreduce job.


NOTE:
The number of reducers is explicitly set to ZERO. So this will map
reduce job will always run ZERO reducer tasks.
You need to change the driver code.

Hope this would help and you will be able to solve your problem. In
case if you face any difficulty please feel free to contact


Regards,
K Som Shekhar Sharma

+91-8197243810


Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Dec 17, 2013 at 5:42 PM, Ranjini Rathinam
 wrote:
> Hi,
>
> I have attached the code. Please verify.
>
> Please suggest . I am using hadoop 0.20 version.
>
>
> import java.io.IOException;
> import java.util.logging.Level;
> import java.util.logging.Logger;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.NullWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> //import org.apache.hadoop.mapreduce.lib.input.XmlInputFormat;
>
> public class ParserDriverMain {
>
> public static void main(String[] args) {
> try {
> runJob(args[0], args[1]);
>
> } catch (IOException ex) {
> Logger.getLogger(ParserDriverMain.class.getName()).log(Level.SEVERE, null,
> ex);
> }
>
> }
>
> //The code is mostly self explanatory. You need to define the starting and
> ending tag of to split a record from the xml file and it can be defined in
> the following lines
>
> //conf.set("xmlinput.start", "");
> //conf.set("xmlinput.end", "");
>
>
> public static void runJob(String input,String output ) throws IOException {
>
> Configuration conf = new Configuration();
>
> conf.set("xmlinput.start", "");
> conf.set("xmlinput.end", "");
> conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
>
> Job job = new Job(conf, "jobName");
>
> input="/user/hduser/Ran/";
> output="/user/task/Sales/";
> FileInputFormat.setInputPaths(job, input);
> job.setJarByClass(ParserDriverMain.class);
> job.setMapperClass(MyParserMapper.class);
> job.setNumReduceTasks(1);
> job.setInputFormatClass(XmlInputFormatNew.class);
> job.setOutputKeyClass(NullWritable.class);
> job.setOutputValueClass(Text.class);
> Path outPath = new Path(output);
> FileOutputFormat.setOutputPath(job, outPath);
> FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
> if (dfs.exists(outPath)) {
> dfs.delete(outPath, true);
> }
>
>
> try {
>
> job.waitForCompletion(true);
>
> } catch (InterruptedException ex) {
> Logger.getLogger(ParserDriverMain.class.getName()).log(Level.SEVERE, null,
> ex);
> } catch (ClassNotFoundException ex) {
> Logger.getLogger(ParserDriverMain.class.getName()).log(Level.SEVERE, null,
> ex);
> }
>
> }
>
> }
>
>
>
>
>
> import java.io.IOException;
> import java.util.logging.Level;
> import java.util.logging.Logger;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.NullWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Mapper;
> import org.jdom.Document;
> import org.jdom.Element;
> import org.jdom.JDOMException;
> import org.jdom.input.SAXBuilder;
> impo

Re: Estimating the time of my hadoop jobs

2013-12-17 Thread Shekhar Sharma
Apart from what Devin has suggested there are other factors which could be
worth while noting when you are running your hadoop cluster on virtual
machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so
total of 8map slots and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task
will run parallely and other task have to wait..


(2) Since you have not mentioned anywhere that whether 30GB of data is made
up of lot of smaller files ( less than block size) or bigger file...let us
do a simple calculation assuming only one file of 30GB and assuming a block
size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864


Total Number of blocks the data will be broken  = (32212254720) /
(67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind
inputsplit size = block size)...But since you have only 8 map slots so at a
time only 8 map task will run and others will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 =
60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say
for every task a jvm is created and then after the task is finished the JVM
is tear down..this is also a bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the
mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec


By the above pointers if you can bring down the timings to atleast 1 hour
or so..
Then with the same 4 node cluster and Hadoop running on separate physical
machine you will for sure see the job getting completed in 15-30minutes..[
Please refer Devin's comments ]



My suggestion is get the optimal performance on your virtual machine and
then you go for real hadoop cluster. You will for sure see the performance
improvement



Regards,
Som Shekhar Sharma
+91-8197243810


On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX  wrote:

> Nikhil,
>
> One of the problems you run into with Hadoop in Virtual Machine
> environments is performance issues when they are all running on the same
> physical host. With a VM, even though you are giving them 4 GB of RAM, and
> a virtual CPU and disk, if the virtual machines are sharing physical
> components like processor and physical storage medium, they compete for
> resources at the physical level. Even if you have the VM on a single host,
> or on a multi-core host with multiple disks and they are sharing as few
> resources as possible, there will still be a performance hit when the VM
> information has to pass through the hypervisor layer - co-scheduling
> resources with the host and things like that.
>
> Does that make sense?
>
> It's generally accepted that due to these issues, Hadoop in virtual
> environments does not offer the same performance benefits as a physical
> Hadoop cluster. It can be used pretty well with even low-quality hardware
> though, so so, maybe you can acquire some used desktops and install your
> favorite Linux flavor on them and make a cluster - some people have even
> run Hadoop on Raspberry Pi clusters.
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil wrote:
>
>> I know this foolish of me to ask this, because there are a lot of factors
>> that affect this,
>>
>> but why is it taking so much time, can anyone suggest possible reasons
>> for it, or if anyone has faced such issue before
>>
>>
>>
>> Thanks,
>>
>> Nikhil Kandoi
>>
>>  P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
>> version has got something to do with it.
>>
>>
>>
>> *From:* Azuryy Yu [mailto:azury...@gmail.com]
>> *Sent:* Tuesday, December 17, 2013 4:14 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Estimating the time of my hadoop jobs
>>
>>
>>
>> Hi Kandoi,
>>
>> It depends on:
>>
>> how many cores on each VNode
>>
>> how complicated of your analysis application
>>
>>
>>
>> But I don't think it's normal spent 3hr to process 30GB data even on your
>> *not good* hareware.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil 
>> wrote:
>>
>> Hello everyone,

Re: DataNode not starting in slave machine

2013-12-25 Thread Shekhar Sharma
It is running on local file system file:///

Regards,
Som Shekhar Sharma
+91-8197243810


On Wed, Dec 25, 2013 at 7:01 PM, Vishnu Viswanath
 wrote:
> Hi,
>
> I am getting this error while starting the datanode in my slave system.
>
> I read the JIRA HDFS-2515, it says it is because hadoop is using wrong conf
> file.
>
> 13/12/24 15:57:14 INFO impl.MetricsConfig: loaded properties from
> hadoop-metrics2.properties
> 13/12/24 15:57:14 INFO impl.MetricsSourceAdapter: MBean for source
> MetricsSystem,sub=Stats registered.
> 13/12/24 15:57:14 INFO impl.MetricsSystemImpl: Scheduled snapshot period at
> 10 second(s).
> 13/12/24 15:57:14 INFO impl.MetricsSystemImpl: DataNode metrics system
> started
> 13/12/24 15:57:15 INFO impl.MetricsSourceAdapter: MBean for source ugi
> registered.
> 13/12/24 15:57:15 WARN impl.MetricsSystemImpl: Source name ugi already
> exists!
> 13/12/24 15:57:15 ERROR datanode.DataNode:
> java.lang.IllegalArgumentException: Does not contain a valid host:port
> authority: file:///
> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:164)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:212)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:244)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:236)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:359)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:321)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1712)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1651)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1669)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1795)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1812)
>
> But how do i check which conf file hadoop is using? or how do i set it?
>
> These are my configurations:
>
> core-site.xml
> --
> 
> 
> fs.defualt.name
> hdfs://master:9000
> 
>
> 
> hadoop.tmp.dir
> /home/vishnu/hadoop-tmp
> 
> 
>
> hdfs-site.xml
> 
> 
> 
> dfs.replication
> 2
> 
> 
>
> mared-site.xml
> 
> 
> 
> mapred.job.tracker
> master:9001
> 
> 
>
> any help,
>


Re: Set number of mappers

2014-01-21 Thread Shekhar Sharma
nO of map tasks is determined by number of input splits.you can change the
NUM Of map tasks by changing the input split size

But you can set to num of reducertasks explicitly
On 21 Jan 2014 20:25, "xeon"  wrote:

> Hi,
>
> I want to set the number of map tasks in the Wordcount example. Is is
> possible to set this variable in MRv2?
>
> Thanks,
>


Re: Datanode Shutting down automatically

2014-01-24 Thread Shekhar Sharma
Incompatible name space Id error.its because dat u might have formatted the
namenode but data nodes folder is still have the same I'd.

What is the value of the following property
dfs. Data.dir
dfs. name.dir
hadoop.tmp.dir

The value of these properties is directory on local file system

Solution is to open the version file of the name node (under
dfs/name/current folder) you will see the first line as name space Id, copy
that and then open version file of datanode which is not coming up
(dfs/data/ current folder) you copy name space Id. Now start the data node
process

A dirty hack is to delete the folders(directories specified by the above
properties) from all the machines and den format ur name node and start ur
process. Please note dat the previous data will be lost
On 25 Jan 2014 11:55, "Pranav Gadekar"  wrote:

> This is my log file.
>
> 014-01-24 17:24:58,238 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = user/127.0.1.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 1.2.1
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r
> 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
> STARTUP_MSG:   java = 1.6.0_27
> /
> 2014-01-24 17:24:58,622 INFO
> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
> hadoop-metrics2.properties
> 2014-01-24 17:24:58,669 INFO
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
> MetricsSystem,sub=Stats registered.
> 2014-01-24 17:24:58,670 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
> period at 10 second(s).
> 2014-01-24 17:24:58,670 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system
> started
> 2014-01-24 17:24:58,877 INFO
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
> registered.
> 2014-01-24 17:24:58,880 WARN
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already
> exists!
> 2014-01-24 17:25:10,778 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
> Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID
> = 102782159; datanode namespaceID = 1227483104
> at
> org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
> at
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:414)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:321)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1712)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1651)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1669)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1795)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1812)
>
> 2014-01-24 17:25:10,779 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down DataNode at user/127.0.1.1
> /
> 2014-01-24 17:26:13,413 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = user/127.0.1.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 1.2.1
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r
> 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
> STARTUP_MSG:   java = 1.6.0_27
> /
> 2014-01-24 17:26:13,510 INFO
> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
> hadoop-metrics2.properties
> 2014-01-24 17:26:13,518 INFO
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
> MetricsSystem,sub=Stats registered.
> 2014-01-24 17:26:13,518 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
> period at 10 second(s).
> 2014-01-24 17:26:13,518 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system
> started
> 2014-01-24 17:26:13,626 INFO
> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
> registered.
> 2014-01-24 17:26:13,628 WARN
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already
> exists!
> 2014-01-24 17:26:28,860 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
> Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID

Re: HDFS data transfer is faster than SCP based transfer?

2014-01-25 Thread Shekhar Sharma
WHEN u put the data or write into HDFS, 64kb of data is written on client
side and then it is pushed through pipeline and this process continue till
64mb of data is written which is the block size defined by the client.

While on the other hand scp will try to buffer the entire data. Passing
chunks of data would be faster than passing larger data.

Please check how writing happen in HDFS. That will give you clear picture
On 24 Jan 2014 10:56, "rab ra"  wrote:

> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>


Re: hadoop report has corrupt block but i can not find any in block metadata

2014-01-25 Thread Shekhar Sharma
Run fsck command

hadoop fsck 《path》-files  -blocks   -locations
On 25 Jan 2014 08:04, "ch huang"  wrote:

> hi,maillist:
>this morning nagios alert hadoop has corrupt block ,i checked
> it use "
> hdfs dfsadmin -report" ,from it output ,it did has corrupt blocks
>
> Configured Capacity: 53163259158528 (48.35 TB)
> Present Capacity: 50117251458834 (45.58 TB)
> DFS Remaining: 45289289015296 (41.19 TB)
> DFS Used: 4827962443538 (4.39 TB)
> DFS Used%: 9.63%
> Under replicated blocks: 277
> Blocks with corrupt replicas: 2
> Missing blocks: 0
>
> but i dump all metadata use "
> # sudo -u hdfs hdfs dfsadmin -metasave"
>
> and loock the record which "c: not 0" i can not find any block with
> corrupt replicas,why?
>


Re: HDFS data transfer is faster than SCP based transfer?

2014-01-25 Thread Shekhar Sharma
We have the concept of short circuit reads which directly reads from data
node which improve read performance. Do we have similar concept like short
circuit writes
On 25 Jan 2014 16:10, "Harsh J"  wrote:

> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra  wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>


Commissioning Task tracker

2014-01-27 Thread Shekhar Sharma
Hello,
I am using apache hadoop-1.0.3 version and i want to commission a task
tracker.
For this i have included a property mapred.hosts in mapred-site.xml and for
this property i have mentioned a file. In this file i have given the IP
address of a machine..

then i ran the command   "hadoop mradmin -refreshNodes"

Now what i see, when i am trying to run the task tracker on that machine,
it is throwing me an exception that task tracker is not allowed to
communicate with the job tracker...

I have used the newer property (mapred.jobtracker.hosts.filename) even this
was not helpful and more over this property is not recognized by the job
tracker.

Please advise me how to commission a task tracker

Regards,
Som Shekhar Sharma
+91-8197243810


Re: Commissioning Task tracker

2014-01-27 Thread Shekhar Sharma
U mean property name?

Regards,
Som Shekhar Sharma
+91-8197243810


On Mon, Jan 27, 2014 at 9:14 PM, Nitin Pawar wrote:

> I think the file name is "mapred.include"
>
>
> On Mon, Jan 27, 2014 at 9:11 PM, Shekhar Sharma wrote:
>
>> Hello,
>> I am using apache hadoop-1.0.3 version and i want to commission a task
>> tracker.
>> For this i have included a property mapred.hosts in mapred-site.xml and
>> for this property i have mentioned a file. In this file i have given the IP
>> address of a machine..
>>
>> then i ran the command   "hadoop mradmin -refreshNodes"
>>
>> Now what i see, when i am trying to run the task tracker on that machine,
>> it is throwing me an exception that task tracker is not allowed to
>> communicate with the job tracker...
>>
>> I have used the newer property (mapred.jobtracker.hosts.filename) even
>> this was not helpful and more over this property is not recognized by the
>> job tracker.
>>
>> Please advise me how to commission a task tracker
>>
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>
>
>
> --
> Nitin Pawar
>


Re: Commissioning Task tracker

2014-01-27 Thread Shekhar Sharma
Thats fine..I have provided the file name as "include"

I have tried this also..but i am getting the same error

my xml files are as follows:

hdfs-site.xml

  
dfs.replication
2
  
  
dfs.block.size
1048576
  
  
dfs.hosts
/usr/local/hadoop/conf/include
  
  
dfs.hosts.exclude
/usr/local/hadoop/conf/exclude
  




mapred-site.xml

  
mapred.job.tracker
NameNode:54311
  


  
 mapred.hosts
/usr/local/hadoop/conf/include
true
  




My include file has the entry of the machine...



Regards,
Som Shekhar Sharma
+91-8197243810


On Mon, Jan 27, 2014 at 9:28 PM, Nitin Pawar wrote:

> Sorry for incomplete reply
>
> In hadoop 1.2/1.0 , following is the property
>
>  
> mapred.hosts
> ${HADOOP_CONF_DIR}/mapred.include
> Names a file that contains the list of nodes that may
> connect to the jobtracker.  If the value is empty, all hosts are
> permitted.
>   
>
>
>
> On Mon, Jan 27, 2014 at 9:21 PM, Shekhar Sharma wrote:
>
>> U mean property name?
>>
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Mon, Jan 27, 2014 at 9:14 PM, Nitin Pawar wrote:
>>
>>> I think the file name is "mapred.include"
>>>
>>>
>>> On Mon, Jan 27, 2014 at 9:11 PM, Shekhar Sharma 
>>> wrote:
>>>
>>>> Hello,
>>>> I am using apache hadoop-1.0.3 version and i want to commission a task
>>>> tracker.
>>>> For this i have included a property mapred.hosts in mapred-site.xml and
>>>> for this property i have mentioned a file. In this file i have given the IP
>>>> address of a machine..
>>>>
>>>> then i ran the command   "hadoop mradmin -refreshNodes"
>>>>
>>>> Now what i see, when i am trying to run the task tracker on that
>>>> machine, it is throwing me an exception that task tracker is not allowed to
>>>> communicate with the job tracker...
>>>>
>>>> I have used the newer property (mapred.jobtracker.hosts.filename) even
>>>> this was not helpful and more over this property is not recognized by the
>>>> job tracker.
>>>>
>>>> Please advise me how to commission a task tracker
>>>>
>>>> Regards,
>>>> Som Shekhar Sharma
>>>> +91-8197243810
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>


commissioning and decommissioning a task tracker

2014-02-02 Thread Shekhar Sharma
Hello ,

I am not able to commission and decommission a tasktracker. I am using
hadoop version 1.0.3 and used the property "mapred.hosts" property for
commissioning a task tracker..

When I see the job tracker log, then it shows that it has taken the
appropriate file as include file. But i am able to run the task tracker on
a machine which is not there in include file.

I have worked with the new property as well "mapred.jobtracker.hosts.file"
, but it didnt worked as well.

Please advise me.
 Regards,
Som Shekhar Sharma
+91-8197243810


Re: A hadoop command to determine the replication factor of a hdfs file ?

2014-02-08 Thread Shekhar Sharma
Run fsck command
On 8 Feb 2014 23:03, "Raj Hadoop"  wrote:

> Hi,
>
> Is there a hadoop command to determine the replication factor of a hdfs
> file  ? Please advise.
>
> I know that "fs setrep" only changes the replication factor.
>
> Regards,
> Raj
>


Re: XML to TEXT

2014-02-12 Thread Shekhar Sharma
Which input format you are using . Use xml input format.
On 3 Jan 2014 10:47, "Ranjini Rathinam"  wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> 100RR
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> 
> 
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: Reg:Hive query with mapreduce

2014-02-20 Thread Shekhar Sharma
Assuming you are using TextInputFormat and your data set is comma separated
value , where secondColumn is empId third column is salary, then your
mapfunction would look like this



public class FooMapper extends Mapper
{


public void map(LongWritable offset, Text empRecord, Context context)
{
   String[]  splits = empRecord.toString().split(",");
   double salary = Double.parseDouble(splits[2]);
   if(salary > 12)
{
  context.write(new Text(splits[1],null);
}

}


set the number of reducer tasks to zero.

No of output files would be equal to number of map tasks in this case and
if you want to have single output file then

(1) Set the mapred.min.split.size=>. It will spawn only one mapper task and you will get
one output file



}

Regards,
Som Shekhar Sharma
+91-8197243810


On Thu, Feb 20, 2014 at 5:55 PM, Ranjini Rathinam wrote:

> Hi,
>
> How to implement the Hive query such as
>
> select * from table comp;
>
> select empId from comp where sal>12000;
>
> in mapreduce.
>
> Need to use this query in mapreduce code. How to implement the above query
> in the code using mapreduce , JAVA.
>
>
> Please provide the sample code.
>
> Thanks in advance for the support
>
> Regards
>
> Ranjini
>
>
>
>
>


Re: Mappers vs. Map tasks

2014-02-26 Thread Shekhar Sharma
Writing a custom input format would be much easier and you would have
better control
You might be tempted to use Jackson lib to do the process json  but that
requires that you need to know your Json data and this assumption would
break if your format of data changes

I would suggest write a custom record reader where you parse the json and
create your own key value pairs
On 27 Feb 2014 09:52, "Sugandha Naolekar"  wrote:

> Joao Paulo,
>
> Your suggestion is appreciated. Although, on a side note, what is more
> tedious: Writing a custom InputFormat or changing the code which is
> generating the input splits.?
>
> --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny wrote:
>
>> If I understood your problem correctly, you have one huge JSON, which is
>> basically a JSONArray, and you want to process one JSONObject of the array
>> at a time.
>>
>> I have faced the same issue some time ago and instead of changing the
>> input format, I changed the code that was generating this input, to
>> generate lots of JSONObjects, one per line. Hence, using the default
>> TextInputFormat, the map function was getting called with the entire JSON.
>>
>> A JSONArray is not good for a mapreduce input since it has a first [ and
>> a last ] and commas between the JSONs of the array. The array can be
>> represented as the file that the JSONs belong.
>>
>> Of course, this approach works only if you can modify what is generating
>> the input you're talking about.
>>
>>
>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq :
>>
>> In that case you have to convert your JSON data into seq files first and
>>> then do the processing.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>> sugandha@gmail.com> wrote:
>>>
 Can I use SequenceFileInputFormat to do the same?

  --
 Thanks & Regards,
 Sugandha Naolekar





 On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq wrote:

> Since there is no OOTB feature that allows this, you have to write
> your custom InputFormat to handle JSON data. Alternatively you could make
> use of Pig or Hive as they have builtin JSON support.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
> rajeshnagar...@gmail.com> wrote:
>
>> 1 simple way is to remove the new line characters so that the default
>> record reader and default way the block is read will take care of the 
>> input
>> splits and JSON will not get affected by the removal of NL character
>>
>>
>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>> sugandha@gmail.com> wrote:
>>
>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
>>> will be split into two blocks. Now, since my file is a json file, I 
>>> cannot
>>> use textinputformat. As, every input split(logical) will be a single 
>>> line
>>> of the json file. Which I dont want. Thus, in this case, can I write a
>>> custom input format and a custom record reader so that, every input
>>> split(logical) will have only that part of data which I require.
>>>
>>> For. e.g:
>>>
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>> 3.00, "CLAZZ": 42.00, "ROAD_TYPE": 3.00, "END_ID":
>>> 33451.00, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>> "OSM_SOURCE": 1520846283.00, "COST": 0.007058, "OSM_TARGET":
>>> 1520846293.00, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>> 10.00, "OSM_ID": 138697535.00, "START_ID": 33450.00, "KM":
>>> 0.00, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>> 30.00, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 862.869986,
>>> 1458938.970140 ] ] } }
>>> ,
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>> 3.00, "CLAZZ": 32.00, "ROAD_TYPE": 3.00, "END_ID":
>>> 37016.00, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>> "OSM_SOURCE": 1037135286.00, "COST": 0.003052, "OSM_TARGET":
>>> 1551615728.00, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>> 20.00, "OSM_ID": 89417379.00, "START_ID": 24882.00, "KM":
>>> 0.00, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>> 50.00, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>> 1458829.592709 ] ] } }
>>>
>>> *I want here the every input split to consist of entire type data
>>> and 

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Shekhar Sharma
Don't you think using flume would be easier. Use hdfs sink and use a
property to roll out the log file every hour.
By doing this way you use a single flume agent to receive logs as and when
it is generating and you will be directly dumping to hdfs.
If you want to remove unwanted logs you can write a custom sink before
dumping to hdfs

I suppose this would he much easier
On 2 Mar 2014 12:34, "Fengyun RAO"  wrote:

> Thanks, Simon. that's very clear.
>
>
> 2014-03-02 14:53 GMT+08:00 Simon Dong :
>
>> Reading data for each hour shouldn't be a problem, as for Hadoop or shell
>> you can pretty much do everything with mmddhh* as you can do with mmddhh.
>>
>> But if you need the data for the hour all sorted in one file then you
>> have to run a post processing MR job for each hour's data to merge them,
>> which should be very trivial.
>>
>> With that being a requirement, using a custom partitioner to send all
>> records with in an hour to a particular reducer might be a viable or better
>> option to save the additional MR pass to merge them, given:
>>
>> -You can determine programatically before submitting the job the number
>> of hours covered, then you can call job.setNumOfReduceTasks(numOfHours) to
>> set the number of reducers
>> -The number of hours you cover for each run matches the number of
>> reducers your cluster typically assigns so you won't suffer much
>> efficiency. For example if each run covers last 24 hours and your cluster
>> defaults to 18 reducer slots, it should be fine
>> -You can emit timestamp as the key from the mapper so your partitioner
>> can decide which reducer the record should be send to, and it will be
>> sorted by MR when it reaches the reducer
>>
>> Even with this, you can still use MultipleOutputs to customize the file
>> name each reducer generates for better usability, i.e. instead of
>> part-r-x have it generate mmddhh-r-0.
>>
>> -Simon
>>
>> On Sat, Mar 1, 2014 at 10:13 PM, Fengyun RAO wrote:
>>
>>> Thank you, Simon! It helps a lot!
>>>
>>> We want one file per hour for the reason of following query.
>>> It would be very convenient to select several specified hours' results.
>>>
>>> We also need each record sorted by timestamp, for following processing.
>>> With a set of files for an hour, as you show in MultipleOutputs, we
>>> would have to merge sort them later. maybe need another MR job?
>>>
>>> 2014-03-02 13:14 GMT+08:00 Simon Dong :
>>>
>>> Fengyun,

 Is there any particular reason you have to have exactly 1 file per
 hour? As you probably knew already, each reducer will output 1 file, or if
 you use MultipleOutputs as I suggested, a set of files. If you have to fit
 the number of reducers to the number hours you have from the input, and
 generate the number of files accordingly, it will most likely be at the
 expense of cluster efficiency and performance. A worst case scenario of
 course is if you have a bunch of data all within the same hour, then you
 have to settle with 1 reducer without any parallelization at all.

 A workaround is to use MultipleOutputs to generate a set of files for
 each hour, with the hour being a the base name. Or if you so choose, a
 sub-directory for each hour. For example if you use mmddhh as the base
 name, you will have a set of files for an hour like:

 030119-r-0
 ...
 030119-r-n
 030120-r-0
 ...
 030120-r-n

 Or in a sub-directory:

 030119/part-r-0
 ...
 030119/part-r-n

 You can then use wild card to glob the output either for manual
 processing, or as input path for subsequent jobs.

 -Simon



 On Sat, Mar 1, 2014 at 7:37 PM, Fengyun RAO wrote:

> Thanks Devin. We don't just want one file. It's complicated.
>
> if the input folder contains data in X hours, we want X files,
> if Y hours, we want Y files.
>
> obviously, X or Y is unknown on compile time.
>
> 2014-03-01 20:48 GMT+08:00 Devin Suiter RDX :
>
>> If you only want one file, then you need to set the number of
>> reducers to 1.
>>
>> If the size of the data makes the original MR job impractical to use
>> a single reducer, you run a second job on the output of the first, with 
>> the
>> default mapper and reducer, which are the Identity- ones, and set that
>> numReducers = 1.
>>
>> Or use hdfs getmerge function to collate the results to one file.
>> On Mar 1, 2014 4:59 AM, "Fengyun RAO"  wrote:
>>
>>> Thanks, but how to set reducer number to X? X is dependent on input
>>> (run-time), which is unknown on job configuration (compile time).
>>>
>>>
>>> 2014-03-01 17:44 GMT+08:00 AnilKumar B :
>>>
 Hi,

 Write the custom partitioner on  and as you mentioned
 set #reducers to X.



>>>
>

>>>
>>
>


Re: Multinode setup..

2015-01-03 Thread Shekhar Sharma
Have you done the following:

(1) Edit masters and slaves file
(2) Edit mapred and core site.xml
(3) delete the hadoop-temp folder (folder as specified by either
dfs.data.dir and dfs.name.dir)
(4) format the namenode (hadoop  namenode  -format)
(5) start the cluster

Regards,
Som Shekhar Sharma
+91-8197243810

On Sat, Jan 3, 2015 at 12:35 PM, Anil Jagtap  wrote:

> I have a single node already set as per the pdf instructions.
> I cloned this node and want to set as a slave.
> How do I do this ?
>
>
> Rgds, Anil
>
> Sent from my iPhone
>
>
> On 3 Jan 2015, at 18:00, "bit1...@163.com"  wrote:
>
> Setting Haddop in real distributed cluster isn't that complex, Which step
> were you stuck at?
>
> --
> bit1...@163.com
>
>
> *From:* Anil Jagtap 
> *Date:* 2015-01-03 14:32
> *To:* user 
> *Subject:* Multinode setup..
> Dear All,
>
> Im trying to setup a multi node cluster and I found millions of articles
> on how to configure multi node. I tried couple of them and i really don't
> understand why Am i not able to configure it.. I have a feeling that these
> experts in their blogs omit out something thinking that its not necessary
> to publish since its a small thing and the person who is configuring is
> good enough to understand but the fact is there are people like me who are
> completely new to Hadoop, Linux & Java as well.
>
> So does any one has some kind of material who literally tells me each and
> every step on how to configure single node, then add another node and
> finally link it..
>
> Sorry to ask this granular detail but trust me i had wasted good amount of
> time in searching and trying it out.. Attached one i found and this helped
> me in setting up single node successfully.
>
> I can use Cloudra but it consumes lot of memory and hence i want to
> configure my own so i can fit 2 nodes in the same memory what Cloudier
> uses. Any VM changes to Cloudera stops working.
>
> Finally, for now do I really need all of the components or just Sqoop,
> PIG, HBASE & Hive should be sufficient to practise..
>
> If anyone can share their multi node VM's would be really great.
>
> Thanks to all in advance and expecting good feedback at the possible
> earliest.
>
> Kind Regards,
>
> Anil
>  [image: 提示图标] 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
> 共有 *1* 个附件
>  hadoop.pdf(358K) 极速下载
> <http://preview.mail.163.com/xdownload?filename=hadoop.pdf&mid=1tbiwBZ2gVD%2BZNNS5QAAsP&part=3&sign=7fd640a0408de2892526962cd4e91166&time=1420268160&uid=bit1129%40163.com>
> 在线预览
> <http://preview.mail.163.com/preview?mid=1tbiwBZ2gVD%2BZNNS5QAAsP&part=3&sign=7fd640a0408de2892526962cd4e91166&time=1420268160&uid=bit1129%40163.com>
>
>


Re: Securing secrets for S3 FileSystems in DistCp

2016-05-03 Thread Shekhar Sharma
Have u used  IAM (identity  access management ) roles ?
On 3 May 2016 18:11, "Elliot West"  wrote:

> Hello,
>
> We're currently using DistCp and S3 FileSystems to move data from a
> vanilla Apache Hadoop cluster to S3. We've been concerned about exposing
> our AWS secrets on our shared, on-premise cluster. As  a work-around we've
> patched DistCp to load these secrets from a JCEKS keystore. This seems to
> work quite well, however we're not comfortable on relying on a DistCp fork.
>
> What is the usual approach to achieve this with DistCp and is there a
> feature or practice that we've overlooked? If not, might there be value in
> us raising a JIRA ticket and submitting a patch for DistCp to include this
> secure keystore functionality?
>
> Thanks - Elliot.
>