Re: Jobs Stuck

2015-10-04 Thread Mohit Anchlia
I changed my code to reduce the values but I still see that app is
requiring 1.24GB. Does it only work when there is a xml file?

conf.set("yarn.app.mapreduce.am.resource.mb", "1000");

conf.set("mapreduce.map.memory.mb", "500");

conf.set("mapreduce.reduce.memory.mb", "500");

On Sun, Oct 4, 2015 at 12:48 PM, Mohit Anchlia 
wrote:

> I just noticed that memory resources are 1273 but my application is
> showing a memory of 1.24 GB. Is that a problem?
>
> On Sun, Oct 4, 2015 at 12:36 PM, Mohit Anchlia 
> wrote:
>
>> I have hadoop running on 1 node and trying to test a simple wordcount
>> example. However, the job is being ACCEPTED but never getting a resource. I
>> looked in the Scheduler UI and it seem to have all the resources available
>> for execution. Could somebody help with what else could be a problem?
>>
>>   ...root.hdfs0.0% used
>>
>> 'root.hdfs' Queue Status
>> Used Resources: 
>> Num Active Applications: 0
>> Num Pending Applications: 1
>> Min Resources: 
>> Max Resources: 
>> Steady Fair Share: 
>> Instantaneous Fair Share: 
>> Show  entriesSearch:
>> ID
>> User
>> Name
>> Application Type
>> Queue
>> Fair Share
>> StartTime
>> FinishTime
>> State
>> FinalStatus
>> Running Containers
>> Allocated CPU VCores
>> Allocated Memory MB
>> Progress
>> Tracking UI
>> application_1443983171281_0004 hdfs wordcount MAPREDUCE root.hdfs 1273 Sun
>> Oct 4 12:21:42 -0700 2015 N/A ACCEPTED UNDEFINED 0 0 0
>> UNASSIGNED
>>
>>
>>
>


Re: Jobs Stuck

2015-10-04 Thread Mohit Anchlia
I just noticed that memory resources are 1273 but my application is showing
a memory of 1.24 GB. Is that a problem?

On Sun, Oct 4, 2015 at 12:36 PM, Mohit Anchlia 
wrote:

> I have hadoop running on 1 node and trying to test a simple wordcount
> example. However, the job is being ACCEPTED but never getting a resource. I
> looked in the Scheduler UI and it seem to have all the resources available
> for execution. Could somebody help with what else could be a problem?
>
>   ...root.hdfs0.0% used
>
> 'root.hdfs' Queue Status
> Used Resources: 
> Num Active Applications: 0
> Num Pending Applications: 1
> Min Resources: 
> Max Resources: 
> Steady Fair Share: 
> Instantaneous Fair Share: 
> Show  entriesSearch:
> ID
> User
> Name
> Application Type
> Queue
> Fair Share
> StartTime
> FinishTime
> State
> FinalStatus
> Running Containers
> Allocated CPU VCores
> Allocated Memory MB
> Progress
> Tracking UI
> application_1443983171281_0004 hdfs wordcount MAPREDUCE root.hdfs 1273 Sun
> Oct 4 12:21:42 -0700 2015 N/A ACCEPTED UNDEFINED 0 0 0
> UNASSIGNED
>
>
>


Jobs Stuck

2015-10-04 Thread Mohit Anchlia
I have hadoop running on 1 node and trying to test a simple wordcount
example. However, the job is being ACCEPTED but never getting a resource. I
looked in the Scheduler UI and it seem to have all the resources available
for execution. Could somebody help with what else could be a problem?

  ...root.hdfs0.0% used

'root.hdfs' Queue Status
Used Resources: 
Num Active Applications: 0
Num Pending Applications: 1
Min Resources: 
Max Resources: 
Steady Fair Share: 
Instantaneous Fair Share: 
Show  entriesSearch:
ID
User
Name
Application Type
Queue
Fair Share
StartTime
FinishTime
State
FinalStatus
Running Containers
Allocated CPU VCores
Allocated Memory MB
Progress
Tracking UI
application_1443983171281_0004 hdfs wordcount MAPREDUCE root.hdfs 1273 Sun
Oct 4 12:21:42 -0700 2015 N/A ACCEPTED UNDEFINED 0 0 0
UNASSIGNED


Re: Starting hadoop on reboot/start

2014-08-20 Thread Mohit Anchlia
Thanks! I ended up creating my own script. I am not sure why these are not
part of the apache hadoop tar?


On Wed, Aug 20, 2014 at 10:19 AM, Abdelrahman Kamel 
wrote:

> unsubscribe
>
>
> On Wed, Aug 20, 2014 at 8:18 PM, Ray Chiang  wrote:
>
>> Try taking a peek at the Cloudera distributions.  Look in the tar file in
>> the sbin directory for files like
>>
>> *-daemon.sh
>> *-daemons.sh
>>
>> That might be a good starting point.
>>
>> -Ray
>>
>> On Wed, Aug 20, 2014 at 10:06 AM, Mohit Anchlia 
>> wrote:
>>
>>> Any help would be appreciated. If not I'll go ahead and write these
>>> startup scripts.
>>>
>>>
>>> On Tue, Aug 19, 2014 at 5:49 PM, Mohit Anchlia 
>>> wrote:
>>>
>>>> I installed apache hadoop, however I am unable to find any script that
>>>> I can configure as a service. Does anyone have any steps or scripts that
>>>> can be reused?
>>>>
>>>
>>>
>>
>
>
> --
> Abdelrahman Kamel
>


Re: Starting hadoop on reboot/start

2014-08-20 Thread Mohit Anchlia
Any help would be appreciated. If not I'll go ahead and write these startup
scripts.


On Tue, Aug 19, 2014 at 5:49 PM, Mohit Anchlia 
wrote:

> I installed apache hadoop, however I am unable to find any script that I
> can configure as a service. Does anyone have any steps or scripts that can
> be reused?
>


Starting hadoop on reboot/start

2014-08-19 Thread Mohit Anchlia
I installed apache hadoop, however I am unable to find any script that I
can configure as a service. Does anyone have any steps or scripts that can
be reused?


Re: Compressing map output

2014-07-01 Thread Mohit Anchlia
Yes it goes away when I comment the map output compression.

On Tue, Jul 1, 2014 at 6:38 PM, M. Dale  wrote:

>  That looks right. Do you consistently get the error below and the total
> job fails? Does it go away when you comment out the map compression?
>
>
> On 07/01/2014 03:23 PM, Mohit Anchlia wrote:
>
> I am trying to compress mapoutput but when I add the following code I get
> errors. Is there anything wrong that you can point me to?
>
>
> conf.setBoolean(
>
> "mapreduce.map.output.compress", *true*);
>
> conf.setClass(
>
> "mapreduce.map.output.compress.codec", GzipCodec.*class*,
>
> CompressionCodec.
>
> *class*);
>
>  14/07/01 22:21:47 INFO mapreduce.Job: Task Id :
> attempt_1404239414989_0008_r_00_1, Status : FAILED
>
> Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error
> in shuffle in fetcher#1
>
> at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
>
> at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:323)
>
> at
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:245)
>
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
>
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
>
>
>


Compressing map output

2014-07-01 Thread Mohit Anchlia
I am trying to compress mapoutput but when I add the following code I get
errors. Is there anything wrong that you can point me to?


conf.setBoolean(

"mapreduce.map.output.compress", *true*);

conf.setClass(

"mapreduce.map.output.compress.codec", GzipCodec.*class*,

CompressionCodec.

*class*);

14/07/01 22:21:47 INFO mapreduce.Job: Task Id :
attempt_1404239414989_0008_r_00_1, Status : FAILED

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error
in shuffle in fetcher#1

at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
bailing-out.

at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:323)

at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:245)

at
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)

at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)


Re: WholeFileInputFormat in hadoop

2014-06-30 Thread Mohit Anchlia
Have you looked at this post:

http://stackoverflow.com/questions/15863566/need-assistance-with-implementing-dbscan-on-map-reduce/15863699#15863699

On Sun, Jun 29, 2014 at 9:01 PM, unmesha sreeveni 
wrote:

> I am trying to do DBScan Algo.I refered the algo in "Data Mining -
> Concepts and Techniques (3rd Ed)" chapter 10 Page no: 474.
> Here in this algorithmwe need to find the disance between each point.
> say my sample input is
> 5,6
> 8,2
> 4,5
> 4,6
>
> So in DBScan we have to pic 1 elemnt and then find the distance between
> all.
>
> While implementing so I will not be able to get the whole file in map
> inorder to find the distance.
> I tried some approach
> 1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
> think this is a better one.(And it end up with heap space error)
> 2. and this one is not implementes as I thought it is not feasible
>   - Reading 1 line of input data set in driver and write to a new
> file.(say centroid)
>  - this centriod can be read in setup and calculate the distance in Map
> and emit the data which satifies the condition with dbscan
> map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
> epsilon neighbours of (5,6) which come from different map and in Reducer
> find the neighbors of epsilon neighbour.
>  - Next iteration should also be done agian read the input file find a
> node which is not visited
> If the input is a 1GB file the MR job executes as many times of the total
> record.
>
>
> Can anyone suggest me a better way to do this.
>
> Hope the usecase is understandable else please tell me.I will explain
> further.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


Re: WholeFileInputFormat in hadoop

2014-06-29 Thread Mohit Anchlia
I think it will be easier if you give your use case. You really would load the 
file if you don't want it to split but there are many ways to solve the issue 
and that's why understanding the use case is helpful.

Sent from my iPhone

> On Jun 29, 2014, at 9:28 AM, unmesha sreeveni  wrote:
> 
> But how is it different from normal execution and parallel MR.
> Although mapreduce is a parallel exec framework where the data into map is  a 
> single input.
> 
> If the Whole fileinput is jst an entire input split insead of the entire 
> input file . it will be useful right?
> if it is the whole file it can caught heapspace ..
> 
> Please correct me if I am wrong.
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Center for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
> 


Re: WholeFileInputFormat in hadoop

2014-06-28 Thread Mohit Anchlia
It takes entire file as input. There is a method in the class isSplittable
in this input format class which is set to false. This method determines if
file can be split in multiple chunks.

On Sat, Jun 28, 2014 at 5:38 AM, Shahab Yunus 
wrote:

> I think it takes the entire file as input. Otherwise it won't be any
> different from the normal line/record-based input format.
>
> Regards,
> Shahab
> On Jun 28, 2014 3:28 AM, "unmesha sreeveni"  wrote:
>
>> Hi
>>
>>   A small clarification:
>>
>>  WholeFileInputFormat takes the entire input file as input or each
>> record(input split) as whole?
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>


Connection refused

2014-06-24 Thread Mohit Anchlia
When I try to run a hdfs api program I get connection refused. However,
hadoop fs -ls and other commands work fine. I don't see anything wrong in
the application web ui.


[mohit@localhost eg]$ hadoop jar hadoop-labs-0.0.1-SNAPSHOT.jar
org.hadoop.qstride.lab.eg.HelloWorld
hdfs://localhost/user/mohit/helloworld.dat

14/06/24 22:11:12 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

Exception in thread "main" java.net.ConnectException: Call From
localhost.localdomain/127.0.0.1 to localhost:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see: http://wiki.apache.org/hadoop/ConnectionRefused



[mohit@localhost eg]$ hadoop fs -ls /user/mohit/eg

 Found 1 items

-rw-r--r-- 1 mohit hadoop 13 2014-06-24 21:34 /user/mohit/eg/helloworld.dat

[mohit@localhost eg]$


Re: Unsatisfied link error

2014-06-19 Thread Mohit Anchlia
Could somebody suggest what might be wrong here?

On Wed, Jun 18, 2014 at 5:37 PM, Mohit Anchlia 
wrote:

> I installed hadoop and now when I try to run "hadoop fs" I get this error.
> I am using openjdk 64 bit on a virtual machine on a centos vm. I am also
> listing my environment variable and one specific message I get when running
> resource manager.
>
>
> Environment variables:
>
> export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.55.x86_64
> export HADOOP_HEAPSIZE="500"
> export HADOOP_NAMENODE_INIT_HEAPSIZE="500"
> export HADOOP_HOME=/opt/yarn/hadoop-2.4.0
> export HADOOP_MAPRED_HOME=$HADOOP_HOME/
> export HADOOP_COMMON_HOME=$HADOOP_HOME
> export HADOOP_HDFS_HOME=$HADOOP_HOME/
> export HADOOP_YARN_HOME=$HADOOP_HOME/
> export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
> export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
> export
> HADOOP_CLASSPATH=$CLASSPATH:$HADOOP_CLASSPATH:$HADOOP_HOME/lib/*:$HADOOP_HOME/share/hadoop/tools/lib/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/httpfs/tomcat/lib/*:$HADOOP_HOME/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/*:$HADOOP_HOME/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*
> export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
> export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
>
> -
>
> [yarn@localhost sbin]$ ./yarn-daemon.sh start nodemanager
>
> starting nodemanager, logging to
> /opt/yarn/hadoop-2.4.0/logs/yarn-yarn-nodemanager-localhost.localdomain.out
>
> OpenJDK 64-Bit Server VM warning: You have loaded library
> /opt/yarn/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have
> disabled stack guard. The VM will try to fix the stack guard now.
>
> It's highly recommended that you fix the library with 'execstack -c
> ', or link it with '-z noexecstack'.
>
> [yarn@localhost sbin]$ jps
>
> 6257 Jps
>
> 3785 ResourceManager
>
> 5746 NodeManager
>
> [yarn@localhost sbin]$
> - Then the error when running hadoop
> fs -ls ---
>
> [root@localhost yarn]# hadoop fs -ls /
>
> SLF4J: Class path contains multiple SLF4J bindings.
>
> SLF4J: Found binding in
> [jar:file:/opt/yarn/hadoop-2.4.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: Found binding in
> [jar:file:/opt/yarn/hadoop-2.4.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
>
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> -ls: Fatal internal error
>
> java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
>
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
>
> at org.apache.hadoop.security.Groups.(Groups.java:64)
>
> at
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>
> at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>
> at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:232)
>
> at
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:718)
>
> at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:703)
>
> at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:605)
>
> at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2554)
>
> at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2546)
>
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2412)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352)
>
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>
> at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
>
> at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224)
>
> at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207)
>
> at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
>
> at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
>
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)
>
> at org.apa

Re: Hadoop classpath

2014-06-18 Thread Mohit Anchlia
No I didn't set that up. Do you know which script it's used? I have
HADOOP_HOME setup.

On Wed, Jun 18, 2014 at 8:01 PM, bo yang  wrote:

> Hi Mohit,
>
> Did you set up HADOOP_INSTALL?
>
> For me, I did following:
>
> export HADOOP_INSTALL=/usr/local/hadoop-2.4.0
>
>
> Bo
>
>
> On Wed, Jun 18, 2014 at 4:27 PM, Mohit Anchlia 
> wrote:
>
>> I installed Yarn on a single node and now when I try to run hadoop fs I
>> get :
>>
>> Error: Could not find or load main class FsShell
>>
>> It appears to be a HADOOP_CLASSPATH issue and I am wondering how I can
>> build the classpath? Should I find all the jars in HADOOP_HOME?
>>
>
>


Unsatisfied link error

2014-06-18 Thread Mohit Anchlia
I installed hadoop and now when I try to run "hadoop fs" I get this error.
I am using openjdk 64 bit on a virtual machine on a centos vm. I am also
listing my environment variable and one specific message I get when running
resource manager.


Environment variables:

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.55.x86_64
export HADOOP_HEAPSIZE="500"
export HADOOP_NAMENODE_INIT_HEAPSIZE="500"
export HADOOP_HOME=/opt/yarn/hadoop-2.4.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME/
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME/
export HADOOP_YARN_HOME=$HADOOP_HOME/
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export
HADOOP_CLASSPATH=$CLASSPATH:$HADOOP_CLASSPATH:$HADOOP_HOME/lib/*:$HADOOP_HOME/share/hadoop/tools/lib/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/httpfs/tomcat/lib/*:$HADOOP_HOME/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/*:$HADOOP_HOME/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

-

[yarn@localhost sbin]$ ./yarn-daemon.sh start nodemanager

starting nodemanager, logging to
/opt/yarn/hadoop-2.4.0/logs/yarn-yarn-nodemanager-localhost.localdomain.out

OpenJDK 64-Bit Server VM warning: You have loaded library
/opt/yarn/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0 which might have
disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c
', or link it with '-z noexecstack'.

[yarn@localhost sbin]$ jps

6257 Jps

3785 ResourceManager

5746 NodeManager

[yarn@localhost sbin]$
- Then the error when running hadoop fs
-ls ---

[root@localhost yarn]# hadoop fs -ls /

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/opt/yarn/hadoop-2.4.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/opt/yarn/hadoop-2.4.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

-ls: Fatal internal error

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)

at org.apache.hadoop.security.Groups.(Groups.java:64)

at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)

at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)

at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:232)

at
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:718)

at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:703)

at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:605)

at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2554)

at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2546)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2412)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)

at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224)

at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207)

at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)

at org.apache.hadoop.fs.shell.Command.run(Command.java:154)

at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)

at org.apache.hadoop.fs.FsShell.main(FsShell.java:308)

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)

... 23 more

Caused by: java.lang.UnsatisfiedLinkError:
org.apache.hadoop.sec

Hadoop classpath

2014-06-18 Thread Mohit Anchlia
I installed Yarn on a single node and now when I try to run hadoop fs I get
:

Error: Could not find or load main class FsShell

It appears to be a HADOOP_CLASSPATH issue and I am wondering how I can
build the classpath? Should I find all the jars in HADOOP_HOME?


Hadoop 1.2

2014-06-18 Thread Mohit Anchlia
Does hadoop map reduce code compiled against 1.2 works with Yarn?




org.apache.hadoop

*hadoop*-core

1.2.1




Heartbeat info in AM and NM in Yarn

2014-06-13 Thread Mohit Anchlia
It appears that NM and AM both sends the heartbeat to the resource manager.
It also appears that they send the health information of the containers
running on the node. I am trying to understand what's the core difference
between the two?


Compare Yarn with V1

2014-06-12 Thread Mohit Anchlia
Is there a good resource that draws similarity and compares Yarn's resource
manager, application manager etc. with job tracker, task tracker etc.?


Re: scp files from Hadoop to different linux box

2013-11-08 Thread Mohit Anchlia
I think distcp would be more advantageous as opposed to Scott

Sent from my iPhone

On Nov 8, 2013, at 6:08 PM, Krishnan Narayanan  wrote:

> Hi All,
> 
> I am having a scenario where I need to scp files from HDFS to another linux 
> box. I use oozie and hive for generating the data.
> I know I can run shell from local and get files from hadoop and scp, but I 
> would want it to be part of oozie action.
> I tried creating a shell action but it failed to connecet the target box.
> Please throw some light.
> Thanks in advance for helping.
> 
> Thanks
> Krishnan
> 


YARN MapReduce 2 concepts

2013-09-19 Thread Mohit Anchlia
I am going through the concepts of resource manager, application master and
node manager. As I undersand resource manager receives the job submission
and launches application master. It also launches node manager to monitor
application master. My questions are:

1. Is Node manager long lived and that one node manager monitors all the
containers launed on the data nodes?
2. How is resource negotiation done between the application master and the
resource manager? In other words what happens during this step? Does
resource manager looks at the active and pending tasks and resources
consumed by those before giving containers to the application master?
3. As it happens in old map reduce cluster that task trackers sends
periodic heartbeats to the job tracker nodes. How does this compare to
YARN? It looks like application master is a task tracker? Little confused
here.
4. It looks like client polls application master to get the progress of the
job but initially client connects to the resource manager. How does client
gets reference to the application master? Does it mean that client gets the
node ip/port from resource manager where application master was launced by
the resource manager?


Re: Hardware Selection for Hadoop

2013-05-05 Thread Mohit Anchlia
Multiple NICs provide 2 benefits, 1) high availability 2) increases the
network bandwidth when using LACP type model.

On Sun, May 5, 2013 at 8:41 PM, Rahul Bhattacharjee  wrote:

>  OK. I do not know if I understand the spindle / core thing. I will dig
> more into that.
>
> Thanks for the info.
>
> One more thing , whats the significance of multiple NIC.
>
> Thanks,
> Rahul
>
>
> On Mon, May 6, 2013 at 12:17 AM, Ted Dunning wrote:
>
>>
>> Data nodes normally are also task nodes.  With 8 physical cores it isn't
>> that unreasonable to have 64GB whereas 24GB really is going to pinch.
>>
>> Achieving highest performance requires that you match the capabilities of
>> your nodes including CPU, memory, disk and networking.  The standard wisdom
>> is 4-6GB of RAM per core, at least a spindle per core and 1/2 to 2/3 of
>> disk bandwidth available as network bandwidth.
>>
>> If you look at the different configurations mentioned in this thread, you
>> will see different limitations.
>>
>> For instance:
>>
>>  2 x Quad cores Intel
>>> 2-3 TB x 6 SATA < 6 disk < desired 8 or more
>>> 64GB mem< slightly larger than necessary
>>> 2 1GBe NICs teaming < 2 x 100 MB << 400MB = 2/3 x 6 x 100MB
>>
>>
>> This configuration is mostly limited by networking bandwidth
>>
>>  2 x Quad cores Intel
>>> 2-3 TB x 6 SATA < 6 disk < desired 8 or more
>>> 24GB mem< 24GB << 8 x 6GB
>>> 2 10GBe NICs teaming< 2 x 1000 MB > 400MB = 2/3 x 6 x 100MB
>>
>>
>> This configuration is weak on disk relative to CPU and very weak on disk
>> relative to network speed.  The worst problem, however, is likely to be
>> small memory.  This will likely require us to decrease the number of slots
>> by half or more making it impossible to even use the 6 disks that we have
>> and making the network even more outrageously over-provisioned.
>>
>>
>>
>>
>> On Sun, May 5, 2013 at 9:41 AM, Rahul Bhattacharjee <
>> rahul.rec@gmail.com> wrote:
>>
>>>  IMHO ,64 G looks bit high for DN. 24 should be good enough for DN.
>>>
>>>
>>> On Tue, Apr 30, 2013 at 12:19 AM, Patai Sangbutsarakum <
>>> patai.sangbutsara...@turn.com> wrote:
>>>
 2 x Quad cores Intel
 2-3 TB x 6 SATA
 64GB mem
 2 NICs teaming

 my 2 cents


  On Apr 29, 2013, at 9:24 AM, Raj Hadoop 
  wrote:

  Hi,

 I have to propose some hardware requirements in my company for a Proof
 of Concept with Hadoop. I was reading Hadoop Operations and also saw
 Cloudera Website. But just wanted to know from the group - what is the
 requirements if I have to plan for a 5 node cluster. I dont know at this
 time, the data that need to be processed at this time for the Proof of
 Concept. So - can you suggest something to me?

 Regards,
 Raj



>>>
>>
>


Re: rack awareness in hadoop

2013-04-20 Thread Mohit Anchlia
And don't forget to look at unlimit settings as well

Sent from my iPhone

On Apr 20, 2013, at 5:07 PM, Marcos Luis Ortiz Valmaseda 
 wrote:

> Like, Aaron say, this problem is related the Linux memory manager.
> You can tune it using the vm.overcommit_memory=1.
> Before to do any change, read all resources first:
> http://www.thegeekstuff.com/2012/02/linux-memory-swap-cache-shared-vm/
> http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html
> http://lwn.net/Articles/317814/
> http://www.tldp.org/LDP/tlk/mm/memory.html
> 
> To learn more about how to tune kernel variables for Hadoop applications. 
> Read these links too:
> First, the amazing Hadoop Operations´s book from Eric:
> http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2685120
> 
> Hadoop Performance Tuning Guide from AMD:
> http://developer.amd.com.php53-23.ord1-1.websitetestlink.com/wordpress/media/2012/10/Hadoop_Tuning_Guide-Version5.pdf
> 
> Intel® Distribution for Apache Hadoop*  Software: Optimization and Tuning 
> Guide:
> http://hadoop.intel.com/pdfs/IntelDistributionTuningGuide.pdf
> 
> Best wishes.
> 
> 
> 
> 2013/4/20 Aaron Eng 
>> The problem is probably not related to the JVM memory so much as the Linux 
>> memory manager.  The exception is in 
>> java.lang.UNIXProcess.(UNIXProcess.java:148) which would imply this is 
>> happening when trying to create a new process.  The initial malloc for the 
>> new process space is being denied by the memory manager.  There could be 
>> many reasons why this happens, though the most likely is your overcommit 
>> settings and swap space.  I'd suggest reading through these details:
>> 
>> https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
>> 
>> On Sat, Apr 20, 2013 at 4:00 PM, Kishore Yellamraju 
>>  wrote:
>>> All,
>>> 
>>> I have posted this question to CDH ML ,  but i guess i can post it here 
>>> because its a general hadoop question.
>>> 
>>> When the NN or JT gets the rack info, i guess it stores the info in memory. 
>>> can i ask you where in the JVM memory it will store the results ( perm gen 
>>> ?) ? .  I am getting "cannot allocate memory on NN and JT " and they have 
>>> more than enough memory. when i looked at JVM usage stats i can see it 
>>> doesnt have enough perm free space.so if its storing the values in perm gen 
>>>  then there is a chance of this memory issues.
>>> 
>>> 
>>> Thanks in advance !!!
>>> 
>>> 
>>> exception that i see in logs :
>>> 
>>> java.io.IOException: Cannot run program "/etc/hadoop/conf/topo.sh" (in 
>>> directory "/usr/lib/hadoop-0.20-mapreduce"): java.io.IOException: error=12, 
>>> Cannot allocate memory
>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:206)
>>> at org.apache.hadoop.util.Shell.run(Shell.java:188)
>>> at 
>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:381)
>>> at 
>>> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:242)
>>> at 
>>> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:180)
>>> at 
>>> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
>>> at 
>>> org.apache.hadoop.mapred.JobTracker.resolveAndAddToTopology(JobTracker.java:2750)
>>> at 
>>> org.apache.hadoop.mapred.JobInProgress.createCache(JobInProgress.java:593)
>>> at 
>>> org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:765)
>>> at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:3775)
>>> at 
>>> org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:90)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> at java.lang.Thread.run(Thread.java:619)
>>> Caused by: java.io.IOException: java.io.IOException: error=12, Cannot 
>>> allocate memory
>>> at java.lang.UNIXProcess.(UNIXProcess.java:148)
>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>> ... 14 more
>>> 2013-04-20 02:07:28,298 ERROR org.apache.hadoop.mapred.JobTracker: Job 
>>> initialization failed:
>>> java.lang.NullPointerException
>>> 
>>> 
>>> -Thanks
>>>  kishore kumar yellamraju |Ground control operations|kish...@rocketfuel.com 
>>> | 408.203.0424
>>> 
>>> 
> 
> 
> 
> -- 
> Marcos Ortiz Valmaseda,
> Data-Driven Product Manager at PDVSA
> Blog: http://dataddict.wordpress.com/
> LinkedIn: http://www.linkedin.com/in/marcosluis2186
> Twitter: @marcosluis2186


Re: Very basic question

2013-04-20 Thread Mohit Anchlia
All dirs start with drw in your example

Sent from my iPhone

On Apr 20, 2013, at 4:46 PM, Raj Hadoop  wrote:

> Thanks Thariq. Following is the list. where are the actual directories . how 
> can i traverse to the directories? can i?
> 
> 2013-04-20 19:45:18.772 java[3742:1603] Unable to load realm info from 
> SCDynamicStore
> drwxr-xr-x   - hadoop supergroup  0 2013-04-20 18:05 /user/hadoop
> drwxr-xr-x   - hadoop supergroup  0 2013-04-20 17:00 
> /user/hadoop/input1
> -rw-r--r--   1 hadoop supergroup 48 2013-04-20 17:00 
> /user/hadoop/input1/sample.txt
> drwxr-xr-x   - hadoop supergroup  0 2013-04-20 17:46 
> /user/hadoop/output
> -rw-r--r--   1 hadoop supergroup  0 2013-04-20 17:46 
> /user/hadoop/output/_SUCCESS
> drwxr-xr-x   - hadoop supergroup  0 2013-04-20 17:45 
> /user/hadoop/output/_logs
> drwxr-xr-x   - hadoop supergroup  0 2013-04-20 17:45 
> /user/hadoop/output/_logs/history
> -rw-r--r--   1 hadoop supergroup  11262 2013-04-20 17:45 
> /user/hadoop/output/_logs/history/job_201304201653_0002_1366494345385_hadoop_word+count
> -rw-r--r--   1 hadoop supergroup  20395 2013-04-20 17:45 
> /user/hadoop/output/_logs/history/job_201304201653_0002_conf.xml
> -rw-r--r--   1 hadoop supergroup 60 2013-04-20 17:46 
> /user/hadoop/output/part-r-0
> drwxr-xr-x   - hadoop supergroup  0 2013-04-20 18:06 
> /user/hadoop/output1
> -rw-r--r--   1 hadoop supergroup  0 2013-04-20 18:06 
> /user/hadoop/output1/_SUCCESS
> drwxr-xr-x   - hadoop supergroup  0 2013-04-20 18:05 
> /user/hadoop/output1/_logs
> drwxr-xr-x   - hadoop supergroup  0 2013-04-20 18:05 
> /user/hadoop/output1/_logs/history
> -rw-r--r--   1 hadoop supergroup  11266 2013-04-20 18:05 
> /user/hadoop/output1/_logs/history/job_201304201653_0004_1366495541487_hadoop_word+count
> -rw-r--r--   1 hadoop supergroup  20396 2013-04-20 18:05 
> /user/hadoop/output1/_logs/history/job_201304201653_0004_conf.xml
> -rw-r--r--   1 hadoop supergroup 60 2013-04-20 18:06 
> /user/hadoop/output1/part-r-0
> 
> 
> 
> From: Mohammad Tariq 
> To: "user@hadoop.apache.org" ; Raj Hadoop 
>  
> Sent: Saturday, April 20, 2013 7:40 PM
> Subject: Re: Very basic question
> 
> do this :
> 
> /Users/hadoop/hadoop-1.0.4/ bin/hadoop fs -lsr /user
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
> 
> 
> On Sun, Apr 21, 2013 at 5:07 AM, Raj Hadoop  wrote:
> This is my folder structure. Can you help me?
> 
> /Users/hadoop/hadoop-1.0.4
> hadoop$ ls -lrt
> total 14960
> drwxr-xr-x   3 hadoop  staff  102 Oct  3  2012 share
> drwxr-xr-x   9 hadoop  staff  306 Oct  3  2012 webapps
> drwxr-xr-x  52 hadoop  staff 1768 Oct  3  2012 lib
> -rw-r--r--   1 hadoop  staff   287807 Oct  3  2012 hadoop-tools-1.0.4.jar
> -rw-r--r--   1 hadoop  staff  2656646 Oct  3  2012 hadoop-test-1.0.4.jar
> -rw-r--r--   1 hadoop  staff  413 Oct  3  2012 
> hadoop-minicluster-1.0.4.jar
> -rw-r--r--   1 hadoop  staff   142452 Oct  3  2012 hadoop-examples-1.0.4.jar
> -rw-r--r--   1 hadoop  staff  3928530 Oct  3  2012 hadoop-core-1.0.4.jar
> -rw-r--r--   1 hadoop  staff  410 Oct  3  2012 hadoop-client-1.0.4.jar
> -rw-r--r--   1 hadoop  staff 6840 Oct  3  2012 hadoop-ant-1.0.4.jar
> drwxr-xr-x  10 hadoop  staff  340 Oct  3  2012 contrib
> drwxr-xr-x   9 hadoop  staff  306 Oct  3  2012 sbin
> -rw-r--r--   1 hadoop  staff10525 Oct  3  2012 ivy.xml
> drwxr-xr-x  13 hadoop  staff  442 Oct  3  2012 ivy
> drwxr-xr-x  69 hadoop  staff 2346 Oct  3  2012 docs
> -rw-r--r--   1 hadoop  staff 1366 Oct  3  2012 README.txt
> -rw-r--r--   1 hadoop  staff  101 Oct  3  2012 NOTICE.txt
> -rw-r--r--   1 hadoop  staff13366 Oct  3  2012 LICENSE.txt
> -rw-r--r--   1 hadoop  staff   446999 Oct  3  2012 CHANGES.txt
> drwxr-xr-x  18 hadoop  staff  612 Oct  3  2012 src
> drwxr-xr-x   4 hadoop  staff  136 Oct  3  2012 c++
> -rw-r--r--   1 hadoop  staff   119875 Oct  3  2012 build.xml
> drwxr-xr-x   4 hadoop  staff  136 Oct  3  2012 libexec
> drwxr-xr-x  19 hadoop  staff  646 Oct  3  2012 bin
> drwxr-xr-x   3 hadoop  staff  102 Apr 20 17:04 myprogs
> drwxr-xr-x  18 hadoop  staff  612 Apr 20 17:16 conf
> -rwx--   1 hadoop  staff   65 Apr 20 17:56 run_raj1.sh
> -rwx--   1 hadoop  staff   66 Apr 20 18:05 run_raj2.sh
> drwxr-xr-x  26 hadoop  staff  884 Apr 20 18:05 logs
> 
> From: Mohammad Tariq 
> To: "user@hadoop.apache.org" ; Raj Hadoop 
>  
> Sent: Saturday, April 20, 2013 7:27 PM
> 
> Subject: Re: Very basic question
> 
> try to look into /user dir inside ur hdfs
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
> 
> 
> On Sun, Apr 21, 2013 at 4:52 AM, Mohammad Tariq  wrote:
> ok..do u remember the command?
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
> 
> 
> On Sun, Apr 21, 2013 at 4:45 AM, Raj Hadoop  wrote:
> Tariq,
> 
> Than

Re: execvp: Permission denied

2013-03-16 Thread Mohit Anchlia
Can you provide some details of Jdk type and version?

Sent from my iPhone

On Mar 13, 2013, at 9:52 AM, Heinz Stockinger  
wrote:

> Hello,
> 
> I've set up Hadoop on two machine and would like to test it with a simple 
> test job. The setup/program works with single-node setup but not with the 
> distributed environment. I get the following error when I run the simple 
> org.myorg.WordCount program:
> 
> bin/hadoop jar examples/wordcount.jar org.myorg.WordCount input2 output22
> 
> 13/03/13 17:40:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 13/03/13 17:40:56 WARN mapred.JobClient: No job jar file set.  User classes 
> may not be found. See JobConf(Class) or JobConf#setJar(String).
> 13/03/13 17:40:56 INFO input.FileInputFormat: Total input paths to process : 3
> 13/03/13 17:40:56 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 13/03/13 17:40:56 WARN snappy.LoadSnappy: Snappy native library not loaded
> 13/03/13 17:40:56 INFO mapred.JobClient: Running job: job_201303131649_0008
> 13/03/13 17:40:57 INFO mapred.JobClient:  map 0% reduce 0%
> 13/03/13 17:41:07 INFO mapred.JobClient: Task Id : 
> attempt_201303131649_0008_m_04_0, Status : FAILED
> java.lang.Throwable: Child Error
>at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> 
> attempt_201303131649_0008_m_04_0: execvp: Permission denied
> 
> I'm not sure where the "permission denied" is actually caused. Do you have 
> any hints? My user can access the HDFS formated space.
> 
> Thanks,
> Heinz
> 


Re: Replication factor

2013-03-12 Thread Mohit Anchlia
Does it mean if I set replication factor on directory /abc and I run a -put
command and add a file to the directory it will use the new replication
factor set on the directory /abc?

On Tue, Mar 12, 2013 at 2:04 PM, Chris Embree  wrote:

> Aww..  You could've used lmgtfy.com :)
>
>
> On Tue, Mar 12, 2013 at 4:57 PM, varun kumar  wrote:
>
>> http://hadoopblogfromvarun.wordpress.com/
>>
>>
>> On Wed, Mar 13, 2013 at 2:16 AM, Mohit Anchlia wrote:
>>
>>> Is it possible to set replication factor to a different value than the
>>> default at the directory level?
>>
>>
>>
>>
>> --
>> Regards,
>> Varun Kumar.P
>>
>
>


Re: mapred.max.tracker.failures

2013-03-07 Thread Mohit Anchlia
Thanks this is very helpful.

On Wed, Mar 6, 2013 at 10:03 PM, bharath vissapragada <
bharathvissapragada1...@gmail.com> wrote:

> No, its the number of task failures in  a job after which that
> particular tasktracker can be blacklisted *for that job*! Note that it
> can take tasks from other jobs!
>
> On Thu, Mar 7, 2013 at 11:21 AM, Mohit Anchlia 
> wrote:
>  > I am wondering what the correct behaviour is of this parameter? If
> it's set
> > to 4 does it mean job should fail if a job has more than 4 failures?
>


mapred.max.tracker.failures

2013-03-06 Thread Mohit Anchlia
I am wondering what the correct behaviour is of this parameter? If it's set
to 4 does it mean job should fail if a job has more than 4 failures?


Re: Moving data in hadoop

2013-01-24 Thread Mohit Anchlia
Have you looked at distcp?

On Thu, Jan 24, 2013 at 5:55 PM, Raj hadoop  wrote:

> Hi,
>
> Can you please suggest me what is the good way to move 1 peta byte of data
> from one cluster to another cluster?
>
> Thanks
> Raj
>


Re: Loading file to HDFS with custom chunk structure

2013-01-16 Thread Mohit Anchlia
Look at  the block size concept in Hadoop and see if that is what you are 
looking for 

Sent from my iPhone

On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist  
wrote:

> I want to load a SegY file onto HDFS of a 3-node Apache Hadoop cluster.
> 
> To summarize, the SegY file consists of :
> 
> 3200 bytes textual header
> 400 bytes binary header
> Variable bytes data
> The 99.99% size of the file is due to the variable bytes data which is 
> collection of thousands of contiguous traces. For any SegY file to make 
> sense, it must have the textual header+binary header+at least one trace of 
> data. What I want to achieve is to split a large SegY file onto the Hadoop 
> cluster so that a smaller SegY file is available on each node for local 
> processing.
> 
> The scenario is as follows:
> 
> The SegY file is large in size(above 10GB) and is resting on the local file 
> system of the NameNode machine
> The file is to be split on the nodes in such a way each node has a small SegY 
> file with a strict structure - 3200 bytes textual header + 400 bytes binary 
> header + variable bytes dataAs obvious, I can't blindly use 
> FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the 
> format in which the chunks of the larger file are required
> Please guide me as to how I must proceed.
> 
> Thanks and regards !


Re: Query mongodb

2013-01-16 Thread Mohit Anchlia
Hadoop knows about files and blocks so you can achieve data locality if you are 
accessing files directly

I think, In your case you'll have to develop your own logic that can take 
advantage of it

Sent from my iPhone

On Jan 16, 2013, at 6:56 AM, John Lilley  wrote:

> Um, I think you and I are talking about the same thing, but maybe not?
>  
> Certainly HBase/MongoDB are HDFS-aware, so I would expect that if I am a 
> client program running outside of the Hadoop cluster and I do a query, the 
> database tools will construct query processing such that data is read and 
> processed in an optimal fashion (using MapReduce?), before the aggregated 
> information is shipped to me on the client side. 
>  
> The question I was asking is a little different although hopefully the answer 
> is just as simple.  Can I write mapper/reducer that queries HBase/MongoDB and 
> have MR schedule my mappers such that each mapper is receiving tuples that 
> have been read in a locality-aware fashion?
>  
> john
>  
> From: Mohammad Tariq [mailto:donta...@gmail.com] 
> Sent: Wednesday, January 16, 2013 7:47 AM
> To: user@hadoop.apache.org
> Subject: Re: Query mongodb
>  
> MapReduce framework tries its best to run the jobs on the nodes 
> where  data is located. It is its fundamental nature. You don't have 
> to do anything extra.
>  
> *I am sorry if I misunderstood the question.
>  
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>  
> 
> On Wed, Jan 16, 2013 at 8:10 PM, John Lilley  wrote:
> How does one schedule mappers to read MongoDB or HBase in a 
> data-locality-aware fashion?
> -john
>  
> From: Mohammad Tariq [mailto:donta...@gmail.com] 
> Sent: Wednesday, January 16, 2013 3:29 AM
> To: user@hadoop.apache.org
> Subject: Re: Query mongodb
>  
> Yes. You can use MongoDB-Hadoop adapter to achieve that. Through this adapter 
> you can pull the data, process it and push it back to your MongoDB backed 
> datastore by writing MR jobs.
>  
> It is also 100% possible to query Hbase or JSON files, or anything else for 
> that matter, stored in HDFS.
> 
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>  
> 
> On Wed, Jan 16, 2013 at 3:50 PM, Panshul Whisper  
> wrote:
> Hello,
> Is it possible or how is it possible to query mongodb directly from hadoop.
> 
> Or is it possible to query hbase or json files stored in hdfs in a similar 
> way as we can query the json documents in mongodb.
> 
> Suggestions please.
> 
> Thank you.
> Regards,
> Panshul.
> 
>  
>  


Re: queues in haddop

2013-01-10 Thread Mohit Anchlia
Have you looked at flume?

Sent from my iPhone

On Jan 10, 2013, at 7:12 PM, Panshul Whisper  wrote:

> Hello,
> 
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing 
> queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving 24 
> Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration 
> Data pipe lines.
> 
> I cannot afford to loose any of the JSON files received.
> 
> Thanking You,
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101


hadoop -put command

2012-12-26 Thread Mohit Anchlia
It looks like hadoop fs -put command doesn't like ":" in the file names. Is
there a way I can escape it?


hadoop fs -put /home/mapr/p/hjob.2012:12:26:11.0.dat
/user/apuser/temp-qdc/scratch/merge_jobs

put: java.net.URISyntaxException: Relative path in absolute URI:
hjob.2012:12:26:11.0.dat


Re: Merging files

2012-12-22 Thread Mohit Anchlia
Thanks for the info. I was trying not to use nfs because my data size might
be 10-20GB in size for every merge I perform. I'll use pig instead.

In dstcp I checked and none of the directories are duplicate. Looking at
the logs it looks like it's failing because all those directories have
sub-directories of the same name.

On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning  wrote:

> A pig script should work quite well.
>
> I also note that the file paths have maprfs in them.  This implies that
> you are using MapR and could simply use the normal linux command cat to
> concatenate the files if you mount the files using NFS (depending on
> volume, of course).  For small amounts of data, this would work very well.
>  For large amounts of data, you would be better with some kind of
> map-reduce program.  Your Pig script is just the sort of thing.
>
> Keep in mind if you write a map-reduce program (or pig script) that you
> will wind up with as many outputs as you have reducers.  If you have only a
> single reducer, you will get one output file, but that will mean that only
> a single process will do all the writing.  That would be no faster than
> using the cat + NFS method above.  Having multiple reducers will allow you
> to have write parallelism.
>
> The error message that distcp is giving you is a little odd, however,
> since it implies that some of your input files are repeated.  Is that
> possible?
>
>
>
> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia wrote:
>
>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>> write a pig script to load from multiple paths
>>
>>
>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
>> are duplicated files in the sources:
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>
>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>
>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>
>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>
>>
>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning wrote:
>>
>>> The technical term for this is "copying".  You may have heard of it.
>>>
>>> It is a subject of such long technical standing that many do not
>>> consider it worthy of detailed documentation.
>>>
>>> Distcp effects a similar process and can be modified to combine the
>>> input files into a single file.
>>>
>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish wrote:
>>>
>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J  wrote:
>>>>
>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>> source streams into it. Or to save code time, an identity job with a
>>>>> single reducer (you may not get control over ordering this way).
>>>>>
>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>> mohitanch...@gmail.com> wrote:
>>>>> > Is it possible to merge files from different locations from HDFS
>>>>> location
>>>>> > into one file into HDFS location?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Merging files

2012-12-22 Thread Mohit Anchlia
Tried distcp but it fails. Is there a way to merge them? Or else I could
write a pig script to load from multiple paths


org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
are duplicated files in the sources:
maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo

at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)

at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)

at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)


On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning  wrote:

> The technical term for this is "copying".  You may have heard of it.
>
> It is a subject of such long technical standing that many do not consider
> it worthy of detailed documentation.
>
> Distcp effects a similar process and can be modified to combine the input
> files into a single file.
>
> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>
>
> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish wrote:
>
>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>
>>
>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J  wrote:
>>
>>> Yes, via the simple act of opening a target stream and writing all
>>> source streams into it. Or to save code time, an identity job with a
>>> single reducer (you may not get control over ordering this way).
>>>
>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia 
>>> wrote:
>>> > Is it possible to merge files from different locations from HDFS
>>> location
>>> > into one file into HDFS location?
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>


Re: Alerting

2012-12-22 Thread Mohit Anchlia
Need alerting

On Sat, Dec 22, 2012 at 12:44 PM, Mohammad Tariq  wrote:

> MR web UI?Although we can't trigger anything, it provides all the info
> related to the jobs. I mean it would be easier to just go there and and
> have a look at everything rather than opening the shell and typing the
> command.
>
> I'm a bit lazy ;)
>
>  Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
>
>
> On Sun, Dec 23, 2012 at 2:09 AM, Mohit Anchlia wrote:
>
>> Best I can find is hadoop job list so far
>>
>>
>> On Sat, Dec 22, 2012 at 12:30 PM, Mohit Anchlia 
>> wrote:
>>
>>> What's the best way to trigger alert when jobs run for too long or have
>>> many failures? Is there a hadoop command that can be used to perform this
>>> activity?
>>
>>
>>
>


Re: Alerting

2012-12-22 Thread Mohit Anchlia
Best I can find is hadoop job list so far

On Sat, Dec 22, 2012 at 12:30 PM, Mohit Anchlia wrote:

> What's the best way to trigger alert when jobs run for too long or have
> many failures? Is there a hadoop command that can be used to perform this
> activity?


Alerting

2012-12-22 Thread Mohit Anchlia
What's the best way to trigger alert when jobs run for too long or have
many failures? Is there a hadoop command that can be used to perform this
activity?


Re: can local disk of reduce task cause the job to fail?

2012-12-09 Thread Mohit Anchlia
Reducer will not start executing until shuffle and sort phase is complete

Sent from my iPhone

On Dec 9, 2012, at 4:09 AM, Majid Azimi  wrote:

> Hi guys,
> 
> Hadoop the definitive guide says: reduce tasks will start only when all maps 
> has done their work.  Also this link says:
> 
> >> The shuffle and sort phases occur simultaneously; while map-outputs are 
> >> being fetched they are merged.
> 
> What I have understood is that when a reducer task starts then all data it 
> needs(including a key and associated values) have been transferred to its 
> local node. Am I right? if this is true then, the node running reduce task 
> must have enough storage to hold all values associated with that key, else 
> The job will fail.
> 
> If no, then reduce job starts with some available data and shuffle + sort 
> phase feed reduce task contiguously, thus low storage on node does not cause 
> problem because data is coming on demand.
> 
> which of the two cases actually happen?


Re: Trouble with Word Count example

2012-11-29 Thread Mohit Anchlia
Also check permissions you  are doing Sudo 

Sent from my iPhone

On Nov 29, 2012, at 1:00 PM, "Kartashov, Andy"  wrote:

> Maybe you stepped out of your working directory. “$ ls –l”  Do you see your 
> .jar?
>  
> From: Sandeep Jangra [mailto:sandeepjan...@gmail.com] 
> Sent: Thursday, November 29, 2012 3:46 PM
> To: user@hadoop.apache.org
> Subject: Re: Trouble with Word Count example
>  
> Hi Harsh,
>  
>   I tried putting the generic option first, but it throws exception file not 
> found.
>   The jar is in current directory. Then I tried giving absolute path of this 
> jar, but that also brought no luck.
>  
>   sudo -u hdfs hadoop jar word_cnt.jar WordCount2  -libjars=word_cnt.jar 
> /tmp/root/input /tmp/root/output17 
> Exception in thread "main" java.io.FileNotFoundException: File word_cnt.jar 
> does not exist.
> at 
> org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:384)
> at 
> org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:280)
> at 
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:418)
> at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:168)
> at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:151)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
>  
>   Also, I have been deleting my jars and the class directory before each new 
> try. So even I am suspicious why do I see this:
> "12/11/29 10:20:59 WARN mapred.JobClient: No job jar file set.  User classes 
> may not be found. See JobConf(Class) or JobConf#setJar(String)."
>  
>   Could it be that my hadoop is running on old jar files (the one with 
> package name "mapred" (not mapreduce))
>   But my program is using new jars as well.
>  
>   I can try going back to old word count example on the apache site and using 
> old jars.
>  
>   Any other pointers would be highly appreciated. Thanks
>  
>   
> 
> On Thu, Nov 29, 2012 at 2:42 PM, Harsh J  wrote:
> I think you may have not recompiled your application properly.
> 
> Your runtime shows this:
> 
> 12/11/29 10:20:59 WARN mapred.JobClient: No job jar file set.  User
> classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> 
> Which should not appear, cause your code has this (which I suspect you
> may have added later, accidentally?):
> 
> job.setJarByClass(WordCount2.class);
> 
> So if you can try deleting the older jar and recompiling it, the
> problem would go away.
> 
> Also, when passing generic options such as -libjars, etc., they need
> to go first in order. I mean, it should always be [Classname] [Generic
> Options] [Application Options]. Otherwise, they may not get utilized
> properly.
> 
> On Fri, Nov 30, 2012 at 12:51 AM, Sandeep Jangra
>  wrote:
> > Yups I can see my class files there.
> >
> >
> > On Thu, Nov 29, 2012 at 2:13 PM, Kartashov, Andy 
> > wrote:
> >>
> >> Can you try running jar –tvf word_cnt.jar and see if your static nested
> >> classes WordCount2$Map.class and WordCount2$Reduce.class have actually been
> >> added to the jar.
> >>
> >>
> >>
> >> Rgds,
> >>
> >> AK47
> >>
> >>
> >>
> >>
> >>
> >> From: Sandeep Jangra [mailto:sandeepjan...@gmail.com]
> >> Sent: Thursday, November 29, 2012 1:36 PM
> >> To: user@hadoop.apache.org
> >> Subject: Re: Trouble with Word Count example
> >>
> >>
> >>
> >> Also, I did set the HADOOP_CLASSPATH variable to point to the word_cnt.jar
> >> only.
> >>
> >>
> >>
> >> On Thu, Nov 29, 2012 at 10:54 AM, Sandeep Jangra 
> >> wrote:
> >>
> >> Thanks for the quick response Mahesh.
> >>
> >>
> >>
> >> I am using the following command:
> >>
> >>
> >>
> >> sudo -u hdfs hadoop jar word_cnt.jar WordCount2  /tmp/root/input
> >> /tmp/root/output15  -libjars=word_cnt.jar
> >>
> >> (The input directory exists on the hdfs)
> >>
> >>
> >>
> >> This is how I compiled and packaged it:
> >>
> >>
> >>
> >> javac -classpath
> >> /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar:/usr/lib/hadoop/*  -d
> >> word_cnt WordCount2.java
> >>
> >> jar -cvf word_cnt.jar -C word_cnt/ .
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Nov 29, 2012 at 10:46 AM, Mahesh Balija
> >>  wrote:
> >>
> >> Hi Sandeep,
> >>
> >>
> >>
> >>For me everything seems to be alright.
> >>
> >>Can you tell us how are you running this job?
> >>
> >>
> >>
> >> Best,
> >>
> >> Mahesh.B.
> >>
> >> Calsoft Labs.
> >>
> >> On Thu, Nov 29, 2012 at 9:01 PM, Sandeep Jangra 
> >> wrote:
> >>
> >> Hello everyone,
> >>
> >>
> >>
> >>   Like most others I am also running into some problems while running my
> >> word count example.
> >>
> >>   I tried the various suggestion available on internet, but I guess it;s
> >> time to go on email :)
> >>
> >>
> >>
> >>   Here is the error that I am getting:
> >>
> >>   12/11/29 10:20:59 WARN mapred.JobClient: Use GenericOptionsParser for
> >> parsing the arguments.

Re: Trouble with Word Count example

2012-11-29 Thread Mohit Anchlia
Try to give full path or ./

Sent from my iPhone

On Nov 29, 2012, at 1:00 PM, "Kartashov, Andy"  wrote:

> Maybe you stepped out of your working directory. “$ ls –l”  Do you see your 
> .jar?
>  
> From: Sandeep Jangra [mailto:sandeepjan...@gmail.com] 
> Sent: Thursday, November 29, 2012 3:46 PM
> To: user@hadoop.apache.org
> Subject: Re: Trouble with Word Count example
>  
> Hi Harsh,
>  
>   I tried putting the generic option first, but it throws exception file not 
> found.
>   The jar is in current directory. Then I tried giving absolute path of this 
> jar, but that also brought no luck.
>  
>   sudo -u hdfs hadoop jar word_cnt.jar WordCount2  -libjars=word_cnt.jar 
> /tmp/root/input /tmp/root/output17 
> Exception in thread "main" java.io.FileNotFoundException: File word_cnt.jar 
> does not exist.
> at 
> org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:384)
> at 
> org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:280)
> at 
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:418)
> at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:168)
> at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:151)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
>  
>   Also, I have been deleting my jars and the class directory before each new 
> try. So even I am suspicious why do I see this:
> "12/11/29 10:20:59 WARN mapred.JobClient: No job jar file set.  User classes 
> may not be found. See JobConf(Class) or JobConf#setJar(String)."
>  
>   Could it be that my hadoop is running on old jar files (the one with 
> package name "mapred" (not mapreduce))
>   But my program is using new jars as well.
>  
>   I can try going back to old word count example on the apache site and using 
> old jars.
>  
>   Any other pointers would be highly appreciated. Thanks
>  
>   
> 
> On Thu, Nov 29, 2012 at 2:42 PM, Harsh J  wrote:
> I think you may have not recompiled your application properly.
> 
> Your runtime shows this:
> 
> 12/11/29 10:20:59 WARN mapred.JobClient: No job jar file set.  User
> classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> 
> Which should not appear, cause your code has this (which I suspect you
> may have added later, accidentally?):
> 
> job.setJarByClass(WordCount2.class);
> 
> So if you can try deleting the older jar and recompiling it, the
> problem would go away.
> 
> Also, when passing generic options such as -libjars, etc., they need
> to go first in order. I mean, it should always be [Classname] [Generic
> Options] [Application Options]. Otherwise, they may not get utilized
> properly.
> 
> On Fri, Nov 30, 2012 at 12:51 AM, Sandeep Jangra
>  wrote:
> > Yups I can see my class files there.
> >
> >
> > On Thu, Nov 29, 2012 at 2:13 PM, Kartashov, Andy 
> > wrote:
> >>
> >> Can you try running jar –tvf word_cnt.jar and see if your static nested
> >> classes WordCount2$Map.class and WordCount2$Reduce.class have actually been
> >> added to the jar.
> >>
> >>
> >>
> >> Rgds,
> >>
> >> AK47
> >>
> >>
> >>
> >>
> >>
> >> From: Sandeep Jangra [mailto:sandeepjan...@gmail.com]
> >> Sent: Thursday, November 29, 2012 1:36 PM
> >> To: user@hadoop.apache.org
> >> Subject: Re: Trouble with Word Count example
> >>
> >>
> >>
> >> Also, I did set the HADOOP_CLASSPATH variable to point to the word_cnt.jar
> >> only.
> >>
> >>
> >>
> >> On Thu, Nov 29, 2012 at 10:54 AM, Sandeep Jangra 
> >> wrote:
> >>
> >> Thanks for the quick response Mahesh.
> >>
> >>
> >>
> >> I am using the following command:
> >>
> >>
> >>
> >> sudo -u hdfs hadoop jar word_cnt.jar WordCount2  /tmp/root/input
> >> /tmp/root/output15  -libjars=word_cnt.jar
> >>
> >> (The input directory exists on the hdfs)
> >>
> >>
> >>
> >> This is how I compiled and packaged it:
> >>
> >>
> >>
> >> javac -classpath
> >> /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar:/usr/lib/hadoop/*  -d
> >> word_cnt WordCount2.java
> >>
> >> jar -cvf word_cnt.jar -C word_cnt/ .
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Nov 29, 2012 at 10:46 AM, Mahesh Balija
> >>  wrote:
> >>
> >> Hi Sandeep,
> >>
> >>
> >>
> >>For me everything seems to be alright.
> >>
> >>Can you tell us how are you running this job?
> >>
> >>
> >>
> >> Best,
> >>
> >> Mahesh.B.
> >>
> >> Calsoft Labs.
> >>
> >> On Thu, Nov 29, 2012 at 9:01 PM, Sandeep Jangra 
> >> wrote:
> >>
> >> Hello everyone,
> >>
> >>
> >>
> >>   Like most others I am also running into some problems while running my
> >> word count example.
> >>
> >>   I tried the various suggestion available on internet, but I guess it;s
> >> time to go on email :)
> >>
> >>
> >>
> >>   Here is the error that I am getting:
> >>
> >>   12/11/29 10:20:59 WARN mapred.JobClient: Use GenericOptionsParser for
> >> parsing the arguments. Applications sh

Re: Assigning reduce tasks to specific nodes

2012-11-28 Thread Mohit Anchlia
Look at locality delay parameter 

Sent from my iPhone

On Nov 28, 2012, at 8:44 PM, Harsh J  wrote:

> None of the current schedulers are "strict" in the sense of "do not
> schedule the task if such a tasktracker is not available". That has
> never been a requirement for Map/Reduce programs and nor should be.
> 
> I feel if you want some code to run individually on all nodes for
> whatever reason, you may as well ssh into each one and start it
> manually with appropriate host-based parameters, etc.. and then
> aggregate their results.
> 
> Note that even if you get down to writing a scheduler for this (which
> I don't think is a good idea, but anyway), you ought to make sure your
> scheduler also does non-strict scheduling of data local tasks for jobs
> that don't require such strictness - in order for them to complete
> quickly than wait around for scheduling in a fixed manner.
> 
> On Thu, Nov 29, 2012 at 6:00 AM, Hiroyuki Yamada  wrote:
>> Thank you all for the comments and advices.
>> 
>> I know it is not recommended to assigning mapper locations by myself.
>> But There needs to be one mapper running in each node in some cases,
>> so I need a strict way to do it.
>> 
>> So, locations is taken care of by JobTracker(scheduler), but it is not 
>> strict.
>> And, the only way to do it strictly is making a own scheduler, right ?
>> 
>> I have checked the source and I am not sure where to modify to do it.
>> What I understand is FairScheduler and others are for scheduling
>> multiple jobs. Is this right ?
>> What I want to do is scheduling tasks in one job.
>> This can be achieved by FairScheduler and others ?
>> 
>> Regards,
>> Hiroyuki
>> 
>> On Thu, Nov 29, 2012 at 12:46 AM, Michael Segel
>>  wrote:
>>> Mappers? Uhm... yes you can do it.
>>> Yes it is non-trivial.
>>> Yes, it is not recommended.
>>> 
>>> I think we talk a bit about this in an InfoQ article written by Boris
>>> Lublinsky.
>>> 
>>> Its kind of wild when your entire cluster map goes red in ganglia... :-)
>>> 
>>> 
>>> On Nov 28, 2012, at 2:41 AM, Harsh J  wrote:
>>> 
>>> Hi,
>>> 
>>> Mapper scheduling is indeed influenced by the getLocations() returned
>>> results of the InputSplit.
>>> 
>>> The map task itself does not care about deserializing the location
>>> information, as it is of no use to it. The location information is vital to
>>> the scheduler (or in 0.20.2, the JobTracker), where it is sent to directly
>>> when a job is submitted. The locations are used pretty well here.
>>> 
>>> You should be able to control (or rather, influence) mapper placement by
>>> working with the InputSplits, but not strictly so, cause in the end its up
>>> to your MR scheduler to do data local or non data local assignments.
>>> 
>>> 
>>> On Wed, Nov 28, 2012 at 11:39 AM, Hiroyuki Yamada 
>>> wrote:
 
 Hi Harsh,
 
 Thank you for the information.
 I understand the current circumstances.
 
 How about for mappers ?
 As far as I tested, location information in InputSplit is ignored in
 0.20.2,
 so there seems no easy way for assigning mappers to specific nodes.
 (I before checked the source and noticed that
 location information is not restored when deserializing the InputSplit
 instance.)
 
 Thanks,
 Hiroyuki
 
 On Wed, Nov 28, 2012 at 2:08 PM, Harsh J  wrote:
> This is not supported/available currently even in MR2, but take a look
> at
> https://issues.apache.org/jira/browse/MAPREDUCE-199.
> 
> 
> On Wed, Nov 28, 2012 at 9:34 AM, Hiroyuki Yamada 
> wrote:
>> 
>> Hi,
>> 
>> I am wondering how I can assign reduce tasks to specific nodes.
>> What I want to do is, for example,  assigning reducer which produces
>> part-0 to node xxx000,
>> and part-1 to node xxx001 and so on.
>> 
>> I think it's abount task assignment scheduling but
>> I am not sure where to customize to achieve this.
>> Is this done by writing some extensions ?
>> or any easier way to do this ?
>> 
>> Regards,
>> Hiroyuki
> 
> 
> 
> 
> --
> Harsh J
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
> 
> 
> 
> -- 
> Harsh J


Re: Is it possible to read a corrupted Sequence File

2012-11-24 Thread Mohit Anchlia
I guess one way might be to write your own dfs reader that ignores The 
exceptions and reads whatever it can

Sent from my iPad

On Nov 23, 2012, at 6:12 PM, Hs  wrote:

> Hi,
> 
> I am running hadoop 1.0.3 and hbase-0.94.0on a 12-node cluster. For unknown 
> operational faults, 6 datanodes  have suffered a complete data loss(hdfs data 
> directory gone).  When I restart hadoop, it reports "The ratio of reported 
> blocks 0.8252".
> 
> I have a folder in hdfs containing many important files in hadoop 
> SequenceFile format. The hadoop fsck tool shows that  (in this folder) 
> 
> Total size:134867556461 B
>  Total dirs:16
>  Total files:   251
>  Total blocks (validated):  2136 (avg. block size 63140241 B)
>   
>   CORRUPT FILES:167
>   MISSING BLOCKS:   405
>   MISSING SIZE: 25819446263 B
>   CORRUPT BLOCKS:   405
>   
> 
> I wonder if I can read these corrupted SequenceFiles with missing blocks 
> skipped ?  Or, what else can I do now to recover these SequenceFiles as much 
> as possible ? 
> 
> Please save me.
> 
> Thanks !
> 
> (Sorry for duplicating this post on user and hdfs-dev list, I do not know 
> where exactly i should put it.)


Re: Hadoop and Hbase site xml

2012-11-12 Thread Mohit Anchlia
I already have it working using the xml files. I was trying to see what are
the parameters that I need to pass to the conf object. Should I take all
the parameters in the xml file and use it in the conf file?

On Mon, Nov 12, 2012 at 7:17 PM, Yanbo Liang  wrote:

> There are two candidate:
> 1) You need to copy your Hadoop/HBase configuration such as
> common-site.xml, hdfs-site.xml, or *hbase-site.xml *file from "etc" or
> "conf" subdirectory of Hadoop/HBase installation directory into the Java
> project directory. Then the configuration of Hadoop/HBase will be auto
> loaded and the client can use directly.
> 2) Explicit set the configuration at your client code, such as:
> conf = new Configuration();
>   conf.set("fs.defaultFS","hdfs://192.168.12.132:9000/");
>
> You can reference the following link:
>
> http://autofei.wordpress.com/2012/04/02/java-example-code-using-hbase-data-model-operations/
>
> 2012/11/13 Mohammad Tariq 
>
>> try copying files from hadoop in hbase to each other's conf directory.
>>
>> Regards,
>> Mohammad Tariq
>>
>>
>>
>> On Tue, Nov 13, 2012 at 5:04 AM, Mohit Anchlia wrote:
>>
>>> Is it necessary to add hadoop and hbase site xmls in the classpath of
>>> the java client? Is there any other way we can configure it using general
>>> properties file using key=value?
>>
>>
>>
>


Re: Reading files from a directory

2012-11-12 Thread Mohit Anchlia
I was actually looking for an example to do it in the java code. But I
think I've found a way to do it by iterating over all the files using
globStatus() method.

On Mon, Nov 12, 2012 at 5:50 PM, yinghua hu  wrote:

> Hi, Mohit
>
> You can input everything in a directory. See the step 12 in this link.
>
> http://raseshmori.wordpress.com/
>
>
> On Mon, Nov 12, 2012 at 5:40 PM, Mohit Anchlia wrote:
>
>> Using Java dfs api is it possible to read all the files in a directory?
>> Or do I need to list all the files in the directory and then read it?
>
>
>
>
> --
> Regards,
>
> Yinghua
>


Reading files from a directory

2012-11-12 Thread Mohit Anchlia
Using Java dfs api is it possible to read all the files in a directory? Or
do I need to list all the files in the directory and then read it?


Re: Reading from sequence file using java FS api

2012-11-12 Thread Mohit Anchlia
I was simple able to read using below code. Didn't have to decompress. It
looks like reader automatically knows and decompresses the file before
returning it to the user.

On Mon, Nov 12, 2012 at 3:16 PM, Mohit Anchlia wrote:

> I am looking for an example that read snappy compressed snappy file. Could
> someone point me to it? What I have so far is this:
>
>
> Configuration conf =
> *new* Configuration();
>
> FileSystem fs = FileSystem.*get*(URI.*create*(uri), conf);
>
> Path path =
> *new* Path(uri);
>
> SequenceFile.Reader reader =
> *null*;
>
> org.apache.hadoop.io.LongWritable key =
> *new* org.apache.hadoop.io.LongWritable();
>
> org.apache.hadoop.io.Text value =
> *new* org.apache.hadoop.io.Text();
>
> *try* {
>
> reader = *new* SequenceFile.Reader(fs, path, conf);
>


Hadoop and Hbase site xml

2012-11-12 Thread Mohit Anchlia
Is it necessary to add hadoop and hbase site xmls in the classpath of the
java client? Is there any other way we can configure it using general
properties file using key=value?


Reading from sequence file using java FS api

2012-11-12 Thread Mohit Anchlia
I am looking for an example that read snappy compressed snappy file. Could
someone point me to it? What I have so far is this:


Configuration conf = *new* Configuration();

FileSystem fs = FileSystem.*get*(URI.*create*(uri), conf);

Path path = *new* Path(uri);

SequenceFile.Reader reader = *null*;

org.apache.hadoop.io.LongWritable key =
*new*org.apache.hadoop.io.LongWritable();

org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text();

*try* {

reader = *new* SequenceFile.Reader(fs, path, conf);


Re: Replication

2012-10-30 Thread Mohit Anchlia
Thanks and if it is not the datanode then I am guessing namenode decides
the nodes in replication pipeline?

On Tue, Oct 30, 2012 at 5:36 PM, ranjith raghunath <
ranjith.raghuna...@gmail.com> wrote:

> If your client node is a datanode with your cluster then the first copy
> does get written to that data node.
>
> Experts please feel free to correct me here.
>  On Oct 30, 2012 7:11 PM, "Mohit Anchlia"  wrote:
>
>> With respect to replication if I run pig job from one of the nodes within
>> the Hadoop cluster then do I always end up with writing 1 replica copy to
>> that client node always and remaining 2 replica copies to other nodes?
>>
>>
>


Replication

2012-10-30 Thread Mohit Anchlia
With respect to replication if I run pig job from one of the nodes within
the Hadoop cluster then do I always end up with writing 1 replica copy to
that client node always and remaining 2 replica copies to other nodes?


Re: Hadoop on Isilon problem

2012-10-17 Thread Mohit Anchlia
Look at the directory permissions?

On Wed, Oct 17, 2012 at 12:18 PM, Artem Ervits  wrote:

>  Anyone using Hadoop running on Isilon NAS? I am trying to submit a job
> with a user other than the one running Hadoop and I’m getting the following
> error:
>
> ** **
>
> Exception in thread "main" java.io.IOException: Permission denied
>
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
>
> at java.io.File.checkAndCreate(File.java:1717)
>
> at java.io.File.createTempFile0(File.java:1738)
>
> at java.io.File.createTempFile(File.java:1815)
>
> at org.apache.hadoop.util.RunJar.main(RunJar.java:115)
>
> ** **
>
> ** **
>
> Any ideas?
>
> ** **
>
> ** **
>
> Artem Ervits
>
> Data Analyst
>
> New York Presbyterian Hospital
>
> ** **
>
> 
>
> This electronic message is intended to be for the use only of the named 
> recipient, and may contain information that is confidential or privileged.  
> If you are not the intended recipient, you are hereby notified that any 
> disclosure, copying, distribution or use of the contents of this message is 
> strictly prohibited.  If you have received this message in error or are not 
> the named recipient, please notify us immediately by contacting the sender at 
> the electronic mail address noted above, and delete and destroy all copies of 
> this message.  Thank you.
>
>
> 
>
> This electronic message is intended to be for the use only of the named 
> recipient, and may contain information that is confidential or privileged.  
> If you are not the intended recipient, you are hereby notified that any 
> disclosure, copying, distribution or use of the contents of this message is 
> strictly prohibited.  If you have received this message in error or are not 
> the named recipient, please notify us immediately by contacting the sender at 
> the electronic mail address noted above, and delete and destroy all copies of 
> this message.  Thank you.
>
>
>
>


Re: ethernet bonding / 802.3ad / link aggregation

2012-08-26 Thread Mohit Anchlia
On Sun, Aug 26, 2012 at 10:59 AM, Koert Kuipers  wrote:

> Thanks! I will post results. What is the recommended way to measure it? (I
> am on Centos 5.x)
>

Use iperf between the hosts that have bonded interface.

>
> On Sun, Aug 26, 2012 at 1:58 PM, Mohit Anchlia wrote:
>
>>
>>
>> On Sun, Aug 26, 2012 at 10:47 AM, Koert Kuipers wrote:
>>
>>> we are looking at channel bonding / link aggregation (2 x 1 Gbit/s) on
>>> our hadoop slaves.
>>> what is the recommended bonding mode? i found some references to mode 4
>>> (802.3ad, hardware based) and to mode 6 (balance-alb, software based) in
>>> hadoop mailing lists.
>>> thanks! koert
>>>
>>
>> I have configured LACP before and seems to be the standard. Do post your
>> results, in general don't expect to see 2 Gbps, you'll probably get around
>> 1.5 Gbps or less depending on hardware vendor.
>>
>
>


Re: ethernet bonding / 802.3ad / link aggregation

2012-08-26 Thread Mohit Anchlia
On Sun, Aug 26, 2012 at 10:47 AM, Koert Kuipers  wrote:

> we are looking at channel bonding / link aggregation (2 x 1 Gbit/s) on
> our hadoop slaves.
> what is the recommended bonding mode? i found some references to mode 4
> (802.3ad, hardware based) and to mode 6 (balance-alb, software based) in
> hadoop mailing lists.
> thanks! koert
>

I have configured LACP before and seems to be the standard. Do post your
results, in general don't expect to see 2 Gbps, you'll probably get around
1.5 Gbps or less depending on hardware vendor.


Re: Learning hadoop

2012-08-23 Thread Mohit Anchlia
start with reading map reduce paper and then look at hadoop book

On Thu, Aug 23, 2012 at 9:19 AM, Pravin Sinha  wrote:

>  Hi,
>
> I am new to Hadoop. What would be the best way to learn  hadoop and eco
> system around it?
>
> Thanks,
> Pravin
>
>


Re: Hadoop Real time help

2012-08-20 Thread Mohit Anchlia
One of the most commonly used use case is to perform all IO intensive batch
jobs in HDFS and load more structured data or the output of the job into
HBase or Solr for quick access. But if your dataset is small that fits into
memory then you could also cache it in memory. There are various options
depending on your requirements. Some of them Bertrand has already
highlighted below.

On Mon, Aug 20, 2012 at 12:37 AM, Bertrand Dechoux wrote:

> The terms are
> * ESP : http://en.wikipedia.org/wiki/Event_stream_processing
> * CEP : http://en.wikipedia.org/wiki/Complex_event_processing
>
> By the way, processing streams in real time tends toward being a pleonasm.
>
> MapReduce follows a batch architecture. You keep data until a given time.
> You then process everything. And at the end you provide all the results.
> Stream processing has by definition a more 'smooth' throughput. Each event
> is processed at a time and potentially each processing could lead to a
> result.
>
> I don't know any complete overview of such tools.
> Esper is well known in that space.
> FlumeBase was an attempt to do something similar (as far as I can tell).
> It shows how an ESP engine fits with log collection using a tool such as
> Flume.
>
> Then you also have other solutions which will allow you to scale such as
> Storm.
> A few people have already considered using Storm for scalability and Esper
> to do the real computation.
>
> Regards
>
> Bertrand
>
>
> On Sun, Aug 19, 2012 at 9:44 PM, Niels Basjes  wrote:
>
>> Is there a "complete" overview of the tools that allow processing streams
>> of data in realtime?
>>
>> Or even better; what are the terms to google for?
>>
>> --
>> Met vriendelijke groet,
>> Niels Basjes
>> (Verstuurd vanaf mobiel )
>> Op 19 aug. 2012 18:22 schreef "Bertrand Dechoux" 
>> het volgende:
>>
>> That's a good question. More and more people are talking about Hadoop
>>> Real Time.
>>> One key aspect of this question is whether we are talking about
>>> MapReduce or not.
>>>
>>> MapReduce greatly improves the response time of any data intensive jobs
>>> but it is still a batch framework with a noticeable latency.
>>>
>>> There is multiple ways to improve the latency :
>>> * ESP/CEP solutions (like Esper, FlumeBase, ...)
>>> * Big Table clones (like HBase ...)
>>> * YARN with a non MapReduce application
>>> * ...
>>>
>>> But it will really depend on the context and the definition of 'real
>>> time'.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>>
>>>
>>> On Sun, Aug 19, 2012 at 5:44 PM, mahout user wrote:
>>>
 Hello folks,


I am new to hadoop, I just want to get information that how hadoop
 framework is usefull for real time service.?can any one explain me..?

 Thanks.

>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: FW: Streaming Issue

2012-08-19 Thread Mohit Anchlia
Are you looking for something like this?

hadoop jar hadoop-streaming.jar -input 'file1 -input file2

On Sun, Aug 19, 2012 at 11:16 AM, Siddharth Tiwari <
siddharth.tiw...@live.com> wrote:

>
>
> **
>  Hi Friends,
>
> Can you please suggest me how can I pass 3 files as parameters to the
> mapper written in python in hadoop streaming API, which will process data
> from this three different files . Please help.
>
>
>
> ****
> *Cheers !!!*
> *Siddharth Tiwari*
> Have a refreshing day !!!
> *"Every duty is holy, and devotion to duty is the highest form of worship
> of God.” *
> *"Maybe other people will try to limit me but I don't limit myself"*
>


Re: Hadoop Real time help

2012-08-19 Thread Mohit Anchlia
On Sun, Aug 19, 2012 at 8:44 AM, mahout user  wrote:

> Hello folks,
>
>
>I am new to hadoop, I just want to get information that how hadoop
> framework is usefull for real time service.?can any one explain me..?
>
> Thanks.
>

Can you specify your use case? Each use case calls for different design
consideration.


Re: Map Reduce Question

2012-08-17 Thread Mohit Anchlia
See Hadoop definitive guide and search for chapter on Hadoop features

On Fri, Aug 17, 2012 at 6:20 PM, Manoj Khangaonkar wrote:

> One usage of these is in a secondary sort , which is used , when you
> want the output values from Map sorted (within a key).
>
> You implement a KeyComparator and tell mapreduce to use it to order
> the keys using a composite key.
>
> To ensure that during partioning & Grouping , all the records for a
> key go to the same reducer
> you need to define and set a Partitioner and ValueGroupingComparator.
>
> search for Secondary Sort for more on this topic
>
> regards
>
> MJ
>
> On Fri, Aug 17, 2012 at 12:00 PM, Anbarasan Murthy
>  wrote:
> > Hi,
> >
> > I have a question in mapreduce api.
> >
> > Would like to know the significance of the following items under jobconf
> > class.
> > ValueGroupingComparator
> > KeyComparator
> >
> >
> >
> > Thanks,
> > Anbu.
>
>
>
> --
> http://khangaonkar.blogspot.com/
>


Re: MAPREDUCE-3661

2012-08-12 Thread Mohit Anchlia
On Sun, Aug 12, 2012 at 6:12 AM, Harsh J  wrote:

> Hi Mohit,
>
> If Windows is going to be your primary development platform, I suggest
> you checkout the svn branch-1-win
> (http://svn.apache.org/viewvc/hadoop/common/branches/branch-1-win) and
> build and use that instead. That branch currently targets making
> Hadoop easier to use on Windows, and is a better approach to go
> through than to try to hack things into place.
>
> I'd prefer running on Linux myself, for we primarily develop Hadoop on
> that platform.


All our environments are linux except developer environment that we
generally prefer to be done locally on developers laptop which can be
windows or mac. That's where I was trying to get it working for windows. I
think I am almost there except this bug that occurs right at the end during
cleanup.

>
> On Fri, Aug 10, 2012 at 1:49 AM, Mohit Anchlia 
> wrote:
> > I am facing this issue on 0.20.2, is there a workaround for this that I
> can
> > employ? I can create alias scripts to return expected results if that is
> an
> > option.
> >
> > https://issues.apache.org/jira/browse/MAPREDUCE-3661
> >
>
>
>
> --
> Harsh J
>


MAPREDUCE-3661

2012-08-09 Thread Mohit Anchlia
I am facing this issue on 0.20.2, is there a workaround for this that I can
employ? I can create alias scripts to return expected results if that is an
option.

https://issues.apache.org/jira/browse/MAPREDUCE-3661