Re: Create and write files on mounted HDFS via java api

2013-04-20 Thread Hemanth Yamijala
Are you using Fuse for mounting HDFS ?


On Fri, Apr 19, 2013 at 4:30 PM, lijinlong wakingdrea...@163.com wrote:

 I mounted HDFS to a local directory for storage,that is /mnt/hdfs.I can do
 the basic file operation such as create ,remove,copy etc just using linux
 command and GUI.But when I tried to do the same thing in the mounted
 directory via java api(not hadoop api),I got exceptions.The detail
 information can be senn here.
 http://stackoverflow.com/questions/16083497/java-exception-when-creating-or-writing-files-on-mounted-hdfs
  .
 Now my question is that if I can do file operactions to the mounted hdfs
 via java api that I wrote in the url.If not,what should be the proper way
 to acomplish that?





Re: Mapreduce

2013-04-20 Thread Hemanth Yamijala
As this is a HBase specific question, it will be better to ask this
question on the HBase user mailing list.

Thanks
Hemanth


On Fri, Apr 19, 2013 at 10:46 PM, Adrian Acosta Mitjans 
amitj...@estudiantes.uci.cu wrote:

 Hello:

 I'm working in a proyect, and i'm using hbase for storage the data, y have
 this method that work great but without the performance i'm looking for, so
 i want is to make the same but using mapreduce.


 public ArrayListMyObject findZ(String z) throws IOException {

 ArrayListMyObject rows = new ArrayListMyObject();
 Configuration conf = HBaseConfiguration.create();
 HTable table = new HTable(conf, test);
 Scan s = new Scan();
 s.addColumn(Bytes.toBytes(x), Bytes.toBytes(y));
 ResultScanner scanner = table.getScanner(s);
 try {
 for (Result rr : scanner) {
 if (Bytes.toString(rr.getValue(Bytes.toBytes(x),
 Bytes.toBytes(y))).equals(z)) {
 rows.add(getInformation(Bytes.toString(rr.getRow(;
 }
 }
 } finally {
 scanner.close();
 }
 return archivos;
 }

 The getInformation method take all the columns and convert the row in
 MyObject type.

 I just want a example or a link to a tutorial that make something like
 this, i want to get a result type as answer and not a number to count
 words, like many a found.

 My natural language is spanish, so sorry if something is not well writing.

 Thanks.
   http://www.uci.cu/




Re: Errors about MRunit

2013-04-20 Thread Hemanth Yamijala
Hi,

If your goal is to use the new API, I am able to get it to work with the
following maven configuration:

dependency
  groupIdorg.apache.mrunit/groupId
  artifactIdmrunit/artifactId
  version0.9.0-incubating/version
  classifierhadoop1/classifier
/dependency

If I switch with classifier hadoop2, I get the same errors as what you
facing.

Thanks
Hemanth


On Sat, Apr 20, 2013 at 3:42 PM, 姚吉龙 geelong...@gmail.com wrote:

 Hi Everyone

 I am testing my MR programe with MRunit, it's version
 is mrunit-0.9.0-incubating-hadoop2. My hadoop version is 1.0.4
 The error trace is below:

 java.lang.IncompatibleClassChangeError: Found class
 org.apache.hadoop.mapreduce.TaskInputOutputContext, but interface was
 expected
 at
 org.apache.hadoop.mrunit.mapreduce.mock.MockContextWrapper.createCommon(MockContextWrapper.java:53)
  at
 org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper.create(MockMapContextWrapper.java:70)
 at
 org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper.init(MockMapContextWrapper.java:62)
  at org.apache.hadoop.mrunit.mapreduce.MapDriver.run(MapDriver.java:217)
 at org.apache.hadoop.mrunit.MapDriverBase.runTest(MapDriverBase.java:150)
  at org.apache.hadoop.mrunit.TestDriver.runTest(TestDriver.java:137)
 at UnitTest.testMapper(UnitTest.java:41)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
  at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
  at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
  at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
  at
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79)
 at
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
  at
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
 at
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
  at
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
 at
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
  at
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


 Anyone has idea?

 BRs
 Geelong

 --
 From Good To Great



Re: Create and write files on mounted HDFS via java api

2013-04-20 Thread Hemanth Yamijala
Sorry - no. I just wanted to know if you were using FUSE, because I knew of
no other way of mounting HDFS.. Basically was wondering if some libraries
needed to be system path for the Java programs to work.

From your response looks like you aren't using FUSE. So what are you using
to mount ?

Hemanth


On Sat, Apr 20, 2013 at 4:24 PM, 金龙 李 wakingdrea...@live.com wrote:

 Yes,I tried both FUSE and NTFS,but all failed.Have you done this
 before?And do you know why?

 --
 Date: Sat, 20 Apr 2013 15:48:36 +0530
 Subject: Re: Create and write files on mounted HDFS via java api
 From: yhema...@thoughtworks.com
 To: user@hadoop.apache.org


 Are you using Fuse for mounting HDFS ?


 On Fri, Apr 19, 2013 at 4:30 PM, lijinlong wakingdrea...@163.com wrote:

 I mounted HDFS to a local directory for storage,that is /mnt/hdfs.I can do
 the basic file operation such as create ,remove,copy etc just using linux
 command and GUI.But when I tried to do the same thing in the mounted
 directory via java api(not hadoop api),I got exceptions.The detail
 information can be senn here.
 http://stackoverflow.com/questions/16083497/java-exception-when-creating-or-writing-files-on-mounted-hdfs
  .
 Now my question is that if I can do file operactions to the mounted hdfs
 via java api that I wrote in the url.If not,what should be the proper way
 to acomplish that?






Re: Errors about MRunit

2013-04-20 Thread Hemanth Yamijala
+ user@

Please do continue the conversation on the mailing list, in case others
like you can benefit from / contribute to the discussion

Thanks
Hemanth

On Sat, Apr 20, 2013 at 5:32 PM, Hemanth Yamijala yhema...@thoughtworks.com
 wrote:

 Hi,

 My code is working with having mrunit-0.9.0-incubating-hadoop1.jar as a
 dependency. So, can you pull this from the mrunit download tarball, add it
 to the dependencies in eclipse and try. Of course remove any other mrunit
 jar you have already

 Thanks
 Hemanth


 On Sat, Apr 20, 2013 at 5:02 PM, 姚吉龙 geelong...@gmail.com wrote:

 Sorry, I have not used the Maven things
 Could u tell me how to set this with Eclipse


 BRs
 geelong


 2013/4/20 Hemanth Yamijala yhema...@thoughtworks.com

 Hi,

 If your goal is to use the new API, I am able to get it to work with the
 following maven configuration:

 dependency
   groupIdorg.apache.mrunit/groupId
   artifactIdmrunit/artifactId
   version0.9.0-incubating/version
   classifierhadoop1/classifier
 /dependency

 If I switch with classifier hadoop2, I get the same errors as what you
 facing.

 Thanks
 Hemanth


 On Sat, Apr 20, 2013 at 3:42 PM, 姚吉龙 geelong...@gmail.com wrote:

 Hi Everyone

 I am testing my MR programe with MRunit, it's version
 is mrunit-0.9.0-incubating-hadoop2. My hadoop version is 1.0.4
 The error trace is below:

 java.lang.IncompatibleClassChangeError: Found class
 org.apache.hadoop.mapreduce.TaskInputOutputContext, but interface was
 expected
 at
 org.apache.hadoop.mrunit.mapreduce.mock.MockContextWrapper.createCommon(MockContextWrapper.java:53)
  at
 org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper.create(MockMapContextWrapper.java:70)
 at
 org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper.init(MockMapContextWrapper.java:62)
  at
 org.apache.hadoop.mrunit.mapreduce.MapDriver.run(MapDriver.java:217)
 at
 org.apache.hadoop.mrunit.MapDriverBase.runTest(MapDriverBase.java:150)
  at org.apache.hadoop.mrunit.TestDriver.runTest(TestDriver.java:137)
 at UnitTest.testMapper(UnitTest.java:41)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
  at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
  at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
  at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
  at
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79)
 at
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
  at
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
 at
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
  at
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
 at
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
  at
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


 Anyone has idea?

 BRs
 Geelong

 --
 From Good To Great





 --
 From Good To Great





Re: Which version of Hadoop

2013-04-20 Thread Hemanth Yamijala
2.x.x provides NN high availability.
http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html

However, it is in alpha stage right now.

Thanks
hemanth


On Sat, Apr 20, 2013 at 5:30 PM, Ascot Moss ascot.m...@gmail.com wrote:

 Hi,

 I am new to Hadoop, from Hadoop download I can find 4 versions:

 1.0.x / 1.1.x / 2.x.x / 0.23.x


 May I know which one is the latest stable version that provides Namenode
 high availability for production environment?

 regards





Re: How to configure mapreduce archive size?

2013-04-18 Thread Hemanth Yamijala
Well, since the DistributedCache is used by the tasktracker, you need to
update the log4j configuration file used by the tasktracker daemon. And you
need to get the tasktracker log file - from the machine where you see the
distributed cache problem.


On Fri, Apr 19, 2013 at 6:27 AM, xia_y...@dell.com wrote:

 Hi Hemanth,

 ** **

 I tried http://machine:50030. It did not work for me.

 ** **

 In hbase_home/conf folder, I update the log4j configuration properties and
 got attached log. Do you find what is happening for the map reduce job?***
 *

 ** **

 Thanks,

 ** **

 Jane

 ** **

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Wednesday, April 17, 2013 9:11 PM

 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

 ** **

 The check for cache file cleanup is controlled by the
 property mapreduce.tasktracker.distributedcache.checkperiod. It defaults to
 1 minute (which should be sufficient for your requirement).

 ** **

 I am not sure why the JobTracker UI is inaccessible. If you know where JT
 is running, try hitting http://machine:50030. If that doesn't work, maybe
 check if ports have been changed in mapred-site.xml for a property similar
 to mapred.job.tracker.http.address. 

 ** **

 There is logging in the code of the tasktracker component that can help
 debug the distributed cache behaviour. In order to get those logs you need
 to enable debug logging in the log4j configuration properties and restart
 the daemons. Hopefully that will help you get some hints on what is
 happening.

 ** **

 Thanks

 hemanth

 ** **

 On Wed, Apr 17, 2013 at 11:49 PM, xia_y...@dell.com wrote:

 Hi Hemanth and Bejoy KS,

  

 I have tried both mapred-site.xml and core-site.xml. They do not work. I
 set the value to 50K just for testing purpose, however the folder size
 already goes to 900M now. As in your email, “After they are done, the
 property will help cleanup the files due to the limit set. ” How frequently
 the cleanup task will be triggered? 

  

 Regarding the job.xml, I cannot use JT web UI to find it. It seems when
 hadoop is packaged within Hbase, this is disabled. I am only use Hbase
 jobs. I was suggested by Hbase people to get help from Hadoop mailing list.
 I will contact them again.

  

 Thanks,

  

 Jane

  

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Tuesday, April 16, 2013 9:35 PM


 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

  

 You can limit the size by setting local.cache.size in the mapred-site.xml
 (or core-site.xml if that works for you). I mistakenly mentioned
 mapred-default.xml in my last mail - apologies for that. However, please
 note that this does not prevent whatever is writing into the distributed
 cache from creating those files when they are required. After they are
 done, the property will help cleanup the files due to the limit set. 

  

 That's why I am more keen on finding what is using the files in the
 Distributed cache. It may be useful if you can ask on the HBase list as
 well if the APIs you are using are creating the files you mention (assuming
 you are only running HBase jobs on the cluster and nothing else)

  

 Thanks

 Hemanth

  

 On Tue, Apr 16, 2013 at 11:15 PM, xia_y...@dell.com wrote:

 Hi Hemanth,

  

 I did not explicitly using DistributedCache in my code. I did not use any
 command line arguments like –libjars neither.

  

 Where can I find job.xml? I am using Hbase MapReduce API and not setting
 any job.xml.

  

 The key point is I want to limit the size of
 /tmp/hadoop-root/mapred/local/archive. Could you help?

  

 Thanks.

  

 Xia

  

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Thursday, April 11, 2013 9:09 PM


 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

  

 TableMapReduceUtil has APIs like addDependencyJars which will use
 DistributedCache. I don't think you are explicitly using that. Are you
 using any command line arguments like -libjars etc when you are launching
 the MapReduce job ? Alternatively you can check job.xml of the launched MR
 job to see if it has set properties having prefixes like mapred.cache. If
 nothing's set there, it would seem like some other process or user is
 adding jars to DistributedCache when using the cluster.

  

 Thanks

 hemanth

  

  

  

 On Thu, Apr 11, 2013 at 11:40 PM, xia_y...@dell.com wrote:

 Hi Hemanth,

  

 Attached is some sample folders within my
 /tmp/hadoop-root/mapred/local/archive. There are some jar and class files
 inside.

  

 My application uses MapReduce job to do purge Hbase old data. I am using
 basic HBase MapReduce API to delete rows from Hbase

Re: Run multiple HDFS instances

2013-04-18 Thread Hemanth Yamijala
Are you trying to implement something like namespace federation, that's a
part of Hadoop 2.0 -
http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-hdfs/Federation.html


On Thu, Apr 18, 2013 at 10:02 PM, Lixiang Ao aolixi...@gmail.com wrote:

 Actually I'm trying to do something like combining multiple namenodes so
 that they present themselves to clients as a single namespace, implementing
 basic namenode functionalities.

 在 2013年4月18日星期四,Chris Embree 写道:

 Glad you got this working... can you explain your use case a little?   I'm
 trying to understand why you might want to do that.


 On Thu, Apr 18, 2013 at 11:29 AM, Lixiang Ao aolixi...@gmail.com wrote:

 I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works!
  Everything looks fine now.

 Seems direct command hdfs namenode gives a better sense of control  :)

 Thanks a lot.

 在 2013年4月18日星期四,Harsh J 写道:

 Yes you can but if you want the scripts to work, you should have them
 use a different PID directory (I think its called HADOOP_PID_DIR)
 every time you invoke them.

 I instead prefer to start the daemons up via their direct command such
 as hdfs namenode and so and move them to the background, with a
 redirect for logging.

 On Thu, Apr 18, 2013 at 2:34 PM, Lixiang Ao aolixi...@gmail.com
 wrote:
  Hi all,
 
  Can I run mutiple HDFS instances, that is, n seperate namenodes and n
  datanodes, on a single machine?
 
  I've modified core-site.xml and hdfs-site.xml to avoid port and file
  conflicting between HDFSes, but when I started the second HDFS, I got
 the
  errors:
 
  Starting namenodes on [localhost]
  localhost: namenode running as process 20544. Stop it first.
  localhost: datanode running as process 20786. Stop it first.
  Starting secondary namenodes [0.0.0.0]
  0.0.0.0: secondarynamenode running as process 21074. Stop it first.
 
  Is there a way to solve this?
  Thank you in advance,
 
  Lixiang Ao



 --
 Harsh J





Re: How to configure mapreduce archive size?

2013-04-17 Thread Hemanth Yamijala
The check for cache file cleanup is controlled by the
property mapreduce.tasktracker.distributedcache.checkperiod. It defaults to
1 minute (which should be sufficient for your requirement).

I am not sure why the JobTracker UI is inaccessible. If you know where JT
is running, try hitting http://machine:50030. If that doesn't work, maybe
check if ports have been changed in mapred-site.xml for a property similar
to mapred.job.tracker.http.address.

There is logging in the code of the tasktracker component that can help
debug the distributed cache behaviour. In order to get those logs you need
to enable debug logging in the log4j configuration properties and restart
the daemons. Hopefully that will help you get some hints on what is
happening.

Thanks
hemanth


On Wed, Apr 17, 2013 at 11:49 PM, xia_y...@dell.com wrote:

 Hi Hemanth and Bejoy KS,

 ** **

 I have tried both mapred-site.xml and core-site.xml. They do not work. I
 set the value to 50K just for testing purpose, however the folder size
 already goes to 900M now. As in your email, “After they are done, the
 property will help cleanup the files due to the limit set. ” How frequently
 the cleanup task will be triggered? 

 ** **

 Regarding the job.xml, I cannot use JT web UI to find it. It seems when
 hadoop is packaged within Hbase, this is disabled. I am only use Hbase
 jobs. I was suggested by Hbase people to get help from Hadoop mailing list.
 I will contact them again.

 ** **

 Thanks,

 ** **

 Jane

 ** **

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Tuesday, April 16, 2013 9:35 PM

 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

 ** **

 You can limit the size by setting local.cache.size in the mapred-site.xml
 (or core-site.xml if that works for you). I mistakenly mentioned
 mapred-default.xml in my last mail - apologies for that. However, please
 note that this does not prevent whatever is writing into the distributed
 cache from creating those files when they are required. After they are
 done, the property will help cleanup the files due to the limit set. 

 ** **

 That's why I am more keen on finding what is using the files in the
 Distributed cache. It may be useful if you can ask on the HBase list as
 well if the APIs you are using are creating the files you mention (assuming
 you are only running HBase jobs on the cluster and nothing else)

 ** **

 Thanks

 Hemanth

 ** **

 On Tue, Apr 16, 2013 at 11:15 PM, xia_y...@dell.com wrote:

 Hi Hemanth,

  

 I did not explicitly using DistributedCache in my code. I did not use any
 command line arguments like –libjars neither.

  

 Where can I find job.xml? I am using Hbase MapReduce API and not setting
 any job.xml.

  

 The key point is I want to limit the size of
 /tmp/hadoop-root/mapred/local/archive. Could you help?

  

 Thanks.

  

 Xia

  

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Thursday, April 11, 2013 9:09 PM


 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

  

 TableMapReduceUtil has APIs like addDependencyJars which will use
 DistributedCache. I don't think you are explicitly using that. Are you
 using any command line arguments like -libjars etc when you are launching
 the MapReduce job ? Alternatively you can check job.xml of the launched MR
 job to see if it has set properties having prefixes like mapred.cache. If
 nothing's set there, it would seem like some other process or user is
 adding jars to DistributedCache when using the cluster.

  

 Thanks

 hemanth

  

  

  

 On Thu, Apr 11, 2013 at 11:40 PM, xia_y...@dell.com wrote:

 Hi Hemanth,

  

 Attached is some sample folders within my
 /tmp/hadoop-root/mapred/local/archive. There are some jar and class files
 inside.

  

 My application uses MapReduce job to do purge Hbase old data. I am using
 basic HBase MapReduce API to delete rows from Hbase table. I do not specify
 to use Distributed cache. Maybe HBase use it?

  

 Some code here:

  

Scan scan = *new* Scan();

scan.setCaching(500);// 1 is the default in Scan, which
 will be bad for MapReduce jobs

scan.setCacheBlocks(*false*);  // don't set to true for MR jobs

scan.setTimeRange(Long.*MIN_VALUE*, timestamp);

// set other scan *attrs*

// the purge start time

Date date=*new* Date();

TableMapReduceUtil.*initTableMapperJob*(

  tableName,// input table

  scan,   // Scan instance to control CF and
 attribute selection

  MapperDelete.*class*, // *mapper* class

  *null*, // *mapper* output key

  *null*,  // *mapper* output value

Re: Hadoop fs -getmerge

2013-04-17 Thread Hemanth Yamijala
I don't think that is possible. When we use -getmerge, the destination
filesystem happens to be a LocalFileSystem which extends from
ChecksumFileSystem. I believe that's why the CRC files are getting in.

Would it not be possible for you to ignore them, since they have a fixed
extension ?

Thanks
Hemanth


On Wed, Apr 17, 2013 at 8:09 PM, Fabio Pitzolu fabio.pitz...@gr-ci.comwrote:

 Hi all,
 is there a way to use the *getmerge* fs command and not generate the
 .crc files in the output local directory?

 Thanks,

 Fabio Pitzolu**



Re: How to configure mapreduce archive size?

2013-04-16 Thread Hemanth Yamijala
You can limit the size by setting local.cache.size in the mapred-site.xml
(or core-site.xml if that works for you). I mistakenly mentioned
mapred-default.xml in my last mail - apologies for that. However, please
note that this does not prevent whatever is writing into the distributed
cache from creating those files when they are required. After they are
done, the property will help cleanup the files due to the limit set.

That's why I am more keen on finding what is using the files in the
Distributed cache. It may be useful if you can ask on the HBase list as
well if the APIs you are using are creating the files you mention (assuming
you are only running HBase jobs on the cluster and nothing else)

Thanks
Hemanth


On Tue, Apr 16, 2013 at 11:15 PM, xia_y...@dell.com wrote:

 Hi Hemanth,

 ** **

 I did not explicitly using DistributedCache in my code. I did not use any
 command line arguments like –libjars neither.

 ** **

 Where can I find job.xml? I am using Hbase MapReduce API and not setting
 any job.xml.

 ** **

 The key point is I want to limit the size of 
 /tmp/hadoop-root/mapred/local/archive.
 Could you help?

 ** **

 Thanks.

 ** **

 Xia

 ** **

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Thursday, April 11, 2013 9:09 PM

 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

 ** **

 TableMapReduceUtil has APIs like addDependencyJars which will use
 DistributedCache. I don't think you are explicitly using that. Are you
 using any command line arguments like -libjars etc when you are launching
 the MapReduce job ? Alternatively you can check job.xml of the launched MR
 job to see if it has set properties having prefixes like mapred.cache. If
 nothing's set there, it would seem like some other process or user is
 adding jars to DistributedCache when using the cluster.

 ** **

 Thanks

 hemanth

 ** **

 ** **

 ** **

 On Thu, Apr 11, 2013 at 11:40 PM, xia_y...@dell.com wrote:

 Hi Hemanth,

  

 Attached is some sample folders within my
 /tmp/hadoop-root/mapred/local/archive. There are some jar and class files
 inside.

  

 My application uses MapReduce job to do purge Hbase old data. I am using
 basic HBase MapReduce API to delete rows from Hbase table. I do not specify
 to use Distributed cache. Maybe HBase use it?

  

 Some code here:

  

Scan scan = *new* Scan();

scan.setCaching(500);// 1 is the default in Scan, which
 will be bad for MapReduce jobs

scan.setCacheBlocks(*false*);  // don't set to true for MR jobs

scan.setTimeRange(Long.*MIN_VALUE*, timestamp);

// set other scan *attrs*

// the purge start time

Date date=*new* Date();

TableMapReduceUtil.*initTableMapperJob*(

  tableName,// input table

  scan,   // Scan instance to control CF and
 attribute selection

  MapperDelete.*class*, // *mapper* class

  *null*, // *mapper* output key

  *null*,  // *mapper* output value

  job);

  

job.setOutputFormatClass(TableOutputFormat.*class*);

job.getConfiguration().set(TableOutputFormat.*OUTPUT_TABLE*,
 tableName);

job.setNumReduceTasks(0);



*boolean* b = job.waitForCompletion(*true*);

  

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Thursday, April 11, 2013 12:29 AM


 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

  

 Could you paste the contents of the directory ? Not sure whether that will
 help, but just giving it a shot.

  

 What application are you using ? Is it custom MapReduce jobs in which you
 use Distributed cache (I guess not) ? 

  

 Thanks

 Hemanth

  

 On Thu, Apr 11, 2013 at 3:34 AM, xia_y...@dell.com wrote:

 Hi Arun,

  

 I stopped my application, then restarted my hbase (which include hadoop).
 After that I start my application. After one evening, my
 /tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not
 work.

  

 Is this the right place to change the value?

  

 local.cache.size in file core-default.xml, which is in
 hadoop-core-1.0.3.jar

  

 Thanks,

  

 Jane

  

 *From:* Arun C Murthy [mailto:a...@hortonworks.com]
 *Sent:* Wednesday, April 10, 2013 2:45 PM


 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

  

 Ensure no jobs are running (cache limit is only for non-active cache
 files), check after a little while (takes sometime for the cleaner thread
 to kick in).

  

 Arun

  

 On Apr 11, 2013, at 2:29 AM, xia_y...@dell.com xia_y...@dell.com
 wrote

Re: How to configure mapreduce archive size?

2013-04-11 Thread Hemanth Yamijala
Could you paste the contents of the directory ? Not sure whether that will
help, but just giving it a shot.

What application are you using ? Is it custom MapReduce jobs in which you
use Distributed cache (I guess not) ?

Thanks
Hemanth


On Thu, Apr 11, 2013 at 3:34 AM, xia_y...@dell.com wrote:

 Hi Arun,

 ** **

 I stopped my application, then restarted my hbase (which include hadoop).
 After that I start my application. After one evening, my 
 /tmp/hadoop-root/mapred/local/archive
 goes to more than 1G. It does not work.

 ** **

 Is this the right place to change the value?

 ** **

 local.cache.size in file core-default.xml, which is in
 hadoop-core-1.0.3.jar

 ** **

 Thanks,

 ** **

 Jane

 ** **

 *From:* Arun C Murthy [mailto:a...@hortonworks.com]
 *Sent:* Wednesday, April 10, 2013 2:45 PM

 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

 ** **

 Ensure no jobs are running (cache limit is only for non-active cache
 files), check after a little while (takes sometime for the cleaner thread
 to kick in).

 ** **

 Arun

 ** **

 On Apr 11, 2013, at 2:29 AM, xia_y...@dell.com xia_y...@dell.com
 wrote:



 

 Hi Hemanth,

  

 For the hadoop 1.0.3, I can only find local.cache.size in file
 core-default.xml, which is in hadoop-core-1.0.3.jar. It is not in
 mapred-default.xml.

  

 I updated the value in file default.xml and changed the value to 50.
 This is just for my testing purpose. However, the folder
 /tmp/hadoop-root/mapred/local/archive already goes more than 1G now. Looks
 like it does not do the work. Could you advise if what I did is correct?**
 **

  

   namelocal.cache.size/name

   value50/value

  

 Thanks,

  

 Xia

  

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Monday, April 08, 2013 9:09 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

  

 Hi,

  

 This directory is used as part of the 'DistributedCache' feature. (
 http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache).
 There is a configuration key local.cache.size which controls the amount
 of data stored under DistributedCache. The default limit is 10GB. However,
 the files under this cannot be deleted if they are being used. Also, some
 frameworks on Hadoop could be using DistributedCache transparently to you.
 

  

 So you could check what is being stored here and based on that lower the
 limit of the cache size if you feel that will help. The property needs to
 be set in mapred-default.xml.

  

 Thanks

 Hemanth

  

 On Mon, Apr 8, 2013 at 11:09 PM, xia_y...@dell.com wrote:

 Hi,

  

 I am using hadoop which is packaged within hbase -0.94.1. It is hadoop
 1.0.3. There is some mapreduce job running on my server. After some time, I
 found that my folder /tmp/hadoop-root/mapred/local/archive has 14G size.**
 **

  

 How to configure this and limit the size? I do not want  to waste my space
 for archive.

  

 Thanks,

  

 Xia

  

  

 ** **

 --

 Arun C. Murthy

 Hortonworks Inc.
 http://hortonworks.com/

 ** **



Re: Copy Vs DistCP

2013-04-11 Thread Hemanth Yamijala
AFAIK, the cp command works fully from the DFS client. It reads bytes from
the InputStream created when the file is opened and writes the same to the
OutputStream of the file. It does not work at the level of data blocks. A
configuration io.file.buffer.size is used as the size of the buffer used in
copy - set to 4096 by default.

Thanks
Hemanth


On Thu, Apr 11, 2013 at 9:42 AM, KayVajj vajjalak...@gmail.com wrote:

 If CP command is not parallel how does it work for a file partitioned on
 various data nodes?


 On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu azury...@gmail.com wrote:

 CP command is not parallel, It's just call FileSystem, even if DFSClient
 has multi threads.

 DistCp can work well on the same cluster.


 On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.com wrote:

 The File System Copy utility copies files byte by byte if I'm not wrong.
 Could it be possible that the cp command works with blocks and moves them
 which could be significantly efficient?


 Also how does the cp command work if the file is distributed on
 different data nodes??

 Thanks
 Kay


 On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote:

 DistCP is a full blown mapreduce job (mapper only, where the mappers do
 a fully parallel copy to the detsination).

 CP appears (correct me if im wrong) to simply invoke the FileSystem and
 issues a copy command for every source file.

 I have an additional question: how is CP which is internal to a cluster
 optimized (if at all) ?



 On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote:

 **
 Hi,

 I think it' better using Copy in the same cluster while using distCP
 between clusters, and cp command is a hadoop internal parallel process and
 will not copy files locally.

 --
  麦树荣

  *From:* KayVajj vajjalak...@gmail.com
 *Date:* 2013-04-11 06:20
 *To:* user@hadoop.apache.org
 *Subject:* Copy Vs DistCP
   I have few questions regarding the usage of DistCP for copying
 files in the same cluster.


 1) Which one is better within a  same cluster and what factors (like
 file size etc) wouldinfluence the usage of one over te other?

  2) when we run a cp command like below from a  client node of the
 cluster (not a data node), How does the cp command work
   i) like an MR job
  ii) copy files locally and then it copy it back at the new
 location.

  Example of the copy command

  hdfs dfs -cp /some_location/file /new_location/

  Thanks, your responses are appreciated.

  -- Kay




 --
 Jay Vyas
 http://jayunit100.blogspot.com







Re: How to configure mapreduce archive size?

2013-04-11 Thread Hemanth Yamijala
TableMapReduceUtil has APIs like addDependencyJars which will use
DistributedCache. I don't think you are explicitly using that. Are you
using any command line arguments like -libjars etc when you are launching
the MapReduce job ? Alternatively you can check job.xml of the launched MR
job to see if it has set properties having prefixes like mapred.cache. If
nothing's set there, it would seem like some other process or user is
adding jars to DistributedCache when using the cluster.

Thanks
hemanth




On Thu, Apr 11, 2013 at 11:40 PM, xia_y...@dell.com wrote:

 Hi Hemanth,

 ** **

 Attached is some sample folders within my 
 /tmp/hadoop-root/mapred/local/archive.
 There are some jar and class files inside.

 ** **

 My application uses MapReduce job to do purge Hbase old data. I am using
 basic HBase MapReduce API to delete rows from Hbase table. I do not specify
 to use Distributed cache. Maybe HBase use it?

 ** **

 Some code here:

 ** **

Scan scan = *new* Scan();

scan.setCaching(500);// 1 is the default in Scan, which
 will be bad for MapReduce jobs

scan.setCacheBlocks(*false*);  // don't set to true for MR jobs

scan.setTimeRange(Long.*MIN_VALUE*, timestamp);

// set other scan *attrs*

// the purge start time

Date date=*new* Date();

TableMapReduceUtil.*initTableMapperJob*(

  tableName,// input table

  scan,   // Scan instance to control CF and
 attribute selection

  MapperDelete.*class*, // *mapper* class

  *null*, // *mapper* output key

  *null*,  // *mapper* output value

  job);

 ** **

job.setOutputFormatClass(TableOutputFormat.*class*);

job.getConfiguration().set(TableOutputFormat.*OUTPUT_TABLE*,
 tableName);

job.setNumReduceTasks(0);



*boolean* b = job.waitForCompletion(*true*);

 ** **

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Thursday, April 11, 2013 12:29 AM

 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

 ** **

 Could you paste the contents of the directory ? Not sure whether that will
 help, but just giving it a shot.

 ** **

 What application are you using ? Is it custom MapReduce jobs in which you
 use Distributed cache (I guess not) ? 

 ** **

 Thanks

 Hemanth

 ** **

 On Thu, Apr 11, 2013 at 3:34 AM, xia_y...@dell.com wrote:

 Hi Arun,

  

 I stopped my application, then restarted my hbase (which include hadoop).
 After that I start my application. After one evening, my
 /tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not
 work.

  

 Is this the right place to change the value?

  

 local.cache.size in file core-default.xml, which is in
 hadoop-core-1.0.3.jar

  

 Thanks,

  

 Jane

  

 *From:* Arun C Murthy [mailto:a...@hortonworks.com]
 *Sent:* Wednesday, April 10, 2013 2:45 PM


 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

  

 Ensure no jobs are running (cache limit is only for non-active cache
 files), check after a little while (takes sometime for the cleaner thread
 to kick in).

  

 Arun

  

 On Apr 11, 2013, at 2:29 AM, xia_y...@dell.com xia_y...@dell.com
 wrote:

 ** **

 Hi Hemanth,

  

 For the hadoop 1.0.3, I can only find local.cache.size in file
 core-default.xml, which is in hadoop-core-1.0.3.jar. It is not in
 mapred-default.xml.

  

 I updated the value in file default.xml and changed the value to 50.
 This is just for my testing purpose. However, the folder
 /tmp/hadoop-root/mapred/local/archive already goes more than 1G now. Looks
 like it does not do the work. Could you advise if what I did is correct?**
 **

  

   namelocal.cache.size/name

   value50/value

  

 Thanks,

  

 Xia

  

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Monday, April 08, 2013 9:09 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to configure mapreduce archive size?

  

 Hi,

  

 This directory is used as part of the 'DistributedCache' feature. (
 http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache).
 There is a configuration key local.cache.size which controls the amount
 of data stored under DistributedCache. The default limit is 10GB. However,
 the files under this cannot be deleted if they are being used. Also, some
 frameworks on Hadoop could be using DistributedCache transparently to you.
 

  

 So you could check what is being stored here and based on that lower the
 limit of the cache size if you feel that will help. The property needs to
 be set in mapred-default.xml

Re: How to configure mapreduce archive size?

2013-04-08 Thread Hemanth Yamijala
Hi,

This directory is used as part of the 'DistributedCache' feature. (
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache).
There is a configuration key local.cache.size which controls the amount
of data stored under DistributedCache. The default limit is 10GB. However,
the files under this cannot be deleted if they are being used. Also, some
frameworks on Hadoop could be using DistributedCache transparently to you.

So you could check what is being stored here and based on that lower the
limit of the cache size if you feel that will help. The property needs to
be set in mapred-default.xml.

Thanks
Hemanth


On Mon, Apr 8, 2013 at 11:09 PM, xia_y...@dell.com wrote:

 Hi,

 ** **

 I am using hadoop which is packaged within hbase -0.94.1. It is hadoop
 1.0.3. There is some mapreduce job running on my server. After some time, I
 found that my folder /tmp/hadoop-root/mapred/local/archive has 14G size.**
 **

 ** **

 How to configure this and limit the size? I do not want  to waste my space
 for archive.

 ** **

 Thanks,

 ** **

 Xia

 ** **



Re: Find reducer for a key

2013-03-28 Thread Hemanth Yamijala
Hi,

Not sure if I am answering your question, but this is the background. Every
MapReduce job has a partitioner associated to it. The default partitioner
is a HashPartitioner. You can as a user write your own partitioner as well
and plug it into the job. The partitioner is responsible for splitting the
map outputs key space among the reducers.

So, to know which reducer a key will go to, it is basically the value
returned by the partitioner's getPartition method. For e.g this is the code
in the HashPartitioner:

  public int getPartition(K2 key, V2 value,
  int numReduceTasks) {
return (key.hashCode()  Integer.MAX_VALUE) % numReduceTasks;
  }

mapred.task.partition is the key that defines the partition number of this
reducer.

I guess you can piece together these bits into what you'd want.. However, I
am interested in understanding why you want to know this ? Can you share
some info ?

Thanks
Hemanth


On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli 
cordioli.albe...@gmail.com wrote:

 Hi everyone,

 how can i know the keys that are associated to a particular reducer in
 the setup method?
 Let's assume in the setup method to read from a file where each line
 is a string that will become a key emitted from mappers.
 For each of these lines I would like to know if the string will be a
 key associated with the current reducer or not.

 I read something about mapred.task.partition and mapred.task.id, but I
 didn't understand the usage.


 Thanks,
 Alberto


 --
 Alberto Cordioli



Re: Find reducer for a key

2013-03-28 Thread Hemanth Yamijala
Hmm. That feels like a join. Can't you read the input file on the map side
and output those keys along with the original map output keys.. That way
the reducer would automatically get both together ?


On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli 
cordioli.albe...@gmail.com wrote:

 Hi Hemanth,

 thanks for your reply.
 Yes, this partially answered to my question. I know how hash
 partitioner works and I guessed something similar.
 The piece that I missed was that mapred.task.partition returns the
 partition number of the reducer.
 So, putting al the pieces together I undersand that: for each key in
 the file I have to call the HashPartitioner.
 Then I have to compare the returned index with the one retrieved by
 Configuration.getInt(mapred.task.partition).
 If it is equal then such a key will be served by that reducer. Is this
 correct?


 To answer to your question:
 In a reduce side of a MR job, I want to load from file some data in a
 in-memory structure. Actually, I don't need to store the whole file
 for each reducer, but only the lines that are related to such keys a
 particular reducers will receive.
 So, my intention is to know the keys in the setup method to store only
 the needed lines.

 Thanks,
 Alberto


 On 28 March 2013 11:01, Hemanth Yamijala yhema...@thoughtworks.com
 wrote:
  Hi,
 
  Not sure if I am answering your question, but this is the background.
 Every
  MapReduce job has a partitioner associated to it. The default
 partitioner is
  a HashPartitioner. You can as a user write your own partitioner as well
 and
  plug it into the job. The partitioner is responsible for splitting the
 map
  outputs key space among the reducers.
 
  So, to know which reducer a key will go to, it is basically the value
  returned by the partitioner's getPartition method. For e.g this is the
 code
  in the HashPartitioner:
 
public int getPartition(K2 key, V2 value,
int numReduceTasks) {
  return (key.hashCode()  Integer.MAX_VALUE) % numReduceTasks;
}
 
  mapred.task.partition is the key that defines the partition number of
 this
  reducer.
 
  I guess you can piece together these bits into what you'd want..
 However, I
  am interested in understanding why you want to know this ? Can you share
  some info ?
 
  Thanks
  Hemanth
 
 
  On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
  cordioli.albe...@gmail.com wrote:
 
  Hi everyone,
 
  how can i know the keys that are associated to a particular reducer in
  the setup method?
  Let's assume in the setup method to read from a file where each line
  is a string that will become a key emitted from mappers.
  For each of these lines I would like to know if the string will be a
  key associated with the current reducer or not.
 
  I read something about mapred.task.partition and mapred.task.id, but I
  didn't understand the usage.
 
 
  Thanks,
  Alberto
 
 
  --
  Alberto Cordioli
 
 



 --
 Alberto Cordioli



Re: Find reducer for a key

2013-03-28 Thread Hemanth Yamijala
Hi,

The way I understand your requirement - you have a file that contains a set
of keys. You want to read this file on every reducer and take only those
entries of the set, whose keys correspond to the current reducer.

If the above summary is correct, can I assume that you are potentially
reading the entire intermediate output key space on every reducer. Would
that even work (considering memory constraints, etc).

It seemed to me that your solution is implementing what the framework can
already do for you. That was the rationale behind my suggestion. Maybe you
should try and implement both approaches to see which one works better for
you.

Thanks
hemanth


On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli 
cordioli.albe...@gmail.com wrote:

 Yes, that is a possible solution.
 But since the MR job has another scope, the mappers already read other
 files (very large) and output tuples.
 You cannot control the number of mappers and hence the risk is that a
 lot of mappers will be created, and each of them read also the other
 file instead of a small number of reducers.

 Do you think that the solution I proposed is not so elegant or efficient?

 Alberto

 On 28 March 2013 13:12, Hemanth Yamijala yhema...@thoughtworks.com
 wrote:
  Hmm. That feels like a join. Can't you read the input file on the map
 side
  and output those keys along with the original map output keys.. That way
 the
  reducer would automatically get both together ?
 
 
  On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
  cordioli.albe...@gmail.com wrote:
 
  Hi Hemanth,
 
  thanks for your reply.
  Yes, this partially answered to my question. I know how hash
  partitioner works and I guessed something similar.
  The piece that I missed was that mapred.task.partition returns the
  partition number of the reducer.
  So, putting al the pieces together I undersand that: for each key in
  the file I have to call the HashPartitioner.
  Then I have to compare the returned index with the one retrieved by
  Configuration.getInt(mapred.task.partition).
  If it is equal then such a key will be served by that reducer. Is this
  correct?
 
 
  To answer to your question:
  In a reduce side of a MR job, I want to load from file some data in a
  in-memory structure. Actually, I don't need to store the whole file
  for each reducer, but only the lines that are related to such keys a
  particular reducers will receive.
  So, my intention is to know the keys in the setup method to store only
  the needed lines.
 
  Thanks,
  Alberto
 
 
  On 28 March 2013 11:01, Hemanth Yamijala yhema...@thoughtworks.com
  wrote:
   Hi,
  
   Not sure if I am answering your question, but this is the background.
   Every
   MapReduce job has a partitioner associated to it. The default
   partitioner is
   a HashPartitioner. You can as a user write your own partitioner as
 well
   and
   plug it into the job. The partitioner is responsible for splitting the
   map
   outputs key space among the reducers.
  
   So, to know which reducer a key will go to, it is basically the value
   returned by the partitioner's getPartition method. For e.g this is the
   code
   in the HashPartitioner:
  
 public int getPartition(K2 key, V2 value,
 int numReduceTasks) {
   return (key.hashCode()  Integer.MAX_VALUE) % numReduceTasks;
 }
  
   mapred.task.partition is the key that defines the partition number of
   this
   reducer.
  
   I guess you can piece together these bits into what you'd want..
   However, I
   am interested in understanding why you want to know this ? Can you
 share
   some info ?
  
   Thanks
   Hemanth
  
  
   On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
   cordioli.albe...@gmail.com wrote:
  
   Hi everyone,
  
   how can i know the keys that are associated to a particular reducer
 in
   the setup method?
   Let's assume in the setup method to read from a file where each line
   is a string that will become a key emitted from mappers.
   For each of these lines I would like to know if the string will be a
   key associated with the current reducer or not.
  
   I read something about mapred.task.partition and mapred.task.id,
 but I
   didn't understand the usage.
  
  
   Thanks,
   Alberto
  
  
   --
   Alberto Cordioli
  
  
 
 
 
  --
  Alberto Cordioli
 
 



 --
 Alberto Cordioli



Re: Child JVM memory allocation / Usage

2013-03-27 Thread Hemanth Yamijala
Couple of things to check:

Does your class com.hadoop.publicationMrPOC.Launcher implement the Tool
interface ? You can look at an example at (
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code-N110D0).
That's what accepts the -D params on command line. Alternatively, you can
also set the same in the configuration object like this, in your launcher
code:

Configuration conf = new Configuration()

conf.set(mapred.create.symlink, yes);
conf.set(mapred.cache.files,
hdfs:///user/hemanty/scripts/copy_dump.sh#copy_dump.sh);
conf.set(mapred.child.java.opts,
  -Xmx200m -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=./heapdump.hprof
-XX:OnOutOfMemoryError=./copy_dump.sh);


Second, the position of the arguments matters. I think the command should
be

hadoop jar -Dmapred.create.symlink=yes
-Dmapred.cache.files=hdfs:///user/ims-b/dump.sh#dump.sh
-Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh'
com.hadoop.publicationMrPOC.Launcher  Fudan\ Univ

Thanks
Hemanth


On Wed, Mar 27, 2013 at 1:58 PM, nagarjuna kanamarlapudi 
nagarjuna.kanamarlap...@gmail.com wrote:

 Hi Hemanth/Koji,

 Seems the above script doesn't work for me.  Can u look into the following
 and suggest what more can I do


  hadoop fs -cat /user/ims-b/dump.sh
 #!/bin/sh
 hadoop dfs -put myheapdump.hprof /tmp/myheapdump_ims/${PWD//\//_}.hprof


 hadoop jar LL.jar com.hadoop.publicationMrPOC.Launcher  Fudan\ Univ
  -Dmapred.create.symlink=yes
 -Dmapred.cache.files=hdfs:///user/ims-b/dump.sh#dump.sh
 -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError
 -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh'


 I am not able to see the heap dump at  /tmp/myheapdump_ims



 Erorr in the mapper :

 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
   ... 17 more
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2734)
   at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
   at java.util.ArrayList.add(ArrayList.java:351)
   at 
 com.hadoop.publicationMrPOC.PublicationMapper.configure(PublicationMapper.java:59)
   ... 22 more





 On Wed, Mar 27, 2013 at 10:16 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Koji,

 Works beautifully. Thanks a lot. I learnt at least 3 different things
 with your script today !

 Hemanth


 On Tue, Mar 26, 2013 at 9:41 PM, Koji Noguchi knogu...@yahoo-inc.comwrote:

 Create a dump.sh on hdfs.

 $ hadoop dfs -cat /user/knoguchi/dump.sh
 #!/bin/sh
 hadoop dfs -put myheapdump.hprof
 /tmp/myheapdump_knoguchi/${PWD//\//_}.hprof

 Run your job with

 -Dmapred.create.symlink=yes
 -Dmapred.cache.files=hdfs:///user/knoguchi/dump.sh#dump.sh
 -Dmapred.reduce.child.java.opts='-Xmx2048m
 -XX:+HeapDumpOnOutOfMemoryError
 -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh'

 This should create the heap dump on hdfs at /tmp/myheapdump_knoguchi.

 Koji


 On Mar 26, 2013, at 11:53 AM, Hemanth Yamijala wrote:

  Hi,
 
  I tried to use the -XX:+HeapDumpOnOutOfMemoryError. Unfortunately,
 like I suspected, the dump goes to the current work directory of the task
 attempt as it executes on the cluster. This directory is cleaned up once
 the task is done. There are options to keep failed task files or task files
 matching a pattern. However, these are NOT retaining the current working
 directory. Hence, there is no option to get this from a cluster AFAIK.
 
  You are effectively left with the jmap option on pseudo distributed
 cluster I think.
 
  Thanks
  Hemanth
 
 
  On Tue, Mar 26, 2013 at 11:37 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:
  If your task is running out of memory, you could add the option
 -XX:+HeapDumpOnOutOfMemoryError
  to mapred.child.java.opts (along with the heap memory). However, I am
 not sure  where it stores the dump.. You might need to experiment a little
 on it.. Will try and send out the info if I get time to try out.
 
 
  Thanks
  Hemanth
 
 
  On Tue, Mar 26, 2013 at 10:23 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:
  Hi hemanth,
 
  This sounds interesting, will out try out that on the pseudo cluster.
  But the real problem for me is, the cluster is being maintained by third
 party. I only have have a edge node through which I can submit the jobs.
 
  Is there any other way of getting the dump instead of physically going
 to that machine and  checking out.
 
 
 
  On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala 
 yhema

Re: Child JVM memory allocation / Usage

2013-03-27 Thread Hemanth Yamijala
Hi,

 Dumping heap to ./heapdump.hprof

 File myheapdump.hprof does not exist.

The file names don't match - can you check your script / command line args.

Thanks
hemanth


On Wed, Mar 27, 2013 at 3:21 PM, nagarjuna kanamarlapudi 
nagarjuna.kanamarlap...@gmail.com wrote:

 Hi Hemanth,

 Nice to see this. I didnot know about this till now.

 But few one more issue.. the dump file did not get created..   The
 following are the logs



 ttempt_201302211510_81218_m_00_0:
 /data/1/mapred/local/taskTracker/distcache/8776089957260881514_-363500746_715125253/cmp111wcd/user/ims-b/nagarjuna/AddressId_Extractor/Numbers
 attempt_201302211510_81218_m_00_0: java.lang.OutOfMemoryError: Java
 heap space
 attempt_201302211510_81218_m_00_0: Dumping heap to ./heapdump.hprof ...
 attempt_201302211510_81218_m_00_0: Heap dump file created [210641441
 bytes in 3.778 secs]
 attempt_201302211510_81218_m_00_0: #
 attempt_201302211510_81218_m_00_0: # java.lang.OutOfMemoryError: Java
 heap space
 attempt_201302211510_81218_m_00_0: # -XX:OnOutOfMemoryError=./dump.sh
 attempt_201302211510_81218_m_00_0: #   Executing /bin/sh -c
 ./dump.sh...
 attempt_201302211510_81218_m_00_0: put: File myheapdump.hprof does not
 exist.
 attempt_201302211510_81218_m_00_0: log4j:WARN No appenders could be
 found for logger (org.apache.hadoop.hdfs.DFSClient).





 On Wed, Mar 27, 2013 at 2:29 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Couple of things to check:

 Does your class com.hadoop.publicationMrPOC.Launcher implement the Tool
 interface ? You can look at an example at (
 http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code-N110D0).
 That's what accepts the -D params on command line. Alternatively, you can
 also set the same in the configuration object like this, in your launcher
 code:

 Configuration conf = new Configuration()

 conf.set(mapred.create.symlink, yes);


 conf.set(mapred.cache.files, 
 hdfs:///user/hemanty/scripts/copy_dump.sh#copy_dump.sh);


 conf.set(mapred.child.java.opts,


   -Xmx200m -XX:+HeapDumpOnOutOfMemoryError 
 -XX:HeapDumpPath=./heapdump.hprof -XX:OnOutOfMemoryError=./copy_dump.sh);


 Second, the position of the arguments matters. I think the command should
 be

 hadoop jar -Dmapred.create.symlink=yes 
 -Dmapred.cache.files=hdfs:///user/ims-b/dump.sh#dump.sh
 -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError
 -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh'
 com.hadoop.publicationMrPOC.Launcher  Fudan\ Univ

 Thanks
 Hemanth


 On Wed, Mar 27, 2013 at 1:58 PM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 Hi Hemanth/Koji,

 Seems the above script doesn't work for me.  Can u look into the
 following and suggest what more can I do


  hadoop fs -cat /user/ims-b/dump.sh
 #!/bin/sh
 hadoop dfs -put myheapdump.hprof /tmp/myheapdump_ims/${PWD//\//_}.hprof


 hadoop jar LL.jar com.hadoop.publicationMrPOC.Launcher  Fudan\ Univ
  -Dmapred.create.symlink=yes
 -Dmapred.cache.files=hdfs:///user/ims-b/dump.sh#dump.sh
 -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError
 -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh'


 I am not able to see the heap dump at  /tmp/myheapdump_ims



 Erorr in the mapper :

 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
 ... 17 more
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
 at 
 com.hadoop.publicationMrPOC.PublicationMapper.configure(PublicationMapper.java:59)
 ... 22 more





 On Wed, Mar 27, 2013 at 10:16 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Koji,

 Works beautifully. Thanks a lot. I learnt at least 3 different things
 with your script today !

 Hemanth


 On Tue, Mar 26, 2013 at 9:41 PM, Koji Noguchi 
 knogu...@yahoo-inc.comwrote:

 Create a dump.sh on hdfs.

 $ hadoop dfs -cat /user/knoguchi/dump.sh
 #!/bin/sh
 hadoop dfs -put myheapdump.hprof
 /tmp/myheapdump_knoguchi/${PWD//\//_}.hprof

 Run your job with

 -Dmapred.create.symlink=yes
 -Dmapred.cache.files=hdfs:///user/knoguchi/dump.sh#dump.sh
 -Dmapred.reduce.child.java.opts='-Xmx2048m
 -XX:+HeapDumpOnOutOfMemoryError
 -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh'

 This should create the heap dump on hdfs at /tmp/myheapdump_knoguchi.

 Koji


 On Mar 26, 2013, at 11:53 AM, Hemanth Yamijala wrote:

  Hi,
 
  I tried to use

Re: Child JVM memory allocation / Usage

2013-03-26 Thread Hemanth Yamijala
If your task is running out of memory, you could add the option *
-XX:+HeapDumpOnOutOfMemoryError *
*to *mapred.child.java.opts (along with the heap memory). However, I am not
sure  where it stores the dump.. You might need to experiment a little on
it.. Will try and send out the info if I get time to try out.


Thanks
Hemanth


On Tue, Mar 26, 2013 at 10:23 AM, nagarjuna kanamarlapudi 
nagarjuna.kanamarlap...@gmail.com wrote:

 Hi hemanth,

 This sounds interesting, will out try out that on the pseudo cluster.  But
 the real problem for me is, the cluster is being maintained by third party.
 I only have have a edge node through which I can submit the jobs.

 Is there any other way of getting the dump instead of physically going to
 that machine and  checking out.



 On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 One option to find what could be taking the memory is to use jmap on the
 running task. The steps I followed are:

 - I ran a sleep job (which comes in the examples jar of the distribution
 - effectively does nothing in the mapper / reducer).
 - From the JobTracker UI looked at a map task attempt ID.
 - Then on the machine where the map task is running, got the PID of the
 running task - ps -ef | grep task attempt id
 - On the same machine executed jmap -histo pid

 This will give you an idea of the count of objects allocated and size.
 Jmap also has options to get a dump, that will contain more information,
 but this should help to get you started with debugging.

 For my sleep job task - I saw allocations worth roughly 130 MB.

 Thanks
 hemanth




 On Mon, Mar 25, 2013 at 6:43 PM, Nagarjuna Kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 I have a lookup file which I need in the mapper. So I am trying to read
 the whole file and load it into list in the mapper.

 For each and every record Iook in this file which I got from distributed
 cache.

 —
 Sent from iPhone


 On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hmm. How are you loading the file into memory ? Is it some sort of
 memory mapping etc ? Are they being read as records ? Some details of the
 app will help


 On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 Hi Hemanth,

 I tried out your suggestion loading 420 MB file into memory. It threw
 java heap space error.

 I am not sure where this 1.6 GB of configured heap went to ?


 On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 The free memory might be low, just because GC hasn't reclaimed what
 it can. Can you just try reading in the data you want to read and see if
 that works ?

 Thanks
 Hemanth


 On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 io.sort.mb = 256 MB


 On Monday, March 25, 2013, Harsh J wrote:

 The MapTask may consume some memory of its own as well. What is your
 io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to?

 On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi
 nagarjuna.kanamarlap...@gmail.com wrote:
  Hi,
 
  I configured  my child jvm heap to 2 GB. So, I thought I could
 really read
  1.5GB of data and store it in memory (mapper/reducer).
 
  I wanted to confirm the same and wrote the following piece of
 code in the
  configure method of mapper.
 
  @Override
 
  public void configure(JobConf job) {
 
  System.out.println(FREE MEMORY -- 
 
  + Runtime.getRuntime().freeMemory());
 
  System.out.println(MAX MEMORY --- +
 Runtime.getRuntime().maxMemory());
 
  }
 
 
  Surprisingly the output was
 
 
  FREE MEMORY -- 341854864  = 320 MB
  MAX MEMORY ---1908932608  = 1.9 GB
 
 
  I am just wondering what processes are taking up that extra 1.6GB
 of heap
  which I configured for the child jvm heap.
 
 
  Appreciate in helping me understand the scenario.
 
 
 
  Regards
 
  Nagarjuna K
 
 
 



 --
 Harsh J



 --
 Sent from iPhone










Re: Child JVM memory allocation / Usage

2013-03-26 Thread Hemanth Yamijala
Hi,

I tried to use the -XX:+HeapDumpOnOutOfMemoryError. Unfortunately, like I
suspected, the dump goes to the current work directory of the task attempt
as it executes on the cluster. This directory is cleaned up once the task
is done. There are options to keep failed task files or task files matching
a pattern. However, these are NOT retaining the current working directory.
Hence, there is no option to get this from a cluster AFAIK.

You are effectively left with the jmap option on pseudo distributed cluster
I think.

Thanks
Hemanth


On Tue, Mar 26, 2013 at 11:37 AM, Hemanth Yamijala 
yhema...@thoughtworks.com wrote:

 If your task is running out of memory, you could add the option *
 -XX:+HeapDumpOnOutOfMemoryError *
 *to *mapred.child.java.opts (along with the heap memory). However, I am
 not sure  where it stores the dump.. You might need to experiment a little
 on it.. Will try and send out the info if I get time to try out.


 Thanks
 Hemanth


 On Tue, Mar 26, 2013 at 10:23 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 Hi hemanth,

 This sounds interesting, will out try out that on the pseudo cluster.
  But the real problem for me is, the cluster is being maintained by third
 party. I only have have a edge node through which I can submit the jobs.

 Is there any other way of getting the dump instead of physically going to
 that machine and  checking out.



  On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 One option to find what could be taking the memory is to use jmap on the
 running task. The steps I followed are:

 - I ran a sleep job (which comes in the examples jar of the distribution
 - effectively does nothing in the mapper / reducer).
 - From the JobTracker UI looked at a map task attempt ID.
 - Then on the machine where the map task is running, got the PID of the
 running task - ps -ef | grep task attempt id
 - On the same machine executed jmap -histo pid

 This will give you an idea of the count of objects allocated and size.
 Jmap also has options to get a dump, that will contain more information,
 but this should help to get you started with debugging.

 For my sleep job task - I saw allocations worth roughly 130 MB.

 Thanks
 hemanth




 On Mon, Mar 25, 2013 at 6:43 PM, Nagarjuna Kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 I have a lookup file which I need in the mapper. So I am trying to read
 the whole file and load it into list in the mapper.

 For each and every record Iook in this file which I got from
 distributed cache.

 —
 Sent from iPhone


 On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hmm. How are you loading the file into memory ? Is it some sort of
 memory mapping etc ? Are they being read as records ? Some details of the
 app will help


 On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 Hi Hemanth,

 I tried out your suggestion loading 420 MB file into memory. It threw
 java heap space error.

 I am not sure where this 1.6 GB of configured heap went to ?


 On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 The free memory might be low, just because GC hasn't reclaimed what
 it can. Can you just try reading in the data you want to read and see if
 that works ?

 Thanks
 Hemanth


 On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 io.sort.mb = 256 MB


 On Monday, March 25, 2013, Harsh J wrote:

 The MapTask may consume some memory of its own as well. What is
 your
 io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to?

 On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi
 nagarjuna.kanamarlap...@gmail.com wrote:
  Hi,
 
  I configured  my child jvm heap to 2 GB. So, I thought I could
 really read
  1.5GB of data and store it in memory (mapper/reducer).
 
  I wanted to confirm the same and wrote the following piece of
 code in the
  configure method of mapper.
 
  @Override
 
  public void configure(JobConf job) {
 
  System.out.println(FREE MEMORY -- 
 
  + Runtime.getRuntime().freeMemory());
 
  System.out.println(MAX MEMORY --- +
 Runtime.getRuntime().maxMemory());
 
  }
 
 
  Surprisingly the output was
 
 
  FREE MEMORY -- 341854864  = 320 MB
  MAX MEMORY ---1908932608  = 1.9 GB
 
 
  I am just wondering what processes are taking up that extra
 1.6GB of heap
  which I configured for the child jvm heap.
 
 
  Appreciate in helping me understand the scenario.
 
 
 
  Regards
 
  Nagarjuna K
 
 
 



 --
 Harsh J



 --
 Sent from iPhone











Re: Child JVM memory allocation / Usage

2013-03-26 Thread Hemanth Yamijala
Koji,

Works beautifully. Thanks a lot. I learnt at least 3 different things with
your script today !

Hemanth


On Tue, Mar 26, 2013 at 9:41 PM, Koji Noguchi knogu...@yahoo-inc.comwrote:

 Create a dump.sh on hdfs.

 $ hadoop dfs -cat /user/knoguchi/dump.sh
 #!/bin/sh
 hadoop dfs -put myheapdump.hprof
 /tmp/myheapdump_knoguchi/${PWD//\//_}.hprof

 Run your job with

 -Dmapred.create.symlink=yes
 -Dmapred.cache.files=hdfs:///user/knoguchi/dump.sh#dump.sh
 -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError
 -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh'

 This should create the heap dump on hdfs at /tmp/myheapdump_knoguchi.

 Koji


 On Mar 26, 2013, at 11:53 AM, Hemanth Yamijala wrote:

  Hi,
 
  I tried to use the -XX:+HeapDumpOnOutOfMemoryError. Unfortunately, like
 I suspected, the dump goes to the current work directory of the task
 attempt as it executes on the cluster. This directory is cleaned up once
 the task is done. There are options to keep failed task files or task files
 matching a pattern. However, these are NOT retaining the current working
 directory. Hence, there is no option to get this from a cluster AFAIK.
 
  You are effectively left with the jmap option on pseudo distributed
 cluster I think.
 
  Thanks
  Hemanth
 
 
  On Tue, Mar 26, 2013 at 11:37 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:
  If your task is running out of memory, you could add the option
 -XX:+HeapDumpOnOutOfMemoryError
  to mapred.child.java.opts (along with the heap memory). However, I am
 not sure  where it stores the dump.. You might need to experiment a little
 on it.. Will try and send out the info if I get time to try out.
 
 
  Thanks
  Hemanth
 
 
  On Tue, Mar 26, 2013 at 10:23 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:
  Hi hemanth,
 
  This sounds interesting, will out try out that on the pseudo cluster.
  But the real problem for me is, the cluster is being maintained by third
 party. I only have have a edge node through which I can submit the jobs.
 
  Is there any other way of getting the dump instead of physically going
 to that machine and  checking out.
 
 
 
  On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:
  Hi,
 
  One option to find what could be taking the memory is to use jmap on the
 running task. The steps I followed are:
 
  - I ran a sleep job (which comes in the examples jar of the distribution
 - effectively does nothing in the mapper / reducer).
  - From the JobTracker UI looked at a map task attempt ID.
  - Then on the machine where the map task is running, got the PID of the
 running task - ps -ef | grep task attempt id
  - On the same machine executed jmap -histo pid
 
  This will give you an idea of the count of objects allocated and size.
 Jmap also has options to get a dump, that will contain more information,
 but this should help to get you started with debugging.
 
  For my sleep job task - I saw allocations worth roughly 130 MB.
 
  Thanks
  hemanth
 
 
 
 
  On Mon, Mar 25, 2013 at 6:43 PM, Nagarjuna Kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:
  I have a lookup file which I need in the mapper. So I am trying to read
 the whole file and load it into list in the mapper.
 
 
  For each and every record Iook in this file which I got from distributed
 cache.
 
  —
  Sent from iPhone
 
 
  On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:
 
  Hmm. How are you loading the file into memory ? Is it some sort of
 memory mapping etc ? Are they being read as records ? Some details of the
 app will help
 
 
  On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:
  Hi Hemanth,
 
  I tried out your suggestion loading 420 MB file into memory. It threw
 java heap space error.
 
  I am not sure where this 1.6 GB of configured heap went to ?
 
 
  On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:
  Hi,
 
  The free memory might be low, just because GC hasn't reclaimed what it
 can. Can you just try reading in the data you want to read and see if that
 works ?
 
  Thanks
  Hemanth
 
 
  On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:
  io.sort.mb = 256 MB
 
 
  On Monday, March 25, 2013, Harsh J wrote:
  The MapTask may consume some memory of its own as well. What is your
  io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to?
 
  On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi
  nagarjuna.kanamarlap...@gmail.com wrote:
   Hi,
  
   I configured  my child jvm heap to 2 GB. So, I thought I could really
 read
   1.5GB of data and store it in memory (mapper/reducer).
  
   I wanted to confirm the same and wrote the following piece of code in
 the
   configure method of mapper.
  
   @Override
  
   public void configure(JobConf job) {
  
   System.out.println(FREE MEMORY

Re: How to tell my Hadoop cluster to read data from an external server

2013-03-26 Thread Hemanth Yamijala
The stack trace indicates the job client is trying to submit a job to the
MR cluster and it is failing. Are you certain that at the time of
submitting the job, the JobTracker is running ? (On localhost:54312) ?

Regarding using a different file system - it depends a lot on what file
system you are using, and whether it will match the requirements of large
scale distributed processing that Hadoop MR can offer. Suggest you be very
sure about this, before you take that route.

Thanks
Hemanth


On Tue, Mar 26, 2013 at 4:22 PM, Agarwal, Nikhil
nikhil.agar...@netapp.comwrote:

  Hi,

 Thanks for your reply. I do not know about cascading. Should I google it
 as “cascading in hadoop”? Also, what I was thinking is to implement a file
 system which overrides the functions provided by fs.FileSystem interface in
 Hadoop. I tried to write some portions of the filesystem (for my external
 server) so that it recompiles successfully but when I submit a MR job I get
 the following error:

 ** **

 13/03/26 06:09:10 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 0 time(s).
 13/03/26 06:09:11 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 1 time(s).
 13/03/26 06:09:12 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 2 time(s).
 13/03/26 06:09:13 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 3 time(s).
 13/03/26 06:09:14 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 4 time(s).
 13/03/26 06:09:15 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 5 time(s).
 13/03/26 06:09:16 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 6 time(s).
 13/03/26 06:09:17 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 7 time(s).
 13/03/26 06:09:18 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 8 time(s).
 13/03/26 06:09:19 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54312. Already tried 9 time(s).
 13/03/26 06:10:20 ERROR security.UserGroupInformation:
 PriviledgedActionException as:nikhil cause:java.net.ConnectException: Call
 to localhost/127.0.0.1:54312 failed on connection exception:
 java.net.ConnectException: Connection refused
 java.net.ConnectException: Call to localhost/127.0.0.1:54312 failed on
 connection exception: java.net.ConnectException: Connection refused
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099)
 at org.apache.hadoop.ipc.Client.call(Client.java:1075)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at
 org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480)
 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474)
 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:457)
 at org.apache.hadoop.mapreduce.Job$1.run(Job.java:513)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at org.apache.hadoop.mapreduce.Job.connect(Job.java:511)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:499)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
 at org.apache.hadoop.examples.WordCount.main(WordCount.java:67)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
 at
 

Re: Child JVM memory allocation / Usage

2013-03-25 Thread Hemanth Yamijala
Hi,

The free memory might be low, just because GC hasn't reclaimed what it can.
Can you just try reading in the data you want to read and see if that works
?

Thanks
Hemanth


On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi 
nagarjuna.kanamarlap...@gmail.com wrote:

 io.sort.mb = 256 MB


 On Monday, March 25, 2013, Harsh J wrote:

 The MapTask may consume some memory of its own as well. What is your
 io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to?

 On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi
 nagarjuna.kanamarlap...@gmail.com wrote:
  Hi,
 
  I configured  my child jvm heap to 2 GB. So, I thought I could really
 read
  1.5GB of data and store it in memory (mapper/reducer).
 
  I wanted to confirm the same and wrote the following piece of code in
 the
  configure method of mapper.
 
  @Override
 
  public void configure(JobConf job) {
 
  System.out.println(FREE MEMORY -- 
 
  + Runtime.getRuntime().freeMemory());
 
  System.out.println(MAX MEMORY --- + Runtime.getRuntime().maxMemory());
 
  }
 
 
  Surprisingly the output was
 
 
  FREE MEMORY -- 341854864  = 320 MB
  MAX MEMORY ---1908932608  = 1.9 GB
 
 
  I am just wondering what processes are taking up that extra 1.6GB of
 heap
  which I configured for the child jvm heap.
 
 
  Appreciate in helping me understand the scenario.
 
 
 
  Regards
 
  Nagarjuna K
 
 
 



 --
 Harsh J



 --
 Sent from iPhone



Re: Child JVM memory allocation / Usage

2013-03-25 Thread Hemanth Yamijala
Hmm. How are you loading the file into memory ? Is it some sort of memory
mapping etc ? Are they being read as records ? Some details of the app will
help


On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi 
nagarjuna.kanamarlap...@gmail.com wrote:

 Hi Hemanth,

 I tried out your suggestion loading 420 MB file into memory. It threw java
 heap space error.

 I am not sure where this 1.6 GB of configured heap went to ?


 On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 The free memory might be low, just because GC hasn't reclaimed what it
 can. Can you just try reading in the data you want to read and see if that
 works ?

 Thanks
 Hemanth


 On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 io.sort.mb = 256 MB


 On Monday, March 25, 2013, Harsh J wrote:

 The MapTask may consume some memory of its own as well. What is your
 io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to?

 On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi
 nagarjuna.kanamarlap...@gmail.com wrote:
  Hi,
 
  I configured  my child jvm heap to 2 GB. So, I thought I could really
 read
  1.5GB of data and store it in memory (mapper/reducer).
 
  I wanted to confirm the same and wrote the following piece of code in
 the
  configure method of mapper.
 
  @Override
 
  public void configure(JobConf job) {
 
  System.out.println(FREE MEMORY -- 
 
  + Runtime.getRuntime().freeMemory());
 
  System.out.println(MAX MEMORY --- +
 Runtime.getRuntime().maxMemory());
 
  }
 
 
  Surprisingly the output was
 
 
  FREE MEMORY -- 341854864  = 320 MB
  MAX MEMORY ---1908932608  = 1.9 GB
 
 
  I am just wondering what processes are taking up that extra 1.6GB of
 heap
  which I configured for the child jvm heap.
 
 
  Appreciate in helping me understand the scenario.
 
 
 
  Regards
 
  Nagarjuna K
 
 
 



 --
 Harsh J



 --
 Sent from iPhone






Re: Child JVM memory allocation / Usage

2013-03-25 Thread Hemanth Yamijala
Hi,

One option to find what could be taking the memory is to use jmap on the
running task. The steps I followed are:

- I ran a sleep job (which comes in the examples jar of the distribution -
effectively does nothing in the mapper / reducer).
- From the JobTracker UI looked at a map task attempt ID.
- Then on the machine where the map task is running, got the PID of the
running task - ps -ef | grep task attempt id
- On the same machine executed jmap -histo pid

This will give you an idea of the count of objects allocated and size. Jmap
also has options to get a dump, that will contain more information, but
this should help to get you started with debugging.

For my sleep job task - I saw allocations worth roughly 130 MB.

Thanks
hemanth




On Mon, Mar 25, 2013 at 6:43 PM, Nagarjuna Kanamarlapudi 
nagarjuna.kanamarlap...@gmail.com wrote:

 I have a lookup file which I need in the mapper. So I am trying to read
 the whole file and load it into list in the mapper.

 For each and every record Iook in this file which I got from distributed
 cache.

 —
 Sent from iPhone


 On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hmm. How are you loading the file into memory ? Is it some sort of memory
 mapping etc ? Are they being read as records ? Some details of the app will
 help


 On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 Hi Hemanth,

 I tried out your suggestion loading 420 MB file into memory. It threw
 java heap space error.

 I am not sure where this 1.6 GB of configured heap went to ?


 On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 The free memory might be low, just because GC hasn't reclaimed what it
 can. Can you just try reading in the data you want to read and see if that
 works ?

 Thanks
 Hemanth


 On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com wrote:

 io.sort.mb = 256 MB


 On Monday, March 25, 2013, Harsh J wrote:

 The MapTask may consume some memory of its own as well. What is your
 io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to?

 On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi
 nagarjuna.kanamarlap...@gmail.com wrote:
  Hi,
 
  I configured  my child jvm heap to 2 GB. So, I thought I could
 really read
  1.5GB of data and store it in memory (mapper/reducer).
 
  I wanted to confirm the same and wrote the following piece of code
 in the
  configure method of mapper.
 
  @Override
 
  public void configure(JobConf job) {
 
  System.out.println(FREE MEMORY -- 
 
  + Runtime.getRuntime().freeMemory());
 
  System.out.println(MAX MEMORY --- +
 Runtime.getRuntime().maxMemory());
 
  }
 
 
  Surprisingly the output was
 
 
  FREE MEMORY -- 341854864  = 320 MB
  MAX MEMORY ---1908932608  = 1.9 GB
 
 
  I am just wondering what processes are taking up that extra 1.6GB
 of heap
  which I configured for the child jvm heap.
 
 
  Appreciate in helping me understand the scenario.
 
 
 
  Regards
 
  Nagarjuna K
 
 
 



 --
 Harsh J



 --
 Sent from iPhone








Re: MapReduce Failed and Killed

2013-03-24 Thread Hemanth Yamijala
Any MapReduce task needs to communicate with the tasktracker that launched
it periodically in order to let the tasktracker know it is still alive and
active. The time for which silence is tolerated is controlled by a
configuration property mapred.task.timeout.

It looks like in your case, this has already been bumped up to 20 minutes
(from the default 10 minutes). It also looks like this is not sufficient.
You could bump this value even further up. However, the correct approach
could be to see what the reducer is actually doing to become inactive
during this time. Can you look at the reducer attempt's logs (which you can
access from the web UI of the Jobtracker) and post them here ?

Thanks
hemanth


On Fri, Mar 22, 2013 at 5:32 PM, Jinchun Kim cien...@gmail.com wrote:

 Hi, All.

 I'm trying to create category-based splits of Wikipedia dataset(41GB) and
 the training data set(5GB) using Mahout.
 I'm using following command.

 $MAHOUT_HOME/bin/mahout wikipediaDataSetCreator -i wikipedia/chunks -o
 wikipediainput -c $MAHOUT_HOME/examples/temp/categories.txt

 I had no problem with the training data set, but Hadoop showed following
 messages
 when I tried to do a same job with Wikipedia dataset,

 .
 13/03/21 22:31:00 INFO mapred.JobClient:  map 27% reduce 1%
 13/03/21 22:40:31 INFO mapred.JobClient:  map 27% reduce 2%
 13/03/21 22:58:49 INFO mapred.JobClient:  map 27% reduce 3%
 13/03/21 23:22:57 INFO mapred.JobClient:  map 27% reduce 4%
 13/03/21 23:46:32 INFO mapred.JobClient:  map 27% reduce 5%
 13/03/22 00:27:14 INFO mapred.JobClient:  map 27% reduce 6%
 13/03/22 01:06:55 INFO mapred.JobClient:  map 27% reduce 7%
 13/03/22 01:14:06 INFO mapred.JobClient:  map 27% reduce 3%
 13/03/22 01:15:35 INFO mapred.JobClient: Task Id :
 attempt_201303211339_0002_r_00_1, Status : FAILED
 Task attempt_201303211339_0002_r_00_1 failed to report status for 1200
 seconds. Killing!
 13/03/22 01:20:09 INFO mapred.JobClient:  map 27% reduce 4%
 13/03/22 01:33:35 INFO mapred.JobClient: Task Id :
 attempt_201303211339_0002_m_37_1, Status : FAILED
 Task attempt_201303211339_0002_m_37_1 failed to report status for 1228
 seconds. Killing!
 13/03/22 01:35:12 INFO mapred.JobClient:  map 27% reduce 5%
 13/03/22 01:40:38 INFO mapred.JobClient:  map 27% reduce 6%
 13/03/22 01:52:28 INFO mapred.JobClient:  map 27% reduce 7%
 13/03/22 02:16:27 INFO mapred.JobClient:  map 27% reduce 8%
 13/03/22 02:19:02 INFO mapred.JobClient: Task Id :
 attempt_201303211339_0002_m_18_1, Status : FAILED
 Task attempt_201303211339_0002_m_18_1 failed to report status for 1204
 seconds. Killing!
 13/03/22 02:49:03 INFO mapred.JobClient:  map 27% reduce 9%
 13/03/22 02:52:04 INFO mapred.JobClient:  map 28% reduce 9%
 

 Because I just started to learn how to run Hadoop, I have no idea how to
 solve
 this problem...
 Does anyone have an idea how to handle this weird thing?

 --
 *Jinchun Kim*



Re: Too many open files error with YARN

2013-03-21 Thread Hemanth Yamijala
There is a way to confirm if it is the same bug. Can you pick a jstack on
the process that has established a connection to 50010 and post it here..

Thanks
hemanth


On Thu, Mar 21, 2013 at 12:13 PM, Krishna Kishore Bonagiri 
write2kish...@gmail.com wrote:

 Hi Hemanth  Sandy,

   Thanks for your reply. Yes, that indicates it is in close wait state,
 exactly like below:

 java  30718 dsadm  200u IPv4 1178376459  0t0
  TCP *:50010 (LISTEN)
 java  31512 dsadm  240u IPv6 1178391921  0t0
  TCP node1:51342-node1:50010 (CLOSE_WAIT)

 I just checked in at the link
 https://issues.apache.org/jira/browse/HDFS-3357 it shows 2.0.0-alpha both
 in affect versions and fix versions.

 There is another bug 3591, at
 https://issues.apache.org/jira/browse/HDFS-3591

 which says it is for backporting 3357 to branch 0.23

 So, I don't understand whether the fix is really in 2.0.0-alpha, request
 you to please clarify me.

 Thanks,
 Kishore





 On Thu, Mar 21, 2013 at 9:57 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 There was an issue related to hung connections (HDFS-3357). But the JIRA
 indicates the fix is available in Hadoop-2.0.0-alpha. Still, would be worth
 checking on Sandy's suggestion


 On Wed, Mar 20, 2013 at 11:09 PM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Hi Kishore,

 50010 is the datanode port. Does your lsof indicate that the sockets are
 in CLOSE_WAIT?  I had come across an issue like this where that was a
 symptom.

 -Sandy


 On Wed, Mar 20, 2013 at 4:24 AM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi,

  I am running a date command with YARN's distributed shell example in a
 loop of 1000 times in this way:

 yarn jar
 /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar
 org.apache.hadoop.yarn.applications.distributedshell.Client --jar
 /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar
 --shell_command date --num_containers 2


 Around 730th time or so, I am getting an error in node manager's log
 saying that it failed to launch container because there are Too many open
 files and when I observe through lsof command,I find that there is one
 instance of this kind of file is left for each run of Application Master,
 and it kept growing as I am running it in loop.

 node1:44871-node1:50010

 Is this a known issue? Or am I missing doing something? Please help.

 Note: I am working on hadoop--2.0.0-alpha

 Thanks,
 Kishore







Re: Too many open files error with YARN

2013-03-20 Thread Hemanth Yamijala
There was an issue related to hung connections (HDFS-3357). But the JIRA
indicates the fix is available in Hadoop-2.0.0-alpha. Still, would be worth
checking on Sandy's suggestion


On Wed, Mar 20, 2013 at 11:09 PM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Hi Kishore,

 50010 is the datanode port. Does your lsof indicate that the sockets are
 in CLOSE_WAIT?  I had come across an issue like this where that was a
 symptom.

 -Sandy


 On Wed, Mar 20, 2013 at 4:24 AM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi,

  I am running a date command with YARN's distributed shell example in a
 loop of 1000 times in this way:

 yarn jar
 /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar
 org.apache.hadoop.yarn.applications.distributedshell.Client --jar
 /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar
 --shell_command date --num_containers 2


 Around 730th time or so, I am getting an error in node manager's log
 saying that it failed to launch container because there are Too many open
 files and when I observe through lsof command,I find that there is one
 instance of this kind of file is left for each run of Application Master,
 and it kept growing as I am running it in loop.

 node1:44871-node1:50010

 Is this a known issue? Or am I missing doing something? Please help.

 Note: I am working on hadoop--2.0.0-alpha

 Thanks,
 Kishore





Re: map reduce and sync

2013-02-24 Thread Hemanth Yamijala
I am using the same version of Hadoop as you.

Can you look at something like Scribe, which AFAIK fits the use case you
describe.

Thanks
Hemanth


On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi luc...@gmail.com wrote:

 That is exactly what I did, but in my case, it is like if the file were
 empty, the job counters say no bytes read.
 I'm using hadoop 1.0.3 which version did you try?

 What I'm trying to do is just some basic analyitics on a product search
 system. There is a search service, every time a user performs a search, the
 search string, and the results are stored in this file, and the file is
 sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
 like I described, because the file looks empty for the map reduce
 components. I thought it was about pig, but I wasn't sure, so I tried a
 simple mr job, and used the word count to test the map reduce compoinents
 actually see the sync'ed bytes.

 Of course if I close the file, everything works perfectly, but I don't
 want to close the file every while, since that means I should create
 another one (since no append support), and that would end up with too many
 tiny files, something we know is bad for mr performance, and I don't want
 to add more parts to this (like a file merging tool). I think unign sync is
 a clean solution, since we don't care about writing performance, so I'd
 rather keep it like this if I can make it work.

 Any idea besides hadoop version?

 Thanks!

 Lucas



 On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi Lucas,

 I tried something like this but got different results.

 I wrote code that opened a file on HDFS, wrote a line and called sync.
 Without closing the file, I ran a wordcount with that file as input. It did
 work fine and was able to count the words that were sync'ed (even though
 the file length seems to come as 0 like you noted in fs -ls)

 So, not sure what's happening in your case. In the MR job, do the job
 counters indicate no bytes were read ?

 On a different note though, if you can describe a little more what you
 are trying to accomplish, we could probably work a better solution.

 Thanks
 hemanth


 On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi luc...@gmail.com wrote:

 Helo Hemanth, thanks for answering.
 The file is open by a separate process not map reduce related at all.
 You can think of it as a servlet, receiving requests, and writing them to
 this file, every time a request is received it is written and
 org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.

 At the same time, I want to run a map reduce job over this file. Simply
 runing the word count example doesn't seem to work, it is like if the file
 were empty.

 hadoop -fs -tail works just fine, and reading the file using
 org.apache.hadoop.fs.FSDataInputStream also works ok.

 Last thing, the web interface doesn't see the contents, and command
 hadoop -fs -ls says the file is empty.

 What am I doing wrong?

 Thanks!

 Lucas



 On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Could you please clarify, are you opening the file in your mapper code
 and reading from there ?

 Thanks
 Hemanth

 On Friday, February 22, 2013, Lucas Bernardi wrote:

 Hello there, I'm trying to use hadoop map reduce to process an open
 file. The writing process, writes a line to the file and syncs the
 file to readers.
 (org.apache.hadoop.fs.FSDataOutputStream.sync()).

 If I try to read the file from another process, it works fine, at
 least using
 org.apache.hadoop.fs.FSDataInputStream.

 hadoop -fs -tail also works just fine

 But it looks like map reduce doesn't read any data. I tried using the
 word count example, same thing, it is like if the file were empty for the
 map reduce framework.

 I'm using hadoop 1.0.3. and pig 0.10.0

 I need some help around this.

 Thanks!

 Lucas







Re: Trouble in running MapReduce application

2013-02-23 Thread Hemanth Yamijala
Can you try this ? Pick a class like WordCount from your package and
execute this command:

javap -classpath path to your jar -verbose org.myorg.Wordcount | grep
version.

For e.g. here's what I get for my class:

$ javap -verbose WCMapper | grep version
  minor version: 0
  major version: 50

Please paste the output of this - we can verify what the problem is.

Thanks
Hemanth


On Sat, Feb 23, 2013 at 4:45 PM, Fatih Haltas fatih.hal...@nyu.edu wrote:

 Hi again,

 Thanks for your help but now, I am struggling with the same problem on a
 machine. As the preivous problem, I just decrease the Java version by Java
 6, but this time I could not solve the problem.

 those are outputs that may explain the situation:

 -
 1. I could not run my own code, to check the system I just tried to run
 basic wordcount example without any modification, except package info.
 **
 COMMAND EXECUTED: hadoop jar my.jar org.myorg.WordCount NetFlow NetFlow.out
 Warning: $HADOOP_HOME is deprecated.

 Exception in thread main java.lang.UnsupportedClassVersionError:
 org/myorg/WordCount : Unsupported major.minor version 51.0
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
 at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
 at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:266)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

 **
 2. Java version:
 
 COMMAND EXECUTED: java -version
 java version 1.6.0_24
 OpenJDK Runtime Environment (IcedTea6 1.11.6)
 (rhel-1.33.1.11.6.el5_9-x86_64)
 OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
 **
 3. JAVA_HOME variable:
 **
 COMMAND EXECUTED: echo $JAVA_HOME
 /usr/lib/jvm/jre-1.6.0-openjdk.x86_64
 
 4. HADOOP version:
 ***
 COMMAND EXECUTED: hadoop version
 Warning: $HADOOP_HOME is deprecated.

 Hadoop 1.0.4
 Subversion
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
 1393290
 Compiled by hortonfo on Wed Oct  3 05:13:58 UTC 2012
 From source with checksum fe2baea87c4c81a2c505767f3f9b71f4
 

 Are these still incompatible with eachother? (Hadoop version and java
 version)


 Thank you very much.


 On Tue, Feb 19, 2013 at 10:26 PM, Fatih Haltas fatih.hal...@nyu.eduwrote:

 Thank you all very much

 19 Şubat 2013 Salı tarihinde Harsh J adlı kullanıcı şöyle yazdı:

 Oops. I just noticed Hemanth has been answering on a dupe thread as
 well. Lets drop this thread and carry on there :)

 On Tue, Feb 19, 2013 at 11:14 PM, Harsh J ha...@cloudera.com wrote:
  Hi,
 
  The new error usually happens if you compile using Java 7 and try to
  run via Java 6 (for example). That is, an incompatibility in the
  runtimes for the binary artifact produced.
 
  On Tue, Feb 19, 2013 at 10:09 PM, Fatih Haltas fatih.hal...@nyu.edu
 wrote:
  Thank you very much Harsh,
 
  Now, as I promised earlier I am much obliged to you.
 
  But, now I solved that problem by just changing the directories then
 again
  creating a jar file of org. but I am getting this error:
 
  1.) What I got
 
 --
  [hadoop@ADUAE042-LAP-V flowclasses_18_02]$ hadoop jar flow19028pm.jar
  org.myorg.MapReduce /home/hadoop/project/hadoop-data/NetFlow 19_02.out
  Warning: $HADOOP_HOME is deprecated.
 
  Exception in thread main java.lang.UnsupportedClassVersionError:
  org/myorg/MapReduce : Unsupported major.minor version 51.0
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
  at
 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
  at
 java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
  at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
  at 

Re: Reg job tracker page

2013-02-23 Thread Hemanth Yamijala
Yes. It corresponds to the JT start time.

Thanks
hemanth


On Sat, Feb 23, 2013 at 5:37 PM, Manoj Babu manoj...@gmail.com wrote:

 Bharath,
 I can understand that its time stamp.
 what does identifier means? whether is holds the job tracker instance
 started time?

 Cheers!
 Manoj.


 On Sat, Feb 23, 2013 at 5:25 PM, bharath vissapragada 
 bharathvissapragada1...@gmail.com wrote:

 Read it this way  2013/02/21-12:22  (yymmdd-time)


 On Sat, Feb 23, 2013 at 4:20 PM, Manoj Babu manoj...@gmail.com wrote:
 
  Hi All,
 
  What does this identifier means in the job tracker page?
 
  State: RUNNING
  Started: Thu Feb 21 12:22:03 CST 2013
  Version: 0.20.2-cdh3u1, bdafb1dbffd0d5f2fbc6ee022e1c8df6500fd638
  Compiled: Mon Jul 18 09:40:29 PDT 2011 by root from Unknown
  Identifier: 201302211222
 
 
  Thanks in advance.
 
  Cheers!
  Manoj.





Re: map reduce and sync

2013-02-23 Thread Hemanth Yamijala
Hi Lucas,

I tried something like this but got different results.

I wrote code that opened a file on HDFS, wrote a line and called sync.
Without closing the file, I ran a wordcount with that file as input. It did
work fine and was able to count the words that were sync'ed (even though
the file length seems to come as 0 like you noted in fs -ls)

So, not sure what's happening in your case. In the MR job, do the job
counters indicate no bytes were read ?

On a different note though, if you can describe a little more what you are
trying to accomplish, we could probably work a better solution.

Thanks
hemanth


On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi luc...@gmail.com wrote:

 Helo Hemanth, thanks for answering.
 The file is open by a separate process not map reduce related at all. You
 can think of it as a servlet, receiving requests, and writing them to this
 file, every time a request is received it is written and
 org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.

 At the same time, I want to run a map reduce job over this file. Simply
 runing the word count example doesn't seem to work, it is like if the file
 were empty.

 hadoop -fs -tail works just fine, and reading the file using
 org.apache.hadoop.fs.FSDataInputStream also works ok.

 Last thing, the web interface doesn't see the contents, and command hadoop
 -fs -ls says the file is empty.

 What am I doing wrong?

 Thanks!

 Lucas



 On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Could you please clarify, are you opening the file in your mapper code
 and reading from there ?

 Thanks
 Hemanth

 On Friday, February 22, 2013, Lucas Bernardi wrote:

 Hello there, I'm trying to use hadoop map reduce to process an open
 file. The writing process, writes a line to the file and syncs the file
 to readers.
 (org.apache.hadoop.fs.FSDataOutputStream.sync()).

 If I try to read the file from another process, it works fine, at least
 using
 org.apache.hadoop.fs.FSDataInputStream.

 hadoop -fs -tail also works just fine

 But it looks like map reduce doesn't read any data. I tried using the
 word count example, same thing, it is like if the file were empty for the
 map reduce framework.

 I'm using hadoop 1.0.3. and pig 0.10.0

 I need some help around this.

 Thanks!

 Lucas





Re: map reduce and sync

2013-02-22 Thread Hemanth Yamijala
Could you please clarify, are you opening the file in your mapper code and
reading from there ?

Thanks
Hemanth

On Friday, February 22, 2013, Lucas Bernardi wrote:

 Hello there, I'm trying to use hadoop map reduce to process an open file. The
 writing process, writes a line to the file and syncs the file to readers.
 (org.apache.hadoop.fs.FSDataOutputStream.sync()).

 If I try to read the file from another process, it works fine, at least
 using
 org.apache.hadoop.fs.FSDataInputStream.

 hadoop -fs -tail also works just fine

 But it looks like map reduce doesn't read any data. I tried using the word
 count example, same thing, it is like if the file were empty for the map
 reduce framework.

 I'm using hadoop 1.0.3. and pig 0.10.0

 I need some help around this.

 Thanks!

 Lucas



Re: Database insertion by HAdoop

2013-02-19 Thread Hemanth Yamijala
Sqoop can be used to export as well.

Thanks
Hemanth

On Tuesday, February 19, 2013, Masoud wrote:

  Dear Tariq

 No, exactly in opposite way, actually we compute the similarity between
 documents and insert them in database, in every table almost 2/000/000
 records.

 Best Regards

 On 02/19/2013 06:41 PM, Mohammad Tariq wrote:

  Hello Masoud,

So you want to pull your data from SQL server to your Hadoop
 cluster first and then do the processing. Please correct me if I am wrong.
 You can do that using Sqoop as mention by Hemanth sir. BTW, what exactly is
 the kind of processing which you are planning to do on your data.

  Warm Regards,
 Tariq
 https://mtariq.jux.com/
  cloudfront.blogspot.com


 On Tue, Feb 19, 2013 at 6:44 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

  You could consider using sqoop. http://sqoop.apache.org/ there seemed to
 be a SQL connector from Microsoft.
 http://www.microsoft.com/en-gb/download/details.aspx?id=27584

 Thanks
  Hemanth

 On Tuesday, February 19, 2013, Masoud wrote:

  Hello Tariq,

 Our database is sql server 2008,
 and we dont need to develop a professional app, we just need to develop it
 fast and make our experiment result soon.
 Thanks


 On 02/18/2013 11:58 PM, Hemanth Yamijala wrote:

 What database is this ? Was hbase mentioned ?

 On Monday, February 18, 2013, Mohammad Tariq wrote:

 Hello Masoud,

   You can use the Bulk Load feature. You might find it more
 efficient than normal client APIs or using the TableOutputFormat.

  The bulk load feature uses a MapReduce job to output table data
 in HBase's internal data format, and then directly loads the
 generated StoreFiles into a running cluster. Using bulk load will use
 less CPU and network resources than simply using the HBase API.

  For a detailed info you can go here :
 http://hbase.apache.org/book/arch.bulk.load.html

  Warm Regards,
 Tariq
 https://mtariq.jux.com/
  cloudfront.blogspot.com




Re: ClassNotFoundException in Main

2013-02-19 Thread Hemanth Yamijala
I am not sure if that will actually work, because the class is defined to
be in the org.myorg package. I suggest you repackage to reflect the right
package structure.

Also, the error you are getting seems to indicate that you aphave compiled
using Jdk 7. Note that some versions of Hadoop are supported only on Jdk 6.
Which version of Hadoop are you using.

Thanks
Hemanth

On Tuesday, February 19, 2013, Fatih Haltas wrote:

 Thank you very much.
 When i tried with wordcount_classes.org.myorg.WordCount, I am getting the
 following error:

 [hadoop@ADUAE042-LAP-V project]$ hadoop jar wordcount_19_02.jar
 wordcount_classes.org.myorg.WordCount
 /home/hadoop/project/hadoop-data/NetFlow 19_02_wordcount.out
 Warning: $HADOOP_HOME is deprecated.

 Exception in thread main java.lang.UnsupportedClassVersionError:
 wordcount_classes/org/myorg/WordCount : Unsupported major.minor version 51.0
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
 at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
 at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:266)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:149)



 On Tue, Feb 19, 2013 at 8:10 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Sorry. I did not read the mail correctly. I think the error is in how the
 jar has been created. The classes start with root as wordcount_classes,
 instead of org.

 Thanks
 Hemanth


 On Tuesday, February 19, 2013, Hemanth Yamijala wrote:

 Have you used the Api setJarByClass in your main program?


 http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setJarByClass(java.lang.Class)

 On Tuesday, February 19, 2013, Fatih Haltas wrote:

 Hi everyone,

 I know this is the common mistake to not specify the class adress while
 trying to run a jar, however,
 although I specified, I am still getting the ClassNotFound exception.

 What may be the reason for it? I have been struggling for this problem
 more than a 2 days.
 I just wrote different MapReduce application for some anlaysis. I got this
 problem.

 To check, is there something wrong with my system, i tried to run
 WordCount example.
 When I just run hadoop-examples wordcount, it is working fine.

 But when I add just package org.myorg; command at the beginning, it
 doesnot work.

 Here is what I have done so far
 *
 1. I just copied wordcount code from the apaches own examples source code
 and I just changed package decleration as package org.myorg;
 **
 2. Then I tried to run that command:
  *
 hadoop jar wordcount_19_02.jar org.myorg.WordCount
 /home/hadoop/project/hadoop-data/NetFlow 19_02_wordcount.output
 *
 3. I got following error:
 **
 [hadoop@ADUAE042-LAP-V project]$ hadoop jar wordcount_19_02.jar
 org.myorg.WordCount /home/hadoop/project/hadoop-data/NetFlow
 19_02_wordcount.output
 Warning: $HADOOP_HOME is deprecated.

 Exception in thread main java.lang.ClassNotFoundException:
 org.myorg.WordCount
 at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:266)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

 **
 4. This is the content of my .jar file:
 
 [hadoop@ADUAE042-LAP-V project]$ jar tf wordcount_19_02.jar
 META-INF/
 META-INF/MANIFEST.MF
 wordcount_classes/
 wordcount_classes/org/
 wordcount_classes/org/myorg/
 wordcount_classes/org/myorg/WordCount.class
 wordcount_classes/org/myorg/WordCount$TokenizerMapper.class
 wordcount_classes/org/myorg/WordCount$IntSumReducer.class

Re: JUint test failing in HDFS when building Hadoop from source.

2013-02-19 Thread Hemanth Yamijala
Hi,

In the past, some tests have been flaky. It would be good if you can search
jira and see whether this is a known issue. Else, please file it, and if
possible, provide a patch. :)

Regarding whether this will be a reliable build, it depends a little bit on
what you are going to use it for. For development / test purposes,
personally, I would still go with it.

Thanks
Hemanth

On Tuesday, February 19, 2013, Leena Rajendran wrote:

 Hi,

 I am posting for the first time. Please let me know if this needs to go to
 any other mailing list.

 I am trying to build Hadoop from source code, and I am able to
 successfully build until the Hadoop-Common-Project. However, in case of
 HDFS the test called TestHftpURLTimeouts is failing intermittently.
 Please note that when I run the test individually, it passes. I have taken
 care of all steps given in HowToContribute twiki page of Hadoop. Please let
 me know ,

 1. Whether this kind of behaviour is expected
 2. Can this intermittent test case failure be ignored.
 3. Will it be a reliable build if I use the -DskipTests option in mvn
 command

  Adding build results below for the following command : mvn -e  package
 -Pdist,native -Dtar

 Results :

 Failed tests:
 testHftpSocketTimeout(org.apache.hadoop.hdfs.TestHftpURLTimeouts):
 expected:connect timed out but was:null

 Tests run: 705, Failures: 1, Errors: 0, Skipped: 3

 [INFO]
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Apache Hadoop Main  SUCCESS
 [2.280s]
 [INFO] Apache Hadoop Project POM . SUCCESS
 [2.339s]
 [INFO] Apache Hadoop Annotations . SUCCESS
 [3.916s]
 [INFO] Apache Hadoop Assemblies .. SUCCESS
 [0.323s]
 [INFO] Apache Hadoop Project Dist POM  SUCCESS
 [2.762s]
 [INFO] Apache Hadoop Auth  SUCCESS
 [7.695s]
 [INFO] Apache Hadoop Auth Examples ... SUCCESS
 [1.997s]
 [INFO] Apache Hadoop Common .. SUCCESS
 [14:13.276s]
 [INFO] Apache Hadoop Common Project .. SUCCESS
 [0.026s]
 [INFO] Apache Hadoop HDFS  FAILURE
 [1:36:02.602s]
 [INFO] Apache Hadoop HttpFS .. SKIPPED
 [INFO] Apache Hadoop HDFS BookKeeper Journal . SKIPPED
 [INFO] Apache Hadoop HDFS Project  SKIPPED
 [INFO] hadoop-yarn ... SKIPPED
 [INFO] hadoop-yarn-api ... SKIPPED
 [INFO] hadoop-yarn-common  SKIPPED
 [INFO] hadoop-yarn-server  SKIPPED
 [INFO] hadoop-yarn-server-common . SKIPPED
 [INFO] hadoop-yarn-server-nodemanager  SKIPPED
 [INFO] hadoop-yarn-server-web-proxy .. SKIPPED
 [INFO] hadoop-yarn-server-resourcemanager  SKIPPED
 [INFO] hadoop-yarn-server-tests .. SKIPPED
 [INFO] hadoop-yarn-client  SKIPPED
 [INFO] hadoop-yarn-applications .. SKIPPED
 [INFO] hadoop-yarn-applications-distributedshell . SKIPPED
 [INFO] hadoop-mapreduce-client ... SKIPPED
 [INFO] hadoop-mapreduce-client-core .. SKIPPED
 [INFO] hadoop-yarn-applications-unmanaged-am-launcher  SKIPPED
 [INFO] hadoop-yarn-site .. SKIPPED
 [INFO] hadoop-yarn-project ... SKIPPED
 [INFO] hadoop-mapreduce-client-common  SKIPPED
 [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED
 [INFO] hadoop-mapreduce-client-app ... SKIPPED
 [INFO] hadoop-mapreduce-client-hs  SKIPPED
 [INFO] hadoop-mapreduce-client-jobclient . SKIPPED
 [INFO] hadoop-mapreduce-client-hs-plugins  SKIPPED
 [INFO] Apache Hadoop MapReduce Examples .. SKIPPED
 [INFO] hadoop-mapreduce .. SKIPPED
 [INFO] Apache Hadoop MapReduce Streaming . SKIPPED
 [INFO] Apache Hadoop Distributed Copy  SKIPPED
 [INFO] Apache Hadoop Archives  SKIPPED
 [INFO] Apache Hadoop Rumen ... SKIPPED
 [INFO] Apache Hadoop Gridmix . SKIPPED
 [INFO] Apache Hadoop Data Join ... SKIPPED
 [INFO] Apache Hadoop Extras .. SKIPPED
 [INFO] Apache Hadoop Pipes ... SKIPPED
 [INFO] Apache Hadoop Tools Dist .. SKIPPED
 [INFO] Apache Hadoop Tools ... SKIPPED
 [INFO] Apache Hadoop Distribution  SKIPPED
 [INFO] Apache Hadoop Client 

Re: Database insertion by HAdoop

2013-02-18 Thread Hemanth Yamijala
What database is this ? Was hbase mentioned ?

On Monday, February 18, 2013, Mohammad Tariq wrote:

 Hello Masoud,

   You can use the Bulk Load feature. You might find it more
 efficient than normal client APIs or using the TableOutputFormat.

 The bulk load feature uses a MapReduce job to output table data
 in HBase's internal data format, and then directly loads the
 generated StoreFiles into a running cluster. Using bulk load will use
 less CPU and network resources than simply using the HBase API.

 For a detailed info you can go here :
 http://hbase.apache.org/book/arch.bulk.load.html

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Mon, Feb 18, 2013 at 5:00 PM, Masoud 
 mas...@agape.hanyang.ac.krjavascript:_e({}, 'cvml', 
 'mas...@agape.hanyang.ac.kr');
  wrote:


 Dear All,

 We are going to do our experiment of a scientific papers, ]
 We must insert data in our database for later consideration, it almost
 300 tables each one has 2/000/000 records.
 as you know It takes lots of time to do it with a single machine,
 we are going to use our Hadoop cluster (32 machines) and divide 300
 insertion tasks between them,
 I need some hint to progress faster,
 1- as i know we dont need to Reduser, just Mapper in enough.
 2- so wee need just implement Mapper class with needed code.

 Please let me know if there is any point,

 Best Regards
 Masoud







Re: How to understand DataNode usages ?

2013-02-14 Thread Hemanth Yamijala
This seems to be related to the % used capacity at a datanode. The values
are computed for all the live datanodes, and the range / central limits /
deviations are computed based on a sorted list of the values.

Thanks
hemanth


On Thu, Feb 14, 2013 at 2:42 PM, Dhanasekaran Anbalagan
bugcy...@gmail.comwrote:

 Hi Guys,

 In Datanode UI page they give Datanode usage. what does actually mean,
 Please guide me Min, Median, stdev


 DataNodes usages:Min %Median %Max %stdev %
  22.15 %24.33 %58.09 %15.4 %


 -Dhanasekaran

 Did I learn something today? If not, I wasted it.



Re: Java submit job to remote server

2013-02-12 Thread Hemanth Yamijala
Can you please include the complete stack trace and not just the root.
Also, have you set fs.default.name to a hdfs location like
hdfs://localhost:9000 ?

Thanks
Hemanth

On Wednesday, February 13, 2013, Alex Thieme wrote:

 Thanks for the prompt reply and I'm sorry I forgot to include the
 exception. My bad. I've included it below. There certainly appears to be a
 server running on localhost:9001. At least, I was able to telnet to that
 address. While in development, I'm treating the server on localhost as the
 remote server. Moving to production, there'd obviously be a different
 remote server address configured.

 Root Exception stack trace:
 java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:375)
 at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
 + 3 more (set debug level logging or '-Dmule.verbose.exceptions=true'
 for everything)

 

 On Feb 12, 2013, at 4:22 PM, Nitin Pawar nitinpawar...@gmail.com wrote:

 conf.set(mapred.job.tracker, localhost:9001);

 this means that your jobtracker is on port 9001 on localhost

 if you change it to the remote host and thats the port its running on then
 it should work as expected

 whats the exception you are getting?


 On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme athi...@athieme.com wrote:

 I apologize for asking what seems to be such a basic question, but I would
 use some help with submitting a job to a remote server.

 I have downloaded and installed hadoop locally in pseudo-distributed mode.
 I have written some Java code to submit a job.

 Here's the org.apache.hadoop.util.Tool
 and org.apache.hadoop.mapreduce.Mapper I've written.

 If I enable the conf.set(mapred.job.tracker, localhost:9001) line,
 then I get the exception included below.

 If that line is disabled, then the job is completed. However, in reviewing
 the hadoop server administration page (
 http://localhost:50030/jobtracker.jsp) I don't see the job as processed
 by the server. Instead, I wonder if my Java code is simply running the
 necessary mapper Java code, bypassing the locally installed server.

 Thanks in advance.

 Alex

 public class OfflineDataTool extends Configured implements Tool {

 public int run(final String[] args) throws Exception {
 final Configuration conf = getConf();
 //conf.set(mapred.job.tracker, localhost:9001);

 final Job job = new Job(conf);
 job.setJarByClass(getClass());
 job.setJobName(getClass().getName());

 job.setMapperClass(OfflineDataMapper.class);

 job.setInputFormatClass(TextInputFormat.class);

 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Text.class);

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);

 FileInputFormat.addInputPath(job, new
 org.apache.hadoop.fs.Path(args[0]));

 final org.apache.hadoop.fs.Path output = new org.a




Re: Cannot use env variables in hodrc

2013-02-08 Thread Hemanth Yamijala
Hi,

Hadoop On Demand is no longer supported with recent releases of Hadoop.
There is no separate user list for HOD related questions.

Which version of Hadoop are you using right now ?

Thanks
hemanth


On Wed, Feb 6, 2013 at 8:59 PM, Mehmet Belgin
mehmet.bel...@oit.gatech.eduwrote:

 Hello again,

 Considering that I have not received any replies, I was wondering whether
 this is not the correct list for hod questions? Please let me know if I
 should better direct this question to another list, a hod-specific one
 perhaps?

 Thank you!



 
  On a related note, env-vars is also being ignored:
 
  env-vars=
 HOD_PYTHON_HOME=/usr/local/packages/python/2.5.1/bin/python2.5
 
  And hod picks the system-default python and terminates with errors
 unless I manually export HOD_PYTHON_HOME.
 
  export HOD_PYTHON_HOME=`which python2.5`
 
  I am also having problems having hod use the cluster I created, but I
 assume those issues are also related.
 
  How can I make sure that hodrc contents are passed correctly into hod?
 
  Thanks a lot in advance!
 
 
  On Feb 5, 2013, at 4:41 PM, Mehmet Belgin wrote:
 
  Hello everyone,
 
  I am setting up Hadoop for the first time, so please bear with me while
 I ask all these beginner questions :)
 
  I followed the instructions to create a hodrc, but looks like I cannot
 user env variables in this file:
 
  error: bin/hod failed to start.
  error: invalid 'java-home' specified in section hod (--hod.java-home):
 ${JAVA_HOME}
  error: invalid 'batch-home' specified in section resource_manager
 (--resource_manager.batch-home): ${RM_HOME}
 
  ... despite the fact that I have  ${JAVA_HOME} and ${RM_HOME}
 correctly defined in my environment. When I replace these variables with
 full explicit paths, it works. I checked the permissions, and everything
 else looks fine.
 
  What am I missing here?
 
  Thanks!



Re: How to find Blacklisted Nodes via cli.

2013-01-30 Thread Hemanth Yamijala
Hi,

Part answer: you can get the blacklisted tasktrackers using the command
line:

mapred job -list-blacklisted-trackers.

Also, I think that a blacklisted tasktracker becomes 'unblacklisted' if it
works fine after some time. Though I am not very sure about this.

Thanks
hemanth


On Wed, Jan 30, 2013 at 9:35 PM, Dhanasekaran Anbalagan
bugcy...@gmail.comwrote:

 Hi Guys,

 How to find Blacklisted Nodes via, command line. I want to see job
 Tracker Blacklisted Nodes and hdfs Blacklisted Nodes.

 and also how to clear blacklisted nodes to clear start. The only option to
 restart the service. some other way clear the Blacklisted Nodes.

 please guide me.

 -Dhanasekaran.

 Did I learn something today? If not, I wasted it.



Re: Filesystem closed exception

2013-01-30 Thread Hemanth Yamijala
FS Caching is enabled on the cluster (i.e. the default is not changed).

Our code isn't actually mapper code, but a standalone java program being
run as part of Oozie. It just seemed confusing and not a very clear
strategy to leave unclosed resources. Hence my suggestion to get an
uncached FS handle for this use case alone. Note, I am not suggesting to
disable FS caching in general.

Thanks
Hemanth


On Thu, Jan 31, 2013 at 12:19 AM, Alejandro Abdelnur t...@cloudera.comwrote:

 Hemanth,

 Is FS caching enabled or not in your cluster?

 A simple solution would be to modify your mapper code not to close the FS.
 It will go away when the task ends anyway.

 Thx


 On Thu, Jan 24, 2013 at 5:26 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 We are noticing a problem where we get a filesystem closed exception when
 a map task is done and is finishing execution. By map task, I literally
 mean the MapTask class of the map reduce code. Debugging this we found that
 the mapper is getting a handle to the filesystem object and itself calling
 a close on it. Because filesystem objects are cached, I believe the
 behaviour is as expected in terms of the exception.

 I just wanted to confirm that:

 - if we do have a requirement to use a filesystem object in a mapper or
 reducer, we should either not close it ourselves
 - or (seems better to me) ask for a new version of the filesystem
 instance by setting the fs.hdfs.impl.disable.cache property to true in job
 configuration.

 Also, does anyone know if this behaviour was any different in Hadoop 0.20
 ?

 For some context, this behaviour is actually seen in Oozie, which runs a
 launcher mapper for a simple java action. Hence, the java action could very
 well interact with a file system. I know this is probably better addressed
 in Oozie context, but wanted to get the map reduce view of things.


 Thanks,
 Hemanth




 --
 Alejandro



Re: Filesystem closed exception

2013-01-25 Thread Hemanth Yamijala
Thanks, Harsh. Particularly for pointing out HADOOP-7973.


On Fri, Jan 25, 2013 at 11:51 AM, Harsh J ha...@cloudera.com wrote:

 It is pretty much the same in 0.20.x as well, IIRC. Your two points
 are also correct (for a fix to this). Also see:
 https://issues.apache.org/jira/browse/HADOOP-7973.

 On Fri, Jan 25, 2013 at 6:56 AM, Hemanth Yamijala
 yhema...@thoughtworks.com wrote:
  Hi,
 
  We are noticing a problem where we get a filesystem closed exception
 when a
  map task is done and is finishing execution. By map task, I literally
 mean
  the MapTask class of the map reduce code. Debugging this we found that
 the
  mapper is getting a handle to the filesystem object and itself calling a
  close on it. Because filesystem objects are cached, I believe the
 behaviour
  is as expected in terms of the exception.
 
  I just wanted to confirm that:
 
  - if we do have a requirement to use a filesystem object in a mapper or
  reducer, we should either not close it ourselves
  - or (seems better to me) ask for a new version of the filesystem
 instance
  by setting the fs.hdfs.impl.disable.cache property to true in job
  configuration.
 
  Also, does anyone know if this behaviour was any different in Hadoop
 0.20 ?
 
  For some context, this behaviour is actually seen in Oozie, which runs a
  launcher mapper for a simple java action. Hence, the java action could
 very
  well interact with a file system. I know this is probably better
 addressed
  in Oozie context, but wanted to get the map reduce view of things.
 
 
  Thanks,
  Hemanth



 --
 Harsh J



Re: mappers-node relationship

2013-01-25 Thread Hemanth Yamijala
This may beof some use, about how maps are decided:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Thanks
Hemanth

On Friday, January 25, 2013, jamal sasha wrote:

 Hi.
   A very very lame question.
 Does numbers of mapper depends on the number of nodes I have?
 How I imagine map-reduce is this.
 For example in word count example
 I have bunch of slave nodes.
 The documents are distributed across these slave nodes.
 Now depending on how big the data is, it will spread across the slave
 nodes.. and that is how my number of mappers are decided.
 I am sure, this is wrong understanding. As in pseudo-distributed node, i
 can see multiple mappers.
 So question is.. how does a single node machine runs multiple mappers? is
 it run in parallel or sequentially??
 Any resources to learn these
 Thanks



Re: TT nodes distributed cache failure

2013-01-25 Thread Hemanth Yamijala
Could you post the stack trace from the job logs. Also looking at the task
tracker logs on the failed nodes may help.

Thanks
Hemanth

On Friday, January 25, 2013, Terry Healy wrote:

 Running hadoop-0.20.2 on a 20 node cluster.

 When running a Map/Reduce job that uses several .jars loaded into the
 Distributed cache, several (~4) nodes have their map jobs fails because
 of ClassNotFoundException. All the other nodes proceed through the job
 normally and the jobs completes. But this is wasting 20-25% of my TT nodes.

 Can anyone explain why some nodes might fail to read all the .jars from
 the Distributed cache?

 Thanks



Filesystem closed exception

2013-01-24 Thread Hemanth Yamijala
Hi,

We are noticing a problem where we get a filesystem closed exception when a
map task is done and is finishing execution. By map task, I literally mean
the MapTask class of the map reduce code. Debugging this we found that the
mapper is getting a handle to the filesystem object and itself calling a
close on it. Because filesystem objects are cached, I believe the behaviour
is as expected in terms of the exception.

I just wanted to confirm that:

- if we do have a requirement to use a filesystem object in a mapper or
reducer, we should either not close it ourselves
- or (seems better to me) ask for a new version of the filesystem instance
by setting the fs.hdfs.impl.disable.cache property to true in job
configuration.

Also, does anyone know if this behaviour was any different in Hadoop 0.20 ?

For some context, this behaviour is actually seen in Oozie, which runs a
launcher mapper for a simple java action. Hence, the java action could very
well interact with a file system. I know this is probably better addressed
in Oozie context, but wanted to get the map reduce view of things.


Thanks,
Hemanth


Re: Where do/should .jar files live?

2013-01-22 Thread Hemanth Yamijala
On top of what Bejoy said, just wanted to add that when you submit a job to
Hadoop using the hadoop jar command, the jars which you reference in the
command on the edge/client node will be picked up by Hadoop and made
available to the cluster nodes where the mappers and reducers run.

Thanks
Hemanth


On Wed, Jan 23, 2013 at 8:24 AM, bejoy.had...@gmail.com wrote:

 **
 Hi Chris

 In larger clusters it is better to have an edge/client node where all the
 user jars reside and you trigger your MR jobs from here.

 A client/edge node is a server with hadoop jars and conf but hosting no
 daemons.

 In smaller clusters one DN might act as the client node and you can
 execute your jars from there. Here you have a risk of that DN getting
 filled if the files are copied to hdfs from this DN (as per block placement
 policy one replica would always be on this node)


 In oozie you put your executables into hdfs . But oozie comes at an
 integration level. In initial development phase, developers put jar into
 the LFS on client node, execute and test their code.
 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Chris Embree cemb...@gmail.com
 *Date: *Tue, 22 Jan 2013 14:24:40 -0500
 *To: *user@hadoop.apache.org
 *ReplyTo: * user@hadoop.apache.org
 *Subject: *Where do/should .jar files live?

 Hi List,

 This should be a simple question, I think.  Disclosure, I am not a java
 developer. ;)

 We're getting ready to build our Dev and Prod clusters. I'm pretty
 comfortable with HDFS and how it sits atop several local file systems on
 multiple servers.  I'm fairly comfortable with the concept of Map/Reduce
 and why it's cool and we want it.

 Now for the question.  Where should my developers, put and store their jar
 files?  Or asked another way, what's the best entry point for submitting
 jobs?

 We have separate physical systems for NN, Checkpoint Node (formerly 2nn),
 Job Tracker and Standby NN.  Should I run from the JT node? Do I keep all
 of my finished .jar's on the JT local file system?
 Or should I expect that jobs will be run via Oozie?  Do I put jars on the
 local Oozie FS?

 Thanks in advance.
 Chris



Re: passing arguments to hadoop job

2013-01-21 Thread Hemanth Yamijala
Hi,

Please note that you are referring to a very old version of Hadoop. the
current stable release is Hadoop 1.x. The API has changed in 1.x. Take a
look at the wordcount example here:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Example%3A+WordCount+v2.0


But, in principle your method should work. I wrote it using the new API in
a similar fashion and it worked fine. Can you show the code of your driver
program (i.e. where you have main) ?

Thanks
hemanth



On Tue, Jan 22, 2013 at 5:22 AM, jamal sasha jamalsha...@gmail.com wrote:

 Hi,
   Lets say I have the standard helloworld program

 http://hadoop.apache.org/docs/r0.17.0/mapred_tutorial.html#Example%3A+WordCount+v2.0

 Now, lets say, I want to start the counting not from zero but from 20.
 So my reference line is 20.

 I modified the Reduce code as following:
  public static class Reduce extends MapReduceBase implements ReducerText,
 IntWritable, Text, IntWritable {
  *private static int baseSum ;*
 *  public void configure(JobConf job){*
 *  baseSum = Integer.parseInt(job.get(basecount));*
 *  *
 *  }*
public void reduce(Text key, IteratorIntWritable values,
 OutputCollectorText, IntWritable output, Reporter reporter) throws
 IOException {
  int sum =* baseSum*;
 while (values.hasNext()) {
   sum += values.next().get();
  }
 output.collect(key, new IntWritable(sum));
   }
  }


 And in main added:
conf.setInt(basecount,20);



 So my hope was this should have done the trick..
 But its not working. the code is running normally :(
 How do i resolve this?
 Thanks



Re: passing arguments to hadoop job

2013-01-21 Thread Hemanth Yamijala
OK. The easiest way I can think of for debugging this is to add a
System.out.println in your Reduce.configure code. The output will come in
the logs specific to your reduce tasks. You can access these logs from the
web ui of the jobtracker. Navigate to your job page from the Jobtracker UI
 reduce   select any task  click on the task log links. Look under
'stdout'.

Thanks
Hemanth


On Tue, Jan 22, 2013 at 11:19 AM, jamal sasha jamalsha...@gmail.com wrote:

 Hi,
   The driver code is actually the same as of java word count old example:
 copying from site
 public static void main(String[] args) throws Exception {
 JobConf conf = new JobConf(WordCount.class);
  conf.setJobName(wordcount);

  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  * conf.setInt(basecount,20); // added this line*
  conf.setMapperClass(Map.class);
  conf.setCombinerClass(Reduce.class);
  conf.setReducerClass(Reduce.class);

  conf.setInputFormat(TextInputFormat.class);
  conf.setOutputFormat(TextOutputFormat.class);

  FileInputFormat.setInputPaths(conf, new Path(args[0]));
  FileOutputFormat.setOutputPath(conf, new Path(args[1]));

  JobClient.runJob(conf);
}


 Reducer class
  public static class Reduce extends MapReduceBase implements ReducerText,
 IntWritable, Text, IntWritable {
  *private static int baseSum ;*
 *  public void configure(JobConf job){*
 *  baseSum = Integer.parseInt(job.get(basecount));*
 *  *
 *  }*
   public void reduce(Text key, IteratorIntWritable values,
 OutputCollectorText, IntWritable output, Reporter reporter) throws
 IOException {
  int sum =* baseSum*;
 while (values.hasNext()) {
sum += values.next().get();
 }
  output.collect(key, new IntWritable(sum));
   }
  }

 On Mon, Jan 21, 2013 at 8:29 PM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:
 
  Hi,
 
  Please note that you are referring to a very old version of Hadoop. the
 current stable release is Hadoop 1.x. The API has changed in 1.x. Take a
 look at the wordcount example here:
 http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Example%3A+WordCount+v2.0
 
  But, in principle your method should work. I wrote it using the new API
 in a similar fashion and it worked fine. Can you show the code of your
 driver program (i.e. where you have main) ?
 
  Thanks
  hemanth
 
 
 
  On Tue, Jan 22, 2013 at 5:22 AM, jamal sasha jamalsha...@gmail.com
 wrote:
 
  Hi,
Lets say I have the standard helloworld program
 
 http://hadoop.apache.org/docs/r0.17.0/mapred_tutorial.html#Example%3A+WordCount+v2.0
 
  Now, lets say, I want to start the counting not from zero but from
 20.
  So my reference line is 20.
 
  I modified the Reduce code as following:
   public static class Reduce extends MapReduceBase implements
 ReducerText, IntWritable, Text, IntWritable {
   private static int baseSum ;
   public void configure(JobConf job){
   baseSum = Integer.parseInt(job.get(basecount));
 
   }
public void reduce(Text key, IteratorIntWritable values,
 OutputCollectorText, IntWritable output, Reporter reporter) throws
 IOException {
  int sum = baseSum;
  while (values.hasNext()) {
sum += values.next().get();
  }
  output.collect(key, new IntWritable(sum));
}
  }
 
 
  And in main added:
 conf.setInt(basecount,20);
 
 
 
  So my hope was this should have done the trick..
  But its not working. the code is running normally :(
  How do i resolve this?
  Thanks
 
 



Re: How to unit test mappers reading data from DistributedCache?

2013-01-17 Thread Hemanth Yamijala
Hi,

Not sure how to do it using MRUnit, but should be possible to do this using
a mocking framework like Mockito or EasyMock. In a mapper (or reducer),
you'd use the Context classes to get the DistributedCache files. By mocking
these to return what you want, you could potentially run a true unit test.

Thanks
Hemanth


On Fri, Jan 18, 2013 at 1:37 AM, Barak Yaish barak.ya...@gmail.com wrote:

 Hi,

 I've found MRUnit a very easy to unit test jobs, is it possible as well to
 test mappers reading data from DisributedCache? If yes, can you share an
 example how the test' setup() should look like?

 Thanks.



Re: tcp error

2013-01-16 Thread Hemanth Yamijala
Coincidentally, I faced the same issue just now.

In my case, it turned out that I was running Hadoop daemons in
pseudo-distributed mode, and in between a machine suspend and restart, the
network configuration changed. The logs link was referring to the older IP
address in use in the URL and thus failed when I tried to open it.
Restarting the daemons helped.

I don't think this problem will come in a normal up-and-running production
cluster.

Thanks
hemanth


On Thu, Jan 17, 2013 at 9:48 AM, Hemanth Yamijala yhema...@thoughtworks.com
 wrote:

 At the place where you get the error, can you cross check what the URL is
 that is being accessed ? Also, can you compare it with the URL with pages
 before this that work ?

 Thanks
 hemanth


 On Thu, Jan 17, 2013 at 1:08 AM, jamal sasha jamalsha...@gmail.comwrote:

 I am inside a network where I need proxy settings to access the internet.
 I have a weird problem.

 The internet is working fine.
 But it is one particular instance when i get this error:

 Network Error (tcp_error)

 A communication error occurred: Operation timed out
 The Web Server may be down, too busy, or experiencing other problems
 preventing it from responding to requests. You may wish to try again at a
 later time.

 For assistance, contact your network support team.

 This happens when I use hadoop in local mode.
 I can access the UI interface. I can see the jobs running. but when I try
 to see the logs of each task.. i am not able to access those logs.

 UI-- job--map-- task-- all -- this is where the error is..

 Any clues?
 THanks





Re: Biggest cluster running YARN in the world?

2013-01-15 Thread Hemanth Yamijala
You may get more updated information from folks at Yahoo!, but here is a
mail on hadoop-general mailing list that has some statistics:

http://www.mail-archive.com/general@hadoop.apache.org/msg05592.html

Please note it is a little dated, so things should be better now :-)

Thank
hemanth


On Tue, Jan 15, 2013 at 7:26 AM, Tan, Wangda wangda@emc.com wrote:

 Hi guys,
 I've a question in my head for a long time, what's the biggest cluster
 running YARN? I just heard some rumor about some biggest cluster running
 map-reduce 1.0 with 10,000+ nodes, but rarely heard about such rumor
 about YARN.
 Welcome any message about this, like inside information or rumor :-p.
 --
 Thanks,
 Wangda




Re: Compile error using contrib.utils.join package with new mapreduce API

2013-01-15 Thread Hemanth Yamijala
On the dev mailing list, Harsh pointed out that there is another join
related package:
http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/join/

This seems to be available in 2.x and trunk. Could you check if this
provides functionality you require - so we at least know there is new API
support in later versions ?

Thanks
Hemanth


On Mon, Jan 14, 2013 at 7:45 PM, Hemanth Yamijala yhema...@thoughtworks.com
 wrote:

 Hi,

 No. I didn't find any reference to a working sample. I also didn't find
 any JIRA that asks for a migration of this package to the new API. Not sure
 why. I have asked on the dev list.

 Thanks
 hemanth


 On Mon, Jan 14, 2013 at 6:25 PM, Michael Forage 
 michael.for...@livenation.co.uk wrote:

  Thanks Hemanth

 ** **

 I appreciate your response

 Did you find any working example of it in use? It looks to me like I’d
 still be tied to the old API

 Thanks

 Mike

 ** **

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* 14 January 2013 05:08
 *To:* user@hadoop.apache.org
 *Subject:* Re: Compile error using contrib.utils.join package with new
 mapreduce API

 ** **

 Hi,

 ** **

 The datajoin package has a class called DataJoinJob (
 http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/contrib/utils/join/DataJoinJob.html
 )

 ** **

 I think using this will help you get around the issue you are facing.

 ** **

 From the source, this is the command line usage of the class:

 ** **

 usage: DataJoinJob inputdirs outputdir map_input_file_format  numofParts
 mapper_class reducer_class map_output_value_class output_value_class
 [maxNumOfValuesPerGroup [descriptionOfJob]]]

 ** **

 Internally the class uses the old API to set the mapper and reducer
 passed as arguments above.

 ** **

 Thanks

 hemanth

 ** **

 ** **

 ** **

 On Fri, Jan 11, 2013 at 9:00 PM, Michael Forage 
 michael.for...@livenation.co.uk wrote:

 Hi

  

 I’m using Hadoop 1.0.4 and using the hadoop.mapreduce API having problems
 compiling a simple class to implement a reduce-side data join of 2 files.
 

 I’m trying to do this using contrib.utils.join and in Eclipse it all
 compiles fine other than:

  

 job.*setMapperClass*(MapClass.*class*);

   job.*setReducerClass*(Reduce.*class*);

  

 …which both complain that the referenced class no longer extends either
 Mapper or Reducer

 It’s my understanding that for what they should instead extend 
 DataJoinMapperBase
 and DataJoinReducerBase in order 

  

 Have searched for a solution everywhere  but unfortunately, all the
 examples I can find are based on the deprecated mapred API.

 Assuming this package actually works with the new API, can anyone offer
 any advice?

  

 Complete compile errors:

  

 The method setMapperClass(Class? extends Mapper) in the type Job is not
 applicable for the arguments (ClassDataJoin.MapClass)

 The method setReducerClass(Class? extends Reducer) in the type Job is
 not applicable for the arguments (ClassDataJoin.Reduce)

  

 …and the code…

  

 *package* JoinTest;

  

 *import* java.io.DataInput;

 *import* java.io.DataOutput;

 *import* java.io.IOException;

 *import* java.util.Iterator;

  

 *import* org.apache.hadoop.conf.Configuration;

 *import* org.apache.hadoop.conf.Configured;

 *import* org.apache.hadoop.fs.Path;

 *import* org.apache.hadoop.io.LongWritable;

 *import* org.apache.hadoop.io.Text;

 *import* org.apache.hadoop.io.Writable;

 *import* org.apache.hadoop.mapreduce.Job;

 *import* org.apache.hadoop.mapreduce.Mapper;

 *import* org.apache.hadoop.mapreduce.Reducer;

 *import* org.apache.hadoop.mapreduce.Mapper.Context;

 *import* org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 *import* org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

 *import* org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 *import* org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 *import* org.apache.hadoop.util.Tool;

 *import* org.apache.hadoop.util.ToolRunner;

  

 *import* org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;

 *import* org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;

 *import* org.apache.hadoop.contrib.utils.join.TaggedMapOutput;

  

 *public* *class* DataJoin *extends* Configured *implements* Tool {

 

   *public* *static* *class* MapClass *extends* DataJoinMapperBase {**
 **

 

 *protected* Text generateInputTag(String inputFile) {

 String datasource = inputFile.split(-)[0];

 *return* *new* Text(datasource);

 }

 

 *protected* Text generateGroupKey

Re: FileSystem.workingDir vs mapred.local.dir

2013-01-15 Thread Hemanth Yamijala
Hi,

AFAIK, the mapred.local.dir property refers to a set of directories under
which different types of data related to mapreduce jobs are stored - for
e.g. intermediate data, localized files for a job etc. The working
directory for a mapreduce job is configured under a sub directory within
one of the directories configured in this property.

The workingDir property of the FileSystem class simply seems to indicate
the working directory for a given filesystem as set by applications.

They don't seem very related per se, unless I am missing something ?

Thanks
Hemanth

On Tue, Jan 15, 2013 at 2:54 AM, Jay Vyas jayunit...@gmail.com wrote:

 Hi guys:  What is the relationship between the working directory in the
 FileSystem class (filesystem.workingDir), compared with the
 mapred.local.dir properties ?

 It seems like these would essentially refer to the same thing?
 --
 Jay Vyas
 http://jayunit100.blogspot.com



Re: config file loactions in Hadoop 2.0.2

2013-01-15 Thread Hemanth Yamijala
Hi,

One place where I could find the capacity-scheduler.xml was from source -
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/resources.

AFAIK, the masters file is only used for starting the secondary namenode -
which has in 2.x been replaced by a proper HA solution. So, I think there
is no need for this file anymore. Please refer to this link for more
details on the HA solution:
http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html

Thanks
hemanth


On Wed, Jan 16, 2013 at 3:15 AM, Panshul Whisper ouchwhis...@gmail.comwrote:

 Hello,

 I was setting up Hadoop 2.0.2.
 Since I have already setup Hadoop 1.0.x so i have a fair idea where are
 all the files and what to put in each of them.
 In case of Hadoop 2.0.2, the config files have been moved to [hadoop
 install directory]/etc/hadoop
 but I still cannot find
 capacity-scheduler.xml
 masters  - for listing master nodes

 Please help me setup this version.

 Thanking You,

 --
 Regards,
 Ouch Whisper
 010101010101



Re: Compile error using contrib.utils.join package with new mapreduce API

2013-01-14 Thread Hemanth Yamijala
Hi,

No. I didn't find any reference to a working sample. I also didn't find any
JIRA that asks for a migration of this package to the new API. Not sure
why. I have asked on the dev list.

Thanks
hemanth


On Mon, Jan 14, 2013 at 6:25 PM, Michael Forage 
michael.for...@livenation.co.uk wrote:

  Thanks Hemanth

 ** **

 I appreciate your response

 Did you find any working example of it in use? It looks to me like I’d
 still be tied to the old API

 Thanks

 Mike

 ** **

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* 14 January 2013 05:08
 *To:* user@hadoop.apache.org
 *Subject:* Re: Compile error using contrib.utils.join package with new
 mapreduce API

 ** **

 Hi,

 ** **

 The datajoin package has a class called DataJoinJob (
 http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/contrib/utils/join/DataJoinJob.html
 )

 ** **

 I think using this will help you get around the issue you are facing.

 ** **

 From the source, this is the command line usage of the class:

 ** **

 usage: DataJoinJob inputdirs outputdir map_input_file_format  numofParts
 mapper_class reducer_class map_output_value_class output_value_class
 [maxNumOfValuesPerGroup [descriptionOfJob]]]

 ** **

 Internally the class uses the old API to set the mapper and reducer passed
 as arguments above.

 ** **

 Thanks

 hemanth

 ** **

 ** **

 ** **

 On Fri, Jan 11, 2013 at 9:00 PM, Michael Forage 
 michael.for...@livenation.co.uk wrote:

 Hi

  

 I’m using Hadoop 1.0.4 and using the hadoop.mapreduce API having problems
 compiling a simple class to implement a reduce-side data join of 2 files.*
 ***

 I’m trying to do this using contrib.utils.join and in Eclipse it all
 compiles fine other than:

  

 job.*setMapperClass*(MapClass.*class*);

   job.*setReducerClass*(Reduce.*class*);

  

 …which both complain that the referenced class no longer extends either
 Mapper or Reducer

 It’s my understanding that for what they should instead extend 
 DataJoinMapperBase
 and DataJoinReducerBase in order 

  

 Have searched for a solution everywhere  but unfortunately, all the
 examples I can find are based on the deprecated mapred API.

 Assuming this package actually works with the new API, can anyone offer
 any advice?

  

 Complete compile errors:

  

 The method setMapperClass(Class? extends Mapper) in the type Job is not
 applicable for the arguments (ClassDataJoin.MapClass)

 The method setReducerClass(Class? extends Reducer) in the type Job is
 not applicable for the arguments (ClassDataJoin.Reduce)

  

 …and the code…

  

 *package* JoinTest;

  

 *import* java.io.DataInput;

 *import* java.io.DataOutput;

 *import* java.io.IOException;

 *import* java.util.Iterator;

  

 *import* org.apache.hadoop.conf.Configuration;

 *import* org.apache.hadoop.conf.Configured;

 *import* org.apache.hadoop.fs.Path;

 *import* org.apache.hadoop.io.LongWritable;

 *import* org.apache.hadoop.io.Text;

 *import* org.apache.hadoop.io.Writable;

 *import* org.apache.hadoop.mapreduce.Job;

 *import* org.apache.hadoop.mapreduce.Mapper;

 *import* org.apache.hadoop.mapreduce.Reducer;

 *import* org.apache.hadoop.mapreduce.Mapper.Context;

 *import* org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 *import* org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

 *import* org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 *import* org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 *import* org.apache.hadoop.util.Tool;

 *import* org.apache.hadoop.util.ToolRunner;

  

 *import* org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;

 *import* org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;

 *import* org.apache.hadoop.contrib.utils.join.TaggedMapOutput;

  

 *public* *class* DataJoin *extends* Configured *implements* Tool {

 

   *public* *static* *class* MapClass *extends* DataJoinMapperBase {***
 *

 

 *protected* Text generateInputTag(String inputFile) {

 String datasource = inputFile.split(-)[0];

 *return* *new* Text(datasource);

 }

 

 *protected* Text generateGroupKey(TaggedMapOutput aRecord) {

 String line = ((Text) aRecord.getData()).toString();

 String[] tokens = line.split(,);

 String groupKey = tokens[0];

 *return* *new* Text(groupKey);

 }

 

 *protected* TaggedMapOutput generateTaggedMapOutput(Object value)
 {

 TaggedWritable retv = *new* TaggedWritable((Text) value);

 retv.setTag(*this*.inputTag);

 *return* retv

Re: log server for hadoop MR jobs??

2013-01-13 Thread Hemanth Yamijala
To add to that, log aggregation is a feature available with Hadoop 2.0
(where mapreduce is re-written to YARN). The functionality is available via
the History Server:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html

Thanks
hemanth




On Sat, Jan 12, 2013 at 12:08 AM, shashwat shriparv 
dwivedishash...@gmail.com wrote:

 Have a look on flume..


 On Fri, Jan 11, 2013 at 11:58 PM, Xiaowei Li sell...@gmail.com wrote:

 ct all log generated from






 ∞
 Shashwat Shriparv




Re: JobCache directory cleanup

2013-01-11 Thread Hemanth Yamijala
Hmm. Unfortunately, there is another config variable that may be affecting
this: keep.task.files.pattern

This is set to .* in the job.xml file you sent. I suspect this may be
causing a problem. Can you please remove this, assuming you have not set it
intentionally ?

Thanks
Hemanth



On Fri, Jan 11, 2013 at 3:28 PM, Ivan Tretyakov itretya...@griddynamics.com
 wrote:

 Thanks for replies!

 keep.failed.task.files set to false.
 Config of one of the jobs attached.


 On Fri, Jan 11, 2013 at 5:44 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Good point. Forgot that one :-)


 On Thu, Jan 10, 2013 at 10:53 PM, Vinod Kumar Vavilapalli 
 vino...@hortonworks.com wrote:



 Can you check the job configuration for these ~100 jobs? Do they have
 keep.failed.task.files set to true? If so, these files won't be deleted. If
 it doesn't, it could be a bug.

 Sharing your configs for these jobs will definitely help.

 Thanks,
 +Vinod


 On Wed, Jan 9, 2013 at 6:41 AM, Ivan Tretyakov 
 itretya...@griddynamics.com wrote:

 Hello!

 I've found that jobcache directory became very large on our cluster,
 e.g.:

 # du -sh /data?/mapred/local/taskTracker/user/jobcache
 465G/data1/mapred/local/taskTracker/user/jobcache
 464G/data2/mapred/local/taskTracker/user/jobcache
 454G/data3/mapred/local/taskTracker/user/jobcache

 And it stores information for about 100 jobs:

 # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/  | sort |
 uniq | wc -l





 --
 Best Regards
 Ivan Tretyakov

 Deployment Engineer
 Grid Dynamics
 +7 812 640 38 76
 Skype: ivan.tretyakov
 www.griddynamics.com
 itretya...@griddynamics.com



Re: queues in haddop

2013-01-11 Thread Hemanth Yamijala
Queues in the capacity scheduler are logical data structures into which
MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
framework, according to some capacity constraints that can be defined for a
queue.

So, given your use case, I don't think Capacity Scheduler is going to
directly help you (since you only spoke about data-in, and not processing)

So, yes something like Flume or Scribe

Thanks
Hemanth

On Fri, Jan 11, 2013 at 11:34 AM, Harsh J ha...@cloudera.com wrote:

 Your question in unclear: HDFS has no queues for ingesting data (it is
 a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
 components have queues for processing data purposes.

 On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper ouchwhis...@gmail.com
 wrote:
  Hello,
 
  I have a hadoop cluster setup of 10 nodes and I an in need of
 implementing
  queues in the cluster for receiving high volumes of data.
  Please suggest what will be more efficient to use in the case of
 receiving
  24 Million Json files.. approx 5 KB each in every 24 hours :
  1. Using Capacity Scheduler
  2. Implementing RabbitMQ and receive data from them using Spring
 Integration
  Data pipe lines.
 
  I cannot afford to loose any of the JSON files received.
 
  Thanking You,
 
  --
  Regards,
  Ouch Whisper
  010101010101



 --
 Harsh J



Re: JobCache directory cleanup

2013-01-10 Thread Hemanth Yamijala
Hi,

On Thu, Jan 10, 2013 at 5:17 PM, Ivan Tretyakov itretya...@griddynamics.com
 wrote:

 Thanks for replies!

 Hemanth,
 I could see following exception in TaskTracker log:
 https://issues.apache.org/jira/browse/MAPREDUCE-5
 But I'm not sure if it is related to this issue.

  Now, when a job completes, the directories under the jobCache must get
 automatically cleaned up. However it doesn't look like this is happening in
 your case.

 So, If I've no running jobs, jobcache directory should be empty. Is it
 correct?


That is correct. I just verified it with my Hadoop 1.0.2 version

Thanks
Hemanth




 On Thu, Jan 10, 2013 at 8:18 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 The directory name you have provided is 
 /data?/mapred/local/taskTracker/persona/jobcache/.
 This directory is used by the TaskTracker (slave) daemons to localize job
 files when the tasks are run on the slaves.

 Hence, I don't think this is related to the parameter 
 mapreduce.jobtracker.retiredjobs.cache.size, which is a parameter
 related to the jobtracker process.

 Now, when a job completes, the directories under the jobCache must get
 automatically cleaned up. However it doesn't look like this is happening in
 your case.

 Could you please look at the logs of the tasktracker machine where this
 has gotten filled up to see if there are any errors that could give clues ?
 Also, since this is a CDH release, it could be a problem specific to that
 - and maybe reaching out on the CDH mailing lists will also help

 Thanks
 hemanth

 On Wed, Jan 9, 2013 at 8:11 PM, Ivan Tretyakov 
 itretya...@griddynamics.com wrote:

 Hello!

 I've found that jobcache directory became very large on our cluster,
 e.g.:

 # du -sh /data?/mapred/local/taskTracker/user/jobcache
 465G/data1/mapred/local/taskTracker/user/jobcache
 464G/data2/mapred/local/taskTracker/user/jobcache
 454G/data3/mapred/local/taskTracker/user/jobcache

 And it stores information for about 100 jobs:

 # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/  | sort | uniq
 | wc -l

 I've found that there is following parameter:

 property
   namemapreduce.jobtracker.retiredjobs.cache.size/name
   value1000/value
   descriptionThe number of retired job status to keep in the cache.
   /description
 /property

 So, if I got it right it intended to control job cache size by limiting
 number of jobs to store cache for.

 Also, I've seen that some hadoop users uses cron approach to cleanup
 jobcache:
 http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually
  (
 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3c99484d561002100143s4404df98qead8f2cf687a7...@mail.gmail.com%3E
 )

 Are there other approaches to control jobcache size?
 What is more correct way to do it?

 Thanks in advance!

 P.S. We are using CDH 4.1.1.

 --
 Best Regards
 Ivan Tretyakov

 Deployment Engineer
 Grid Dynamics
 +7 812 640 38 76
 Skype: ivan.tretyakov
 www.griddynamics.com
 itretya...@griddynamics.com





 --
 Best Regards
 Ivan Tretyakov

 Deployment Engineer
 Grid Dynamics
 +7 812 640 38 76
 Skype: ivan.tretyakov
 www.griddynamics.com
 itretya...@griddynamics.com



Re: Not committing output in map reduce

2013-01-10 Thread Hemanth Yamijala
Is this the same as:
http://stackoverflow.com/questions/6137139/how-to-save-only-non-empty-reducers-output-in-hdfs?

i.e. LazyOutputFormat, etc. ?


On Thu, Jan 10, 2013 at 4:51 PM, Pratyush Chandra 
chandra.praty...@gmail.com wrote:

 Hi,

 I am using s3n as file system. I do not wish to create output folders and
 file if there is no output from RecordReader implementation. Currently it
 creates empty part* files and _SUCCESS. Is there a way to do so ?

 --
 Pratyush Chandra



Re: JobCache directory cleanup

2013-01-10 Thread Hemanth Yamijala
Good point. Forgot that one :-)


On Thu, Jan 10, 2013 at 10:53 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:



 Can you check the job configuration for these ~100 jobs? Do they have
 keep.failed.task.files set to true? If so, these files won't be deleted. If
 it doesn't, it could be a bug.

 Sharing your configs for these jobs will definitely help.

 Thanks,
 +Vinod


 On Wed, Jan 9, 2013 at 6:41 AM, Ivan Tretyakov 
 itretya...@griddynamics.com wrote:

 Hello!

 I've found that jobcache directory became very large on our cluster, e.g.:

 # du -sh /data?/mapred/local/taskTracker/user/jobcache
 465G/data1/mapred/local/taskTracker/user/jobcache
 464G/data2/mapred/local/taskTracker/user/jobcache
 454G/data3/mapred/local/taskTracker/user/jobcache

 And it stores information for about 100 jobs:

 # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/  | sort | uniq
 | wc -l




Re: JobCache directory cleanup

2013-01-09 Thread Hemanth Yamijala
Hi,

The directory name you have provided is
/data?/mapred/local/taskTracker/persona/jobcache/.
This directory is used by the TaskTracker (slave) daemons to localize job
files when the tasks are run on the slaves.

Hence, I don't think this is related to the parameter
mapreduce.jobtracker.retiredjobs.cache.size,
which is a parameter related to the jobtracker process.

Now, when a job completes, the directories under the jobCache must get
automatically cleaned up. However it doesn't look like this is happening in
your case.

Could you please look at the logs of the tasktracker machine where this has
gotten filled up to see if there are any errors that could give clues ?
Also, since this is a CDH release, it could be a problem specific to that -
and maybe reaching out on the CDH mailing lists will also help

Thanks
hemanth

On Wed, Jan 9, 2013 at 8:11 PM, Ivan Tretyakov
itretya...@griddynamics.comwrote:

 Hello!

 I've found that jobcache directory became very large on our cluster, e.g.:

 # du -sh /data?/mapred/local/taskTracker/user/jobcache
 465G/data1/mapred/local/taskTracker/user/jobcache
 464G/data2/mapred/local/taskTracker/user/jobcache
 454G/data3/mapred/local/taskTracker/user/jobcache

 And it stores information for about 100 jobs:

 # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/  | sort | uniq |
 wc -l

 I've found that there is following parameter:

 property
   namemapreduce.jobtracker.retiredjobs.cache.size/name
   value1000/value
   descriptionThe number of retired job status to keep in the cache.
   /description
 /property

 So, if I got it right it intended to control job cache size by limiting
 number of jobs to store cache for.

 Also, I've seen that some hadoop users uses cron approach to cleanup
 jobcache:
 http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually
  (
 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3c99484d561002100143s4404df98qead8f2cf687a7...@mail.gmail.com%3E
 )

 Are there other approaches to control jobcache size?
 What is more correct way to do it?

 Thanks in advance!

 P.S. We are using CDH 4.1.1.

 --
 Best Regards
 Ivan Tretyakov

 Deployment Engineer
 Grid Dynamics
 +7 812 640 38 76
 Skype: ivan.tretyakov
 www.griddynamics.com
 itretya...@griddynamics.com



Re: Why the official Hadoop Documents are so messy?

2013-01-08 Thread Hemanth Yamijala
Hi,

I am not sure if your complaint is as much about the changing interfaces as
it is about documentation.

Please note that versions prior to 1.0 did not have stable interfaces as a
major requirement. Not by choice, but because the focus was on seemingly
more important functionality, stability, performance etc. Specifically with
respect to the shell commands you refer to, they were going through the
same evolution. From now, 1.x releases will not change these kind of public
interfaces and Apis.

I don't intend that documentation is unimportant. Just that this might be
less of an issue now, post the 1.x release. As others have mentioned, it
would be great if you can participate to improve documentation by filing or
fixing jiras.

Thanks
Hemanth

On Tuesday, January 8, 2013, javaLee wrote:

 For example,look at the documents about HDFS shell guide:

 In 0.17, the prefix of HDFS shell is hadoop dfs:
 http://hadoop.apache.org/docs/r0.17.2/hdfs_shell.html

 In 0.19, the prefix of HDFS shell is hadoop fs:
 http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html#lsr

 In 1.0.4,the prefix of HDFS shell is hdfs dfs:
 http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html#ls


 Reading official Hadoop ducuments is such a suffering.
 As a end user, I am confused...



Re: Reg: Fetching TaskAttempt Details from a RunningJob

2013-01-07 Thread Hemanth Yamijala
Hi,

In Hadoop 1.0, I don't think this information is exposed. The
TaskInProgress is an internal class and hence cannot / should not be used
from client applications. The only way out seems to be to screen scrape the
information from the Jobtracker web UI.

If you can live with completed events, then there is something called
TaskCompletionEvents that seem to provide some of this information. You
could look at JobClient.getTaskCompletionEvents.

Please note that in Hadoop 2.0 where mapreduce has been re-architected into
YARN, there are JSON APIs that seem to expose the information you require:

http://hadoop.apache.org/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html#Response_Examples

Look here for taskAttempts

Thanks
hemanth


On Sun, Jan 6, 2013 at 8:30 PM, Hadoop Learner
hadooplearner1...@gmail.comwrote:

 Hi All,

   Working on a requirement of hadoop Job Monitoring. Requirement is to get
 every Task attempt details of a Running Job. Details are following :

 Task Attempt Start Time
 Task Tracker Name where Task Attempt is executing
 Errors or Exceptions in a running Task Attempt

 I have Implemented JobClient, RunningJob, JobStatus, TaskReport APIs to
 get the Task Attempt IDs( I have the task Attempt IDs for which I need
 required details). By using these APIs I was unable to get Task Attempt
 details as per my requirement mentioned above. Can you please help me with
 some ways to get required details using Java APIs.

 Any pointers will be helpful.

 (Note : I have tried using TaskInProgress API, But not sure how to get
 TaskInProgress instance from a task ID. Also noticed TaskInProgress class
 is not visible to my classes)

 Thanks and Regards,
 Shyam



Re: Differences between 'mapped' and 'mapreduce' packages

2013-01-07 Thread Hemanth Yamijala
From a user perspective, at a high level, the mapreduce package can be
thought of as having user facing client code that can be invoked, extended
etc as applicable from client programs.

The mapred package is to be treated as internal to the mapreduce system,
and shouldn't directly be used unless no alternative in the mapreduce
package is available.

Thanks
Hemanth



On Mon, Jan 7, 2013 at 11:44 PM, Oleg Zhurakousky 
oleg.zhurakou...@gmail.com wrote:

 What is the differences between the two?
 It seems like MR job could be configured using one of the other (e.g,
 extends MapReduceBase implements Mapper of extends Mapper)

 Cheers
 Oleg


Re: Skipping entire task

2013-01-06 Thread Hemanth Yamijala
Hi,

Are tasks being executed multiple times due to failures? Sorry, it was not
very clear from your question.

Thanks
hemanth


On Sat, Jan 5, 2013 at 7:44 PM, David Parks davidpark...@yahoo.com wrote:

 Thinking here... if you submitted the task programmatically you should be
 able to capture the failure of the task and gracefully move past it to your
 next tasks.

 To say it in a long-winded way:  Let's say you submit a job to Hadoop, a
 java jar, and your main class implements Tool. That code has the
 responsibility to submit a series of jobs to hadoop, something like this:

 try{
   Job myJob = new MyJob(getConf());
   myJob.submitAndWait();
 }catch(Exception uhhohh){
   //Deal with the issue and move on
 }
 Job myNextJob = new MyNextJob(getConf());
 myNextJob.submit();

 Just pseudo code there to demonstrate my thought.

 David



 -Original Message-
 From: Håvard Wahl Kongsgård [mailto:haavard.kongsga...@gmail.com]
 Sent: Saturday, January 05, 2013 4:54 PM
 To: user
 Subject: Skipping entire task

 Hi, hadoop can skip bad records

 http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-c
 ode.
 But it is also possible to skip entire tasks?

 -Håvard

 --
 Håvard Wahl Kongsgård
 Faculty of Medicine 
 Department of Mathematical Sciences
 NTNU

 http://havard.security-review.net/




Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-30 Thread Hemanth Yamijala
If it is a small number, A seems the best way to me.

On Friday, December 28, 2012, Kshiva Kps wrote:


 Which one is current ..


 What is the preferred way to pass a small number of configuration
 parameters to a mapper or reducer?





 *A.  *As key-value pairs in the jobconf object.

 * *

 *B.  *As a custom input key-value pair passed to each mapper or reducer.

 * *

 *C.  *Using a plain text file via the Distributedcache, which each mapper
 or reducer reads.

 * *

 *D.  *Through a static variable in the MapReduce driver class (i.e., the
 class that submits the MapReduce job).



 *Answer: B*





Re: Selecting a task for the tasktracker

2012-12-27 Thread Hemanth Yamijala
Hi,

Firstly, I am talking about Hadoop 1.0. Please note that in Hadoop 2.x and
trunk, the Mapreduce framework is completely revamped to Yarn (
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
and you may need to look at different interfaces for building your own
scheduler.

In 1.0, the primary function of the TaskScheduler is the assignTasks
method. Given a TaskTracker object as input, this method figures out how
many free map and reduce slots exist in that particular tasktracker and
selects one or more task that can be scheduled on it. Since task selection
is the primary responsibility and the granularity is at a task level, the
class is called TaskScheduler.

The method of choosing a job and then a task within the job is customised
by the different schedulers already present in Hadoop. Also, the core logic
of selecting a map task with data locality optimizations is not implemented
in the schedulers per se, but they rely on the JobInProgress object in
MapReduce framework for achieving the same.

To implement your own Scheduler, it may be best to look at the sources of
existing schedulers: JobQueueTaskScheduler, CapacityTaskScheduler or
FairScheduler.  In particular, the last two are in the contrib modules of
mapreduce, and hence will be fairly independent to follow. Their build
files will also tell you how to resolve any compile problems like the one
you are facing.

Thanks
Hemanth




On Thu, Dec 27, 2012 at 4:10 PM, Yaron Gonen yaron.go...@gmail.com wrote:

 Hi,
 If I understand correctly, the job scheduler (why is the class called
 TaskScheduler?) is responsible for assigning the task whose split is as
 close as possible to the tasktacker.
 Meaning that the job scheduler is responsible to two things:

1. Selecting a job.
2. Once a job is selected, assign the closest task to the tasktracker
that send the heartbeat.

 Is this correct?

 I want to write my own job scheduler to change the logic above, but it
 says The type TaskScheduler is not visible.
 How can I write my own scheduler?

 thanks



Re: Sane max storage size for DN

2012-12-13 Thread Hemanth Yamijala
This is a dated blog post, so it would help if someone with current HDFS
knowledge can validate it:
http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
.

There is a bit about the RAM required for the Namenode and how to compute
it:

You can look at the 'Namespace limitations' section.

Thanks
hemanth


On Thu, Dec 13, 2012 at 10:57 AM, Mohammad Tariq donta...@gmail.com wrote:

 Hello Chris,

  Thank you so much for the valuable insights. I was actually using the
 same principle. I did the blunder and did the maths for entire (9*3)PB.

 Seems I am higher than you, that too without drinking ;)

 Many thanks.


 Regards,
 Mohammad Tariq



 On Thu, Dec 13, 2012 at 10:38 AM, Chris Embree cemb...@gmail.com wrote:

 Hi Mohammed,

 The amount of RAM on the NN is related to the number of blocks... so
 let's do some math. :)  1G of RAM to 1M blocks seems to be the general rule.

 I'll probably mess this up so someone check my math:

 9 PT ~ 9,216 TB ~ 9,437,184 GB of data.  Let's put that in 128MB blocks:
  according to kcalc that's 75,497,472 of 128 MB Blocks.
 Unless I missed this by an order of magnitude (entirely possible... I've
 been drinking since 6), that sound like 76G of RAM (above OS requirements).
  128G should kick it's ass; 256G seems like a waste of $$.

 Hmm... That makes the NN sound extremely efficient.  Someone validate me
 or kick me to the curb.

 YMMV ;)


 On Wed, Dec 12, 2012 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello Michael,

   It's an array. The actual size of the data could be somewhere
 around 9PB(exclusive of replication) and we want to keep the no of DNs as
 less as possible. Computations are not too frequent, as I have specified
 earlier. If I have 500TB in 1 DN, the no of DNs would be around 49. And, if
 the block size is 128MB, the no of blocks would be 201326592. So, I was
 thinking of having 256GB RAM for the NN. Does this make sense to you?

 Many thanks.

 Regards,
 Mohammad Tariq



 On Thu, Dec 13, 2012 at 12:28 AM, Michael Segel 
 michael_se...@hotmail.com wrote:

 500 TB?

 How many nodes in the cluster? Is this attached storage or is it in an
 array?

 I mean if you have 4 nodes for a total of 2PB, what happens when you
 lose 1 node?


 On Dec 12, 2012, at 9:02 AM, Mohammad Tariq donta...@gmail.com wrote:

 Hello list,

   I don't know if this question makes any sense, but I would
 like to ask, does it make sense to store 500TB (or more) data in a single
 DN?If yes, then what should be the spec of other parameters *viz*. NN
  DN RAM, N/W etc?If no, what could be the alternative?

 Many thanks.

 Regards,
 Mohammad Tariq









Re: attempt* directories in user logs

2012-12-10 Thread Hemanth Yamijala
However, in the case Oleg is talking about the attempts are:
attempt_201212051224_0021_m_00_0
attempt_201212051224_0021_m_02_0
attempt_201212051224_0021_m_03_0

These aren't multiple attempts of a single task, are they ? They are
actually different tasks. If they were multiple attempts, I would expect
the last digit to get incremented, like attempt_201212051224_0021_m_00_0
and attempt_201212051224_0021_m_00_1, for instance.

It looks like at least 3 different tasks were launched on this node. One of
them could be setup task. Oleg, how many map tasks does the Jobtracker UI
show for this job.

Thanks
hemanth


On Tue, Dec 11, 2012 at 12:19 AM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:


 MR launches multiple attempts for single Task in case of TaskAttempt
 failures or when speculative execution is turned on. In either case, a
 given Task will only ever have one successful TaskAttempt whose output will
 be accepted (committed).

 Number of reduces is set to 1 by default in mapred-default.xml - you
 should explicitly set it to zero if you don't want reducers.

 By master, I suppose you mean JobTracker. JobTracker doesn't show all the
 attempts for a given Task, you should navigate to per-task page to see that.


 Thanks,
 +Vinod Kumar Vavilapalli
 Hortonworks Inc.
 http://hortonworks.com/

 On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote:

 I studying user logs on the two node cluster that I have setup and I was
 wondering if anyone can shed some light on these attempt*' directories

 $ ls

 attempt_201212051224_0021_m_00_0  attempt_201212051224_0021_m_03_0
  job-acls.xml
 attempt_201212051224_0021_m_02_0  attempt_201212051224_0021_r_00_0

 I mean its obvious that its talking about 3 attempts for Map task and 1
 attempt for reduce task. However my current MR job only results in some
 output written to attempt_201212051224_0021_m_00_0. Nothing is the
 reduce part (understandably since I don't even have a reducer, so my
 question is:

 1. The two more M attempts. . . what are they?
 2. Why was there an attempt to do a Reduce when no reducer was
 provided.implemented
 3. Why my master node only had 1 attempt for M task but the slave had all
 that's displayed and questioned above (the 'ls' output above is from the
 slave node)

 Thanks
 Oleg





Re: Map tasks processing some files multiple times

2012-12-06 Thread Hemanth Yamijala
David,

You are using FileNameTextInputFormat. This is not in Hadoop source, as far
as I can see. Can you please confirm where this is being used from ? It
seems like the isSplittable method of this input format may need checking.

Another thing, given you are adding the same input format for all files, do
you need MultipleInputs ?

Thanks
Hemanth


On Thu, Dec 6, 2012 at 1:06 PM, David Parks davidpark...@yahoo.com wrote:

 I believe I just tracked down the problem, maybe you can help confirm if
 you’re familiar with this.

 ** **

 I see that FileInputFormat is specifying that gzip files (.gz extension)
 from s3n filesystem are being reported as *splittable*, and I see that
 it’s creating multiple input splits for these files. I’m mapping the files
 directly off S3:

 ** **

Path lsDir = *new* Path(
 s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*);

MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
 class*, LinkShareCatalogImportMapper.*class*);

 ** **

 I see in the map phase, based on my counters, that it’s actually
 processing the entire file (I set up a counter per file input). So the 2
 files which were processed twice had 2 splits (I now see that in some debug
 logs I created), and the 1 file that was processed 3 times had 3 splits
 (the rest were smaller and were only assigned one split by default anyway).
 

 ** **

 Am I wrong in expecting all files on the s3n filesystem to come through as
 not-splittable? This seems to be a bug in hadoop code if I’m right.

 ** **

 David

 ** **

 ** **

 *From:* Raj Vishwanathan [mailto:rajv...@yahoo.com]
 *Sent:* Thursday, December 06, 2012 1:45 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Map tasks processing some files multiple times

 ** **

 Could it be due to spec-ex? Does it make a diffrerence in the end?

 ** **

 Raj

 ** **
 --

 *From:* David Parks davidpark...@yahoo.com
 *To:* user@hadoop.apache.org
 *Sent:* Wednesday, December 5, 2012 10:15 PM
 *Subject:* Map tasks processing some files multiple times

 ** **

 I’ve got a job that reads in 167 files from S3, but 2 of the files are
 being mapped twice and 1 of the files is mapped 3 times.

  

 This is the code I use to set up the mapper:

  

Path lsDir = *new* Path(
 s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*);

*for*(FileStatus f :
 lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info(Identified
 linkshare catalog:  + f.getPath().toString());

*if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length  0
 ){

   MultipleInputs.*addInputPath*(job, lsDir,
 FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
 *

}

  

 I can see from the logs that it sees only 1 copy of each of these files,
 and correctly identifies 167 files.

  

 I also have the following confirmation that it found the 167 files
 correctly:

  

 2012-12-06 04:56:41,213 INFO
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
 paths to process : 167

  

 When I look through the syslogs I can see that the file in question was
 opened by two different map attempts:

  

 ./task-attempts/job_201212060351_0001/*
 attempt_201212060351_0001_m_05_0*/syslog:2012-12-06 03:56:05,265 INFO
 org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
 for reading

 ./task-attempts/job_201212060351_0001/*
 attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
 org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
 for reading

  

 This is only happening to these 3 files, all others seem to be fine. For
 the life of me I can’t see a reason why these files might be processed
 multiple times.

  

 Notably, map attempt 173 is more map attempts than should be possible.
 There are 167 input files (from S3, gzipped), thus there should be 167 map
 attempts. But I see a total of 176 map tasks.

  

 Any thoughts/ideas/guesses?

  

 ** **




Re: Map tasks processing some files multiple times

2012-12-06 Thread Hemanth Yamijala
Glad it helps. Could you also explain the reason for using MultipleInputs ?


On Thu, Dec 6, 2012 at 2:59 PM, David Parks davidpark...@yahoo.com wrote:

 Figured it out, it is, as usual, with my code. I had wrapped
 TextInputFormat to replace the LongWritable key with a key representing the
 file name. It was a bit tricky to do because of changing the generics from
 LongWritable, Text to Text, Text and I goofed up and mis-directed a
 call to isSplittable, which was causing the issue.

 ** **

 It now works fine. Thanks very much for the response, it gave me pause to
 think enough to work out what I had done.

 ** **

 Dave

 ** **

 ** **

 *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
 *Sent:* Thursday, December 06, 2012 3:25 PM

 *To:* user@hadoop.apache.org
 *Subject:* Re: Map tasks processing some files multiple times

 ** **

 David,

 ** **

 You are using FileNameTextInputFormat. This is not in Hadoop source, as
 far as I can see. Can you please confirm where this is being used from ? It
 seems like the isSplittable method of this input format may need checking.
 

 ** **

 Another thing, given you are adding the same input format for all files,
 do you need MultipleInputs ?

 ** **

 Thanks

 Hemanth

 ** **

 On Thu, Dec 6, 2012 at 1:06 PM, David Parks davidpark...@yahoo.com
 wrote:

 I believe I just tracked down the problem, maybe you can help confirm if
 you’re familiar with this.

  

 I see that FileInputFormat is specifying that gzip files (.gz extension)
 from s3n filesystem are being reported as *splittable*, and I see that
 it’s creating multiple input splits for these files. I’m mapping the files
 directly off S3:

  

Path lsDir = *new* Path(
 s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*);

MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
 class*, LinkShareCatalogImportMapper.*class*);

  

 I see in the map phase, based on my counters, that it’s actually
 processing the entire file (I set up a counter per file input). So the 2
 files which were processed twice had 2 splits (I now see that in some debug
 logs I created), and the 1 file that was processed 3 times had 3 splits
 (the rest were smaller and were only assigned one split by default anyway).
 

  

 Am I wrong in expecting all files on the s3n filesystem to come through as
 not-splittable? This seems to be a bug in hadoop code if I’m right.

  

 David

  

  

 *From:* Raj Vishwanathan [mailto:rajv...@yahoo.com]
 *Sent:* Thursday, December 06, 2012 1:45 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Map tasks processing some files multiple times

  

 Could it be due to spec-ex? Does it make a diffrerence in the end?

  

 Raj

  
 --

 *From:* David Parks davidpark...@yahoo.com
 *To:* user@hadoop.apache.org
 *Sent:* Wednesday, December 5, 2012 10:15 PM
 *Subject:* Map tasks processing some files multiple times

  

 I’ve got a job that reads in 167 files from S3, but 2 of the files are
 being mapped twice and 1 of the files is mapped 3 times.

  

 This is the code I use to set up the mapper:

  

Path lsDir = *new* Path(
 s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*);

*for*(FileStatus f :
 lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info(Identified
 linkshare catalog:  + f.getPath().toString());

*if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length  0
 ){

   MultipleInputs.*addInputPath*(job, lsDir,
 FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
 *

}

  

 I can see from the logs that it sees only 1 copy of each of these files,
 and correctly identifies 167 files.

  

 I also have the following confirmation that it found the 167 files
 correctly:

  

 2012-12-06 04:56:41,213 INFO
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
 paths to process : 167

  

 When I look through the syslogs I can see that the file in question was
 opened by two different map attempts:

  

 ./task-attempts/job_201212060351_0001/*
 attempt_201212060351_0001_m_05_0*/syslog:2012-12-06 03:56:05,265 INFO
 org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
 for reading

 ./task-attempts/job_201212060351_0001/*
 attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
 org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
 for reading

  

 This is only happening to these 3 files, all others seem to be fine. For
 the life of me I can’t see a reason why these files might be processed
 multiple

Re: Changing hadoop configuration without restarting service

2012-12-04 Thread Hemanth Yamijala
Generally true for the framework config files, but some of the
supplementary features can be refreshed without restart. For e.g. scheduler
configuration, host files (for included / excluded nodes) ...


On Tue, Dec 4, 2012 at 5:33 AM, Cristian Cira
cmc0...@tigermail.auburn.eduwrote:

 No. You will have to restart hadoop. Hot/online configuration is not
 supported in hadoop

 all the best,

 Cristian Cira
 Graduate Research Assistant
 Parallel Architecture and System Laboratory (PASL)
 Shelby Center 2105
 Auburn University, AL 36849

 
 From: Pankaj Gupta [pan...@brightroll.com]
 Sent: Monday, December 03, 2012 5:59 PM
 To: user@hadoop.apache.org
 Subject: Changing hadoop configuration without restarting service

 Hi,

 Is it possible to change hadoop configuration files such as core-site.xml
 and get the changes take effect without having to restart hadoop services?

 Thanks,
 Pankaj




Re: Failed to call hadoop API

2012-11-29 Thread Hemanth Yamijala
Hi,

Little confused about where JNI comes in here (you mentioned this in your
original email). Also, where do you want to get the information for the
hadoop job ? Is it in a program that is submitting a job, or some sort of
monitoring application that is monitoring jobs submitted to a cluster by
others ? I think some of this information will drive an answer.

FWIW, JobID.forName(job_id_as_string) would give you a handle to a JobID
tied to a job. Reference:
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/JobID.html

Thanks
hemanth


On Wed, Nov 28, 2012 at 6:58 AM, ugiwgh ugi...@gmail.com wrote:


 I want to code a program to get hadoop job information, such as jobid,
 jobname, owner, running nodes.

 **


 -- Original --
 *From: * Mahesh Balijabalijamahesh@gmail.com;
 *Date: * Tue, Nov 27, 2012 10:02 PM
 *To: * useruser@hadoop.apache.org; **
 *Subject: * Re: Failed to call hadoop API

 Hi Hui,
   JobID constructor is not a public constructor it has default visibility
 so you have to create the instance within the same package.
  Usually you cannot create a JobID rather you can get one from the JOB
 instance by invoking getJobID().
  If this doesnot works for you, please tell what you are trying to do?
  Thanks,
 Mahesh Balija,
 Calsoft Labs.
 On Tue, Nov 27, 2012 at 5:37 PM, GHui ugi...@gmail.com wrote:


 I call the sentence JobID id = new JobID() of hadoop API with JNI. But
 when my program run to this sentence, it exits. And no errors output. I
 don't make any sense of this.

 The hadoop is hadoop-core-1.0.3.jar.
 The Java sdk is jdk1.6.0-34.

 Any help will be appreciated.

 -GHui


 **



Re: problem using s3 instead of hdfs

2012-10-16 Thread Hemanth Yamijala
Hi,

I've not tried this on S3. However, the directory mentioned in the
exception is based on the value of this particular configuration
key: mapreduce.jobtracker.staging.root.dir. This defaults
to ${hadoop.tmp.dir}/mapred/staging. Can you please set this to an S3
location and try ?

Thanks
Hemanth

On Mon, Oct 15, 2012 at 10:43 PM, Parth Savani pa...@sensenetworks.comwrote:

 Hello,
   I am trying to run hadoop on s3 using distributed mode. However I am
 having issues running my job successfully on it. I get the following error
 I followed the instructions provided in this article -
 http://wiki.apache.org/hadoop/AmazonS3
 I replaced the fs.default.name value in my hdfs-site.xml to
 s3n://ID:SECRET@BUCKET
 And I am running my job using the following: hadoop jar
 /path/to/my/jar/abcd.jar /input /output
 Where */input* is the folder name inside the s3 bucket
 (s3n://ID:SECRET@BUCKET/input)
 and */output *folder should created in my bucket (s3n://ID:SECRET@BUCKET
 /output)
 Below is the error i get. It is looking for job.jar on s3 and that path is
 on my server from where i am launching my job.

 java.io.FileNotFoundException: No such file or directory
 '/opt/data/hadoop/hadoop-mapred/mapred/staging/psavani/.staging/job_201207021606_1036/job.jar'
 at
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:412)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
  at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1371)
 at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1352)
  at
 org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:273)
 at
 org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:381)
  at
 org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:371)
 at
 org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:222)
  at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1372)
 at java.security.AccessController.doPri







Re: problem using s3 instead of hdfs

2012-10-16 Thread Hemanth Yamijala
Parth,

I notice in the below stack trace that the LocalJobRunner, instead of the
JobTracker is being used. Are you sure this is a distributed cluster ?
Could you please check the value of mapred.job.tracker ?

Thanks
Hemanth

On Tue, Oct 16, 2012 at 8:02 PM, Parth Savani pa...@sensenetworks.comwrote:

 Hello Hemanth,
 I set the hadoop staging directory to s3 location. However, it
 complains. Below is the error

 12/10/16 10:22:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 Exception in thread main java.lang.IllegalArgumentException: Wrong FS:
 s3n://ABCD:ABCD@ABCD/tmp/mapred/staging/psavani1821193643/.staging,
 expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:410)
  at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:322)
 at
 org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:79)
  at
 org.apache.hadoop.mapred.LocalJobRunner.getStagingAreaDir(LocalJobRunner.java:541)
 at
 org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1204)
  at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:839)
  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
 at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
  at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
  at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506)
 at
 com.sensenetworks.macrosensedata.ParseLogsMacrosense.run(ParseLogsMacrosense.java:54)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at
 com.sensenetworks.macrosensedata.ParseLogsMacrosense.main(ParseLogsMacrosense.java:121)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:197)


 On Tue, Oct 16, 2012 at 3:11 AM, Hemanth Yamijala 
 yhema...@thoughtworks.com wrote:

 Hi,

 I've not tried this on S3. However, the directory mentioned in the
 exception is based on the value of this particular configuration
 key: mapreduce.jobtracker.staging.root.dir. This defaults
 to ${hadoop.tmp.dir}/mapred/staging. Can you please set this to an S3
 location and try ?

 Thanks
 Hemanth


 On Mon, Oct 15, 2012 at 10:43 PM, Parth Savani 
 pa...@sensenetworks.comwrote:

 Hello,
   I am trying to run hadoop on s3 using distributed mode. However I
 am having issues running my job successfully on it. I get the following
 error
 I followed the instructions provided in this article -
 http://wiki.apache.org/hadoop/AmazonS3
 I replaced the fs.default.name value in my hdfs-site.xml to
 s3n://ID:SECRET@BUCKET
 And I am running my job using the following: hadoop jar
 /path/to/my/jar/abcd.jar /input /output
 Where */input* is the folder name inside the s3 bucket
 (s3n://ID:SECRET@BUCKET/input)
 and */output *folder should created in my bucket (s3n://ID:SECRET@BUCKET
 /output)
 Below is the error i get. It is looking for job.jar on s3 and that path
 is on my server from where i am launching my job.

 java.io.FileNotFoundException: No such file or directory
 '/opt/data/hadoop/hadoop-mapred/mapred/staging/psavani/.staging/job_201207021606_1036/job.jar'
 at
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:412)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
  at
 org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1371)
 at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1352)
  at
 org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:273)
 at
 org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:381)
  at
 org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:371)
 at
 org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:222)
  at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1372)
 at java.security.AccessController.doPri









Re: Question about how to find which file takes the longest time to process and how to assign more mappers to process that particular file

2012-10-04 Thread Hemanth Yamijala
Hi,

Roughly, this information will be available under the 'Hadoop map task
list' page in the Mapreduce web ui (in Hadoop-1.0, which I am assuming is
what you are using). You can reach this page by selecting the running tasks
link from the job information page. The page has a table that lists all the
tasks and under the status column tells you which part of the input is
being processed. Please note that, depending on the input format chosen, a
task may be processing a *part* of a file, and not necessary a file itself.

Another good source of information to see why these particular tasks are
slow will be to look at the job's counters. Again these counters can be
accessed from the web ui of the task list page.

It would help more if you can provide more information - like what job
you're trying to run, the input format specified etc.

Thanks
hemanth

On Fri, Oct 5, 2012 at 3:33 AM, Huanchen Zhang iamzhan...@gmail.com wrote:

 Hello,

 I have a question about how to find which file takes the longest time to
 process and how to assign more mappers to process that particular file.

 Currently, about three mapper takes about five times more time to
 complete. So, how can I detect which specific files are those three mapper
 are processing? If above if doable, how can I assign more mappers to
 process those specific files?

 Thank you !

 Best,
 Huanchen


Re: A small portion of map tasks slows down the job

2012-10-03 Thread Hemanth Yamijala
Hi,

Would reducing the output from the map tasks solve the problem ? i.e. are
reducers slowing down because a lot of data is being shuffled ?

If that's the case, you could see if the map output size will reduce by
using the framework's combiner or an in-mapper combining technique.

Thanks
Hemanth

On Wed, Oct 3, 2012 at 6:34 AM, Huanchen Zhang iamzhan...@gmail.com wrote:

 Hello,

 I have a small portion of map tasks whose output is much larger than
 others (more spills). So the reducer is mainly waiting for these a few map
 tasks. Is there a good solution for this problem ?

 Thank you.

 Best,
 Huanchen


Re: Can we write output directly to HDFS from Mapper

2012-09-27 Thread Hemanth Yamijala
Can certainly do that. Indeed, if you set the number of reducers to 0,
the map output will be directly written to HDFS by the framework
itself. You may also want to look at
http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Task+Side-Effect+Files
to see some things that need to be taken care of if you are writing
files on your own.

Thanks
hemanth

On Fri, Sep 28, 2012 at 9:45 AM, Balaraman, Anand
anand_balara...@syntelinc.com wrote:
 Hi



 In Map-Reduce, is it appropriate to write the output directly to HDFS from
 Mapper (without using a reducer) ?

 Are there any adverse effects in doing so or are there any best practices to
 be followed in this aspect ?



 Comments are much appreciable at the moment J



 Thanks and Regards

 Anand B

 Confidential: This electronic message and all contents contain information
 from Syntel, Inc. which may be privileged, confidential or otherwise
 protected from disclosure. The information is intended to be for the
 addressee only. If you are not the addressee, any disclosure, copy,
 distribution or use of the contents of this message is prohibited. If you
 have received this electronic message in error, please notify the sender
 immediately and destroy the original message and all copies.


Re: Passing Command-line Parameters to the Job Submit Command

2012-09-25 Thread Hemanth Yamijala
By java environment variables, do you mean the ones passed as
-Dkey=value ? That's one way of passing them. I suppose another way is
to have a client side site configuration (like mapred-site.xml) that
is in the classpath of the client app.

Thanks
Hemanth

On Tue, Sep 25, 2012 at 12:20 AM, Varad Meru meru.va...@gmail.com wrote:
 Thanks Hemanth,

 But in general, if we want to pass arguments to any job (not only
 PiEstimator from examples-jar) and submit the Job to the Job queue
 scheduler, by the looks of it, we might always need to use the java
 environment variables only.

 Is my above assumption correct?

 Thanks,
 Varad

 On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala yhema...@gmail.comwrote:

 Varad,

 Looking at the code for the PiEstimator class which implements the
 'pi' example, the two arguments are mandatory and are used *before*
 the job is submitted for execution - i.e on the client side. In
 particular, one of them (nSamples) is used not by the MapReduce job,
 but by the client code (i.e. PiEstimator) to generate some input.

 Hence, I believe all of this additional work that is being done by the
 PiEstimator class will be bypassed if we directly use the job -submit
 command. In other words, I don't think these two ways of running the
 job:

 - using the hadoop jar examples pi
 - using hadoop job -submit

 are equivalent.

 As a general answer to your question though, if additional parameters
 are used by the Mappers or reducers, then they will generally be set
 as additional job specific configuration items. So, one way of using
 them with the job -submit command will be to find out the specific
 names of the configuration items (from code, or some other
 documentation), and include them in the job.xml used when submitting
 the job.

 Thanks
 Hemanth

 On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com wrote:
  Hi,
 
  I want to run the PiEstimator example from using the following command
 
  $hadoop job -submit pieestimatorconf.xml
 
  which contains all the info required by hadoop to run the job. E.g. the
  input file location, the output file location and other details.
 
 
 propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property
  propertynamemapred.map.tasks/namevalue20/value/property
  propertynamemapred.reduce.tasks/namevalue2/value/property
  ...
  propertynamemapred.job.name
 /namevaluePiEstimator/value/property
 
 propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property
 
  Now, as we now, to run the PiEstimator, we can use the following command
 too
 
  $hadoop jar hadoop-examples.1.0.3 pi 5 10
 
  where 5 and 10 are the arguments to the main class of the PiEstimator.
 How
  can I pass the same arguments (5 and 10) using the job -submit command
  through conf. file or any other way, without changing the code of the
  examples to reflect the use of environment variables.
 
  Thanks in advance,
  Varad
 
  -
  Varad Meru
  Software Engineer,
  Business Intelligence and Analytics,
  Persistent Systems and Solutions Ltd.,
  Pune, India.



Re: Passing Command-line Parameters to the Job Submit Command

2012-09-23 Thread Hemanth Yamijala
Varad,

Looking at the code for the PiEstimator class which implements the
'pi' example, the two arguments are mandatory and are used *before*
the job is submitted for execution - i.e on the client side. In
particular, one of them (nSamples) is used not by the MapReduce job,
but by the client code (i.e. PiEstimator) to generate some input.

Hence, I believe all of this additional work that is being done by the
PiEstimator class will be bypassed if we directly use the job -submit
command. In other words, I don't think these two ways of running the
job:

- using the hadoop jar examples pi
- using hadoop job -submit

are equivalent.

As a general answer to your question though, if additional parameters
are used by the Mappers or reducers, then they will generally be set
as additional job specific configuration items. So, one way of using
them with the job -submit command will be to find out the specific
names of the configuration items (from code, or some other
documentation), and include them in the job.xml used when submitting
the job.

Thanks
Hemanth

On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com wrote:
 Hi,

 I want to run the PiEstimator example from using the following command

 $hadoop job -submit pieestimatorconf.xml

 which contains all the info required by hadoop to run the job. E.g. the
 input file location, the output file location and other details.

 propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property
 propertynamemapred.map.tasks/namevalue20/value/property
 propertynamemapred.reduce.tasks/namevalue2/value/property
 ...
 propertynamemapred.job.name/namevaluePiEstimator/value/property
 propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property

 Now, as we now, to run the PiEstimator, we can use the following command too

 $hadoop jar hadoop-examples.1.0.3 pi 5 10

 where 5 and 10 are the arguments to the main class of the PiEstimator. How
 can I pass the same arguments (5 and 10) using the job -submit command
 through conf. file or any other way, without changing the code of the
 examples to reflect the use of environment variables.

 Thanks in advance,
 Varad

 -
 Varad Meru
 Software Engineer,
 Business Intelligence and Analytics,
 Persistent Systems and Solutions Ltd.,
 Pune, India.


Re: Will all the intermediate output with the same key go to the same reducer?

2012-09-20 Thread Hemanth Yamijala
Hi,

Yes. By contract, all intermediate output with the same key goes to
the same reducer.

In your example, suppose of the two keys generated from the mapper,
one key goes to reducer 1 and the second goes to reducer 2, reducer 3
will not have any records to process and end without producing any
output.

If the intermediate key space is very large, 1 reducer would certainly
be a bottleneck, as you rightly note. Hence, configuring the right
number of reducers would be certainly important.

Thanks
hemanth

On 9/20/12, Jason Yang lin.yang.ja...@gmail.com wrote:
 Hi, all

 I have a question that whether all the intermediate output with the same
 key go to the same reducer or not?

 If it is, in case of only two keys are generated from mapper, but there are
 3 reducer running in this job, what would happen?

 If not, how could I do some processing over the all data, like counting? I
 think some would suggest to set the number of reducer to 1, but I thought
 this would make the reducer to be the bottleneck when there are large
 volume of intermediate output, isn't it?

 --
 YANG, Lin



Re: About ant Hadoop

2012-09-19 Thread Hemanth Yamijala
Can you please look at the jobtracker and tasktracker logs on nodes
where the task has been launched ? Also see if the job logs are
picking up anything. They'll probably give you clues on what is
happening.

Also, is HDFS ok ? i.e. are you able to read files already loaded etc.

Thanks
hemanth

On 9/19/12, Li Shengmei lisheng...@ict.ac.cn wrote:
 Hi,all

  I revise the source codes of hadoop-1.0.3 and use ant to recompile
 hadoop. It compiles successfully. Then I jar cvf hadoop-core-1.0.3.jar *
 and copy the new hadoop-core-1.0.3.jar to overwirite the $HADOOP_HOME/
 hadoop-core-1.0.3.jar in every node machine. Then I use hadoop to test the
 wordcount application. But the application halts at map 0% reduce 0%.

  Does anyone give suggestions?

 Thanks a lot.



 May










Re: What's the basic idea of pseudo-distributed Hadoop ?

2012-09-14 Thread Hemanth Yamijala
One thing to be careful about is paths of dependent libraries or
executables like streaming binaries. In pseudo distributed mode, since all
processes are looking on the same machine, it is likely that they will find
paths that are really local to only the machine where the job is being
launched from. When you start to run them in a true distributed
environment, and if these files are not packaged and distributed to the
cluster in some way, they will start failing.

Thanks
hemanth

On Fri, Sep 14, 2012 at 1:04 PM, Jason Yang lin.yang.ja...@gmail.comwrote:

 All right, I got it.

 Thanks for all of you.


 2012/9/14 Bertrand Dechoux decho...@gmail.com

 The only difference between pseudo-distributed and fully distributed
 would be scale. You could say that code that runs fine on the former, runs
 fine too on the latter. But it does not necessary mean that the performance
 will scale the same way (ie if you keep a list of elements in memory, at
 bigger scale you could receive OOME).

 Of course, like it has been implied in previous answers, you can't say
 the same with standalone. With this mode, you could use a global mutable
 static state thinking it's fine without caring about distribution between
 the nodes. In that case, the same code launched on pseudo-distributed will
 fail to replicate the same results.

 Regards

 Bertrand


 On Fri, Sep 14, 2012 at 9:24 AM, Harsh J ha...@cloudera.com wrote:

 Hi Jason,

 I think you're confusing the standalone mode with a pseudo-distributed
 mode. The former is a limited mode of MR where no daemons need to be
 deployed and the tasks run in a single JVM (via threads).

 A pseudo distributed cluster is a cluster where all daemons are
 running on one node itself. Hence, not distributed in the sense of
 multi-nodes (no use of an network gear) but works in the same way
 between nodes (RPC, etc.) as a fully-distributed one.

 If an MR program works fine in a pseudo-distributed mode, it should
 work (no guarantee) fine in a fully-distributed mode iff all nodes
 have the same arch/OS, same JVM, and job-specific configurations. This
 is because tasks execute on various nodes and may be affected by the
 node's behavior or setup that is different from others - and thats
 something you'd have to detect/know about if it exhibits failures more
 than others.

 On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang lin.yang.ja...@gmail.com
 wrote:
  Hey, Kai
 
  Thanks for you reply.
 
  I was wondering what's difference btw the pseudo-distributed and
  fully-distributed hadoop, except the maximum number of map/reduce.
 
  And if a MR program works fine in pseudo-distributed cluster, will it
 work
  exactly fine in the fully-distributed cluster ?
 
 
  2012/9/14 Kai Voigt k...@123.org
 
  e default setting is that a tasktracker can run up to two map and
 reduce
  tasks in parallel (mapred.tasktracker.map.tasks.maximum and
  mapred.tasktracker.reduce.tasks.maximum), so you will actually see
 some
  concurrency on your one machine.
 
 
 
 
  --
  YANG, Lin
 



 --
 Harsh J




 --
 Bertrand Dechoux




 --
 YANG, Lin




Re: Ignore keys while scheduling reduce jobs

2012-09-14 Thread Hemanth Yamijala
Hi,

When do you know the keys to ignore ? You mentioned after the map stage
.. is this at the end of each map task, or at the end of all map tasks ?

Thanks
hemanth

On Fri, Sep 14, 2012 at 4:36 PM, Aseem Anand aseem.ii...@gmail.com wrote:

 Hi,
 Is there anyway I can ignore all keys except a certain key ( determined
 after the map stage) to start only 1 reduce job using a partitioner? If so
 could someone suggest such a method.

 Regards,
 Aseem




Re: Question about the task assignment strategy

2012-09-11 Thread Hemanth Yamijala
Hi,

Task assignment takes data locality into account first and not block
sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks.
When such a request comes to the jobtracker, it will try to look for an
unassigned task which needs data that is close to the tasktracker and will
assign it.

Thanks
Hemanth

On Tue, Sep 11, 2012 at 6:31 PM, Hiroyuki Yamada mogwa...@gmail.com wrote:

 Hi,

 I want to make sure my understanding about task assignment in hadoop
 is correct or not.

 When scanning a file with multiple tasktrackers,
 I am wondering how a task is assigned to each tasktracker .
 Is it based on the block sequence or data locality ?

 Let me explain my question by example.
 There is a file which composed of 10 blocks (block1 to block10), and
 block1 is the beginning of the file and block10 is the tail of the file.
 When scanning the file with 3 tasktrackers (tt1 to tt3),
 I am wondering if
 task assignment is based on the block sequence like
 first tt1 takes block1 and tt2 takes block2 and tt3 takes block3 and
 tt1 takes block4 and so on
 or
 task assignment is based on the task(data) locality like
 first tt1 takes block2(because it's located in the local) and tt2
 takes block1 (because it's located in the local) and
 tt3 takes block 4(because it's located in the local) and so on.

 As far as I experienced and the definitive guide book says,
 I think that the first case is the task assignment strategy.
 (and if there are many replicas, closest one is picked.)

 Is this right ?

 If this is right, is there any way to do like the second case
 with the current implementation ?

 Thanks,

 Hiroyuki



Re: Error in : hadoop fsck /

2012-09-11 Thread Hemanth Yamijala
Could you please review your configuration to see if you are pointing to
the right namenode address ? (This will be in core-site.xml)
Please paste it here so we can look for clues.

Thanks
hemanth

On Tue, Sep 11, 2012 at 9:25 PM, yogesh dhari yogeshdh...@live.com wrote:

  Hi all,

 I am running hadoop-0.20.2 on single node cluster,

 I run the command

 hadoop fsck /

 it shows error:

 Exception in thread main java.net.UnknownHostException: http
 at
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
 at java.net.Socket.connect(Socket.java:579)
 at java.net.Socket.connect(Socket.java:528)
 at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:378)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:473)
 at sun.net.www.http.HttpClient.init(HttpClient.java:203)
 at sun.net.www.http.HttpClient.New(HttpClient.java:290)
 at sun.net.www.http.HttpClient.New(HttpClient.java:306)
 at
 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:995)
 at
 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:931)
 at
 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:849)
 at
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1299)
 at org.apache.hadoop.hdfs.tools.DFSck.run(DFSck.java:123)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.hdfs.tools.DFSck.main(DFSck.java:159)




 Please suggest why it so..  it should show the health status:



Re: Question about the task assignment strategy

2012-09-11 Thread Hemanth Yamijala
Hi,

I tried a similar experiment as yours but couldn't replicate the issue.

I generated 64 MB files and added them to my DFS - one file from every
machine, with a replication factor of 1,  like you did. My block size was
64MB. I verified the blocks were located on the same machine as where I
added them from.

Then, I launched a wordcount (without the min split size config). As
expected, it created 8 maps, and I could verify that it ran all the tasks
as data local - i.e. every task read off its own datanode. From the launch
times of the tasks, I could roughly feel that this scheduling behaviour was
independent of the order in which the tasks were launched. This behaviour
was retained even with the min split size config.

Could you share the size of input you generated (i.e the size of
data01..data14) ? Also, what job are you running - specifically what is the
input format ?

BTW, this wiki entry:
http://wiki.apache.org/hadoop/HowManyMapsAndReducestalks a little bit
about how the maps are created.

Thanks
Hemanth

On Wed, Sep 12, 2012 at 7:49 AM, Hiroyuki Yamada mogwa...@gmail.com wrote:

 I figured out the cause.
 HDFS block size is 128MB, but
 I specify mapred.min.split.size as 512MB,
 and data local I/O processing goes wrong for some reason.
 When I remove the mapred.min.split.size configuration,
 tasktrackers pick data-local tasks.
 Why does it happen ?

 It seems like a bug.
 Split is a logical container of blocks,
 so nothing is wrong logically.

 On Wed, Sep 12, 2012 at 1:20 AM, Hiroyuki Yamada mogwa...@gmail.com
 wrote:
  Hi, thank you for the comment.
 
  Task assignment takes data locality into account first and not block
 sequence.
 
  Does it work like that when replica factor is set to 1 ?
 
  I just had a experiment to check the behavior.
  There are 14 nodes (node01 to node14) and there are 14 datanodes and
  14 tasktrackers working.
  I first created a data to be processed in each node (say data01 to
 data14),
  and I put the each data to the hdfs from each node (at /data
  directory. /data/data01, ... /data/data14).
  Replica factor is set to 1, so according to the default block placement
 policy,
  each data is stored at local node. (data01 is stored at node01, data02
  is stored at node02 and so on)
  In that setting, I launched a job that processes the /data and
  what happened is that tasktrackers read from data01 to data14
 sequentially,
  which means tasktrackers first take all data from node01 and then
  node02 and then node03 and so on.
 
  If tasktracker takes data locality into account as you say,
  each tasktracker should take the local task(data). (tasktrackers at
  node02 should take data02 blocks if there is any)
  But, it didn't work like that.
  What this is happening ?
 
  Is there any documents about this ?
  What part of the source code is doing that ?
 
  Regards,
  Hiroyuki
 
  On Tue, Sep 11, 2012 at 11:27 PM, Hemanth Yamijala
  yhema...@thoughtworks.com wrote:
  Hi,
 
  Task assignment takes data locality into account first and not block
  sequence. In hadoop, tasktrackers ask the jobtracker to be assigned
 tasks.
  When such a request comes to the jobtracker, it will try to look for an
  unassigned task which needs data that is close to the tasktracker and
 will
  assign it.
 
  Thanks
  Hemanth
 
 
  On Tue, Sep 11, 2012 at 6:31 PM, Hiroyuki Yamada mogwa...@gmail.com
 wrote:
 
  Hi,
 
  I want to make sure my understanding about task assignment in hadoop
  is correct or not.
 
  When scanning a file with multiple tasktrackers,
  I am wondering how a task is assigned to each tasktracker .
  Is it based on the block sequence or data locality ?
 
  Let me explain my question by example.
  There is a file which composed of 10 blocks (block1 to block10), and
  block1 is the beginning of the file and block10 is the tail of the
 file.
  When scanning the file with 3 tasktrackers (tt1 to tt3),
  I am wondering if
  task assignment is based on the block sequence like
  first tt1 takes block1 and tt2 takes block2 and tt3 takes block3 and
  tt1 takes block4 and so on
  or
  task assignment is based on the task(data) locality like
  first tt1 takes block2(because it's located in the local) and tt2
  takes block1 (because it's located in the local) and
  tt3 takes block 4(because it's located in the local) and so on.
 
  As far as I experienced and the definitive guide book says,
  I think that the first case is the task assignment strategy.
  (and if there are many replicas, closest one is picked.)
 
  Is this right ?
 
  If this is right, is there any way to do like the second case
  with the current implementation ?
 
  Thanks,
 
  Hiroyuki
 
 



Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

2012-09-10 Thread Hemanth Yamijala
Hi,

I am not sure if there's any way to restrict the tasks to specific
machines. However, I think there are some ways of restricting to
number of 'slots' that can be used by the job.

Also, not sure which version of Hadoop you are on. The
capacityscheduler
(http://hadoop.apache.org/common/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html)
has ways by which you can set up a queue with a hard capacity limit.
The capacity controls the number of slots that that can be used by
jobs submitted to the queue. So, if you submit a job to the queue,
irrespective of the number of tasks it has, it should limit it to
those slots.  However, please note that this does not restrict the
tasks to specific machines.

Thanks
Hemanth

On Mon, Sep 10, 2012 at 2:36 PM, Safdar Kureishy
safdar.kurei...@gmail.com wrote:
 Hi,

 I need to run some benchmarking tests for a given mapreduce job on a *subset
 *of a 10-node Hadoop cluster. Not that it matters, but the current cluster
 settings allow for ~20 map slots and 10 reduce slots per node.

 Without loss of generalization, let's say I want a job with these
 constraints below:
 - to use only *5* out of the 10 nodes for running the mappers,
 - to use only *5* out of the 10 nodes for running the reducers.

 Is there any other way of achieving this through Hadoop property overrides
 during job-submission time? I understand that the Fair Scheduler can
 potentially be used to create pools of a proportionate # of mappers and
 reducers, to achieve a similar outcome, but the problem is that I still
 cannot tie such a pool to a fixed # of machines (right?). Essentially,
 regardless of the # of map/reduce tasks involved, I only want a *fixed # of
 machines* to handle the job.

 Any tips on how I can go about achieving this?

 Thanks,
 Safdar


Re: Reading from HDFS from inside the mapper

2012-09-10 Thread Hemanth Yamijala
Hi,

You could check DistributedCache (
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
It would allow you to distribute data to the nodes where your tasks are run.

Thanks
Hemanth

On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann 
sigurd.spieckerm...@gmail.com wrote:

 Hi,

 I would like to perform a map-side join of two large datasets where
 dataset A consists of m*n elements and dataset B consists of n elements.
 For the join, every element in dataset B needs to be accessed m times. Each
 mapper would join one element from A with the corresponding element from B.
 Elements here are actually data blocks. Is there a performance problem (and
 difference compared to a slightly modified map-side join using the
 join-package) if I set dataset A as the map-reduce input and load the
 relevant element from dataset B directly from the HDFS inside the mapper? I
 could store the elements of B in a MapFile for faster random access. In the
 second case without the join-package I would not have to partition the
 datasets manually which would allow a bit more flexibility, but I'm
 wondering if HDFS access from inside a mapper is strictly bad. Also, does
 Hadoop have a cache for such situations by any chance?

 I appreciate any comments!

 Sigurd



Re: Understanding of the hadoop distribution system (tuning)

2012-09-10 Thread Hemanth Yamijala
Hi,

Responses inline to some points.

On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan elaine-...@gmo.jp wrote:

 Hi,

 I'm new to hadoop and i've just played around with map reduce.
 I would like to check if my understanding to hadoop is correct and i
 would appreciate if anyone could correct me if i'm wrong.

 I have a data of around 518MB, and i wrote a MR program to process it.
 Here are some of my settings in my mapred-site.xml.
 ---
 mapred.tasktracker.map.tasks.maximum = 20
 mapred.tasktracker.reduce.tasks.maximum = 20
 ---


These two configurations essentially tell the tasktrackers that they can
run 20 maps and 20 reduces in parallel on a machine. Is this what you
intended ? (Generally the sum of these two values should equal the number
of cores on your tasktracker node, or a little more).

Also, would help if you can tell us your cluster size - i.e. number of
slaves.


 My block size is default, 64MB
 With my data size = 518MB, i guess setting the maximum for MR task to 20
 is far more than enough (518/64 = 8) , did i get it correctly?


I suppose what you want is to run all the maps in parallel. For that, the
number of map slots in your cluster should be more than the number of maps
of your job (assuming there's a single job running). If the number of slots
is less than number of maps, the maps would be scheduled in multiple waves.
On your jobtracker main page, the Cluster Summary  Map Task Capacity gives
you the total slots available in your cluster.



 When i run the MR program, i could see in the Map/Reduce Administration
 page that the number of Maps Total = 8, so i assume that everything is
 going well here, once again if i'm wrong please correct me.
 (Sometimes it shows only Maps Total = 3)


This value tells us the number of maps that will run for the job.


 There's one thing which i'm uncertain about hadoop distribution.
 Is the Maps Total = 8 means that there are 8 map tasks split among all
 the data nodes (task trackers)?
 Is there anyway i can checked whether all the tasks are shared among
 datanodes (where task trackers are working).


There's no easy way to check this. The task page for every task shows the
attempts that ran for each task and where they ran under the 'Machine'
column.


 When i clicked on each link under that Task Id, i can see there's Input
 Split Locations stated under each task details, if the inputs are
 splitted between data nodes, does that means that everything is working
 well?


I think this is just the location of the splits, including the replicas.
What you could see is if enough data local maps ran - which means that the
tasks mostly got their inputs from datanodes running on the same machine as
themselves. This is given by the counter Data-local map tasks on the job
UI page.


 I need to make sure i got everything running well because my MR took
 around 6 hours to finish despite the input size is small.. (Well, i know
 hadoop is not meant for small data), I'm not sure whether it's my
 configuration that goes wrong or hadoop is just not suitable for my case.
 I'm actually running a mahout kmeans analysis.

 Thank you for your time.







Re: [Cosmos-dev] Out of memory in identity mapper?

2012-09-06 Thread Hemanth Yamijala
Harsh,

Could IsolationRunner be used here. I'd put up a patch for HADOOP-8765,
after applying which IsolationRunner works for me. Maybe we could use it to
re-run the map task that's failing and debug.

Thanks
hemanth

On Thu, Sep 6, 2012 at 9:42 PM, Harsh J ha...@cloudera.com wrote:

 Protobuf involvement makes me more suspicious that this is possibly a
 corruption or an issue with serialization as well. Perhaps if you can
 share some stack traces, people can help better. If it is reliably
 reproducible, then I'd also check for the count of records until after
 this occurs, and see if the stacktraces are always same.

 Serialization formats such as protobufs allocate objects based on read
 sizes (like for example, a string size may be read first before the
 string's bytes are read, and upon size read, such a length is
 pre-allocated for the bytes to be read into), and in cases of corrupt
 data or bugs in the deserialization code, it is quite easy for it to
 make a large alloc request due to a badly read value. Its one
 possibility.

 Is the input compressed too, btw? Can you seek out the input file the
 specific map fails on, and try to read it in an isolated manner to
 validate it? Or do all maps seem to fail?

 On Thu, Sep 6, 2012 at 9:01 PM, SEBASTIAN ORTEGA TORRES sort...@tid.es
 wrote:
  Input files are small fixed-size protobuf records and yes, it is
  reproducible (but it takes some time).
  In this case I cannot use combiners since I need to process all the
 elements
  with the same key altogether.
 
  Thanks for the prompt response
 
  --
  Sebastián Ortega Torres
  Product Development  Innovation / Telefónica Digital
  C/ Don Ramón de la Cruz 82-84
  Madrid 28006
 
 
 
 
 
 
  El 06/09/2012, a las 17:13, Harsh J escribió:
 
  I can imagine a huge record size possibly causing this. Is this
  reliably reproducible? Do you also have combiners enabled, which may
  run the reducer-logic on the map-side itself?
 
  On Thu, Sep 6, 2012 at 8:20 PM, JOAQUIN GUANTER GONZALBEZ x...@tid.es
  wrote:
 
  Hello hadoopers!
 
 
 
 
  In a reduce-only Hadoop job input files are handled by the identity
 mapper
 
  and sent to the reducers without modification. In one of my job I was
 
  surprised to see the job failing in the map phase with Out of memory
 error
 
  and GC overhead limit exceeded.
 
 
 
 
  In my understanding, a memory leak on the identity mapper is out of the
 
  question.
 
 
  What can be the cause of such error?
 
 
 
 
  Thanks,
 
 
  Ximo.
 
 
 
 
  P.S. The logs show no stack trace other than the messages I mentioned
 
  before.
 
 
 
  
 
  Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
 
  nuestra política de envío y recepción de correo electrónico en el enlace
 
  situado más abajo.
 
  This message is intended exclusively for its addressee. We only send and
 
  receive email on the basis of the terms set out at:
 
  http://www.tid.es/ES/PAGINAS/disclaimer.aspx
 
 
 
 
  --
  Harsh J
 
  ___
  Cosmos-dev mailing list
  cosmos-...@tid.es
  https://listas.tid.es/mailman/listinfo/cosmos-dev
 
 
 
  
 
  Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
  nuestra política de envío y recepción de correo electrónico en el enlace
  situado más abajo.
  This message is intended exclusively for its addressee. We only send and
  receive email on the basis of the terms set out at:
  http://www.tid.es/ES/PAGINAS/disclaimer.aspx



 --
 Harsh J



Re: Error using hadoop in non-distributed mode

2012-09-04 Thread Hemanth Yamijala
Hi,

The path 
/tmp/hadoop-pat/mapred/local/archive/-4686065962599733460_1587570556_150738331/snip
is a location used by the tasktracker process for the 'DistributedCache' -
a mechanism to distribute files to all tasks running in a map reduce job. (
http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html#DistributedCache
).

You have mentioned Mahout, so I am assuming that the specific analysis job
you are running is using this feature to distribute the output of the file /
Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 to the job that is
causing a failure.

Also, I find links stating the distributed cache does not work with in the
local (non-HDFS) mode. (
http://stackoverflow.com/questions/9148724/multiple-input-into-a-mapper-in-hadoop).
Look at the second answer.

Thanks
hemanth


On Tue, Sep 4, 2012 at 10:33 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 The job is creating several output and intermediate files all under the
 location: Users/pat/Projects/big-data/b/ssvd/ several output directories
 and files are created correctly and the
 file Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 is created and
 exists at the time of the error. We seem to be passing
 in Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 as the input file.

 Under what circumstances would an input path passed in as
 Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 be turned into
 pat/mapred/local/archive/6590995089539988730_1587570556_37122331/file/Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0

 ???


 On Sep 4, 2012, at 1:14 AM, Narasingu Ramesh ramesh.narasi...@gmail.com
 wrote:

 Hi Pat,
 Please specify correct input file location.
 Thanks  Regards,
 Ramesh.Narasingu

 On Mon, Sep 3, 2012 at 9:28 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Using hadoop with mahout in a local filesystem/non-hdfs config for
 debugging purposes inside Intellij IDEA. When I run one particular part of
 the analysis I get the error below. I didn't write the code but we are
 looking for some hint about what might cause it. This job completes without
 error in a single node pseudo-clustered config outside of IDEA.

 several jobs in the pipeline complete without error creating part files
 just fine in the local file system

 The file
 /tmp/hadoop-pat/mapred/local/archive/6590995089539988730_1587570556_37122331/file/Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0

 which is the subject of the error - does not exist

 Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0

 does exist at the time of the error. So the code is looking for the data
 in the wrong place?

 ….
 12/09/02 14:56:29 INFO compress.CodecPool: Got brand-new decompressor
 12/09/02 14:56:29 INFO compress.CodecPool: Got brand-new decompressor
 12/09/02 14:56:29 INFO compress.CodecPool: Got brand-new decompressor
 12/09/02 14:56:29 WARN mapred.LocalJobRunner: job_local_0002
 java.io.FileNotFoundException: File
 /tmp/hadoop-pat/mapred/local/archive/-4686065962599733460_1587570556_150738331/file/Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0
 does not exist.
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
 at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
 at
 org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:92)
 at
 org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:219)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Exception in thread main java.io.IOException: Bt job unsuccessful.
 at
 org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:609)
 at
 org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:397)
 at
 com.finderbots.analysis.AnalysisPipeline.SSVDTransformAndBack(AnalysisPipeline.java:257)
 at com.finderbots.analysis.AnalysisJob.run(AnalysisJob.java:20)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at com.finderbots.analysis.AnalysisJob.main(AnalysisJob.java:34)
 Disconnected from the target VM, address: '127.0.0.1:63483', transport:
 'socket'






Re: Exception while running a Hadoop example on a standalone install on Windows 7

2012-09-04 Thread Hemanth Yamijala
Though I agree with others that it would probably be easier to get Hadoop
up and running on Unix based systems, couldn't help notice that this path:

 \tmp \hadoop-upendyal\mapred\staging\upendyal-1075683580\.staging

seems to have a space in the first component i.e '\tmp ' and not '\tmp'. Is
that a copy paste issue, or is it really the case. Again, not sure if it
could cause the specific error you're seeing, but could try removing the
space if it does exist. Also assuming that you've set up Cygwin etc. if you
still want to try out on Windows.

Thanks
hemanth

On Wed, Sep 5, 2012 at 12:12 AM, Marcos Ortiz mlor...@uci.cu wrote:


 On 09/04/2012 02:35 PM, Udayini Pendyala wrote:

   Hi Bejoy,

 Thanks for your response. I first started to install on Ubuntu Linux and
 ran into a bunch of problems. So, I wanted to back off a bit and try
 something simple first. Hence, my attempt to install on my Windows 7 Laptop.

 Well, if you tell to us the problems that you have in Ubuntu, we can give
 you a hand.
 Michael Noll have great tutorials for this:

 Running Hadoop on Ubuntu Linux (Single node cluster)

 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

 Running Hadoop on Ubuntu Linux (Multi node cluster)

 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/


 I am doing the standalone mode - as per the documentation (link in my
 original email), I don't need ssh unless I am doing the distributed mode.
 Is that not correct?

 Yes, but I give you the same recommendation that Bejoy said to you: Use a
 Unix-based platform for Hadoop, it's more tested and have better
 performance than Windows.

 Best wishes


 Thanks again for responding
 Udayini


 --- On *Tue, 9/4/12, Bejoy Ks bejoy.had...@gmail.combejoy.had...@gmail.com
 * wrote:


 From: Bejoy Ks bejoy.had...@gmail.com bejoy.had...@gmail.com
 Subject: Re: Exception while running a Hadoop example on a standalone
 install on Windows 7
 To: user@hadoop.apache.org
 Date: Tuesday, September 4, 2012, 11:11 AM

 Hi Udayani

  By default hadoop works well for linux and linux based OS. Since you are
 on Windows you need to install and configure ssh using cygwin before you
 start hadoop daemons.

  On Tue, Sep 4, 2012 at 6:16 PM, Udayini Pendyala 
 udayini_pendy...@yahoo.comhttp://mc/compose?to=udayini_pendy...@yahoo.com
  wrote:

   Hi,


  Following is a description of what I am trying to do and the steps I
 followed.


  GOAL:

 a). Install Hadoop 1.0.3

 b). Hadoop in a standalone (or local) mode

 c). OS: Windows 7


  STEPS FOLLOWED:

 1.1.   I followed instructions from:
 http://www.oreillynet.com/pub/a/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html.
 Listing the steps I did -

 a.   I went to: http://hadoop.apache.org/core/releases.html.

 b.  I installed hadoop-1.0.3 by downloading “hadoop-1.0.3.tar.gz” and
 unzipping/untarring the file.

 c.   I installed JDK 1.6 and set up JAVA_HOME to point to it.

 d.  I set up HADOOP_INSTALL to point to my Hadoop install location. I
 updated my PATH variable to have $HADOOP_INSTALL/bin

 e.  After the above steps, I ran the command: “hadoop version” and
 got the following information:

 $ hadoop version

 Hadoop 1.0.3

 Subversion
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
 1335192

 Compiled by hortonfo on Tue May 8 20:31:25 UTC 2012

 From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be



 2.  2.  The standalone was very easy to install as described above.
 Then, I tried to run a sample command as given in:

 http://hadoop.apache.org/common/docs/r0.17.2/quickstart.html#Local

 Specifically, the steps followed were:

 a.   cd $HADOOP_INSTALL

 b.  mkdir input

 c.   cp conf/*.xml input

 d.  bin/hadoop jar hadoop-examples-1.0.3.jar grep input output
 ‘dfs[a-z.]+’

 and got the following error:



 $ bin/hadoop jar hadoop-examples-1.0.3.jar grep input output 'dfs[a-z.]+'

 12/09/03 15:41:57 WARN util.NativeCodeLoader: Unable to load native-hadoop
 libra ry for your platform... using builtin-java classes where applicable

 12/09/03 15:41:57 ERROR security.UserGroupInformation:
 PriviledgedActionExceptio n as:upendyal cause:java.io.IOException: Failed
 to set permissions of path: \tmp
 \hadoop-upendyal\mapred\staging\upendyal-1075683580\.staging to 0700

 java.io http://java.io.IO.IOException: Failed to set permissions of
 path: \tmp\hadoop-upendyal\map red\staging\upendyal-1075683580\.staging to
 0700

 at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)

 at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)

 at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
 tem.java:509)

 at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.jav
 a:344)

 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:18 9)

 at 

Re: Integrating hadoop with java UI application deployed on tomcat

2012-09-03 Thread Hemanth Yamijala
Hi,

If you are getting the LocalFileSystem, you could try by putting
core-site.xml in a directory that's there in the classpath for the
Tomcat App (or include such a path in the classpath, if that's
possible)

Thanks
hemanth

On Mon, Sep 3, 2012 at 4:01 PM, Visioner Sadak visioner.sa...@gmail.com wrote:
 Thanks steve thers nothing in logs and no exceptions as well i found that
 some file is created in my F:\user with directory name but its not visible
 inside my hadoop browse filesystem directories i also added the config by
 using the below method
 hadoopConf.addResource(
 F:/hadoop-0.22.0/conf/core-site.xml);
 when running thru WAR printing out the filesystem i m getting
 org.apache.hadoop.fs.LocalFileSystem@9cd8db
 when running an independet jar within hadoop i m getting
 DFS[DFSClient[clientName=DFSClient_296231340, ugi=dell]]
 when running an independet jar i m able to do uploads

 just wanted to know will i have to add something in my classpath of tomcat
 or is there any other configurations of core-site.xml that i am missing
 out..thanks for your help.



 On Sat, Sep 1, 2012 at 1:38 PM, Steve Loughran ste...@hortonworks.com
 wrote:


 well, it's worked for me in the past outside Hadoop itself:


 http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/utils/DfsUtils.java?revision=8882view=markup

 Turn logging up to DEBUG
 Make sure that the filesystem you've just loaded is what you expect, by
 logging its value. It may turn out to be file:///, because the normal Hadoop
 site-config.xml isn't being picked up




 On Fri, Aug 31, 2012 at 1:08 AM, Visioner Sadak
 visioner.sa...@gmail.com wrote:

 but the problem is that my  code gets executed with the warning but file
 is not copied to hdfs , actually i m trying to copy a file from local to
 hdfs

Configuration hadoopConf=new Configuration();
 //get the default associated file system
FileSystem fileSystem=FileSystem.get(hadoopConf);
// HarFileSystem harFileSystem= new HarFileSystem(fileSystem);
 //copy from lfs to hdfs
fileSystem.copyFromLocalFile(new Path(E:/test/GANI.jpg),new
 Path(/user/TestDir/));






Yarn defaults for local directories

2012-09-03 Thread Hemanth Yamijala
Hi,

Is there a reason why Yarn's directory paths are not defaulting to be
relative to hadoop.tmp.dir.

For e.g. yarn.nodemanager.local-dirs defaults to /tmp/nm-local-dir.
Could it be ${hadoop.tmp.dir}/nm-local-dir instead ? Similarly for the
log directories, I guess...

Thanks
hemanth


Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Hemanth Yamijala
Hi,

You are right that a change to mapred.tasktracker.reduce.tasks.maximum will
require a restart of the tasktrackers. AFAIK, there is no way of modifying
this property without restarting.

On a different note, could you see if the amount of intermediate data can
be reduced using a combiner, or some other form of local aggregation ?

Thanks
hemanth

On Mon, Sep 3, 2012 at 9:06 PM, Abhay Ratnaparkhi 
abhay.ratnapar...@gmail.com wrote:

 How can I set  'mapred.tasktracker.reduce.tasks.maximum'  to 0 in a
 running tasktracker?
 Seems that I need to restart the tasktracker and in that case I'll loose
 the output of map tasks by particular tasktracker.

 Can I change   'mapred.tasktracker.reduce.tasks.maximum'  to 0  without
 restarting tasktracker?

 ~Abhay


 On Mon, Sep 3, 2012 at 8:53 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 HI Abhay

 The TaskTrackers on which the reduce tasks are triggered is chosen in
 random based on the reduce slot availability. So if you don't need the
 reduce tasks to be scheduled on some particular nodes you need to set
 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The
 bottleneck here is that this property is not a job level one you need to
 set it on a cluster level.

 A cleaner approach will be to configure each of your nodes with the right
 number of map and reduce slots based on the resources available on each
 machine.


 On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi 
 abhay.ratnapar...@gmail.com wrote:

 Hello,

 How can one get to know the nodes on which reduce tasks will run?

 One of my job is running and it's completing all the map tasks.
 My map tasks write lots of intermediate data. The intermediate directory
 is getting full on all the nodes.
 If the reduce task take any node from cluster then It'll try to copy the
 data to same disk and it'll eventually fail due to Disk space related
 exceptions.

 I have added few more tasktracker nodes in the cluster and now want to
 run reducer on new nodes only.
 Is it possible to choose a node on which the reducer will run? What's
 the algorithm hadoop uses to get a new node to run reducer?

 Thanks in advance.

 Bye
 Abhay






  1   2   >