Counters across all jobs

2012-08-28 Thread Kasi Subrahmanyam
Hi,

I have around 4 jobs running in a controller.
How can i have a single unique counter present in all the jobs and
incremented where ever used in a job?

For example:Consider a counter ACount.
If job1 is incrementing the counter by2 and job3 by 5 and job 4 by 6.
Can i have the  counter displayed output in the jobtracker as
job1:2
job2:2
job3:7
job4:13

Thanks,
Subbu


Hadoop or HBase

2012-08-28 Thread Kushal Agrawal
Hi,

I wants to use DFS for Content-Management-System (CMS), in
that I just wants to store and retrieve files. 

Please suggest me what should I use: 

Hadoop or HBase

 

Thanks  Regards,

Kushal Agrawal

 mailto:kushalagra...@teledna.com kushalagra...@teledna.com

cid:image001.jpg@01CBF854.8DD096F0 

One Earth. Your moment. Go green...

  _  

This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise private information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the email by you is prohibited.

 



Re: Hadoop or HBase

2012-08-28 Thread Kai Voigt
Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.

Which CMS do you plan to use, and what's wrong with MySQL or other open source 
RDBMSs?

Kai

Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com:

 Hi,
 I wants to use DFS for Content-Management-System (CMS), in 
 that I just wants to store and retrieve files.
 Please suggest me what should I use:
 Hadoop or HBase
  
 Thanks  Regards,
 Kushal Agrawal
 kushalagra...@teledna.com
  
 One Earth. Your moment. Go green...
 This message is for the designated recipient only and may contain privileged, 
 proprietary, or otherwise private information. If you have received it in 
 error, please notify the sender immediately and delete the original. Any 
 other use of the email by you is prohibited.
  

-- 
Kai Voigt
k...@123.org






RE: Hadoop or HBase

2012-08-28 Thread Kushal Agrawal
As the data is too much in (10's of terabytes) it's difficult to take backup
because it takes 1.5 days to take backup of data every time. Instead of that
if we uses distributed file system we need not to do that.

Thanks  Regards,
Kushal Agrawal
kushalagra...@teledna.com
 
-Original Message-
From: Kai Voigt [mailto:k...@123.org] 
Sent: Tuesday, August 28, 2012 11:57 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop or HBase

Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.

Which CMS do you plan to use, and what's wrong with MySQL or other open
source RDBMSs?

Kai

Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com:

 Hi,
 I wants to use DFS for Content-Management-System (CMS), in
that I just wants to store and retrieve files.
 Please suggest me what should I use:
 Hadoop or HBase
  
 Thanks  Regards,
 Kushal Agrawal
 kushalagra...@teledna.com
  
 One Earth. Your moment. Go green...
 This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise private information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the email by you is prohibited.
  

-- 
Kai Voigt
k...@123.org







Re: Hadoop or HBase

2012-08-28 Thread Kai Voigt
Having a distributed filesystem doesn't save you from having backups. If 
someone deletes a file in HDFS, it's gone.

What backend storage is supported by your CMS?

Kai

Am 28.08.2012 um 08:36 schrieb Kushal Agrawal kushalagra...@teledna.com:

 As the data is too much in (10's of terabytes) it's difficult to take backup
 because it takes 1.5 days to take backup of data every time. Instead of that
 if we uses distributed file system we need not to do that.
 
 Thanks  Regards,
 Kushal Agrawal
 kushalagra...@teledna.com
  
 -Original Message-
 From: Kai Voigt [mailto:k...@123.org] 
 Sent: Tuesday, August 28, 2012 11:57 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop or HBase
 
 Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.
 
 Which CMS do you plan to use, and what's wrong with MySQL or other open
 source RDBMSs?
 
 Kai
 
 Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com:
 
 Hi,
I wants to use DFS for Content-Management-System (CMS), in
 that I just wants to store and retrieve files.
 Please suggest me what should I use:
 Hadoop or HBase
 
 Thanks  Regards,
 Kushal Agrawal
 kushalagra...@teledna.com
 
 One Earth. Your moment. Go green...
 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the email by you is prohibited.
 
 
 -- 
 Kai Voigt
 k...@123.org
 
 
 
 
 
 

-- 
Kai Voigt
k...@123.org






heap size for the job tracker

2012-08-28 Thread Mike S
1 Can I change the heap size for the job tracker only if I am using
version 1.0.2?

2  If so, would you please say what exact line I should put in the
hadoop-evv.sh and where ? Should I set the value with a number or use
the Xmx notion?

I mean which one is the correct way

export HADOOP_HEAPSIZE=2000

or

export HADOOP_HEAPSIZE=-Xmx2000m

3 Do I need to restart the job tracker node or call start-mapred.sh
to make the heap size change to take in effect? Is there anything else
I need to do to make the change to be applied?


Re: heap size for the job tracker

2012-08-28 Thread Harsh J
Hi Mike,

On Wed, Aug 29, 2012 at 7:40 AM, Mike S mikesam...@gmail.com wrote:
 1 Can I change the heap size for the job tracker only if I am using
 version 1.0.2?

Yes.

 2  If so, would you please say what exact line I should put in the
 hadoop-evv.sh and where ? Should I set the value with a number or use
 the Xmx notion?

 I mean which one is the correct way

 export HADOOP_HEAPSIZE=2000

 or

 export HADOOP_HEAPSIZE=-Xmx2000m

The above (first one is right syntax) changes the heap across _all_
daemons, not just JT specifically. So you don't want to do that.

You may instead find and change the below line in hadoop-env.sh to the
following:

export HADOOP_JOBTRACKER_OPTS=$HADOOP_JOBTRACKER_OPTS -Xmx2g

 3 Do I need to restart the job tracker node or call start-mapred.sh
 to make the heap size change to take in effect? Is there anything else
 I need to do to make the change to be applied?

You will need to restart the JobTracker JVM for the new heap limit to
get used. You can run hadoop-daemon.sh stop jobtracker followed by
hadoop-daemon.sh start jobtracker to restart just the JobTracker
daemon (run the command on the JT node).

-- 
Harsh J


Suggestions/Info required regarding Hadoop Benchmarking

2012-08-28 Thread Gaurav Dasgupta
Hi Users,

I have a 12 node CDH3 cluster where I am planning to run some benchmark
tests. My main intension is to run the benchmarks first with the default
Hadoop configuration and then analyze the outcomes and tune the Hadoop
metrics accordingly to increase the performance of my cluster.

Can some one provide me some suggestions that which are the important
Hadoop metrics that I should observe during benchmarking?
Also, I have seen somewhere that the ratio of Avg Map Tasks and Avg
Reduce Tasks Execution Time is recorded for various benchmarks. How
significant is that information for me to judge the cluster performance?
How the ratios will help me to analyze and tune the Hadoop cluster
accordingly for increase in performance.

Till now I have run the following benchmarks without tuning the cluster
(with default Hadoop configuration):

   - Sort
   - WordCount
   - TeraSort
   - TestDFSIO

Please provide suggestion that which are the other benchmarks that I should
run, especially from hadoop-test.jar in $HADOOP_HOME directory and what
are the usage of those jobs.

Thanks,
Gaurav Dasgupta


example usage of s3 file system

2012-08-28 Thread Chris Collins



Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my 
tinkering I am not having a great deal of success.  I am particularly 
interested in the ability to mimic a directory structure (since s3 native 
doesnt do it).

Can anyone point me to some good example usage of Hadoop FileSystem with s3?

I created a few directories using transit and AWS S3 console for test.  Doing a 
liststatus of the bucket returns a FileStatus object of the directory created 
but if I try to do a liststatus of that path I am getting a 404:

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: 
Request Error. HEAD '/' on Host 

Probably not the best list to look for help, any clues appreciated.

C



Re: error in shuffle in InMemoryMerger

2012-08-28 Thread Joshi, Rekha
Hi Abhay,

Ideally the error line - Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for output/map_128.out suggests you either do not have 
permissions for output folder or disk is full.

Also 5 is not a big number on thread spawning, (infact, default on 
parallelcopies) to recommend reducing it, but a lower value might work.only 
long-term indications are for your system to under-go node maintenance.

Thanks
Rekha

From: Abhay Ratnaparkhi 
abhay.ratnapar...@gmail.commailto:abhay.ratnapar...@gmail.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Tue, 28 Aug 2012 14:52:27 +0530
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: error in shuffle in InMemoryMerger

Hello,

I am getting following error when reduce task is running.
mapreduce.reduce.shuffle.parallelcopies  property is set to 5.

org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle 
in InMemoryMerger - Thread to merge in-memory shuffled map-outputs at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(AccessController.java:284) at 
javax.security.auth.Subject.doAs(Subject.java:573) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:773)
 at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for output/map_128.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351)
 at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
 at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputF 
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle 
in InMemoryMerger - Thread to merge in-memory shuffled map-outputs at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at 
java.security.AccessController.doPrivileged(AccessController.java:284) at 
javax.security.auth.Subject.doAs(Subject.java:573) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:773)
 at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for output/map_119.out at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351)
 at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
 at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputF

Regards,
Abhay



hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2012-08-28 Thread Tony Burton
Hi,

I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good 
for writing results into (for example) different directories created on the 
fly. However, now I'm implementing a MapReduce job using Hadoop 1.0.3, I see 
that the new API no longer supports MultipleTextOutputFormat. Is there an 
equivalent that I can use, or will it be supported in a future release?

Thanks,

Tony


**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Gateway House, 
Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and 
regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling 
Commission (reg. no. 000-027343-R-308898-001).  Any financial promotion 
contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM



Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2012-08-28 Thread Harsh J
The Multiple*OutputFormat have been deprecated in favor of the generic
MultipleOutputs API. Would using that instead work for you?

On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton tbur...@sportingindex.com wrote:
 Hi,

 I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good 
 for writing results into (for example) different directories created on the 
 fly. However, now I'm implementing a MapReduce job using Hadoop 1.0.3, I see 
 that the new API no longer supports MultipleTextOutputFormat. Is there an 
 equivalent that I can use, or will it be supported in a future release?

 Thanks,

 Tony


 **
 This email and any attachments are confidential, protected by copyright and 
 may be legally privileged.  If you are not the intended recipient, then the 
 dissemination or copying of this email is prohibited. If you have received 
 this in error, please notify the sender by replying by email and then delete 
 the email completely from your system.  Neither Sporting Index nor the sender 
 accepts responsibility for any virus, or any other defect which might affect 
 any computer or IT system into which the email is received and/or opened.  It 
 is the responsibility of the recipient to scan the email and no 
 responsibility is accepted for any loss or damage arising in any way from 
 receipt or use of this email.  Sporting Index Ltd is a company registered in 
 England and Wales with company number 2636842, whose registered office is at 
 Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is 
 authorised and regulated by the UK Financial Services Authority (reg. no. 
 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).  Any 
 financial promotion contained herein has been issued
 and approved by Sporting Index Ltd.

 Outbound email has been scanned for viruses and SPAM




-- 
Harsh J


Re: error in shuffle in InMemoryMerger

2012-08-28 Thread Abhay Ratnaparkhi
Checked the mapred.tmp.local directory on the node which is running the
reducer attempt and seems that there is available space around 1G(though
it's less).

On Tue, Aug 28, 2012 at 3:55 PM, Joshi, Rekha rekha_jo...@intuit.comwrote:

  Hi Abhay,

  Ideally the error line - Caused by:
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
 valid local directory for output/map_128.out suggests you either do not
 have permissions for output folder or disk is full.

  Also 5 is not a big number on thread spawning, (infact, default on
 parallelcopies) to recommend reducing it, but a lower value might work.only
 long-term indications are for your system to under-go node maintenance.

  Thanks
 Rekha

   From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com
 Reply-To: user@hadoop.apache.org
 Date: Tue, 28 Aug 2012 14:52:27 +0530
 To: user@hadoop.apache.org
 Subject: error in shuffle in InMemoryMerger

  Hello,

 I am getting following error when reduce task is running.
 mapreduce.reduce.shuffle.parallelcopies  property is set to 5.

 org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in
 shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs
 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(AccessController.java:284) at
 javax.security.auth.Subject.doAs(Subject.java:573) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:773)
 at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by:
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
 valid local directory for output/map_128.out at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351)
 at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
 at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputF
 org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in
 shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs
 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(AccessController.java:284) at
 javax.security.auth.Subject.doAs(Subject.java:573) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:773)
 at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by:
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
 valid local directory for output/map_119.out at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351)
 at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
 at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputF

 Regards,
 Abhay




one reducer is hanged in reduce- copy phase

2012-08-28 Thread Abhay Ratnaparkhi
Hello,

I have a MR job which has 4 reducers running.
One of the reduce attempt is pending since long time in reduce-copy phase.

The job is not able to complete because of this.
I have seen that the child java process on tasktracker is running.

Is it possible to run the same attempt again? Does killing the child java
process or tasktracker on the node help? (since hadoop may schedule a
reduce attempt on another node).

Can I copy the map intermediate output required for this single reducer
(which is hanged) and rerun only the hang reducer?

Thank you in advance.
~Abhay


ask_201208250623_0005_r_00http://dpep089.innovate.ibm.com:50030/taskdetails.jsp?tipid=task_201208250623_0005_r_00
26.41%

reduce  copy(103 of 130 at 0.08 MB/s)
28-Aug-2012 03:09:34


Re: best way to join?

2012-08-28 Thread Ted Dunning
On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan dextermorga...@gmail.comwrote:


 I understand your solution ( i think) , didn't think of that, in that
 particular way.
 I think that lets say i have 1M data-points, and running knn , that the
 k=1M and n=10 (each point is a cluster that requires up to 10 points)
 is an overkill.


I am not sure I understand you.  n = number of points.  k = number of
clusters.  For searching 1 million points, I would recommend thousands of
clusters.


 How can i achieve the same result WITHOUT using mahout, just running on
 the dataset , i even think it'll be in the same complexity (o(n^2))


Running with a good knn package will give you roughly O(n log n)
complexity.


Re: Hadoop or HBase

2012-08-28 Thread Marcos Ortiz

Regards to all the list.
Well, you should ask to the Tumblr´s fellows that they use a combination 
of MySQL and HBase for its blogging platform. They talked about this 
topic in the last HBaseCon. Here is the link:

http://www.hbasecon.com/sessions/growing-your-inbox-hbase-at-tumblr/

Blake Matheny, Director of Platform Engineering at Tumblr was the 
presenter of this topic.

Best wishes

El 28/08/2012 6:18, Kai Voigt escribió:

Having a distributed filesystem doesn't save you from having backups. If 
someone deletes a file in HDFS, it's gone.

What backend storage is supported by your CMS?

Kai

Am 28.08.2012 um 08:36 schrieb Kushal Agrawal kushalagra...@teledna.com:


As the data is too much in (10's of terabytes) it's difficult to take backup
because it takes 1.5 days to take backup of data every time. Instead of that
if we uses distributed file system we need not to do that.

Thanks  Regards,
Kushal Agrawal
kushalagra...@teledna.com
  
-Original Message-

From: Kai Voigt [mailto:k...@123.org]
Sent: Tuesday, August 28, 2012 11:57 AM
To: common-u...@hadoop.apache.org
Subject: Re: Hadoop or HBase

Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.

Which CMS do you plan to use, and what's wrong with MySQL or other open
source RDBMSs?

Kai

Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com:


Hi,
I wants to use DFS for Content-Management-System (CMS), in

that I just wants to store and retrieve files.

Please suggest me what should I use:
Hadoop or HBase

Thanks  Regards,
Kushal Agrawal
kushalagra...@teledna.com

One Earth. Your moment. Go green...
This message is for the designated recipient only and may contain

privileged, proprietary, or otherwise private information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the email by you is prohibited.
--
Kai Voigt
k...@123.org










10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: distcp error.

2012-08-28 Thread Marcos Ortiz
Hi, Tao. This problem is only with 2.0.1 or with the two versions?
Have you tried to use distcp from 1.0.3 to 1.0.3?

El 28/08/2012 11:36, Tao escribió:

 Hi, all

 I use distcp copying data from hadoop1.0.3 to hadoop 2.0.1.

 When the file path(or file name) contain Chinese character, an
 exception will throw. Like below. I need some help about this.

 Thanks.

 [hdfs@host ~]$ hadoop distcp -i -prbugp -m 14 -overwrite -log
 /tmp/distcp.log hftp://10.xx.xx.aa:50070/tmp/中文路径测试hdfs:
 //10.xx.xx.bb:54310/tmp/distcp_test14

 12/08/28 23:32:31 INFO tools.DistCp: Input Options:
 DistCpOptions{atomicCommit=false, syncFolder=false,
 deleteMissing=false, ignoreFailures=true, maxMaps=14,
 sslConfigurationFile='null', copyStrategy='uniformsize',
 sourceFileListing=null, sourcePaths=[hftp://10.xx.xx.aa:50070/tmp/中文
 路径测试], targetPath=hdfs://10.xx.xx.bb:54310/tmp/distcp_test14}

 12/08/28 23:32:33 INFO tools.DistCp: DistCp job log path: /tmp/distcp.log

 12/08/28 23:32:34 WARN conf.Configuration: io.sort.mb is deprecated.
 Instead, use mapreduce.task.io.sort.mb

 12/08/28 23:32:34 WARN conf.Configuration: io.sort.factor is
 deprecated. Instead, use mapreduce.task.io.sort.factor

 12/08/28 23:32:34 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes
 where applicable

 12/08/28 23:32:36 INFO mapreduce.JobSubmitter: number of splits:1

 12/08/28 23:32:36 WARN conf.Configuration: mapred.jar is deprecated.
 Instead, use mapreduce.job.jar

 12/08/28 23:32:36 WARN conf.Configuration:
 mapred.map.tasks.speculative.execution is deprecated. Instead, use
 mapreduce.map.speculative

 12/08/28 23:32:36 WARN conf.Configuration: mapred.reduce.tasks is
 deprecated. Instead, use mapreduce.job.reduces

 12/08/28 23:32:36 WARN conf.Configuration:
 mapred.mapoutput.value.class is deprecated. Instead, use
 mapreduce.map.output.value.class

 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.map.class is
 deprecated. Instead, use mapreduce.job.map.class

 12/08/28 23:32:36 WARN conf.Configuration: mapred.job.name is
 deprecated. Instead, use mapreduce.job.name

 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.inputformat.class
 is deprecated. Instead, use mapreduce.job.inputformat.class

 12/08/28 23:32:36 WARN conf.Configuration: mapred.output.dir is
 deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

 12/08/28 23:32:36 WARN conf.Configuration:
 mapreduce.outputformat.class is deprecated. Instead, use
 mapreduce.job.outputformat.class

 12/08/28 23:32:36 WARN conf.Configuration: mapred.map.tasks is
 deprecated. Instead, use mapreduce.job.maps

 12/08/28 23:32:36 WARN conf.Configuration: mapred.mapoutput.key.class
 is deprecated. Instead, use mapreduce.map.output.key.class

 12/08/28 23:32:36 WARN conf.Configuration: mapred.working.dir is
 deprecated. Instead, use mapreduce.job.working.dir

 12/08/28 23:32:37 INFO mapred.ResourceMgrDelegate: Submitted
 application application_1345831938927_0039 to ResourceManager at
 baby20/10.1.1.40:8040

 12/08/28 23:32:37 INFO mapreduce.Job: The url to track the job:
 http://baby20:8088/proxy/application_1345831938927_0039/

 12/08/28 23:32:37 INFO tools.DistCp: DistCp job-id: job_1345831938927_0039

 12/08/28 23:32:37 INFO mapreduce.Job: Running job: job_1345831938927_0039

 12/08/28 23:32:50 INFO mapreduce.Job: Job job_1345831938927_0039
 running in uber mode : false

 12/08/28 23:32:50 INFO mapreduce.Job: map 0% reduce 0%

 12/08/28 23:33:00 INFO mapreduce.Job: map 100% reduce 0%

 12/08/28 23:33:00 INFO mapreduce.Job: Task Id :
 attempt_1345831938927_0039_m_00_0, Status : FAILED

 Error: java.io.IOException: File copy failed: hftp://10.1.1.26:50070
 /tmp/中文路径测试/part-r-00017 --
 hdfs://10.1.1.40:54310/tmp/distcp_test14/part-r-00017

 at
 org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:262)

 at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:229)

 at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:45)

 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:725)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)

 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)

 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)

 Caused by: java.io.IOException: Couldn't run retriable-command:
 Copying hftp://10.1.1.26:50070/tmp/中文路径测试/part-r-00017 to
 hdfs://10.1.1.40:54310/tmp/distcp_test14/part-r-00017

 at
 org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)

 at
 org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:258)

 ... 10 more

 Caused by:
 

Re: How to reduce total shuffle time

2012-08-28 Thread Minh Duc Nguyen
Without knowing your exact workload, using a Combiner (if possible) as
Tsuyoshi recommended should decrease your total shuffle time.  You can also
try compressing the map output so that there's less disk and network IO.
 Here's an example configuration using Snappy:

conf.set(mapred.compress.map.output,true);
conf.set(mapred.map.output.compression.codec,org.apache.hadoop.io.compress.SnappyCodec);

HTH,
Minh

On Tue, Aug 28, 2012 at 4:37 AM, Tsuyoshi OZAWA 
ozawa.tsuyo...@lab.ntt.co.jp wrote:

 It depends of workload. Could you tell us more specification about
 your job? In general case which reducers are bottleneck, there are
 some tuning techniques as follows:
 1. Allocate more memory to reducers. It decreases disk IO of reducers
 when merging and running reduce functions.
 2. Use combine function, which enable mapper-side aggregation
 processing, if your MR job consists of the operations that satisfy
 both the commutative and the associative low.

 See also about combine functions:
 http://wiki.apache.org/hadoop/HadoopMapReduce

 Tsuyoshi

 On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
 
  Hi,
 
  I have run some large and small jobs and calculated the Total Shuffle
 Time for the jobs. I can see that the Total Shuffle Time is almost half the
 Total Time which was taken by the full job to complete.
 
  My question, here, is that how can we decrease the Total Shuffle Time?
 And doing so, what will be its effect on the Job?
 
  Thanks,
  Gaurav Dasgupta



datanode has no storageID

2012-08-28 Thread boazya

Hi,
hope it's not a newby question...
I installed several versions of hadoop for testing,
(0.20.203, 0.21.0, and 1.0.3)
on various machines.
now I am using 1.0.3 on all the machines,
I face a problem that in some of the machhines, the datanode gets no
storageID from the namenode.
where it works, the datanode has the following lines in the log file:
(and current/VERSION has a storageID= some ID  )
---
2012-08-28 19:04:31,415 INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration =  
DatanodeRegistration(datanode-works.cs.tau.ac.il:50010,  
storageID=DS-996163017-machines-ip-50010-1342683478942,  
infoPort=50075, ipcPort=50020)
2012-08-28 19:04:31,418 INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode: Starting asynchronous  
block report scan
2012-08-28 19:04:31,418 INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode:  
DatanodeRegistration(machines-ip:50010,  
storageID=DS-996163017-machines-ip-50010-1342683478942,  
infoPort=50075, ipcPort=50020)In DataNode.run, data =  
FSDataset{dirpath='/var/cache/hdfs/hadoop-data-node/current'}
2012-08-28 19:04:31,419 INFO org.apache.hadoop.ipc.Server: IPC Server  
Responder: starting

---

where it doesn't work I have only the first line and it hangs.
(and current/VERSION has a 'storageID=' empty value line )
++
2012-08-28 18:42:01,297 INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration =  
DatanodeRegistration(machinename.cs.tau.ac.il:50010, storageID=,  
infoPort=50075, ipcPort=50020)
2012-08-28 18:42:01,287 INFO org.apache.hadoop.ipc.Server: Starting  
SocketReader

++
1. any Ideas?
2. how/where does the namenode stores the datanodes's storageids ?
3. how can I get a new storageid for a datanode or it's old ID ?
4. can I format/reset the namenode to enable the datanode to reconnect ?

thanks!
-
Boaz Yarom
CS System Team
03-640-8961 / 7637




Hadoop and MainFrame integration

2012-08-28 Thread Siddharth Tiwari

Hi Users.

We have flat files on mainframes with around a billion records. We need to sort 
them and then use them with different jobs on mainframe for report generation. 
I was wondering was there any way I could integrate the mainframe with hadoop 
do the sorting and keep the file on the sever itself ( I do not want to ftp the 
file to a hadoop cluster and then ftp back the sorted file to Mainframe as it 
would waste MIPS and nullify the advantage ). This way I could save on MIPS and 
ultimately improve profitability. 

Thank you in advance

**

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

Maybe other people will try to limit me but I don't limit myself
  

Re: Hadoop and MainFrame integration

2012-08-28 Thread modemide
At some point in the work flow you're going to have to transfer the file
from the mainframe to the Hadoop cluster for processing, and then send it
back for storage on the mainframe.

You should be able to automate the process of sending the files back and
forth.

It's been my experience that it's often faster to process and sort large
files on a Hadoop cluster even while factoring in the cost to transfer
to/from the mainframe.

Hopefully that answers your question.  If not, are you looking to actually
use Hadoop to process files in place on the mainframe?  That concept
conflicts with my understanding of Hadoop.

On Tue, Aug 28, 2012 at 12:24 PM, Siddharth Tiwari 
siddharth.tiw...@live.com wrote:

  Hi Users.

 We have flat files on mainframes with around a billion records. We need to
 sort them and then use them with different jobs on mainframe for report
 generation. I was wondering was there any way I could integrate the
 mainframe with hadoop do the sorting and keep the file on the sever itself
 ( I do not want to ftp the file to a hadoop cluster and then ftp back the
 sorted file to Mainframe as it would waste MIPS and nullify the advantage
 ). This way I could save on MIPS and ultimately improve profitability.

 Thank you in advance


 ****
 *Cheers !!!*
 *Siddharth Tiwari*
 Have a refreshing day !!!
 *Every duty is holy, and devotion to duty is the highest form of worship
 of God.” *
 *Maybe other people will try to limit me but I don't limit myself*



Re: one reducer is hanged in reduce- copy phase

2012-08-28 Thread Bejoy KS
Hi Abhay

The map outputs are deleted only after the reducer runs to completion. 

Is it possible to run the same attempt again? Does killing the child java 
process or tasktracker on the node help? (since hadoop may schedule a reduce 
attempt on another node).

Yes,it is possible to re attempt the task again for that you need to fail the 
current attempt. 

Can I copy the map intermediate output required for this single reducer (which 
is hanged) and rerun only the hang reducer?

It is not that easy to accomplish this. Better fail the task explicitly so that 
the it is re attempted.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com
Date: Tue, 28 Aug 2012 19:40:58 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: one reducer is hanged in reduce- copy phase

Hello,

I have a MR job which has 4 reducers running.
One of the reduce attempt is pending since long time in reduce-copy phase.

The job is not able to complete because of this.
I have seen that the child java process on tasktracker is running.

Is it possible to run the same attempt again? Does killing the child java
process or tasktracker on the node help? (since hadoop may schedule a
reduce attempt on another node).

Can I copy the map intermediate output required for this single reducer
(which is hanged) and rerun only the hang reducer?

Thank you in advance.
~Abhay


ask_201208250623_0005_r_00http://dpep089.innovate.ibm.com:50030/taskdetails.jsp?tipid=task_201208250623_0005_r_00
26.41%

reduce  copy(103 of 130 at 0.08 MB/s)
28-Aug-2012 03:09:34



example usage of s3 file system

2012-08-28 Thread Chris Collins
Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my 
tinkering I am not having a great deal of success.  I am particularly 
interested in the ability to mimic a directory structure (since s3 native 
doesnt do it).

Can anyone point me to some good example usage of Hadoop FileSystem with s3?

I created a few directories using transit and AWS S3 console for test.  Doing a 
liststatus of the bucket returns a FileStatus object of the directory created 
but if I try to do a liststatus of that path I am getting a 404:

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: 
Request Error. HEAD '/' on Host 

Probably not the best list to look for help, any clues appreciated.

C

Hadoop Streaming question

2012-08-28 Thread Periya.Data
Hi all,
   I am using Python on CDH3u3 for streaming. I do not know how to provide
command-line arguments. My python mapper takes in 3 arguments - 2 input
files and one placeholder for an output file. I am doing something like
this, but fails. Where am I going wrong? What other options do I have? Any
best practices? I am using cmdenv, but, do not know how exactly to use it.
I have seen this question on the net, but, I have not found a working
answer..



HDFS_INPUT_1=/user/kk/book/eccfile.txt
HDFS_INPUT_2=/user/kk/book/calist.txt
LOCAL_INPUT_1=$KK_HOME/eccfile.txt
LOCAL_INPUT_2=$KK_HOME/calist.txt

HDFS_OUTPUT=/user/kk/book/eccoutput
LOCAL_OUTPUT=$KK_HOME/

hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
-D mapred.job.name=CM \
-D mapred.reduce.tasks=0 \
-files $LOCAL_INPUT_1, $LOCAL_INPUT_2 \
-input  $HDFS_INPUT_1 \
-output $HDFS_OUTPUT \
-file   $KK_HOME/ec_ca.py \
-cmdenv arg1=$LOCAL_INPUT_1 \
-cmdenv arg2=$LOCAL_INPUT_2 \
-cmdenv arg3=$LOCAL_OUTPUT \
-mapper $KK_HOME/ec_ca.py $arg1 $arg2 $arg3

==

Some more related questions:

   1. what is the option for sending a file to all the nodes (say, arg2).
   This file is a reference input file that is needed for processing. Should
   I use the option-files? like DistributedCache.
   2. I really do not know what happens if I specify an output file (in
   local dir). I understand that specifying a HDFS location for output will
   nicely place the output in that dir. My Python script writes the output
   into a local directory - which I tested and worked fine locally. But, what
   really happens when I try to run on Hadoop? This is my $arg3.


Thanks and appreciate your help,
PD.


Re: Hadoop and MainFrame integration

2012-08-28 Thread Mathias Herberts
build a custom transfer mechanism in Java and use a zaap so you won't
consume mips
On Aug 28, 2012 6:24 PM, Siddharth Tiwari siddharth.tiw...@live.com
wrote:

  Hi Users.

 We have flat files on mainframes with around a billion records. We need to
 sort them and then use them with different jobs on mainframe for report
 generation. I was wondering was there any way I could integrate the
 mainframe with hadoop do the sorting and keep the file on the sever itself
 ( I do not want to ftp the file to a hadoop cluster and then ftp back the
 sorted file to Mainframe as it would waste MIPS and nullify the advantage
 ). This way I could save on MIPS and ultimately improve profitability.

 Thank you in advance


 ****
 *Cheers !!!*
 *Siddharth Tiwari*
 Have a refreshing day !!!
 *Every duty is holy, and devotion to duty is the highest form of worship
 of God.” *
 *Maybe other people will try to limit me but I don't limit myself*



Re: Hadoop and MainFrame integration

2012-08-28 Thread Steve Loughran
On 28 August 2012 09:24, Siddharth Tiwari siddharth.tiw...@live.com wrote:

  Hi Users.

 We have flat files on mainframes with around a billion records. We need to
 sort them and then use them with different jobs on mainframe for report
 generation. I was wondering was there any way I could integrate the
 mainframe with hadoop do the sorting and keep the file on the sever itself
 ( I do not want to ftp the file to a hadoop cluster and then ftp back the
 sorted file to Mainframe as it would waste MIPS and nullify the advantage
 ). This way I could save on MIPS and ultimately improve profitability.


Can you NFS-mount the mainframe filesystem from the Hadoop cluster?
Otherwise, do you or your mainframe vendor have a custom Hadoop filesystem
binding for the mainframe?

If not, you should be able to use ftp:// URLs as the source of data for the
initial MR job; at the end of the sequence of MR jobs the result can go
back to the mainframe;


RE: unsubscribe

2012-08-28 Thread Hennig, Ryan
Error: unsubscribe request failed.  Please retry again during a full moon.

From: Alberto Andreotti [mailto:albertoandreo...@gmail.com]
Sent: Thursday, August 23, 2012 9:00 AM
To: user@hadoop.apache.org
Subject: unsubscribe

unsubscribe

--
José Pablo Alberto Andreotti.
Tel: 54 351 4730292
Móvil: 54351156526363.
MSN: albertoandreo...@gmail.commailto:albertoandreo...@gmail.com
Skype: andreottialberto


Re: best way to join?

2012-08-28 Thread Ted Dunning
I don't mean that.

I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors.  The basic outline
is that you find the nearest clusters and search those for near neighbors.
 The first riff is that you use a clever data structure for finding the
nearest clusters so that you can do that faster than linear search.  The
second riff is when you use another clever data structure to search each
cluster quickly.

There are fancier data structures available as well.

On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan dextermorga...@gmail.comwrote:

 Right, but if i understood your sugesstion, you look at the end goal ,
 which is:
 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]

 for example, and you say: here we see a cluster basically, that cluster is
 represented by the point:  [40.123,-50.432]
 which points does this cluster contains?  [[41.431,-
 43.32],[...,...],...,[...]]
 meaning: that for every point i have in the dataset, you create a cluster.
 If you don't mean that, but you do mean to create clusters based on some
 random-seed points or what not, that would mean
  that i'll have points (talking about the end goal) that won't have
 enough points in their list.

 one of the criterions for a clustering is that for any clusters: C_i and
 C_j (where i != j), C_i intersect C_j is empty

 and again, how can i accomplish my task with out running mahout / knn
 algo? just by calculating distance between points?
 join of a file with it self.

 Thanks

 On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning tdunn...@maprtech.comwrote:



 On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan 
 dextermorga...@gmail.comwrote:


 I understand your solution ( i think) , didn't think of that, in that
 particular way.
 I think that lets say i have 1M data-points, and running knn , that the
 k=1M and n=10 (each point is a cluster that requires up to 10 points)
 is an overkill.


 I am not sure I understand you.  n = number of points.  k = number of
 clusters.  For searching 1 million points, I would recommend thousands of
 clusters.


 How can i achieve the same result WITHOUT using mahout, just running on
 the dataset , i even think it'll be in the same complexity (o(n^2))


 Running with a good knn package will give you roughly O(n log n)
 complexity.





copy Configuration into another

2012-08-28 Thread Radim Kolar
Its possible to copy apache.hadoop.conf.Configuration into another 
configuration object without creating a new instance? I am seeking 
something like new Configuration(Configuration) but without creating new 
destination object (its managed by spring)


Re: Hadoop and MainFrame integration

2012-08-28 Thread Artem Ervits
Can you read the data off backup tapes and dump it to flat files?


Artem Ervits
Data Analyst
New York Presbyterian Hospital

From: Marcos Ortiz [mailto:mlor...@uci.cu]
Sent: Tuesday, August 28, 2012 06:51 PM
To: user@hadoop.apache.org user@hadoop.apache.org
Cc: Siddharth Tiwari siddharth.tiw...@live.com
Subject: Re: Hadoop and MainFrame integration

The problem with it, is that Hadoop depends on top of HDFS to storage in blocks 
of 64/128 MB of size (or the size that you determine, 64 MB is the de-facto 
size), and then make the calculations.
So, you need to move all your data to a HDFS cluster to use data in MapReduce 
jobs if you want to make the calculations with Hadoop.
Best wishes

El 28/08/2012 12:24, Siddharth Tiwari escribió:
Hi Users.

We have flat files on mainframes with around a billion records. We need to sort 
them and then use them with different jobs on mainframe for report generation. 
I was wondering was there any way I could integrate the mainframe with hadoop 
do the sorting and keep the file on the sever itself ( I do not want to ftp the 
file to a hadoop cluster and then ftp back the sorted file to Mainframe as it 
would waste MIPS and nullify the advantage ). This way I could save on MIPS and 
ultimately improve profitability.

Thank you in advance


**
Cheers !!!
Siddharth Tiwari
Have a refreshing day !!!
Every duty is holy, and devotion to duty is the highest form of worship of 
God.�
Maybe other people will try to limit me but I don't limit myself

[http://universidad.uci.cu/email.gif]
http://www.uci.cu/



[http://universidad.uci.cu/email.gif]
http://www.uci.cu/

This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited. If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message. Thank you.




This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.






This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.





RE: unsubscribe

2012-08-28 Thread sathyavageeswaran
HADOOP policy has changed.

 

Any user wanting to unsubscribe needs to donate USD 100/- for Obama’s
campaign before the request is accepted.

 

 

 

From: Georgi Georgiev [mailto:g.d.georg...@gmail.com] 
Sent: 29 August 2012 03:31
To: user@hadoop.apache.org
Cc: Hennig, Ryan
Subject: Re: unsubscribe

 

I even got emails from people not in office, by sending the email bellow -
thats crazy!

 

g

On Wed, Aug 29, 2012 at 12:56 AM, Georgi Georgiev g.d.georg...@gmail.com
wrote:

guys - whats going wrong with these request - cant you just teach people act
appropriate - send regular mails to un-sub-subscribe  - really a lot of spam
in my in-mail.

 

cheers,

 

g

 

On Wed, Aug 29, 2012 at 12:08 AM, Fabio Pitzolu fabio.pitz...@gmail.com
wrote:

Epic Ryan!!!

Sent from my Windows Phone

  _  

Da: Hennig, Ryan
Inviato: 28/08/2012 21:14
A: user@hadoop.apache.org
Oggetto: RE: unsubscribe

Error: unsubscribe request failed.  Please retry again during a full moon.

 

From: Alberto Andreotti [mailto:albertoandreo...@gmail.com] 
Sent: Thursday, August 23, 2012 9:00 AM
To: user@hadoop.apache.org
Subject: unsubscribe

 

unsubscribe

-- 
José Pablo Alberto Andreotti.
Tel: 54 351 4730292 tel:351%204730292  
Móvil: 54351156526363.
MSN: albertoandreo...@gmail.com
Skype: andreottialberto

 

 

  _  

No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.2197 / Virus Database: 2437/5231 - Release Date: 08/28/12



HBase and MapReduce data locality

2012-08-28 Thread Robert Dyer
I have been reading up on HBase and my understanding is that the
physical files on the HDFS are split first by region and then by
column families.

Thus each column family has its own physical file (on a per-region basis).

If I run a MapReduce task that uses the HBase as input, wouldn't this
imply that if the task reads from more than 1 column family the data
for that row might not be (entirely) local to the task?

Is there a way to tell the HDFS to keep blocks of each region's column
families together?


Re: MRBench Maps strange behaviour

2012-08-28 Thread Hemanth Yamijala
Hi,

The number of maps specified to any map reduce program (including
those part of MRBench) is generally only a hint, and the actual number
of maps will be influenced in typical cases by the amount of data
being processed. You can take a look at this wiki link to understand
more: http://wiki.apache.org/hadoop/HowManyMapsAndReduces

In the examples below, since the data you've generated is different,
the number of mappers are different. To be able to judge your
benchmark results, you'd need to benchmark against the same data (or
at least same type of type - i.e. size and type).

The number of maps printed at the end is straight from the input
specified and doesn't reflect what the job actually ran with. The
information from the counters is the right one.

Thanks
Hemanth

On Tue, Aug 28, 2012 at 4:02 PM, Gaurav Dasgupta gdsay...@gmail.com wrote:
 Hi All,

 I executed the MRBench program from hadoop-test.jar in my 12 node CDH3
 cluster. After executing, I had some strange observations regarding the
 number of Maps it ran.

 First I ran the command:
 hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 3 -maps 200
 -reduces 200 -inputLines 1024 -inputType random
 And I could see that the actual number of Maps it ran was 201 (for all the 3
 runs) instead of 200 (Though the end report displays the launched to be
 200). Here is the console report:


 12/08/28 04:34:35 INFO mapred.JobClient: Job complete: job_201208230144_0035

 12/08/28 04:34:35 INFO mapred.JobClient: Counters: 28

 12/08/28 04:34:35 INFO mapred.JobClient:   Job Counters

 12/08/28 04:34:35 INFO mapred.JobClient: Launched reduce tasks=200

 12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=617209

 12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all reduces
 waiting after reserving slots (ms)=0

 12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all maps
 waiting after reserving slots (ms)=0

 12/08/28 04:34:35 INFO mapred.JobClient: Rack-local map tasks=137

 12/08/28 04:34:35 INFO mapred.JobClient: Launched map tasks=201

 12/08/28 04:34:35 INFO mapred.JobClient: Data-local map tasks=64

 12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1756882



 Again, I ran the MRBench for just 10 Maps and 10 Reduces:

 hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -maps 10 -reduces 10



 This time the actual number of Maps were only 2 and again the end report
 displays Maps Lauched to be 10. The console output:



 12/08/28 05:05:35 INFO mapred.JobClient: Job complete: job_201208230144_0040
 12/08/28 05:05:35 INFO mapred.JobClient: Counters: 27
 12/08/28 05:05:35 INFO mapred.JobClient:   Job Counters
 12/08/28 05:05:35 INFO mapred.JobClient: Launched reduce tasks=20
 12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6648
 12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all reduces
 waiting after reserving slots (ms)=0
 12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all maps
 waiting after reserving slots (ms)=0
 12/08/28 05:05:35 INFO mapred.JobClient: Launched map tasks=2
 12/08/28 05:05:35 INFO mapred.JobClient: Data-local map tasks=2
 12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=163257
 12/08/28 05:05:35 INFO mapred.JobClient:   FileSystemCounters
 12/08/28 05:05:35 INFO mapred.JobClient: FILE_BYTES_READ=407
 12/08/28 05:05:35 INFO mapred.JobClient: HDFS_BYTES_READ=258
 12/08/28 05:05:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1072596
 12/08/28 05:05:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3
 12/08/28 05:05:35 INFO mapred.JobClient:   Map-Reduce Framework
 12/08/28 05:05:35 INFO mapred.JobClient: Map input records=1
 12/08/28 05:05:35 INFO mapred.JobClient: Reduce shuffle bytes=647
 12/08/28 05:05:35 INFO mapred.JobClient: Spilled Records=2
 12/08/28 05:05:35 INFO mapred.JobClient: Map output bytes=5
 12/08/28 05:05:35 INFO mapred.JobClient: CPU time spent (ms)=17070
 12/08/28 05:05:35 INFO mapred.JobClient: Total committed heap usage
 (bytes)=6218842112
 12/08/28 05:05:35 INFO mapred.JobClient: Map input bytes=2
 12/08/28 05:05:35 INFO mapred.JobClient: Combine input records=0
 12/08/28 05:05:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=254
 12/08/28 05:05:35 INFO mapred.JobClient: Reduce input records=1
 12/08/28 05:05:35 INFO mapred.JobClient: Reduce input groups=1
 12/08/28 05:05:35 INFO mapred.JobClient: Combine output records=0
 12/08/28 05:05:35 INFO mapred.JobClient: Physical memory (bytes)
 snapshot=3348828160
 12/08/28 05:05:35 INFO mapred.JobClient: Reduce output records=1
 12/08/28 05:05:35 INFO mapred.JobClient: Virtual memory (bytes)
 snapshot=22955810816
 12/08/28 05:05:35 INFO mapred.JobClient: Map output records=1
 DataLines Maps Reduces AvgTime (milliseconds)
 120 20   17451

 Can some one please help me understand this behaviour of Hadoop in