Help with adjusting Hadoop configuration files

2011-06-21 Thread Avi Vaknin
Hi Everyone,
We are a start-up company has been using the Hadoop Cluster platform
(version 0.20.2) on Amazon EC2 environment.
We tried to setup a cluster using two different forms:
Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the machines
are small EC2 instances (1.6 GB RAM)
Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a
small EC2 instance and the other two datanodes are large EC2 instances (7.5
GB RAM)
We tried to make changes on the the configuration files (core-sit, hdfs-site
and mapred-sit xml files) and we expected to see a significant improvement
on the performance of the cluster 2,
unfortunately this has yet to happen.

Are there any special parameters on the configuration files that we need to
change in order to adjust the Hadoop to a large hardware environment ?
Are there any best practice you recommend?

Thanks in advance.

Avi





Re: Setting up a Single Node Hadoop Cluster

2011-06-21 Thread madhu phatak
What is the log content? Its the best place to see whats going wrong . If
you give the logs then its easy point out the problem

On Tue, Jun 21, 2011 at 9:06 AM, Kumar Kandasami 
kumaravel.kandas...@gmail.com wrote:

 Hi Ziyad:

  Do you see any errors on the log file ?

 I have installed CDH3 in the past on Ubuntu machines using the two links
 below:

 https://ccp.cloudera.com/display/CDHDOC/Before+You+Install+CDH3+on+a+Single+Node
 
 https://ccp.cloudera.com/display/CDHDOC/Before+You+Install+CDH3+on+a+Single+Node%20
 

 https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode

 Also the blog link below explains how to install using tarball files that
 works on my Ubuntu too (even though it is explained for Mac).


 http://knowledgedonor.blogspot.com/2011/05/installing-cloudera-hadoop-hadoop-0202.html


 Hope these links help you proceed further.

 Kumar_/|\_
 www.saisk.com
 ku...@saisk.com
 making a profound difference with knowledge and creativity...


 On Mon, Jun 20, 2011 at 6:22 PM, Ziyad Mir ziyad...@gmail.com wrote:

  Hi,
 
  I have been attempting to set up a single node Hadoop cluster (by
 following
 
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  )
  on
  my personal computer (running Ubuntu 10.10), however, I have run into
 some
  roadblocks.
 
  Specifically, there appear to be issues starting the required Hadoop
  processes after running 'bin/hadoop/start-all.sh' (jps only returns
  itself).
  In addition, if I run 'bin/hadoop/stop-all.sh', I often see 'no namenode
 to
  stop, no jobtracker to stop'.
 
  I have attempted looking into the hadoop/log files, however, I'm not sure
  what specifically I am looking for.
 
  Any suggestions would be much appreciated.
 
  Thanks,
  Ziyad
 



Re: Help with adjusting Hadoop configuration files

2011-06-21 Thread madhu phatak
If u reduce the default block size of dfs(which is in the configuration
file) and if u use default inputformat it creates more no of mappers at a
time which may help you to effectively use the RAM.. Another way is create
as many parallel jobs as possible at pro grammatically so that uses all
available RAM.

On Tue, Jun 21, 2011 at 3:17 PM, Avi Vaknin avivakni...@gmail.com wrote:

 Hi Madhu,
 First of all, thanks for the quick reply.
 I've searched the net about the properties of the configuration files and I
 specifically wanted to know if there is
 a property that is related to memory tuning (as you can see I have 7.5 RAM
 on each datanode and I really want to use it properly).
 Also, I've changed the mapred.tasktracker.reduce/map.tasks.maximum to 10
 (number of cores on the datanodes) and unfortunately I haven't seen any
 change on the performance or time duration of running jobs.

 Avi

 -Original Message-
 From: madhu phatak [mailto:phatak@gmail.com]
 Sent: Tuesday, June 21, 2011 12:33 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help with adjusting Hadoop configuration files

 The utilization of cluster depends upon the no of jobs and no of mappers
 and
 reducers.The configuration files only help u set up the cluster by
 specifying info .u can also specify some of details like block size and
 replication in configuration files  which may help you in job
 management.You
 can read all the available configuration properties here
 http://hadoop.apache.org/common/docs/current/cluster_setup.html

 On Tue, Jun 21, 2011 at 2:13 PM, Avi Vaknin avivakni...@gmail.com wrote:

  Hi Everyone,
  We are a start-up company has been using the Hadoop Cluster platform
  (version 0.20.2) on Amazon EC2 environment.
  We tried to setup a cluster using two different forms:
  Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the
 machines
  are small EC2 instances (1.6 GB RAM)
  Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a
  small EC2 instance and the other two datanodes are large EC2 instances
 (7.5
  GB RAM)
  We tried to make changes on the the configuration files (core-sit,
  hdfs-site
  and mapred-sit xml files) and we expected to see a significant
 improvement
  on the performance of the cluster 2,
  unfortunately this has yet to happen.
 
  Are there any special parameters on the configuration files that we need
 to
  change in order to adjust the Hadoop to a large hardware environment ?
  Are there any best practice you recommend?
 
  Thanks in advance.
 
  Avi
 
 
 
 

 -
 No virus found in this message.
 Checked by AVG - www.avg.com
 Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11




Re: Starting an HDFS node (standalone) programmatically by API

2011-06-21 Thread madhu phatak
HDFS should be available to DataNodes in order to run the jobs and bin/hdfs
just uses the hadoop jobs to access hdfs in datanodes .So if u want read a
file from hdfs inside a job you have to start data nodes when cluster comes
up.

On Fri, Jun 17, 2011 at 4:12 PM, punisher punishe...@hotmail.it wrote:

 Hi all,

 hdfs nodes can be started using the sh scripts provided with hadoop.
 I read that it's all based on script files

 is it possible to start an HDFS (standalone) from a java application by
 API?

 Thanks

 --
 View this message in context:
 http://hadoop-common.472056.n3.nabble.com/Starting-an-HDFS-node-standalone-programmatically-by-API-tp3075693p3075693.html
 Sent from the Users mailing list archive at Nabble.com.



Re: HDFS File Appending

2011-06-21 Thread madhu phatak
HDFS doesnot support Appending i think . I m not sure about pig , if you are
using Hadoop directly you can zip the files and use zip as the input the
jobs.

On Fri, Jun 17, 2011 at 6:56 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote:

 please refer to FileUtil.CopyMerge

 On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in
 wrote:
  Hi,
 
  We have a requirement where
 
   There would be huge number of small files to be pushed to hdfs and then
 use pig
  to do analysis.
   To get around the classic Small File Issue we merge the files and push
 a
  bigger file in to HDFS.
   But we are loosing time in this merging process of our pipeline.
 
  But If we can directly append to an existing file in HDFS we can save
 this
  Merging Files time.
 
  Can you please suggest if there a newer stable version of Hadoop where
 can go
  for appending ?
 
  Thanks and Regards,
  Jagaran



Re: Heap Size is 27.25 MB/888.94 MB

2011-06-21 Thread madhu phatak
Its related with the amount of memory available to Java Virtual machine that
is created for hadoop jobs.

On Fri, Jun 17, 2011 at 1:18 AM, Harsh J ha...@cloudera.com wrote:

 The 'heap size' is a Java/program and memory (RAM) thing; unrelated to
 physical disk space that the HDFS may occupy (which can be seen in
 configured capacity).

 More reading on what a Java heap size is about:
 http://en.wikipedia.org/wiki/Java_Virtual_Machine#Heap

 On Fri, Jun 17, 2011 at 1:07 AM,  jeff.schm...@shell.com wrote:
 
  So its saying my heap size is  (Heap Size is 27.25 MB/888.94 MB)
 
 
  but my configured capacity is 971GB (4 nodes)
 
 
 
  Is heap size on the main page just for the namenode or do I need to
  increase it to include the datanodes
 
 
 
  Cheers -
 
 
 
  Jeffery Schmitz
  Projects and Technology
  3737 Bellaire Blvd Houston, Texas 77001
  Tel: +1-713-245-7326 Fax: +1 713 245 7678
  Email: jeff.schm...@shell.com mailto:jeff.schm...@shell.com
 
  TK-421, why aren't you at your post?
 
 
 
 
 
 



 --
 Harsh J



Re: ClassNotFoundException while running quick start guide on Windows.

2011-06-21 Thread madhu phatak
I think the jar have some issuses where its not able to read the Main class
from manifest . try unjar the jar and see in Manifest.xml what is the main
class and then run as follows

 bin/hadoop jar hadoop-*-examples.jar Full qualified main class grep input
output 'dfs[a-z.]+'
On Thu, Jun 16, 2011 at 10:23 AM, Drew Gross drew.a.gr...@gmail.com wrote:

 Hello,

 I'm trying to run the example from the quick start guide on Windows and I
 get this error:

 $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
 Exception in thread main java.lang.NoClassDefFoundError:
 Caused by: java.lang.ClassNotFoundException:
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 Could not find the main class: .  Program will exit.
 Exception in thread main java.lang.NoClassDefFoundError:
 Gross\Documents\Projects\discom\hadoop-0/21/0\logs
 Caused by: java.lang.ClassNotFoundException:
 Gross\Documents\Projects\discom\hadoop-0.21.0\logs
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 Could not find the main class:
 Gross\Documents\Projects\discom\hadoop-0.21.0\logs.  Program will exit.

 Does anyone know what I need to change?

 Thank you.

 From, Drew

 --
 Forget the environment. Print this e-mail immediately. Then burn it.



Re: Handling external jars in EMR

2011-06-21 Thread madhu phatak
Its better to merge the library with ur code . Other wise u have to copy the
library to every lib folder of HADOOP in every node cluster. libjars is not
working for me also . I used maven shade plugin (eclipse) to get the merged
jar.

On Wed, Jun 15, 2011 at 12:20 AM, Mehmet Tepedelenlioglu 
mehmets...@gmail.com wrote:

 I am using the Guava library in my hadoop code through a jar file. With
 hadoop one has the -libjars option (although I could not get that working on
 0.2 for some reason). Are there any easy options with EMR short of using a
 utility like jarjar or bootstrapping magic. Or is that what I'll need to do?

 Thanks,

 Mehmet T.


Re: Hadoop Runner

2011-06-21 Thread madhu phatak
Define Ur own custom Record Reader and its efficient .

On Sun, Jun 12, 2011 at 10:12 AM, Harsh J ha...@cloudera.com wrote:

 Mark,

 I may not have gotten your question exactly, but you can do further
 processing inside of your FileInputFormat derivative's RecordReader
 implementation (just before it loads the value for a next() form of
 call -- which the MapRunner would use to read).

 If you're looking to dig into Hadoop's source code to understand the
 flow yourself, MapTask.java is what you may be looking for (run*
 methods).

 On Sun, Jun 12, 2011 at 3:25 AM, Mark question markq2...@gmail.com
 wrote:
  Hi,
 
   1) Where can I find the main class of hadoop? The one that calls the
  InputFormat then the MapperRunner and ReducerRunner and others?
 
 This will help me understand what is in memory or still on disk ,
 exact
  flow of data between split and mappers .
 
  My problem is, assuming I have a TextInputFormat and would like to modify
  the input in memory before being read by RecordReader... where shall I do
  that?
 
 InputFormat was my first guess, but unfortunately, it only defines the
  logical splits ... So, the only way I can think of is use the
 recordReader
  to read all the records in split into another variable (with the format I
  want) then process that variable by map functions.
 
But is that efficient? So, to understand this,I hope someone can give
 an
  answer to Q(1)
 
  Thank you,
  Mark
 



 --
 Harsh J



Re: Append to Existing File

2011-06-21 Thread Eric Charles
When you say bugs pending, are your refering to HDFS-265 (which links 
to HDFS-1060, HADOOP-6239 and HDFS-744?


Are there other issues related to append than the one above?

Tks, Eric

https://issues.apache.org/jira/browse/HDFS-265


On 21/06/11 12:36, madhu phatak wrote:

Its not stable . There are some bugs pending . According one of the
disccusion till date the append is not ready for production.

On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.inwrote:


I am using hadoop-0.20.203.0 version.
I have set

dfs.support.append to true and then using append method

It is working but need to know how stable it is to deploy and use in
production
clusters ?

Regards,
Jagaran




From: jagaran dasjagaran_...@yahoo.co.in
To: common-user@hadoop.apache.org
Sent: Mon, 13 June, 2011 11:07:57 AM
Subject: Append to Existing File

Hi All,

Is append to an existing file is now supported in Hadoop for production
clusters?
If yes, please let me know which version and how

Thanks
Jagaran





--
Eric


RE: Help with adjusting Hadoop configuration files

2011-06-21 Thread Avi Vaknin
Hi,
The block size is configured to 128MB, I've read that it is recommended to
increase it in order to get better performance.
What value do you recommend to set it ?

Avi

-Original Message-
From: madhu phatak [mailto:phatak@gmail.com] 
Sent: Tuesday, June 21, 2011 12:54 PM
To: common-user@hadoop.apache.org
Subject: Re: Help with adjusting Hadoop configuration files

If u reduce the default block size of dfs(which is in the configuration
file) and if u use default inputformat it creates more no of mappers at a
time which may help you to effectively use the RAM.. Another way is create
as many parallel jobs as possible at pro grammatically so that uses all
available RAM.

On Tue, Jun 21, 2011 at 3:17 PM, Avi Vaknin avivakni...@gmail.com wrote:

 Hi Madhu,
 First of all, thanks for the quick reply.
 I've searched the net about the properties of the configuration files and
I
 specifically wanted to know if there is
 a property that is related to memory tuning (as you can see I have 7.5 RAM
 on each datanode and I really want to use it properly).
 Also, I've changed the mapred.tasktracker.reduce/map.tasks.maximum to 10
 (number of cores on the datanodes) and unfortunately I haven't seen any
 change on the performance or time duration of running jobs.

 Avi

 -Original Message-
 From: madhu phatak [mailto:phatak@gmail.com]
 Sent: Tuesday, June 21, 2011 12:33 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help with adjusting Hadoop configuration files

 The utilization of cluster depends upon the no of jobs and no of mappers
 and
 reducers.The configuration files only help u set up the cluster by
 specifying info .u can also specify some of details like block size and
 replication in configuration files  which may help you in job
 management.You
 can read all the available configuration properties here
 http://hadoop.apache.org/common/docs/current/cluster_setup.html

 On Tue, Jun 21, 2011 at 2:13 PM, Avi Vaknin avivakni...@gmail.com wrote:

  Hi Everyone,
  We are a start-up company has been using the Hadoop Cluster platform
  (version 0.20.2) on Amazon EC2 environment.
  We tried to setup a cluster using two different forms:
  Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the
 machines
  are small EC2 instances (1.6 GB RAM)
  Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a
  small EC2 instance and the other two datanodes are large EC2 instances
 (7.5
  GB RAM)
  We tried to make changes on the the configuration files (core-sit,
  hdfs-site
  and mapred-sit xml files) and we expected to see a significant
 improvement
  on the performance of the cluster 2,
  unfortunately this has yet to happen.
 
  Are there any special parameters on the configuration files that we need
 to
  change in order to adjust the Hadoop to a large hardware environment ?
  Are there any best practice you recommend?
 
  Thanks in advance.
 
  Avi
 
 
 
 

 -
 No virus found in this message.
 Checked by AVG - www.avg.com
 Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11



-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11



Re: Append to Existing File

2011-06-21 Thread madhu phatak
Please refer to this discussion
http://search-hadoop.com/m/rnG0h1zCZcL1/Re%253A+HDFS+File+Appending+URGENTsubj=Fw+HDFS+File+Appending+URGENT

On Tue, Jun 21, 2011 at 4:23 PM, Eric Charles eric.char...@u-mangate.comwrote:

 When you say bugs pending, are your refering to HDFS-265 (which links to
 HDFS-1060, HADOOP-6239 and HDFS-744?

 Are there other issues related to append than the one above?

 Tks, Eric

 https://issues.apache.org/**jira/browse/HDFS-265https://issues.apache.org/jira/browse/HDFS-265



 On 21/06/11 12:36, madhu phatak wrote:

 Its not stable . There are some bugs pending . According one of the
 disccusion till date the append is not ready for production.

 On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.in**
 wrote:

  I am using hadoop-0.20.203.0 version.
 I have set

 dfs.support.append to true and then using append method

 It is working but need to know how stable it is to deploy and use in
 production
 clusters ?

 Regards,
 Jagaran



 __**__
 From: jagaran dasjagaran_...@yahoo.co.in
 To: common-user@hadoop.apache.org
 Sent: Mon, 13 June, 2011 11:07:57 AM
 Subject: Append to Existing File

 Hi All,

 Is append to an existing file is now supported in Hadoop for production
 clusters?
 If yes, please let me know which version and how

 Thanks
 Jagaran



 --
 Eric



Re: Help with adjusting Hadoop configuration files

2011-06-21 Thread madhu phatak
Yeah it will increase performance by reducing number of mappers and making
single mapper to use more memory . So the value depends upon the application
and RAM available . For your use case i think 512MB- 1GB will be better
value.

On Tue, Jun 21, 2011 at 4:28 PM, Avi Vaknin avivakni...@gmail.com wrote:

 Hi,
 The block size is configured to 128MB, I've read that it is recommended to
 increase it in order to get better performance.
 What value do you recommend to set it ?

 Avi

 -Original Message-
 From: madhu phatak [mailto:phatak@gmail.com]
 Sent: Tuesday, June 21, 2011 12:54 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help with adjusting Hadoop configuration files

 If u reduce the default block size of dfs(which is in the configuration
 file) and if u use default inputformat it creates more no of mappers at a
 time which may help you to effectively use the RAM.. Another way is create
 as many parallel jobs as possible at pro grammatically so that uses all
 available RAM.

 On Tue, Jun 21, 2011 at 3:17 PM, Avi Vaknin avivakni...@gmail.com wrote:

  Hi Madhu,
  First of all, thanks for the quick reply.
  I've searched the net about the properties of the configuration files and
 I
  specifically wanted to know if there is
  a property that is related to memory tuning (as you can see I have 7.5
 RAM
  on each datanode and I really want to use it properly).
  Also, I've changed the mapred.tasktracker.reduce/map.tasks.maximum to 10
  (number of cores on the datanodes) and unfortunately I haven't seen any
  change on the performance or time duration of running jobs.
 
  Avi
 
  -Original Message-
  From: madhu phatak [mailto:phatak@gmail.com]
  Sent: Tuesday, June 21, 2011 12:33 PM
  To: common-user@hadoop.apache.org
  Subject: Re: Help with adjusting Hadoop configuration files
 
  The utilization of cluster depends upon the no of jobs and no of mappers
  and
  reducers.The configuration files only help u set up the cluster by
  specifying info .u can also specify some of details like block size and
  replication in configuration files  which may help you in job
  management.You
  can read all the available configuration properties here
  http://hadoop.apache.org/common/docs/current/cluster_setup.html
 
  On Tue, Jun 21, 2011 at 2:13 PM, Avi Vaknin avivakni...@gmail.com
 wrote:
 
   Hi Everyone,
   We are a start-up company has been using the Hadoop Cluster platform
   (version 0.20.2) on Amazon EC2 environment.
   We tried to setup a cluster using two different forms:
   Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the
  machines
   are small EC2 instances (1.6 GB RAM)
   Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a
   small EC2 instance and the other two datanodes are large EC2 instances
  (7.5
   GB RAM)
   We tried to make changes on the the configuration files (core-sit,
   hdfs-site
   and mapred-sit xml files) and we expected to see a significant
  improvement
   on the performance of the cluster 2,
   unfortunately this has yet to happen.
  
   Are there any special parameters on the configuration files that we
 need
  to
   change in order to adjust the Hadoop to a large hardware environment ?
   Are there any best practice you recommend?
  
   Thanks in advance.
  
   Avi
  
  
  
  
 
  -
  No virus found in this message.
  Checked by AVG - www.avg.com
  Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11
 
 

 -
 No virus found in this message.
 Checked by AVG - www.avg.com
 Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11




Re: one-to-many Map Side Join without reducer

2011-06-21 Thread madhu phatak
I think HIVE is best suited for ur use case where it gives you the sql based
interface to the hadoop to make these type of things.

On Fri, Jun 10, 2011 at 2:39 AM, Shi Yu sh...@uchicago.edu wrote:

 Hi,

 I have two datasets: dataset 1 has the format:

 MasterKey1SubKey1SubKey2SubKey3
 MasterKey2Subkey4 Subkey5 Subkey6
 


 dataset 2 has the format:

 SubKey1Value1
 SubKey2Value2
 ...

 I want to have one-to-many join based on the SubKey, and the final goal is
 to have an output like:

 MasterKey1Value1Value2Value3
 MasterKey2Value4Value5Value6
 ...


 After studying and experimenting some example code, I understand that it is
 doable if I transform the first data set as

 SubKey1MasterKey1
 SubKey2MasterKey1
 SubKey3MasterKey1
 SubKey4MasterKey2
 SubKey5MasterKey2
 SubKey6MasterKey2

 then using the inner join with the dataset 2 on SubKey. Then I probably
 need a reducer to perform secondary sort on MasterKey to get the result.
 However, the bottleneck is still on the reducer if each MasterKey has lots
 of SubKey.
 My question is, suppose that dataset2 contains all the Subkeys and never
 split, is it possible to join the key of dataset 2 with multiple values of
 dataset 1 at the Mapper Side? Any hint is highly appreciated.

 Shi





Re: Running Back to Back Map-reduce jobs

2011-06-21 Thread madhu phatak
You can use ControlledJob's addDependingJob to handle dependency between
multiple jobs.

On Tue, Jun 7, 2011 at 4:15 PM, Adarsh Sharma adarsh.sha...@orkash.comwrote:

 Harsh J wrote:

 Yes, I believe Oozie does have Pipes and Streaming action helpers as well.

 On Thu, Jun 2, 2011 at 5:05 PM, Adarsh Sharma adarsh.sha...@orkash.com
 wrote:


 Ok, Is it valid for running jobs through Hadoop Pipes too.

 Thanks

 Harsh J wrote:


 Oozie's workflow feature may exactly be what you're looking for. It
 can also do much more than just chain jobs.

 Check out additional features at: http://yahoo.github.com/oozie/

 On Thu, Jun 2, 2011 at 4:48 PM, Adarsh Sharma adarsh.sha...@orkash.com
 
 wrote:



 After following the below points, I am confused about the examples used
 in the documentation :

 http://yahoo.github.com/oozie/**releases/3.0.0/**
 WorkflowFunctionalSpec.html#**a3.2.2.3_Pipeshttp://yahoo.github.com/oozie/releases/3.0.0/WorkflowFunctionalSpec.html#a3.2.2.3_Pipes

 What I want to achieve is to terminate a job after my permission i.e if I
 want to run again a map-reduce job after the completion of one , it runs 
 then terminates after my code execution.
 I struggled to find a simple example that proves this concept. In the Oozie
 documentation, they r just setting parameters and use them.

 fore.g a simple Hadoop Pipes job is executed by :

 int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(**HadoopPipes::TemplateFactory**
 WordCountMap,
 WordCountReduce());
 }

 Now if I want to run another job after this on the reduced data in HDFS,
 how this could be possible. Do i need to add some code.

 Thanks





  Dear all,

 I ran several map-reduce jobs in Hadoop Cluster of 4 nodes.

 Now this time I want a map-reduce job to be run again after one.

 Fore.g to clear my point, suppose a wordcount is run on gutenberg file
 in
 HDFS and after completion

 11/06/02 15:14:35 WARN mapred.JobClient: No job jar file set.  User
 classes
 may not be found. See JobConf(Class) or JobConf#setJar(String).
 11/06/02 15:14:35 INFO mapred.FileInputFormat: Total input paths to
 process
 : 3
 11/06/02 15:14:36 INFO mapred.JobClient: Running job:
 job_201106021143_0030
 11/06/02 15:14:37 INFO mapred.JobClient:  map 0% reduce 0%
 11/06/02 15:14:50 INFO mapred.JobClient:  map 33% reduce 0%
 11/06/02 15:14:59 INFO mapred.JobClient:  map 66% reduce 11%
 11/06/02 15:15:08 INFO mapred.JobClient:  map 100% reduce 22%
 11/06/02 15:15:17 INFO mapred.JobClient:  map 100% reduce 100%
 11/06/02 15:15:25 INFO mapred.JobClient: Job complete:
 job_201106021143_0030
 11/06/02 15:15:25 INFO mapred.JobClient: Counters: 18



 Again a map-reduce job is started on the output or original data say
 again

 1/06/02 15:14:36 INFO mapred.JobClient: Running job:
 job_201106021143_0030
 11/06/02 15:14:37 INFO mapred.JobClient:  map 0% reduce 0%
 11/06/02 15:14:50 INFO mapred.JobClient:  map 33% reduce 0%

 Is it possible or any parameters to achieve it.

 Please guide .

 Thanks




















Re: Append to Existing File

2011-06-21 Thread Eric Charles

Hi Madhu,

Tks for the pointer. Even after reading the section on 0.21/22/23 
written by Tsz-Wo, I still remain in the fog...


Will HDFS-265 (and its mentioned Jiras) provide a solution for append 
(whatever the release it will be)?


Another way of asking is: Are there today other Jiras than the ones 
mentioned on HDFS-265 to take into consideration to have working hadoop 
append?.


Tks, Eric


On 21/06/11 12:58, madhu phatak wrote:

Please refer to this discussion
http://search-hadoop.com/m/rnG0h1zCZcL1/Re%253A+HDFS+File+Appending+URGENTsubj=Fw+HDFS+File+Appending+URGENT

On Tue, Jun 21, 2011 at 4:23 PM, Eric Charleseric.char...@u-mangate.comwrote:


When you say bugs pending, are your refering to HDFS-265 (which links to
HDFS-1060, HADOOP-6239 and HDFS-744?

Are there other issues related to append than the one above?

Tks, Eric

https://issues.apache.org/**jira/browse/HDFS-265https://issues.apache.org/jira/browse/HDFS-265



On 21/06/11 12:36, madhu phatak wrote:


Its not stable . There are some bugs pending . According one of the
disccusion till date the append is not ready for production.

On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.in**
wrote:

  I am using hadoop-0.20.203.0 version.

I have set

dfs.support.append to true and then using append method

It is working but need to know how stable it is to deploy and use in
production
clusters ?

Regards,
Jagaran



__**__
From: jagaran dasjagaran_...@yahoo.co.in
To: common-user@hadoop.apache.org
Sent: Mon, 13 June, 2011 11:07:57 AM
Subject: Append to Existing File

Hi All,

Is append to an existing file is now supported in Hadoop for production
clusters?
If yes, please let me know which version and how

Thanks
Jagaran





--
Eric





--
Eric


RE: Help with adjusting Hadoop configuration files

2011-06-21 Thread Avi Vaknin
Thanks Madhu, I'll check it.

-Original Message-
From: madhu phatak [mailto:phatak@gmail.com] 
Sent: Tuesday, June 21, 2011 2:02 PM
To: common-user@hadoop.apache.org
Subject: Re: Help with adjusting Hadoop configuration files

Yeah it will increase performance by reducing number of mappers and making
single mapper to use more memory . So the value depends upon the application
and RAM available . For your use case i think 512MB- 1GB will be better
value.

On Tue, Jun 21, 2011 at 4:28 PM, Avi Vaknin avivakni...@gmail.com wrote:

 Hi,
 The block size is configured to 128MB, I've read that it is recommended to
 increase it in order to get better performance.
 What value do you recommend to set it ?

 Avi

 -Original Message-
 From: madhu phatak [mailto:phatak@gmail.com]
 Sent: Tuesday, June 21, 2011 12:54 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help with adjusting Hadoop configuration files

 If u reduce the default block size of dfs(which is in the configuration
 file) and if u use default inputformat it creates more no of mappers at a
 time which may help you to effectively use the RAM.. Another way is create
 as many parallel jobs as possible at pro grammatically so that uses all
 available RAM.

 On Tue, Jun 21, 2011 at 3:17 PM, Avi Vaknin avivakni...@gmail.com wrote:

  Hi Madhu,
  First of all, thanks for the quick reply.
  I've searched the net about the properties of the configuration files
and
 I
  specifically wanted to know if there is
  a property that is related to memory tuning (as you can see I have 7.5
 RAM
  on each datanode and I really want to use it properly).
  Also, I've changed the mapred.tasktracker.reduce/map.tasks.maximum to 10
  (number of cores on the datanodes) and unfortunately I haven't seen any
  change on the performance or time duration of running jobs.
 
  Avi
 
  -Original Message-
  From: madhu phatak [mailto:phatak@gmail.com]
  Sent: Tuesday, June 21, 2011 12:33 PM
  To: common-user@hadoop.apache.org
  Subject: Re: Help with adjusting Hadoop configuration files
 
  The utilization of cluster depends upon the no of jobs and no of mappers
  and
  reducers.The configuration files only help u set up the cluster by
  specifying info .u can also specify some of details like block size and
  replication in configuration files  which may help you in job
  management.You
  can read all the available configuration properties here
  http://hadoop.apache.org/common/docs/current/cluster_setup.html
 
  On Tue, Jun 21, 2011 at 2:13 PM, Avi Vaknin avivakni...@gmail.com
 wrote:
 
   Hi Everyone,
   We are a start-up company has been using the Hadoop Cluster platform
   (version 0.20.2) on Amazon EC2 environment.
   We tried to setup a cluster using two different forms:
   Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the
  machines
   are small EC2 instances (1.6 GB RAM)
   Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is
a
   small EC2 instance and the other two datanodes are large EC2 instances
  (7.5
   GB RAM)
   We tried to make changes on the the configuration files (core-sit,
   hdfs-site
   and mapred-sit xml files) and we expected to see a significant
  improvement
   on the performance of the cluster 2,
   unfortunately this has yet to happen.
  
   Are there any special parameters on the configuration files that we
 need
  to
   change in order to adjust the Hadoop to a large hardware environment ?
   Are there any best practice you recommend?
  
   Thanks in advance.
  
   Avi
  
  
  
  
 
  -
  No virus found in this message.
  Checked by AVG - www.avg.com
  Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11
 
 

 -
 No virus found in this message.
 Checked by AVG - www.avg.com
 Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11



-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11



Poor scalability with map reduce application

2011-06-21 Thread Alberto Andreotti
Hello,

I'm working with an application to calculate the temperatures of a squared
board. I divide the board in a mesh, and represent the board as a list of
(key, value) pairs with a key being the linear position of a cell within the
mesh, and the value its temperature.
I distribute the data during the map and calculate the temperature for next
step in the reduce. You can see a more detailed explanation here,

http://code.google.com/p/heat-transfer/source/browse/trunk/informe/Informe.pdf

but the basic idea is the one I have just mentioned.
The funny thing is that the more nodes I add the slower it runs!. With 7
nodes it takes 16 minutes, but with 4 nodes it takes only 8 minutes.
You can see the code in file HeatTransfer.java which is found here,

http://code.google.com/p/heat-transfer/source/browse/#svn%2Ftrunk%2Ffine%253Fstate%253Dclosed

thanks in advance!

Alberto.
-- 
José Pablo Alberto Andreotti.
Tel: 54 351 4730292
Móvil: 54351156526363.
MSN: albertoandreo...@gmail.com
Skype: andreottialberto


Re: Poor scalability with map reduce application

2011-06-21 Thread Alberto Andreotti
Hi Harsh,

thanks for your answer!. The cluster is homogeneus, every node has the same
amount of cores and memory and is equally reachable in the network. The data
is generated specifically for each run. I mean, I write the input data in 4
nodes for one run and in 7 nodes for another. So the input file will be
replicated in 4 nodes when running the map reduce with 4 nodes, and in 7
nodes when running it with 7.
I don't know if speculatives maps are on, I'll check it. One thing I
observed is that reduces begin before all maps have finished. Let me check
also if the difference is on the map side or in the reduce. I believe it's
balanced, both are slower when adding more nodes, but i'll confirm that.

I would appreciate any other comment,

thanks again


On 21 June 2011 13:33, Harsh J ha...@cloudera.com wrote:

 Alberto,

 Please add more practical-related info like if your cluster is
 homogenous, if the number of maps and reduces in both runs are
 consistent (i.e., same data and same amount of reducers on 4 vs. 7?),
 and if map speculatives are on. Also, do you notice difference of time
 for a single map task across the two runs? Or is the difference on the
 reduce task side?

 On Tue, Jun 21, 2011 at 8:33 PM, Alberto Andreotti
 albertoandreo...@gmail.com wrote:
  Hello,
 
  I'm working with an application to calculate the temperatures of a
 squared
  board. I divide the board in a mesh, and represent the board as a list of
  (key, value) pairs with a key being the linear position of a cell within
 the
  mesh, and the value its temperature.
  I distribute the data during the map and calculate the temperature for
 next
  step in the reduce. You can see a more detailed explanation here,
 
 
 http://code.google.com/p/heat-transfer/source/browse/trunk/informe/Informe.pdf
 
  but the basic idea is the one I have just mentioned.
  The funny thing is that the more nodes I add the slower it runs!. With 7
  nodes it takes 16 minutes, but with 4 nodes it takes only 8 minutes.
  You can see the code in file HeatTransfer.java which is found here,
 
 
 http://code.google.com/p/heat-transfer/source/browse/#svn%2Ftrunk%2Ffine%253Fstate%253Dclosed
 
  thanks in advance!
 
  Alberto.
  --
  José Pablo Alberto Andreotti.
  Tel: 54 351 4730292
  Móvil: 54351156526363.
  MSN: albertoandreo...@gmail.com
  Skype: andreottialberto
 



 --
 Harsh J




-- 
José Pablo Alberto Andreotti.
Tel: 54 351 4730292
Móvil: 54351156526363.
MSN: albertoandreo...@gmail.com
Skype: andreottialberto


RE: Poor scalability with map reduce application

2011-06-21 Thread GOEKE, MATTHEW (AG/1000)
Harsh,

Is it possible for mapred.reduce.slowstart.completed.maps to even play a 
significant role in this? The only benefit he would find in tweaking that for 
his problem would be to spread network traffic from the shuffle over a longer 
period of time at a cost of having the reducer using resources earlier. Either 
way he would see this effect across both sets of runs if he is using the 
default parameters. I guess it would all depend on what kind of network layout 
the cluster is on.

Matt

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Tuesday, June 21, 2011 12:09 PM
To: common-user@hadoop.apache.org
Subject: Re: Poor scalability with map reduce application

Alberto,

On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti
albertoandreo...@gmail.com wrote:
 I don't know if speculatives maps are on, I'll check it. One thing I
 observed is that reduces begin before all maps have finished. Let me check
 also if the difference is on the map side or in the reduce. I believe it's
 balanced, both are slower when adding more nodes, but i'll confirm that.

Maps and reduces are speculative by default, so must've been ON. Could
you also post a general input vs. output record counts and statistics
like that between your job runs, to correlate?

The reducers get scheduled early but do not exactly reduce() until
all maps are done. They just keep fetching outputs. Their scheduling
can be controlled with some configurations (say, to start only after
X% of maps are done -- by default it starts up when 5% of maps are
done).

-- 
Harsh J
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Re: Poor scalability with map reduce application

2011-06-21 Thread Alberto Andreotti
Thank you guys, I really appreciate your answers. I don't have access to the
cluster right now, I'll check the info you are asking and come back in a
couple of hours.
BTW, I tried the app on two clusters with similar results. I'm using 0.21.0.

thanks again, Alberto.

On 21 June 2011 14:16, GOEKE, MATTHEW (AG/1000)
matthew.go...@monsanto.comwrote:

 Harsh,

 Is it possible for mapred.reduce.slowstart.completed.maps to even play a
 significant role in this? The only benefit he would find in tweaking that
 for his problem would be to spread network traffic from the shuffle over a
 longer period of time at a cost of having the reducer using resources
 earlier. Either way he would see this effect across both sets of runs if he
 is using the default parameters. I guess it would all depend on what kind of
 network layout the cluster is on.

 Matt

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Tuesday, June 21, 2011 12:09 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Poor scalability with map reduce application

 Alberto,

 On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti
 albertoandreo...@gmail.com wrote:
  I don't know if speculatives maps are on, I'll check it. One thing I
  observed is that reduces begin before all maps have finished. Let me
 check
  also if the difference is on the map side or in the reduce. I believe
 it's
  balanced, both are slower when adding more nodes, but i'll confirm that.

 Maps and reduces are speculative by default, so must've been ON. Could
 you also post a general input vs. output record counts and statistics
 like that between your job runs, to correlate?

 The reducers get scheduled early but do not exactly reduce() until
 all maps are done. They just keep fetching outputs. Their scheduling
 can be controlled with some configurations (say, to start only after
 X% of maps are done -- by default it starts up when 5% of maps are
 done).

 --
 Harsh J
 This e-mail message may contain privileged and/or confidential information,
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.




-- 
José Pablo Alberto Andreotti.
Tel: 54 351 4730292
Móvil: 54351156526363.
MSN: albertoandreo...@gmail.com
Skype: andreottialberto


Re: Poor scalability with map reduce application

2011-06-21 Thread Alberto Andreotti
I saw that the link I sent you may not be working, please take a look here
to see what it is all about,

https://docs.google.com/viewer?a=vpid=explorerchrome=truesrcid=0B5AOpwg8IzVANjJlODZhZDctNWUzMS00MmNhLWI3OWMtMWNhMTdjODQwNjVlhl=en_US


thanks again!

On 21 June 2011 14:22, Alberto Andreotti albertoandreo...@gmail.com wrote:

 Thank you guys, I really appreciate your answers. I don't have access to
 the cluster right now, I'll check the info you are asking and come back in a
 couple of hours.
 BTW, I tried the app on two clusters with similar results. I'm using
 0.21.0.

 thanks again, Alberto.


 On 21 June 2011 14:16, GOEKE, MATTHEW (AG/1000) 
 matthew.go...@monsanto.com wrote:

 Harsh,

 Is it possible for mapred.reduce.slowstart.completed.maps to even play a
 significant role in this? The only benefit he would find in tweaking that
 for his problem would be to spread network traffic from the shuffle over a
 longer period of time at a cost of having the reducer using resources
 earlier. Either way he would see this effect across both sets of runs if he
 is using the default parameters. I guess it would all depend on what kind of
 network layout the cluster is on.

 Matt

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Tuesday, June 21, 2011 12:09 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Poor scalability with map reduce application

 Alberto,

 On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti
 albertoandreo...@gmail.com wrote:
  I don't know if speculatives maps are on, I'll check it. One thing I
  observed is that reduces begin before all maps have finished. Let me
 check
  also if the difference is on the map side or in the reduce. I believe
 it's
  balanced, both are slower when adding more nodes, but i'll confirm that.

 Maps and reduces are speculative by default, so must've been ON. Could
 you also post a general input vs. output record counts and statistics
 like that between your job runs, to correlate?

 The reducers get scheduled early but do not exactly reduce() until
 all maps are done. They just keep fetching outputs. Their scheduling
 can be controlled with some configurations (say, to start only after
 X% of maps are done -- by default it starts up when 5% of maps are
 done).

 --
 Harsh J
 This e-mail message may contain privileged and/or confidential
 information, and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other
 use of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.




 --
 José Pablo Alberto Andreotti.
 Tel: 54 351 4730292
 Móvil: 54351156526363.
 MSN: albertoandreo...@gmail.com
 Skype: andreottialberto




-- 
José Pablo Alberto Andreotti.
Tel: 54 351 4730292
Móvil: 54351156526363.
MSN: albertoandreo...@gmail.com
Skype: andreottialberto


Re: Append to Existing File

2011-06-21 Thread jagaran das
Hi All,

Does CDH3 support Existing File Append ?

Regards,
Jagaran 




From: Eric Charles eric.char...@u-mangate.com
To: common-user@hadoop.apache.org
Sent: Tue, 21 June, 2011 3:53:33 AM
Subject: Re: Append to Existing File

When you say bugs pending, are your refering to HDFS-265 (which links 
to HDFS-1060, HADOOP-6239 and HDFS-744?

Are there other issues related to append than the one above?

Tks, Eric

https://issues.apache.org/jira/browse/HDFS-265


On 21/06/11 12:36, madhu phatak wrote:
 Its not stable . There are some bugs pending . According one of the
 disccusion till date the append is not ready for production.

 On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.inwrote:

 I am using hadoop-0.20.203.0 version.
 I have set

 dfs.support.append to true and then using append method

 It is working but need to know how stable it is to deploy and use in
 production
 clusters ?

 Regards,
 Jagaran



 
 From: jagaran dasjagaran_...@yahoo.co.in
 To: common-user@hadoop.apache.org
 Sent: Mon, 13 June, 2011 11:07:57 AM
 Subject: Append to Existing File

 Hi All,

 Is append to an existing file is now supported in Hadoop for production
 clusters?
 If yes, please let me know which version and how

 Thanks
 Jagaran



-- 
Eric


Re: Poor scalability with map reduce application

2011-06-21 Thread Harsh J
Matt,

You're right that it (slowstart) does not / would not affect much. I
was merely explaining the reason behind his observance of reducers
getting scheduled early, not really recommending a tweak for
performance changes there.

On Tue, Jun 21, 2011 at 10:46 PM, GOEKE, MATTHEW (AG/1000)
matthew.go...@monsanto.com wrote:
 Harsh,

 Is it possible for mapred.reduce.slowstart.completed.maps to even play a 
 significant role in this? The only benefit he would find in tweaking that for 
 his problem would be to spread network traffic from the shuffle over a longer 
 period of time at a cost of having the reducer using resources earlier. 
 Either way he would see this effect across both sets of runs if he is using 
 the default parameters. I guess it would all depend on what kind of network 
 layout the cluster is on.

 Matt

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Tuesday, June 21, 2011 12:09 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Poor scalability with map reduce application

 Alberto,

 On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti
 albertoandreo...@gmail.com wrote:
 I don't know if speculatives maps are on, I'll check it. One thing I
 observed is that reduces begin before all maps have finished. Let me check
 also if the difference is on the map side or in the reduce. I believe it's
 balanced, both are slower when adding more nodes, but i'll confirm that.

 Maps and reduces are speculative by default, so must've been ON. Could
 you also post a general input vs. output record counts and statistics
 like that between your job runs, to correlate?

 The reducers get scheduled early but do not exactly reduce() until
 all maps are done. They just keep fetching outputs. Their scheduling
 can be controlled with some configurations (say, to start only after
 X% of maps are done -- by default it starts up when 5% of maps are
 done).

 --
 Harsh J
 This e-mail message may contain privileged and/or confidential information, 
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error, 
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use 
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring, 
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for checking 
 for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage 
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export control 
 laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR) and 
 sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.





-- 
Harsh J


Re: Append to Existing File

2011-06-21 Thread Joey Echeverria
Yes.

-Joey
On Jun 21, 2011 1:47 PM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi All,

 Does CDH3 support Existing File Append ?

 Regards,
 Jagaran



 
 From: Eric Charles eric.char...@u-mangate.com
 To: common-user@hadoop.apache.org
 Sent: Tue, 21 June, 2011 3:53:33 AM
 Subject: Re: Append to Existing File

 When you say bugs pending, are your refering to HDFS-265 (which links
 to HDFS-1060, HADOOP-6239 and HDFS-744?

 Are there other issues related to append than the one above?

 Tks, Eric

 https://issues.apache.org/jira/browse/HDFS-265


 On 21/06/11 12:36, madhu phatak wrote:
 Its not stable . There are some bugs pending . According one of the
 disccusion till date the append is not ready for production.

 On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.in
wrote:

 I am using hadoop-0.20.203.0 version.
 I have set

 dfs.support.append to true and then using append method

 It is working but need to know how stable it is to deploy and use in
 production
 clusters ?

 Regards,
 Jagaran



 
 From: jagaran dasjagaran_...@yahoo.co.in
 To: common-user@hadoop.apache.org
 Sent: Mon, 13 June, 2011 11:07:57 AM
 Subject: Append to Existing File

 Hi All,

 Is append to an existing file is now supported in Hadoop for production
 clusters?
 If yes, please let me know which version and how

 Thanks
 Jagaran



 --
 Eric


Deserializing a MapWritable entry set.

2011-06-21 Thread Dhruv Kumar
I want to extract the key-value pairs from a MapWritable, cast them into
Integer (key) and Double (value) types, and add them to another collection.
I'm attempting the following but this code is incorrect.

// initialDistributionStripe is a MapWritableIntWritable, DoubleWritable
// initialProbabilities is of type  Vector which can have (Integer, Double)
entries in it

for (Map.EntryWritable, Writable entry : initialDistributionStripe.
entrySet()) {
  initialProbabilities.set(entry.getKey(), entry.getValue());
}


Is there a convenient way to do this?


Re: Deserializing a MapWritable entry set.

2011-06-21 Thread Alberto Andreotti
Never worked with maps before, btw what are you trying to calculate?


alberto.

On 21 June 2011 17:14, Dhruv Kumar dku...@ecs.umass.edu wrote:

 I want to extract the key-value pairs from a MapWritable, cast them into
 Integer (key) and Double (value) types, and add them to another collection.
 I'm attempting the following but this code is incorrect.

 // initialDistributionStripe is a MapWritableIntWritable, DoubleWritable
 // initialProbabilities is of type  Vector which can have (Integer, Double)
 entries in it

 for (Map.EntryWritable, Writable entry : initialDistributionStripe.
 entrySet()) {
  initialProbabilities.set(entry.getKey(), entry.getValue());
}


 Is there a convenient way to do this?




-- 
José Pablo Alberto Andreotti.
Tel: 54 351 4730292
Móvil: 54351156526363.
MSN: albertoandreo...@gmail.com
Skype: andreottialberto


Re: Deserializing a MapWritable entry set.

2011-06-21 Thread Harsh J
Dhruv,

If the Writable, Writable are IntWritable and DoubleWritable
underneath, simply cast them properly to those types after
get{Key,Value}() and then use the appropriate method to get the
underlying value (Simple .get() in most cases).

Is this what you're looking for?

On Wed, Jun 22, 2011 at 1:44 AM, Dhruv Kumar dku...@ecs.umass.edu wrote:
 I want to extract the key-value pairs from a MapWritable, cast them into
 Integer (key) and Double (value) types, and add them to another collection.
 I'm attempting the following but this code is incorrect.

 // initialDistributionStripe is a MapWritableIntWritable, DoubleWritable
 // initialProbabilities is of type  Vector which can have (Integer, Double)
 entries in it

 for (Map.EntryWritable, Writable entry : initialDistributionStripe.
 entrySet()) {
      initialProbabilities.set(entry.getKey(), entry.getValue());
    }


 Is there a convenient way to do this?




-- 
Harsh J


Re: Deserializing a MapWritable entry set.

2011-06-21 Thread Dhruv Kumar
The exact problem I'm facing is the following:

entry.getKey() and entry.getValue() return Writable types. How do I extract
the buried Int and Double? In case of IntWritable and DoubleWritable return
types, I could have used entry.getKey().get() and entry.getValue.get() and
it would have been fine.

On Tue, Jun 21, 2011 at 4:18 PM, Alberto Andreotti 
albertoandreo...@gmail.com wrote:

 Never worked with maps before, btw what are you trying to calculate?


There is no calculation in this loop, it is just a conversion from one type
(MapWritable) produced by the reducer(s) to another type (Vector) which can
be consumed by some legacy code for actual processing.



 alberto.

 On 21 June 2011 17:14, Dhruv Kumar dku...@ecs.umass.edu wrote:

  I want to extract the key-value pairs from a MapWritable, cast them into
  Integer (key) and Double (value) types, and add them to another
 collection.
  I'm attempting the following but this code is incorrect.
 
  // initialDistributionStripe is a MapWritableIntWritable,
 DoubleWritable
  // initialProbabilities is of type  Vector which can have (Integer,
 Double)
  entries in it
 
  for (Map.EntryWritable, Writable entry : initialDistributionStripe.
  entrySet()) {
   initialProbabilities.set(entry.getKey(), entry.getValue());
 }
 
 
  Is there a convenient way to do this?
 



 --
 José Pablo Alberto Andreotti.
 Tel: 54 351 4730292
 Móvil: 54351156526363.
 MSN: albertoandreo...@gmail.com
 Skype: andreottialberto



Re: Deserializing a MapWritable entry set.

2011-06-21 Thread Harsh J
((IntWritable) entry.getKey()).get(); and similar.

On Wed, Jun 22, 2011 at 2:00 AM, Dhruv Kumar dku...@ecs.umass.edu wrote:
 The exact problem I'm facing is the following:

 entry.getKey() and entry.getValue() return Writable types. How do I extract
 the buried Int and Double? In case of IntWritable and DoubleWritable return
 types, I could have used entry.getKey().get() and entry.getValue.get() and
 it would have been fine.

 On Tue, Jun 21, 2011 at 4:18 PM, Alberto Andreotti 
 albertoandreo...@gmail.com wrote:

 Never worked with maps before, btw what are you trying to calculate?


 There is no calculation in this loop, it is just a conversion from one type
 (MapWritable) produced by the reducer(s) to another type (Vector) which can
 be consumed by some legacy code for actual processing.



 alberto.

 On 21 June 2011 17:14, Dhruv Kumar dku...@ecs.umass.edu wrote:

  I want to extract the key-value pairs from a MapWritable, cast them into
  Integer (key) and Double (value) types, and add them to another
 collection.
  I'm attempting the following but this code is incorrect.
 
  // initialDistributionStripe is a MapWritableIntWritable,
 DoubleWritable
  // initialProbabilities is of type  Vector which can have (Integer,
 Double)
  entries in it
 
  for (Map.EntryWritable, Writable entry : initialDistributionStripe.
  entrySet()) {
       initialProbabilities.set(entry.getKey(), entry.getValue());
     }
 
 
  Is there a convenient way to do this?
 



 --
 José Pablo Alberto Andreotti.
 Tel: 54 351 4730292
 Móvil: 54351156526363.
 MSN: albertoandreo...@gmail.com
 Skype: andreottialberto





-- 
Harsh J


Configuration settings

2011-06-21 Thread Mark
We have a small 4 node clusters that have 12GB of ram and the cpus are 
Quad Core Xeons.


I'm assuming the defaults aren't that generous so what are some 
configuration changes I should make to take advantage of this hardware? 
Max map task? Max reduce tasks? Anything else?


Thanks


Re: Configuration settings

2011-06-21 Thread Sonal Goyal
Hi Mark,

You can take a look at
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
 and
http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/to
configure your cluster. Along with the tasks, you can change the child
jvm heap size, data.xceivers etc. A good practice is to understand what kind
of map reduce programming you will be doing, are your tasks CPU bound or
memory bound and accordingly change your base cluster settings.

Best Regards,
Sonal
https://github.com/sonalgoyal/hihoHadoop ETL and Data
Integrationhttps://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Jun 22, 2011 at 6:16 AM, Mark static.void@gmail.com wrote:

 We have a small 4 node clusters that have 12GB of ram and the cpus are Quad
 Core Xeons.

 I'm assuming the defaults aren't that generous so what are some
 configuration changes I should make to take advantage of this hardware? Max
 map task? Max reduce tasks? Anything else?

 Thanks



TableOutputFormat not efficient than direct HBase API calls?

2011-06-21 Thread edward choi
Hi,

I am writing an Hadoop application that uses HBase as both source and sink.

There is no reducer job in my application.

I am using TableOutputFormat as the OutputFormatClass.

I read it on the Internet that it is experimentally faster to directly
instantiate HTable and use HTable.batch() in the Map
than to use TableOutputFormat as the Map's OutputClass

So I looked into the source code,
org.apache.hadoop.hbase.mapreduce.TableOutputFormat.
It looked like TableRecordWriter does not support batch updates, since
TableRecordWriter.write() called HTable.put(new Put()).

Am I right on this matter? Or does TableOutputFormat automatically do batch
updates somehow?
Or is there a specific way to do batch updates with TableOutputFormat?

Any explanation is greatly appreciated.

Ed


Re: Append to Existing File

2011-06-21 Thread Todd Lipcon
On Tue, Jun 21, 2011 at 11:53 AM, Joey Echeverria j...@cloudera.com wrote:
 Yes.

Sort-of kind-of...

we support it only for the use case that HBase uses it. Mostly, we
support sync() which was implemented at the same time.

I know of several bugs in existing-file-append in CDH3 and 0.20-append.

-Todd


 -Joey
 On Jun 21, 2011 1:47 PM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi All,

 Does CDH3 support Existing File Append ?

 Regards,
 Jagaran



 
 From: Eric Charles eric.char...@u-mangate.com
 To: common-user@hadoop.apache.org
 Sent: Tue, 21 June, 2011 3:53:33 AM
 Subject: Re: Append to Existing File

 When you say bugs pending, are your refering to HDFS-265 (which links
 to HDFS-1060, HADOOP-6239 and HDFS-744?

 Are there other issues related to append than the one above?

 Tks, Eric

 https://issues.apache.org/jira/browse/HDFS-265


 On 21/06/11 12:36, madhu phatak wrote:
 Its not stable . There are some bugs pending . According one of the
 disccusion till date the append is not ready for production.

 On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.in
wrote:

 I am using hadoop-0.20.203.0 version.
 I have set

 dfs.support.append to true and then using append method

 It is working but need to know how stable it is to deploy and use in
 production
 clusters ?

 Regards,
 Jagaran



 
 From: jagaran dasjagaran_...@yahoo.co.in
 To: common-user@hadoop.apache.org
 Sent: Mon, 13 June, 2011 11:07:57 AM
 Subject: Append to Existing File

 Hi All,

 Is append to an existing file is now supported in Hadoop for production
 clusters?
 If yes, please let me know which version and how

 Thanks
 Jagaran



 --
 Eric




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: ClassNotFoundException while running quick start guide on Windows.

2011-06-21 Thread Drew Gross
Thanks Jeff, it was a problem with JAVA_HOME. I have another problem
now though, I have this:

$JAVA:  /cygdrive/c/Program Files/Java/jdk1.6.0_26/bin/java
$JAVA_HEAP_MAX:  -Xmx1000m
$HADOOP_OPTS:  -Dhadoop.log.dir=C:\Users\Drew
Gross\Documents\Projects\discom\hadoop-0.21.0\logs
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=C:\Users\Drew
Gross\Documents\Projects\discom\hadoop-0.21.0\ -Dhadoop.id.str=
-Dhadoop.root.logger=INFO,console
 -Djava.library.path=/cygdrive/c/Users/Drew
Gross/Documents/Projects/discom/hadoop-0.21.0/lib/native/
-Dhadoop.policy.file=hadoop-policy.xml
$CLASS:  org.apache.hadoop.util.RunJar
Exception in thread main java.lang.NoClassDefFoundError:
Gross\Documents\Projects\discom\hadoop-0/21/0\logs
Caused by: java.lang.ClassNotFoundException:
Gross\Documents\Projects\discom\hadoop-0.21.0\logs
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class:
Gross\Documents\Projects\discom\hadoop-0.21.0\logs.  Program will
exit.

(This is with some extra debugging info added by me in bin/hadoop)

It looks like the windows style file names are causing problems,
especially the spaces. Has anyone encountered this before, and know
how to fix? I tried escaping the spaces and surrounding the file paths
with quotes (not at the same time), but that didn't help.

Drew


On Tue, Jun 21, 2011 at 6:24 AM, madhu phatak phatak@gmail.com wrote:

 I think the jar have some issuses where its not able to read the Main class
 from manifest . try unjar the jar and see in Manifest.xml what is the main
 class and then run as follows

  bin/hadoop jar hadoop-*-examples.jar Full qualified main class grep input
 output 'dfs[a-z.]+'
 On Thu, Jun 16, 2011 at 10:23 AM, Drew Gross drew.a.gr...@gmail.com wrote:

  Hello,
 
  I'm trying to run the example from the quick start guide on Windows and I
  get this error:
 
  $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
  Exception in thread main java.lang.NoClassDefFoundError:
  Caused by: java.lang.ClassNotFoundException:
         at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
         at java.security.AccessController.doPrivileged(Native Method)
         at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
  Could not find the main class: .  Program will exit.
  Exception in thread main java.lang.NoClassDefFoundError:
  Gross\Documents\Projects\discom\hadoop-0/21/0\logs
  Caused by: java.lang.ClassNotFoundException:
  Gross\Documents\Projects\discom\hadoop-0.21.0\logs
         at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
         at java.security.AccessController.doPrivileged(Native Method)
         at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
  Could not find the main class:
  Gross\Documents\Projects\discom\hadoop-0.21.0\logs.  Program will exit.
 
  Does anyone know what I need to change?
 
  Thank you.
 
  From, Drew
 
  --
  Forget the environment. Print this e-mail immediately. Then burn it.
 



--
Forget the environment. Print this e-mail immediately. Then burn it.


Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files

2011-06-21 Thread praveenesh kumar
Guys,
I was using hadoop eclipse plugin on hadoop 0.20.2 cluster..
It was working fine for me.
I was using Eclipse SDK Helios 3.6.2 with the plugin
hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar downloaded from JIRA
MAPREDUCE-1280

Now for Hbase installation.. I had to use hadoop-0.20-append compiled
jars..and I had to replace the old jar files with new 0.20-append compiled
jar files..
But now after replacing .. my hadoop eclipse plugin is not working well for
me.
Whenever I am trying to connect to my hadoop master node from that and try
to see DFS locations..
it is giving me the following error:
*
Error : Protocol org.apache.hadoop.hdfs.protocol.clientprotocol version
mismatch (client 41 server 43)*

However the hadoop cluster is working fine if I go directly on hadoop
namenode use hadoop commands..
I can add files to HDFS.. run jobs from there.. HDFS web console and
Map-Reduce web console are also working fine. but not able to use my
previous hadoop eclipse plugin.

Any suggestions or help for this issue ?

Thanks,
Praveenesh