Re: Hadoop and Ubuntu / Java

2012-04-19 Thread madhu phatak
As per Oracle, going forward openjdk will be official oracle jdk for linux
. Which means openjdk will be same as the official one.

On Tue, Dec 20, 2011 at 9:12 PM, hadoopman hadoop...@gmail.com wrote:


 http://www.omgubuntu.co.uk/**2011/12/java-to-be-removed-**
 from-ubuntu-uninstalled-from-**user-machines/http://www.omgubuntu.co.uk/2011/12/java-to-be-removed-from-ubuntu-uninstalled-from-user-machines/

 I'm curious what this will mean for Hadoop on Ubuntu systems moving
 forward.  I've tried openJDK nearly two years ago with Hadoop.  Needless to
 say it was a real problem.

 Hopefully we can still download it from the Sun/Oracle web site and still
 use it.  Won't be the same though :/




-- 
https://github.com/zinnia-phatak-dev/Nectar


Re: How to rebuild NameNode from DataNode.

2012-04-19 Thread Michel Segel
Switch to MapR M5?
:-)

Just kidding. 
Simple way of solving this pre CDH4...
NFS mount a directory from your SN and add it to your list of checkpoint 
directories.
You may lose some data, but you should be able rebuild. 

There is more to this, but its the basic idea on how to get a copy of your meta 
data.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 18, 2012, at 11:48 PM, Saburo Fujioka fuji...@do-it.co.jp wrote:

 Hello,
 
 I do a tentative plan of operative trouble countermeasures
 of a system currently now.
 
 If when NameNode has been lost, but are investigating the
 means to rebuild the remaining NameNode from DataNode,
 I don't know at the moment.
 
 Were consistent with those of the DataNode is the namespaceID
 of dfs/name/current/ VERSION as confirmation,
 fsimage are not rebuilt, the list did not display anything
 in the hadoop dfs-ls.
 
 The risk of loss for NameNode  because that is protected by
 Corosync + Pacemaker + DRBD is low.
 Because of the rare case, it is necessary to clarify the means
 to reconfigure the NameNode from DataNode.
 
 Do you know how to?
 
 
 I am using hadoop 1.0.1.
 
 Thank you very much,
 
 


Algorithms used in fairscheduler 0.20.205

2012-04-19 Thread Merto Mertek
I could find that the closest doc matching the current implementation of
the fairscheduler could be find in this
documenthttp://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-55.htmlfrom
Matei Zaharia et al.. Another documented from delay scheduling can be
found from year 2010..

a) I am interested if there maybe exist any newer documented version of the
implementation?
b) Are there any other algorithms in addition to delay scheduling,
copy-compute splitting algorithm and  fairshare calculation algorithm
that are important for the cluster performance and fairsharing?
c) Is there maybe any connection between copy-compute splitting and
mapreduce phases (copy-sort-reduce)?

Thank you..


JobControl run

2012-04-19 Thread Juan Pino
Hi,

I wonder why when I call run() on a JobControl object, it loops forever.
Namely, this code doesn't work:

JobControl jobControl = new JobControl(Name);
// some stuff here (add jobs and dependencies)
jobControl.run();

This code works but looks a bit ugly:

JobControl jobControl = new JobControl(Name);
// some stuff here (add jobs and dependencies)
Thread control = new Thread(jobControl);
control.start();
while (!jobControl.allFinished()) {
try {
Thread.sleep(5000);
}
catch (Exception e) {}
}

I wonder if the run method in the JobControl class could add the following
condition to break the while(true) loop:

if (jobsInProgress.isEmpty()) {
break;
}

Thanks very much,

Juan


Troubleshoot job failures after upgrade to 1.0.2

2012-04-19 Thread Filippo Diotalevi
Hi, 
I have an application on Cassandra 1.0.8 + Hadoop. Previously running on the 
cloudera distribution hadoop-0.20.2-cdh3u1, I tried today to upgrade to Hadoop 
1.0.2 but stumbled in issues with consistent job failures

The hadoop userlogs seem quite clear:

2012-04-19 18:21:09.837 java[57837:1903] Unable to load realm info from 
SCDynamicStore
Apr 19, 2012 6:21:10 PM org.apache.hadoop.util.NativeCodeLoader clinit
WARNING: Unable to load native-hadoop library for your platform... using 
builtin-java classes where applicable
Apr 19, 2012 6:21:10 PM org.apache.hadoop.mapred.Child main
SEVERE: Error running child : java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.log4j.LogManager
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:334)
at org.apache.hadoop.mapred.Child.main(Child.java:229)


In particular, the SEVERE: Error running child : 
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.log4j.LogManager seem to be the most reasonable cause. 

Can you help me troubleshoot the cause of the failure?  

Thanks,
-- 
Filippo Diotalevi



Re: Pre-requisites for hadoop 0.23/CDH4

2012-04-19 Thread Arun C Murthy
You are better of trying hadoop-0.23.1 or even hadoop-0.23.2-rc0 since CDH4's 
version of YARN is very incomplete and you might get nasty surprises there.

Settings:

# Run 1 NodeManager with yarn.nodemanager.resource.memory-mb - 1024
# Use CapacityScheduler (significantly better tested) by setting 
yarn.resourcemanager.scheduler.class to 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
# Set min container size to 128M or 256M (upto you) by setting 
yarn.scheduler.capacity.minimum-allocation-mb
# Set max container size to 1024M (max given to NM) by setting 
yarn.scheduler.capacity.maximum-allocation-mb

Arun

On Apr 18, 2012, at 8:00 PM, praveenesh kumar wrote:

 Hi,
 
 Sweet.. Can you please elaborate how can I tweak my configs to make
 CDH4/hadoop-0.23 run in 1.5GB RAM VM.
 
 Regards,
 Praveenesh
 
 On Wed, Apr 18, 2012 at 8:42 AM, Harsh J ha...@cloudera.com wrote:
 
 Praveenesh,
 
 Speaking minimally (and thereby requiring less tweaks on your end),
 1.5 GB would be a good value to use for RAM if available (1.0 will do
 too, if you make sure to tweak your configs to not use too much heap
 memory). Single processor should do fine for testing purposes.
 
 On Tue, Apr 17, 2012 at 8:51 PM, praveenesh kumar praveen...@gmail.com
 wrote:
 I am looking to test hadoop 0.23 or CDH4 beta on my local VM. I am
 looking
 to execute the sample example codes in new architecture, play around with
 the containers/resource managers.
 Is there any pre-requisite on default memory/CPU/core settings I need to
 keep in mind before setting up the VM.
 
 Regards,
 Praveenesh
 
 
 
 --
 Harsh J
 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Pre-requisites for hadoop 0.23/CDH4

2012-04-19 Thread praveenesh kumar
Thanks Arun,
Will try out those settings. Is there a good documentation on
configuring/playing with hadoop 0.23 apart from the apache hadoop-0.23
page. I have already looked into that page.. Just wondering is there
something more that I don't know.

Regards,
Praveenesh

On Fri, Apr 20, 2012 at 12:45 AM, Arun C Murthy a...@hortonworks.com wrote:

 You are better of trying hadoop-0.23.1 or even hadoop-0.23.2-rc0 since
 CDH4's version of YARN is very incomplete and you might get nasty surprises
 there.

 Settings:

 # Run 1 NodeManager with yarn.nodemanager.resource.memory-mb - 1024
 # Use CapacityScheduler (significantly better tested) by setting
 yarn.resourcemanager.scheduler.class to
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
 # Set min container size to 128M or 256M (upto you) by setting
 yarn.scheduler.capacity.minimum-allocation-mb
 # Set max container size to 1024M (max given to NM) by setting
 yarn.scheduler.capacity.maximum-allocation-mb

 Arun

 On Apr 18, 2012, at 8:00 PM, praveenesh kumar wrote:

  Hi,
 
  Sweet.. Can you please elaborate how can I tweak my configs to make
  CDH4/hadoop-0.23 run in 1.5GB RAM VM.
 
  Regards,
  Praveenesh
 
  On Wed, Apr 18, 2012 at 8:42 AM, Harsh J ha...@cloudera.com wrote:
 
  Praveenesh,
 
  Speaking minimally (and thereby requiring less tweaks on your end),
  1.5 GB would be a good value to use for RAM if available (1.0 will do
  too, if you make sure to tweak your configs to not use too much heap
  memory). Single processor should do fine for testing purposes.
 
  On Tue, Apr 17, 2012 at 8:51 PM, praveenesh kumar praveen...@gmail.com
 
  wrote:
  I am looking to test hadoop 0.23 or CDH4 beta on my local VM. I am
  looking
  to execute the sample example codes in new architecture, play around
 with
  the containers/resource managers.
  Is there any pre-requisite on default memory/CPU/core settings I need
 to
  keep in mind before setting up the VM.
 
  Regards,
  Praveenesh
 
 
 
  --
  Harsh J
 

 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/





Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Robert Evans
From what I can see your implementation seems OK, especially from a 
performance perspective. Depending on what storage: is it is likely to be your 
bottlekneck, not the hadoop computations.

Because you are writing files directly instead of relying on Hadoop to do it 
for you, you may need to deal with error cases that Hadoop will normally hide 
from you, and you will not be able to turn on speculative execution.  Just be 
aware that a map or reduce task may have problems in the middle, and be 
relaunched.  So when you are writing out your updated manifest be careful to 
not replace the old one until the new one is completely ready and will not 
fail, or you may lose data.  You may also need to be careful in your reduce if 
you are writing directly to the file there too, but because it is not a read 
modify write, but just a write it is not as critical.

--Bobby Evans

On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote:




Please help me architect the design of my first significant MR task beyond 
word count. My program works well. but I am trying to optimize performance to 
maximize use of available computing resources. I have 3 questions at the bottom.

Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 
4000.manif.txt
 * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS 
(range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
storage:/root/1.manif/1223.folder/5443.Ebook.ebk
So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
publisher, year, ebook-version).
2. Update each of the EBOOK entry record in the manifest - with the 3 
attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01)
3. Create a output file such that the named 
publisher_year_ebook-version  contains a list of all ebook urls that 
met that criteria.
example:
File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains:
storage:/root/1.manif/1223.folder/2143.Ebook.ebk
storage:/root/2.manif/2133.folder/5449.Ebook.ebk
storage:/root/2.manif/2133.folder/5450.Ebook.ebk
etc..

and File storage:/root/summary/PENGUIN_2001_3.12.txt contains:
storage:/root/19.manif/2223.folder/4343.Ebook.ebk
storage:/root/13.manif/9733.folder/2149.Ebook.ebk
storage:/root/21.manif/3233.folder/1110.Ebook.ebk

etc

4. finally, I also want to output statistics such that:
publisher_year_ebook-version  COUNT_OF_URLs
PENGUIN_2001_3.12 250,111
RANDOMHOUSE_1999_2.01  11,322
etc

Here is how I implemented:
* My launcher gets list of MM manifests
* My Mapper gets one manifest.
 --- It reads the manifest, within a WHILE loop,
--- fetches each EBOOK,  and obtain attributes from each ebook,
--- updates the manifest for that ebook
--- context.write(new Text(RANDOMHOUSE_1999_2.01), new 
Text(storage:/root/1.manif/1223.folder/2143.Ebook.ebk))
 --- Once all ebooks in the manifest are read, it saves the updated Manifest, 
and exits
* My Reducer gets the RANDOMHOUSE_1999_2.01 and a list of ebooks urls.
 --- It writes a new file storage:/root/summary/RANDOMHOUSE_1999_2.01.txt 
with all the storage urls for the ebooks
 --- It also does a context.write(new Text(RANDOMHOUSE_1999_2.01), new 
IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST))

As I mentioned, its working. I launch it on 15 elastic instances. I have three 
questions:
1. Is this the best way to implement the MR logic?
2. I dont know if each of the instances is getting one task or multiple tasks 
simultaneously for the MAP portion. If it is not getting multiple MAP tasks, 
should I go with the route of multithreaded reading of ebooks from each 
manifest? Its not efficient to read just one ebook at a time per machine. Is 
Context.write() threadsafe?
3. I can see log4j logs for main program, but no visibility into logs for 
Mapper or Reducer. Any idea?






Re: Multiple data centre in Hadoop

2012-04-19 Thread Robert Evans
Where I work  we have done some things like this, but none of them are open 
source, and I have not really been directly involved with the details of it.  I 
can guess about what it would take, but that is all it would be at this point.

--Bobby


On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote:

Thanks bobby, I m looking for something like this. Now the question is
what is the best strategy to do Hot/Hot or Hot/Warm.
I need to consider the CPU and Network bandwidth, also needs to decide from
which layer this replication should start.

Regards,
Abhishek

On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote:

 Hi Abhishek,

 Manu is correct about High Availability within a single colo.  I realize
 that in some cases you have to have fail over between colos.  I am not
 aware of any turn key solution for things like that, but generally what you
 want to do is to run two clusters, one in each colo, either hot/hot or
 hot/warm, and I have seen both depending on how quickly you need to fail
 over.  In hot/hot the input data is replicated to both clusters and the
 same software is run on both.  In this case though you have to be fairly
 sure that your processing is deterministic, or the results could be
 slightly different (i.e. No generating if random ids).  In hot/warm the
 data is replicated from one colo to the other at defined checkpoints.  The
 data is only processed on one of the grids, but if that colo goes down the
 other one can take up the processing from where ever the last checkpoint
 was.

 I hope that helps.

 --Bobby

 On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote:

 Hi Abhishek,

 1. Use multiple directories for *dfs.name.dir*  *dfs.data.dir* etc
 * Recommendation: write to *two local directories on different
 physical volumes*, and to an *NFS-mounted* directory
 - Data will be preserved even in the event of a total failure of the
 NameNode machines
 * Recommendation: *soft-mount the NFS* directory
 - If the NFS mount goes offline, this will not cause the NameNode
 to fail

 2. *Rack awareness*

 https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

 On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
 manu.i...@gmail.comwrote:

  Thanks Robert.
  Is there a best practice or design than can address the High Availability
  to certain extent?
 
  ~Abhishek
 
  On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com
  wrote:
 
   No it does not. Sorry
  
  
   On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com
 wrote:
  
   Hi All,
  
   Just wanted if hadoop supports more than one data centre. This is
  basically
   for DR purposes and High Availability where one centre goes down other
  can
   bring up.
  
  
   Regards,
   Abhishek
  
  
 



 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in





Re: Multiple data centre in Hadoop

2012-04-19 Thread Michael Segel
I don't know of any open source solution in doing this... 
And yeah its something one can't talk about  ;-)


On Apr 19, 2012, at 4:28 PM, Robert Evans wrote:

 Where I work  we have done some things like this, but none of them are open 
 source, and I have not really been directly involved with the details of it.  
 I can guess about what it would take, but that is all it would be at this 
 point.
 
 --Bobby
 
 
 On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote:
 
 Thanks bobby, I m looking for something like this. Now the question is
 what is the best strategy to do Hot/Hot or Hot/Warm.
 I need to consider the CPU and Network bandwidth, also needs to decide from
 which layer this replication should start.
 
 Regards,
 Abhishek
 
 On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote:
 
 Hi Abhishek,
 
 Manu is correct about High Availability within a single colo.  I realize
 that in some cases you have to have fail over between colos.  I am not
 aware of any turn key solution for things like that, but generally what you
 want to do is to run two clusters, one in each colo, either hot/hot or
 hot/warm, and I have seen both depending on how quickly you need to fail
 over.  In hot/hot the input data is replicated to both clusters and the
 same software is run on both.  In this case though you have to be fairly
 sure that your processing is deterministic, or the results could be
 slightly different (i.e. No generating if random ids).  In hot/warm the
 data is replicated from one colo to the other at defined checkpoints.  The
 data is only processed on one of the grids, but if that colo goes down the
 other one can take up the processing from where ever the last checkpoint
 was.
 
 I hope that helps.
 
 --Bobby
 
 On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote:
 
 Hi Abhishek,
 
 1. Use multiple directories for *dfs.name.dir*  *dfs.data.dir* etc
 * Recommendation: write to *two local directories on different
 physical volumes*, and to an *NFS-mounted* directory
 - Data will be preserved even in the event of a total failure of the
 NameNode machines
 * Recommendation: *soft-mount the NFS* directory
 - If the NFS mount goes offline, this will not cause the NameNode
 to fail
 
 2. *Rack awareness*
 
 https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
 
 On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
 manu.i...@gmail.comwrote:
 
 Thanks Robert.
 Is there a best practice or design than can address the High Availability
 to certain extent?
 
 ~Abhishek
 
 On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com
 wrote:
 
 No it does not. Sorry
 
 
 On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com
 wrote:
 
 Hi All,
 
 Just wanted if hadoop supports more than one data centre. This is
 basically
 for DR purposes and High Availability where one centre goes down other
 can
 bring up.
 
 
 Regards,
 Abhishek
 
 
 
 
 
 
 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 



Re: Multiple data centre in Hadoop

2012-04-19 Thread Robert Evans
If you want to start an open source project for this I am sure that there are 
others with the same problem that might be very wiling to help out. :)

--Bobby Evans

On 4/19/12 4:31 PM, Michael Segel michael_se...@hotmail.com wrote:

I don't know of any open source solution in doing this...
And yeah its something one can't talk about  ;-)


On Apr 19, 2012, at 4:28 PM, Robert Evans wrote:

 Where I work  we have done some things like this, but none of them are open 
 source, and I have not really been directly involved with the details of it.  
 I can guess about what it would take, but that is all it would be at this 
 point.

 --Bobby


 On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote:

 Thanks bobby, I m looking for something like this. Now the question is
 what is the best strategy to do Hot/Hot or Hot/Warm.
 I need to consider the CPU and Network bandwidth, also needs to decide from
 which layer this replication should start.

 Regards,
 Abhishek

 On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote:

 Hi Abhishek,

 Manu is correct about High Availability within a single colo.  I realize
 that in some cases you have to have fail over between colos.  I am not
 aware of any turn key solution for things like that, but generally what you
 want to do is to run two clusters, one in each colo, either hot/hot or
 hot/warm, and I have seen both depending on how quickly you need to fail
 over.  In hot/hot the input data is replicated to both clusters and the
 same software is run on both.  In this case though you have to be fairly
 sure that your processing is deterministic, or the results could be
 slightly different (i.e. No generating if random ids).  In hot/warm the
 data is replicated from one colo to the other at defined checkpoints.  The
 data is only processed on one of the grids, but if that colo goes down the
 other one can take up the processing from where ever the last checkpoint
 was.

 I hope that helps.

 --Bobby

 On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote:

 Hi Abhishek,

 1. Use multiple directories for *dfs.name.dir*  *dfs.data.dir* etc
 * Recommendation: write to *two local directories on different
 physical volumes*, and to an *NFS-mounted* directory
 - Data will be preserved even in the event of a total failure of the
 NameNode machines
 * Recommendation: *soft-mount the NFS* directory
 - If the NFS mount goes offline, this will not cause the NameNode
 to fail

 2. *Rack awareness*

 https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

 On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
 manu.i...@gmail.comwrote:

 Thanks Robert.
 Is there a best practice or design than can address the High Availability
 to certain extent?

 ~Abhishek

 On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com
 wrote:

 No it does not. Sorry


 On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com
 wrote:

 Hi All,

 Just wanted if hadoop supports more than one data centre. This is
 basically
 for DR purposes and High Availability where one centre goes down other
 can
 bring up.


 Regards,
 Abhishek






 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in







Re: Multiple data centre in Hadoop

2012-04-19 Thread Edward Capriolo
Hive is beginning to implement Region support where one metastore will
manage multiple filesystems and jobtrackers. When a query creates a
table it will then be copied to one ore more datacenters. In addition
the query planner will intelligently attempt to run queries in regions
only where all the tables exists.

While wiating for these awesome features I am doing a fair amount of
distcp work from groovy scripts.

Edward

On Thu, Apr 19, 2012 at 5:33 PM, Robert Evans ev...@yahoo-inc.com wrote:
 If you want to start an open source project for this I am sure that there are 
 others with the same problem that might be very wiling to help out. :)

 --Bobby Evans

 On 4/19/12 4:31 PM, Michael Segel michael_se...@hotmail.com wrote:

 I don't know of any open source solution in doing this...
 And yeah its something one can't talk about  ;-)


 On Apr 19, 2012, at 4:28 PM, Robert Evans wrote:

 Where I work  we have done some things like this, but none of them are open 
 source, and I have not really been directly involved with the details of it. 
  I can guess about what it would take, but that is all it would be at this 
 point.

 --Bobby


 On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote:

 Thanks bobby, I m looking for something like this. Now the question is
 what is the best strategy to do Hot/Hot or Hot/Warm.
 I need to consider the CPU and Network bandwidth, also needs to decide from
 which layer this replication should start.

 Regards,
 Abhishek

 On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote:

 Hi Abhishek,

 Manu is correct about High Availability within a single colo.  I realize
 that in some cases you have to have fail over between colos.  I am not
 aware of any turn key solution for things like that, but generally what you
 want to do is to run two clusters, one in each colo, either hot/hot or
 hot/warm, and I have seen both depending on how quickly you need to fail
 over.  In hot/hot the input data is replicated to both clusters and the
 same software is run on both.  In this case though you have to be fairly
 sure that your processing is deterministic, or the results could be
 slightly different (i.e. No generating if random ids).  In hot/warm the
 data is replicated from one colo to the other at defined checkpoints.  The
 data is only processed on one of the grids, but if that colo goes down the
 other one can take up the processing from where ever the last checkpoint
 was.

 I hope that helps.

 --Bobby

 On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote:

 Hi Abhishek,

 1. Use multiple directories for *dfs.name.dir*  *dfs.data.dir* etc
 * Recommendation: write to *two local directories on different
 physical volumes*, and to an *NFS-mounted* directory
 - Data will be preserved even in the event of a total failure of the
 NameNode machines
 * Recommendation: *soft-mount the NFS* directory
 - If the NFS mount goes offline, this will not cause the NameNode
 to fail

 2. *Rack awareness*

 https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

 On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
 manu.i...@gmail.comwrote:

 Thanks Robert.
 Is there a best practice or design than can address the High Availability
 to certain extent?

 ~Abhishek

 On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com
 wrote:

 No it does not. Sorry


 On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com
 wrote:

 Hi All,

 Just wanted if hadoop supports more than one data centre. This is
 basically
 for DR purposes and High Availability where one centre goes down other
 can
 bring up.


 Regards,
 Abhishek






 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855                Skype: manuspkd
 www.opensourcetalk.co.in







hadoop datanode is dead but cannot stop it

2012-04-19 Thread Johnson Chengwu
I have encountered when there is a disk IO error in a datanode machine, the
datanode will be dead, but the in the dead datanode, the datanode daemon is
still alive, and I cannot stop it to restart it the datanode. When I check
the process , it seems that the linux command du -sk path/to/datadir is
hangup, this problem cause the datanode dead, so that I cannot stop the
datanode as well as cannot use the “kill -9 datanode-process” to kill the
datanode process, is this a bug? may be we should set a timeout of linux
command du , when there is no return to the datanode.


Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Sky
Thanks for your reply.  After I sent my email, I found a fundamental 
defect - in my understanding of how MR is distributed. I discovered that 
even though I was firing off 15 COREs, the map job - which is the most 
expensive part of my processing was run only on 1 core.


To start my map job, I was creating a single file with following data:
  1 storage:/root/1.manif.txt
  2 storage:/root/2.manif.txt
  3 storage:/root/3.manif.txt
  ...
  4000 storage:/root/4000.manif.txt

I thought that each of the available COREs will be assigned a map job from 
top down from the same file one at a time, and as soon as one CORE is done, 
it would get the next map job. However, it looks like I need to split the 
file into the number of times. Now while that’s clearly trivial to do, I am 
not sure how I can detect at runtime how many splits I need to do, and also 
to deal with adding new CORES at runtime. Any suggestions?  (it doesn't have 
to be a file, it can be a list, etc).


This all would be much easier to debug, if somehow I could get my log4j logs 
for my mappers and reducers. I can see log4j for my main launcher, but not 
sure how to enable it for mappers and reducers.


Thx
- Akash


-Original Message- 
From: Robert Evans

Sent: Thursday, April 19, 2012 2:08 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
implementation


From what I can see your implementation seems OK, especially from a 
performance perspective. Depending on what storage: is it is likely to be 
your bottlekneck, not the hadoop computations.


Because you are writing files directly instead of relying on Hadoop to do it 
for you, you may need to deal with error cases that Hadoop will normally 
hide from you, and you will not be able to turn on speculative execution. 
Just be aware that a map or reduce task may have problems in the middle, and 
be relaunched.  So when you are writing out your updated manifest be careful 
to not replace the old one until the new one is completely ready and will 
not fail, or you may lose data.  You may also need to be careful in your 
reduce if you are writing directly to the file there too, but because it is 
not a read modify write, but just a write it is not as critical.


--Bobby Evans

On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote:




Please help me architect the design of my first significant MR task beyond 
word count. My program works well. but I am trying to optimize performance 
to maximize use of available computing resources. I have 3 questions at the 
bottom.


Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available on storage:/root/1.manif.txt 
to 4000.manif.txt
* Each MANIFEST in turn contains varilable number EE of URLs to 
EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
storage:/root/1.manif/1223.folder/5443.Ebook.ebk

So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
publisher, year, ebook-version).
2. Update each of the EBOOK entry record in the manifest - with the 3 
attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01)
3. Create a output file such that the named 
publisher_year_ebook-version  contains a list of all ebook urls 
that met that criteria.

example:
File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains:
storage:/root/1.manif/1223.folder/2143.Ebook.ebk
storage:/root/2.manif/2133.folder/5449.Ebook.ebk
storage:/root/2.manif/2133.folder/5450.Ebook.ebk
etc..

and File storage:/root/summary/PENGUIN_2001_3.12.txt contains:
storage:/root/19.manif/2223.folder/4343.Ebook.ebk
storage:/root/13.manif/9733.folder/2149.Ebook.ebk
storage:/root/21.manif/3233.folder/1110.Ebook.ebk

etc

4. finally, I also want to output statistics such that:
publisher_year_ebook-version  COUNT_OF_URLs
PENGUIN_2001_3.12 250,111
RANDOMHOUSE_1999_2.01  11,322
etc

Here is how I implemented:
* My launcher gets list of MM manifests
* My Mapper gets one manifest.
--- It reads the manifest, within a WHILE loop,
   --- fetches each EBOOK,  and obtain attributes from each ebook,
   --- updates the manifest for that ebook
   --- context.write(new Text(RANDOMHOUSE_1999_2.01), new 
Text(storage:/root/1.manif/1223.folder/2143.Ebook.ebk))
--- Once all ebooks in the manifest are read, it saves the updated Manifest, 
and exits

* My Reducer gets the RANDOMHOUSE_1999_2.01 and a list of ebooks urls.
--- It writes a new file storage:/root/summary/RANDOMHOUSE_1999_2.01.txt 
with all the storage urls for the ebooks
--- It also does a context.write(new Text(RANDOMHOUSE_1999_2.01), new 
IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST))


As I mentioned, its working. I launch it on 15 elastic instances. I have 
three questions:

1. Is this the best way to implement the MR logic?
2. I dont know if each of the instances is getting one task or multiple 
tasks 

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Michael Segel
How 'large' or rather in this case small is your file? 

If you're on a default system, the block sizes are 64MB. So if your file ~= 
64MB, you end up with 1 block, and you will only have 1 mapper. 


On Apr 19, 2012, at 10:10 PM, Sky wrote:

 Thanks for your reply.  After I sent my email, I found a fundamental defect - 
 in my understanding of how MR is distributed. I discovered that even though I 
 was firing off 15 COREs, the map job - which is the most expensive part of my 
 processing was run only on 1 core.
 
 To start my map job, I was creating a single file with following data:
  1 storage:/root/1.manif.txt
  2 storage:/root/2.manif.txt
  3 storage:/root/3.manif.txt
  ...
  4000 storage:/root/4000.manif.txt
 
 I thought that each of the available COREs will be assigned a map job from 
 top down from the same file one at a time, and as soon as one CORE is done, 
 it would get the next map job. However, it looks like I need to split the 
 file into the number of times. Now while that’s clearly trivial to do, I am 
 not sure how I can detect at runtime how many splits I need to do, and also 
 to deal with adding new CORES at runtime. Any suggestions?  (it doesn't have 
 to be a file, it can be a list, etc).
 
 This all would be much easier to debug, if somehow I could get my log4j logs 
 for my mappers and reducers. I can see log4j for my main launcher, but not 
 sure how to enable it for mappers and reducers.
 
 Thx
 - Akash
 
 
 -Original Message- From: Robert Evans
 Sent: Thursday, April 19, 2012 2:08 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
 implementation
 
 From what I can see your implementation seems OK, especially from a 
 performance perspective. Depending on what storage: is it is likely to be 
 your bottlekneck, not the hadoop computations.
 
 Because you are writing files directly instead of relying on Hadoop to do it 
 for you, you may need to deal with error cases that Hadoop will normally hide 
 from you, and you will not be able to turn on speculative execution. Just be 
 aware that a map or reduce task may have problems in the middle, and be 
 relaunched.  So when you are writing out your updated manifest be careful to 
 not replace the old one until the new one is completely ready and will not 
 fail, or you may lose data.  You may also need to be careful in your reduce 
 if you are writing directly to the file there too, but because it is not a 
 read modify write, but just a write it is not as critical.
 
 --Bobby Evans
 
 On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote:
 
 
 
 
 Please help me architect the design of my first significant MR task beyond 
 word count. My program works well. but I am trying to optimize performance 
 to maximize use of available computing resources. I have 3 questions at the 
 bottom.
 
 Project description in an abstract sense (written in java):
 * I have MM number of MANIFEST files available on storage:/root/1.manif.txt 
 to 4000.manif.txt
* Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS 
 (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
 storage:/root/1.manif/1223.folder/5443.Ebook.ebk
 So we are talking about millions of ebooks
 
 My task is to:
 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
 publisher, year, ebook-version).
 2. Update each of the EBOOK entry record in the manifest - with the 3 
 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01)
 3. Create a output file such that the named 
 publisher_year_ebook-version  contains a list of all ebook urls 
 that met that criteria.
 example:
 File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains:
 storage:/root/1.manif/1223.folder/2143.Ebook.ebk
 storage:/root/2.manif/2133.folder/5449.Ebook.ebk
 storage:/root/2.manif/2133.folder/5450.Ebook.ebk
 etc..
 
 and File storage:/root/summary/PENGUIN_2001_3.12.txt contains:
 storage:/root/19.manif/2223.folder/4343.Ebook.ebk
 storage:/root/13.manif/9733.folder/2149.Ebook.ebk
 storage:/root/21.manif/3233.folder/1110.Ebook.ebk
 
 etc
 
 4. finally, I also want to output statistics such that:
 publisher_year_ebook-version  COUNT_OF_URLs
 PENGUIN_2001_3.12 250,111
 RANDOMHOUSE_1999_2.01  11,322
 etc
 
 Here is how I implemented:
 * My launcher gets list of MM manifests
 * My Mapper gets one manifest.
 --- It reads the manifest, within a WHILE loop,
   --- fetches each EBOOK,  and obtain attributes from each ebook,
   --- updates the manifest for that ebook
   --- context.write(new Text(RANDOMHOUSE_1999_2.01), new 
 Text(storage:/root/1.manif/1223.folder/2143.Ebook.ebk))
 --- Once all ebooks in the manifest are read, it saves the updated Manifest, 
 and exits
 * My Reducer gets the RANDOMHOUSE_1999_2.01 and a list of ebooks urls.
 --- It writes a new file storage:/root/summary/RANDOMHOUSE_1999_2.01.txt 
 with all the storage urls for the ebooks
 --- It also does a 

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Sky
My file for the input to mapper is very small - as all it has is urls to 
list of manifests. The task for mappers is to fetch each manifest, and then 
fetch files using urls from the manifests and then process them.  Besides 
passing around lists of files, I am not really accessing the disk. It should 
be RAM, network, and CPU (unzip, parsexml,extract attributes).


So is my only choice to break the input file and submit multiple files (if I 
have 15 cores, I should split the file with urls to 15 files? also how does 
it look in code?)? The two drawbacks are - some cores might finish early and 
stay idle, and I don’t know how to deal with dynamically 
increasing/decreasing cores.


Thx
- Sky

-Original Message- 
From: Michael Segel

Sent: Thursday, April 19, 2012 8:49 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
implementation


How 'large' or rather in this case small is your file?

If you're on a default system, the block sizes are 64MB. So if your file ~= 
64MB, you end up with 1 block, and you will only have 1 mapper.



On Apr 19, 2012, at 10:10 PM, Sky wrote:

Thanks for your reply.  After I sent my email, I found a fundamental 
defect - in my understanding of how MR is distributed. I discovered that 
even though I was firing off 15 COREs, the map job - which is the most 
expensive part of my processing was run only on 1 core.


To start my map job, I was creating a single file with following data:
 1 storage:/root/1.manif.txt
 2 storage:/root/2.manif.txt
 3 storage:/root/3.manif.txt
 ...
 4000 storage:/root/4000.manif.txt

I thought that each of the available COREs will be assigned a map job from 
top down from the same file one at a time, and as soon as one CORE is 
done, it would get the next map job. However, it looks like I need to 
split the file into the number of times. Now while that’s clearly trivial 
to do, I am not sure how I can detect at runtime how many splits I need to 
do, and also to deal with adding new CORES at runtime. Any suggestions? 
(it doesn't have to be a file, it can be a list, etc).


This all would be much easier to debug, if somehow I could get my log4j 
logs for my mappers and reducers. I can see log4j for my main launcher, 
but not sure how to enable it for mappers and reducers.


Thx
- Akash


-Original Message- From: Robert Evans
Sent: Thursday, April 19, 2012 2:08 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
implementation


From what I can see your implementation seems OK, especially from a 
performance perspective. Depending on what storage: is it is likely to be 
your bottlekneck, not the hadoop computations.


Because you are writing files directly instead of relying on Hadoop to do 
it for you, you may need to deal with error cases that Hadoop will 
normally hide from you, and you will not be able to turn on speculative 
execution. Just be aware that a map or reduce task may have problems in 
the middle, and be relaunched.  So when you are writing out your updated 
manifest be careful to not replace the old one until the new one is 
completely ready and will not fail, or you may lose data.  You may also 
need to be careful in your reduce if you are writing directly to the file 
there too, but because it is not a read modify write, but just a write it 
is not as critical.


--Bobby Evans

On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote:




Please help me architect the design of my first significant MR task beyond 
word count. My program works well. but I am trying to optimize 
performance to maximize use of available computing resources. I have 3 
questions at the bottom.


Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available on 
storage:/root/1.manif.txt to 4000.manif.txt
   * Each MANIFEST in turn contains varilable number EE of URLs to 
EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored 
on storage:/root/1.manif/1223.folder/5443.Ebook.ebk

So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
publisher, year, ebook-version).
2. Update each of the EBOOK entry record in the manifest - with the 3 
attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01)
3. Create a output file such that the named 
publisher_year_ebook-version  contains a list of all ebook urls 
that met that criteria.

example:
File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains:
storage:/root/1.manif/1223.folder/2143.Ebook.ebk
storage:/root/2.manif/2133.folder/5449.Ebook.ebk
storage:/root/2.manif/2133.folder/5450.Ebook.ebk
etc..

and File storage:/root/summary/PENGUIN_2001_3.12.txt contains:
storage:/root/19.manif/2223.folder/4343.Ebook.ebk
storage:/root/13.manif/9733.folder/2149.Ebook.ebk
storage:/root/21.manif/3233.folder/1110.Ebook.ebk

etc

4. finally, I 

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Michael Segel
If the file is small enough you could read it in to a java object like a list 
and write your own input format that takes a list object as its input and then 
lets you specify the number of mappers.

On Apr 19, 2012, at 11:34 PM, Sky wrote:

 My file for the input to mapper is very small - as all it has is urls to list 
 of manifests. The task for mappers is to fetch each manifest, and then fetch 
 files using urls from the manifests and then process them.  Besides passing 
 around lists of files, I am not really accessing the disk. It should be RAM, 
 network, and CPU (unzip, parsexml,extract attributes).
 
 So is my only choice to break the input file and submit multiple files (if I 
 have 15 cores, I should split the file with urls to 15 files? also how does 
 it look in code?)? The two drawbacks are - some cores might finish early and 
 stay idle, and I don’t know how to deal with dynamically 
 increasing/decreasing cores.
 
 Thx
 - Sky
 
 -Original Message- From: Michael Segel
 Sent: Thursday, April 19, 2012 8:49 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
 implementation
 
 How 'large' or rather in this case small is your file?
 
 If you're on a default system, the block sizes are 64MB. So if your file ~= 
 64MB, you end up with 1 block, and you will only have 1 mapper.
 
 
 On Apr 19, 2012, at 10:10 PM, Sky wrote:
 
 Thanks for your reply.  After I sent my email, I found a fundamental defect 
 - in my understanding of how MR is distributed. I discovered that even 
 though I was firing off 15 COREs, the map job - which is the most expensive 
 part of my processing was run only on 1 core.
 
 To start my map job, I was creating a single file with following data:
 1 storage:/root/1.manif.txt
 2 storage:/root/2.manif.txt
 3 storage:/root/3.manif.txt
 ...
 4000 storage:/root/4000.manif.txt
 
 I thought that each of the available COREs will be assigned a map job from 
 top down from the same file one at a time, and as soon as one CORE is done, 
 it would get the next map job. However, it looks like I need to split the 
 file into the number of times. Now while that’s clearly trivial to do, I am 
 not sure how I can detect at runtime how many splits I need to do, and also 
 to deal with adding new CORES at runtime. Any suggestions? (it doesn't have 
 to be a file, it can be a list, etc).
 
 This all would be much easier to debug, if somehow I could get my log4j logs 
 for my mappers and reducers. I can see log4j for my main launcher, but not 
 sure how to enable it for mappers and reducers.
 
 Thx
 - Akash
 
 
 -Original Message- From: Robert Evans
 Sent: Thursday, April 19, 2012 2:08 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
 implementation
 
 From what I can see your implementation seems OK, especially from a 
 performance perspective. Depending on what storage: is it is likely to be 
 your bottlekneck, not the hadoop computations.
 
 Because you are writing files directly instead of relying on Hadoop to do it 
 for you, you may need to deal with error cases that Hadoop will normally 
 hide from you, and you will not be able to turn on speculative execution. 
 Just be aware that a map or reduce task may have problems in the middle, and 
 be relaunched.  So when you are writing out your updated manifest be careful 
 to not replace the old one until the new one is completely ready and will 
 not fail, or you may lose data.  You may also need to be careful in your 
 reduce if you are writing directly to the file there too, but because it is 
 not a read modify write, but just a write it is not as critical.
 
 --Bobby Evans
 
 On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote:
 
 
 
 
 Please help me architect the design of my first significant MR task beyond 
 word count. My program works well. but I am trying to optimize performance 
 to maximize use of available computing resources. I have 3 questions at the 
 bottom.
 
 Project description in an abstract sense (written in java):
 * I have MM number of MANIFEST files available on storage:/root/1.manif.txt 
 to 4000.manif.txt
   * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS 
 (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
 storage:/root/1.manif/1223.folder/5443.Ebook.ebk
 So we are talking about millions of ebooks
 
 My task is to:
 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
 publisher, year, ebook-version).
 2. Update each of the EBOOK entry record in the manifest - with the 3 
 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01)
 3. Create a output file such that the named 
 publisher_year_ebook-version  contains a list of all ebook urls 
 that met that criteria.
 example:
 File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains:
 storage:/root/1.manif/1223.folder/2143.Ebook.ebk
 

Re: hadoop datanode is dead but cannot stop it

2012-04-19 Thread Harsh J
What distro/version of Hadoop are you using? This was a bug fixed
quite a while ago.

On Fri, Apr 20, 2012 at 7:29 AM, Johnson Chengwu
johnsonchen...@gmail.com wrote:
 I have encountered when there is a disk IO error in a datanode machine, the
 datanode will be dead, but the in the dead datanode, the datanode daemon is
 still alive, and I cannot stop it to restart it the datanode. When I check
 the process , it seems that the linux command du -sk path/to/datadir is
 hangup, this problem cause the datanode dead, so that I cannot stop the
 datanode as well as cannot use the “kill -9 datanode-process” to kill the
 datanode process, is this a bug? may be we should set a timeout of linux
 command du , when there is no return to the datanode.



-- 
Harsh J