Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

2012-06-01 Thread Jane Wayne
Sandeep,

How are you guys moving 100 TB into the AWS cloud? Are you using S3 or
EBS? If you are using S3, it does not work like HDFS. Although data is
replicated (I believe within an availability zone) in S3, it is not
the same as HDFS replication. You lose the data locality optimization
feature of Hadoop when you use S3, which runs counter to the sending
code to data paradigm of MapReduce. Mind you, traffic in/out of S3
equates to costs incurred as well (when you lose data locality
optimization).

I hear that to get PBs worth of data into AWS, it is not uncommon to
drive a truck with your data on some physical storage device (in fact,
Amazon will help you do this).

Please update us, this is an interesting problem.

Thanks,

On Thu, May 31, 2012 at 2:41 PM, Sandeep Reddy P
sandeepreddy.3...@gmail.com wrote:
 Hi,
 We are getting 100TB of data with replication factor of 3 this goes to
 300TB of data. We are planning to use hadoop with 65nodes. We want to know
 which option will be better in terms of hardware either physical Machines
 or deploy hadoop on EC2. Is there any document that supports use of
 physical machines.
 Hardware specs:  2 quad core cpu, 32 Gb Ram, 12*1 Tb hard drives , 10Gb
 Ethernet Switches costs $10k for each machine. Is that cheaper to use EC2
 ?? will there be any performance issues??
 --
 Thanks,
 sandeep


DFS Client not found in sorted Leases

2012-06-01 Thread Miguel Costa

Hi,

We are having several errors in hdfs master log

[Lease.  Holder: DFSClient  not found in sortedLeases.

Hbase Master stays in the state of Splitting regions and master hdfs 
doesn't show any dead node.


We are using Cloudera Distribution CDH3.

Thanks,

Miguel




Re: Hadoop with Sharded MySql

2012-06-01 Thread Srinivas Surasani
All,

I'm trying to get data into HDFS directly from sharded database and expose
to existing hive infrastructure.

( we are currently doing this way,, mysql-staging server-hdfs put
commands-hdfs, which is taking lot of time ).

If we have way of running single sqoop job across all shardes for single
table, I believe it makes life easier in terms of monotoring and exception
handlings..

Thanks,
Srinivas

On Fri, Jun 1, 2012 at 1:27 AM, anil gupta anilgupt...@gmail.com wrote:

 Hi Sujith,

 Srinivas is asking how to import data into HDFS using sqoop?  I believe he
 must have thought out well before designing the entire
 architecture/solution. He has not specified whether he would like to modify
 the data or not. Whether to use HIve or HBase is a different question
 altogether and depends on his use-case.

 Thanks,
 Anil


 On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale sujitdhamal...@gmail.com
 wrote:

  Hi ,
  instead of pulling 70K tables from mysql into hdfs.
  take dump of all 30 table and put in to hBase data base .
 
  if you pulled 70K tables from mysql into hdfs , you need to use Hive ,
 but
  modification will not possible in Hive :(
 
  *@ common-user :* please correct me , if i am wrong .
 
  Kind Regards
  Sujit Dhamale
  (+91 9970086652)
  On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
   Maybe you can do some VIEWs or unions or merge tables on the mysql
   side to overcome the aspect of launching so many sqoop jobs.
  
   On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
   hivehadooplearn...@gmail.com wrote:
All,
   
We are trying to implement sqoop in our environment which has 30
 mysql
sharded databases and all the databases have around 30 databases with
150 tables in each of the database which are all sharded
 (horizontally
sharded that means the data is divided into all the tables in mysql).
   
The problem is that we have a total of around 70K tables which needed
to be pulled from mysql into hdfs.
   
So, my question is that generating 70K sqoop commands and running
 them
parallel is feasible or not?
   
Also, doing incremental updates is going to be like invoking 70K
another sqoop jobs which intern kick of map-reduce jobs.
   
The main problem is monitoring and managing this huge number of jobs?
   
Can anyone suggest me the best way of doing it or is sqoop a good
candidate for this type of scenario?
   
Currently the same process is done by generating tsv files  mysql
server and dumped into staging server and  from there we'll generate
hdfs put statements..
   
Appreciate your suggestions !!!
   
   
Thanks,
Srinivas Surasani
  
 



 --
 Thanks  Regards,
 Anil Gupta




-- 
Regards,
-- Srinivas
srini...@cloudwick.com


Re: Hadoop and Eclipse integration

2012-06-01 Thread Erik Paulson
Using this suggestion: (look down for the 'M2_REPO' bit)
http://lucene.472066.n3.nabble.com/Hadoop-on-Eclipse-td3392212.html

and this documentation:
http://maven.apache.org/guides/mini/guide-ide-eclipse.html

I ran this:

mvn -Declipse.workspace=/Users/epaulson/development/workspace/
eclipse:add-maven-repo
(Eclipse on MacOS wanted to put its workspace in /Users/epaulson/Documents
but I don't like to keep anything there)

and that resolved most of the problems that you're having below.

-Erik

On Tue, May 29, 2012 at 1:30 AM, Nick Katsipoulakis popa...@gmail.comwrote:

 Hello everybody,
 I attempted to use the Eclipse IDE for Hadoop development and I followed
 the instructions shown in here:

 http://wiki.apache.org/hadoop/**EclipseEnvironmenthttp://wiki.apache.org/hadoop/EclipseEnvironment

 Everything goes well until I am starting to import projects in Eclipse,
 and particularly HDFS. When I follow the instructions for HDFS import i get
 the following error from Eclipse:

 Project 'hadoop-hdfs' is missing required library:
 '/home/nick/.m2/repository/**org/aspectj/aspectjtools/1.6.**
 5/aspectjtools-1.6.5.jar'

 I should mention that the directory hadoop-common that i checked out
 hadoop is located at:

 /home/nick/hadoop-common

 and I am using Ubuntu 10.04.

 Similar errors appear when I attempt to import the MapReduceTools:

 Project 'MapReduceTools' is missing required library: 'classes'
 Project 'MapReduceTools' is missing required library: 'lib/hadoop-core.jar'

 How can I resolve these issues?  When I resolve them, how can I execute a
 simple Wordcount job from eclipse? Thank you



Re: Hadoop with Sharded MySql

2012-06-01 Thread Michael Segel
Ok just tossing out some ideas... Take them with a grain of salt...

With hive you can create external tables.

Write a custom Java app the creates one thread to each server. Then iterate 
through each table selecting the rows you want. You can then easily write the 
output directly to HDFS in each thread.
It's not a map reduce, but it should be fairly efficient.

You can even expand on this if you want.
Java and jdbc...


Sent from my iPhone

On Jun 1, 2012, at 11:30 AM, Srinivas Surasani hivehadooplearn...@gmail.com 
wrote:

 All,
 
 I'm trying to get data into HDFS directly from sharded database and expose
 to existing hive infrastructure.
 
 ( we are currently doing this way,, mysql-staging server-hdfs put
 commands-hdfs, which is taking lot of time ).
 
 If we have way of running single sqoop job across all shardes for single
 table, I believe it makes life easier in terms of monotoring and exception
 handlings..
 
 Thanks,
 Srinivas
 
 On Fri, Jun 1, 2012 at 1:27 AM, anil gupta anilgupt...@gmail.com wrote:
 
 Hi Sujith,
 
 Srinivas is asking how to import data into HDFS using sqoop?  I believe he
 must have thought out well before designing the entire
 architecture/solution. He has not specified whether he would like to modify
 the data or not. Whether to use HIve or HBase is a different question
 altogether and depends on his use-case.
 
 Thanks,
 Anil
 
 
 On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale sujitdhamal...@gmail.com
 wrote:
 
 Hi ,
 instead of pulling 70K tables from mysql into hdfs.
 take dump of all 30 table and put in to hBase data base .
 
 if you pulled 70K tables from mysql into hdfs , you need to use Hive ,
 but
 modification will not possible in Hive :(
 
 *@ common-user :* please correct me , if i am wrong .
 
 Kind Regards
 Sujit Dhamale
 (+91 9970086652)
 On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
 Maybe you can do some VIEWs or unions or merge tables on the mysql
 side to overcome the aspect of launching so many sqoop jobs.
 
 On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
 hivehadooplearn...@gmail.com wrote:
 All,
 
 We are trying to implement sqoop in our environment which has 30
 mysql
 sharded databases and all the databases have around 30 databases with
 150 tables in each of the database which are all sharded
 (horizontally
 sharded that means the data is divided into all the tables in mysql).
 
 The problem is that we have a total of around 70K tables which needed
 to be pulled from mysql into hdfs.
 
 So, my question is that generating 70K sqoop commands and running
 them
 parallel is feasible or not?
 
 Also, doing incremental updates is going to be like invoking 70K
 another sqoop jobs which intern kick of map-reduce jobs.
 
 The main problem is monitoring and managing this huge number of jobs?
 
 Can anyone suggest me the best way of doing it or is sqoop a good
 candidate for this type of scenario?
 
 Currently the same process is done by generating tsv files  mysql
 server and dumped into staging server and  from there we'll generate
 hdfs put statements..
 
 Appreciate your suggestions !!!
 
 
 Thanks,
 Srinivas Surasani
 
 
 
 
 
 --
 Thanks  Regards,
 Anil Gupta
 
 
 
 
 -- 
 Regards,
 -- Srinivas
 srini...@cloudwick.com