Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines
Sandeep, How are you guys moving 100 TB into the AWS cloud? Are you using S3 or EBS? If you are using S3, it does not work like HDFS. Although data is replicated (I believe within an availability zone) in S3, it is not the same as HDFS replication. You lose the data locality optimization feature of Hadoop when you use S3, which runs counter to the sending code to data paradigm of MapReduce. Mind you, traffic in/out of S3 equates to costs incurred as well (when you lose data locality optimization). I hear that to get PBs worth of data into AWS, it is not uncommon to drive a truck with your data on some physical storage device (in fact, Amazon will help you do this). Please update us, this is an interesting problem. Thanks, On Thu, May 31, 2012 at 2:41 PM, Sandeep Reddy P sandeepreddy.3...@gmail.com wrote: Hi, We are getting 100TB of data with replication factor of 3 this goes to 300TB of data. We are planning to use hadoop with 65nodes. We want to know which option will be better in terms of hardware either physical Machines or deploy hadoop on EC2. Is there any document that supports use of physical machines. Hardware specs: 2 quad core cpu, 32 Gb Ram, 12*1 Tb hard drives , 10Gb Ethernet Switches costs $10k for each machine. Is that cheaper to use EC2 ?? will there be any performance issues?? -- Thanks, sandeep
DFS Client not found in sorted Leases
Hi, We are having several errors in hdfs master log [Lease. Holder: DFSClient not found in sortedLeases. Hbase Master stays in the state of Splitting regions and master hdfs doesn't show any dead node. We are using Cloudera Distribution CDH3. Thanks, Miguel
Re: Hadoop with Sharded MySql
All, I'm trying to get data into HDFS directly from sharded database and expose to existing hive infrastructure. ( we are currently doing this way,, mysql-staging server-hdfs put commands-hdfs, which is taking lot of time ). If we have way of running single sqoop job across all shardes for single table, I believe it makes life easier in terms of monotoring and exception handlings.. Thanks, Srinivas On Fri, Jun 1, 2012 at 1:27 AM, anil gupta anilgupt...@gmail.com wrote: Hi Sujith, Srinivas is asking how to import data into HDFS using sqoop? I believe he must have thought out well before designing the entire architecture/solution. He has not specified whether he would like to modify the data or not. Whether to use HIve or HBase is a different question altogether and depends on his use-case. Thanks, Anil On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi , instead of pulling 70K tables from mysql into hdfs. take dump of all 30 table and put in to hBase data base . if you pulled 70K tables from mysql into hdfs , you need to use Hive , but modification will not possible in Hive :( *@ common-user :* please correct me , if i am wrong . Kind Regards Sujit Dhamale (+91 9970086652) On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Maybe you can do some VIEWs or unions or merge tables on the mysql side to overcome the aspect of launching so many sqoop jobs. On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani hivehadooplearn...@gmail.com wrote: All, We are trying to implement sqoop in our environment which has 30 mysql sharded databases and all the databases have around 30 databases with 150 tables in each of the database which are all sharded (horizontally sharded that means the data is divided into all the tables in mysql). The problem is that we have a total of around 70K tables which needed to be pulled from mysql into hdfs. So, my question is that generating 70K sqoop commands and running them parallel is feasible or not? Also, doing incremental updates is going to be like invoking 70K another sqoop jobs which intern kick of map-reduce jobs. The main problem is monitoring and managing this huge number of jobs? Can anyone suggest me the best way of doing it or is sqoop a good candidate for this type of scenario? Currently the same process is done by generating tsv files mysql server and dumped into staging server and from there we'll generate hdfs put statements.. Appreciate your suggestions !!! Thanks, Srinivas Surasani -- Thanks Regards, Anil Gupta -- Regards, -- Srinivas srini...@cloudwick.com
Re: Hadoop and Eclipse integration
Using this suggestion: (look down for the 'M2_REPO' bit) http://lucene.472066.n3.nabble.com/Hadoop-on-Eclipse-td3392212.html and this documentation: http://maven.apache.org/guides/mini/guide-ide-eclipse.html I ran this: mvn -Declipse.workspace=/Users/epaulson/development/workspace/ eclipse:add-maven-repo (Eclipse on MacOS wanted to put its workspace in /Users/epaulson/Documents but I don't like to keep anything there) and that resolved most of the problems that you're having below. -Erik On Tue, May 29, 2012 at 1:30 AM, Nick Katsipoulakis popa...@gmail.comwrote: Hello everybody, I attempted to use the Eclipse IDE for Hadoop development and I followed the instructions shown in here: http://wiki.apache.org/hadoop/**EclipseEnvironmenthttp://wiki.apache.org/hadoop/EclipseEnvironment Everything goes well until I am starting to import projects in Eclipse, and particularly HDFS. When I follow the instructions for HDFS import i get the following error from Eclipse: Project 'hadoop-hdfs' is missing required library: '/home/nick/.m2/repository/**org/aspectj/aspectjtools/1.6.** 5/aspectjtools-1.6.5.jar' I should mention that the directory hadoop-common that i checked out hadoop is located at: /home/nick/hadoop-common and I am using Ubuntu 10.04. Similar errors appear when I attempt to import the MapReduceTools: Project 'MapReduceTools' is missing required library: 'classes' Project 'MapReduceTools' is missing required library: 'lib/hadoop-core.jar' How can I resolve these issues? When I resolve them, how can I execute a simple Wordcount job from eclipse? Thank you
Re: Hadoop with Sharded MySql
Ok just tossing out some ideas... Take them with a grain of salt... With hive you can create external tables. Write a custom Java app the creates one thread to each server. Then iterate through each table selecting the rows you want. You can then easily write the output directly to HDFS in each thread. It's not a map reduce, but it should be fairly efficient. You can even expand on this if you want. Java and jdbc... Sent from my iPhone On Jun 1, 2012, at 11:30 AM, Srinivas Surasani hivehadooplearn...@gmail.com wrote: All, I'm trying to get data into HDFS directly from sharded database and expose to existing hive infrastructure. ( we are currently doing this way,, mysql-staging server-hdfs put commands-hdfs, which is taking lot of time ). If we have way of running single sqoop job across all shardes for single table, I believe it makes life easier in terms of monotoring and exception handlings.. Thanks, Srinivas On Fri, Jun 1, 2012 at 1:27 AM, anil gupta anilgupt...@gmail.com wrote: Hi Sujith, Srinivas is asking how to import data into HDFS using sqoop? I believe he must have thought out well before designing the entire architecture/solution. He has not specified whether he would like to modify the data or not. Whether to use HIve or HBase is a different question altogether and depends on his use-case. Thanks, Anil On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi , instead of pulling 70K tables from mysql into hdfs. take dump of all 30 table and put in to hBase data base . if you pulled 70K tables from mysql into hdfs , you need to use Hive , but modification will not possible in Hive :( *@ common-user :* please correct me , if i am wrong . Kind Regards Sujit Dhamale (+91 9970086652) On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Maybe you can do some VIEWs or unions or merge tables on the mysql side to overcome the aspect of launching so many sqoop jobs. On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani hivehadooplearn...@gmail.com wrote: All, We are trying to implement sqoop in our environment which has 30 mysql sharded databases and all the databases have around 30 databases with 150 tables in each of the database which are all sharded (horizontally sharded that means the data is divided into all the tables in mysql). The problem is that we have a total of around 70K tables which needed to be pulled from mysql into hdfs. So, my question is that generating 70K sqoop commands and running them parallel is feasible or not? Also, doing incremental updates is going to be like invoking 70K another sqoop jobs which intern kick of map-reduce jobs. The main problem is monitoring and managing this huge number of jobs? Can anyone suggest me the best way of doing it or is sqoop a good candidate for this type of scenario? Currently the same process is done by generating tsv files mysql server and dumped into staging server and from there we'll generate hdfs put statements.. Appreciate your suggestions !!! Thanks, Srinivas Surasani -- Thanks Regards, Anil Gupta -- Regards, -- Srinivas srini...@cloudwick.com