Re: Hadoop with Sharded MySql
Ok just tossing out some ideas... Take them with a grain of salt... With hive you can create external tables. Write a custom Java app the creates one thread to each server. Then iterate through each table selecting the rows you want. You can then easily write the output directly to HDFS in each thread. It's not a map reduce, but it should be fairly efficient. You can even expand on this if you want. Java and jdbc... Sent from my iPhone On Jun 1, 2012, at 11:30 AM, "Srinivas Surasani" wrote: > All, > > I'm trying to get data into HDFS directly from sharded database and expose > to existing hive infrastructure. > > ( we are currently doing this way,, mysql->staging server->hdfs put > commands->hdfs, which is taking lot of time ). > > If we have way of running single sqoop job across all shardes for single > table, I believe it makes life easier in terms of monotoring and exception > handlings.. > > Thanks, > Srinivas > > On Fri, Jun 1, 2012 at 1:27 AM, anil gupta wrote: > >> Hi Sujith, >> >> Srinivas is asking how to import data into HDFS using sqoop? I believe he >> must have thought out well before designing the entire >> architecture/solution. He has not specified whether he would like to modify >> the data or not. Whether to use HIve or HBase is a different question >> altogether and depends on his use-case. >> >> Thanks, >> Anil >> >> >> On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale >> wrote: >> >>> Hi , >>> instead of pulling 70K tables from mysql into hdfs. >>> take dump of all 30 table and put in to hBase data base . >>> >>> if you pulled 70K tables from mysql into hdfs , you need to use Hive , >> but >>> modification will not possible in Hive :( >>> >>> *@ common-user :* please correct me , if i am wrong . >>> >>> Kind Regards >>> Sujit Dhamale >>> (+91 9970086652) >>> On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo >>> wrote: >>> Maybe you can do some VIEWs or unions or merge tables on the mysql side to overcome the aspect of launching so many sqoop jobs. On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani wrote: > All, > > We are trying to implement sqoop in our environment which has 30 >> mysql > sharded databases and all the databases have around 30 databases with > 150 tables in each of the database which are all sharded >> (horizontally > sharded that means the data is divided into all the tables in mysql). > > The problem is that we have a total of around 70K tables which needed > to be pulled from mysql into hdfs. > > So, my question is that generating 70K sqoop commands and running >> them > parallel is feasible or not? > > Also, doing incremental updates is going to be like invoking 70K > another sqoop jobs which intern kick of map-reduce jobs. > > The main problem is monitoring and managing this huge number of jobs? > > Can anyone suggest me the best way of doing it or is sqoop a good > candidate for this type of scenario? > > Currently the same process is done by generating tsv files mysql > server and dumped into staging server and from there we'll generate > hdfs put statements.. > > Appreciate your suggestions !!! > > > Thanks, > Srinivas Surasani >>> >> >> >> >> -- >> Thanks & Regards, >> Anil Gupta >> > > > > -- > Regards, > -- Srinivas > srini...@cloudwick.com
Re: Hadoop with Sharded MySql
All, I'm trying to get data into HDFS directly from sharded database and expose to existing hive infrastructure. ( we are currently doing this way,, mysql->staging server->hdfs put commands->hdfs, which is taking lot of time ). If we have way of running single sqoop job across all shardes for single table, I believe it makes life easier in terms of monotoring and exception handlings.. Thanks, Srinivas On Fri, Jun 1, 2012 at 1:27 AM, anil gupta wrote: > Hi Sujith, > > Srinivas is asking how to import data into HDFS using sqoop? I believe he > must have thought out well before designing the entire > architecture/solution. He has not specified whether he would like to modify > the data or not. Whether to use HIve or HBase is a different question > altogether and depends on his use-case. > > Thanks, > Anil > > > On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale >wrote: > > > Hi , > > instead of pulling 70K tables from mysql into hdfs. > > take dump of all 30 table and put in to hBase data base . > > > > if you pulled 70K tables from mysql into hdfs , you need to use Hive , > but > > modification will not possible in Hive :( > > > > *@ common-user :* please correct me , if i am wrong . > > > > Kind Regards > > Sujit Dhamale > > (+91 9970086652) > > On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo > >wrote: > > > > > Maybe you can do some VIEWs or unions or merge tables on the mysql > > > side to overcome the aspect of launching so many sqoop jobs. > > > > > > On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani > > > wrote: > > > > All, > > > > > > > > We are trying to implement sqoop in our environment which has 30 > mysql > > > > sharded databases and all the databases have around 30 databases with > > > > 150 tables in each of the database which are all sharded > (horizontally > > > > sharded that means the data is divided into all the tables in mysql). > > > > > > > > The problem is that we have a total of around 70K tables which needed > > > > to be pulled from mysql into hdfs. > > > > > > > > So, my question is that generating 70K sqoop commands and running > them > > > > parallel is feasible or not? > > > > > > > > Also, doing incremental updates is going to be like invoking 70K > > > > another sqoop jobs which intern kick of map-reduce jobs. > > > > > > > > The main problem is monitoring and managing this huge number of jobs? > > > > > > > > Can anyone suggest me the best way of doing it or is sqoop a good > > > > candidate for this type of scenario? > > > > > > > > Currently the same process is done by generating tsv files mysql > > > > server and dumped into staging server and from there we'll generate > > > > hdfs put statements.. > > > > > > > > Appreciate your suggestions !!! > > > > > > > > > > > > Thanks, > > > > Srinivas Surasani > > > > > > > > > -- > Thanks & Regards, > Anil Gupta > -- Regards, -- Srinivas srini...@cloudwick.com
Re: Hadoop with Sharded MySql
Hi Sujith, Srinivas is asking how to import data into HDFS using sqoop? I believe he must have thought out well before designing the entire architecture/solution. He has not specified whether he would like to modify the data or not. Whether to use HIve or HBase is a different question altogether and depends on his use-case. Thanks, Anil On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale wrote: > Hi , > instead of pulling 70K tables from mysql into hdfs. > take dump of all 30 table and put in to hBase data base . > > if you pulled 70K tables from mysql into hdfs , you need to use Hive , but > modification will not possible in Hive :( > > *@ common-user :* please correct me , if i am wrong . > > Kind Regards > Sujit Dhamale > (+91 9970086652) > On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo >wrote: > > > Maybe you can do some VIEWs or unions or merge tables on the mysql > > side to overcome the aspect of launching so many sqoop jobs. > > > > On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani > > wrote: > > > All, > > > > > > We are trying to implement sqoop in our environment which has 30 mysql > > > sharded databases and all the databases have around 30 databases with > > > 150 tables in each of the database which are all sharded (horizontally > > > sharded that means the data is divided into all the tables in mysql). > > > > > > The problem is that we have a total of around 70K tables which needed > > > to be pulled from mysql into hdfs. > > > > > > So, my question is that generating 70K sqoop commands and running them > > > parallel is feasible or not? > > > > > > Also, doing incremental updates is going to be like invoking 70K > > > another sqoop jobs which intern kick of map-reduce jobs. > > > > > > The main problem is monitoring and managing this huge number of jobs? > > > > > > Can anyone suggest me the best way of doing it or is sqoop a good > > > candidate for this type of scenario? > > > > > > Currently the same process is done by generating tsv files mysql > > > server and dumped into staging server and from there we'll generate > > > hdfs put statements.. > > > > > > Appreciate your suggestions !!! > > > > > > > > > Thanks, > > > Srinivas Surasani > > > -- Thanks & Regards, Anil Gupta
Re: Hadoop with Sharded MySql
Hi , instead of pulling 70K tables from mysql into hdfs. take dump of all 30 table and put in to hBase data base . if you pulled 70K tables from mysql into hdfs , you need to use Hive , but modification will not possible in Hive :( *@ common-user :* please correct me , if i am wrong . Kind Regards Sujit Dhamale (+91 9970086652) On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo wrote: > Maybe you can do some VIEWs or unions or merge tables on the mysql > side to overcome the aspect of launching so many sqoop jobs. > > On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani > wrote: > > All, > > > > We are trying to implement sqoop in our environment which has 30 mysql > > sharded databases and all the databases have around 30 databases with > > 150 tables in each of the database which are all sharded (horizontally > > sharded that means the data is divided into all the tables in mysql). > > > > The problem is that we have a total of around 70K tables which needed > > to be pulled from mysql into hdfs. > > > > So, my question is that generating 70K sqoop commands and running them > > parallel is feasible or not? > > > > Also, doing incremental updates is going to be like invoking 70K > > another sqoop jobs which intern kick of map-reduce jobs. > > > > The main problem is monitoring and managing this huge number of jobs? > > > > Can anyone suggest me the best way of doing it or is sqoop a good > > candidate for this type of scenario? > > > > Currently the same process is done by generating tsv files mysql > > server and dumped into staging server and from there we'll generate > > hdfs put statements.. > > > > Appreciate your suggestions !!! > > > > > > Thanks, > > Srinivas Surasani >
Re: Hadoop with Sharded MySql
Maybe you can do some VIEWs or unions or merge tables on the mysql side to overcome the aspect of launching so many sqoop jobs. On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani wrote: > All, > > We are trying to implement sqoop in our environment which has 30 mysql > sharded databases and all the databases have around 30 databases with > 150 tables in each of the database which are all sharded (horizontally > sharded that means the data is divided into all the tables in mysql). > > The problem is that we have a total of around 70K tables which needed > to be pulled from mysql into hdfs. > > So, my question is that generating 70K sqoop commands and running them > parallel is feasible or not? > > Also, doing incremental updates is going to be like invoking 70K > another sqoop jobs which intern kick of map-reduce jobs. > > The main problem is monitoring and managing this huge number of jobs? > > Can anyone suggest me the best way of doing it or is sqoop a good > candidate for this type of scenario? > > Currently the same process is done by generating tsv files mysql > server and dumped into staging server and from there we'll generate > hdfs put statements.. > > Appreciate your suggestions !!! > > > Thanks, > Srinivas Surasani