List of tables is not large , RDD is created on table list to parllelise the work of fetching tables in multiple mappers at same time.Since time taken to fetch a table is significant , so can't run that sequentially.
Content of table fetched by a map job is large, so one option is to dump content to hdfs using filesystem api from inside map function for every few rows of table fetched. I cannot keep complete table in memory and then dump in hdfs using below map function- JavaRDD<String> tablecontent = tablelistrdd.map(new Function<String,Iterable<String>>) {public Iterable<String> call(String tablename){ ..make jdbc connection get table data and populate in list and return that.. } tablecontent .saveAsTextFile("hdfspath"); Here I wanted to create customRDD- whose partitions would be in memory on multiple executors and contains parts of table data. And i would have called saveAsTextFile on customRDD directly to save in hdfs. On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang <fli...@databricks.com> wrote: > > On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora <shushantaror...@gmail.com> > wrote: > >> JavaRDD<String> rdd = javasparkcontext.parllelise(tables); > > > You are already creating an RDD in Java here ;) > > However, it's not clear to me why you'd want to make this an RDD. Is the > list of tables so large that it doesn't fit on a single machine? If not, > you may be better off spinning up one spark job for dumping each table in > tables using a JDBC datasource > <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases> > . > > On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito < > silvio.fior...@granturing.com> wrote: > >> Sure, you can create custom RDDs. Haven’t done so in Java, but in >> Scala absolutely. >> >> From: Shushant Arora >> Date: Wednesday, July 1, 2015 at 1:44 PM >> To: Silvio Fiorito >> Cc: user >> Subject: Re: custom RDD in java >> >> ok..will evaluate these options but is it possible to create RDD in >> java? >> >> >> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito < >> silvio.fior...@granturing.com> wrote: >> >>> If all you’re doing is just dumping tables from SQLServer to HDFS, >>> have you looked at Sqoop? >>> >>> Otherwise, if you need to run this in Spark could you just use the >>> existing JdbcRDD? >>> >>> >>> From: Shushant Arora >>> Date: Wednesday, July 1, 2015 at 10:19 AM >>> To: user >>> Subject: custom RDD in java >>> >>> Hi >>> >>> Is it possible to write custom RDD in java? >>> >>> Requirement is - I am having a list of Sqlserver tables need to be >>> dumped in HDFS. >>> >>> So I have a >>> List<String> tables = {dbname.tablename,dbname.tablename2......}; >>> >>> then >>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables); >>> >>> JavaRDDString> tablecontent = rdd.map(new >>> Function<String,Iterable<String>>){fetch table and return populate iterable} >>> >>> tablecontent.storeAsTextFile("hffs path"); >>> >>> >>> In rdd.map(new Function<String,>). I cannot keep complete table >>> content in memory , so I want to creat my own RDD to handle it. >>> >>> Thanks >>> Shushant >>> >>> >>> >>> >>> >>> >>> >> >