Re: Spark SQL Teradata load is very slow
Asmath, Why upperBound is set to 300 ? how many cores you have ? check how data is distributed in TeraData DB table. SELECT distinct( itm_bloon_seq_no ), count(*) as cc FROM TABLE order by itm_bloon_seq_no desc; Is this column "itm_bloon_seq_no" already in table or you derived at spark code side ? Thanks, Shyam On Thu, May 2, 2019 at 11:30 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I have teradata table who has more than 2.5 billion records and data size > is around 600 GB. I am not able to pull efficiently using spark SQL and it > is been running for more than 11 hours. here is my code. > > val df2 = sparkSession.read.format("jdbc") > .option("url", "jdbc:teradata://PROD/DATABASE=101") > .option("user", "HDFS_TD") > .option("password", "C") > .option("dbtable", "") > .option("numPartitions", partitions) > .option("driver","com.teradata.jdbc.TeraDriver") > .option("partitionColumn", "itm_bloon_seq_no") > .option("lowerBound", config.getInt("lowerBound")) > .option("upperBound", config.getInt("upperBound")) > > Lower bound is 0 and upperbound is 300. Spark is using multiple executors > but most of the executors are running fast and few executors are taking > more time may be due to shuffling or something else. > > I also tried repartition on column but no luck. is there a better way to > load this fast? > > Table in teradata is view but not the table. > > Thanks, > Asmath > >
Spark SQL Teradata load is very slow
Hi, I have teradata table who has more than 2.5 billion records and data size is around 600 GB. I am not able to pull efficiently using spark SQL and it is been running for more than 11 hours. here is my code. val df2 = sparkSession.read.format("jdbc") .option("url", "jdbc:teradata://PROD/DATABASE=101") .option("user", "HDFS_TD") .option("password", "C") .option("dbtable", "") .option("numPartitions", partitions) .option("driver","com.teradata.jdbc.TeraDriver") .option("partitionColumn", "itm_bloon_seq_no") .option("lowerBound", config.getInt("lowerBound")) .option("upperBound", config.getInt("upperBound")) Lower bound is 0 and upperbound is 300. Spark is using multiple executors but most of the executors are running fast and few executors are taking more time may be due to shuffling or something else. I also tried repartition on column but no luck. is there a better way to load this fast? Table in teradata is view but not the table. Thanks, Asmath
Re: Spark on teradata?
Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen
Re: Spark on teradata?
Thanks a lot for your reply. In fact, I need to work on almost all the data in teradata (~100T). So, I don't think that jdbcRDD is a good choice. Cheers Gen On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote: Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen
Re: Spark on teradata?
Have you taken a look at the TeradataDBInputFormat? Spark is compatible with arbitrary hadoop input formats - so this might work for you: http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw On Thu, Jan 8, 2015 at 10:53 AM, gen tang gen.tan...@gmail.com wrote: Thanks a lot for your reply. In fact, I need to work on almost all the data in teradata (~100T). So, I don't think that jdbcRDD is a good choice. Cheers Gen On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote: Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen
Spark on teradata?
Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen