Re: Spark SQL Teradata load is very slow

2019-05-03 Thread Shyam P
Asmath,
Why upperBound is set to 300 ? how many cores you have ?
  check how data is distributed in TeraData DB table.
SELECT  distinct( itm_bloon_seq_no  ), count(*) as cc  FROM TABLE   order
by  itm_bloon_seq_no  desc;

Is this column "itm_bloon_seq_no" already in table or you derived at spark
code side ?

Thanks,
Shyam


On Thu, May 2, 2019 at 11:30 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> I have teradata table who has more than 2.5 billion records and data size
> is around 600 GB. I am not able to pull efficiently using spark SQL and it
> is been running for more than 11 hours. here is my code.
>
>   val df2 = sparkSession.read.format("jdbc")
> .option("url", "jdbc:teradata://PROD/DATABASE=101")
> .option("user", "HDFS_TD")
> .option("password", "C")
> .option("dbtable", "")
> .option("numPartitions", partitions)
> .option("driver","com.teradata.jdbc.TeraDriver")
> .option("partitionColumn", "itm_bloon_seq_no")
> .option("lowerBound", config.getInt("lowerBound"))
> .option("upperBound", config.getInt("upperBound"))
>
> Lower bound is 0 and upperbound is 300. Spark is using multiple executors
> but most of the executors are running fast and few executors are taking
> more time may be due to shuffling or something else.
>
> I also tried repartition on column but no luck. is there a better way to
> load this fast?
>
> Table in teradata is view but not the table.
>
> Thanks,
> Asmath
>
>


Spark SQL Teradata load is very slow

2019-05-02 Thread KhajaAsmath Mohammed
Hi,

I have teradata table who has more than 2.5 billion records and data size
is around 600 GB. I am not able to pull efficiently using spark SQL and it
is been running for more than 11 hours. here is my code.

  val df2 = sparkSession.read.format("jdbc")
.option("url", "jdbc:teradata://PROD/DATABASE=101")
.option("user", "HDFS_TD")
.option("password", "C")
.option("dbtable", "")
.option("numPartitions", partitions)
.option("driver","com.teradata.jdbc.TeraDriver")
.option("partitionColumn", "itm_bloon_seq_no")
.option("lowerBound", config.getInt("lowerBound"))
.option("upperBound", config.getInt("upperBound"))

Lower bound is 0 and upperbound is 300. Spark is using multiple executors
but most of the executors are running fast and few executors are taking
more time may be due to shuffling or something else.

I also tried repartition on column but no luck. is there a better way to
load this fast?

Table in teradata is view but not the table.

Thanks,
Asmath