Re: Optimized way to use spark as db to hdfs etl

Deepak Sharma Sat, 05 Nov 2016 09:12:42 -0700

Hi Rohit
You can use accumulators and increase it on every record processing.
At last you can get the value of accumulator on driver , which will give
you the count.


HTH
Deepak

On Nov 5, 2016 20:09, "Rohit Verma" <rohit.ve...@rokittech.com> wrote:

> I am using spark to read from database and write in hdfs as parquet file.
> Here is code snippet.
>
> private long etlFunction(SparkSession spark){
> spark.sqlContext().setConf("spark.sql.parquet.compression.codec",
> “SNAPPY");
> Properties properties = new Properties();
> properties.put("driver”,”oracle.jdbc.driver");
> properties.put("fetchSize”,”5000");
> Dataset<Row> dataset = spark.read().jdbc(jdbcUrl, query, properties);
> dataset.write.format(“parquet”).save(“pdfs-path”);
> return dataset.count();
> }
>
> When I look at spark ui, during write I have stats of records written,
> visible in sql tab under query plan.
>
> While the count itself is a heavy task.
>
> Can someone suggest best way to get count in most optimized way.
>
> Thanks all..
>

Re: Optimized way to use spark as db to hdfs etl

Reply via email to