Re: Performance Improvement: Collect in spark taking huge time

2021-05-05 Thread Chetan Khatri
Hi All, Do you think, replacing the collect() (for having scala collection for loop) with below codeblock will have any benefit? cachedColumnsAddTableDF.select("reporting_table").distinct().foreach(r => { r.getAs("reporting_table").asInstanceOf[String] }) On Wed, May 5, 2021 at 10:15 PM

Performance Improvement: Collect in spark taking huge time

2021-05-05 Thread Chetan Khatri
Hi All, Collect in spark is taking huge time. I want to get list of values of one column to Scala collection. How can I do this? val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF .select(col("reporting_table")).except(clientSchemaDF)

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi Lalwani, But I need to augment the directory specific data to every record of that directory. Once I have read the data, there is no link back to the directory in the data which I can use to augment additional data On Wed, May 5, 2021 at 10:41 PM Lalwani, Jayesh wrote: > You don’t have to

Re: How to read multiple HDFS directories

2021-05-05 Thread Lalwani, Jayesh
You don’t have to union multiple RDDs. You can read files from multiple directories in a single read call. Spark will manage partitioning of the data across directories. From: Kapil Garg Date: Wednesday, May 5, 2021 at 10:45 AM To: spark users Subject: [EXTERNAL] How to read multiple HDFS

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi Mich, The number of directories can be 1000+, doing 1000+ reduce by key and union might be a costlier operation. On Wed, May 5, 2021 at 10:22 PM Mich Talebzadeh wrote: > This is my take > > >1. read the current snapshot (provide empty if it doesn't exist yet) >2. Loop over N

Re: Graceful shutdown SPARK Structured Streaming

2021-05-05 Thread Mich Talebzadeh
Hi, I believe I discussed this in this forum. I sent the following to spark-dev forum as an add-on to Spark functionality. This is the gist of it. Spark Structured Streaming AKA SSS is a very useful tool in dealing with Event Driven Architecture. In an Event Driven Architecture, there is

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
This is my take 1. read the current snapshot (provide empty if it doesn't exist yet) 2. Loop over N directories 1. read unprocessed new data from HDFS 2. union them and do a `reduceByKey` operation 3. output a new version of the snapshot HTH view my Linkedin profile

Fwd: Graceful shutdown SPARK Structured Streaming

2021-05-05 Thread Gourav Sengupta
Hi, just thought of reaching out once again and seeking out your kind help to find out what is the best way to stop SPARK streaming gracefully. Do we still use the methods of creating a file as in SPARK 2.4.x which is several years old method or do we have a better approach in SPARK 3.1?

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Sorry but I didn't get the question. It is possible that 1 record is present in multiple directories. That's why we do a reduceByKey after the union step. On Wed, May 5, 2021 at 9:20 PM Mich Talebzadeh wrote: > When you are doing union on these RDDs, (each RDD has one to one > correspondence

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
When you are doing union on these RDDs, (each RDD has one to one correspondence with an HDFS directory), do you have a common key across all? view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi Mich, I went through the thread and it doesn't relate to the problem statement I shared above. In my problem statement, there is a simple ETL job which doesn't use any external library (such as pandas) This is the flow *hdfsDirs := List(); //contains N directories* *rddList := List();* *for

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
Hi, Have a look at this thread called Tasks are skewed to one executor and see if it helps and we can take it from there. HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any

How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi, I am facing issues while reading multiple HDFS directories. Please read the problem statement and current approach below *Problem Statement* There are N HDFS directories each having K files. We want to read data from all directories such that when we read data from directory D, we map all the

Does Pyspark script support Sonarqube

2021-05-05 Thread Priyanka Kakkar
Hi All, Hope all is well. I just need an info on whether Pyspark script support Sonarqube code coverage and quality gate. Awaiting your response. Thank you Priyanka Choudhury