date:20210505

Does Pyspark script support Sonarqube

2021-05-05 Thread Priyanka Kakkar

Hi All, Hope all is well. I just need an info on whether Pyspark script support Sonarqube code coverage and quality gate. Awaiting your response. Thank you Priyanka Choudhury

How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg

Hi, I am facing issues while reading multiple HDFS directories. Please read the problem statement and current approach below *Problem Statement* There are N HDFS directories each having K files. We want to read data from all directories such that when we read data from directory D, we map all the

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh

Hi, Have a look at this thread called Tasks are skewed to one executor and see if it helps and we can take it from there. HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any lo

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg

Hi Mich, I went through the thread and it doesn't relate to the problem statement I shared above. In my problem statement, there is a simple ETL job which doesn't use any external library (such as pandas) This is the flow *hdfsDirs := List(); //contains N directories* *rddList := List();* *for e

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh

When you are doing union on these RDDs, (each RDD has one to one correspondence with an HDFS directory), do you have a common key across all? view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg

Sorry but I didn't get the question. It is possible that 1 record is present in multiple directories. That's why we do a reduceByKey after the union step. On Wed, May 5, 2021 at 9:20 PM Mich Talebzadeh wrote: > When you are doing union on these RDDs, (each RDD has one to one > correspondence wit

Fwd: Graceful shutdown SPARK Structured Streaming

2021-05-05 Thread Gourav Sengupta

Hi, just thought of reaching out once again and seeking out your kind help to find out what is the best way to stop SPARK streaming gracefully. Do we still use the methods of creating a file as in SPARK 2.4.x which is several years old method or do we have a better approach in SPARK 3.1? Regards,

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh

This is my take 1. read the current snapshot (provide empty if it doesn't exist yet) 2. Loop over N directories 1. read unprocessed new data from HDFS 2. union them and do a `reduceByKey` operation 3. output a new version of the snapshot HTH view my Linkedin profile <

Re: Graceful shutdown SPARK Structured Streaming

2021-05-05 Thread Mich Talebzadeh

Hi, I believe I discussed this in this forum. I sent the following to spark-dev forum as an add-on to Spark functionality. This is the gist of it. Spark Structured Streaming AKA SSS is a very useful tool in dealing with Event Driven Architecture. In an Event Driven Architecture, there is genera

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg

Hi Mich, The number of directories can be 1000+, doing 1000+ reduce by key and union might be a costlier operation. On Wed, May 5, 2021 at 10:22 PM Mich Talebzadeh wrote: > This is my take > > >1. read the current snapshot (provide empty if it doesn't exist yet) >2. Loop over N directori

Re: How to read multiple HDFS directories

2021-05-05 Thread Lalwani, Jayesh

You don’t have to union multiple RDDs. You can read files from multiple directories in a single read call. Spark will manage partitioning of the data across directories. From: Kapil Garg Date: Wednesday, May 5, 2021 at 10:45 AM To: spark users Subject: [EXTERNAL] How to read multiple HDFS dir

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg

Hi Lalwani, But I need to augment the directory specific data to every record of that directory. Once I have read the data, there is no link back to the directory in the data which I can use to augment additional data On Wed, May 5, 2021 at 10:41 PM Lalwani, Jayesh wrote: > You don’t have to uni

Performance Improvement: Collect in spark taking huge time

2021-05-05 Thread Chetan Khatri

Hi All, Collect in spark is taking huge time. I want to get list of values of one column to Scala collection. How can I do this? val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF .select(col("reporting_table")).except(clientSchemaDF) logger.info(s"###

Re: Performance Improvement: Collect in spark taking huge time

2021-05-05 Thread Chetan Khatri

Hi All, Do you think, replacing the collect() (for having scala collection for loop) with below codeblock will have any benefit? cachedColumnsAddTableDF.select("reporting_table").distinct().foreach(r => { r.getAs("reporting_table").asInstanceOf[String] }) On Wed, May 5, 2021 at 10:15 PM Cheta

Does Pyspark script support Sonarqube

How to read multiple HDFS directories

Re: How to read multiple HDFS directories

Re: How to read multiple HDFS directories

Re: How to read multiple HDFS directories

Re: How to read multiple HDFS directories

Fwd: Graceful shutdown SPARK Structured Streaming

Re: How to read multiple HDFS directories

Re: Graceful shutdown SPARK Structured Streaming

Re: How to read multiple HDFS directories

Re: How to read multiple HDFS directories

Re: How to read multiple HDFS directories

Performance Improvement: Collect in spark taking huge time

Re: Performance Improvement: Collect in spark taking huge time

14 matches

Site Navigation

Mail list logo

Footer information