Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
ave to union multiple RDDs. You can read files from multiple > directories in a single read call. Spark will manage partitioning of the > data across directories. > > > > *From: *Kapil Garg > *Date: *Wednesday, May 5, 2021 at 10:45 AM > *To: *spark users > *Subject: *[EXTER

Re: How to read multiple HDFS directories

2021-05-05 Thread Lalwani, Jayesh
You don’t have to union multiple RDDs. You can read files from multiple directories in a single read call. Spark will manage partitioning of the data across directories. From: Kapil Garg Date: Wednesday, May 5, 2021 at 10:45 AM To: spark users Subject: [EXTERNAL] How to read multiple HDFS

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi Mich, The number of directories can be 1000+, doing 1000+ reduce by key and union might be a costlier operation. On Wed, May 5, 2021 at 10:22 PM Mich Talebzadeh wrote: > This is my take > > >1. read the current snapshot (provide empty if it doesn't exist yet) >2. Loop over N

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
This is my take 1. read the current snapshot (provide empty if it doesn't exist yet) 2. Loop over N directories 1. read unprocessed new data from HDFS 2. union them and do a `reduceByKey` operation 3. output a new version of the snapshot HTH view my Linkedin profile

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Sorry but I didn't get the question. It is possible that 1 record is present in multiple directories. That's why we do a reduceByKey after the union step. On Wed, May 5, 2021 at 9:20 PM Mich Talebzadeh wrote: > When you are doing union on these RDDs, (each RDD has one to one > correspondence

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
When you are doing union on these RDDs, (each RDD has one to one correspondence with an HDFS directory), do you have a common key across all? view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi Mich, I went through the thread and it doesn't relate to the problem statement I shared above. In my problem statement, there is a simple ETL job which doesn't use any external library (such as pandas) This is the flow *hdfsDirs := List(); //contains N directories* *rddList := List();* *for

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
Hi, Have a look at this thread called Tasks are skewed to one executor and see if it helps and we can take it from there. HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any

How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi, I am facing issues while reading multiple HDFS directories. Please read the problem statement and current approach below *Problem Statement* There are N HDFS directories each having K files. We want to read data from all directories such that when we read data from directory D, we map all the