RE: Reading parquet files in parallel on the cluster

2021-05-30 Thread Boris Litvak
for and launch a job per directory. What am I missing? Boris From: Eric Beabes Sent: Wednesday, 26 May 2021 0:34 To: Sean Owen Cc: Silvio Fiorito ; spark-user Subject: Re: Reading parquet files in parallel on the cluster Right... but the problem is still the same, no? Those N Jobs (aka

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
park Executors & >>>> will scale better. But this throws the NullPointerException shown in the >>>> original email. >>>> >>>> Is there a better way to do this? >>>> >>>> >>>> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito < >

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
shown in the >>> original email. >>> >>> Is there a better way to do this? >>> >>> >>> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito < >>> silvio.fior...@granturing.com> wrote: >>> >>>> Why not just read from Spark as n

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
at 1:10 PM Silvio Fiorito < >> silvio.fior...@granturing.com> wrote: >> >>> Why not just read from Spark as normal? Do these files have different or >>> incompatible schemas? >>> >>> >>> >>> val df = spark.read.option(“merge

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
Fiorito < > silvio.fior...@granturing.com> wrote: > >> Why not just read from Spark as normal? Do these files have different or >> incompatible schemas? >> >> >> >> val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) >> >&g

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
option(“mergeSchema”, “true”).load(listOfPaths) > > > > *From: *Eric Beabes > *Date: *Tuesday, May 25, 2021 at 1:24 PM > *To: *spark-user > *Subject: *Reading parquet files in parallel on the cluster > > > > I've a use case in which I need to read Parquet files i

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Silvio Fiorito
Why not just read from Spark as normal? Do these files have different or incompatible schemas? val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) From: Eric Beabes Date: Tuesday, May 25, 2021 at 1:24 PM To: spark-user Subject: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
Right, you can't use Spark within Spark. Do you actually need to read Parquet like this vs spark.read.parquet? that's also parallel of course. You'd otherwise be reading the files directly in your function with the Parquet APIs. On Tue, May 25, 2021 at 12:24 PM Eric Beabes wrote: > I've a use

Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
I've a use case in which I need to read Parquet files in parallel from over 1000+ directories. I am doing something like this: val df = list.toList.toDF() df.foreach(c => { val config = *getConfigs()* doSomething(spark, config) }) In the doSomething method, when I try to