Re: NullPointerException in SparkSession while reading Parquet files on S3

2021-05-25 Thread YEONWOO BAEK
unsubscribe 2021년 5월 26일 (수) 오전 12:31, Eric Beabes 님이 작성: > I keep getting the following exception when I am trying to read a Parquet > file from a Path on S3 in Spark/Scala. Note: I am running this on EMR. > > java.lang.NullPointerException > at >

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
Thanks for your time & advice. We will experiment & see which works best for us EMR or ECS. On Tue, May 25, 2021 at 2:39 PM Sean Owen wrote: > No, the work is happening on the cluster; you just have (say) 100 parallel > jobs running at the same time. You apply spark.read.parquet to each dir

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
No, the work is happening on the cluster; you just have (say) 100 parallel jobs running at the same time. You apply spark.read.parquet to each dir -- from the driver yes, but spark.read is distributed. At extremes, yes that would challenge the driver, to manage 1000s of jobs concurrently. You may

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
Right... but the problem is still the same, no? Those N Jobs (aka Futures or Threads) will all be running on the Driver. Each with its own SparkSession. Isn't that going to put a lot of burden on one Machine? Is that really distributing the load across the cluster? Am I missing something? Would

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
What you could do is launch N Spark jobs in parallel from the driver. Each one would process a directory you supply with spark.read.parquet, for example. You would just have 10s or 100s of those jobs running at the same time. You have to write a bit of async code to do it, but it's pretty easy

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
Here's the use case: We've a bunch of directories (over 1000) which contain tons of small files in each. Each directory is for a different customer so they are independent in that respect. We need to merge all the small files in each directory into one (or a few) compacted file(s) by using a

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Silvio Fiorito
Why not just read from Spark as normal? Do these files have different or incompatible schemas? val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) From: Eric Beabes Date: Tuesday, May 25, 2021 at 1:24 PM To: spark-user Subject: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
Right, you can't use Spark within Spark. Do you actually need to read Parquet like this vs spark.read.parquet? that's also parallel of course. You'd otherwise be reading the files directly in your function with the Parquet APIs. On Tue, May 25, 2021 at 12:24 PM Eric Beabes wrote: > I've a use

Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
I've a use case in which I need to read Parquet files in parallel from over 1000+ directories. I am doing something like this: val df = list.toList.toDF() df.foreach(c => { val config = *getConfigs()* doSomething(spark, config) }) In the doSomething method, when I try to

NullPointerException in SparkSession while reading Parquet files on S3

2021-05-25 Thread Eric Beabes
I keep getting the following exception when I am trying to read a Parquet file from a Path on S3 in Spark/Scala. Note: I am running this on EMR. java.lang.NullPointerException at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144) at