I’m using spark 2.3 with schema merge set to false. I don’t think spark is
reading any file indeed but it tries to list them all one by one and it’s super
slow on s3 !
Pointing to a single partition manually is not an option as it requires me to
be aware of the partitioning in order to add it
What version of Spark you are using?
You can search "spark.sql.parquet.mergeSchema" on
https://spark.apache.org/docs/latest/sql-programming-guide.html
Starting from Spark 1.5, the default is already "false", which means Spark
shouldn't scan all the parquet files to generate the schema.
You can specify the first folder directly and read it
On Fri, 27 Apr 2018 at 9:42 pm, Walid LEZZAR wrote:
> Hi,
>
> I have a parquet on S3 partitioned by day. I have 2 years of data (->
> about 1000 partitions). With spark, when I just want to know the schema of
> this
Hi,
I have a parquet on S3 partitioned by day. I have 2 years of data (-> about
1000 partitions). With spark, when I just want to know the schema of this
parquet without even asking for a single row of data, spark tries to list
all the partitions and the nested partitions of the parquet. Which