Re: About the bottleneck of parquet file reading in Spark

Cheng Lian Thu, 10 Dec 2015 01:39:13 -0800

Cc Spark user list since this information is generally useful.

On Thu, Dec 10, 2015 at 3:31 PM, Lionheart <87249...@qq.com> wrote:


> Dear, Cheng
>      I'm a user of Spark. Our current Spark version is 1.4.1
>      In our project, I find there is a bottleneck when loading huge amount
> of parquet files. We tried to load more than 50000 parquet files into the
> spark. The total size of the data is about 150G bytes. We find that Spark
> spent more than 30 minutes to do
>      sqlContext.read.option("mergSchema","false") .parquet(filelist:_*)
>      During this time, the network, disk and cpu are not busy. And based
> on the profile, all time is used by the FileSystem.globStatus(). Then I
> find the commit SPARK-8125 by you which accelerates the speed.
>      Then I update Spark to 1.5.1. Base on the test, the driver spent 13
> minutes to do the parquet reading. But I think there is still some
> possibility to improve this speed.
>       Base on the profile and reading the code, I find that the
> DataFrameReader method parquet is implemented in a serial manner to process
> the Path. Do you think if change parquet method into a concurrent version,
> the performance will become much better since there are many CPU core in
> the drive node of Spark?
>

Usually there shouldn't be many distinct paths passed to
DataFrameReader.parquet(). For those data files living under the same
parent directory, you can pass the path of their parent directory instead
of paths of all data files. Then I think this won't be a huge bottleneck.


>       By the way, when will the issud SPARK-8824 be solved. In my opinion,
> loss some precision with a warning message is better than throw a exception
> and say it is not supported.
>

This is a good question. For all those 4 data types,

   - DATE: It's actually already been supported, just resolved that JIRA
   ticket.
   - INTERVAL: We can start woking on this since now we've finally got
   CalendarIntervalType.
   - TIMESTAMP_MILLIS: We can start working on support this on the read
   path and convert extracted millisec timestamps to microsec ones. For the
   write path, maybe we can have an option to indicate whether
   TIMESTAMP_MILLIS or INT96 should be used to store timestamp values. If the
   former is chose, microsec part of the timestamp will be truncated.
   - TIMESTAMP_MICROS: Unfortunately this one depends on parquet-format and
   parquet-mr, which haven't added TIMESTAMP_MILLIS as OriginalType.



>
> Sincerely,
> Zhizhou Li
>
>
>
>

Re: About the bottleneck of parquet file reading in Spark

Reply via email to