Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

Matthew Anthony Fri, 08 Sep 2017 10:05:18 -0700

The code is as simple as calling `data = spark.read.parquet(<s3address.)`. I can't give you the actual address I'm reading from forsecurity reasons. Is there something else I can provide? We're usingstandard EMR images with Hive and Spark installed.


On 9/8/17 11:00 AM, Neil Jonkers wrote:

Can you provide a code sample please?

On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony <statm...@gmail.com<mailto:statm...@gmail.com>> wrote:


    Hi all -


    since upgrading to 2.2.0, we've noticed a significant increase in
    read.parquet(...) ops. The parquet files are being read from S3.
    Upon entry at the interactive terminal (pyspark in this case), the
    terminal will sit "idle" for several minutes (as many as 10)
    before returning:


    "17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table
    partition metadata from memory due to size constraints
    (spark.sql.hive.filesourcePartitionFileCacheSize = 2000000000
    bytes). This may impact query planning performance."


    In the spark UI, there are no jobs being run during this idle
    period. Subsequently, a short 1-task job lasting approximately 10
    seconds runs, and then another idle time of roughly 2-3 minutes
    follows thereafter before returning to the terminal/CLI.


    Can someone explain what is happening here in the background? Is
    there a misconfiguration we should be looking for? We are using
    Hive metastore on the EMR cluster.


    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscr...@spark.apache.org
    <mailto:user-unsubscr...@spark.apache.org>

Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

Reply via email to