Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-11 Thread Gourav Sengupta
Hi Matthew,

I have read close to 3 TB of data in Parquet format without any issues in
EMR. A few questions:
1. What is the EMR version that you are using?
2. How many partitions do you have?
3. How many fields do you have in the table?  Are you reading all of them?
4. Is there a way that you can just use sparkSession.read.load("s3://")
5. Hopefully you are using spark session ?


Regards,
Gourav

On Mon, Sep 11, 2017 at 11:19 PM, Matthew Anthony 
wrote:

> any other feedback on this?
>
> On 9/8/17 11:00 AM, Neil Jonkers wrote:
>
> Can you provide a code sample please?
>
> On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony 
> wrote:
>
>> Hi all -
>>
>>
>> since upgrading to 2.2.0, we've noticed a significant increase in
>> read.parquet(...) ops. The parquet files are being read from S3. Upon entry
>> at the interactive terminal (pyspark in this case), the terminal will sit
>> "idle" for several minutes (as many as 10) before returning:
>>
>>
>> "17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table
>> partition metadata from memory due to size constraints
>> (spark.sql.hive.filesourcePartitionFileCacheSize = 20 bytes).
>> This may impact query planning performance."
>>
>>
>> In the spark UI, there are no jobs being run during this idle period.
>> Subsequently, a short 1-task job lasting approximately 10 seconds runs, and
>> then another idle time of roughly 2-3 minutes follows thereafter before
>> returning to the terminal/CLI.
>>
>>
>> Can someone explain what is happening here in the background? Is there a
>> misconfiguration we should be looking for? We are using Hive metastore on
>> the EMR cluster.
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>


Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-11 Thread Matthew Anthony

any other feedback on this?


On 9/8/17 11:00 AM, Neil Jonkers wrote:

Can you provide a code sample please?

On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony > wrote:


Hi all -


since upgrading to 2.2.0, we've noticed a significant increase in
read.parquet(...) ops. The parquet files are being read from S3.
Upon entry at the interactive terminal (pyspark in this case), the
terminal will sit "idle" for several minutes (as many as 10)
before returning:


"17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table
partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 20
bytes). This may impact query planning performance."


In the spark UI, there are no jobs being run during this idle
period. Subsequently, a short 1-task job lasting approximately 10
seconds runs, and then another idle time of roughly 2-3 minutes
follows thereafter before returning to the terminal/CLI.


Can someone explain what is happening here in the background? Is
there a misconfiguration we should be looking for? We are using
Hive metastore on the EMR cluster.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org







Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-08 Thread Matthew Anthony
The code is as simple as calling `data = spark.read.parquet(address.)`. I can't give you the actual address I'm reading from for 
security reasons. Is there something else I can provide? We're using 
standard EMR images with Hive and Spark installed.



On 9/8/17 11:00 AM, Neil Jonkers wrote:

Can you provide a code sample please?

On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony > wrote:


Hi all -


since upgrading to 2.2.0, we've noticed a significant increase in
read.parquet(...) ops. The parquet files are being read from S3.
Upon entry at the interactive terminal (pyspark in this case), the
terminal will sit "idle" for several minutes (as many as 10)
before returning:


"17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table
partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 20
bytes). This may impact query planning performance."


In the spark UI, there are no jobs being run during this idle
period. Subsequently, a short 1-task job lasting approximately 10
seconds runs, and then another idle time of roughly 2-3 minutes
follows thereafter before returning to the terminal/CLI.


Can someone explain what is happening here in the background? Is
there a misconfiguration we should be looking for? We are using
Hive metastore on the EMR cluster.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org







Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-08 Thread Neil Jonkers
Can you provide a code sample please?

On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony  wrote:

> Hi all -
>
>
> since upgrading to 2.2.0, we've noticed a significant increase in
> read.parquet(...) ops. The parquet files are being read from S3. Upon entry
> at the interactive terminal (pyspark in this case), the terminal will sit
> "idle" for several minutes (as many as 10) before returning:
>
>
> "17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table
> partition metadata from memory due to size constraints
> (spark.sql.hive.filesourcePartitionFileCacheSize = 20 bytes).
> This may impact query planning performance."
>
>
> In the spark UI, there are no jobs being run during this idle period.
> Subsequently, a short 1-task job lasting approximately 10 seconds runs, and
> then another idle time of roughly 2-3 minutes follows thereafter before
> returning to the terminal/CLI.
>
>
> Can someone explain what is happening here in the background? Is there a
> misconfiguration we should be looking for? We are using Hive metastore on
> the EMR cluster.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>