Hi Matthew,
I have read close to 3 TB of data in Parquet format without any issues in
EMR. A few questions:
1. What is the EMR version that you are using?
2. How many partitions do you have?
3. How many fields do you have in the table? Are you reading all of them?
4. Is there a way that you can
any other feedback on this?
On 9/8/17 11:00 AM, Neil Jonkers wrote:
Can you provide a code sample please?
On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony > wrote:
Hi all -
since upgrading to 2.2.0, we've noticed a significant
The code is as simple as calling `data = spark.read.parquet(address.)`. I can't give you the actual address I'm reading from for
security reasons. Is there something else I can provide? We're using
standard EMR images with Hive and Spark installed.
On 9/8/17 11:00 AM, Neil Jonkers wrote:
Can
Can you provide a code sample please?
On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony wrote:
> Hi all -
>
>
> since upgrading to 2.2.0, we've noticed a significant increase in
> read.parquet(...) ops. The parquet files are being read from S3. Upon entry
> at the interactive
Hi all -
since upgrading to 2.2.0, we've noticed a significant increase in
read.parquet(...) ops. The parquet files are being read from S3. Upon
entry at the interactive terminal (pyspark in this case), the terminal
will sit "idle" for several minutes (as many as 10) before returning: