Re: Corrupt parquet file

2018-02-12 Thread Steve Loughran
What failure mode is likely here? As the uploads are signed, the network payload is not corruptible from the moment its written into the HTTPS request, which places it earlier * RAM corruption which ECC doesn't pick up. It'd be interesting to know what stats & health checks AWS run here, such a

Re: Corrupt parquet file

2018-02-12 Thread Dong Jiang
I got no error messages from EMR. We write directly from dataframe to S3. There doesn’t appear to be an issue with S3 file, we can still down the parquet file and read most of the columns, just one column is corrupted in parquet. I suspect we need to write to HDFS first, make sure we can read bac

Re: Corrupt parquet file

2018-02-12 Thread Steve Loughran
On 12 Feb 2018, at 19:35, Dong Jiang mailto:dji...@dataxu.com>> wrote: I got no error messages from EMR. We write directly from dataframe to S3. There doesn’t appear to be an issue with S3 file, we can still down the parquet file and read most of the columns, just one column is corrupted in p

Re: Drop the Hadoop 2.6 profile?

2018-02-12 Thread Steve Loughran
I'd advocate 2.7 over 2.6, primarily due to Kerberos and JVM versions 2.6 is not even qualified for Java 7, let alone Java 8: you've got no guarantees that things work on the min Java version Spark requires. Kerberos is always the failure point here, as well as various libraries (jetty) which

Re: Corrupt parquet file

2018-02-12 Thread Ryan Blue
I wouldn't say we have a primary failure mode that we deal with. What we concluded was that all the schemes we came up with to avoid corruption couldn't cover all cases. For example, what about when memory holding a value is corrupted just before it is handed off to the writer? That's why we track

Regarding NimbusDS JOSE JWT jar 3.9 security vulnerability

2018-02-12 Thread sujith71955
Hi Folks, I observed that in spark 2.2.x version we are using NimbusDS JOSE JWT jar 3.9 version, but i saw few vulnerability has been reported for this particular version jar. please refer below details https://nvd.nist.gov/vuln/detail/CVE-2017-12973, https://www.cvedetails.com/cve/CVE-2017-12972/

[VOTE] Spark 2.3.0 (RC3)

2018-02-12 Thread Sameer Agarwal
Now that all known blockers have once again been resolved, please vote on releasing the following candidate as Apache Spark version 2.3.0. The vote is open until Friday February 16, 2018 at 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 votes are cast. [ ] +1 Release this package as

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-12 Thread Sameer Agarwal
I'll start the vote with a +1. As of today, all known release blockers and QA tasks have been resolved, and the jenkins builds are healthy: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ On 12 February 2018 at 22:30, Sameer Agarwal wrote: > Now that all known block