Re: Which parts of a parquet read happen on the driver vs the executor?

2019-04-11 Thread Sean Owen
Spark is a distributed compute framework of course, so things you do with Spark operations like map, filter, groupBy, etc do not happen on the driver. The function is serialized to the executors. The error here just indicates you are making some function that references things that can't be seriali

Which parts of a parquet read happen on the driver vs the executor?

2019-04-11 Thread Long, Andrew
Hey Friends, I’m working on a POC that involves reading and writing parquet files mid dag. Writes are working but I’m struggling with getting reads working due to serialization issues. I’ve got code that works in master=local but not in yarn. So here are my questions. 1. Is there an easy

Re: Dataset schema incompatibility bug when reading column partitioned data

2019-04-11 Thread Ryan Blue
I think the confusion is that the schema passed to spark.read is not a projection schema. I don’t think it is even used in this case because the Parquet dataset has its own schema. You’re getting the schema of the table. I think the correct behavior is to reject a user-specified schema in this case

Re: Dataset schema incompatibility bug when reading column partitioned data

2019-04-11 Thread Bruce Robbins
I see a Jira: https://issues.apache.org/jira/browse/SPARK-21021 On Thu, Apr 11, 2019 at 9:08 AM Dávid Szakállas wrote: > +dev for more visibility. Is this a known issue? Is there a plan for a fix? > > Thanks, > David > > Begin forwarded message: > > *From: *Dávid Szakállas > *Subject: **Datase

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Reynold Xin
I just realized we had an earlier SPIP on a similar topic:  https://issues.apache.org/jira/browse/SPARK-24579 Perhaps we should tie the two together. IIUC, you'd want to expose the existing ColumnBatch API, but also provide utilities to directly convert from/to Arrow. On Thu, Apr 11, 2019 at 7:1

Fwd: Dataset schema incompatibility bug when reading column partitioned data

2019-04-11 Thread Dávid Szakállas
+dev for more visibility. Is this a known issue? Is there a plan for a fix? Thanks, David > Begin forwarded message: > > From: Dávid Szakállas > Subject: Dataset schema incompatibility bug when reading column partitioned > data > Date: 2019. March 29. 14:15:27 CET > To: u...@spark.apache.org >

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Bobby Evans
The SPIP has been up for almost 6 days now with really no discussion on it. I am hopeful that means it's okay and we are good to call a vote on it, but I want to give everyone one last chance to take a look and comment. If there are no comments by tomorrow I this we will start a vote for this. T

Raise Jenkins test timeout? with alternatives

2019-04-11 Thread Sean Owen
I have a big PR that keeps failing because it his the 300 minute build timeout: https://github.com/apache/spark/pull/24314 https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console It's because it touches so much code that all tests run including things like Kinesis. It l