Spark is a distributed compute framework of course, so things you do
with Spark operations like map, filter, groupBy, etc do not happen on
the driver. The function is serialized to the executors. The error
here just indicates you are making some function that references
things that can't be seriali
Hey Friends,
I’m working on a POC that involves reading and writing parquet files mid dag.
Writes are working but I’m struggling with getting reads working due to
serialization issues. I’ve got code that works in master=local but not in yarn.
So here are my questions.
1. Is there an easy
I think the confusion is that the schema passed to spark.read is not a
projection schema. I don’t think it is even used in this case because the
Parquet dataset has its own schema. You’re getting the schema of the table.
I think the correct behavior is to reject a user-specified schema in this
case
I see a Jira:
https://issues.apache.org/jira/browse/SPARK-21021
On Thu, Apr 11, 2019 at 9:08 AM Dávid Szakállas
wrote:
> +dev for more visibility. Is this a known issue? Is there a plan for a fix?
>
> Thanks,
> David
>
> Begin forwarded message:
>
> *From: *Dávid Szakállas
> *Subject: **Datase
I just realized we had an earlier SPIP on a similar topic:
https://issues.apache.org/jira/browse/SPARK-24579
Perhaps we should tie the two together. IIUC, you'd want to expose the existing
ColumnBatch API, but also provide utilities to directly convert from/to Arrow.
On Thu, Apr 11, 2019 at 7:1
+dev for more visibility. Is this a known issue? Is there a plan for a fix?
Thanks,
David
> Begin forwarded message:
>
> From: Dávid Szakállas
> Subject: Dataset schema incompatibility bug when reading column partitioned
> data
> Date: 2019. March 29. 14:15:27 CET
> To: u...@spark.apache.org
>
The SPIP has been up for almost 6 days now with really no discussion on
it. I am hopeful that means it's okay and we are good to call a vote on
it, but I want to give everyone one last chance to take a look and
comment. If there are no comments by tomorrow I this we will start a vote
for this.
T
I have a big PR that keeps failing because it his the 300 minute build timeout:
https://github.com/apache/spark/pull/24314
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console
It's because it touches so much code that all tests run including
things like Kinesis. It l