java.nio.BufferUnderflowException
Can you try reading the same data in Scala?
From: Liana Napalkova
Sent: Wednesday, January 10, 2018 12:04:00 PM
To: Timur Shenkao
Cc: user@spark.apache.org
Subject: Re: py4j.protocol.Py4JJavaError:
The DataFrame is not empy.
Indeed, it has nothing to do with serialization. I think that the issue is
related to this bug: https://issues.apache.org/jira/browse/SPARK-22769
In my question I have not posted the whole error stack trace, but one of the
error messages says `Could not find
Hi, All.
Vectorized ORC Reader is now supported in Apache Spark 2.3.
https://issues.apache.org/jira/browse/SPARK-16060
It has been a long journey. From now, Spark can read ORC files faster
without feature penalty.
Thank you for all your support, especially Wenchen Fan.
It's done by two
Caused by: org.apache.spark.SparkException: Task not serializable
That's the answer :)
What are you trying to save? Is it empty or None / null?
On Wed, Jan 10, 2018 at 4:58 PM, Liana Napalkova <
liana.napalk...@eurecat.org> wrote:
> Hello,
>
>
> Has anybody faced the following problem in
Hi,
I've a job which takes a HiveQL joining 2 tables (2.5 TB, 45GB),
repartitions to 100 and then does some other transformations. This
executed fine earlier.
Job stages:
Stage 0: hive table 1 scan
Stage 1: Hive table 2 scan
Stage 2: Tungsten exchange for the join
Stage 3: Tungsten exchange for
Hello,
Has anybody faced the following problem in PySpark? (Python 2.7.12):
df.show() # works fine and shows the first 5 rows of DataFrame
df.write.parquet(outputPath + '/data.parquet', mode="overwrite") # throws
the error
The last line throws the following error:
I just followed Hien Luu approach
val empExplode = empInfoStrDF.select(explode(from_json('emp_info_str,
empInfoSchema)).as("emp_info_withexplode"))
empExplode.show(false)
+---+
|emp_info_withexplode |
I wrote Datasets, and I'll say I only use them when I really need to (i.e.
when it would be very cumbersome to express what I am trying to do
relationally). Dataset operations are almost always going to be slower
than their DataFrame equivalents since they usually require materializing
objects