Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Felix Cheung
java.nio.BufferUnderflowException Can you try reading the same data in Scala? From: Liana Napalkova Sent: Wednesday, January 10, 2018 12:04:00 PM To: Timur Shenkao Cc: user@spark.apache.org Subject: Re: py4j.protocol.Py4JJavaError:

Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Liana Napalkova
The DataFrame is not empy. Indeed, it has nothing to do with serialization. I think that the issue is related to this bug: https://issues.apache.org/jira/browse/SPARK-22769 In my question I have not posted the whole error stack trace, but one of the error messages says `Could not find

Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-10 Thread Dongjoon Hyun
Hi, All. Vectorized ORC Reader is now supported in Apache Spark 2.3. https://issues.apache.org/jira/browse/SPARK-16060 It has been a long journey. From now, Spark can read ORC files faster without feature penalty. Thank you for all your support, especially Wenchen Fan. It's done by two

Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Timur Shenkao
Caused by: org.apache.spark.SparkException: Task not serializable That's the answer :) What are you trying to save? Is it empty or None / null? On Wed, Jan 10, 2018 at 4:58 PM, Liana Napalkova < liana.napalk...@eurecat.org> wrote: > Hello, > > > Has anybody faced the following problem in

No Tasks have reported metrics yet

2018-01-10 Thread Joel D
Hi, I've a job which takes a HiveQL joining 2 tables (2.5 TB, 45GB), repartitions to 100 and then does some other transformations. This executed fine earlier. Job stages: Stage 0: hive table 1 scan Stage 1: Hive table 2 scan Stage 2: Tungsten exchange for the join Stage 3: Tungsten exchange for

py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Liana Napalkova
Hello, Has anybody faced the following problem in PySpark? (Python 2.7.12): df.show() # works fine and shows the first 5 rows of DataFrame df.write.parquet(outputPath + '/data.parquet', mode="overwrite") # throws the error The last line throws the following error:

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2018-01-10 Thread Selvam Raman
I just followed Hien Luu approach val empExplode = empInfoStrDF.select(explode(from_json('emp_info_str, empInfoSchema)).as("emp_info_withexplode")) empExplode.show(false) +---+ |emp_info_withexplode |

Re: Dataset API inconsistencies

2018-01-10 Thread Michael Armbrust
I wrote Datasets, and I'll say I only use them when I really need to (i.e. when it would be very cumbersome to express what I am trying to do relationally). Dataset operations are almost always going to be slower than their DataFrame equivalents since they usually require materializing objects