Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Felix Cheung
java.nio.BufferUnderflowException

Can you try reading the same data in Scala?



From: Liana Napalkova 
Sent: Wednesday, January 10, 2018 12:04:00 PM
To: Timur Shenkao
Cc: user@spark.apache.org
Subject: Re: py4j.protocol.Py4JJavaError: An error occurred while calling 
o794.parquet

The DataFrame is not empy.
Indeed, it has nothing to do with serialization. I think that the issue is 
related to this bug: https://issues.apache.org/jira/browse/SPARK-22769
In my question I have not posted the whole error stack trace, but one of the 
error messages says `Could not find CoarseGrainedScheduler`. So, it's probably 
something related to the resources.


From: Timur Shenkao 
Sent: 10 January 2018 20:07:37
To: Liana Napalkova
Cc: user@spark.apache.org
Subject: Re: py4j.protocol.Py4JJavaError: An error occurred while calling 
o794.parquet


Caused by: org.apache.spark.SparkException: Task not serializable


That's the answer :)

What are you trying to save? Is it empty or None / null?

On Wed, Jan 10, 2018 at 4:58 PM, Liana Napalkova 
> wrote:

Hello,

Has anybody faced the following problem in PySpark? (Python 2.7.12):

df.show() # works fine and shows the first 5 rows of DataFrame

df.write.parquet(outputPath + '/data.parquet', mode="overwrite")  # throws 
the error

The last line throws the following error:


py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)

Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)

Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)

Caused by: java.nio.BufferUnderflowException

at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

Thanks.

L.


DISCLAIMER: Aquest missatge pot contenir informació confidencial. Si vostè no 
n'és el destinatari, si us plau, esborri'l i faci'ns-ho saber immediatament a 
la següent adreça: le...@eurecat.org Si el 
destinatari d'aquest missatge no consent la utilització del correu electrònic 
via Internet i la gravació 

Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Liana Napalkova
The DataFrame is not empy.
Indeed, it has nothing to do with serialization. I think that the issue is 
related to this bug: https://issues.apache.org/jira/browse/SPARK-22769
In my question I have not posted the whole error stack trace, but one of the 
error messages says `Could not find CoarseGrainedScheduler`. So, it's probably 
something related to the resources.


From: Timur Shenkao 
Sent: 10 January 2018 20:07:37
To: Liana Napalkova
Cc: user@spark.apache.org
Subject: Re: py4j.protocol.Py4JJavaError: An error occurred while calling 
o794.parquet


Caused by: org.apache.spark.SparkException: Task not serializable


That's the answer :)

What are you trying to save? Is it empty or None / null?

On Wed, Jan 10, 2018 at 4:58 PM, Liana Napalkova 
> wrote:

Hello,

Has anybody faced the following problem in PySpark? (Python 2.7.12):

df.show() # works fine and shows the first 5 rows of DataFrame

df.write.parquet(outputPath + '/data.parquet', mode="overwrite")  # throws 
the error

The last line throws the following error:


py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)

Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)

Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)

Caused by: java.nio.BufferUnderflowException

at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

Thanks.

L.


DISCLAIMER: Aquest missatge pot contenir informació confidencial. Si vostè no 
n'és el destinatari, si us plau, esborri'l i faci'ns-ho saber immediatament a 
la següent adreça: le...@eurecat.org Si el 
destinatari d'aquest missatge no consent la utilització del correu electrònic 
via Internet i la gravació de missatges, li preguem que ens ho comuniqui 
immediatament.

DISCLAIMER: Este mensaje puede contener información confidencial. Si usted no 
es el destinatario del mensaje, por favor bórrelo y notifíquenoslo 
inmediatamente a la siguiente dirección: 
le...@eurecat.org Si el destinatario de este mensaje 
no 

Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-10 Thread Dongjoon Hyun
Hi, All.

Vectorized ORC Reader is now supported in Apache Spark 2.3.

https://issues.apache.org/jira/browse/SPARK-16060

It has been a long journey. From now, Spark can read ORC files faster
without feature penalty.

Thank you for all your support, especially Wenchen Fan.

It's done by two commits.

[SPARK-16060][SQL] Support Vectorized ORC Reader
https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
9d6f2bf766

[SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
reader
https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
faa5ba99d1

Please check OrcReadBenchmark for the final speed-up from `Hive built-in
ORC` to `Native ORC Vectorized`.

https://github.com/apache/spark/blob/master/sql/hive/
src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

Thank you.

Bests,
Dongjoon.


Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Timur Shenkao
Caused by: org.apache.spark.SparkException: Task not serializable

That's the answer :)

What are you trying to save? Is it empty or None / null?


On Wed, Jan 10, 2018 at 4:58 PM, Liana Napalkova <
liana.napalk...@eurecat.org> wrote:

> Hello,
>
>
> Has anybody faced the following problem in PySpark? (Python 2.7.12):
>
> df.show() # works fine and shows the first 5 rows of DataFrame
>
> df.write.parquet(outputPath + '/data.parquet', mode="overwrite")  #
> throws the error
>
> The last line throws the following error:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
>
> Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>
> Caused by: org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>
> Caused by: java.lang.IllegalArgumentException
> at java.nio.Buffer.position(Buffer.java:244)
> at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
> at java.nio.ByteBuffer.get(ByteBuffer.java:715)
>
> Caused by: java.nio.BufferUnderflowException
>
>   at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
>   at java.nio.ByteBuffer.get(ByteBuffer.java:715)
>   at 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
>   at 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
>   at 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
>   at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>
> Thanks.
>
> L.
>
> --
> DISCLAIMER: Aquest missatge pot contenir informació confidencial. Si vostè
> no n'és el destinatari, si us plau, esborri'l i faci'ns-ho saber
> immediatament a la següent adreça: le...@eurecat.org Si el destinatari
> d'aquest missatge no consent la utilització del correu electrònic via
> Internet i la gravació de missatges, li preguem que ens ho comuniqui
> immediatament.
>
> DISCLAIMER: Este mensaje puede contener información confidencial. Si usted
> no es el destinatario del mensaje, por favor bórrelo y notifíquenoslo
> inmediatamente a la siguiente dirección: le...@eurecat.org Si el
> destinatario de este mensaje no consintiera la utilización del correo
> electrónico vía Internet y la grabación de los mensajes, rogamos lo ponga
> en nuestro conocimiento de forma inmediata.
>
> DISCLAIMER: Privileged/Confidential Information may be contained in this
> message. If you are not the addressee indicated in this message you should
> destroy this message, and notify us immediately to the following address:
> le...@eurecat.org. If the addressee of this message does not consent to
> the use of Internet e-mail and message recording, please notify us
> immediately.
> --
>
>
>


No Tasks have reported metrics yet

2018-01-10 Thread Joel D
Hi,

I've a job which takes a HiveQL joining 2 tables (2.5 TB, 45GB),
repartitions to 100  and then does some other transformations. This
executed fine earlier.

Job stages:
Stage 0: hive table 1 scan
Stage 1: Hive table 2 scan
Stage 2: Tungsten exchange for the join
Stage 3: Tungsten exchange for the reparation

Today the job is stuck in Stage 2. Out of 200 tasks which are supposed to
be executed none of them have started but 290 have failed due to preempted
executors.

Any inputs on how to resolve this issue? I'll try reducing the executor
memory to see if resource allocation is the issue.

Thanks.


py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Liana Napalkova
Hello,

Has anybody faced the following problem in PySpark? (Python 2.7.12):

df.show() # works fine and shows the first 5 rows of DataFrame

df.write.parquet(outputPath + '/data.parquet', mode="overwrite")  # throws 
the error

The last line throws the following error:


py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)

Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)

Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)

Caused by: java.nio.BufferUnderflowException

at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

Thanks.

L.


DISCLAIMER: Aquest missatge pot contenir informació confidencial. Si vostè no 
n'és el destinatari, si us plau, esborri'l i faci'ns-ho saber immediatament a 
la següent adreça: le...@eurecat.org Si el destinatari d'aquest missatge no 
consent la utilització del correu electrònic via Internet i la gravació de 
missatges, li preguem que ens ho comuniqui immediatament.

DISCLAIMER: Este mensaje puede contener información confidencial. Si usted no 
es el destinatario del mensaje, por favor bórrelo y notifíquenoslo 
inmediatamente a la siguiente dirección: le...@eurecat.org Si el destinatario 
de este mensaje no consintiera la utilización del correo electrónico vía 
Internet y la grabación de los mensajes, rogamos lo ponga en nuestro 
conocimiento de forma inmediata.

DISCLAIMER: Privileged/Confidential Information may be contained in this 
message. If you are not the addressee indicated in this message you should 
destroy this message, and notify us immediately to the following address: 
le...@eurecat.org. If the addressee of this message does not consent to the use 
of Internet e-mail and message recording, please notify us immediately.





Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2018-01-10 Thread Selvam Raman
I just followed Hien Luu approach

val empExplode = empInfoStrDF.select(explode(from_json('emp_info_str,
  empInfoSchema)).as("emp_info_withexplode"))


empExplode.show(false)

+---+
|emp_info_withexplode   |
+---+
|[foo,[CA,USA],WrappedArray([english,2016])]|
|[bar,[OH,USA],WrappedArray([math,2017])]   |
+---+

empExplode.select($"emp_info_withexplode.name").show(false)


++
|name|
++
|foo |
|bar |
++

empExplode.select($"emp_info_withexplode.address.state").show(false)

+-+
|state|
+-+
|CA   |
|OH   |
+-+

empExplode.select($"emp_info_withexplode.docs.subject").show(false)

+-+
|subject  |
+-+
|[english]|
|[math]   |
+-+


@Kant kodali, is that helpful for you? if not please let me know what
changes are you expecting in this?




On Sun, Jan 7, 2018 at 12:16 AM, Jules Damji  wrote:

> Here’s are couple tutorial that shows how to extract Structured nested
> data
>
> https://databricks.com/blog/2017/06/27/4-sql-high-order-
> lambda-functions-examine-complex-structured-data-databricks.html
>
> https://databricks.com/blog/2017/06/13/five-spark-sql-
> utility-functions-extract-explore-complex-data-types.html
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jan 6, 2018, at 11:42 AM, Hien Luu  wrote:
>
> Hi Kant,
>
> I am not sure whether you had come up with a solution yet, but the
> following
> works for me (in Scala)
>
> val emp_info = """
>  [
>{"name": "foo", "address": {"state": "CA", "country": "USA"},
> "docs":[{"subject": "english", "year": 2016}]},
>{"name": "bar", "address": {"state": "OH", "country": "USA"},
> "docs":[{"subject": "math", "year": 2017}]}
>  ]"""
>
> import org.apache.spark.sql.types._
>
> val addressSchema = new StructType().add("state",
> StringType).add("country",
> StringType)
> val docsSchema = ArrayType(new StructType().add("subject",
> StringType).add("year", IntegerType))
> val employeeSchema = new StructType().add("name",
> StringType).add("address",
> addressSchema).add("docs", docsSchema)
>
> val empInfoSchema = ArrayType(employeeSchema)
>
> empInfoSchema.json
>
> val empInfoStrDF = Seq((emp_info)).toDF("emp_info_str")
> empInfoStrDF.printSchema
> empInfoStrDF.show(false)
>
> val empInfoDF = empInfoStrDF.select(from_json('emp_info_str,
> empInfoSchema).as("emp_info"))
> empInfoDF.printSchema
>
> empInfoDF.select(struct("*")).show(false)
>
> empInfoDF.select("emp_info.name", "emp_info.address",
> "emp_info.docs").show(false)
>
> empInfoDF.select(explode('emp_info.getItem("name"))).show
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Dataset API inconsistencies

2018-01-10 Thread Michael Armbrust
I wrote Datasets, and I'll say I only use them when I really need to (i.e.
when it would be very cumbersome to express what I am trying to do
relationally).  Dataset operations are almost always going to be slower
than their DataFrame equivalents since they usually require materializing
objects (where as DataFrame operations usually generate code that operates
directly on binary encoded data).

We certainly could flesh out the API further (e.g. add orderBy that takes a
lambda function), but so far I have not seen a lot of demand for this, and
it would be strictly slower than the DataFrame version. I worry this
wouldn't actually be beneficial to users as it would give them a choice
that looks the same but has performance implications that are non-obvious.
If I'm in the minority though with this opinion, we should do it.

Regarding the concerns about type-safety, I haven't really found that to be
a major issue.  Even though you don't have type safety from the scala
compiler, the Spark SQL analyzer is checking your query before any
execution begins. This opinion is perhaps biased by the fact that I do a
lot of Spark SQL programming in notebooks where the difference between
"compile-time" and "runtime" is pretty minimal.

On Wed, Jan 10, 2018 at 1:45 AM, Alex Nastetsky 
wrote:

> I am finding using the Dataset API to be very cumbersome to use, which is
> unfortunate, as I was looking forward to the type-safety after coming from
> a Dataframe codebase.
>
> This link summarizes my troubles: http://loicdescotte.
> github.io/posts/spark2-datasets-type-safety/
>
> The problem is having to continuously switch back and forth between typed
> and untyped semantics, which really kills productivity. In contrast, the
> RDD API is consistently typed and the Dataframe API is consistently
> untyped. I don't have to continuously stop and think about which one to use
> for each operation.
>
> I gave the Frameless framework (mentioned in the link) a shot, but
> eventually started running into oddities and lack of enough documentation
> and community support and did not want to sink too much time into it.
>
> At this point I'm considering just sticking with Dataframes, as I don't
> really consider Datasets to be usable. Has anyone had a similar experience
> or has had better luck?
>
> Alex.
>