[jira] [Updated] (SPARK-24376) compiling spark with scala-2.10 should use the -P parameter instead of -D

2018-05-23 Thread Yu Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated SPARK-24376:

Description: compiling spark with scala-2.10 should use the -P parameter 
instead of -D  (was: compiling spark with scala-2.10 should use the -p 
parameter instead of -d)

> compiling spark with scala-2.10 should use the -P parameter instead of -D
> -
>
> Key: SPARK-24376
> URL: https://issues.apache.org/jira/browse/SPARK-24376
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Yu Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> compiling spark with scala-2.10 should use the -P parameter instead of -D



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24376) compiling spark with scala-2.10 should use the -P parameter instead of -D

2018-05-23 Thread Yu Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated SPARK-24376:

Summary: compiling spark with scala-2.10 should use the -P parameter 
instead of -D  (was: compiling spark with scala-2.10 should use the -p 
parameter instead of -d)

> compiling spark with scala-2.10 should use the -P parameter instead of -D
> -
>
> Key: SPARK-24376
> URL: https://issues.apache.org/jira/browse/SPARK-24376
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Yu Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> compiling spark with scala-2.10 should use the -p parameter instead of -d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24376) compiling spark with scala-2.10 should use the -p parameter instead of -d

2018-05-23 Thread Yu Wang (JIRA)
Yu Wang created SPARK-24376:
---

 Summary: compiling spark with scala-2.10 should use the -p 
parameter instead of -d
 Key: SPARK-24376
 URL: https://issues.apache.org/jira/browse/SPARK-24376
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Yu Wang
 Fix For: 2.2.0


compiling spark with scala-2.10 should use the -p parameter instead of -d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24362) SUM function precision issue

2018-05-23 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488464#comment-16488464
 ] 

Yuming Wang commented on SPARK-24362:
-

*{{SortMergeJoin}}* vs *{{BroadcastHashJoin}}*:
{code}
test("SPARK-24362") {
  val df = spark.range(6).toDF("c1")
  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
df.join(df, "c1").selectExpr("sum(cast(9.99 as double))").show()
  }

  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "10") {
df.join(df, "c1").selectExpr("sum(cast(9.99 as double))").show()
  }
}
{code}
Results:
{noformat}
+--+
|   smj|
+--+
|59.945|
+--+

+-+
|  bhj|
+-+
|59.94|
+-+
{noformat}

> SUM function precision issue
> 
>
> Key: SPARK-24362
> URL: https://issues.apache.org/jira/browse/SPARK-24362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Priority: Major
>
>  How to reproduce:
> {noformat}
> bin/spark-shell --conf spark.sql.autoBroadcastJoinThreshold=-1
> scala> val df = spark.range(6).toDF("c1")
> df: org.apache.spark.sql.DataFrame = [c1: bigint]
> scala> df.join(df, "c1").selectExpr("sum(cast(9.99 as double))").show()
> +-+
> |sum(CAST(9.99 AS DOUBLE))|
> +-+
> |       59.945|
> +-+{noformat}
>  
> More links:
> [https://stackoverflow.com/questions/42158844/about-a-loss-of-precision-when-calculating-an-aggregate-sum-with-data-frames]
> [https://stackoverflow.com/questions/44134497/spark-sql-sum-function-issues-on-double-value]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24364) Files deletion after globbing may fail StructuredStreaming jobs

2018-05-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24364.
--
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21408
[https://github.com/apache/spark/pull/21408]

> Files deletion after globbing may fail StructuredStreaming jobs
> ---
>
> Key: SPARK-24364
> URL: https://issues.apache.org/jira/browse/SPARK-24364
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> This is related with SPARK-17599. SPARK-17599 checked the directory only but 
> actually it can file on another FileSystem operation when the file is not 
> found. For example see:
> {code}
> Error occurred while processing: File does not exist: 
> /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2029)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2000)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1913)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
> java.io.FileNotFoundException: File does not exist: 
> /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2029)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2000)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1913)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> 

[jira] [Assigned] (SPARK-24364) Files deletion after globbing may fail StructuredStreaming jobs

2018-05-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24364:


Assignee: Hyukjin Kwon

> Files deletion after globbing may fail StructuredStreaming jobs
> ---
>
> Key: SPARK-24364
> URL: https://issues.apache.org/jira/browse/SPARK-24364
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> This is related with SPARK-17599. SPARK-17599 checked the directory only but 
> actually it can file on another FileSystem operation when the file is not 
> found. For example see:
> {code}
> Error occurred while processing: File does not exist: 
> /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2029)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2000)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1913)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
> java.io.FileNotFoundException: File does not exist: 
> /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2029)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2000)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1913)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> 

[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488432#comment-16488432
 ] 

Xiangrui Meng commented on SPARK-24375:
---

[~jiangxb1987] Could you summarize the design sketch based on our offline 
discussion? Thanks!

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-24375:
-

 Summary: Design sketch: support barrier scheduling in Apache Spark
 Key: SPARK-24375
 URL: https://issues.apache.org/jira/browse/SPARK-24375
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Jiang Xingbo


This task is to outline a design sketch for the barrier scheduling SPIP 
discussion. It doesn't need to be a complete design before the vote. But it 
should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Shepherd: Reynold Xin

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Issue Type: Epic  (was: Story)

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Description: 
(See details in the linked/attached SPIP doc.)

{quote}

The proposal here is to add a new scheduling model to Apache Spark so users can 
properly embed distributed DL training as a Spark stage to simplify the 
distributed training workflow. For example, Horovod uses MPI to implement 
all-reduce to accelerate distributed TensorFlow training. The computation model 
is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t 
depend on any other tasks in the same stage, and hence it can be scheduled 
independently. In MPI, all workers start at the same time and pass messages 
around. To embed this workload in Spark, we need to introduce a new scheduling 
model, tentatively named “barrier scheduling”, which launches tasks at the same 
time and provides users enough information and tooling to embed distributed DL 
training. Spark can also provide an extra layer of fault tolerance in case some 
tasks failed in the middle, where Spark would abort all tasks and restart the 
stage.

{quote}

  was:
(See details in the linked/attached SPIP doc.)

The proposal here is to add a new scheduling model to Apache Spark so users can 
properly embed distributed DL training as a Spark stage to simplify the 
distributed training workflow. For example, Horovod uses MPI to implement 
all-reduce to accelerate distributed TensorFlow training. The computation model 
is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t 
depend on any other tasks in the same stage, and hence it can be scheduled 
independently. In MPI, all workers start at the same time and pass messages 
around. To embed this workload in Spark, we need to introduce a new scheduling 
model, tentatively named “barrier scheduling”, which launches tasks at the same 
time and provides users enough information and tooling to embed distributed DL 
training. Spark can also provide an extra layer of fault tolerance in case some 
tasks failed in the middle, where Spark would abort all tasks and restart the 
stage.


> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Attachment: SPIP_ Support Barrier Scheduling in Apache Spark.pdf

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked SPIP doc.)
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Description: 
(See details in the linked/attached SPIP doc.)

The proposal here is to add a new scheduling model to Apache Spark so users can 
properly embed distributed DL training as a Spark stage to simplify the 
distributed training workflow. For example, Horovod uses MPI to implement 
all-reduce to accelerate distributed TensorFlow training. The computation model 
is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t 
depend on any other tasks in the same stage, and hence it can be scheduled 
independently. In MPI, all workers start at the same time and pass messages 
around. To embed this workload in Spark, we need to introduce a new scheduling 
model, tentatively named “barrier scheduling”, which launches tasks at the same 
time and provides users enough information and tooling to embed distributed DL 
training. Spark can also provide an extra layer of fault tolerance in case some 
tasks failed in the middle, where Spark would abort all tasks and restart the 
stage.

  was:
(See details in the linked SPIP doc.)

The proposal here is to add a new scheduling model to Apache Spark so users can 
properly embed distributed DL training as a Spark stage to simplify the 
distributed training workflow. For example, Horovod uses MPI to implement 
all-reduce to accelerate distributed TensorFlow training. The computation model 
is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t 
depend on any other tasks in the same stage, and hence it can be scheduled 
independently. In MPI, all workers start at the same time and pass messages 
around. To embed this workload in Spark, we need to introduce a new scheduling 
model, tentatively named “barrier scheduling”, which launches tasks at the same 
time and provides users enough information and tooling to embed distributed DL 
training. Spark can also provide an extra layer of fault tolerance in case some 
tasks failed in the middle, where Spark would abort all tasks and restart the 
stage.


> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Labels: SPIP  (was: )

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
>
> (See details in the linked SPIP doc.)
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-24374:
-

 Summary: SPIP: Support Barrier Scheduling in Apache Spark
 Key: SPARK-24374
 URL: https://issues.apache.org/jira/browse/SPARK-24374
 Project: Spark
  Issue Type: Story
  Components: ML, Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


(See details in the linked SPIP doc.)

The proposal here is to add a new scheduling model to Apache Spark so users can 
properly embed distributed DL training as a Spark stage to simplify the 
distributed training workflow. For example, Horovod uses MPI to implement 
all-reduce to accelerate distributed TensorFlow training. The computation model 
is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t 
depend on any other tasks in the same stage, and hence it can be scheduled 
independently. In MPI, all workers start at the same time and pass messages 
around. To embed this workload in Spark, we need to introduce a new scheduling 
model, tentatively named “barrier scheduling”, which launches tasks at the same 
time and provides users enough information and tooling to embed distributed DL 
training. Spark can also provide an extra layer of fault tolerance in case some 
tasks failed in the middle, where Spark would abort all tasks and restart the 
stage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24369) A bug when having multiple distinct aggregations

2018-05-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488411#comment-16488411
 ] 

Takeshi Yamamuro commented on SPARK-24369:
--

ok, I will look into this. Thanks!

> A bug when having multiple distinct aggregations
> 
>
> Key: SPARK-24369
> URL: https://issues.apache.org/jira/browse/SPARK-24369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
> (VALUES
>(1, 1),
>(2, 2),
>(2, 2)
> ) t(x, y)
> {code}
> It returns 
> {code}
> java.lang.RuntimeException
> You hit a query analyzer bug. Please report your query to Spark user mailing 
> list.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24369) A bug when having multiple distinct aggregations

2018-05-23 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488396#comment-16488396
 ] 

Xiao Li commented on SPARK-24369:
-

cc [~maropu] Are you interested in this?

> A bug when having multiple distinct aggregations
> 
>
> Key: SPARK-24369
> URL: https://issues.apache.org/jira/browse/SPARK-24369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
> (VALUES
>(1, 1),
>(2, 2),
>(2, 2)
> ) t(x, y)
> {code}
> It returns 
> {code}
> java.lang.RuntimeException
> You hit a query analyzer bug. Please report your query to Spark user mailing 
> list.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24322) Upgrade Apache ORC to 1.4.4

2018-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24322:
---

Assignee: Dongjoon Hyun

> Upgrade Apache ORC to 1.4.4
> ---
>
> Key: SPARK-24322
> URL: https://issues.apache.org/jira/browse/SPARK-24322
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.1, 2.4.0
>
>
> ORC 1.4.4 includes [nine 
> fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4].
>  One of the issues is about `Timestamp` bug (ORC-306) which occurs when 
> `native` ORC vectorized reader reads ORC column vector's sub-vector `times` 
> and `nanos`. ORC-306 fixes this according to the [original 
> definition|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46]
>  and the linked PR includes the updated interpretation on ORC column vectors. 
> Note that `hive` ORC reader and ORC MR reader is not affected.
> {code}
> scala> spark.version
> res0: String = 2.3.0
> scala> spark.sql("set spark.sql.orc.impl=native")
> scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
> 12:34:56.000789")).toDF().write.orc("/tmp/orc")
> scala> spark.read.orc("/tmp/orc").show(false)
> +--+
> |value |
> +--+
> |1900-05-05 12:34:55.000789|
> +--+
> {code}
> This issue aims to update Apache Spark to use it.
> *FULL LIST*
> || ID || TITLE ||
> | ORC-281 | Fix compiler warnings from clang 5.0 | 
> | ORC-301 | `extractFileTail` should open a file in `try` statement | 
> | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | 
> | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | 
> | ORC-324 | Add support for ARM and PPC arch | 
> | ORC-330 | Remove unnecessary Hive artifacts from root pom | 
> | ORC-332 | Add syntax version to orc_proto.proto | 
> | ORC-336 | Remove avro and parquet dependency management entries | 
> | ORC-360 | Implement error checking on subtype fields in Java | 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24322) Upgrade Apache ORC to 1.4.4

2018-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24322.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21372
[https://github.com/apache/spark/pull/21372]

> Upgrade Apache ORC to 1.4.4
> ---
>
> Key: SPARK-24322
> URL: https://issues.apache.org/jira/browse/SPARK-24322
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.0, 2.3.1
>
>
> ORC 1.4.4 includes [nine 
> fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4].
>  One of the issues is about `Timestamp` bug (ORC-306) which occurs when 
> `native` ORC vectorized reader reads ORC column vector's sub-vector `times` 
> and `nanos`. ORC-306 fixes this according to the [original 
> definition|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46]
>  and the linked PR includes the updated interpretation on ORC column vectors. 
> Note that `hive` ORC reader and ORC MR reader is not affected.
> {code}
> scala> spark.version
> res0: String = 2.3.0
> scala> spark.sql("set spark.sql.orc.impl=native")
> scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
> 12:34:56.000789")).toDF().write.orc("/tmp/orc")
> scala> spark.read.orc("/tmp/orc").show(false)
> +--+
> |value |
> +--+
> |1900-05-05 12:34:55.000789|
> +--+
> {code}
> This issue aims to update Apache Spark to use it.
> *FULL LIST*
> || ID || TITLE ||
> | ORC-281 | Fix compiler warnings from clang 5.0 | 
> | ORC-301 | `extractFileTail` should open a file in `try` statement | 
> | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | 
> | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | 
> | ORC-324 | Add support for ARM and PPC arch | 
> | ORC-330 | Remove unnecessary Hive artifacts from root pom | 
> | ORC-332 | Add syntax version to orc_proto.proto | 
> | ORC-336 | Remove avro and parquet dependency management entries | 
> | ORC-360 | Implement error checking on subtype fields in Java | 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24257.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.1.3
   2.4.0
   2.2.2
   2.0.3

Issue resolved by pull request 21311
[https://github.com/apache/spark/pull/21311]

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.0.3, 2.2.2, 2.4.0, 2.1.3, 2.3.1
>
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24257:
---

Assignee: dzcxzl

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.0.3, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-05-23 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488355#comment-16488355
 ] 

Dongjoon Hyun commented on SPARK-23416:
---

Thank you, [~joseph.torres] and [~tdas] .

Actually, recently failures are reported on `branch-2.3`. Can we have this fix 
on `branch-2.3` please?

> Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for 
> failOnDataLoss=false
> 
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 3.0.0
>
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> 

[jira] [Commented] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory

2018-05-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488330#comment-16488330
 ] 

Hyukjin Kwon commented on SPARK-24370:
--

Sounds more like a question though. Do you have any reproducer or steps to 
reproduce this? That should help other guys like me reproduce and debug the 
problem.

> spark checkpoint creates many 0 byte empty files(partitions)  in checkpoint 
> directory
> -
>
> Key: SPARK-24370
> URL: https://issues.apache.org/jira/browse/SPARK-24370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.1
>Reporter: Jami Malikzade
>Priority: Major
> Attachments: partitions.PNG
>
>
> We currently facing issue, that when we call checkpoint on dataframe, it 
> creates partitions in checkpoint dir, but some of them are empty. So we 
> having exceptions reading dataframe back.
> Do you have any idea how to avoid it?
> it creates 200 partitions.Some are empty. I used repartition(1) before 
> checkpoint. But it is not good wordaround. Do we have anyway , to populate 
> all partitions with data, or avoid empty files?
> Pasted snapshot.
> !image-2018-05-23-21-10-43-673.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory

2018-05-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488329#comment-16488329
 ] 

Hyukjin Kwon commented on SPARK-24370:
--

(let's avoid setting critical usually reserved for a committer)

> spark checkpoint creates many 0 byte empty files(partitions)  in checkpoint 
> directory
> -
>
> Key: SPARK-24370
> URL: https://issues.apache.org/jira/browse/SPARK-24370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.1
>Reporter: Jami Malikzade
>Priority: Major
> Attachments: partitions.PNG
>
>
> We currently facing issue, that when we call checkpoint on dataframe, it 
> creates partitions in checkpoint dir, but some of them are empty. So we 
> having exceptions reading dataframe back.
> Do you have any idea how to avoid it?
> it creates 200 partitions.Some are empty. I used repartition(1) before 
> checkpoint. But it is not good wordaround. Do we have anyway , to populate 
> all partitions with data, or avoid empty files?
> Pasted snapshot.
> !image-2018-05-23-21-10-43-673.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory

2018-05-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24370:
-
Priority: Major  (was: Critical)

> spark checkpoint creates many 0 byte empty files(partitions)  in checkpoint 
> directory
> -
>
> Key: SPARK-24370
> URL: https://issues.apache.org/jira/browse/SPARK-24370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.1
>Reporter: Jami Malikzade
>Priority: Major
> Attachments: partitions.PNG
>
>
> We currently facing issue, that when we call checkpoint on dataframe, it 
> creates partitions in checkpoint dir, but some of them are empty. So we 
> having exceptions reading dataframe back.
> Do you have any idea how to avoid it?
> it creates 200 partitions.Some are empty. I used repartition(1) before 
> checkpoint. But it is not good wordaround. Do we have anyway , to populate 
> all partitions with data, or avoid empty files?
> Pasted snapshot.
> !image-2018-05-23-21-10-43-673.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24363) spark context exit abnormally

2018-05-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24363.
--
Resolution: Invalid

Sounds more like a question. Please ask it to dev mailing list. I believe you 
can have a better answer for it. Let's leave this resolved for now until we are 
clear if it's an issue.

> spark context exit abnormally
> -
>
> Key: SPARK-24363
> URL: https://issues.apache.org/jira/browse/SPARK-24363
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Submit, YARN
>Affects Versions: 2.1.2
> Environment: spark2.1.2   
> hadoop2.6.0-cdh5.4.7
>  
>  
>Reporter: nilone
>Priority: Major
> Attachments: QQ图片20180523162543.png, error_log.txt
>
>
> I create a sparkcontext instance with yarn-client mode in my program (a web 
> application) for users to submit their jobs ,It can work for a for a period 
> of time, but then it suddenly comes out without any new job running.It looks 
> like the context has been killed, and I'm not sure whether the problem is on 
> the yarn or on the spark.
> You can see my screenshot and log file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488261#comment-16488261
 ] 

Hyukjin Kwon commented on SPARK-24358:
--

Yea, that's exactly what I thought. There are some differences between Python 2 
and 3 but I think PySpark usually supports consistently for both so far up to 
my knowledge.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-05-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23416.
---
   Resolution: Fixed
Fix Version/s: (was: 2.3.0)
   3.0.0

Issue resolved by pull request 21384
[https://github.com/apache/spark/pull/21384]

> Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for 
> failOnDataLoss=false
> 
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 3.0.0
>
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> 

[jira] [Assigned] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-05-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-23416:
-

Assignee: Jose Torres

> Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for 
> failOnDataLoss=false
> 
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> 

[jira] [Updated] (SPARK-24322) Upgrade Apache ORC to 1.4.4

2018-05-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24322:
--
Description: 
ORC 1.4.4 includes [nine 
fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4].
 One of the issues is about `Timestamp` bug (ORC-306) which occurs when 
`native` ORC vectorized reader reads ORC column vector's sub-vector `times` and 
`nanos`. ORC-306 fixes this according to the [original 
definition|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46]
 and the linked PR includes the updated interpretation on ORC column vectors. 
Note that `hive` ORC reader and ORC MR reader is not affected.

{code}
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--+
|value |
+--+
|1900-05-05 12:34:55.000789|
+--+
{code}

This issue aims to update Apache Spark to use it.

*FULL LIST*

|| ID || TITLE ||
| ORC-281 | Fix compiler warnings from clang 5.0 | 
| ORC-301 | `extractFileTail` should open a file in `try` statement | 
| ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | 
| ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | 
| ORC-324 | Add support for ARM and PPC arch | 
| ORC-330 | Remove unnecessary Hive artifacts from root pom | 
| ORC-332 | Add syntax version to orc_proto.proto | 
| ORC-336 | Remove avro and parquet dependency management entries | 
| ORC-360 | Implement error checking on subtype fields in Java | 

  was:
ORC 1.4.4 includes [nine 
fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4).
 One of the issues is about `Timestamp` bug (ORC-306) which occurs when 
`native` ORC vectorized reader reads ORC column vector's sub-vector `times` and 
`nanos`. ORC-306 fixes this according to the [original 
definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46)
 and the linked PR includes the updated interpretation on ORC column vectors. 
Note that `hive` ORC reader and ORC MR reader is not affected.

{code}
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--+
|value |
+--+
|1900-05-05 12:34:55.000789|
+--+
{code}

This issue aims to update Apache Spark to use it.

*FULL LIST*

|| ID || TITLE ||
| ORC-281 | Fix compiler warnings from clang 5.0 | 
| ORC-301 | `extractFileTail` should open a file in `try` statement | 
| ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | 
| ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | 
| ORC-324 | Add support for ARM and PPC arch | 
| ORC-330 | Remove unnecessary Hive artifacts from root pom | 
| ORC-332 | Add syntax version to orc_proto.proto | 
| ORC-336 | Remove avro and parquet dependency management entries | 
| ORC-360 | Implement error checking on subtype fields in Java | 


> Upgrade Apache ORC to 1.4.4
> ---
>
> Key: SPARK-24322
> URL: https://issues.apache.org/jira/browse/SPARK-24322
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: correctness
>
> ORC 1.4.4 includes [nine 
> fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4].
>  One of the issues is about `Timestamp` bug (ORC-306) which occurs when 
> `native` ORC vectorized reader reads ORC column vector's sub-vector `times` 
> and `nanos`. ORC-306 fixes this according to the [original 
> definition|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46]
>  and the linked PR includes the updated interpretation on ORC column vectors. 
> Note that `hive` ORC reader and ORC MR reader is not affected.
> {code}
> scala> spark.version
> res0: String = 2.3.0
> scala> spark.sql("set spark.sql.orc.impl=native")
> scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
> 12:34:56.000789")).toDF().write.orc("/tmp/orc")
> scala> 

[jira] [Updated] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data

2018-05-23 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24373:
---
Summary: "df.cache() df.count()" no longer eagerly caches data  (was: Spark 
Dataset groupby.agg/count doesn't respect cache with UDF)

> "df.cache() df.count()" no longer eagerly caches data
> -
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488109#comment-16488109
 ] 

Li Jin edited comment on SPARK-24373 at 5/23/18 11:18 PM:
--

We found after upgrading to Spark 2.3, many of our production systems runs 
slower. This is founded to be caused by "df.cache(); df.count()" no longer 
eagerly cache df correctly. I think this might be a regression from 2.2. 

I think any one uses  "df.cache() df.count()" to cache data eagerly will be 
affected.


was (Author: icexelloss):
I think this might be a regression from 2.2

Any one uses  "df.cache() df.count()" to cache data eagerly will be affected.

> Spark Dataset groupby.agg/count doesn't respect cache with UDF
> --
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24373:
---
Component/s: (was: Input/Output)
 SQL

> Spark Dataset groupby.agg/count doesn't respect cache with UDF
> --
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488109#comment-16488109
 ] 

Li Jin commented on SPARK-24373:


I think this might be a regression from 2.2

Any one uses  "df.cache() df.count()" to cache data eagerly will be affected.

> Spark Dataset groupby.agg/count doesn't respect cache with UDF
> --
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-23 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488090#comment-16488090
 ] 

Joel Croteau commented on SPARK-24358:
--

This may be trickier than I first thought. In Python 2, bytes is an alias for 
str, so a bytes object resolves as a StringType. In Python 3, they are 
different types, and not in general freely convertible, as an str in Python 3 
is unicode, and an arbitrary byte string as represented by a bytes may not be 
valid unicode. This means that a bytes in Python 2 will need to be resolved as 
a different schema from a bytes in Python 3. Not sure how significant that is.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Wenbo Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbo Zhao updated SPARK-24373:
---
Description: 
Here is the code to reproduce in local mode
{code:java}
scala> val df = sc.range(1, 2).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> val myudf = udf({x: Long => println(""); x + 1})
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
UserDefinedFunction(,LongType,Some(List(LongType)))

scala> val df1 = df.withColumn("value1", myudf(col("value")))
df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]

scala> df1.cache
res0: df1.type = [value: bigint, value1: bigint]

scala> df1.count
res1: Long = 1 

scala> df1.count
res2: Long = 1

scala> df1.count
res3: Long = 1
{code}
 

in Spark 2.2, you could see it prints "". 

In the above example, when you do explain. You could see
{code:java}
scala> df1.explain(true)
== Parsed Logical Plan ==
'Project [value#2L, UDF('value) AS value1#5]
+- AnalysisBarrier
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
value: bigint, value1: bigint
Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Optimized Logical Plan ==
InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

== Physical Plan ==
*(1) InMemoryTableScan [value#2L, value1#5L]
+- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

{code}
but the ImMemoryTableScan is mising in the following explain()
{code:java}
scala> df1.groupBy().count().explain(true)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Optimized Logical Plan ==
Aggregate [count(1) AS count#170L]
+- Project
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
output=[count#175L])
+- *(1) Project
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]
{code}
 

 

  was:
Here is the code to reproduce in local mode
{code:java}
scala> val df = sc.range(1, 2).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]
scala> val myudf = udf({x: Long => println(""); x + 1})
scala> val df1 = df.withColumn("value1", myudf(col("value")))
scala> df1.cache()
res0: df1.type = [value: bigint, value1: bigint] 
scala> df1.count
res1: Long = 1
scala> df1.count
res2: Long = 1 
scala> df1.count 
res3: Long = 1
{code}
 

in Spark 2.2, you could see it prints "". 

In the above example, when you do explain. You could see
{code:java}
scala> df1.explain(true)
== Parsed Logical Plan ==
'Project [value#2L, UDF('value) AS value1#5]
+- AnalysisBarrier
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
value: bigint, value1: bigint
Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Optimized Logical Plan ==
InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

== Physical Plan ==
*(1) InMemoryTableScan [value#2L, value1#5L]
+- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

{code}
but the ImMemoryTableScan is mising in the following explain()
{code:java}
scala> df1.groupBy().count().explain(true)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 

[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Wenbo Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbo Zhao updated SPARK-24373:
---
Description: 
Here is the code to reproduce in local mode
{code:java}
scala> val df = sc.range(1, 2).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]
scala> val myudf = udf({x: Long => println(""); x + 1})
scala> val df1 = df.withColumn("value1", myudf(col("value")))
scala> df1.cache()
res0: df1.type = [value: bigint, value1: bigint] 
scala> df1.count
res1: Long = 1
scala> df1.count
res2: Long = 1 
scala> df1.count 
res3: Long = 1
{code}
 

in Spark 2.2, you could see it prints "". 

In the above example, when you do explain. You could see
{code:java}
scala> df1.explain(true)
== Parsed Logical Plan ==
'Project [value#2L, UDF('value) AS value1#5]
+- AnalysisBarrier
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
value: bigint, value1: bigint
Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Optimized Logical Plan ==
InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

== Physical Plan ==
*(1) InMemoryTableScan [value#2L, value1#5L]
+- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

{code}
but the ImMemoryTableScan is mising in the following explain()
{code:java}
scala> df1.groupBy().count().explain(true)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Optimized Logical Plan ==
Aggregate [count(1) AS count#170L]
+- Project
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
output=[count#175L])
+- *(1) Project
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]
{code}
 

 

  was:
Here is the code to reproduce in local mode
{code:java}
val df = sc.range(1, 2).toDF
val myudf = udf({x: Long => println(""); x + 1})
scala> df1.cache()
res0: df1.type = [value: bigint, value1: bigint] 
scala> df1.count
res1: Long = 1
scala> df1.count
res2: Long = 1 
scala> df1.count 
res3: Long = 1
{code}
 

in Spark 2.2, you could see it prints "". 

In the above example, when you do explain. You could see
{code:java}
scala> df1.explain(true)
== Parsed Logical Plan ==
'Project [value#2L, UDF('value) AS value1#5]
+- AnalysisBarrier
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
value: bigint, value1: bigint
Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Optimized Logical Plan ==
InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

== Physical Plan ==
*(1) InMemoryTableScan [value#2L, value1#5L]
+- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

{code}
but the ImMemoryTableScan is mising in the following explain()
{code:java}
scala> df1.groupBy().count().explain(true)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject 

[jira] [Updated] (SPARK-24322) Upgrade Apache ORC to 1.4.4

2018-05-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24322:
--
Description: 
ORC 1.4.4 includes [nine 
fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4).
 One of the issues is about `Timestamp` bug (ORC-306) which occurs when 
`native` ORC vectorized reader reads ORC column vector's sub-vector `times` and 
`nanos`. ORC-306 fixes this according to the [original 
definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46)
 and the linked PR includes the updated interpretation on ORC column vectors. 
Note that `hive` ORC reader and ORC MR reader is not affected.

{code}
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--+
|value |
+--+
|1900-05-05 12:34:55.000789|
+--+
{code}

This issue aims to update Apache Spark to use it.

*FULL LIST*

|| ID || TITLE ||
| ORC-281 | Fix compiler warnings from clang 5.0 | 
| ORC-301 | `extractFileTail` should open a file in `try` statement | 
| ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | 
| ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | 
| ORC-324 | Add support for ARM and PPC arch | 
| ORC-330 | Remove unnecessary Hive artifacts from root pom | 
| ORC-332 | Add syntax version to orc_proto.proto | 
| ORC-336 | Remove avro and parquet dependency management entries | 
| ORC-360 | Implement error checking on subtype fields in Java | 

  was:
ORC 1.4.4 (released on May 14th) includes nine fixes. This issue aims to update 
Spark to use it.

https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4

For example, ORC-306 fixes the timestamp issue.
{code}
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--+
|value |
+--+
|1900-05-05 12:34:55.000789|
+--+
{code}


> Upgrade Apache ORC to 1.4.4
> ---
>
> Key: SPARK-24322
> URL: https://issues.apache.org/jira/browse/SPARK-24322
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: correctness
>
> ORC 1.4.4 includes [nine 
> fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4).
>  One of the issues is about `Timestamp` bug (ORC-306) which occurs when 
> `native` ORC vectorized reader reads ORC column vector's sub-vector `times` 
> and `nanos`. ORC-306 fixes this according to the [original 
> definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46)
>  and the linked PR includes the updated interpretation on ORC column vectors. 
> Note that `hive` ORC reader and ORC MR reader is not affected.
> {code}
> scala> spark.version
> res0: String = 2.3.0
> scala> spark.sql("set spark.sql.orc.impl=native")
> scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
> 12:34:56.000789")).toDF().write.orc("/tmp/orc")
> scala> spark.read.orc("/tmp/orc").show(false)
> +--+
> |value |
> +--+
> |1900-05-05 12:34:55.000789|
> +--+
> {code}
> This issue aims to update Apache Spark to use it.
> *FULL LIST*
> || ID || TITLE ||
> | ORC-281 | Fix compiler warnings from clang 5.0 | 
> | ORC-301 | `extractFileTail` should open a file in `try` statement | 
> | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | 
> | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | 
> | ORC-324 | Add support for ARM and PPC arch | 
> | ORC-330 | Remove unnecessary Hive artifacts from root pom | 
> | ORC-332 | Add syntax version to orc_proto.proto | 
> | ORC-336 | Remove avro and parquet dependency management entries | 
> | ORC-360 | Implement error checking on subtype fields in Java | 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF

2018-05-23 Thread Wenbo Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbo Zhao updated SPARK-24373:
---
Summary: Spark Dataset groupby.agg/count doesn't respect cache with UDF  
(was: Spark Dataset groupby.agg/count doesn't respect cache)

> Spark Dataset groupby.agg/count doesn't respect cache with UDF
> --
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Priority: Major
>
> Here is the code to reproduce in local mode
> {code:java}
> val df = sc.range(1, 2).toDF
> val myudf = udf({x: Long => println(""); x + 1})
> scala> df1.cache()
> res0: df1.type = [value: bigint, value1: bigint] 
> scala> df1.count
> res1: Long = 1
> scala> df1.count
> res2: Long = 1 
> scala> df1.count 
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache

2018-05-23 Thread Wenbo Zhao (JIRA)
Wenbo Zhao created SPARK-24373:
--

 Summary: Spark Dataset groupby.agg/count doesn't respect cache
 Key: SPARK-24373
 URL: https://issues.apache.org/jira/browse/SPARK-24373
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.0
Reporter: Wenbo Zhao


Here is the code to reproduce in local mode
{code:java}
val df = sc.range(1, 2).toDF
val myudf = udf({x: Long => println(""); x + 1})
scala> df1.cache()
res0: df1.type = [value: bigint, value1: bigint] 
scala> df1.count
res1: Long = 1
scala> df1.count
res2: Long = 1 
scala> df1.count 
res3: Long = 1
{code}
 

in Spark 2.2, you could see it prints "". 

In the above example, when you do explain. You could see
{code:java}
scala> df1.explain(true)
== Parsed Logical Plan ==
'Project [value#2L, UDF('value) AS value1#5]
+- AnalysisBarrier
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
value: bigint, value1: bigint
Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Optimized Logical Plan ==
InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

== Physical Plan ==
*(1) InMemoryTableScan [value#2L, value1#5L]
+- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
+- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]

{code}
but the ImMemoryTableScan is mising in the following explain()
{code:java}
scala> df1.groupBy().count().explain(true)
== Parsed Logical Plan ==
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#170L]
+- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
null else UDF(value#2L) AS value1#5L]
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Optimized Logical Plan ==
Aggregate [count(1) AS count#170L]
+- Project
+- SerializeFromObject [input[0, bigint, false] AS value#2L]
+- ExternalRDD [obj#1L]

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
output=[count#175L])
+- *(1) Project
+- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
+- Scan ExternalRDDScan[obj#1L]
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22054) Allow release managers to inject their keys

2018-05-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488024#comment-16488024
 ] 

Marcelo Vanzin commented on SPARK-22054:


I filed SPARK-24372 to create an easier way to prepare releases, which I think 
is better than futzing with jenkins jobs.

I'll downgrade this from blocker but leave it open, in case you guys still 
think this should be done.

> Allow release managers to inject their keys
> ---
>
> Key: SPARK-22054
> URL: https://issues.apache.org/jira/browse/SPARK-22054
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> Right now the current release process signs with Patrick's keys, let's update 
> the scripts to allow the release manager to sign the release as part of the 
> job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22054) Allow release managers to inject their keys

2018-05-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-22054:
---
Priority: Major  (was: Blocker)

> Allow release managers to inject their keys
> ---
>
> Key: SPARK-22054
> URL: https://issues.apache.org/jira/browse/SPARK-22054
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Major
>
> Right now the current release process signs with Patrick's keys, let's update 
> the scripts to allow the release manager to sign the release as part of the 
> job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22055) Port release scripts

2018-05-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-22055:
---
Priority: Major  (was: Blocker)

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Major
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22055) Port release scripts

2018-05-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488023#comment-16488023
 ] 

Marcelo Vanzin commented on SPARK-22055:


I filed SPARK-24372 for the stuff I'm working on. I'm downgrading this from 
blocker since I'm not sure this is the correct approach, but feel free to keep 
it open if you want to do it.

I think those jenkins jobs are still used to publish "nightly" builds, if 
anyone actually uses those...

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24372) Create script for preparing RCs

2018-05-23 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24372:
--

 Summary: Create script for preparing RCs
 Key: SPARK-24372
 URL: https://issues.apache.org/jira/browse/SPARK-24372
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


Currently, when preparing RCs, the RM has to invoke many scripts manually, make 
sure that is being done in the correct environment, and set all the correct 
environment variables, which differ from one script to the other.

It will be much easier for RMs if all that was automated as much as possible.

I'm working on something like this as part of releasing 2.3.1, and plan to send 
my scripts for review after the release is done (i.e. after I make sure the 
scripts are working properly).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24371:


Assignee: Apache Spark  (was: DB Tsai)

> Added isinSet in DataFrame API for Scala and Java.
> --
>
> Key: SPARK-24371
> URL: https://issues.apache.org/jira/browse/SPARK-24371
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users 
> can do
> {code}
>  val profileDF = Seq(
>  Some(1), Some(2), Some(3), Some(4),
>  Some(5), Some(6), Some(7), None
>  ).toDF("profileID")
> val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")
> val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers))
> result.show(10)
>  """
>  +--+--+
> |profileID|isValid|
> +--+--+
> |1|false|
> |2|false|
> |3|true|
> |4|false|
> |5|false|
> |6|true|
> |7|true|
> |null|null|
> +--+--+
>  """.stripMargin
> {code}
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Set}}*, the physical plan will be 
> optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isinSet(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical plan 
> will be simplified to
> {code}
>  profileDF.filter( $"profileID".isinSet(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
>  
> TODO:
>  # For multiple conditions with numbers less than certain thresholds, we 
> should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the numbers of 
> the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we should do 
> benchmark for using different set implementation for faster query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24371:


Assignee: DB Tsai  (was: Apache Spark)

> Added isinSet in DataFrame API for Scala and Java.
> --
>
> Key: SPARK-24371
> URL: https://issues.apache.org/jira/browse/SPARK-24371
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users 
> can do
> {code}
>  val profileDF = Seq(
>  Some(1), Some(2), Some(3), Some(4),
>  Some(5), Some(6), Some(7), None
>  ).toDF("profileID")
> val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")
> val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers))
> result.show(10)
>  """
>  +--+--+
> |profileID|isValid|
> +--+--+
> |1|false|
> |2|false|
> |3|true|
> |4|false|
> |5|false|
> |6|true|
> |7|true|
> |null|null|
> +--+--+
>  """.stripMargin
> {code}
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Set}}*, the physical plan will be 
> optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isinSet(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical plan 
> will be simplified to
> {code}
>  profileDF.filter( $"profileID".isinSet(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
>  
> TODO:
>  # For multiple conditions with numbers less than certain thresholds, we 
> should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the numbers of 
> the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we should do 
> benchmark for using different set implementation for faster query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488012#comment-16488012
 ] 

Apache Spark commented on SPARK-24371:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21416

> Added isinSet in DataFrame API for Scala and Java.
> --
>
> Key: SPARK-24371
> URL: https://issues.apache.org/jira/browse/SPARK-24371
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users 
> can do
> {code}
>  val profileDF = Seq(
>  Some(1), Some(2), Some(3), Some(4),
>  Some(5), Some(6), Some(7), None
>  ).toDF("profileID")
> val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")
> val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers))
> result.show(10)
>  """
>  +--+--+
> |profileID|isValid|
> +--+--+
> |1|false|
> |2|false|
> |3|true|
> |4|false|
> |5|false|
> |6|true|
> |7|true|
> |null|null|
> +--+--+
>  """.stripMargin
> {code}
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Set}}*, the physical plan will be 
> optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isinSet(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical plan 
> will be simplified to
> {code}
>  profileDF.filter( $"profileID".isinSet(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
>  
> TODO:
>  # For multiple conditions with numbers less than certain thresholds, we 
> should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the numbers of 
> the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we should do 
> benchmark for using different set implementation for faster query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.

2018-05-23 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24371:
---

 Summary: Added isinSet in DataFrame API for Scala and Java.
 Key: SPARK-24371
 URL: https://issues.apache.org/jira/browse/SPARK-24371
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 2.4.0


Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users 
can do
{code}
 val profileDF = Seq(
 Some(1), Some(2), Some(3), Some(4),
 Some(5), Some(6), Some(7), None
 ).toDF("profileID")

val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")

val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers))

result.show(10)
 """
 +--+--+
|profileID|isValid|

+--+--+
|1|false|
|2|false|
|3|true|
|4|false|
|5|false|
|6|true|
|7|true|
|null|null|

+--+--+
 """.stripMargin
{code}
Two new rules in the logical plan optimizers are added.

# When there is only one element in the *{{Set}}*, the physical plan will be 
optimized to *{{EqualTo}}*, so predicate pushdown can be used.
{code}
 profileDF.filter( $"profileID".isinSet(Set(6))).explain(true)
 """
|== Physical Plan ==|
|*(1) Project [profileID#0|#0]|
|+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
|+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
|PartitionFilters: [],|
|PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
|ReadSchema: struct
 """.stripMargin
{code}

# When the *{{Set}}* is empty, and the input is nullable, the logical plan will 
be simplified to
{code}
 profileDF.filter( $"profileID".isinSet(Set())).explain(true)
 """
|== Optimized Logical Plan ==|
|Filter if (isnull(profileID#0)) null else false|
|+- Relation[profileID#0|#0] parquet
 """.stripMargin
{code}
 

TODO:
 # For multiple conditions with numbers less than certain thresholds, we should 
still allow predicate pushdown.
 # Optimize the `In` using tableswitch or lookupswitch when the numbers of the 
categories are low, and they are `Int`, `Long`.
 # The default immutable hash trees set is slow for query, and we should do 
benchmark for using different set implementation for faster query.
 # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24322) Upgrade Apache ORC to 1.4.4

2018-05-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24322:

Labels: correctness  (was: )

> Upgrade Apache ORC to 1.4.4
> ---
>
> Key: SPARK-24322
> URL: https://issues.apache.org/jira/browse/SPARK-24322
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: correctness
>
> ORC 1.4.4 (released on May 14th) includes nine fixes. This issue aims to 
> update Spark to use it.
> https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4
> For example, ORC-306 fixes the timestamp issue.
> {code}
> scala> spark.version
> res0: String = 2.3.0
> scala> spark.sql("set spark.sql.orc.impl=native")
> scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
> 12:34:56.000789")).toDF().write.orc("/tmp/orc")
> scala> spark.read.orc("/tmp/orc").show(false)
> +--+
> |value |
> +--+
> |1900-05-05 12:34:55.000789|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24294) Throw SparkException when OOM in BroadcastExchangeExec

2018-05-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24294.
-
   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 2.4.0

> Throw SparkException when OOM in BroadcastExchangeExec
> --
>
> Key: SPARK-24294
> URL: https://issues.apache.org/jira/browse/SPARK-24294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: scala-2.11.8
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> When OutOfMemoryError thrown from BroadcastExchangeExec, 
> scala.concurrent.Future will hit scala bug -- 
> https://github.com/scala/bug/issues/9554, and hang until future timeout:
> We could wrap the OOM inside SparkException to resolve this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24206) Improve DataSource benchmark code for read and pushdown

2018-05-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24206.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Improve DataSource benchmark code for read and pushdown
> ---
>
> Key: SPARK-24206
> URL: https://issues.apache.org/jira/browse/SPARK-24206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
>
> I improved the DataSource code for read and pushdown in the parquet v1.10.0 
> upgrade activity: [https://github.com/apache/spark/pull/21070]
> Based on the code, we need to brush up the benchmark code and results in the 
> master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24244) Parse only required columns of CSV file

2018-05-23 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk reopened SPARK-24244:


Previous PR was reverted due flaky UnivocityParserSuite

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487923#comment-16487923
 ] 

Apache Spark commented on SPARK-24368:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21415

> Flaky tests: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
> 
>
> Key: SPARK-24368
> URL: https://issues.apache.org/jira/browse/SPARK-24368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed 
> very often.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/
> {code}
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** 
> ABORTED *** (1 millisecond)
> [info]   java.lang.IllegalStateException: LiveListenerBus is stopped.
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
> [info]   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at scala.Option.getOrElse(Option.scala:121)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
> [info]   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at scala.Option.map(Option.scala:146)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
> [info]   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30)
> [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> [info]   at java.lang.Class.newInstance(Class.java:442)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24244) Parse only required columns of CSV file

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487922#comment-16487922
 ] 

Apache Spark commented on SPARK-24244:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21415

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-05-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487901#comment-16487901
 ] 

Imran Rashid commented on SPARK-23206:
--

[~felixcheung] can you give an example of the network & IO stat which you think 
would go into this metric collection framework, rather than just a task metric? 
 I can't think of a good example / use case.  While I don't want this to block 
on including those metrics, I'd like to at least have that in mind while we're 
designing this part.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory

2018-05-23 Thread Jami Malikzade (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jami Malikzade updated SPARK-24370:
---
Attachment: partitions.PNG

> spark checkpoint creates many 0 byte empty files(partitions)  in checkpoint 
> directory
> -
>
> Key: SPARK-24370
> URL: https://issues.apache.org/jira/browse/SPARK-24370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.1
>Reporter: Jami Malikzade
>Priority: Critical
> Attachments: partitions.PNG
>
>
> We currently facing issue, that when we call checkpoint on dataframe, it 
> creates partitions in checkpoint dir, but some of them are empty. So we 
> having exceptions reading dataframe back.
> Do you have any idea how to avoid it?
> it creates 200 partitions.Some are empty. I used repartition(1) before 
> checkpoint. But it is not good wordaround. Do we have anyway , to populate 
> all partitions with data, or avoid empty files?
> Pasted snapshot.
> !image-2018-05-23-21-10-43-673.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory

2018-05-23 Thread Jami Malikzade (JIRA)
Jami Malikzade created SPARK-24370:
--

 Summary: spark checkpoint creates many 0 byte empty 
files(partitions)  in checkpoint directory
 Key: SPARK-24370
 URL: https://issues.apache.org/jira/browse/SPARK-24370
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.1.1
Reporter: Jami Malikzade


We currently facing issue, that when we call checkpoint on dataframe, it 
creates partitions in checkpoint dir, but some of them are empty. So we having 
exceptions reading dataframe back.

Do you have any idea how to avoid it?

it creates 200 partitions.Some are empty. I used repartition(1) before 
checkpoint. But it is not good wordaround. Do we have anyway , to populate all 
partitions with data, or avoid empty files?

Pasted snapshot.

!image-2018-05-23-21-10-43-673.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24369) A bug when having multiple distinct aggregations

2018-05-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24369:

Summary: A bug when having multiple distinct aggregations  (was: A bug of 
multiple distinct aggregations)

> A bug when having multiple distinct aggregations
> 
>
> Key: SPARK-24369
> URL: https://issues.apache.org/jira/browse/SPARK-24369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
> (VALUES
>(1, 1),
>(2, 2),
>(2, 2)
> ) t(x, y)
> {code}
> It returns 
> {code}
> java.lang.RuntimeException
> You hit a query analyzer bug. Please report your query to Spark user mailing 
> list.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24369) A bug of multiple distinct aggregations

2018-05-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24369:

Summary: A bug of multiple distinct aggregations  (was: A bug for multiple 
distinct aggregations)

> A bug of multiple distinct aggregations
> ---
>
> Key: SPARK-24369
> URL: https://issues.apache.org/jira/browse/SPARK-24369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
> (VALUES
>(1, 1),
>(2, 2),
>(2, 2)
> ) t(x, y)
> {code}
> It returns 
> {code}
> java.lang.RuntimeException
> You hit a query analyzer bug. Please report your query to Spark user mailing 
> list.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24369) A bug for multiple distinct aggregations

2018-05-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24369:

Description: 
{code}
SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
(VALUES
   (1, 1),
   (2, 2),
   (2, 2)
) t(x, y)
{code}

It returns 

{code}
java.lang.RuntimeException
You hit a query analyzer bug. Please report your query to Spark user mailing 
list.
{code}

  was:
SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
(VALUES
   (1, 1),
   (2, 2),
   (2, 2)
) t(x, y)


> A bug for multiple distinct aggregations
> 
>
> Key: SPARK-24369
> URL: https://issues.apache.org/jira/browse/SPARK-24369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
> (VALUES
>(1, 1),
>(2, 2),
>(2, 2)
> ) t(x, y)
> {code}
> It returns 
> {code}
> java.lang.RuntimeException
> You hit a query analyzer bug. Please report your query to Spark user mailing 
> list.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24369) A bug for multiple distinct aggregations

2018-05-23 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24369:
---

 Summary: A bug for multiple distinct aggregations
 Key: SPARK-24369
 URL: https://issues.apache.org/jira/browse/SPARK-24369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
(VALUES
   (1, 1),
   (2, 2),
   (2, 2)
) t(x, y)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-23 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487815#comment-16487815
 ] 

Hossein Falaki commented on SPARK-24359:


Thanks [~shivaram] and [~zero323]. It seems CRAN release pain has left some 
scar tissue. We can host this new package in a separate repo and maintain it 
for a few release cycles to evaluate CRAN release overhead. If the overhead is 
not too high, we can contribute it back to the main Spark repository. 
Alternatively, we can remove the requirement for co-releasing with Apache Spark 
– only release when there is API changes in the new package. 

As for duplicate API, I see the issue as well. I think there is room for both 
styles (formula-based for simple use cases and pipeline-based for more complex 
programs). Based on feedback from community we can decide if deprecating old 
API makes sense down the road. I have received many requests from SparkR users 
for ability to build pipelines (same way Python and Scala support it).

As for whether this new package will introduce Scala API changes, in my current 
prototype it is very minimal (and can be avoided). Almost all new Scala code is 
for the utility that generates R source code. The idea is, if a patch adds new 
API to MLlib, the contributor can simply execute a command-line tool and 
check-in R wrappers for his/her new API. The goal of this work is to reduce 
maintenance cost for R API in Spark.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural 

[jira] [Assigned] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24368:


Assignee: (was: Apache Spark)

> Flaky tests: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
> 
>
> Key: SPARK-24368
> URL: https://issues.apache.org/jira/browse/SPARK-24368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed 
> very often.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/
> {code}
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** 
> ABORTED *** (1 millisecond)
> [info]   java.lang.IllegalStateException: LiveListenerBus is stopped.
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
> [info]   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at scala.Option.getOrElse(Option.scala:121)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
> [info]   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at scala.Option.map(Option.scala:146)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
> [info]   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30)
> [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> [info]   at java.lang.Class.newInstance(Class.java:442)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487755#comment-16487755
 ] 

Apache Spark commented on SPARK-24368:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21414

> Flaky tests: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
> 
>
> Key: SPARK-24368
> URL: https://issues.apache.org/jira/browse/SPARK-24368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed 
> very often.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/
> {code}
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** 
> ABORTED *** (1 millisecond)
> [info]   java.lang.IllegalStateException: LiveListenerBus is stopped.
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
> [info]   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at scala.Option.getOrElse(Option.scala:121)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
> [info]   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at scala.Option.map(Option.scala:146)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
> [info]   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30)
> [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> [info]   at java.lang.Class.newInstance(Class.java:442)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24368:


Assignee: Apache Spark

> Flaky tests: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
> 
>
> Key: SPARK-24368
> URL: https://issues.apache.org/jira/browse/SPARK-24368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed 
> very often.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/
> {code}
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** 
> ABORTED *** (1 millisecond)
> [info]   java.lang.IllegalStateException: LiveListenerBus is stopped.
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
> [info]   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at scala.Option.getOrElse(Option.scala:121)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
> [info]   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at scala.Option.map(Option.scala:146)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
> [info]   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30)
> [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> [info]   at java.lang.Class.newInstance(Class.java:442)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487754#comment-16487754
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

I'd just like to echo the point on release and testing strategy raised by 
[~felixcheung] 
 * For a new CRAN package tying it to the Spark release cycle can be especially 
challenging as it takes a bunch of iterations to get things right.
 * This also leads to the question of how the SparkML package APIs are going to 
depend on Spark versions. Are we only going to have code that depends on older 
Spark releases or are we going to have cases where we introduce the Java/Scala 
side code at the same time as the R API ?
 * One more idea could be to have a new repo in Apache that has its own release 
cycle (like the spark-website repo)

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> constructors are dot separated (e.g., spark.logistic.regression()) and all 
> setters and getters are snake case (e.g., set_max_iter()). If a constructor 
> gets arguments, they will be named arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, 

[jira] [Assigned] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23161:


Assignee: Apache Spark

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487739#comment-16487739
 ] 

Apache Spark commented on SPARK-23161:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21413

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23161:


Assignee: (was: Apache Spark)

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487707#comment-16487707
 ] 

Marcelo Vanzin commented on SPARK-21945:


{noformat}
$ cat test.py 
from pyspark import SparkContext, SparkConf
import funcs

sc = SparkContext()
funcs.test()
sc.stop()

$ cat lib/funcs.py 

def test():
  print "This is a test."

[systest@vanzin-c5-1 ~]$ spark-submit --master yarn --py-files lib/funcs.py 
test.py 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.378171/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Traceback (most recent call last):
  File "/home/systest/test.py", line 2, in 
import funcs
ImportError: No module named funcs

$ pyspark --master yarn --py-files lib/funcs.py
[blah blah blah]
Using Python version 2.7.5 (default, Sep 15 2016 22:37:39)
SparkSession available as 'spark'.
>>> import funcs
>>> 
{noformat}

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487707#comment-16487707
 ] 

Marcelo Vanzin edited comment on SPARK-21945 at 5/23/18 5:27 PM:
-

{noformat}
$ cat test.py 
from pyspark import SparkContext, SparkConf
import funcs

sc = SparkContext()
funcs.test()
sc.stop()

$ cat lib/funcs.py 

def test():
  print "This is a test."

$ spark-submit --master yarn --py-files lib/funcs.py test.py 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.378171/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Traceback (most recent call last):
  File "/home/systest/test.py", line 2, in 
import funcs
ImportError: No module named funcs

$ pyspark --master yarn --py-files lib/funcs.py
[blah blah blah]
Using Python version 2.7.5 (default, Sep 15 2016 22:37:39)
SparkSession available as 'spark'.
>>> import funcs
>>> 
{noformat}


was (Author: vanzin):
{noformat}
$ cat test.py 
from pyspark import SparkContext, SparkConf
import funcs

sc = SparkContext()
funcs.test()
sc.stop()

$ cat lib/funcs.py 

def test():
  print "This is a test."

[systest@vanzin-c5-1 ~]$ spark-submit --master yarn --py-files lib/funcs.py 
test.py 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.378171/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Traceback (most recent call last):
  File "/home/systest/test.py", line 2, in 
import funcs
ImportError: No module named funcs

$ pyspark --master yarn --py-files lib/funcs.py
[blah blah blah]
Using Python version 2.7.5 (default, Sep 15 2016 22:37:39)
SparkSession available as 'spark'.
>>> import funcs
>>> 
{noformat}

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22055) Port release scripts

2018-05-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487647#comment-16487647
 ] 

Marcelo Vanzin commented on SPARK-22055:


No, my script just prepares the release (asks questions and runs the existing 
scripts). You'd need to write a second script to inject stuff into the docker 
image and then run my script.

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite

2018-05-23 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487516#comment-16487516
 ] 

Xiao Li commented on SPARK-24368:
-

cc [~maxgekk]


> Flaky tests: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
> 
>
> Key: SPARK-24368
> URL: https://issues.apache.org/jira/browse/SPARK-24368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed 
> very often.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/
> {code}
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** 
> ABORTED *** (1 millisecond)
> [info]   java.lang.IllegalStateException: LiveListenerBus is stopped.
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
> [info]   at 
> org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
> [info]   at 
> org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
> [info]   at scala.Option.getOrElse(Option.scala:121)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
> [info]   at 
> org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
> [info]   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
> [info]   at 
> org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
> [info]   at scala.Option.map(Option.scala:146)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
> [info]   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30)
> [info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> [info]   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> [info]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> [info]   at java.lang.Class.newInstance(Class.java:442)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite

2018-05-23 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24368:
---

 Summary: Flaky tests: 
org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
 Key: SPARK-24368
 URL: https://issues.apache.org/jira/browse/SPARK-24368
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Xiao Li


org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed very 
often.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/



{code}
org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** ABORTED 
*** (1 millisecond)
[info]   java.lang.IllegalStateException: LiveListenerBus is stopped.
[info]   at 
org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
[info]   at 
org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
[info]   at 
org.apache.spark.sql.internal.SharedState.(SharedState.scala:93)
[info]   at 
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
[info]   at 
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
[info]   at scala.Option.getOrElse(Option.scala:121)
[info]   at 
org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
[info]   at 
org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
[info]   at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
[info]   at 
org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
[info]   at 
org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
[info]   at 
org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
[info]   at 
org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
[info]   at scala.Option.map(Option.scala:146)
[info]   at 
org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
[info]   at 
org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
[info]   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125)
[info]   at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84)
[info]   at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40)
[info]   at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30)
[info]   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
[info]   at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info]   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info]   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
[info]   at java.lang.Class.newInstance(Class.java:442)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]   at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487391#comment-16487391
 ] 

Apache Spark commented on SPARK-18805:
--

User 'chermenin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21412

> InternalMapWithStateDStream make java.lang.StackOverflowError 
> --
>
> Key: SPARK-18805
> URL: https://issues.apache.org/jira/browse/SPARK-18805
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.0.2
> Environment: mesos
>Reporter: etienne
>Priority: Major
>
> When load InternalMapWithStateDStream from a check point.
> If isValidTime is true and if there is no generatedRDD at the given time 
> there is an infinite loop.
> 1) compute is call on InternalMapWithStateDStream
> 2) InternalMapWithStateDStream try to generate the previousRDD
> 3) Stream look in generatedRDD if the RDD is already generated for the given 
> time 
> 4) It not fund the rdd so it check if the time is valid.
> 5) if the time is valid call compute on InternalMapWithStateDStream
> 6) restart from 1)
> Here the exception that illustrate this error
> {code}
> Exception in thread "streaming-start" java.lang.StackOverflowError
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18805:


Assignee: Apache Spark

> InternalMapWithStateDStream make java.lang.StackOverflowError 
> --
>
> Key: SPARK-18805
> URL: https://issues.apache.org/jira/browse/SPARK-18805
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.0.2
> Environment: mesos
>Reporter: etienne
>Assignee: Apache Spark
>Priority: Major
>
> When load InternalMapWithStateDStream from a check point.
> If isValidTime is true and if there is no generatedRDD at the given time 
> there is an infinite loop.
> 1) compute is call on InternalMapWithStateDStream
> 2) InternalMapWithStateDStream try to generate the previousRDD
> 3) Stream look in generatedRDD if the RDD is already generated for the given 
> time 
> 4) It not fund the rdd so it check if the time is valid.
> 5) if the time is valid call compute on InternalMapWithStateDStream
> 6) restart from 1)
> Here the exception that illustrate this error
> {code}
> Exception in thread "streaming-start" java.lang.StackOverflowError
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18805:


Assignee: (was: Apache Spark)

> InternalMapWithStateDStream make java.lang.StackOverflowError 
> --
>
> Key: SPARK-18805
> URL: https://issues.apache.org/jira/browse/SPARK-18805
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.0.2
> Environment: mesos
>Reporter: etienne
>Priority: Major
>
> When load InternalMapWithStateDStream from a check point.
> If isValidTime is true and if there is no generatedRDD at the given time 
> there is an infinite loop.
> 1) compute is call on InternalMapWithStateDStream
> 2) InternalMapWithStateDStream try to generate the previousRDD
> 3) Stream look in generatedRDD if the RDD is already generated for the given 
> time 
> 4) It not fund the rdd so it check if the time is valid.
> 5) if the time is valid call compute on InternalMapWithStateDStream
> 6) restart from 1)
> Here the exception that illustrate this error
> {code}
> Exception in thread "streaming-start" java.lang.StackOverflowError
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23711) Add fallback to interpreted execution logic

2018-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23711.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21106
[https://github.com/apache/spark/pull/21106]

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23711) Add fallback to interpreted execution logic

2018-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23711:
---

Assignee: Liang-Chi Hsieh

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types

2018-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24313:

Fix Version/s: 2.3.1

> Collection functions interpreted execution doesn't work with complex types
> --
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Critical
>  Labels: correctness
> Fix For: 2.3.1, 2.4.0
>
>
> Several functions working on collection return incorrect result for complex 
> data types in interpreted mode. In particular, we consider comple data types 
> BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, 
> {{array_position}}, {{element_at}} and {{GetMapValue}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487302#comment-16487302
 ] 

Apache Spark commented on SPARK-24367:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21411

> Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
> 
>
> Key: SPARK-24367
> URL: https://issues.apache.org/jira/browse/SPARK-24367
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. 
> When writing to Parquet files, the warning message "WARN 
> org.apache.parquet.hadoop.ParquetOutputFormat: Setting 
> parquet.enable.summary-metadata is deprecated, please use 
> parquet.summary.metadata.level" keeps showing up.
> From 
> [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164]
>  we can see that we should use JOB_SUMMARY_LEVEL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24367:


Assignee: Apache Spark

> Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
> 
>
> Key: SPARK-24367
> URL: https://issues.apache.org/jira/browse/SPARK-24367
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. 
> When writing to Parquet files, the warning message "WARN 
> org.apache.parquet.hadoop.ParquetOutputFormat: Setting 
> parquet.enable.summary-metadata is deprecated, please use 
> parquet.summary.metadata.level" keeps showing up.
> From 
> [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164]
>  we can see that we should use JOB_SUMMARY_LEVEL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24367:


Assignee: (was: Apache Spark)

> Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
> 
>
> Key: SPARK-24367
> URL: https://issues.apache.org/jira/browse/SPARK-24367
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. 
> When writing to Parquet files, the warning message "WARN 
> org.apache.parquet.hadoop.ParquetOutputFormat: Setting 
> parquet.enable.summary-metadata is deprecated, please use 
> parquet.summary.metadata.level" keeps showing up.
> From 
> [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164]
>  we can see that we should use JOB_SUMMARY_LEVEL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY

2018-05-23 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-24367:
--

 Summary: Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag 
ENABLE_JOB_SUMMARY
 Key: SPARK-24367
 URL: https://issues.apache.org/jira/browse/SPARK-24367
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Gengliang Wang


In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. 

When writing to Parquet files, the warning message "WARN 
org.apache.parquet.hadoop.ParquetOutputFormat: Setting 
parquet.enable.summary-metadata is deprecated, please use 
parquet.summary.metadata.level" keeps showing up.

>From 
>[https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164]
> we can see that we should use JOB_SUMMARY_LEVEL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-23 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487209#comment-16487209
 ] 

Maciej Szymkiewicz commented on SPARK-24359:


Just my two cents:
 * As proposed right now, wouldn't it duplicate a significant part of the 
existing ML API, and if it will, does it mean we should deprecate current API? 
Having two different, and compatible APIs, sounds like a recipe for confusion. 
Not to mention duplicated tests, can be deal breaker, especially when CI 
pipeline is already incredibly heavy.
 * To concur with [~felixcheung] maintaining CRAN package is a significant 
maintenance burden. Add another package, tightly bound to main release cycle, 
might have huge impact on overall release process.
 * If the package is to be designed to be mostly independent of  the current R 
API, why not create a separate package, not maintained by the Apache Foundation?
 * I am not sure if anything changed lately, but based on my previous 
experience, there is not enough hands to keep current API up-to-date. Unless 
there is enough support from the stakeholders, it might end up as mostly 
unmaintained deadweight.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> constructors are 

[jira] [Updated] (SPARK-24366) Improve error message for Catalyst type converters

2018-05-23 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24366:
---
Summary: Improve error message for Catalyst type converters  (was: Improve 
error message for type converting)

> Improve error message for Catalyst type converters
> --
>
> Key: SPARK-24366
> URL: https://issues.apache.org/jira/browse/SPARK-24366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> User have no way to drill down to understand which of the hundreds of fields 
> in millions records feeding into the job are causing the problem. We should 
> to show where in the schema the error is happening.
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: 
> start (of class java.lang.String)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> 

[jira] [Updated] (SPARK-24366) Improve error message for type converting

2018-05-23 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24366:
---
Summary: Improve error message for type converting  (was: Improve error 
message for type conversions)

> Improve error message for type converting
> -
>
> Key: SPARK-24366
> URL: https://issues.apache.org/jira/browse/SPARK-24366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> User have no way to drill down to understand which of the hundreds of fields 
> in millions records feeding into the job are causing the problem. We should 
> to show where in the schema the error is happening.
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: 
> start (of class java.lang.String)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> 

[jira] [Commented] (SPARK-24366) Improve error message for type conversions

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487171#comment-16487171
 ] 

Apache Spark commented on SPARK-24366:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21410

> Improve error message for type conversions
> --
>
> Key: SPARK-24366
> URL: https://issues.apache.org/jira/browse/SPARK-24366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> User have no way to drill down to understand which of the hundreds of fields 
> in millions records feeding into the job are causing the problem. We should 
> to show where in the schema the error is happening.
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: 
> start (of class java.lang.String)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> 

[jira] [Assigned] (SPARK-24366) Improve error message for type conversions

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24366:


Assignee: (was: Apache Spark)

> Improve error message for type conversions
> --
>
> Key: SPARK-24366
> URL: https://issues.apache.org/jira/browse/SPARK-24366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> User have no way to drill down to understand which of the hundreds of fields 
> in millions records feeding into the job are causing the problem. We should 
> to show where in the schema the error is happening.
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: 
> start (of class java.lang.String)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 

[jira] [Assigned] (SPARK-24366) Improve error message for type conversions

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24366:


Assignee: Apache Spark

> Improve error message for type conversions
> --
>
> Key: SPARK-24366
> URL: https://issues.apache.org/jira/browse/SPARK-24366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> User have no way to drill down to understand which of the hundreds of fields 
> in millions records feeding into the job are causing the problem. We should 
> to show where in the schema the error is happening.
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: 
> start (of class java.lang.String)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> 

[jira] [Created] (SPARK-24366) Improve error message for type conversions

2018-05-23 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24366:
--

 Summary: Improve error message for type conversions
 Key: SPARK-24366
 URL: https://issues.apache.org/jira/browse/SPARK-24366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0, 1.6.3
Reporter: Maxim Gekk


User have no way to drill down to understand which of the hundreds of fields in 
millions records feeding into the job are causing the problem. We should to 
show where in the schema the error is happening.
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: start 
(of class java.lang.String)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 

[jira] [Commented] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487120#comment-16487120
 ] 

Apache Spark commented on SPARK-24341:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21403

> Codegen compile error from predicate subquery
> -
>
> Key: SPARK-24341
> URL: https://issues.apache.org/jira/browse/SPARK-24341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Ran on master:
> {code}
> drop table if exists juleka;
> drop table if exists julekb;
> create table juleka (a integer, b integer);
> create table julekb (na integer, nb integer);
> insert into juleka values (1,1);
> insert into julekb values (1,1);
> select * from juleka where (a, b) not in (select (na, nb) from julekb);
> {code}
> Results in:
> {code}
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: Cannot compare types "int" and 
> "org.apache.spark.sql.catalyst.InternalRow"
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>   at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:111)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> 

[jira] [Assigned] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24341:


Assignee: (was: Apache Spark)

> Codegen compile error from predicate subquery
> -
>
> Key: SPARK-24341
> URL: https://issues.apache.org/jira/browse/SPARK-24341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Ran on master:
> {code}
> drop table if exists juleka;
> drop table if exists julekb;
> create table juleka (a integer, b integer);
> create table julekb (na integer, nb integer);
> insert into juleka values (1,1);
> insert into julekb values (1,1);
> select * from juleka where (a, b) not in (select (na, nb) from julekb);
> {code}
> Results in:
> {code}
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: Cannot compare types "int" and 
> "org.apache.spark.sql.catalyst.InternalRow"
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>   at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:111)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Assigned] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24341:


Assignee: Apache Spark

> Codegen compile error from predicate subquery
> -
>
> Key: SPARK-24341
> URL: https://issues.apache.org/jira/browse/SPARK-24341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Minor
>
> Ran on master:
> {code}
> drop table if exists juleka;
> drop table if exists julekb;
> create table juleka (a integer, b integer);
> create table julekb (na integer, nb integer);
> insert into juleka values (1,1);
> insert into julekb values (1,1);
> select * from juleka where (a, b) not in (select (na, nb) from julekb);
> {code}
> Results in:
> {code}
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: Cannot compare types "int" and 
> "org.apache.spark.sql.catalyst.InternalRow"
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>   at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:111)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Comment Edited] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-23 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487110#comment-16487110
 ] 

Cristian Consonni edited comment on SPARK-24324 at 5/23/18 11:36 AM:
-

[~hyukjin.kwon] said:
> Ah, I meant shorter reproducer should make other guys easier to take a look.

Here, half the number of lines of code, same behavior.

{code:python}
#!/usr/bin/env python3
# coding: utf-8
from datetime import datetime
import findspark
findspark.init()

import pyspark
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd

input_data = [('en1', 'a', 150),
  ('en2', 'b', 148),
  ('en3', 'c', 197),
  ('en4', 'd', 145),
  ('en5', 'e', 131),
  ('en6', 'f', 154),
  ('en7', 'g', 142)
  ]

sc = pyspark.SparkContext(appName="udf_example")
sqlctx = pyspark.SQLContext(sc)

schema = StructType([StructField("lang", StringType(), False),
 StructField("page", StringType(), False),
 StructField("views", IntegerType(), False)])

new_schema = StructType([StructField("lang", StringType(), False),
 StructField("page", StringType(), False),
 StructField("foo", StringType(), False)])

rdd = sc.parallelize(input_data)
df = sqlctx.createDataFrame(rdd, schema)
df.show()

@pandas_udf(new_schema, PandasUDFType.GROUPED_MAP)
def myudf(x):
foo = 'foo'

# this gives the expect results:
# return pd.DataFrame({'foo': x.lang, 'lang': x.page, 'page': foo})

# this mixes columns
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'foo': foo})

grp_df = df.select(['lang', 'page', 'views']).groupby(['lang','page'])
grp_df = grp_df.apply(myudf).dropDuplicates()
grp_df.show()
{code}

I am also wondering if this is a pyspark issue or a pandas issue.


was (Author: cristiancantoro):
[~hyukjin.kwon] said:
> Ah, I meant shorter reproducer should make other guys easier to take a look.

Here, half the number of lines of code, same behavior.

{code:python}
#!/usr/bin/env python3
# coding: utf-8
from datetime import datetime
import findspark
findspark.init()

import pyspark
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd

input_data = [('en1', 'a', 150),
  ('en2', 'b', 148),
  ('en3', 'c', 197),
  ('en4', 'd', 145),
  ('en5', 'e', 131),
  ('en6', 'f', 154),
  ('en7', 'g', 142)
  ]

sc = pyspark.SparkContext(appName="udf_example")
sqlctx = pyspark.SQLContext(sc)

schema = StructType([StructField("lang", StringType(), False),
 StructField("page", StringType(), False),
 StructField("views", IntegerType(), False)])

new_schema = StructType([StructField("lang", StringType(), False),
 StructField("page", StringType(), False),
 StructField("foo", StringType(), False)])

rdd = sc.parallelize(input_data)
df = sqlctx.createDataFrame(rdd, schema)
df.show()

@pandas_udf(new_schema, PandasUDFType.GROUPED_MAP)
def myudf(x):
foo = 'foo'

# this gives the expect results:
# return pd.DataFrame({'foo': x.lang, 'lang': x.page, 'page': foo})

# this mixes columns
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'foo': foo})

grp_df = df.select(['lang', 'page', 'views']).groupby(['lang','page'])
grp_df = grp_df.apply(myudf).dropDuplicates()
grp_df.show()
{code}

I am aslo wondering if this is a pyspark issue or a pandas issue.

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> 

[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-23 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487110#comment-16487110
 ] 

Cristian Consonni commented on SPARK-24324:
---

[~hyukjin.kwon] said:
> Ah, I meant shorter reproducer should make other guys easier to take a look.

Here, half the number of lines of code, same behavior.

{code:python}
#!/usr/bin/env python3
# coding: utf-8
from datetime import datetime
import findspark
findspark.init()

import pyspark
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd

input_data = [('en1', 'a', 150),
  ('en2', 'b', 148),
  ('en3', 'c', 197),
  ('en4', 'd', 145),
  ('en5', 'e', 131),
  ('en6', 'f', 154),
  ('en7', 'g', 142)
  ]

sc = pyspark.SparkContext(appName="udf_example")
sqlctx = pyspark.SQLContext(sc)

schema = StructType([StructField("lang", StringType(), False),
 StructField("page", StringType(), False),
 StructField("views", IntegerType(), False)])

new_schema = StructType([StructField("lang", StringType(), False),
 StructField("page", StringType(), False),
 StructField("foo", StringType(), False)])

rdd = sc.parallelize(input_data)
df = sqlctx.createDataFrame(rdd, schema)
df.show()

@pandas_udf(new_schema, PandasUDFType.GROUPED_MAP)
def myudf(x):
foo = 'foo'

# this gives the expect results:
# return pd.DataFrame({'foo': x.lang, 'lang': x.page, 'page': foo})

# this mixes columns
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'foo': foo})

grp_df = df.select(['lang', 'page', 'views']).groupby(['lang','page'])
grp_df = grp_df.apply(myudf).dropDuplicates()
grp_df.show()
{code}

I am aslo wondering if this is a pyspark issue or a pandas issue.

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150

[jira] [Commented] (SPARK-24365) Add Parquet write benchmark

2018-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487050#comment-16487050
 ] 

Apache Spark commented on SPARK-24365:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21409

> Add Parquet write benchmark
> ---
>
> Key: SPARK-24365
> URL: https://issues.apache.org/jira/browse/SPARK-24365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Add Parquet write benchmark. So that it would be easier to measure the writer 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24365) Add Parquet write benchmark

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24365:


Assignee: (was: Apache Spark)

> Add Parquet write benchmark
> ---
>
> Key: SPARK-24365
> URL: https://issues.apache.org/jira/browse/SPARK-24365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Add Parquet write benchmark. So that it would be easier to measure the writer 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24365) Add Parquet write benchmark

2018-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24365:


Assignee: Apache Spark

> Add Parquet write benchmark
> ---
>
> Key: SPARK-24365
> URL: https://issues.apache.org/jira/browse/SPARK-24365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Add Parquet write benchmark. So that it would be easier to measure the writer 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >