[jira] [Updated] (SPARK-24376) compiling spark with scala-2.10 should use the -P parameter instead of -D
[ https://issues.apache.org/jira/browse/SPARK-24376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang updated SPARK-24376: Description: compiling spark with scala-2.10 should use the -P parameter instead of -D (was: compiling spark with scala-2.10 should use the -p parameter instead of -d) > compiling spark with scala-2.10 should use the -P parameter instead of -D > - > > Key: SPARK-24376 > URL: https://issues.apache.org/jira/browse/SPARK-24376 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Yu Wang >Priority: Minor > Fix For: 2.2.0 > > > compiling spark with scala-2.10 should use the -P parameter instead of -D -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24376) compiling spark with scala-2.10 should use the -P parameter instead of -D
[ https://issues.apache.org/jira/browse/SPARK-24376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang updated SPARK-24376: Summary: compiling spark with scala-2.10 should use the -P parameter instead of -D (was: compiling spark with scala-2.10 should use the -p parameter instead of -d) > compiling spark with scala-2.10 should use the -P parameter instead of -D > - > > Key: SPARK-24376 > URL: https://issues.apache.org/jira/browse/SPARK-24376 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Yu Wang >Priority: Minor > Fix For: 2.2.0 > > > compiling spark with scala-2.10 should use the -p parameter instead of -d -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24376) compiling spark with scala-2.10 should use the -p parameter instead of -d
Yu Wang created SPARK-24376: --- Summary: compiling spark with scala-2.10 should use the -p parameter instead of -d Key: SPARK-24376 URL: https://issues.apache.org/jira/browse/SPARK-24376 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.2.0 Reporter: Yu Wang Fix For: 2.2.0 compiling spark with scala-2.10 should use the -p parameter instead of -d -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24362) SUM function precision issue
[ https://issues.apache.org/jira/browse/SPARK-24362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488464#comment-16488464 ] Yuming Wang commented on SPARK-24362: - *{{SortMergeJoin}}* vs *{{BroadcastHashJoin}}*: {code} test("SPARK-24362") { val df = spark.range(6).toDF("c1") withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { df.join(df, "c1").selectExpr("sum(cast(9.99 as double))").show() } withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "10") { df.join(df, "c1").selectExpr("sum(cast(9.99 as double))").show() } } {code} Results: {noformat} +--+ | smj| +--+ |59.945| +--+ +-+ | bhj| +-+ |59.94| +-+ {noformat} > SUM function precision issue > > > Key: SPARK-24362 > URL: https://issues.apache.org/jira/browse/SPARK-24362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {noformat} > bin/spark-shell --conf spark.sql.autoBroadcastJoinThreshold=-1 > scala> val df = spark.range(6).toDF("c1") > df: org.apache.spark.sql.DataFrame = [c1: bigint] > scala> df.join(df, "c1").selectExpr("sum(cast(9.99 as double))").show() > +-+ > |sum(CAST(9.99 AS DOUBLE))| > +-+ > | 59.945| > +-+{noformat} > > More links: > [https://stackoverflow.com/questions/42158844/about-a-loss-of-precision-when-calculating-an-aggregate-sum-with-data-frames] > [https://stackoverflow.com/questions/44134497/spark-sql-sum-function-issues-on-double-value] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24364) Files deletion after globbing may fail StructuredStreaming jobs
[ https://issues.apache.org/jira/browse/SPARK-24364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24364. -- Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21408 [https://github.com/apache/spark/pull/21408] > Files deletion after globbing may fail StructuredStreaming jobs > --- > > Key: SPARK-24364 > URL: https://issues.apache.org/jira/browse/SPARK-24364 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > This is related with SPARK-17599. SPARK-17599 checked the directory only but > actually it can file on another FileSystem operation when the file is not > found. For example see: > {code} > Error occurred while processing: File does not exist: > /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2029) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2000) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1913) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > java.io.FileNotFoundException: File does not exist: > /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2029) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2000) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1913) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at >
[jira] [Assigned] (SPARK-24364) Files deletion after globbing may fail StructuredStreaming jobs
[ https://issues.apache.org/jira/browse/SPARK-24364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24364: Assignee: Hyukjin Kwon > Files deletion after globbing may fail StructuredStreaming jobs > --- > > Key: SPARK-24364 > URL: https://issues.apache.org/jira/browse/SPARK-24364 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > This is related with SPARK-17599. SPARK-17599 checked the directory only but > actually it can file on another FileSystem operation when the file is not > found. For example see: > {code} > Error occurred while processing: File does not exist: > /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2029) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2000) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1913) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > java.io.FileNotFoundException: File does not exist: > /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2029) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2000) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1913) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at >
[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488432#comment-16488432 ] Xiangrui Meng commented on SPARK-24375: --- [~jiangxb1987] Could you summarize the design sketch based on our offline discussion? Thanks! > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
Xiangrui Meng created SPARK-24375: - Summary: Design sketch: support barrier scheduling in Apache Spark Key: SPARK-24375 URL: https://issues.apache.org/jira/browse/SPARK-24375 Project: Spark Issue Type: Story Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng Assignee: Jiang Xingbo This task is to outline a design sketch for the barrier scheduling SPIP discussion. It doesn't need to be a complete design before the vote. But it should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24374: -- Shepherd: Reynold Xin > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24374: -- Issue Type: Epic (was: Story) > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24374: -- Description: (See details in the linked/attached SPIP doc.) {quote} The proposal here is to add a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. For example, Horovod uses MPI to implement all-reduce to accelerate distributed TensorFlow training. The computation model is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. In MPI, all workers start at the same time and pass messages around. To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named “barrier scheduling”, which launches tasks at the same time and provides users enough information and tooling to embed distributed DL training. Spark can also provide an extra layer of fault tolerance in case some tasks failed in the middle, where Spark would abort all tasks and restart the stage. {quote} was: (See details in the linked/attached SPIP doc.) The proposal here is to add a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. For example, Horovod uses MPI to implement all-reduce to accelerate distributed TensorFlow training. The computation model is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. In MPI, all workers start at the same time and pass messages around. To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named “barrier scheduling”, which launches tasks at the same time and provides users enough information and tooling to embed distributed DL training. Spark can also provide an extra layer of fault tolerance in case some tasks failed in the middle, where Spark would abort all tasks and restart the stage. > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24374: -- Attachment: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked SPIP doc.) > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24374: -- Description: (See details in the linked/attached SPIP doc.) The proposal here is to add a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. For example, Horovod uses MPI to implement all-reduce to accelerate distributed TensorFlow training. The computation model is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. In MPI, all workers start at the same time and pass messages around. To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named “barrier scheduling”, which launches tasks at the same time and provides users enough information and tooling to embed distributed DL training. Spark can also provide an extra layer of fault tolerance in case some tasks failed in the middle, where Spark would abort all tasks and restart the stage. was: (See details in the linked SPIP doc.) The proposal here is to add a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. For example, Horovod uses MPI to implement all-reduce to accelerate distributed TensorFlow training. The computation model is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. In MPI, all workers start at the same time and pass messages around. To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named “barrier scheduling”, which launches tasks at the same time and provides users enough information and tooling to embed distributed DL training. Spark can also provide an extra layer of fault tolerance in case some tasks failed in the middle, where Spark would abort all tasks and restart the stage. > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24374: -- Labels: SPIP (was: ) > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > > (See details in the linked SPIP doc.) > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
Xiangrui Meng created SPARK-24374: - Summary: SPIP: Support Barrier Scheduling in Apache Spark Key: SPARK-24374 URL: https://issues.apache.org/jira/browse/SPARK-24374 Project: Spark Issue Type: Story Components: ML, Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng (See details in the linked SPIP doc.) The proposal here is to add a new scheduling model to Apache Spark so users can properly embed distributed DL training as a Spark stage to simplify the distributed training workflow. For example, Horovod uses MPI to implement all-reduce to accelerate distributed TensorFlow training. The computation model is different from MapReduce used by Spark. In Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. In MPI, all workers start at the same time and pass messages around. To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named “barrier scheduling”, which launches tasks at the same time and provides users enough information and tooling to embed distributed DL training. Spark can also provide an extra layer of fault tolerance in case some tasks failed in the middle, where Spark would abort all tasks and restart the stage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24369) A bug when having multiple distinct aggregations
[ https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488411#comment-16488411 ] Takeshi Yamamuro commented on SPARK-24369: -- ok, I will look into this. Thanks! > A bug when having multiple distinct aggregations > > > Key: SPARK-24369 > URL: https://issues.apache.org/jira/browse/SPARK-24369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > {code} > SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM > (VALUES >(1, 1), >(2, 2), >(2, 2) > ) t(x, y) > {code} > It returns > {code} > java.lang.RuntimeException > You hit a query analyzer bug. Please report your query to Spark user mailing > list. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24369) A bug when having multiple distinct aggregations
[ https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488396#comment-16488396 ] Xiao Li commented on SPARK-24369: - cc [~maropu] Are you interested in this? > A bug when having multiple distinct aggregations > > > Key: SPARK-24369 > URL: https://issues.apache.org/jira/browse/SPARK-24369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > {code} > SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM > (VALUES >(1, 1), >(2, 2), >(2, 2) > ) t(x, y) > {code} > It returns > {code} > java.lang.RuntimeException > You hit a query analyzer bug. Please report your query to Spark user mailing > list. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24322) Upgrade Apache ORC to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24322: --- Assignee: Dongjoon Hyun > Upgrade Apache ORC to 1.4.4 > --- > > Key: SPARK-24322 > URL: https://issues.apache.org/jira/browse/SPARK-24322 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: correctness > Fix For: 2.3.1, 2.4.0 > > > ORC 1.4.4 includes [nine > fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4]. > One of the issues is about `Timestamp` bug (ORC-306) which occurs when > `native` ORC vectorized reader reads ORC column vector's sub-vector `times` > and `nanos`. ORC-306 fixes this according to the [original > definition|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46] > and the linked PR includes the updated interpretation on ORC column vectors. > Note that `hive` ORC reader and ORC MR reader is not affected. > {code} > scala> spark.version > res0: String = 2.3.0 > scala> spark.sql("set spark.sql.orc.impl=native") > scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 > 12:34:56.000789")).toDF().write.orc("/tmp/orc") > scala> spark.read.orc("/tmp/orc").show(false) > +--+ > |value | > +--+ > |1900-05-05 12:34:55.000789| > +--+ > {code} > This issue aims to update Apache Spark to use it. > *FULL LIST* > || ID || TITLE || > | ORC-281 | Fix compiler warnings from clang 5.0 | > | ORC-301 | `extractFileTail` should open a file in `try` statement | > | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | > | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | > | ORC-324 | Add support for ARM and PPC arch | > | ORC-330 | Remove unnecessary Hive artifacts from root pom | > | ORC-332 | Add syntax version to orc_proto.proto | > | ORC-336 | Remove avro and parquet dependency management entries | > | ORC-360 | Implement error checking on subtype fields in Java | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24322) Upgrade Apache ORC to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24322. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21372 [https://github.com/apache/spark/pull/21372] > Upgrade Apache ORC to 1.4.4 > --- > > Key: SPARK-24322 > URL: https://issues.apache.org/jira/browse/SPARK-24322 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: correctness > Fix For: 2.4.0, 2.3.1 > > > ORC 1.4.4 includes [nine > fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4]. > One of the issues is about `Timestamp` bug (ORC-306) which occurs when > `native` ORC vectorized reader reads ORC column vector's sub-vector `times` > and `nanos`. ORC-306 fixes this according to the [original > definition|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46] > and the linked PR includes the updated interpretation on ORC column vectors. > Note that `hive` ORC reader and ORC MR reader is not affected. > {code} > scala> spark.version > res0: String = 2.3.0 > scala> spark.sql("set spark.sql.orc.impl=native") > scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 > 12:34:56.000789")).toDF().write.orc("/tmp/orc") > scala> spark.read.orc("/tmp/orc").show(false) > +--+ > |value | > +--+ > |1900-05-05 12:34:55.000789| > +--+ > {code} > This issue aims to update Apache Spark to use it. > *FULL LIST* > || ID || TITLE || > | ORC-281 | Fix compiler warnings from clang 5.0 | > | ORC-301 | `extractFileTail` should open a file in `try` statement | > | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | > | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | > | ORC-324 | Add support for ARM and PPC arch | > | ORC-330 | Remove unnecessary Hive artifacts from root pom | > | ORC-332 | Add syntax version to orc_proto.proto | > | ORC-336 | Remove avro and parquet dependency management entries | > | ORC-360 | Implement error checking on subtype fields in Java | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong
[ https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24257. - Resolution: Fixed Fix Version/s: 2.3.1 2.1.3 2.4.0 2.2.2 2.0.3 Issue resolved by pull request 21311 [https://github.com/apache/spark/pull/21311] > LongToUnsafeRowMap calculate the new size may be wrong > -- > > Key: SPARK-24257 > URL: https://issues.apache.org/jira/browse/SPARK-24257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Blocker > Labels: correctness > Fix For: 2.0.3, 2.2.2, 2.4.0, 2.1.3, 2.3.1 > > > LongToUnsafeRowMap > Calculate the new size simply by multiplying by 2 > At this time, the size of the application may not be enough to store data > Some data is lost and the data read out is dirty -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong
[ https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24257: --- Assignee: dzcxzl > LongToUnsafeRowMap calculate the new size may be wrong > -- > > Key: SPARK-24257 > URL: https://issues.apache.org/jira/browse/SPARK-24257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Blocker > Labels: correctness > Fix For: 2.0.3, 2.1.3, 2.2.2, 2.3.1, 2.4.0 > > > LongToUnsafeRowMap > Calculate the new size simply by multiplying by 2 > At this time, the size of the application may not be enough to store data > Some data is lost and the data read out is dirty -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488355#comment-16488355 ] Dongjoon Hyun commented on SPARK-23416: --- Thank you, [~joseph.torres] and [~tdas] . Actually, recently failures are reported on `branch-2.3`. Can we have this fix on `branch-2.3` please? > Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for > failOnDataLoss=false > > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Minor > Fix For: 3.0.0 > > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at >
[jira] [Commented] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory
[ https://issues.apache.org/jira/browse/SPARK-24370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488330#comment-16488330 ] Hyukjin Kwon commented on SPARK-24370: -- Sounds more like a question though. Do you have any reproducer or steps to reproduce this? That should help other guys like me reproduce and debug the problem. > spark checkpoint creates many 0 byte empty files(partitions) in checkpoint > directory > - > > Key: SPARK-24370 > URL: https://issues.apache.org/jira/browse/SPARK-24370 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.1 >Reporter: Jami Malikzade >Priority: Major > Attachments: partitions.PNG > > > We currently facing issue, that when we call checkpoint on dataframe, it > creates partitions in checkpoint dir, but some of them are empty. So we > having exceptions reading dataframe back. > Do you have any idea how to avoid it? > it creates 200 partitions.Some are empty. I used repartition(1) before > checkpoint. But it is not good wordaround. Do we have anyway , to populate > all partitions with data, or avoid empty files? > Pasted snapshot. > !image-2018-05-23-21-10-43-673.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory
[ https://issues.apache.org/jira/browse/SPARK-24370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488329#comment-16488329 ] Hyukjin Kwon commented on SPARK-24370: -- (let's avoid setting critical usually reserved for a committer) > spark checkpoint creates many 0 byte empty files(partitions) in checkpoint > directory > - > > Key: SPARK-24370 > URL: https://issues.apache.org/jira/browse/SPARK-24370 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.1 >Reporter: Jami Malikzade >Priority: Major > Attachments: partitions.PNG > > > We currently facing issue, that when we call checkpoint on dataframe, it > creates partitions in checkpoint dir, but some of them are empty. So we > having exceptions reading dataframe back. > Do you have any idea how to avoid it? > it creates 200 partitions.Some are empty. I used repartition(1) before > checkpoint. But it is not good wordaround. Do we have anyway , to populate > all partitions with data, or avoid empty files? > Pasted snapshot. > !image-2018-05-23-21-10-43-673.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory
[ https://issues.apache.org/jira/browse/SPARK-24370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24370: - Priority: Major (was: Critical) > spark checkpoint creates many 0 byte empty files(partitions) in checkpoint > directory > - > > Key: SPARK-24370 > URL: https://issues.apache.org/jira/browse/SPARK-24370 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.1 >Reporter: Jami Malikzade >Priority: Major > Attachments: partitions.PNG > > > We currently facing issue, that when we call checkpoint on dataframe, it > creates partitions in checkpoint dir, but some of them are empty. So we > having exceptions reading dataframe back. > Do you have any idea how to avoid it? > it creates 200 partitions.Some are empty. I used repartition(1) before > checkpoint. But it is not good wordaround. Do we have anyway , to populate > all partitions with data, or avoid empty files? > Pasted snapshot. > !image-2018-05-23-21-10-43-673.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24363) spark context exit abnormally
[ https://issues.apache.org/jira/browse/SPARK-24363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24363. -- Resolution: Invalid Sounds more like a question. Please ask it to dev mailing list. I believe you can have a better answer for it. Let's leave this resolved for now until we are clear if it's an issue. > spark context exit abnormally > - > > Key: SPARK-24363 > URL: https://issues.apache.org/jira/browse/SPARK-24363 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Submit, YARN >Affects Versions: 2.1.2 > Environment: spark2.1.2 > hadoop2.6.0-cdh5.4.7 > > >Reporter: nilone >Priority: Major > Attachments: QQ图片20180523162543.png, error_log.txt > > > I create a sparkcontext instance with yarn-client mode in my program (a web > application) for users to submit their jobs ,It can work for a for a period > of time, but then it suddenly comes out without any new job running.It looks > like the context has been killed, and I'm not sure whether the problem is on > the yarn or on the spark. > You can see my screenshot and log file. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type
[ https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488261#comment-16488261 ] Hyukjin Kwon commented on SPARK-24358: -- Yea, that's exactly what I thought. There are some differences between Python 2 and 3 but I think PySpark usually supports consistently for both so far up to my knowledge. > createDataFrame in Python 3 should be able to infer bytes type as Binary type > - > > Key: SPARK-24358 > URL: https://issues.apache.org/jira/browse/SPARK-24358 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Minor > Labels: Python3 > > createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes > is just the immutable, hashable version of this same structure, it makes > sense for the same thing to apply there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23416. --- Resolution: Fixed Fix Version/s: (was: 2.3.0) 3.0.0 Issue resolved by pull request 21384 [https://github.com/apache/spark/pull/21384] > Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for > failOnDataLoss=false > > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Minor > Fix For: 3.0.0 > > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at >
[jira] [Assigned] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-23416: - Assignee: Jose Torres > Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for > failOnDataLoss=false > > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Minor > Fix For: 2.3.0 > > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at >
[jira] [Updated] (SPARK-24322) Upgrade Apache ORC to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24322: -- Description: ORC 1.4.4 includes [nine fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4]. One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46] and the linked PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. {code} scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--+ |value | +--+ |1900-05-05 12:34:55.000789| +--+ {code} This issue aims to update Apache Spark to use it. *FULL LIST* || ID || TITLE || | ORC-281 | Fix compiler warnings from clang 5.0 | | ORC-301 | `extractFileTail` should open a file in `try` statement | | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | | ORC-324 | Add support for ARM and PPC arch | | ORC-330 | Remove unnecessary Hive artifacts from root pom | | ORC-332 | Add syntax version to orc_proto.proto | | ORC-336 | Remove avro and parquet dependency management entries | | ORC-360 | Implement error checking on subtype fields in Java | was: ORC 1.4.4 includes [nine fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and the linked PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. {code} scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--+ |value | +--+ |1900-05-05 12:34:55.000789| +--+ {code} This issue aims to update Apache Spark to use it. *FULL LIST* || ID || TITLE || | ORC-281 | Fix compiler warnings from clang 5.0 | | ORC-301 | `extractFileTail` should open a file in `try` statement | | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | | ORC-324 | Add support for ARM and PPC arch | | ORC-330 | Remove unnecessary Hive artifacts from root pom | | ORC-332 | Add syntax version to orc_proto.proto | | ORC-336 | Remove avro and parquet dependency management entries | | ORC-360 | Implement error checking on subtype fields in Java | > Upgrade Apache ORC to 1.4.4 > --- > > Key: SPARK-24322 > URL: https://issues.apache.org/jira/browse/SPARK-24322 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: correctness > > ORC 1.4.4 includes [nine > fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4]. > One of the issues is about `Timestamp` bug (ORC-306) which occurs when > `native` ORC vectorized reader reads ORC column vector's sub-vector `times` > and `nanos`. ORC-306 fixes this according to the [original > definition|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46] > and the linked PR includes the updated interpretation on ORC column vectors. > Note that `hive` ORC reader and ORC MR reader is not affected. > {code} > scala> spark.version > res0: String = 2.3.0 > scala> spark.sql("set spark.sql.orc.impl=native") > scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 > 12:34:56.000789")).toDF().write.orc("/tmp/orc") > scala>
[jira] [Updated] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24373: --- Summary: "df.cache() df.count()" no longer eagerly caches data (was: Spark Dataset groupby.agg/count doesn't respect cache with UDF) > "df.cache() df.count()" no longer eagerly caches data > - > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488109#comment-16488109 ] Li Jin edited comment on SPARK-24373 at 5/23/18 11:18 PM: -- We found after upgrading to Spark 2.3, many of our production systems runs slower. This is founded to be caused by "df.cache(); df.count()" no longer eagerly cache df correctly. I think this might be a regression from 2.2. I think any one uses "df.cache() df.count()" to cache data eagerly will be affected. was (Author: icexelloss): I think this might be a regression from 2.2 Any one uses "df.cache() df.count()" to cache data eagerly will be affected. > Spark Dataset groupby.agg/count doesn't respect cache with UDF > -- > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-24373: --- Component/s: (was: Input/Output) SQL > Spark Dataset groupby.agg/count doesn't respect cache with UDF > -- > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488109#comment-16488109 ] Li Jin commented on SPARK-24373: I think this might be a regression from 2.2 Any one uses "df.cache() df.count()" to cache data eagerly will be affected. > Spark Dataset groupby.agg/count doesn't respect cache with UDF > -- > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type
[ https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488090#comment-16488090 ] Joel Croteau commented on SPARK-24358: -- This may be trickier than I first thought. In Python 2, bytes is an alias for str, so a bytes object resolves as a StringType. In Python 3, they are different types, and not in general freely convertible, as an str in Python 3 is unicode, and an arbitrary byte string as represented by a bytes may not be valid unicode. This means that a bytes in Python 2 will need to be resolved as a different schema from a bytes in Python 3. Not sure how significant that is. > createDataFrame in Python 3 should be able to infer bytes type as Binary type > - > > Key: SPARK-24358 > URL: https://issues.apache.org/jira/browse/SPARK-24358 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Minor > Labels: Python3 > > createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes > is just the immutable, hashable version of this same structure, it makes > sense for the same thing to apply there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbo Zhao updated SPARK-24373: --- Description: Here is the code to reproduce in local mode {code:java} scala> val df = sc.range(1, 2).toDF df: org.apache.spark.sql.DataFrame = [value: bigint] scala> val myudf = udf({x: Long => println(""); x + 1}) myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,LongType,Some(List(LongType))) scala> val df1 = df.withColumn("value1", myudf(col("value"))) df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] scala> df1.cache res0: df1.type = [value: bigint, value1: bigint] scala> df1.count res1: Long = 1 scala> df1.count res2: Long = 1 scala> df1.count res3: Long = 1 {code} in Spark 2.2, you could see it prints "". In the above example, when you do explain. You could see {code:java} scala> df1.explain(true) == Parsed Logical Plan == 'Project [value#2L, UDF('value) AS value1#5] +- AnalysisBarrier +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == value: bigint, value1: bigint Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Optimized Logical Plan == InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] == Physical Plan == *(1) InMemoryTableScan [value#2L, value1#5L] +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] {code} but the ImMemoryTableScan is mising in the following explain() {code:java} scala> df1.groupBy().count().explain(true) == Parsed Logical Plan == Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == count: bigint Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Optimized Logical Plan == Aggregate [count(1) AS count#170L] +- Project +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#175L]) +- *(1) Project +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] {code} was: Here is the code to reproduce in local mode {code:java} scala> val df = sc.range(1, 2).toDF df: org.apache.spark.sql.DataFrame = [value: bigint] scala> val myudf = udf({x: Long => println(""); x + 1}) scala> val df1 = df.withColumn("value1", myudf(col("value"))) scala> df1.cache() res0: df1.type = [value: bigint, value1: bigint] scala> df1.count res1: Long = 1 scala> df1.count res2: Long = 1 scala> df1.count res3: Long = 1 {code} in Spark 2.2, you could see it prints "". In the above example, when you do explain. You could see {code:java} scala> df1.explain(true) == Parsed Logical Plan == 'Project [value#2L, UDF('value) AS value1#5] +- AnalysisBarrier +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == value: bigint, value1: bigint Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Optimized Logical Plan == InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] == Physical Plan == *(1) InMemoryTableScan [value#2L, value1#5L] +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] {code} but the ImMemoryTableScan is mising in the following explain() {code:java} scala> df1.groupBy().count().explain(true) == Parsed Logical Plan == Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS
[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbo Zhao updated SPARK-24373: --- Description: Here is the code to reproduce in local mode {code:java} scala> val df = sc.range(1, 2).toDF df: org.apache.spark.sql.DataFrame = [value: bigint] scala> val myudf = udf({x: Long => println(""); x + 1}) scala> val df1 = df.withColumn("value1", myudf(col("value"))) scala> df1.cache() res0: df1.type = [value: bigint, value1: bigint] scala> df1.count res1: Long = 1 scala> df1.count res2: Long = 1 scala> df1.count res3: Long = 1 {code} in Spark 2.2, you could see it prints "". In the above example, when you do explain. You could see {code:java} scala> df1.explain(true) == Parsed Logical Plan == 'Project [value#2L, UDF('value) AS value1#5] +- AnalysisBarrier +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == value: bigint, value1: bigint Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Optimized Logical Plan == InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] == Physical Plan == *(1) InMemoryTableScan [value#2L, value1#5L] +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] {code} but the ImMemoryTableScan is mising in the following explain() {code:java} scala> df1.groupBy().count().explain(true) == Parsed Logical Plan == Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == count: bigint Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Optimized Logical Plan == Aggregate [count(1) AS count#170L] +- Project +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#175L]) +- *(1) Project +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] {code} was: Here is the code to reproduce in local mode {code:java} val df = sc.range(1, 2).toDF val myudf = udf({x: Long => println(""); x + 1}) scala> df1.cache() res0: df1.type = [value: bigint, value1: bigint] scala> df1.count res1: Long = 1 scala> df1.count res2: Long = 1 scala> df1.count res3: Long = 1 {code} in Spark 2.2, you could see it prints "". In the above example, when you do explain. You could see {code:java} scala> df1.explain(true) == Parsed Logical Plan == 'Project [value#2L, UDF('value) AS value1#5] +- AnalysisBarrier +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == value: bigint, value1: bigint Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Optimized Logical Plan == InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] == Physical Plan == *(1) InMemoryTableScan [value#2L, value1#5L] +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] {code} but the ImMemoryTableScan is mising in the following explain() {code:java} scala> df1.groupBy().count().explain(true) == Parsed Logical Plan == Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == count: bigint Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject
[jira] [Updated] (SPARK-24322) Upgrade Apache ORC to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24322: -- Description: ORC 1.4.4 includes [nine fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and the linked PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. {code} scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--+ |value | +--+ |1900-05-05 12:34:55.000789| +--+ {code} This issue aims to update Apache Spark to use it. *FULL LIST* || ID || TITLE || | ORC-281 | Fix compiler warnings from clang 5.0 | | ORC-301 | `extractFileTail` should open a file in `try` statement | | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | | ORC-324 | Add support for ARM and PPC arch | | ORC-330 | Remove unnecessary Hive artifacts from root pom | | ORC-332 | Add syntax version to orc_proto.proto | | ORC-336 | Remove avro and parquet dependency management entries | | ORC-360 | Implement error checking on subtype fields in Java | was: ORC 1.4.4 (released on May 14th) includes nine fixes. This issue aims to update Spark to use it. https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4 For example, ORC-306 fixes the timestamp issue. {code} scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--+ |value | +--+ |1900-05-05 12:34:55.000789| +--+ {code} > Upgrade Apache ORC to 1.4.4 > --- > > Key: SPARK-24322 > URL: https://issues.apache.org/jira/browse/SPARK-24322 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: correctness > > ORC 1.4.4 includes [nine > fixes|https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). > One of the issues is about `Timestamp` bug (ORC-306) which occurs when > `native` ORC vectorized reader reads ORC column vector's sub-vector `times` > and `nanos`. ORC-306 fixes this according to the [original > definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) > and the linked PR includes the updated interpretation on ORC column vectors. > Note that `hive` ORC reader and ORC MR reader is not affected. > {code} > scala> spark.version > res0: String = 2.3.0 > scala> spark.sql("set spark.sql.orc.impl=native") > scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 > 12:34:56.000789")).toDF().write.orc("/tmp/orc") > scala> spark.read.orc("/tmp/orc").show(false) > +--+ > |value | > +--+ > |1900-05-05 12:34:55.000789| > +--+ > {code} > This issue aims to update Apache Spark to use it. > *FULL LIST* > || ID || TITLE || > | ORC-281 | Fix compiler warnings from clang 5.0 | > | ORC-301 | `extractFileTail` should open a file in `try` statement | > | ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api | > | ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp | > | ORC-324 | Add support for ARM and PPC arch | > | ORC-330 | Remove unnecessary Hive artifacts from root pom | > | ORC-332 | Add syntax version to orc_proto.proto | > | ORC-336 | Remove avro and parquet dependency management entries | > | ORC-360 | Implement error checking on subtype fields in Java | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Updated] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache with UDF
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbo Zhao updated SPARK-24373: --- Summary: Spark Dataset groupby.agg/count doesn't respect cache with UDF (was: Spark Dataset groupby.agg/count doesn't respect cache) > Spark Dataset groupby.agg/count doesn't respect cache with UDF > -- > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Priority: Major > > Here is the code to reproduce in local mode > {code:java} > val df = sc.range(1, 2).toDF > val myudf = udf({x: Long => println(""); x + 1}) > scala> df1.cache() > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24373) Spark Dataset groupby.agg/count doesn't respect cache
Wenbo Zhao created SPARK-24373: -- Summary: Spark Dataset groupby.agg/count doesn't respect cache Key: SPARK-24373 URL: https://issues.apache.org/jira/browse/SPARK-24373 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.3.0 Reporter: Wenbo Zhao Here is the code to reproduce in local mode {code:java} val df = sc.range(1, 2).toDF val myudf = udf({x: Long => println(""); x + 1}) scala> df1.cache() res0: df1.type = [value: bigint, value1: bigint] scala> df1.count res1: Long = 1 scala> df1.count res2: Long = 1 scala> df1.count res3: Long = 1 {code} in Spark 2.2, you could see it prints "". In the above example, when you do explain. You could see {code:java} scala> df1.explain(true) == Parsed Logical Plan == 'Project [value#2L, UDF('value) AS value1#5] +- AnalysisBarrier +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == value: bigint, value1: bigint Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Optimized Logical Plan == InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] == Physical Plan == *(1) InMemoryTableScan [value#2L, value1#5L] +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] {code} but the ImMemoryTableScan is mising in the following explain() {code:java} scala> df1.groupBy().count().explain(true) == Parsed Logical Plan == Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Analyzed Logical Plan == count: bigint Aggregate [count(1) AS count#170L] +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Optimized Logical Plan == Aggregate [count(1) AS count#170L] +- Project +- SerializeFromObject [input[0, bigint, false] AS value#2L] +- ExternalRDD [obj#1L] == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#175L]) +- *(1) Project +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] +- Scan ExternalRDDScan[obj#1L] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22054) Allow release managers to inject their keys
[ https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488024#comment-16488024 ] Marcelo Vanzin commented on SPARK-22054: I filed SPARK-24372 to create an easier way to prepare releases, which I think is better than futzing with jenkins jobs. I'll downgrade this from blocker but leave it open, in case you guys still think this should be done. > Allow release managers to inject their keys > --- > > Key: SPARK-22054 > URL: https://issues.apache.org/jira/browse/SPARK-22054 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Blocker > > Right now the current release process signs with Patrick's keys, let's update > the scripts to allow the release manager to sign the release as part of the > job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22054) Allow release managers to inject their keys
[ https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-22054: --- Priority: Major (was: Blocker) > Allow release managers to inject their keys > --- > > Key: SPARK-22054 > URL: https://issues.apache.org/jira/browse/SPARK-22054 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Major > > Right now the current release process signs with Patrick's keys, let's update > the scripts to allow the release manager to sign the release as part of the > job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22055) Port release scripts
[ https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-22055: --- Priority: Major (was: Blocker) > Port release scripts > > > Key: SPARK-22055 > URL: https://issues.apache.org/jira/browse/SPARK-22055 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Major > > The current Jenkins jobs are generated from scripts in a private repo. We > should port these to enable changes like SPARK-22054 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22055) Port release scripts
[ https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488023#comment-16488023 ] Marcelo Vanzin commented on SPARK-22055: I filed SPARK-24372 for the stuff I'm working on. I'm downgrading this from blocker since I'm not sure this is the correct approach, but feel free to keep it open if you want to do it. I think those jenkins jobs are still used to publish "nightly" builds, if anyone actually uses those... > Port release scripts > > > Key: SPARK-22055 > URL: https://issues.apache.org/jira/browse/SPARK-22055 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Blocker > > The current Jenkins jobs are generated from scripts in a private repo. We > should port these to enable changes like SPARK-22054 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24372) Create script for preparing RCs
Marcelo Vanzin created SPARK-24372: -- Summary: Create script for preparing RCs Key: SPARK-24372 URL: https://issues.apache.org/jira/browse/SPARK-24372 Project: Spark Issue Type: New Feature Components: Project Infra Affects Versions: 2.4.0 Reporter: Marcelo Vanzin Currently, when preparing RCs, the RM has to invoke many scripts manually, make sure that is being done in the correct environment, and set all the correct environment variables, which differ from one script to the other. It will be much easier for RMs if all that was automated as much as possible. I'm working on something like this as part of releasing 2.3.1, and plan to send my scripts for review after the release is done (i.e. after I make sure the scripts are working properly). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.
[ https://issues.apache.org/jira/browse/SPARK-24371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24371: Assignee: Apache Spark (was: DB Tsai) > Added isinSet in DataFrame API for Scala and Java. > -- > > Key: SPARK-24371 > URL: https://issues.apache.org/jira/browse/SPARK-24371 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.0 > > > Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users > can do > {code} > val profileDF = Seq( > Some(1), Some(2), Some(3), Some(4), > Some(5), Some(6), Some(7), None > ).toDF("profileID") > val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3") > val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers)) > result.show(10) > """ > +--+--+ > |profileID|isValid| > +--+--+ > |1|false| > |2|false| > |3|true| > |4|false| > |5|false| > |6|true| > |7|true| > |null|null| > +--+--+ > """.stripMargin > {code} > Two new rules in the logical plan optimizers are added. > # When there is only one element in the *{{Set}}*, the physical plan will be > optimized to *{{EqualTo}}*, so predicate pushdown can be used. > {code} > profileDF.filter( $"profileID".isinSet(Set(6))).explain(true) > """ > |== Physical Plan ==| > |*(1) Project [profileID#0|#0]| > |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| > |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| > |PartitionFilters: [],| > |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| > |ReadSchema: struct > """.stripMargin > {code} > # When the *{{Set}}* is empty, and the input is nullable, the logical plan > will be simplified to > {code} > profileDF.filter( $"profileID".isinSet(Set())).explain(true) > """ > |== Optimized Logical Plan ==| > |Filter if (isnull(profileID#0)) null else false| > |+- Relation[profileID#0|#0] parquet > """.stripMargin > {code} > > TODO: > # For multiple conditions with numbers less than certain thresholds, we > should still allow predicate pushdown. > # Optimize the `In` using tableswitch or lookupswitch when the numbers of > the categories are low, and they are `Int`, `Long`. > # The default immutable hash trees set is slow for query, and we should do > benchmark for using different set implementation for faster query. > # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.
[ https://issues.apache.org/jira/browse/SPARK-24371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24371: Assignee: DB Tsai (was: Apache Spark) > Added isinSet in DataFrame API for Scala and Java. > -- > > Key: SPARK-24371 > URL: https://issues.apache.org/jira/browse/SPARK-24371 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users > can do > {code} > val profileDF = Seq( > Some(1), Some(2), Some(3), Some(4), > Some(5), Some(6), Some(7), None > ).toDF("profileID") > val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3") > val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers)) > result.show(10) > """ > +--+--+ > |profileID|isValid| > +--+--+ > |1|false| > |2|false| > |3|true| > |4|false| > |5|false| > |6|true| > |7|true| > |null|null| > +--+--+ > """.stripMargin > {code} > Two new rules in the logical plan optimizers are added. > # When there is only one element in the *{{Set}}*, the physical plan will be > optimized to *{{EqualTo}}*, so predicate pushdown can be used. > {code} > profileDF.filter( $"profileID".isinSet(Set(6))).explain(true) > """ > |== Physical Plan ==| > |*(1) Project [profileID#0|#0]| > |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| > |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| > |PartitionFilters: [],| > |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| > |ReadSchema: struct > """.stripMargin > {code} > # When the *{{Set}}* is empty, and the input is nullable, the logical plan > will be simplified to > {code} > profileDF.filter( $"profileID".isinSet(Set())).explain(true) > """ > |== Optimized Logical Plan ==| > |Filter if (isnull(profileID#0)) null else false| > |+- Relation[profileID#0|#0] parquet > """.stripMargin > {code} > > TODO: > # For multiple conditions with numbers less than certain thresholds, we > should still allow predicate pushdown. > # Optimize the `In` using tableswitch or lookupswitch when the numbers of > the categories are low, and they are `Int`, `Long`. > # The default immutable hash trees set is slow for query, and we should do > benchmark for using different set implementation for faster query. > # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.
[ https://issues.apache.org/jira/browse/SPARK-24371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488012#comment-16488012 ] Apache Spark commented on SPARK-24371: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/21416 > Added isinSet in DataFrame API for Scala and Java. > -- > > Key: SPARK-24371 > URL: https://issues.apache.org/jira/browse/SPARK-24371 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users > can do > {code} > val profileDF = Seq( > Some(1), Some(2), Some(3), Some(4), > Some(5), Some(6), Some(7), None > ).toDF("profileID") > val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3") > val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers)) > result.show(10) > """ > +--+--+ > |profileID|isValid| > +--+--+ > |1|false| > |2|false| > |3|true| > |4|false| > |5|false| > |6|true| > |7|true| > |null|null| > +--+--+ > """.stripMargin > {code} > Two new rules in the logical plan optimizers are added. > # When there is only one element in the *{{Set}}*, the physical plan will be > optimized to *{{EqualTo}}*, so predicate pushdown can be used. > {code} > profileDF.filter( $"profileID".isinSet(Set(6))).explain(true) > """ > |== Physical Plan ==| > |*(1) Project [profileID#0|#0]| > |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| > |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| > |PartitionFilters: [],| > |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| > |ReadSchema: struct > """.stripMargin > {code} > # When the *{{Set}}* is empty, and the input is nullable, the logical plan > will be simplified to > {code} > profileDF.filter( $"profileID".isinSet(Set())).explain(true) > """ > |== Optimized Logical Plan ==| > |Filter if (isnull(profileID#0)) null else false| > |+- Relation[profileID#0|#0] parquet > """.stripMargin > {code} > > TODO: > # For multiple conditions with numbers less than certain thresholds, we > should still allow predicate pushdown. > # Optimize the `In` using tableswitch or lookupswitch when the numbers of > the categories are low, and they are `Int`, `Long`. > # The default immutable hash trees set is slow for query, and we should do > benchmark for using different set implementation for faster query. > # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24371) Added isinSet in DataFrame API for Scala and Java.
DB Tsai created SPARK-24371: --- Summary: Added isinSet in DataFrame API for Scala and Java. Key: SPARK-24371 URL: https://issues.apache.org/jira/browse/SPARK-24371 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.0 Reporter: DB Tsai Assignee: DB Tsai Fix For: 2.4.0 Implemented *{{isinSet}}* in DataFrame API for both Scala and Java, so users can do {code} val profileDF = Seq( Some(1), Some(2), Some(3), Some(4), Some(5), Some(6), Some(7), None ).toDF("profileID") val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3") val result = profileDF.withColumn("isValid", $"profileID".isinSet(validUsers)) result.show(10) """ +--+--+ |profileID|isValid| +--+--+ |1|false| |2|false| |3|true| |4|false| |5|false| |6|true| |7|true| |null|null| +--+--+ """.stripMargin {code} Two new rules in the logical plan optimizers are added. # When there is only one element in the *{{Set}}*, the physical plan will be optimized to *{{EqualTo}}*, so predicate pushdown can be used. {code} profileDF.filter( $"profileID".isinSet(Set(6))).explain(true) """ |== Physical Plan ==| |*(1) Project [profileID#0|#0]| |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))| |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,| |PartitionFilters: [],| |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],| |ReadSchema: struct """.stripMargin {code} # When the *{{Set}}* is empty, and the input is nullable, the logical plan will be simplified to {code} profileDF.filter( $"profileID".isinSet(Set())).explain(true) """ |== Optimized Logical Plan ==| |Filter if (isnull(profileID#0)) null else false| |+- Relation[profileID#0|#0] parquet """.stripMargin {code} TODO: # For multiple conditions with numbers less than certain thresholds, we should still allow predicate pushdown. # Optimize the `In` using tableswitch or lookupswitch when the numbers of the categories are low, and they are `Int`, `Long`. # The default immutable hash trees set is slow for query, and we should do benchmark for using different set implementation for faster query. # `filter(if (condition) null else false)` can be optimized to false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24322) Upgrade Apache ORC to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24322: Labels: correctness (was: ) > Upgrade Apache ORC to 1.4.4 > --- > > Key: SPARK-24322 > URL: https://issues.apache.org/jira/browse/SPARK-24322 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: correctness > > ORC 1.4.4 (released on May 14th) includes nine fixes. This issue aims to > update Spark to use it. > https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4 > For example, ORC-306 fixes the timestamp issue. > {code} > scala> spark.version > res0: String = 2.3.0 > scala> spark.sql("set spark.sql.orc.impl=native") > scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 > 12:34:56.000789")).toDF().write.orc("/tmp/orc") > scala> spark.read.orc("/tmp/orc").show(false) > +--+ > |value | > +--+ > |1900-05-05 12:34:55.000789| > +--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24294) Throw SparkException when OOM in BroadcastExchangeExec
[ https://issues.apache.org/jira/browse/SPARK-24294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24294. - Resolution: Fixed Assignee: jin xing Fix Version/s: 2.4.0 > Throw SparkException when OOM in BroadcastExchangeExec > -- > > Key: SPARK-24294 > URL: https://issues.apache.org/jira/browse/SPARK-24294 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: scala-2.11.8 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > When OutOfMemoryError thrown from BroadcastExchangeExec, > scala.concurrent.Future will hit scala bug -- > https://github.com/scala/bug/issues/9554, and hang until future timeout: > We could wrap the OOM inside SparkException to resolve this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24206) Improve DataSource benchmark code for read and pushdown
[ https://issues.apache.org/jira/browse/SPARK-24206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24206. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > Improve DataSource benchmark code for read and pushdown > --- > > Key: SPARK-24206 > URL: https://issues.apache.org/jira/browse/SPARK-24206 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.4.0 > > > I improved the DataSource code for read and pushdown in the parquet v1.10.0 > upgrade activity: [https://github.com/apache/spark/pull/21070] > Based on the code, we need to brush up the benchmark code and results in the > master. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24244) Parse only required columns of CSV file
[ https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk reopened SPARK-24244: Previous PR was reverted due flaky UnivocityParserSuite > Parse only required columns of CSV file > --- > > Key: SPARK-24244 > URL: https://issues.apache.org/jira/browse/SPARK-24244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > uniVocity parser allows to specify only required column names or indexes for > parsing like: > {code} > // Here we select only the columns by their indexes. > // The parser just skips the values in other columns > parserSettings.selectIndexes(4, 0, 1); > CsvParser parser = new CsvParser(parserSettings); > {code} > Need to modify *UnivocityParser* to extract only needed columns from > requiredSchema -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
[ https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487923#comment-16487923 ] Apache Spark commented on SPARK-24368: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21415 > Flaky tests: > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite > > > Key: SPARK-24368 > URL: https://issues.apache.org/jira/browse/SPARK-24368 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed > very often. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/ > {code} > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** > ABORTED *** (1 millisecond) > [info] java.lang.IllegalStateException: LiveListenerBus is stopped. > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > [info] at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at scala.Option.getOrElse(Option.scala:121) > [info] at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > [info] at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at scala.Option.map(Option.scala:146) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > [info] at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40) > [info] at > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30) > [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > [info] at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > [info] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [info] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > [info] at java.lang.Class.newInstance(Class.java:442) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info] at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24244) Parse only required columns of CSV file
[ https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487922#comment-16487922 ] Apache Spark commented on SPARK-24244: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21415 > Parse only required columns of CSV file > --- > > Key: SPARK-24244 > URL: https://issues.apache.org/jira/browse/SPARK-24244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > uniVocity parser allows to specify only required column names or indexes for > parsing like: > {code} > // Here we select only the columns by their indexes. > // The parser just skips the values in other columns > parserSettings.selectIndexes(4, 0, 1); > CsvParser parser = new CsvParser(parserSettings); > {code} > Need to modify *UnivocityParser* to extract only needed columns from > requiredSchema -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487901#comment-16487901 ] Imran Rashid commented on SPARK-23206: -- [~felixcheung] can you give an example of the network & IO stat which you think would go into this metric collection framework, rather than just a task metric? I can't think of a good example / use case. While I don't want this to block on including those metrics, I'd like to at least have that in mind while we're designing this part. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory
[ https://issues.apache.org/jira/browse/SPARK-24370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jami Malikzade updated SPARK-24370: --- Attachment: partitions.PNG > spark checkpoint creates many 0 byte empty files(partitions) in checkpoint > directory > - > > Key: SPARK-24370 > URL: https://issues.apache.org/jira/browse/SPARK-24370 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.1 >Reporter: Jami Malikzade >Priority: Critical > Attachments: partitions.PNG > > > We currently facing issue, that when we call checkpoint on dataframe, it > creates partitions in checkpoint dir, but some of them are empty. So we > having exceptions reading dataframe back. > Do you have any idea how to avoid it? > it creates 200 partitions.Some are empty. I used repartition(1) before > checkpoint. But it is not good wordaround. Do we have anyway , to populate > all partitions with data, or avoid empty files? > Pasted snapshot. > !image-2018-05-23-21-10-43-673.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24370) spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory
Jami Malikzade created SPARK-24370: -- Summary: spark checkpoint creates many 0 byte empty files(partitions) in checkpoint directory Key: SPARK-24370 URL: https://issues.apache.org/jira/browse/SPARK-24370 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.1.1 Reporter: Jami Malikzade We currently facing issue, that when we call checkpoint on dataframe, it creates partitions in checkpoint dir, but some of them are empty. So we having exceptions reading dataframe back. Do you have any idea how to avoid it? it creates 200 partitions.Some are empty. I used repartition(1) before checkpoint. But it is not good wordaround. Do we have anyway , to populate all partitions with data, or avoid empty files? Pasted snapshot. !image-2018-05-23-21-10-43-673.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24369) A bug when having multiple distinct aggregations
[ https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24369: Summary: A bug when having multiple distinct aggregations (was: A bug of multiple distinct aggregations) > A bug when having multiple distinct aggregations > > > Key: SPARK-24369 > URL: https://issues.apache.org/jira/browse/SPARK-24369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > {code} > SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM > (VALUES >(1, 1), >(2, 2), >(2, 2) > ) t(x, y) > {code} > It returns > {code} > java.lang.RuntimeException > You hit a query analyzer bug. Please report your query to Spark user mailing > list. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24369) A bug of multiple distinct aggregations
[ https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24369: Summary: A bug of multiple distinct aggregations (was: A bug for multiple distinct aggregations) > A bug of multiple distinct aggregations > --- > > Key: SPARK-24369 > URL: https://issues.apache.org/jira/browse/SPARK-24369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > {code} > SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM > (VALUES >(1, 1), >(2, 2), >(2, 2) > ) t(x, y) > {code} > It returns > {code} > java.lang.RuntimeException > You hit a query analyzer bug. Please report your query to Spark user mailing > list. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24369) A bug for multiple distinct aggregations
[ https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24369: Description: {code} SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM (VALUES (1, 1), (2, 2), (2, 2) ) t(x, y) {code} It returns {code} java.lang.RuntimeException You hit a query analyzer bug. Please report your query to Spark user mailing list. {code} was: SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM (VALUES (1, 1), (2, 2), (2, 2) ) t(x, y) > A bug for multiple distinct aggregations > > > Key: SPARK-24369 > URL: https://issues.apache.org/jira/browse/SPARK-24369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > {code} > SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM > (VALUES >(1, 1), >(2, 2), >(2, 2) > ) t(x, y) > {code} > It returns > {code} > java.lang.RuntimeException > You hit a query analyzer bug. Please report your query to Spark user mailing > list. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24369) A bug for multiple distinct aggregations
Xiao Li created SPARK-24369: --- Summary: A bug for multiple distinct aggregations Key: SPARK-24369 URL: https://issues.apache.org/jira/browse/SPARK-24369 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM (VALUES (1, 1), (2, 2), (2, 2) ) t(x, y) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487815#comment-16487815 ] Hossein Falaki commented on SPARK-24359: Thanks [~shivaram] and [~zero323]. It seems CRAN release pain has left some scar tissue. We can host this new package in a separate repo and maintain it for a few release cycles to evaluate CRAN release overhead. If the overhead is not too high, we can contribute it back to the main Spark repository. Alternatively, we can remove the requirement for co-releasing with Apache Spark – only release when there is API changes in the new package. As for duplicate API, I see the issue as well. I think there is room for both styles (formula-based for simple use cases and pipeline-based for more complex programs). Based on feedback from community we can decide if deprecating old API makes sense down the road. I have received many requests from SparkR users for ability to build pipelines (same way Python and Scala support it). As for whether this new package will introduce Scala API changes, in my current prototype it is very minimal (and can be avoided). Almost all new Scala code is for the utility that generates R source code. The idea is, if a patch adds new API to MLlib, the contributor can simply execute a command-line tool and check-in R wrappers for his/her new API. The goal of this work is to reduce maintenance cost for R API in Spark. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural
[jira] [Assigned] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
[ https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24368: Assignee: (was: Apache Spark) > Flaky tests: > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite > > > Key: SPARK-24368 > URL: https://issues.apache.org/jira/browse/SPARK-24368 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed > very often. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/ > {code} > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** > ABORTED *** (1 millisecond) > [info] java.lang.IllegalStateException: LiveListenerBus is stopped. > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > [info] at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at scala.Option.getOrElse(Option.scala:121) > [info] at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > [info] at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at scala.Option.map(Option.scala:146) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > [info] at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40) > [info] at > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30) > [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > [info] at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > [info] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [info] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > [info] at java.lang.Class.newInstance(Class.java:442) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info] at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
[ https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487755#comment-16487755 ] Apache Spark commented on SPARK-24368: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21414 > Flaky tests: > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite > > > Key: SPARK-24368 > URL: https://issues.apache.org/jira/browse/SPARK-24368 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed > very often. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/ > {code} > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** > ABORTED *** (1 millisecond) > [info] java.lang.IllegalStateException: LiveListenerBus is stopped. > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > [info] at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at scala.Option.getOrElse(Option.scala:121) > [info] at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > [info] at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at scala.Option.map(Option.scala:146) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > [info] at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40) > [info] at > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30) > [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > [info] at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > [info] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [info] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > [info] at java.lang.Class.newInstance(Class.java:442) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info] at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
[ https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24368: Assignee: Apache Spark > Flaky tests: > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite > > > Key: SPARK-24368 > URL: https://issues.apache.org/jira/browse/SPARK-24368 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed > very often. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/ > {code} > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** > ABORTED *** (1 millisecond) > [info] java.lang.IllegalStateException: LiveListenerBus is stopped. > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > [info] at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at scala.Option.getOrElse(Option.scala:121) > [info] at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > [info] at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at scala.Option.map(Option.scala:146) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > [info] at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40) > [info] at > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30) > [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > [info] at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > [info] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [info] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > [info] at java.lang.Class.newInstance(Class.java:442) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info] at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487754#comment-16487754 ] Shivaram Venkataraman commented on SPARK-24359: --- I'd just like to echo the point on release and testing strategy raised by [~felixcheung] * For a new CRAN package tying it to the Spark release cycle can be especially challenging as it takes a bunch of iterations to get things right. * This also leads to the question of how the SparkML package APIs are going to depend on Spark versions. Are we only going to have code that depends on older Spark releases or are we going to have cases where we introduce the Java/Scala side code at the same time as the R API ? * One more idea could be to have a new repo in Apache that has its own release cycle (like the spark-website repo) > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > constructors are dot separated (e.g., spark.logistic.regression()) and all > setters and getters are snake case (e.g., set_max_iter()). If a constructor > gets arguments, they will be named arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained,
[jira] [Assigned] (SPARK-23161) Add missing APIs to Python GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23161: Assignee: Apache Spark > Add missing APIs to Python GBTClassifier > > > Key: SPARK-23161 > URL: https://issues.apache.org/jira/browse/SPARK-23161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Minor > Labels: starter > > GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved to > {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. > GBTClassificationModel is missing {{numClasses}}. It should inherit from > {{JavaClassificationModel}} instead of prediction model which will give it > this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23161) Add missing APIs to Python GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487739#comment-16487739 ] Apache Spark commented on SPARK-23161: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21413 > Add missing APIs to Python GBTClassifier > > > Key: SPARK-23161 > URL: https://issues.apache.org/jira/browse/SPARK-23161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Minor > Labels: starter > > GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved to > {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. > GBTClassificationModel is missing {{numClasses}}. It should inherit from > {{JavaClassificationModel}} instead of prediction model which will give it > this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23161) Add missing APIs to Python GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23161: Assignee: (was: Apache Spark) > Add missing APIs to Python GBTClassifier > > > Key: SPARK-23161 > URL: https://issues.apache.org/jira/browse/SPARK-23161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Minor > Labels: starter > > GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved to > {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. > GBTClassificationModel is missing {{numClasses}}. It should inherit from > {{JavaClassificationModel}} instead of prediction model which will give it > this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487707#comment-16487707 ] Marcelo Vanzin commented on SPARK-21945: {noformat} $ cat test.py from pyspark import SparkContext, SparkConf import funcs sc = SparkContext() funcs.test() sc.stop() $ cat lib/funcs.py def test(): print "This is a test." [systest@vanzin-c5-1 ~]$ spark-submit --master yarn --py-files lib/funcs.py test.py SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.378171/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Traceback (most recent call last): File "/home/systest/test.py", line 2, in import funcs ImportError: No module named funcs $ pyspark --master yarn --py-files lib/funcs.py [blah blah blah] Using Python version 2.7.5 (default, Sep 15 2016 22:37:39) SparkSession available as 'spark'. >>> import funcs >>> {noformat} > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487707#comment-16487707 ] Marcelo Vanzin edited comment on SPARK-21945 at 5/23/18 5:27 PM: - {noformat} $ cat test.py from pyspark import SparkContext, SparkConf import funcs sc = SparkContext() funcs.test() sc.stop() $ cat lib/funcs.py def test(): print "This is a test." $ spark-submit --master yarn --py-files lib/funcs.py test.py SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.378171/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Traceback (most recent call last): File "/home/systest/test.py", line 2, in import funcs ImportError: No module named funcs $ pyspark --master yarn --py-files lib/funcs.py [blah blah blah] Using Python version 2.7.5 (default, Sep 15 2016 22:37:39) SparkSession available as 'spark'. >>> import funcs >>> {noformat} was (Author: vanzin): {noformat} $ cat test.py from pyspark import SparkContext, SparkConf import funcs sc = SparkContext() funcs.test() sc.stop() $ cat lib/funcs.py def test(): print "This is a test." [systest@vanzin-c5-1 ~]$ spark-submit --master yarn --py-files lib/funcs.py test.py SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.378171/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Traceback (most recent call last): File "/home/systest/test.py", line 2, in import funcs ImportError: No module named funcs $ pyspark --master yarn --py-files lib/funcs.py [blah blah blah] Using Python version 2.7.5 (default, Sep 15 2016 22:37:39) SparkSession available as 'spark'. >>> import funcs >>> {noformat} > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22055) Port release scripts
[ https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487647#comment-16487647 ] Marcelo Vanzin commented on SPARK-22055: No, my script just prepares the release (asks questions and runs the existing scripts). You'd need to write a second script to inject stuff into the docker image and then run my script. > Port release scripts > > > Key: SPARK-22055 > URL: https://issues.apache.org/jira/browse/SPARK-22055 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Blocker > > The current Jenkins jobs are generated from scripts in a private repo. We > should port these to enable changes like SPARK-22054 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
[ https://issues.apache.org/jira/browse/SPARK-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487516#comment-16487516 ] Xiao Li commented on SPARK-24368: - cc [~maxgekk] > Flaky tests: > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite > > > Key: SPARK-24368 > URL: https://issues.apache.org/jira/browse/SPARK-24368 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed > very often. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/ > {code} > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** > ABORTED *** (1 millisecond) > [info] java.lang.IllegalStateException: LiveListenerBus is stopped. > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) > [info] at > org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) > [info] at > org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) > [info] at scala.Option.getOrElse(Option.scala:121) > [info] at > org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) > [info] at > org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) > [info] at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) > [info] at > org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) > [info] at scala.Option.map(Option.scala:146) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) > [info] at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) > [info] at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40) > [info] at > org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30) > [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > [info] at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > [info] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [info] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > [info] at java.lang.Class.newInstance(Class.java:442) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) > [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) > [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info] at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24368) Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
Xiao Li created SPARK-24368: --- Summary: Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite Key: SPARK-24368 URL: https://issues.apache.org/jira/browse/SPARK-24368 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.4.0 Reporter: Xiao Li org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite failed very often. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91020/testReport/org.apache.spark.sql.execution.datasources.csv/UnivocityParserSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/history/ {code} org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite *** ABORTED *** (1 millisecond) [info] java.lang.IllegalStateException: LiveListenerBus is stopped. [info] at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) [info] at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) [info] at org.apache.spark.sql.internal.SharedState.(SharedState.scala:93) [info] at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) [info] at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) [info] at scala.Option.getOrElse(Option.scala:121) [info] at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) [info] at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) [info] at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) [info] at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) [info] at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) [info] at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) [info] at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) [info] at scala.Option.map(Option.scala:146) [info] at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) [info] at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) [info] at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:125) [info] at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:84) [info] at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:40) [info] at org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite.(UnivocityParserSuite.scala:30) [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [info] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) [info] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [info] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) [info] at java.lang.Class.newInstance(Class.java:442) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:435) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487391#comment-16487391 ] Apache Spark commented on SPARK-18805: -- User 'chermenin' has created a pull request for this issue: https://github.com/apache/spark/pull/21412 > InternalMapWithStateDStream make java.lang.StackOverflowError > -- > > Key: SPARK-18805 > URL: https://issues.apache.org/jira/browse/SPARK-18805 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.0.2 > Environment: mesos >Reporter: etienne >Priority: Major > > When load InternalMapWithStateDStream from a check point. > If isValidTime is true and if there is no generatedRDD at the given time > there is an infinite loop. > 1) compute is call on InternalMapWithStateDStream > 2) InternalMapWithStateDStream try to generate the previousRDD > 3) Stream look in generatedRDD if the RDD is already generated for the given > time > 4) It not fund the rdd so it check if the time is valid. > 5) if the time is valid call compute on InternalMapWithStateDStream > 6) restart from 1) > Here the exception that illustrate this error > {code} > Exception in thread "streaming-start" java.lang.StackOverflowError > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18805: Assignee: Apache Spark > InternalMapWithStateDStream make java.lang.StackOverflowError > -- > > Key: SPARK-18805 > URL: https://issues.apache.org/jira/browse/SPARK-18805 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.0.2 > Environment: mesos >Reporter: etienne >Assignee: Apache Spark >Priority: Major > > When load InternalMapWithStateDStream from a check point. > If isValidTime is true and if there is no generatedRDD at the given time > there is an infinite loop. > 1) compute is call on InternalMapWithStateDStream > 2) InternalMapWithStateDStream try to generate the previousRDD > 3) Stream look in generatedRDD if the RDD is already generated for the given > time > 4) It not fund the rdd so it check if the time is valid. > 5) if the time is valid call compute on InternalMapWithStateDStream > 6) restart from 1) > Here the exception that illustrate this error > {code} > Exception in thread "streaming-start" java.lang.StackOverflowError > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18805: Assignee: (was: Apache Spark) > InternalMapWithStateDStream make java.lang.StackOverflowError > -- > > Key: SPARK-18805 > URL: https://issues.apache.org/jira/browse/SPARK-18805 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.0.2 > Environment: mesos >Reporter: etienne >Priority: Major > > When load InternalMapWithStateDStream from a check point. > If isValidTime is true and if there is no generatedRDD at the given time > there is an infinite loop. > 1) compute is call on InternalMapWithStateDStream > 2) InternalMapWithStateDStream try to generate the previousRDD > 3) Stream look in generatedRDD if the RDD is already generated for the given > time > 4) It not fund the rdd so it check if the time is valid. > 5) if the time is valid call compute on InternalMapWithStateDStream > 6) restart from 1) > Here the exception that illustrate this error > {code} > Exception in thread "streaming-start" java.lang.StackOverflowError > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23711) Add fallback to interpreted execution logic
[ https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23711. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21106 [https://github.com/apache/spark/pull/21106] > Add fallback to interpreted execution logic > --- > > Key: SPARK-23711 > URL: https://issues.apache.org/jira/browse/SPARK-23711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23711) Add fallback to interpreted execution logic
[ https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23711: --- Assignee: Liang-Chi Hsieh > Add fallback to interpreted execution logic > --- > > Key: SPARK-23711 > URL: https://issues.apache.org/jira/browse/SPARK-23711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types
[ https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24313: Fix Version/s: 2.3.1 > Collection functions interpreted execution doesn't work with complex types > -- > > Key: SPARK-24313 > URL: https://issues.apache.org/jira/browse/SPARK-24313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Critical > Labels: correctness > Fix For: 2.3.1, 2.4.0 > > > Several functions working on collection return incorrect result for complex > data types in interpreted mode. In particular, we consider comple data types > BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, > {{array_position}}, {{element_at}} and {{GetMapValue}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
[ https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487302#comment-16487302 ] Apache Spark commented on SPARK-24367: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21411 > Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY > > > Key: SPARK-24367 > URL: https://issues.apache.org/jira/browse/SPARK-24367 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Priority: Major > > In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. > When writing to Parquet files, the warning message "WARN > org.apache.parquet.hadoop.ParquetOutputFormat: Setting > parquet.enable.summary-metadata is deprecated, please use > parquet.summary.metadata.level" keeps showing up. > From > [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164] > we can see that we should use JOB_SUMMARY_LEVEL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
[ https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24367: Assignee: Apache Spark > Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY > > > Key: SPARK-24367 > URL: https://issues.apache.org/jira/browse/SPARK-24367 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. > When writing to Parquet files, the warning message "WARN > org.apache.parquet.hadoop.ParquetOutputFormat: Setting > parquet.enable.summary-metadata is deprecated, please use > parquet.summary.metadata.level" keeps showing up. > From > [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164] > we can see that we should use JOB_SUMMARY_LEVEL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
[ https://issues.apache.org/jira/browse/SPARK-24367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24367: Assignee: (was: Apache Spark) > Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY > > > Key: SPARK-24367 > URL: https://issues.apache.org/jira/browse/SPARK-24367 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Priority: Major > > In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. > When writing to Parquet files, the warning message "WARN > org.apache.parquet.hadoop.ParquetOutputFormat: Setting > parquet.enable.summary-metadata is deprecated, please use > parquet.summary.metadata.level" keeps showing up. > From > [https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164] > we can see that we should use JOB_SUMMARY_LEVEL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24367) Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
Gengliang Wang created SPARK-24367: -- Summary: Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY Key: SPARK-24367 URL: https://issues.apache.org/jira/browse/SPARK-24367 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Gengliang Wang In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. When writing to Parquet files, the warning message "WARN org.apache.parquet.hadoop.ParquetOutputFormat: Setting parquet.enable.summary-metadata is deprecated, please use parquet.summary.metadata.level" keeps showing up. >From >[https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164] > we can see that we should use JOB_SUMMARY_LEVEL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487209#comment-16487209 ] Maciej Szymkiewicz commented on SPARK-24359: Just my two cents: * As proposed right now, wouldn't it duplicate a significant part of the existing ML API, and if it will, does it mean we should deprecate current API? Having two different, and compatible APIs, sounds like a recipe for confusion. Not to mention duplicated tests, can be deal breaker, especially when CI pipeline is already incredibly heavy. * To concur with [~felixcheung] maintaining CRAN package is a significant maintenance burden. Add another package, tightly bound to main release cycle, might have huge impact on overall release process. * If the package is to be designed to be mostly independent of the current R API, why not create a separate package, not maintained by the Apache Foundation? * I am not sure if anything changed lately, but based on my previous experience, there is not enough hands to keep current API up-to-date. Unless there is enough support from the stakeholders, it might end up as mostly unmaintained deadweight. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > constructors are
[jira] [Updated] (SPARK-24366) Improve error message for Catalyst type converters
[ https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24366: --- Summary: Improve error message for Catalyst type converters (was: Improve error message for type converting) > Improve error message for Catalyst type converters > -- > > Key: SPARK-24366 > URL: https://issues.apache.org/jira/browse/SPARK-24366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > User have no way to drill down to understand which of the hundreds of fields > in millions records feeding into the job are causing the problem. We should > to show where in the schema the error is happening. > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: > start (of class java.lang.String) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at >
[jira] [Updated] (SPARK-24366) Improve error message for type converting
[ https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24366: --- Summary: Improve error message for type converting (was: Improve error message for type conversions) > Improve error message for type converting > - > > Key: SPARK-24366 > URL: https://issues.apache.org/jira/browse/SPARK-24366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > User have no way to drill down to understand which of the hundreds of fields > in millions records feeding into the job are causing the problem. We should > to show where in the schema the error is happening. > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: > start (of class java.lang.String) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at >
[jira] [Commented] (SPARK-24366) Improve error message for type conversions
[ https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487171#comment-16487171 ] Apache Spark commented on SPARK-24366: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21410 > Improve error message for type conversions > -- > > Key: SPARK-24366 > URL: https://issues.apache.org/jira/browse/SPARK-24366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > User have no way to drill down to understand which of the hundreds of fields > in millions records feeding into the job are causing the problem. We should > to show where in the schema the error is happening. > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: > start (of class java.lang.String) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at >
[jira] [Assigned] (SPARK-24366) Improve error message for type conversions
[ https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24366: Assignee: (was: Apache Spark) > Improve error message for type conversions > -- > > Key: SPARK-24366 > URL: https://issues.apache.org/jira/browse/SPARK-24366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > User have no way to drill down to understand which of the hundreds of fields > in millions records feeding into the job are causing the problem. We should > to show where in the schema the error is happening. > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: > start (of class java.lang.String) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at
[jira] [Assigned] (SPARK-24366) Improve error message for type conversions
[ https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24366: Assignee: Apache Spark > Improve error message for type conversions > -- > > Key: SPARK-24366 > URL: https://issues.apache.org/jira/browse/SPARK-24366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.3.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > User have no way to drill down to understand which of the hundreds of fields > in millions records feeding into the job are causing the problem. We should > to show where in the schema the error is happening. > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: > start (of class java.lang.String) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > at >
[jira] [Created] (SPARK-24366) Improve error message for type conversions
Maxim Gekk created SPARK-24366: -- Summary: Improve error message for type conversions Key: SPARK-24366 URL: https://issues.apache.org/jira/browse/SPARK-24366 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0, 1.6.3 Reporter: Maxim Gekk User have no way to drill down to understand which of the hundreds of fields in millions records feeding into the job are causing the problem. We should to show where in the schema the error is happening. {code:java} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: start (of class java.lang.String) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161) at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at
[jira] [Commented] (SPARK-24341) Codegen compile error from predicate subquery
[ https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487120#comment-16487120 ] Apache Spark commented on SPARK-24341: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21403 > Codegen compile error from predicate subquery > - > > Key: SPARK-24341 > URL: https://issues.apache.org/jira/browse/SPARK-24341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Juliusz Sompolski >Priority: Minor > > Ran on master: > {code} > drop table if exists juleka; > drop table if exists julekb; > create table juleka (a integer, b integer); > create table julekb (na integer, nb integer); > insert into juleka values (1,1); > insert into julekb values (1,1); > select * from juleka where (a, b) not in (select (na, nb) from julekb); > {code} > Results in: > {code} > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 27, Column 29: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 27, Column 29: Cannot compare types "int" and > "org.apache.spark.sql.catalyst.InternalRow" > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) > at com.google.common.cache.LocalCache.get(LocalCache.java:3932) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46) > at > org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > at > scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) > at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) > at > org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49) > at > org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126) > at > org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:111) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) > at >
[jira] [Assigned] (SPARK-24341) Codegen compile error from predicate subquery
[ https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24341: Assignee: (was: Apache Spark) > Codegen compile error from predicate subquery > - > > Key: SPARK-24341 > URL: https://issues.apache.org/jira/browse/SPARK-24341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Juliusz Sompolski >Priority: Minor > > Ran on master: > {code} > drop table if exists juleka; > drop table if exists julekb; > create table juleka (a integer, b integer); > create table julekb (na integer, nb integer); > insert into juleka values (1,1); > insert into julekb values (1,1); > select * from juleka where (a, b) not in (select (na, nb) from julekb); > {code} > Results in: > {code} > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 27, Column 29: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 27, Column 29: Cannot compare types "int" and > "org.apache.spark.sql.catalyst.InternalRow" > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) > at com.google.common.cache.LocalCache.get(LocalCache.java:3932) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46) > at > org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > at > scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) > at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) > at > org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49) > at > org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126) > at > org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:111) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at >
[jira] [Assigned] (SPARK-24341) Codegen compile error from predicate subquery
[ https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24341: Assignee: Apache Spark > Codegen compile error from predicate subquery > - > > Key: SPARK-24341 > URL: https://issues.apache.org/jira/browse/SPARK-24341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Juliusz Sompolski >Assignee: Apache Spark >Priority: Minor > > Ran on master: > {code} > drop table if exists juleka; > drop table if exists julekb; > create table juleka (a integer, b integer); > create table julekb (na integer, nb integer); > insert into juleka values (1,1); > insert into julekb values (1,1); > select * from juleka where (a, b) not in (select (na, nb) from julekb); > {code} > Results in: > {code} > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 27, Column 29: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 27, Column 29: Cannot compare types "int" and > "org.apache.spark.sql.catalyst.InternalRow" > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) > at com.google.common.cache.LocalCache.get(LocalCache.java:3932) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46) > at > org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > at > scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) > at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) > at > org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49) > at > org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126) > at > org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:111) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at >
[jira] [Comment Edited] (SPARK-24324) UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487110#comment-16487110 ] Cristian Consonni edited comment on SPARK-24324 at 5/23/18 11:36 AM: - [~hyukjin.kwon] said: > Ah, I meant shorter reproducer should make other guys easier to take a look. Here, half the number of lines of code, same behavior. {code:python} #!/usr/bin/env python3 # coding: utf-8 from datetime import datetime import findspark findspark.init() import pyspark from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql.functions import pandas_udf, PandasUDFType import pandas as pd input_data = [('en1', 'a', 150), ('en2', 'b', 148), ('en3', 'c', 197), ('en4', 'd', 145), ('en5', 'e', 131), ('en6', 'f', 154), ('en7', 'g', 142) ] sc = pyspark.SparkContext(appName="udf_example") sqlctx = pyspark.SQLContext(sc) schema = StructType([StructField("lang", StringType(), False), StructField("page", StringType(), False), StructField("views", IntegerType(), False)]) new_schema = StructType([StructField("lang", StringType(), False), StructField("page", StringType(), False), StructField("foo", StringType(), False)]) rdd = sc.parallelize(input_data) df = sqlctx.createDataFrame(rdd, schema) df.show() @pandas_udf(new_schema, PandasUDFType.GROUPED_MAP) def myudf(x): foo = 'foo' # this gives the expect results: # return pd.DataFrame({'foo': x.lang, 'lang': x.page, 'page': foo}) # this mixes columns return pd.DataFrame({'lang': x.lang, 'page': x.page, 'foo': foo}) grp_df = df.select(['lang', 'page', 'views']).groupby(['lang','page']) grp_df = grp_df.apply(myudf).dropDuplicates() grp_df.show() {code} I am also wondering if this is a pyspark issue or a pandas issue. was (Author: cristiancantoro): [~hyukjin.kwon] said: > Ah, I meant shorter reproducer should make other guys easier to take a look. Here, half the number of lines of code, same behavior. {code:python} #!/usr/bin/env python3 # coding: utf-8 from datetime import datetime import findspark findspark.init() import pyspark from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql.functions import pandas_udf, PandasUDFType import pandas as pd input_data = [('en1', 'a', 150), ('en2', 'b', 148), ('en3', 'c', 197), ('en4', 'd', 145), ('en5', 'e', 131), ('en6', 'f', 154), ('en7', 'g', 142) ] sc = pyspark.SparkContext(appName="udf_example") sqlctx = pyspark.SQLContext(sc) schema = StructType([StructField("lang", StringType(), False), StructField("page", StringType(), False), StructField("views", IntegerType(), False)]) new_schema = StructType([StructField("lang", StringType(), False), StructField("page", StringType(), False), StructField("foo", StringType(), False)]) rdd = sc.parallelize(input_data) df = sqlctx.createDataFrame(rdd, schema) df.show() @pandas_udf(new_schema, PandasUDFType.GROUPED_MAP) def myudf(x): foo = 'foo' # this gives the expect results: # return pd.DataFrame({'foo': x.lang, 'lang': x.page, 'page': foo}) # this mixes columns return pd.DataFrame({'lang': x.lang, 'page': x.page, 'foo': foo}) grp_df = df.select(['lang', 'page', 'views']).groupby(['lang','page']) grp_df = grp_df.apply(myudf).dropDuplicates() grp_df.show() {code} I am aslo wondering if this is a pyspark issue or a pandas issue. > UserDefinedFunction mixes column labels > --- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 >
[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels
[ https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487110#comment-16487110 ] Cristian Consonni commented on SPARK-24324: --- [~hyukjin.kwon] said: > Ah, I meant shorter reproducer should make other guys easier to take a look. Here, half the number of lines of code, same behavior. {code:python} #!/usr/bin/env python3 # coding: utf-8 from datetime import datetime import findspark findspark.init() import pyspark from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql.functions import pandas_udf, PandasUDFType import pandas as pd input_data = [('en1', 'a', 150), ('en2', 'b', 148), ('en3', 'c', 197), ('en4', 'd', 145), ('en5', 'e', 131), ('en6', 'f', 154), ('en7', 'g', 142) ] sc = pyspark.SparkContext(appName="udf_example") sqlctx = pyspark.SQLContext(sc) schema = StructType([StructField("lang", StringType(), False), StructField("page", StringType(), False), StructField("views", IntegerType(), False)]) new_schema = StructType([StructField("lang", StringType(), False), StructField("page", StringType(), False), StructField("foo", StringType(), False)]) rdd = sc.parallelize(input_data) df = sqlctx.createDataFrame(rdd, schema) df.show() @pandas_udf(new_schema, PandasUDFType.GROUPED_MAP) def myudf(x): foo = 'foo' # this gives the expect results: # return pd.DataFrame({'foo': x.lang, 'lang': x.page, 'page': foo}) # this mixes columns return pd.DataFrame({'lang': x.lang, 'page': x.page, 'foo': foo}) grp_df = df.select(['lang', 'page', 'views']).groupby(['lang','page']) grp_df = grp_df.apply(myudf).dropDuplicates() grp_df.show() {code} I am aslo wondering if this is a pyspark issue or a pandas issue. > UserDefinedFunction mixes column labels > --- > > Key: SPARK-24324 > URL: https://issues.apache.org/jira/browse/SPARK-24324 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Python (using virtualenv): > {noformat} > $ python --version > Python 3.6.5 > {noformat} > Modules installed: > {noformat} > arrow==0.12.1 > backcall==0.1.0 > bleach==2.1.3 > chardet==3.0.4 > decorator==4.3.0 > entrypoints==0.2.3 > findspark==1.2.0 > html5lib==1.0.1 > ipdb==0.11 > ipykernel==4.8.2 > ipython==6.3.1 > ipython-genutils==0.2.0 > ipywidgets==7.2.1 > jedi==0.12.0 > Jinja2==2.10 > jsonschema==2.6.0 > jupyter==1.0.0 > jupyter-client==5.2.3 > jupyter-console==5.2.0 > jupyter-core==4.4.0 > MarkupSafe==1.0 > mistune==0.8.3 > nbconvert==5.3.1 > nbformat==4.4.0 > notebook==5.5.0 > numpy==1.14.3 > pandas==0.22.0 > pandocfilters==1.4.2 > parso==0.2.0 > pbr==3.1.1 > pexpect==4.5.0 > pickleshare==0.7.4 > progressbar2==3.37.1 > prompt-toolkit==1.0.15 > ptyprocess==0.5.2 > pyarrow==0.9.0 > Pygments==2.2.0 > python-dateutil==2.7.2 > python-utils==2.3.0 > pytz==2018.4 > pyzmq==17.0.0 > qtconsole==4.3.1 > Send2Trash==1.5.0 > simplegeneric==0.8.1 > six==1.11.0 > SQLAlchemy==1.2.7 > stevedore==1.28.0 > terminado==0.8.1 > testpath==0.3.1 > tornado==5.0.2 > traitlets==4.3.2 > virtualenv==15.1.0 > virtualenv-clone==0.2.6 > virtualenvwrapper==4.7.2 > wcwidth==0.1.7 > webencodings==0.5.1 > widgetsnbextension==3.2.1 > {noformat} > > Java: > {noformat} > $ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > System: > {noformat} > $ lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 16.04.4 LTS > Release: 16.04 > Codename: xenial > {noformat} >Reporter: Cristian Consonni >Priority: Major > > I am working on Wikipedia page views (see [task T188041 on Wikimedia's > Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's > say that these are the data: > {noformat} > > {noformat} > For each combination of (lang, page, day(timestamp)) I need to transform the > views for each hour: > {noformat} > 00:00 -> A > 01:00 -> B > ... > {noformat} > and concatenate the number of views for that hour. So, if a page got 5 views > at 00:00 and 7 views at 01:00 it would become: > {noformat} > A5B7 > {noformat} > > I have written a UDF called {code:python}concat_hours{code} > However, the function is mixing the columns and I am not sure what is going > on. I wrote here a minimal complete example that reproduces the issue on my > system (the details of my environment are above). > {code:python} > #!/usr/bin/env python3 > # coding: utf-8 > input_data = b"""en Albert_Camus 20071210-00 150
[jira] [Commented] (SPARK-24365) Add Parquet write benchmark
[ https://issues.apache.org/jira/browse/SPARK-24365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487050#comment-16487050 ] Apache Spark commented on SPARK-24365: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21409 > Add Parquet write benchmark > --- > > Key: SPARK-24365 > URL: https://issues.apache.org/jira/browse/SPARK-24365 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Priority: Major > > Add Parquet write benchmark. So that it would be easier to measure the writer > performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24365) Add Parquet write benchmark
[ https://issues.apache.org/jira/browse/SPARK-24365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24365: Assignee: (was: Apache Spark) > Add Parquet write benchmark > --- > > Key: SPARK-24365 > URL: https://issues.apache.org/jira/browse/SPARK-24365 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Priority: Major > > Add Parquet write benchmark. So that it would be easier to measure the writer > performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24365) Add Parquet write benchmark
[ https://issues.apache.org/jira/browse/SPARK-24365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24365: Assignee: Apache Spark > Add Parquet write benchmark > --- > > Key: SPARK-24365 > URL: https://issues.apache.org/jira/browse/SPARK-24365 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Add Parquet write benchmark. So that it would be easier to measure the writer > performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org