[jira] [Commented] (SPARK-19811) sparksql 2.1 can not prune hive partition
[ https://issues.apache.org/jira/browse/SPARK-19811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948299#comment-16948299 ] Vasu Bajaj commented on SPARK-19811: It's too late, but the issue arises when you use your partition name in upper case. try using partition in lower case, this solved the issue for me. > sparksql 2.1 can not prune hive partition > -- > > Key: SPARK-19811 > URL: https://issues.apache.org/jira/browse/SPARK-19811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: sydt >Priority: Major > > When sparksql2.1 execute sql, it has error: > java.lang.RuntimeException: Expected only partition pruning predicates: > (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212)) and the sql sentence is > select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND > PROV_ID ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19811) sparksql 2.1 can not prune hive partition
[ https://issues.apache.org/jira/browse/SPARK-19811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948299#comment-16948299 ] Vasu Bajaj edited comment on SPARK-19811 at 10/10/19 8:00 AM: -- the issue arises when you use your partition name in upper case. try using partition in lower case, this solved the issue for me. was (Author: vasubajaj): It's too late, but the issue arises when you use your partition name in upper case. try using partition in lower case, this solved the issue for me. > sparksql 2.1 can not prune hive partition > -- > > Key: SPARK-19811 > URL: https://issues.apache.org/jira/browse/SPARK-19811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: sydt >Priority: Major > > When sparksql2.1 execute sql, it has error: > java.lang.RuntimeException: Expected only partition pruning predicates: > (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212)) and the sql sentence is > select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND > PROV_ID ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
pin_zhang created SPARK-29423: - Summary: leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus Key: SPARK-29423 URL: https://issues.apache.org/jira/browse/SPARK-29423 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: pin_zhang 1. start server with start-thriftserver.sh 2. JDBC client connect and disconnect to hiveserver2 for (int i = 0; i < 1; i++) { Connection conn = DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); conn.close(); } 3. instance of org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29424) Prevent Spark to committing stage of too much Task
angerszhu created SPARK-29424: - Summary: Prevent Spark to committing stage of too much Task Key: SPARK-29424 URL: https://issues.apache.org/jira/browse/SPARK-29424 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29424: -- Description: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So i add a constraint when submit tasks.I wonder if the community will accept it > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So i add a constraint when submit tasks.I wonder if the community will accept > it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29424: -- Description: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So I add a constraint when submit tasks and abort stage early when TaskSet size num is bigger then set limit . I wonder if the community will accept this way. cc [~srowen] was: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So i add a constraint when submit tasks.I wonder if the community will accept it > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So I add a constraint when submit tasks and abort stage early when TaskSet > size num is bigger then set limit . I wonder if the community will accept > this way. > cc [~srowen] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29424: -- Description: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So I add a constraint when submit tasks and abort stage early when TaskSet size num is bigger then set limit . I wonder if the community will accept this way. cc [~srowen] [~dongjoon] [~yumwang] was: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So I add a constraint when submit tasks and abort stage early when TaskSet size num is bigger then set limit . I wonder if the community will accept this way. cc [~srowen] > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So I add a constraint when submit tasks and abort stage early when TaskSet > size num is bigger then set limit . I wonder if the community will accept > this way. > cc [~srowen] [~dongjoon] [~yumwang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29409) spark drop partition always throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ant_nebula updated SPARK-29409: --- Description: The table is: {code:java} CREATE TABLE `test_spark.test_drop_partition`( `platform` string, `product` string, `cnt` bigint) PARTITIONED BY (dt string) stored as orc;{code} hive 2.1.1: {code:java} spark-sql -e "alter table test_spark.test_drop_partition drop if exists partition(dt='2019-10-08')"{code} hive builtin: {code:java} spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf spark.sql.hive.metastore.jars=builtin -e "alter table test_spark.test_drop_partition drop if exists partition(dt='2019-10-08')"{code} both would log Exception: {code:java} 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current connections: 1 19/10/09 18:21:27 INFO metastore: Connected to metastore. 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect. org.apache.thrift.transport.TTransportException: Cannot write to null outputStream at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142) at org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178) at org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106) at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70) at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154) at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265) at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333) at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) at org.apache.spark.sql.hive.client.HiveClientImpl.dropPartitions(HiveClientImpl.scala:550) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropPartitions$1.apply$mcV$sp(HiveExternalCatalog.scala:972) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropPartitions$1.apply(HiveExternalCatalog.scala:970) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropPartitions$1.apply(HiveExternalCatalog.scala:970) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.
[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file
[ https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948361#comment-16948361 ] Jatin Puri commented on SPARK-10848: This issue still exists in `2.4.4`. Should a new issue be created? > Applied JSON Schema Works for json RDD but not when loading json file > - > > Key: SPARK-10848 > URL: https://issues.apache.org/jira/browse/SPARK-10848 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Miklos Christine >Priority: Minor > > Using a defined schema to load a json rdd works as expected. Loading the json > records from a file does not apply the supplied schema. Mainly the nullable > field isn't applied correctly. Loading from a file uses nullable=true on all > fields regardless of applied schema. > Code to reproduce: > {code} > import org.apache.spark.sql.types._ > val jsonRdd = sc.parallelize(List( > """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", > "ProductCode": "WQT648", "Qty": 5}""", > """{"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11", > "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, > "expressDelivery":true}""")) > val mySchema = StructType(Array( > StructField(name="OrderID" , dataType=LongType, nullable=false), > StructField("CustomerID", IntegerType, false), > StructField("OrderDate", DateType, false), > StructField("ProductCode", StringType, false), > StructField("Qty", IntegerType, false), > StructField("Discount", FloatType, true), > StructField("expressDelivery", BooleanType, true) > )) > val myDF = sqlContext.read.schema(mySchema).json(jsonRdd) > val schema1 = myDF.printSchema > val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json") > val schema2 = dfDFfromFile.printSchema > {code} > Orders.json > {code} > {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": > "WQT648", "Qty": 5} > {"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11", "ProductCode": > "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true} > {code} > The behavior should be consistent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29288) Spark SQL add jar can't support HTTP path.
[ https://issues.apache.org/jira/browse/SPARK-29288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948364#comment-16948364 ] angerszhu commented on SPARK-29288: --- [~dongjoon] Sorry for later reply , the hive Jira is https://issues.apache.org/jira/browse/HIVE-9664 > Spark SQL add jar can't support HTTP path. > --- > > Key: SPARK-29288 > URL: https://issues.apache.org/jira/browse/SPARK-29288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > SparkSQL > `ADD JAR` can't support url with http, livy schema , do we need to support it? > cc [~sro...@scient.com] > [~hyukjin.kwon][~dongjoon][~jerryshao][~juliuszsompolski] > Hive 2.3 support it, do we need to support it? > I can work on this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29409) spark drop partition always throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948368#comment-16948368 ] angerszhu commented on SPARK-29409: --- Thanks, I will check this problem. > spark drop partition always throws Exception > > > Key: SPARK-29409 > URL: https://issues.apache.org/jira/browse/SPARK-29409 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: spark 2.4.0 on yarn 2.7.3 > spark-sql client mode > run hive version: 2.1.1 > hive builtin version 1.2.1 >Reporter: ant_nebula >Priority: Major > > The table is: > {code:java} > CREATE TABLE `test_spark.test_drop_partition`( > `platform` string, > `product` string, > `cnt` bigint) > PARTITIONED BY (dt string) > stored as orc;{code} > hive 2.1.1: > {code:java} > spark-sql -e "alter table test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > hive builtin: > {code:java} > spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf > spark.sql.hive.metastore.jars=builtin -e "alter table > test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > both would log Exception: > {code:java} > 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current > connections: 1 > 19/10/09 18:21:27 INFO metastore: Connected to metastore. > 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost > connection. Attempting to reconnect. > org.apache.thrift.transport.TTransportException: Cannot write to null > outputStream > at > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142) > at > org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178) > at > org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClien
[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948236#comment-16948236 ] Yuming Wang edited comment on SPARK-29354 at 10/10/19 10:02 AM: Hi [~Elixir Kook] Do you mean create the Spark binary distribution with {{-Phadoop-provided}}? {code:sh} ./dev/make-distribution.sh --tgz -Phadoop-provided {code} was (Author: q79969786): Hi [~Elixir Kook] Do you mean create the Spark binary distribution with {{-Phadoop-provided}}? {code:sh} ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided {code} > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29425) Alter database statement erases hive database's ownership
Kent Yao created SPARK-29425: Summary: Alter database statement erases hive database's ownership Key: SPARK-29425 URL: https://issues.apache.org/jira/browse/SPARK-29425 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 2.3.4 Reporter: Kent Yao ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out') will erase a hive database's owner -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29425) Alter database statement erases hive database's ownership
[ https://issues.apache.org/jira/browse/SPARK-29425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-29425: - Description: Commands like `ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out')` will erase a hive database's owner (was: ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out') will erase a hive database's owner) > Alter database statement erases hive database's ownership > - > > Key: SPARK-29425 > URL: https://issues.apache.org/jira/browse/SPARK-29425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: Kent Yao >Priority: Major > > Commands like `ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out')` will > erase a hive database's owner -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29426) Watermark does not take effect
jingshanglu created SPARK-29426: --- Summary: Watermark does not take effect Key: SPARK-29426 URL: https://issues.apache.org/jira/browse/SPARK-29426 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.3 Environment: my kafka mes like this: {code:java} // code placeholder [kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0 {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-04 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {code} output like this: {code:java} // code placeholder Batch: 5 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2| ++--++-+-+--- Batch: 6 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|3| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|3| ++--++-+-+--- Batch: 7 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-04 12:20...|select * from user|192.168.54.6|172.0.0.1|1| |[2019-03-04 12:15...|select * from user|192.168.54.6|172.0.0.1|1| ++--++-+-+ {code} the watermark behind the event time(2019-03-04 12:23:22), but this event {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} still be Aggregated Reporter: jingshanglu I use withWatermark and window to express windowed aggregations, but the Watermark does not take effect. my code: {code:java} // code placeholder Dataset clientSqlIpCount = mes.withWatermark("timestamp","1 minute") .groupBy( functions.window(mes.col("timestamp"),"10 minutes","5 minutes"), mes.col("sql"),mes.col("client"),mes.col("ip")) .count(); StreamingQuery query = clientSqlIpCount .writeStream() .outputMode("Update") .format("console") .start(); spark.streams().awaitAnyTermination(); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29426) Watermark does not take effect
[ https://issues.apache.org/jira/browse/SPARK-29426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jingshanglu updated SPARK-29426: Environment: (was: my kafka mes like this: {code:java} // code placeholder [kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0 {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-04 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {code} output like this: {code:java} // code placeholder Batch: 5 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2| ++--++-+-+--- Batch: 6 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|3| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|3| ++--++-+-+--- Batch: 7 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-04 12:20...|select * from user|192.168.54.6|172.0.0.1|1| |[2019-03-04 12:15...|select * from user|192.168.54.6|172.0.0.1|1| ++--++-+-+ {code} the watermark behind the event time(2019-03-04 12:23:22), but this event {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} still be Aggregated) my kafka mes like this: {code:java} // code placeholder [kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0 {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-04 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {code} output like this: {code:java} // code placeholder Batch: 5 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2| ++--++-+-+--- Batch: 6 --- ++--++-+-+ | window|
[jira] [Commented] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948393#comment-16948393 ] Sean R. Owen commented on SPARK-29424: -- I doubt we want to throw yet another limit/config at this. It's hard to guess or impose a _task_ limit in order to limit cluster usage. This is often what resource constrains on the resource manager are for, not also duplicated in Spark. > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So I add a constraint when submit tasks and abort stage early when TaskSet > size num is bigger then set limit . I wonder if the community will accept > this way. > cc [~srowen] [~dongjoon] [~yumwang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29427) Create KeyValueGroupedDataset from RelationalGroupedDataset
Alexander Hagerf created SPARK-29427: Summary: Create KeyValueGroupedDataset from RelationalGroupedDataset Key: SPARK-29427 URL: https://issues.apache.org/jira/browse/SPARK-29427 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Affects Versions: 2.4.4 Reporter: Alexander Hagerf The scenario I'm having is that I'm reading two huge bucketed tables and since a regular join is not performant enough for these cases I'm using groupByKey to generate two KeyValueGroupedDatasets and cogroup them to implement the logic I need. The issue with this approach is that I'm only grouping by the column that the tables are bucketed by but since I'm using groupByKey the bucketing is completely ignored and I still get a full shuffle. What I'm looking for is some functionality to tell Catalyst to group by a column in a relational way but then give the user a possibility to utilize the functions of the KeyValueGroupedDataset e.g. cogroup (which is not available for dataframes) At current spark (2.4.4) I see no way to do this efficiently. I think this is a valid use case which if solved would have huge performance benefits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948399#comment-16948399 ] angerszhu commented on SPARK-29424: --- [~srowen] Since resource limit is established, these bad behavior will cause program run very slow but don't know why. Make it abort early is better for user to recognize where the problem is. Especially for Spark Thrift Server. > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So I add a constraint when submit tasks and abort stage early when TaskSet > size num is bigger then set limit . I wonder if the community will accept > this way. > cc [~srowen] [~dongjoon] [~yumwang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29427) Create KeyValueGroupedDataset from RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-29427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Hagerf updated SPARK-29427: - Description: The scenario I'm having is that I'm reading two huge bucketed tables and since a regular join is not performant enough for my cases, I'm using groupByKey to generate two KeyValueGroupedDatasets and cogroup them to implement the merging logic I need. The issue with this approach is that I'm only grouping by the column that the tables are bucketed by but since I'm using groupByKey the bucketing is completely ignored and I still get a full shuffle. What I'm looking for is some functionality to tell Catalyst to group by a column in a relational way but then give the user a possibility to utilize the functions of the KeyValueGroupedDataset e.g. cogroup (which is not available for dataframes) At current spark (2.4.4) I see no way to do this efficiently. I think this is a valid use case which if solved would have huge performance benefits. was: The scenario I'm having is that I'm reading two huge bucketed tables and since a regular join is not performant enough for these cases I'm using groupByKey to generate two KeyValueGroupedDatasets and cogroup them to implement the logic I need. The issue with this approach is that I'm only grouping by the column that the tables are bucketed by but since I'm using groupByKey the bucketing is completely ignored and I still get a full shuffle. What I'm looking for is some functionality to tell Catalyst to group by a column in a relational way but then give the user a possibility to utilize the functions of the KeyValueGroupedDataset e.g. cogroup (which is not available for dataframes) At current spark (2.4.4) I see no way to do this efficiently. I think this is a valid use case which if solved would have huge performance benefits. > Create KeyValueGroupedDataset from RelationalGroupedDataset > --- > > Key: SPARK-29427 > URL: https://issues.apache.org/jira/browse/SPARK-29427 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Alexander Hagerf >Priority: Major > > The scenario I'm having is that I'm reading two huge bucketed tables and > since a regular join is not performant enough for my cases, I'm using > groupByKey to generate two KeyValueGroupedDatasets and cogroup them to > implement the merging logic I need. > The issue with this approach is that I'm only grouping by the column that the > tables are bucketed by but since I'm using groupByKey the bucketing is > completely ignored and I still get a full shuffle. > What I'm looking for is some functionality to tell Catalyst to group by a > column in a relational way but then give the user a possibility to utilize > the functions of the KeyValueGroupedDataset e.g. cogroup (which is not > available for dataframes) > > At current spark (2.4.4) I see no way to do this efficiently. I think this is > a valid use case which if solved would have huge performance benefits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948410#comment-16948410 ] angerszhu commented on SPARK-29354: --- [~Elixir Kook] [~yumwang] Jline is brought by hive-beeline module , you can find dependency in hive-beeline's pom file. And build liek below : {code:java} ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided -Phive-provided {code} You won't get a jline jar in dist/jar/ folder > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29428) Can't persist/set None-valued param
Borys Biletskyy created SPARK-29428: --- Summary: Can't persist/set None-valued param Key: SPARK-29428 URL: https://issues.apache.org/jira/browse/SPARK-29428 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.3.2 Reporter: Borys Biletskyy {code:java} import pytest from pyspark import keyword_only from pyspark.ml import Model from pyspark.sql import DataFrame from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable from pyspark.ml.param.shared import HasInputCol from pyspark.sql.functions import * class NoneParamTester(Model, HasInputCol, DefaultParamsReadable, DefaultParamsWritable ): @keyword_only def __init__(self, inputCol: str = None): super(NoneParamTester, self).__init__() kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol: str = None): kwargs = self._input_kwargs self._set(**kwargs) return self def _transform(self, data: DataFrame) -> DataFrame: return data class TestHasInputColParam(object): def test_persist_none(self, spark, temp_dir): path = temp_dir + '/test_model' model = NoneParamTester(inputCol=None) assert model.isDefined(model.inputCol) assert model.isSet(model.inputCol) assert model.getInputCol() is None model.write().overwrite().save(path) NoneParamTester.load(path) # TypeError: Could not convert to string type def test_set_none(self, spark): model = NoneParamTester(inputCol=None) assert model.isDefined(model.inputCol) assert model.isSet(model.inputCol) assert model.getInputCol() is None model.set(model.inputCol, None) # TypeError: Could not convert to string type {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29428) Can't persist/set None-valued param
[ https://issues.apache.org/jira/browse/SPARK-29428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Borys Biletskyy updated SPARK-29428: Description: {code:java} import pytest from pyspark import keyword_only from pyspark.ml import Model from pyspark.sql import DataFrame from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable from pyspark.ml.param.shared import HasInputCol from pyspark.sql.functions import * class NoneParamTester(Model, HasInputCol, DefaultParamsReadable, DefaultParamsWritable ): @keyword_only def __init__(self, inputCol: str = None): super(NoneParamTester, self).__init__() kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol: str = None): kwargs = self._input_kwargs self._set(**kwargs) return self def _transform(self, data: DataFrame) -> DataFrame: return data class TestNoneParam(object): def test_persist_none(self, spark, temp_dir): path = temp_dir + '/test_model' model = NoneParamTester(inputCol=None) assert model.isDefined(model.inputCol) assert model.isSet(model.inputCol) assert model.getInputCol() is None model.write().overwrite().save(path) NoneParamTester.load(path) # TypeError: Could not convert to string type def test_set_none(self, spark): model = NoneParamTester(inputCol=None) assert model.isDefined(model.inputCol) assert model.isSet(model.inputCol) assert model.getInputCol() is None model.set(model.inputCol, None) # TypeError: Could not convert to string type {code} was: {code:java} import pytest from pyspark import keyword_only from pyspark.ml import Model from pyspark.sql import DataFrame from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable from pyspark.ml.param.shared import HasInputCol from pyspark.sql.functions import * class NoneParamTester(Model, HasInputCol, DefaultParamsReadable, DefaultParamsWritable ): @keyword_only def __init__(self, inputCol: str = None): super(NoneParamTester, self).__init__() kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol: str = None): kwargs = self._input_kwargs self._set(**kwargs) return self def _transform(self, data: DataFrame) -> DataFrame: return data class TestHasInputColParam(object): def test_persist_none(self, spark, temp_dir): path = temp_dir + '/test_model' model = NoneParamTester(inputCol=None) assert model.isDefined(model.inputCol) assert model.isSet(model.inputCol) assert model.getInputCol() is None model.write().overwrite().save(path) NoneParamTester.load(path) # TypeError: Could not convert to string type def test_set_none(self, spark): model = NoneParamTester(inputCol=None) assert model.isDefined(model.inputCol) assert model.isSet(model.inputCol) assert model.getInputCol() is None model.set(model.inputCol, None) # TypeError: Could not convert to string type {code} > Can't persist/set None-valued param > > > Key: SPARK-29428 > URL: https://issues.apache.org/jira/browse/SPARK-29428 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.2 >Reporter: Borys Biletskyy >Priority: Major > > {code:java} > import pytest > from pyspark import keyword_only > from pyspark.ml import Model > from pyspark.sql import DataFrame > from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable > from pyspark.ml.param.shared import HasInputCol > from pyspark.sql.functions import * > class NoneParamTester(Model, > HasInputCol, > DefaultParamsReadable, > DefaultParamsWritable > ): > @keyword_only > def __init__(self, inputCol: str = None): > super(NoneParamTester, self).__init__() > kwargs = self._input_kwargs > self.setParams(**kwargs) > @keyword_only > def setParams(self, inputCol: str = None): > kwargs = self._input_kwargs > self._set(**kwargs) > return self > def _transform(self, data: DataFrame) -> DataFrame: > return data > class TestNoneParam(object): > def test_persist_none(self, spark, temp_dir): > path = temp_dir + '/test_model' > model = NoneParamTester(inputCol=None) > assert model.isDefined(model.inputCol) > assert model.isSet(m
[jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to slow query planning
[ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948443#comment-16948443 ] Izek Greenfield commented on SPARK-13346: - [~davies] why this issue gets closed? > Using DataFrames iteratively leads to slow query planning > - > > Key: SPARK-13346 > URL: https://issues.apache.org/jira/browse/SPARK-13346 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > I have an iterative algorithm based on DataFrames, and the query plan grows > very quickly with each iteration. Caching the current DataFrame at the end > of an iteration does not fix the problem. However, converting the DataFrame > to an RDD and back at the end of each iteration does fix the problem. > Printing the query plans shows that the plan explodes quickly (10 lines, to > several hundred lines, to several thousand lines, ...) with successive > iterations. > The desired behavior is for the analyzer to recognize that a big chunk of the > query plan does not need to be computed since it is already cached. The > computation on each iteration should be the same. > If useful, I can push (complex) code to reproduce the issue. But it should > be simple to see if you create an iterative algorithm which produces a new > DataFrame from an old one on each iteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29426) Watermark does not take effect
[ https://issues.apache.org/jira/browse/SPARK-29426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948379#comment-16948379 ] jingshanglu edited comment on SPARK-29426 at 10/10/19 11:45 AM: my kafka mes like this: {code:java} // code placeholder [kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0 {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-04 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {code} output like this: {code:java} // code placeholder Batch: 0 --- +--+---+--+---+-+ |window|sql|client| ip|count| +--+---+--+---+-+ +--+---+--+---+-+--- Batch: 1 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|1| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|1| ++--++-+-+--- Batch: 2 --- +--+---+--+---+-+ |window|sql|client| ip|count| +--+---+--+---+-+ +--+---+--+---+-+--- Batch: 3 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:25...|select * from user|192.168.54.6|172.0.0.1|1| |[2019-03-05 12:30...|select * from user|192.168.54.6|172.0.0.1|1| ++--++-+-+--- Batch: 4 --- +--+---+--+---+-+ |window|sql|client| ip|count| +--+---+--+---+-+ +--+---+--+---+-+--- Batch: 5 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2| ++--++-+-+ {code} the watermark behind the event time(2019-03-04 12:23:22), but this event {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} still be Aggregated was (Author: jingshang): my kafka mes like this: {code:java} // code placeholder [kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0 {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-04 12:22:29","ip":"172.0.0.1","clie
[jira] [Comment Edited] (SPARK-29426) Watermark does not take effect
[ https://issues.apache.org/jira/browse/SPARK-29426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948379#comment-16948379 ] jingshanglu edited comment on SPARK-29426 at 10/10/19 11:46 AM: my kafka mes like this: {code:java} // code placeholder [kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0 {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {code} output like this: {code:java} // code placeholder Batch: 0 --- +--+---+--+---+-+ |window|sql|client| ip|count| +--+---+--+---+-+ +--+---+--+---+-+--- Batch: 1 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|1| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|1| ++--++-+-+--- Batch: 2 --- +--+---+--+---+-+ |window|sql|client| ip|count| +--+---+--+---+-+ +--+---+--+---+-+--- Batch: 3 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:25...|select * from user|192.168.54.6|172.0.0.1|1| |[2019-03-05 12:30...|select * from user|192.168.54.6|172.0.0.1|1| ++--++-+-+--- Batch: 4 --- +--+---+--+---+-+ |window|sql|client| ip|count| +--+---+--+---+-+ +--+---+--+---+-+--- Batch: 5 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2| |[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2| ++--++-+-+ {code} the watermark behind the event time(2019-03-04 12:23:22), but this event {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} still be Aggregated was (Author: jingshang): my kafka mes like this: {code:java} // code placeholder [kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0 {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-05 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {"sql":"select * from user","timestamp":"2019-03-04 12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306} {code} output like this: {code:java} // code placeholder Batch: 0 --- +--+---+--+---+-+ |window|sql|client| ip|count| +--+---+--+---+-+ +--+---+--+---+-+--- Batch: 1 --- ++--++-+-+ | window| sql| client| ip|count| ++--++-+-+ |[2019-03-05
[jira] [Comment Edited] (SPARK-29421) Add an opportunity to change the file format of command CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948506#comment-16948506 ] Lantao Jin edited comment on SPARK-29421 at 10/10/19 12:03 PM: --- [~cloud_fan] Yes, Hive support a similar command with {{STORED AS}}: {code} hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE; OK Time taken: 0.726 seconds hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; OK Time taken: 0.294 seconds {code} was (Author: cltlfcjin): [~cloud_fan] Yes, Hive support the similar command with {{STORED AS}}: {code} hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE; OK Time taken: 0.726 seconds hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; OK Time taken: 0.294 seconds {code} > Add an opportunity to change the file format of command CREATE TABLE LIKE > - > > Key: SPARK-29421 > URL: https://issues.apache.org/jira/browse/SPARK-29421 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on > the definition of table tb2. The most user case is to create tb1 with the > same schema of tb2. But an inconvenient case here is this command also copies > the FileFormat from tb2, it cannot change the input/output format and serde. > Add the ability of changing file format is useful for some scenarios like > upgrading a table from a low performance file format to a high performance > one (parquet, orc). > Here gives two options to enhance it. > Option1: Add a configuration {{spark.sql.createTableLike.fileformat}}, the > value by default is "none" which keeps the behaviour same with current -- > copying the file format from source table. After run command SET > spark.sql.createTableLike.fileformat=parquet or any other valid file format > defined in {{HiveSerDe}}, {{CREATE TABLE ... LIKE}} will use the new file > format type. > Option2: Add syntax {{USING fileformat}} after {{CREATE TABLE ... LIKE}}. For > example, > {code} > CREATE TABLE tb1 LIKE tb2 USING parquet; > {code} > If USING keyword is ignored, it also keeps the behaviour same with current -- > copying the file format from source table. > Both of them can keep its behaviour same with current. > We use option1 with parquet file format as an enhancement in our production > thriftserver because we need change many existing SQL scripts without any > modification. But for community, Option2 could be treated as a new feature > since it needs user to write additional USING part. > cc [~dongjoon] [~hyukjin.kwon] [~joshrosen] [~cloud_fan] [~yumwang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29421) Add an opportunity to change the file format of command CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948506#comment-16948506 ] Lantao Jin commented on SPARK-29421: [~cloud_fan] Yes, Hive support the similar command with {{STORED AS}}: {code} hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE; OK Time taken: 0.726 seconds hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; OK Time taken: 0.294 seconds {code} > Add an opportunity to change the file format of command CREATE TABLE LIKE > - > > Key: SPARK-29421 > URL: https://issues.apache.org/jira/browse/SPARK-29421 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on > the definition of table tb2. The most user case is to create tb1 with the > same schema of tb2. But an inconvenient case here is this command also copies > the FileFormat from tb2, it cannot change the input/output format and serde. > Add the ability of changing file format is useful for some scenarios like > upgrading a table from a low performance file format to a high performance > one (parquet, orc). > Here gives two options to enhance it. > Option1: Add a configuration {{spark.sql.createTableLike.fileformat}}, the > value by default is "none" which keeps the behaviour same with current -- > copying the file format from source table. After run command SET > spark.sql.createTableLike.fileformat=parquet or any other valid file format > defined in {{HiveSerDe}}, {{CREATE TABLE ... LIKE}} will use the new file > format type. > Option2: Add syntax {{USING fileformat}} after {{CREATE TABLE ... LIKE}}. For > example, > {code} > CREATE TABLE tb1 LIKE tb2 USING parquet; > {code} > If USING keyword is ignored, it also keeps the behaviour same with current -- > copying the file format from source table. > Both of them can keep its behaviour same with current. > We use option1 with parquet file format as an enhancement in our production > thriftserver because we need change many existing SQL scripts without any > modification. But for community, Option2 could be treated as a new feature > since it needs user to write additional USING part. > cc [~dongjoon] [~hyukjin.kwon] [~joshrosen] [~cloud_fan] [~yumwang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29421) Add an opportunity to change the file format of command CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948506#comment-16948506 ] Lantao Jin edited comment on SPARK-29421 at 10/10/19 12:07 PM: --- [~cloud_fan] Yes, Hive support a similar command with {{STORED AS}}: {code} hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE; OK Time taken: 0.726 seconds hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; OK Time taken: 0.294 seconds {code} Here I use {{USING}} to be compatible with Spark command {{CREATE TABLE ... USING}}. So do you think which option would be better? was (Author: cltlfcjin): [~cloud_fan] Yes, Hive support a similar command with {{STORED AS}}: {code} hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE; OK Time taken: 0.726 seconds hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; OK Time taken: 0.294 seconds {code} > Add an opportunity to change the file format of command CREATE TABLE LIKE > - > > Key: SPARK-29421 > URL: https://issues.apache.org/jira/browse/SPARK-29421 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on > the definition of table tb2. The most user case is to create tb1 with the > same schema of tb2. But an inconvenient case here is this command also copies > the FileFormat from tb2, it cannot change the input/output format and serde. > Add the ability of changing file format is useful for some scenarios like > upgrading a table from a low performance file format to a high performance > one (parquet, orc). > Here gives two options to enhance it. > Option1: Add a configuration {{spark.sql.createTableLike.fileformat}}, the > value by default is "none" which keeps the behaviour same with current -- > copying the file format from source table. After run command SET > spark.sql.createTableLike.fileformat=parquet or any other valid file format > defined in {{HiveSerDe}}, {{CREATE TABLE ... LIKE}} will use the new file > format type. > Option2: Add syntax {{USING fileformat}} after {{CREATE TABLE ... LIKE}}. For > example, > {code} > CREATE TABLE tb1 LIKE tb2 USING parquet; > {code} > If USING keyword is ignored, it also keeps the behaviour same with current -- > copying the file format from source table. > Both of them can keep its behaviour same with current. > We use option1 with parquet file format as an enhancement in our production > thriftserver because we need change many existing SQL scripts without any > modification. But for community, Option2 could be treated as a new feature > since it needs user to write additional USING part. > cc [~dongjoon] [~hyukjin.kwon] [~joshrosen] [~cloud_fan] [~yumwang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948575#comment-16948575 ] Yuming Wang commented on SPARK-29354: - Does {{bin/spark-shell}} need jline? > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610 ] Sungpeo Kook commented on SPARK-29354: -- [~yumwang] I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with uesr-provided Apache Hadoop` and yes, `bin/spark-shell` need jilne. > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610 ] Sungpeo Kook edited comment on SPARK-29354 at 10/10/19 2:12 PM: [~yumwang] I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with uesr-provided Apache Hadoop` and yes, `bin/spark-shell` needs a jilne. was (Author: elixir kook): [~yumwang] I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with uesr-provided Apache Hadoop` and yes, `bin/spark-shell` need jilne. > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610 ] Sungpeo Kook edited comment on SPARK-29354 at 10/10/19 2:14 PM: [~yumwang] I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with uesr-provided Apache Hadoop` and yes, `bin/spark-shell` needs a jline. was (Author: elixir kook): [~yumwang] I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with uesr-provided Apache Hadoop` and yes, `bin/spark-shell` needs a jilne. > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610 ] Sungpeo Kook edited comment on SPARK-29354 at 10/10/19 2:17 PM: [~yumwang] [~angerszhuuu] At first, thank you guys for the response. I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with uesr-provided Apache Hadoop` and yes, `bin/spark-shell` needs a jline. was (Author: elixir kook): [~yumwang] I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with uesr-provided Apache Hadoop` and yes, `bin/spark-shell` needs a jline. > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610 ] Sungpeo Kook edited comment on SPARK-29354 at 10/10/19 2:17 PM: [~yumwang] [~angerszhuuu] At first, thank you guys for the response. I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with user-provided Apache Hadoop` and yes, `bin/spark-shell` needs a jline. was (Author: elixir kook): [~yumwang] [~angerszhuuu] At first, thank you guys for the response. I meant spark binary distributions which are listed on the [download page|https://spark.apache.org/downloads.html]. whose package type is `Pre-built with uesr-provided Apache Hadoop` and yes, `bin/spark-shell` needs a jline. > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
[ https://issues.apache.org/jira/browse/SPARK-28859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948652#comment-16948652 ] Thomas Graves commented on SPARK-28859: --- I wouldn't expect users to specify the size when enabled its false. If they do specify false, I guess its ok for it to be 0, but not sure we really need to special case this. Default of 0 is fine that is why I said if the user specifies a value it should be > 0, but I haven't looked to see when the configEntry does the validation on this. If it validates the default value then we can't change it, or validator needs change. This is what the Jira is to investigate. Taking a skim of the code it looks like the validator only runs on the non-default value. > Remove value check of MEMORY_OFFHEAP_SIZE in declaration section > > > Key: SPARK-28859 > URL: https://issues.apache.org/jira/browse/SPARK-28859 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yang Jie >Assignee: yifan >Priority: Minor > > Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 > when > MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code? > > SPARK-28577 add this check before request memory resource to Yarn > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29400) Improve PrometheusResource to use labels
[ https://issues.apache.org/jira/browse/SPARK-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29400. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26060 [https://github.com/apache/spark/pull/26060] > Improve PrometheusResource to use labels > > > Key: SPARK-29400 > URL: https://issues.apache.org/jira/browse/SPARK-29400 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > SPARK-29064 introduced `PrometheusResource` for native support. This issue > aims to improve it to use **labels**. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29400) Improve PrometheusResource to use labels
[ https://issues.apache.org/jira/browse/SPARK-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29400: - Assignee: Dongjoon Hyun > Improve PrometheusResource to use labels > > > Key: SPARK-29400 > URL: https://issues.apache.org/jira/browse/SPARK-29400 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > SPARK-29064 introduced `PrometheusResource` for native support. This issue > aims to improve it to use **labels**. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29400) Improve PrometheusResource to use labels
[ https://issues.apache.org/jira/browse/SPARK-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29400: -- Parent: SPARK-29429 Issue Type: Sub-task (was: Improvement) > Improve PrometheusResource to use labels > > > Key: SPARK-29400 > URL: https://issues.apache.org/jira/browse/SPARK-29400 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > SPARK-29064 introduced `PrometheusResource` for native support. This issue > aims to improve it to use **labels**. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29429) Support Prometheus monitoring
Dongjoon Hyun created SPARK-29429: - Summary: Support Prometheus monitoring Key: SPARK-29429 URL: https://issues.apache.org/jira/browse/SPARK-29429 Project: Spark Issue Type: Umbrella Components: Spark Core Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29064) Add PrometheusResource to export Executor metrics
[ https://issues.apache.org/jira/browse/SPARK-29064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29064: -- Parent: SPARK-29429 Issue Type: Sub-task (was: Improvement) > Add PrometheusResource to export Executor metrics > - > > Key: SPARK-29064 > URL: https://issues.apache.org/jira/browse/SPARK-29064 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29032: -- Parent: SPARK-29429 Issue Type: Sub-task (was: Improvement) > Simplify Prometheus support by adding PrometheusServlet > --- > > Key: SPARK-29032 > URL: https://issues.apache.org/jira/browse/SPARK-29032 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This issue aims to simplify `Prometheus` support in Spark standalone > environment or K8s environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29429) Support Prometheus monitoring
[ https://issues.apache.org/jira/browse/SPARK-29429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29429: -- Target Version/s: 3.0.0 > Support Prometheus monitoring > - > > Key: SPARK-29429 > URL: https://issues.apache.org/jira/browse/SPARK-29429 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29032: - Assignee: Dongjoon Hyun > Simplify Prometheus support by adding PrometheusServlet > --- > > Key: SPARK-29032 > URL: https://issues.apache.org/jira/browse/SPARK-29032 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This issue aims to simplify `Prometheus` support in Spark standalone > environment or K8s environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29429) Support Prometheus monitoring natively
[ https://issues.apache.org/jira/browse/SPARK-29429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29429: -- Summary: Support Prometheus monitoring natively (was: Support Prometheus monitoring) > Support Prometheus monitoring natively > -- > > Key: SPARK-29429 > URL: https://issues.apache.org/jira/browse/SPARK-29429 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29064) Add PrometheusResource to export Executor metrics
[ https://issues.apache.org/jira/browse/SPARK-29064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29064: - Assignee: Dongjoon Hyun > Add PrometheusResource to export Executor metrics > - > > Key: SPARK-29064 > URL: https://issues.apache.org/jira/browse/SPARK-29064 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29430) Document new metric endpoints for Prometheus
Dongjoon Hyun created SPARK-29430: - Summary: Document new metric endpoints for Prometheus Key: SPARK-29430 URL: https://issues.apache.org/jira/browse/SPARK-29430 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29396) Extend Spark plugin interface to driver
[ https://issues.apache.org/jira/browse/SPARK-29396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948720#comment-16948720 ] Imran Rashid commented on SPARK-29396: -- My hack to get around this in the past was to create a "SparkListener" which just ignored all the events it got, as that lets you instantiate arbitrary code in the driver, after most initialization but before running anything else. Its an ugly api for sure, so it would be nice to improve -- but I'm curious if there is a functional shortcoming you need to address as well? > Extend Spark plugin interface to driver > --- > > Key: SPARK-29396 > URL: https://issues.apache.org/jira/browse/SPARK-29396 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Masiero Vanzin >Priority: Major > > Spark provides an extension API for people to implement executor plugins, > added in SPARK-24918 and later extended in SPARK-28091. > That API does not offer any functionality for doing similar things on the > driver side, though. As a consequence of that, there is not a good way for > the executor plugins to get information or communicate in any way with the > Spark driver. > I've been playing with such an improved API for developing some new > functionality. I'll file a few child bugs for the work to get the changes in. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29431) Improve Web UI / Sql tab visualization with cached dataframes.
Pablo Langa Blanco created SPARK-29431: -- Summary: Improve Web UI / Sql tab visualization with cached dataframes. Key: SPARK-29431 URL: https://issues.apache.org/jira/browse/SPARK-29431 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco When the Spark plan has a cached dataframe, all the plan of the cached dataframe is not been shown in the SQL tree. More info at the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29430) Document new metric endpoints for Prometheus
[ https://issues.apache.org/jira/browse/SPARK-29430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29430: - Assignee: Dongjoon Hyun > Document new metric endpoints for Prometheus > > > Key: SPARK-29430 > URL: https://issues.apache.org/jira/browse/SPARK-29430 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948800#comment-16948800 ] Mukul Murthy commented on SPARK-29358: -- That would be a start to make us not have to do #1, but #2 is still annoying. Serializing and deserializing the data just to merge schemas is clunky, and transforming each DataFrame's schema is annoying enough to use when you don't have StructTypes and nested columns. When you add those into the picture, it gets even messier. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down
[ https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20629: -- Affects Version/s: (was: 2.3.0) (was: 2.2.0) 3.0.0 > Copy shuffle data when nodes are being shut down > > > Key: SPARK-20629 > URL: https://issues.apache.org/jira/browse/SPARK-20629 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > We decided not to do this for YARN, but for EC2/GCE and similar systems nodes > may be shut down entirely without the ability to keep an AuxiliaryService > around. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20629) Copy shuffle data when nodes are being shut down
[ https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-20629: --- > Copy shuffle data when nodes are being shut down > > > Key: SPARK-20629 > URL: https://issues.apache.org/jira/browse/SPARK-20629 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > We decided not to do this for YARN, but for EC2/GCE and similar systems nodes > may be shut down entirely without the ability to keep an AuxiliaryService > around. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20624) Add better handling for node shutdown
[ https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-20624: --- > Add better handling for node shutdown > - > > Key: SPARK-20624 > URL: https://issues.apache.org/jira/browse/SPARK-20624 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Priority: Minor > Labels: bulk-closed > > While we've done some good work with better handling when Spark is choosing > to decommission nodes (SPARK-7955), it might make sense in environments where > we get preempted without our own choice (e.g. YARN over-commit, EC2 spot > instances, GCE Preemptiable instances, etc.) to do something for the data on > the node (or at least not schedule any new tasks). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20624) Add better handling for node shutdown
[ https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20624: -- Labels: (was: bulk-closed) > Add better handling for node shutdown > - > > Key: SPARK-20624 > URL: https://issues.apache.org/jira/browse/SPARK-20624 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Minor > > While we've done some good work with better handling when Spark is choosing > to decommission nodes (SPARK-7955), it might make sense in environments where > we get preempted without our own choice (e.g. YARN over-commit, EC2 spot > instances, GCE Preemptiable instances, etc.) to do something for the data on > the node (or at least not schedule any new tasks). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down
[ https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20629: -- Labels: (was: bulk-closed) > Copy shuffle data when nodes are being shut down > > > Key: SPARK-20629 > URL: https://issues.apache.org/jira/browse/SPARK-20629 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > > We decided not to do this for YARN, but for EC2/GCE and similar systems nodes > may be shut down entirely without the ability to keep an AuxiliaryService > around. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20624) Add better handling for node shutdown
[ https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20624: -- Affects Version/s: (was: 2.3.0) (was: 2.2.0) 3.0.0 > Add better handling for node shutdown > - > > Key: SPARK-20624 > URL: https://issues.apache.org/jira/browse/SPARK-20624 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Minor > Labels: bulk-closed > > While we've done some good work with better handling when Spark is choosing > to decommission nodes (SPARK-7955), it might make sense in environments where > we get preempted without our own choice (e.g. YARN over-commit, EC2 spot > instances, GCE Preemptiable instances, etc.) to do something for the data on > the node (or at least not schedule any new tasks). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20624) Add better handling for node shutdown
[ https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20624: -- Priority: Major (was: Minor) > Add better handling for node shutdown > - > > Key: SPARK-20624 > URL: https://issues.apache.org/jira/browse/SPARK-20624 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > > While we've done some good work with better handling when Spark is choosing > to decommission nodes (SPARK-7955), it might make sense in environments where > we get preempted without our own choice (e.g. YARN over-commit, EC2 spot > instances, GCE Preemptiable instances, etc.) to do something for the data on > the node (or at least not schedule any new tasks). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20732) Copy cache data when node is being shut down
[ https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20732: -- Labels: (was: bulk-closed) > Copy cache data when node is being shut down > > > Key: SPARK-20732 > URL: https://issues.apache.org/jira/browse/SPARK-20732 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20732) Copy cache data when node is being shut down
[ https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20732: -- Affects Version/s: (was: 2.3.0) (was: 2.2.0) 3.0.0 > Copy cache data when node is being shut down > > > Key: SPARK-20732 > URL: https://issues.apache.org/jira/browse/SPARK-20732 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20732) Copy cache data when node is being shut down
[ https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-20732: --- > Copy cache data when node is being shut down > > > Key: SPARK-20732 > URL: https://issues.apache.org/jira/browse/SPARK-20732 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks
[ https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-21040: --- > On executor/worker decommission consider speculatively re-launching current > tasks > - > > Key: SPARK-21040 > URL: https://issues.apache.org/jira/browse/SPARK-21040 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > If speculative execution is enabled we may wish to consider decommissioning > of worker as a weight for speculative execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks
[ https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21040: -- Labels: (was: bulk-closed) > On executor/worker decommission consider speculatively re-launching current > tasks > - > > Key: SPARK-21040 > URL: https://issues.apache.org/jira/browse/SPARK-21040 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > > If speculative execution is enabled we may wish to consider decommissioning > of worker as a weight for speculative execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks
[ https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21040: -- Affects Version/s: (was: 2.3.0) (was: 2.2.0) 3.0.0 > On executor/worker decommission consider speculatively re-launching current > tasks > - > > Key: SPARK-21040 > URL: https://issues.apache.org/jira/browse/SPARK-21040 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > If speculative execution is enabled we may wish to consider decommissioning > of worker as a weight for speculative execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948832#comment-16948832 ] Nasir Ali commented on SPARK-28502: --- [~bryanc] I tested it and it works fine with master branch. Is there any expected release date for version 3? Or could this bug be fix be integrated in next release? > Error with struct conversion while using pandas_udf > --- > > Key: SPARK-28502 > URL: https://issues.apache.org/jira/browse/SPARK-28502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: OS: Ubuntu > Python: 3.6 >Reporter: Nasir Ali >Priority: Minor > Fix For: 3.0.0 > > > What I am trying to do: Group data based on time intervals (e.g., 15 days > window) and perform some operations on dataframe using (pandas) UDFs. I don't > know if there is a better/cleaner way to do it. > Below is the sample code that I tried and error message I am getting. > > {code:java} > df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"), > (13.00, "2018-03-11T12:27:18+00:00"), > (25.00, "2018-03-12T11:27:18+00:00"), > (20.00, "2018-03-13T15:27:18+00:00"), > (17.00, "2018-03-14T12:27:18+00:00"), > (99.00, "2018-03-15T11:27:18+00:00"), > (156.00, "2018-03-22T11:27:18+00:00"), > (17.00, "2018-03-31T11:27:18+00:00"), > (25.00, "2018-03-15T11:27:18+00:00"), > (25.00, "2018-03-16T11:27:18+00:00") > ], >["id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > schema = StructType([ > StructField("id", IntegerType()), > StructField("ts", TimestampType()) > ]) > @pandas_udf(schema, PandasUDFType.GROUPED_MAP) > def some_udf(df): > # some computation > return df > df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show() > {code} > This throws following exception: > {code:java} > TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]> > {code} > > However, if I use builtin agg method then it works all fine. For example, > {code:java} > df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False) > {code} > Output > {code:java} > +-+--+---+ > |id |window|avg(id)| > +-+--+---+ > |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0 | > |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0 | > |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0 | > |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0 | > |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0 | > |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0 | > |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0 | > +-+--+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948860#comment-16948860 ] Michael Albert commented on SPARK-28921: Is there a timeline for this fix being present in a release version? This bug is really affecting us severely, and I've been waiting for 2.4.5 to drop. Is there a timeline in which we can expect some Spark release version to include the fix? > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948896#comment-16948896 ] Paul Schweigert commented on SPARK-28921: - [~albertmichaelj] You can replace the `kubernetes-client-4.1.2.jar` (in the `jars/` directory) with the newer version (v 4.4.2). > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948897#comment-16948897 ] antonkulaga commented on SPARK-28547: - [~hyukjin.kwon] what is not clear for you? I think it is really clear that Spark performs miserably (freezing or taking many hours) whenever the data frame has 10-20K and more columns and I gave GTEX dataset as an example (however any gene or transcript expression dataset will be ok to demonstrate it). In many fields (like big part of bioinformatics) wide data frames are common, right now Spark is totally useless there. > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948897#comment-16948897 ] antonkulaga edited comment on SPARK-28547 at 10/10/19 7:24 PM: --- [~hyukjin.kwon] what is not clear for you? I think it is really clear that Spark performs miserably (freezing or taking hours/days to compute, even for simplest operations like per-column statistics) whenever the data frame has 10-20K and more columns and I gave GTEX dataset as an example (however any gene or transcript expression dataset will be ok to demonstrate it). In many fields (like big part of bioinformatics) wide data frames are common, right now Spark is totally useless there. was (Author: antonkulaga): [~hyukjin.kwon] what is not clear for you? I think it is really clear that Spark performs miserably (freezing or taking many hours) whenever the data frame has 10-20K and more columns and I gave GTEX dataset as an example (however any gene or transcript expression dataset will be ok to demonstrate it). In many fields (like big part of bioinformatics) wide data frames are common, right now Spark is totally useless there. > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948907#comment-16948907 ] Sean R. Owen commented on SPARK-28547: -- I agree, this is too open-ended. It's not clear whether it's a general problem or specific to a usage pattern, a SQL query, a data type or distribution. Often I find that use cases for "1 columns" are use cases for "a big array-valued column". I bet there is room for improvement, but, ten thousand columns is just inherently slow given how metadata, query plans, etc are handled. You'd at least need to help narrow down where the slow down is and why, and even better if you can propose a class of fix. As it is I'd close this. > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe
Prasanna Saraswathi Krishnan created SPARK-29432: Summary: nullable flag of new column changes when persisting a pyspark dataframe Key: SPARK-29432 URL: https://issues.apache.org/jira/browse/SPARK-29432 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.4.0 Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution) Python 3.7.3 Reporter: Prasanna Saraswathi Krishnan When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {{>>> l = [('Alice', 1)]}} {{>>> df = spark.createDataFrame(l)}} {{>>> df.printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{>>> from pyspark.sql.functions import lit}} {{>>> df = df.withColumn('newCol', lit('newVal'))}} {{>>> df.printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{ |-- newCol: string (nullable = false)}} {{>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')}} {{>>> spark.sql("select * from default.withcolTest").printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{ |-- newCol: string (nullable = true)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe
[ https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Saraswathi Krishnan updated SPARK-29432: - Description: When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {{>>> l = [('Alice', 1)]}} {{>>> df = spark.createDataFrame(l)}} {{>>> df.printSchema()}} {{root}} |-- _1: string (nullable = true) |-- _2: long (nullable = true) {{>>> from pyspark.sql.functions import lit}} {{>>> df = df.withColumn('newCol', lit('newVal'))}} {{>>> df.printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from default.withcolTest").printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} \{{ |-- newCol: string (nullable = true)}} was: When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {{>>> l = [('Alice', 1)]}} {{>>> df = spark.createDataFrame(l)}} {{>>> df.printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}}{{>>> from pyspark.sql.functions import lit}} {{>>> df = df.withColumn('newCol', lit('newVal'))}} {{>>> df.printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from default.withcolTest").printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{ |-- newCol: string (nullable = true)}} > nullable flag of new column changes when persisting a pyspark dataframe > --- > > Key: SPARK-29432 > URL: https://issues.apache.org/jira/browse/SPARK-29432 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution) > Python 3.7.3 >Reporter: Prasanna Saraswathi Krishnan >Priority: Minor > > When I add a new column to a dataframe with {{withColumn}} function, by > default, the column is added with {{nullable=false}}. > But, when I save the dataframe, the flag changes to {{nullable=true}}. Is > this the expected behavior? why? > > {{>>> l = [('Alice', 1)]}} > {{>>> df = spark.createDataFrame(l)}} > {{>>> df.printSchema()}} > {{root}} > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > {{>>> from pyspark.sql.functions import lit}} > {{>>> df = df.withColumn('newCol', lit('newVal'))}} > {{>>> df.printSchema()}} > {{root}} > \{{ |-- _1: string (nullable = true)}} > \{{ |-- _2: long (nullable = true)}} > \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from > default.withcolTest").printSchema()}} > {{root}} > \{{ |-- _1: string (nullable = true)}} > \{{ |-- _2: long (nullable = true)}} > \{{ |-- newCol: string (nullable = true)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe
[ https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Saraswathi Krishnan updated SPARK-29432: - Description: When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {{>>> l = [('Alice', 1)]}} {{>>> df = spark.createDataFrame(l)}} {{>>> df.printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}}{{>>> from pyspark.sql.functions import lit}} {{>>> df = df.withColumn('newCol', lit('newVal'))}} {{>>> df.printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from default.withcolTest").printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{ |-- newCol: string (nullable = true)}} was: When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {{>>> l = [('Alice', 1)]}} {{>>> df = spark.createDataFrame(l)}} {{>>> df.printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{>>> from pyspark.sql.functions import lit}} {{>>> df = df.withColumn('newCol', lit('newVal'))}} {{>>> df.printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{ |-- newCol: string (nullable = false)}} {{>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')}} {{>>> spark.sql("select * from default.withcolTest").printSchema()}} {{root}} {{ |-- _1: string (nullable = true)}} {{ |-- _2: long (nullable = true)}} {{ |-- newCol: string (nullable = true)}} > nullable flag of new column changes when persisting a pyspark dataframe > --- > > Key: SPARK-29432 > URL: https://issues.apache.org/jira/browse/SPARK-29432 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution) > Python 3.7.3 >Reporter: Prasanna Saraswathi Krishnan >Priority: Minor > > When I add a new column to a dataframe with {{withColumn}} function, by > default, the column is added with {{nullable=false}}. > But, when I save the dataframe, the flag changes to {{nullable=true}}. Is > this the expected behavior? why? > > {{>>> l = [('Alice', 1)]}} > {{>>> df = spark.createDataFrame(l)}} > {{>>> df.printSchema()}} > {{root}} > {{ |-- _1: string (nullable = true)}} > {{ |-- _2: long (nullable = true)}}{{>>> from pyspark.sql.functions import > lit}} > {{>>> df = df.withColumn('newCol', lit('newVal'))}} > {{>>> df.printSchema()}} > {{root}} > {{ |-- _1: string (nullable = true)}} > {{ |-- _2: long (nullable = true)}} > {{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from > default.withcolTest").printSchema()}} > {{root}} > {{ |-- _1: string (nullable = true)}} > {{ |-- _2: long (nullable = true)}} > {{ |-- newCol: string (nullable = true)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe
[ https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Saraswathi Krishnan updated SPARK-29432: - Description: When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {{>>> l = [('Alice', 1)]}} {{>>> df = spark.createDataFrame(l)}} {{>>> df.printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} {{>>> from pyspark.sql.functions import lit}} {{>>> df = df.withColumn('newCol', lit('newVal'))}} {{>>> df.printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from default.withcolTest").printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} \{{ |-- newCol: string (nullable = true)}} was: When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {{>>> l = [('Alice', 1)]}} {{>>> df = spark.createDataFrame(l)}} {{>>> df.printSchema()}} {{root}} |-- _1: string (nullable = true) |-- _2: long (nullable = true) {{>>> from pyspark.sql.functions import lit}} {{>>> df = df.withColumn('newCol', lit('newVal'))}} {{>>> df.printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from default.withcolTest").printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} \{{ |-- newCol: string (nullable = true)}} > nullable flag of new column changes when persisting a pyspark dataframe > --- > > Key: SPARK-29432 > URL: https://issues.apache.org/jira/browse/SPARK-29432 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution) > Python 3.7.3 >Reporter: Prasanna Saraswathi Krishnan >Priority: Minor > > When I add a new column to a dataframe with {{withColumn}} function, by > default, the column is added with {{nullable=false}}. > But, when I save the dataframe, the flag changes to {{nullable=true}}. Is > this the expected behavior? why? > > {{>>> l = [('Alice', 1)]}} > {{>>> df = spark.createDataFrame(l)}} > {{>>> df.printSchema()}} > {{root}} > \{{ |-- _1: string (nullable = true)}} > \{{ |-- _2: long (nullable = true)}} > {{>>> from pyspark.sql.functions import lit}} > {{>>> df = df.withColumn('newCol', lit('newVal'))}} > {{>>> df.printSchema()}} > {{root}} > \{{ |-- _1: string (nullable = true)}} > \{{ |-- _2: long (nullable = true)}} > \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from > default.withcolTest").printSchema()}} > {{root}} > \{{ |-- _1: string (nullable = true)}} > \{{ |-- _2: long (nullable = true)}} > \{{ |-- newCol: string (nullable = true)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29116) Refactor py classes related to DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948942#comment-16948942 ] Huaxin Gao commented on SPARK-29116: I will submit a PR after DecisionTree refactor is in, so I don't need to rebase and merge later. [~podongfeng] > Refactor py classes related to DecisionTree > --- > > Key: SPARK-29116 > URL: https://issues.apache.org/jira/browse/SPARK-29116 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > 1, Like the scala side, move related classes to a seperate file 'tree.py' > 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29433) Web UI Stages table tooltip correction
Pablo Langa Blanco created SPARK-29433: -- Summary: Web UI Stages table tooltip correction Key: SPARK-29433 URL: https://issues.apache.org/jira/browse/SPARK-29433 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco In the Web UI, Stages table, the tool tip of Input and output column are not corrrect. Actual tooltip messages: * Bytes and records read from Hadoop or from Spark storage. * Bytes and records written to Hadoop. In this column we are only showing bytes, not records More information at the pull request -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948957#comment-16948957 ] Maxim Gekk commented on SPARK-26651: [~jiangxb] Could you consider this for including to the major changes lists of Spark 3.0 > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: ReleaseNote > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29434) Improve the MapStatuses serialization performance
DB Tsai created SPARK-29434: --- Summary: Improve the MapStatuses serialization performance Key: SPARK-29434 URL: https://issues.apache.org/jira/browse/SPARK-29434 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.4 Reporter: DB Tsai Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks
[ https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948979#comment-16948979 ] koert kuipers commented on SPARK-27665: --- i tried using spark.shuffle.useOldFetchProtocol=true while using spark 3 (master) to launch job, with spark 2.4.1 shuffle service running in yarn. i cannot get it to work. for example on one cluster i saw: {code} Error occurred while fetching local blocks java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index {code} on another: {code} org.apache.spark.shuffle.FetchFailedException: /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.file.NoSuchFileException: /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) at org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:161) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:60) at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:172) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) ... 11 more {code} > Split fetch shuffle blocks protocol from OpenBlocks > --- > > Key: SPARK-27665 > URL: https://issues.apache.org/jira/brows
[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948981#comment-16948981 ] Bryan Cutler commented on SPARK-28502: -- Thanks for testing it out [~nasirali]! It's unlikely that this would make it to the 2.4.x line, but the next Spark release will be 3.0.0 anyway, which will roughly be early 2020. > Error with struct conversion while using pandas_udf > --- > > Key: SPARK-28502 > URL: https://issues.apache.org/jira/browse/SPARK-28502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: OS: Ubuntu > Python: 3.6 >Reporter: Nasir Ali >Priority: Minor > Fix For: 3.0.0 > > > What I am trying to do: Group data based on time intervals (e.g., 15 days > window) and perform some operations on dataframe using (pandas) UDFs. I don't > know if there is a better/cleaner way to do it. > Below is the sample code that I tried and error message I am getting. > > {code:java} > df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"), > (13.00, "2018-03-11T12:27:18+00:00"), > (25.00, "2018-03-12T11:27:18+00:00"), > (20.00, "2018-03-13T15:27:18+00:00"), > (17.00, "2018-03-14T12:27:18+00:00"), > (99.00, "2018-03-15T11:27:18+00:00"), > (156.00, "2018-03-22T11:27:18+00:00"), > (17.00, "2018-03-31T11:27:18+00:00"), > (25.00, "2018-03-15T11:27:18+00:00"), > (25.00, "2018-03-16T11:27:18+00:00") > ], >["id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > schema = StructType([ > StructField("id", IntegerType()), > StructField("ts", TimestampType()) > ]) > @pandas_udf(schema, PandasUDFType.GROUPED_MAP) > def some_udf(df): > # some computation > return df > df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show() > {code} > This throws following exception: > {code:java} > TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]> > {code} > > However, if I use builtin agg method then it works all fine. For example, > {code:java} > df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False) > {code} > Output > {code:java} > +-+--+---+ > |id |window|avg(id)| > +-+--+---+ > |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0 | > |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0 | > |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0 | > |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0 | > |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0 | > |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0 | > |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0 | > +-+--+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29435) Spark 3 doesnt work with older shuffle service
koert kuipers created SPARK-29435: - Summary: Spark 3 doesnt work with older shuffle service Key: SPARK-29435 URL: https://issues.apache.org/jira/browse/SPARK-29435 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.0.0 Environment: Spark 3 from Sept 26, commit 8beb736a00b004f97de7fcdf9ff09388d80fc548 Spark 2.4.1 shuffle service in yarn Reporter: koert kuipers SPARK-27665 introduced a change to the shuffle protocol. It also introduced a setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run with old shuffle service. However i have not gotten that to work. I have been testing with Spark 3 master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. The errors i see are for example on EMR: {code} Error occurred while fetching local blocks java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index {code} And on CDH5: {code} org.apache.spark.shuffle.FetchFailedException: /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.file.NoSuchFileException: /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) at org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:161) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:60) at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:172) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spar
[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948987#comment-16948987 ] Marcelo Masiero Vanzin commented on SPARK-29435: I think you have to set {{spark.shuffle.useOldFetchProtocol=true}} with 3.0 for that to work. > Spark 3 doesnt work with older shuffle service > -- > > Key: SPARK-29435 > URL: https://issues.apache.org/jira/browse/SPARK-29435 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 > Environment: Spark 3 from Sept 26, commit > 8beb736a00b004f97de7fcdf9ff09388d80fc548 > Spark 2.4.1 shuffle service in yarn >Reporter: koert kuipers >Priority: Major > > SPARK-27665 introduced a change to the shuffle protocol. It also introduced a > setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run > with old shuffle service. > However i have not gotten that to work. I have been testing with Spark 3 > master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. > The errors i see are for example on EMR: > {code} > Error occurred while fetching local blocks > java.nio.file.NoSuchFileException: > /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index > {code} > And on CDH5: > {code} > org.apache.spark.shuffle.FetchFailedException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.file.NoSuchFileException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) > at > org.ap
[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948989#comment-16948989 ] koert kuipers commented on SPARK-29435: --- [~vanzin] sorry i should have been more clear, i did set spark.shuffle.useOldFetchProtocol=true and i could not get it to work > Spark 3 doesnt work with older shuffle service > -- > > Key: SPARK-29435 > URL: https://issues.apache.org/jira/browse/SPARK-29435 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 > Environment: Spark 3 from Sept 26, commit > 8beb736a00b004f97de7fcdf9ff09388d80fc548 > Spark 2.4.1 shuffle service in yarn >Reporter: koert kuipers >Priority: Major > > SPARK-27665 introduced a change to the shuffle protocol. It also introduced a > setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run > with old shuffle service. > However i have not gotten that to work. I have been testing with Spark 3 > master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. > The errors i see are for example on EMR: > {code} > Error occurred while fetching local blocks > java.nio.file.NoSuchFileException: > /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index > {code} > And on CDH5: > {code} > org.apache.spark.shuffle.FetchFailedException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.file.NoSuchFileException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) >
[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-26806: --- Description: Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will make "avg" become "NaN". And whatever gets merged with the result of "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will return "0" and the user will see the following incorrect report: {code:java} "eventTime" : { "avg" : "1970-01-01T00:00:00.000Z", "max" : "2019-01-31T12:57:00.000Z", "min" : "2019-01-30T18:44:04.000Z", "watermark" : "1970-01-01T00:00:00.000Z" } {code} was: Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will make "avg" become "NaN". And whatever gets merged with the result of "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will return "0" and the user will see the following incorrect report: {code} "eventTime" : { "avg" : "1970-01-01T00:00:00.000Z", "max" : "2019-01-31T12:57:00.000Z", "min" : "2019-01-30T18:44:04.000Z", "watermark" : "1970-01-01T00:00:00.000Z" } {code} This issue was reported by [~liancheng] > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: Cheng Lian >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0 > > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code:java} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-26806: --- Reporter: Cheng Lian (was: liancheng) > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: Cheng Lian >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0 > > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} > This issue was reported by [~liancheng] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29367) pandas udf not working with latest pyarrow release (0.15.0)
[ https://issues.apache.org/jira/browse/SPARK-29367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29367. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26045 [https://github.com/apache/spark/pull/26045] > pandas udf not working with latest pyarrow release (0.15.0) > --- > > Key: SPARK-29367 > URL: https://issues.apache.org/jira/browse/SPARK-29367 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.4.0, 2.4.1, 2.4.3 >Reporter: Julien Peloton >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > Hi, > I recently upgraded pyarrow from 0.14 to 0.15 (released on Oct 5th), and my > pyspark jobs using pandas udf are failing with > java.lang.IllegalArgumentException (tested with Spark 2.4.0, 2.4.1, and > 2.4.3). Here is a full example to reproduce the failure with pyarrow 0.15: > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import BooleanType > import pandas as pd > @pandas_udf(BooleanType(), PandasUDFType.SCALAR) > def qualitycuts(nbad: int, rb: float, magdiff: float) -> pd.Series: > """ Apply simple quality cuts > Returns > -- > out: pandas.Series of booleans > Return a Pandas DataFrame with the appropriate flag: false for bad alert, > and true for good alert. > """ > mask = nbad.values == 0 > mask *= rb.values >= 0.55 > mask *= abs(magdiff.values) <= 0.1 > return pd.Series(mask) > spark = SparkSession.builder.getOrCreate() > # Create dummy DF > colnames = ["nbad", "rb", "magdiff"] > df = spark.sparkContext.parallelize( > zip( > [0, 1, 0, 0], > [0.01, 0.02, 0.6, 0.01], > [0.02, 0.05, 0.1, 0.01] > ) > ).toDF(colnames) > df.show() > # Apply cuts > df = df\ > .withColumn("toKeep", qualitycuts(*colnames))\ > .filter("toKeep == true")\ > .drop("toKeep") > # This will fail if latest pyarrow 0.15.0 is used > df.show() > {code} > and the log is: > {code} > Driver stacktrace: > 19/10/07 09:37:49 INFO DAGScheduler: Job 3 failed: showString at > NativeMethodAccessorImpl.java:0, took 0.660523 s > Traceback (most recent call last): > File > "/Users/julien/Documents/workspace/myrepos/fink-broker/test_pyarrow.py", line > 44, in > df.show() > File > "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 378, in show > File > "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File > "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o64.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 > (TID 5, localhost, executor driver): java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:98) > at > org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96) > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:89) > at
[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949050#comment-16949050 ] angerszhu commented on SPARK-29354: --- [~Elixir Kook] i download spark-2.4.4-bin-hadoop2.7 in your link and I can found jline-2.14.6.jar in jars/ folder > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29284) df.distinct.count throw NoSuchElementException when enabled daptive executor
[ https://issues.apache.org/jira/browse/SPARK-29284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29284. -- Resolution: Cannot Reproduce > df.distinct.count throw NoSuchElementException when enabled daptive executor > - > > Key: SPARK-29284 > URL: https://issues.apache.org/jira/browse/SPARK-29284 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: yiming.xu >Priority: Minor > > This case: > spark.sql("SET spark.sql.adaptive.enabled=true") > spark.sql("SET spark.sql.shuffle.partitions=1") > val result = spark.range(0).distinct().count > or spark.table("empty table").distinct.count > throw java.util.NoSuchElementException: next on empty iterator > If current stage partition is > 0,org.apache.spark.sql.execution.exchange.ExchangeCoordinator#doEstimationIfNecessary->partitionStartIndices > is empty (https://issues.apache.org/jira/browse/SPARK-22144) > next stage rdd partition is empty. > To solve the problem I will change partitionStartIndices to Array(0) when the > parent rdd partition is 0; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28636) Thriftserver can not support decimal type with negative scale
[ https://issues.apache.org/jira/browse/SPARK-28636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28636: Issue Type: Bug (was: Improvement) > Thriftserver can not support decimal type with negative scale > - > > Key: SPARK-28636 > URL: https://issues.apache.org/jira/browse/SPARK-28636 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > 0: jdbc:hive2://localhost:1> select 2.35E10 * 1.0; > Error: java.lang.IllegalArgumentException: Error: name expected at the > position 10 of 'decimal(6,-7)' but '-' is found. (state=,code=0) > {code} > {code:sql} > spark-sql> select 2.35E10 * 1.0; > 235 > {code} > ThriftServer log: > {noformat} > java.lang.RuntimeException: java.lang.IllegalArgumentException: Error: name > expected at the position 10 of 'decimal(6,-7)' but '-' is found. > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at > java.security.AccessController.doPrivileged(AccessController.java:770) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy31.getResultSetMetadata(Unknown Source) > at > org.apache.hive.service.cli.CLIService.getResultSetMetadata(CLIService.java:502) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.GetResultSetMetadata(ThriftCLIService.java:609) > at > org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetResultSetMetadata.getResult(TCLIService.java:1697) > at > org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetResultSetMetadata.getResult(TCLIService.java:1682) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:819) > Caused by: java.lang.IllegalArgumentException: Error: name expected at the > position 10 of 'decimal(6,-7)' but '-' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:378) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseParams(TypeInfoUtils.java:403) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parsePrimitiveParts(TypeInfoUtils.java:542) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.parsePrimitiveParts(TypeInfoUtils.java:557) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.createPrimitiveTypeInfo(TypeInfoFactory.java:136) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.getPrimitiveTypeInfo(TypeInfoFactory.java:109) > at > org.apache.hive.service.cli.TypeDescriptor.(TypeDescriptor.java:58) > at org.apache.hive.service.cli.TableSchema.(TableSchema.java:54) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$.getTableSchema(SparkExecuteStatementOperation.scala:313) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.resultSchema$lzycompute(SparkExecuteStatementOperation.scala:69) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.resultSchema(SparkExecuteStatementOperation.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getResultSetSchema(SparkExecuteStatementOperation.scala:157) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationResultSetSchema(OperationManager.java:233) > at > org.apache.hive.service.cli.session.HiveSessionImpl.getResultSetMetadata(HiveSessionImpl.java:787) > at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at jav
[jira] [Updated] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe
[ https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29432: - Description: When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {code} >>> l = [('Alice', 1)] >>> df = spark.createDataFrame(l) >>> df.printSchema() root |-- _1: string (nullable = true) |-- _2: long (nullable = true) {code} {code} >>> from pyspark.sql.functions import lit >>> df = df.withColumn('newCol', lit('newVal')) >>> df.printSchema() root |-- _1: string (nullable = true) |-- _2: long (nullable = true) |-- newCol: string (nullable = false) >>> spark.sql("select * from default.withcolTest").printSchema() |-- _1: string (nullable = true) |-- _2: long (nullable = true) |-- newCol: string (nullable = true) {code} was: When I add a new column to a dataframe with {{withColumn}} function, by default, the column is added with {{nullable=false}}. But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this the expected behavior? why? {{>>> l = [('Alice', 1)]}} {{>>> df = spark.createDataFrame(l)}} {{>>> df.printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} {{>>> from pyspark.sql.functions import lit}} {{>>> df = df.withColumn('newCol', lit('newVal'))}} {{>>> df.printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from default.withcolTest").printSchema()}} {{root}} \{{ |-- _1: string (nullable = true)}} \{{ |-- _2: long (nullable = true)}} \{{ |-- newCol: string (nullable = true)}} > nullable flag of new column changes when persisting a pyspark dataframe > --- > > Key: SPARK-29432 > URL: https://issues.apache.org/jira/browse/SPARK-29432 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution) > Python 3.7.3 >Reporter: Prasanna Saraswathi Krishnan >Priority: Minor > > When I add a new column to a dataframe with {{withColumn}} function, by > default, the column is added with {{nullable=false}}. > But, when I save the dataframe, the flag changes to {{nullable=true}}. Is > this the expected behavior? why? > > {code} > >>> l = [('Alice', 1)] > >>> df = spark.createDataFrame(l) > >>> df.printSchema() > root > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > {code} > {code} > >>> from pyspark.sql.functions import lit > >>> df = df.withColumn('newCol', lit('newVal')) > >>> df.printSchema() > root > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > |-- newCol: string (nullable = false) > >>> spark.sql("select * from default.withcolTest").printSchema() > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > |-- newCol: string (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe
[ https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29432. -- Resolution: Cannot Reproduce Can't fine {{withcolTest}} table. Also, please ask questions into mailing list. You could have a better answer. > nullable flag of new column changes when persisting a pyspark dataframe > --- > > Key: SPARK-29432 > URL: https://issues.apache.org/jira/browse/SPARK-29432 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution) > Python 3.7.3 >Reporter: Prasanna Saraswathi Krishnan >Priority: Minor > > When I add a new column to a dataframe with {{withColumn}} function, by > default, the column is added with {{nullable=false}}. > But, when I save the dataframe, the flag changes to {{nullable=true}}. Is > this the expected behavior? why? > > {code} > >>> l = [('Alice', 1)] > >>> df = spark.createDataFrame(l) > >>> df.printSchema() > root > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > {code} > {code} > >>> from pyspark.sql.functions import lit > >>> df = df.withColumn('newCol', lit('newVal')) > >>> df.printSchema() > root > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > |-- newCol: string (nullable = false) > >>> spark.sql("select * from default.withcolTest").printSchema() > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > |-- newCol: string (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
[ https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29423: - Component/s: (was: SQL) Structured Streaming > leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus > --- > > Key: SPARK-29423 > URL: https://issues.apache.org/jira/browse/SPARK-29423 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > 1. start server with start-thriftserver.sh > 2. JDBC client connect and disconnect to hiveserver2 > for (int i = 0; i < 1; i++) { >Connection conn = > DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); >conn.close(); > } > 3. instance of > org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep > increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
[ https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949060#comment-16949060 ] Hyukjin Kwon commented on SPARK-29423: -- 2.3.x is EOL releases. Can you try in higher versions? > leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus > --- > > Key: SPARK-29423 > URL: https://issues.apache.org/jira/browse/SPARK-29423 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > 1. start server with start-thriftserver.sh > 2. JDBC client connect and disconnect to hiveserver2 > for (int i = 0; i < 1; i++) { >Connection conn = > DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); >conn.close(); > } > 3. instance of > org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep > increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe
[ https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949068#comment-16949068 ] Prasanna Saraswathi Krishnan commented on SPARK-29432: -- My bad. When I formatted the code, I deleted the saveAsTable statement ny mistake. If you save the dataframe like - df.write.saveAsTable('delault.withcolTest', mode='overwrite') And then get the schema, you should be able to reproduce the error. Not sure how it is resolved, if you can't reproduce the issue. > nullable flag of new column changes when persisting a pyspark dataframe > --- > > Key: SPARK-29432 > URL: https://issues.apache.org/jira/browse/SPARK-29432 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution) > Python 3.7.3 >Reporter: Prasanna Saraswathi Krishnan >Priority: Minor > > When I add a new column to a dataframe with {{withColumn}} function, by > default, the column is added with {{nullable=false}}. > But, when I save the dataframe, the flag changes to {{nullable=true}}. Is > this the expected behavior? why? > > {code} > >>> l = [('Alice', 1)] > >>> df = spark.createDataFrame(l) > >>> df.printSchema() > root > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > {code} > {code} > >>> from pyspark.sql.functions import lit > >>> df = df.withColumn('newCol', lit('newVal')) > >>> df.printSchema() > root > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > |-- newCol: string (nullable = false) > >>> spark.sql("select * from default.withcolTest").printSchema() > |-- _1: string (nullable = true) > |-- _2: long (nullable = true) > |-- newCol: string (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running
[ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24266: - Affects Version/s: 3.0.0 > Spark client terminates while driver is still running > - > > Key: SPARK-24266 > URL: https://issues.apache.org/jira/browse/SPARK-24266 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 3.0.0 >Reporter: Chun Chen >Priority: Major > > {code} > Warning: Ignoring non-spark config property: Default=system properties > included when running spark-submit. > 18/05/11 14:50:12 WARN Config: Error reading service account token from: > [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. > 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: > Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) > 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. > Mounting Hadoop specific files > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: N/A >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 >phase: Pending >status: [ContainerStatus(containerID=null, > image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=PodInitializing, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to > finish... > 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >
[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running
[ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24266: - Labels: (was: bulk-closed) > Spark client terminates while driver is still running > - > > Key: SPARK-24266 > URL: https://issues.apache.org/jira/browse/SPARK-24266 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Chun Chen >Priority: Major > > {code} > Warning: Ignoring non-spark config property: Default=system properties > included when running spark-submit. > 18/05/11 14:50:12 WARN Config: Error reading service account token from: > [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. > 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: > Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) > 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. > Mounting Hadoop specific files > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: N/A >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 >phase: Pending >status: [ContainerStatus(containerID=null, > image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=PodInitializing, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to > finish... > 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >
[jira] [Reopened] (SPARK-24266) Spark client terminates while driver is still running
[ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-24266: -- > Spark client terminates while driver is still running > - > > Key: SPARK-24266 > URL: https://issues.apache.org/jira/browse/SPARK-24266 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Chun Chen >Priority: Major > Labels: bulk-closed > > {code} > Warning: Ignoring non-spark config property: Default=system properties > included when running spark-submit. > 18/05/11 14:50:12 WARN Config: Error reading service account token from: > [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. > 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: > Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) > 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. > Mounting Hadoop specific files > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: N/A >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 >phase: Pending >status: [ContainerStatus(containerID=null, > image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=PodInitializing, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to > finish... > 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >
[jira] [Resolved] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29337. -- Resolution: Invalid Please see [https://spark.apache.org/community.html] > How to Cache Table and Pin it in Memory and should not Spill to Disk on > Thrift Server > -- > > Key: SPARK-29337 > URL: https://issues.apache.org/jira/browse/SPARK-29337 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Srini E >Priority: Major > Labels: Question, stack-overflow > Attachments: Cache+Image.png > > > Hi Team, > How to pin the table in cache so it would not swap out of memory? > Situation: We are using Microstrategy BI reporting. Semantic layer is built. > We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table > ; we did cache for SPARK context( Thrift server). Please see > below snapshot of Cache table, went to disk over time. Initially it was all > in cache , now some in cache and some in disk. That disk may be local disk > relatively more expensive reading than from s3. Queries may take longer and > inconsistent times from user experience perspective. If More queries running > using Cache tables, copies of the cache table images are copied and copies > are not staying in memory causing reports to run longer. so how to pin the > table so would not swap to disk. Spark memory management is dynamic > allocation, and how to use those few tables to Pin in memory . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
[ https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29335. -- Resolution: Invalid Please see [https://spark.apache.org/community.html] > Cost Based Optimizer stats are not used while evaluating query plans in Spark > Sql > - > > Key: SPARK-29335 > URL: https://issues.apache.org/jira/browse/SPARK-29335 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 2.3.0 > Environment: We tried to execute the same using Spark-sql and Thrify > server using SQLWorkbench but we are not able to use the stats. >Reporter: Srini E >Priority: Major > Labels: Question, stack-overflow > Attachments: explain_plan_cbo_spark.txt > > > We are trying to leverage CBO for getting better plan results for few > critical queries run thru spark-sql or thru thrift server using jdbc driver. > Following settings added to spark-defaults.conf > {code} > spark.sql.cbo.enabled true > spark.experimental.extrastrategies intervaljoin > spark.sql.cbo.joinreorder.enabled true > {code} > > The tables that we are using are not partitioned. > {code} > spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ; > analyze table arrow.t_fperiods_sundar compute statistics for columns eid, > year, ptype, absref, fpid , pid ; > analyze table arrow.t_fdata_sundar compute statistics ; > analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, > absref; > {code} > Analyze completed success fully. > Describe extended , does not show column level stats data and queries are not > leveraging table or column level stats . > we are using Oracle as our Hive Catalog store and not Glue . > *When we are using spark sql and running queries we are not able to see the > stats in use in the explain plan and we are not sure if cbo is put to use.* > *A quick response would be helpful.* > *Explain Plan:* > Following Explain command does not reference to any Statistics usage. > > {code} > spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref > from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = > a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 > and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;* > > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = > 2017),(ptype#4546 = A),(eid#4542 = > 29940),isnull(PID#4527),isnotnull(fpid#4523) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... > 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID) > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(absref#4569),(absref#4569 = > Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: > string ... 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940) > == Parsed Logical Plan == > 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref] > +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && > (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && > ('a12.eid = 29940)) && isnull('a12.PID))) > +- 'Join Inner > :- 'SubqueryAlias a12 > : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar` > +- 'SubqueryAlias a13 > +- 'UnresolvedRelation `arrow`.`t_fdata_sundar` > > == Analyzed Logical Plan == > imnem: string, fvalue: string, ptype: string, absref: string > Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] > +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = > cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = > 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = > cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527))) > +- Join Inner > :- SubqueryAlias a12 > : +- SubqueryAlias t_fperiods_sundar > : +- > Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,
[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
[ https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949086#comment-16949086 ] Hyukjin Kwon commented on SPARK-29222: -- Shall we increase the time a bit more if it was verified that it helps the flakiness? > Flaky test: > pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence > --- > > Key: SPARK-29222 > URL: https://issues.apache.org/jira/browse/SPARK-29222 > Project: Spark > Issue Type: Test > Components: MLlib, Tests >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/] > {code:java} > Error Message > 7 != 10 > StacktraceTraceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 429, in test_parameter_convergence > self._eventually(condition, catch_assertions=True) > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 74, in _eventually > raise lastValue > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 65, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 425, in condition > self.assertEqual(len(model_weights), len(batches)) > AssertionError: 7 != 10 >{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org