[jira] [Commented] (SPARK-19811) sparksql 2.1 can not prune hive partition

2019-10-10 Thread Vasu Bajaj (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948299#comment-16948299
 ] 

Vasu Bajaj commented on SPARK-19811:


It's too late, but the issue arises when you use your partition name in upper 
case. try using partition in lower case, this solved the issue for me.

> sparksql 2.1 can not prune hive partition 
> --
>
> Key: SPARK-19811
> URL: https://issues.apache.org/jira/browse/SPARK-19811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: sydt
>Priority: Major
>
> When sparksql2.1 execute sql, it has error:
> java.lang.RuntimeException: Expected only partition pruning predicates: 
> (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212))  and the sql  sentence is 
> select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND  
> PROV_ID  ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19811) sparksql 2.1 can not prune hive partition

2019-10-10 Thread Vasu Bajaj (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948299#comment-16948299
 ] 

Vasu Bajaj edited comment on SPARK-19811 at 10/10/19 8:00 AM:
--

 the issue arises when you use your partition name in upper case. try using 
partition in lower case, this solved the issue for me.


was (Author: vasubajaj):
It's too late, but the issue arises when you use your partition name in upper 
case. try using partition in lower case, this solved the issue for me.

> sparksql 2.1 can not prune hive partition 
> --
>
> Key: SPARK-19811
> URL: https://issues.apache.org/jira/browse/SPARK-19811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: sydt
>Priority: Major
>
> When sparksql2.1 execute sql, it has error:
> java.lang.RuntimeException: Expected only partition pruning predicates: 
> (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212))  and the sql  sentence is 
> select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND  
> PROV_ID  ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus

2019-10-10 Thread pin_zhang (Jira)
pin_zhang created SPARK-29423:
-

 Summary: leak on  
org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
 Key: SPARK-29423
 URL: https://issues.apache.org/jira/browse/SPARK-29423
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: pin_zhang


1.  start server with start-thriftserver.sh
 2.  JDBC client connect and disconnect to hiveserver2
 for (int i = 0; i < 1; i++) {
   Connection conn = 
DriverManager.getConnection("jdbc:hive2://localhost:1", "test", "");
   conn.close();
 }
3.  instance of  
org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep 
increasing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)
angerszhu created SPARK-29424:
-

 Summary: Prevent Spark to committing stage of too much Task
 Key: SPARK-29424
 URL: https://issues.apache.org/jira/browse/SPARK-29424
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29424:
--
Description: 
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So i add a constraint when submit tasks.I wonder if the community will accept it

> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So i add a constraint when submit tasks.I wonder if the community will accept 
> it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29424:
--
Description: 
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So I add a constraint when submit tasks and abort stage early when TaskSet size 
num is bigger then set limit . I wonder if the community will accept this way.
cc [~srowen]

  was:
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So i add a constraint when submit tasks.I wonder if the community will accept it


> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So I add a constraint when submit tasks and abort stage early when TaskSet 
> size num is bigger then set limit . I wonder if the community will accept 
> this way.
> cc [~srowen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29424:
--
Description: 
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So I add a constraint when submit tasks and abort stage early when TaskSet size 
num is bigger then set limit . I wonder if the community will accept this way.
cc [~srowen] [~dongjoon] [~yumwang]

  was:
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So I add a constraint when submit tasks and abort stage early when TaskSet size 
num is bigger then set limit . I wonder if the community will accept this way.
cc [~srowen]


> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So I add a constraint when submit tasks and abort stage early when TaskSet 
> size num is bigger then set limit . I wonder if the community will accept 
> this way.
> cc [~srowen] [~dongjoon] [~yumwang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29409) spark drop partition always throws Exception

2019-10-10 Thread ant_nebula (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ant_nebula updated SPARK-29409:
---
Description: 
The table is:
{code:java}
CREATE TABLE `test_spark.test_drop_partition`(
 `platform` string,
 `product` string,
 `cnt` bigint)
PARTITIONED BY (dt string)
stored as orc;{code}
hive 2.1.1:
{code:java}
spark-sql -e "alter table test_spark.test_drop_partition drop if exists 
partition(dt='2019-10-08')"{code}
hive builtin:
{code:java}
spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf 
spark.sql.hive.metastore.jars=builtin -e "alter table 
test_spark.test_drop_partition drop if exists partition(dt='2019-10-08')"{code}
both would log Exception:
{code:java}
19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current 
connections: 1
19/10/09 18:21:27 INFO metastore: Connected to metastore.
19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost 
connection. Attempting to reconnect.
org.apache.thrift.transport.TTransportException: Cannot write to null 
outputStream
 at 
org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142)
 at 
org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178)
 at 
org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106)
 at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70)
 at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
 at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265)
 at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333)
 at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
 at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.dropPartitions(HiveClientImpl.scala:550)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropPartitions$1.apply$mcV$sp(HiveExternalCatalog.scala:972)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropPartitions$1.apply(HiveExternalCatalog.scala:970)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropPartitions$1.apply(HiveExternalCatalog.scala:970)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 at 
org.apache.

[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2019-10-10 Thread Jatin Puri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948361#comment-16948361
 ] 

Jatin Puri commented on SPARK-10848:


This issue still exists in `2.4.4`. Should a new issue be created?

> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29288) Spark SQL add jar can't support HTTP path.

2019-10-10 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948364#comment-16948364
 ] 

angerszhu commented on SPARK-29288:
---

[~dongjoon]

Sorry for later reply , the hive Jira is 
https://issues.apache.org/jira/browse/HIVE-9664

 

> Spark SQL add jar can't support HTTP path. 
> ---
>
> Key: SPARK-29288
> URL: https://issues.apache.org/jira/browse/SPARK-29288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> SparkSQL 
> `ADD JAR` can't support url with http, livy schema , do we need to support it?
> cc [~sro...@scient.com] 
> [~hyukjin.kwon][~dongjoon][~jerryshao][~juliuszsompolski]
> Hive 2.3 support it, do we need to support it? 
> I can work on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29409) spark drop partition always throws Exception

2019-10-10 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948368#comment-16948368
 ] 

angerszhu commented on SPARK-29409:
---

Thanks, I will check this problem.

> spark drop partition always throws Exception
> 
>
> Key: SPARK-29409
> URL: https://issues.apache.org/jira/browse/SPARK-29409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark 2.4.0 on yarn 2.7.3
> spark-sql client mode
> run hive version: 2.1.1
> hive builtin version 1.2.1
>Reporter: ant_nebula
>Priority: Major
>
> The table is:
> {code:java}
> CREATE TABLE `test_spark.test_drop_partition`(
>  `platform` string,
>  `product` string,
>  `cnt` bigint)
> PARTITIONED BY (dt string)
> stored as orc;{code}
> hive 2.1.1:
> {code:java}
> spark-sql -e "alter table test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> hive builtin:
> {code:java}
> spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf 
> spark.sql.hive.metastore.jars=builtin -e "alter table 
> test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> both would log Exception:
> {code:java}
> 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 19/10/09 18:21:27 INFO metastore: Connected to metastore.
> 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.transport.TTransportException: Cannot write to null 
> outputStream
>  at 
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClien

[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948236#comment-16948236
 ] 

Yuming Wang edited comment on SPARK-29354 at 10/10/19 10:02 AM:


Hi [~Elixir Kook] Do you mean create the Spark binary distribution with 
{{-Phadoop-provided}}?
{code:sh}
./dev/make-distribution.sh --tgz -Phadoop-provided
{code}


was (Author: q79969786):
Hi [~Elixir Kook] Do you mean create the Spark binary distribution with 
{{-Phadoop-provided}}?
{code:sh}
./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn 
-Phadoop-provided
{code}

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29425) Alter database statement erases hive database's ownership

2019-10-10 Thread Kent Yao (Jira)
Kent Yao created SPARK-29425:


 Summary: Alter database statement erases hive database's ownership
 Key: SPARK-29425
 URL: https://issues.apache.org/jira/browse/SPARK-29425
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.3.4
Reporter: Kent Yao


ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out') will erase a hive 
database's owner



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29425) Alter database statement erases hive database's ownership

2019-10-10 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-29425:
-
Description: Commands like `ALTER DATABASE kyuubi SET DBPROPERTIES 
('in'='out')` will erase a hive database's owner  (was: ALTER DATABASE kyuubi 
SET DBPROPERTIES ('in'='out') will erase a hive database's owner)

> Alter database statement erases hive database's ownership
> -
>
> Key: SPARK-29425
> URL: https://issues.apache.org/jira/browse/SPARK-29425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Kent Yao
>Priority: Major
>
> Commands like `ALTER DATABASE kyuubi SET DBPROPERTIES ('in'='out')` will 
> erase a hive database's owner



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29426) Watermark does not take effect

2019-10-10 Thread jingshanglu (Jira)
jingshanglu created SPARK-29426:
---

 Summary: Watermark does not take effect
 Key: SPARK-29426
 URL: https://issues.apache.org/jira/browse/SPARK-29426
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.3
 Environment: my kafka mes like this:
{code:java}
// code placeholder

[kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 
172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-04 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{code}
output like this:
{code:java}
// code placeholder
Batch: 5
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2|
++--++-+-+---
Batch: 6
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|3|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|3|
++--++-+-+---
Batch: 7
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-04 12:20...|select * from user|192.168.54.6|172.0.0.1|1|
|[2019-03-04 12:15...|select * from user|192.168.54.6|172.0.0.1|1|
++--++-+-+
{code}
the watermark behind the event time(2019-03-04 12:23:22), but this event

{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}

still be Aggregated
Reporter: jingshanglu


I use withWatermark and window to express windowed aggregations, but the 
Watermark does not take effect.

my code:
{code:java}

// code placeholder
Dataset clientSqlIpCount = mes.withWatermark("timestamp","1 minute")
.groupBy(
functions.window(mes.col("timestamp"),"10 minutes","5 minutes"),
mes.col("sql"),mes.col("client"),mes.col("ip"))
.count();
StreamingQuery query = clientSqlIpCount
.writeStream()
.outputMode("Update")
.format("console")
.start();
spark.streams().awaitAnyTermination();
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29426) Watermark does not take effect

2019-10-10 Thread jingshanglu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingshanglu updated SPARK-29426:

Environment: (was: my kafka mes like this:
{code:java}
// code placeholder

[kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 
172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-04 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{code}
output like this:
{code:java}
// code placeholder
Batch: 5
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2|
++--++-+-+---
Batch: 6
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|3|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|3|
++--++-+-+---
Batch: 7
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-04 12:20...|select * from user|192.168.54.6|172.0.0.1|1|
|[2019-03-04 12:15...|select * from user|192.168.54.6|172.0.0.1|1|
++--++-+-+
{code}
the watermark behind the event time(2019-03-04 12:23:22), but this event

{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}

still be Aggregated)

my kafka mes like this:
{code:java}
// code placeholder

[kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 
172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-04 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{code}
output like this:
{code:java}
// code placeholder
Batch: 5
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2|
++--++-+-+---
Batch: 6
---
++--++-+-+
|  window|

[jira] [Commented] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948393#comment-16948393
 ] 

Sean R. Owen commented on SPARK-29424:
--

I doubt we want to throw yet another limit/config at this. It's hard to guess 
or impose a _task_ limit in order to limit cluster usage. This is often what 
resource constrains on the resource manager are for, not also duplicated in 
Spark.

> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So I add a constraint when submit tasks and abort stage early when TaskSet 
> size num is bigger then set limit . I wonder if the community will accept 
> this way.
> cc [~srowen] [~dongjoon] [~yumwang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29427) Create KeyValueGroupedDataset from RelationalGroupedDataset

2019-10-10 Thread Alexander Hagerf (Jira)
Alexander Hagerf created SPARK-29427:


 Summary: Create KeyValueGroupedDataset from 
RelationalGroupedDataset
 Key: SPARK-29427
 URL: https://issues.apache.org/jira/browse/SPARK-29427
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 2.4.4
Reporter: Alexander Hagerf


The scenario I'm having is that I'm reading two huge bucketed tables and since 
a regular join is not performant enough for these cases I'm using groupByKey to 
generate two KeyValueGroupedDatasets and cogroup them to implement the logic I 
need.

The issue with this approach is that I'm only grouping by the column that the 
tables are bucketed by but since I'm using groupByKey the bucketing is 
completely ignored and I still get a full shuffle. 
What I'm looking for is some functionality to tell Catalyst to group by a 
column in a relational way but then give the user a possibility to utilize the 
functions of the KeyValueGroupedDataset e.g. cogroup (which is not available 
for dataframes)

 

At current spark (2.4.4) I see no way to do this efficiently. I think this is a 
valid use case which if solved would have huge performance benefits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948399#comment-16948399
 ] 

angerszhu commented on SPARK-29424:
---

[~srowen]

Since resource limit is  established, these bad behavior will cause program run 
very slow but don't know why. 

Make it abort early is better for user to recognize where the problem is. 
Especially for Spark Thrift Server.

> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So I add a constraint when submit tasks and abort stage early when TaskSet 
> size num is bigger then set limit . I wonder if the community will accept 
> this way.
> cc [~srowen] [~dongjoon] [~yumwang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29427) Create KeyValueGroupedDataset from RelationalGroupedDataset

2019-10-10 Thread Alexander Hagerf (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Hagerf updated SPARK-29427:
-
Description: 
The scenario I'm having is that I'm reading two huge bucketed tables and since 
a regular join is not performant enough for my cases, I'm using groupByKey to 
generate two KeyValueGroupedDatasets and cogroup them to implement the merging 
logic I need.

The issue with this approach is that I'm only grouping by the column that the 
tables are bucketed by but since I'm using groupByKey the bucketing is 
completely ignored and I still get a full shuffle. 
 What I'm looking for is some functionality to tell Catalyst to group by a 
column in a relational way but then give the user a possibility to utilize the 
functions of the KeyValueGroupedDataset e.g. cogroup (which is not available 
for dataframes)

 

At current spark (2.4.4) I see no way to do this efficiently. I think this is a 
valid use case which if solved would have huge performance benefits.

  was:
The scenario I'm having is that I'm reading two huge bucketed tables and since 
a regular join is not performant enough for these cases I'm using groupByKey to 
generate two KeyValueGroupedDatasets and cogroup them to implement the logic I 
need.

The issue with this approach is that I'm only grouping by the column that the 
tables are bucketed by but since I'm using groupByKey the bucketing is 
completely ignored and I still get a full shuffle. 
What I'm looking for is some functionality to tell Catalyst to group by a 
column in a relational way but then give the user a possibility to utilize the 
functions of the KeyValueGroupedDataset e.g. cogroup (which is not available 
for dataframes)

 

At current spark (2.4.4) I see no way to do this efficiently. I think this is a 
valid use case which if solved would have huge performance benefits.


> Create KeyValueGroupedDataset from RelationalGroupedDataset
> ---
>
> Key: SPARK-29427
> URL: https://issues.apache.org/jira/browse/SPARK-29427
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Alexander Hagerf
>Priority: Major
>
> The scenario I'm having is that I'm reading two huge bucketed tables and 
> since a regular join is not performant enough for my cases, I'm using 
> groupByKey to generate two KeyValueGroupedDatasets and cogroup them to 
> implement the merging logic I need.
> The issue with this approach is that I'm only grouping by the column that the 
> tables are bucketed by but since I'm using groupByKey the bucketing is 
> completely ignored and I still get a full shuffle. 
>  What I'm looking for is some functionality to tell Catalyst to group by a 
> column in a relational way but then give the user a possibility to utilize 
> the functions of the KeyValueGroupedDataset e.g. cogroup (which is not 
> available for dataframes)
>  
> At current spark (2.4.4) I see no way to do this efficiently. I think this is 
> a valid use case which if solved would have huge performance benefits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948410#comment-16948410
 ] 

angerszhu commented on SPARK-29354:
---

[~Elixir Kook] [~yumwang]

Jline is brought by hive-beeline module , you can find dependency in 
hive-beeline's pom file. 

And build liek below :
{code:java}
./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn 
-Phadoop-provided -Phive-provided
{code}
You won't get a jline jar in dist/jar/ folder

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29428) Can't persist/set None-valued param

2019-10-10 Thread Borys Biletskyy (Jira)
Borys Biletskyy created SPARK-29428:
---

 Summary: Can't persist/set None-valued param 
 Key: SPARK-29428
 URL: https://issues.apache.org/jira/browse/SPARK-29428
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.3.2
Reporter: Borys Biletskyy


{code:java}
import pytest
from pyspark import keyword_only
from pyspark.ml import Model
from pyspark.sql import DataFrame
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.param.shared import HasInputCol
from pyspark.sql.functions import *


class NoneParamTester(Model,
  HasInputCol,
  DefaultParamsReadable,
  DefaultParamsWritable
  ):
@keyword_only
def __init__(self, inputCol: str = None):
super(NoneParamTester, self).__init__()
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol: str = None):
kwargs = self._input_kwargs
self._set(**kwargs)
return self

def _transform(self, data: DataFrame) -> DataFrame:
return data


class TestHasInputColParam(object):
def test_persist_none(self, spark, temp_dir):
path = temp_dir + '/test_model'
model = NoneParamTester(inputCol=None)
assert model.isDefined(model.inputCol)
assert model.isSet(model.inputCol)
assert model.getInputCol() is None
model.write().overwrite().save(path)
NoneParamTester.load(path)  # TypeError: Could not convert  to string type

def test_set_none(self, spark):
model = NoneParamTester(inputCol=None)
assert model.isDefined(model.inputCol)
assert model.isSet(model.inputCol)
assert model.getInputCol() is None
model.set(model.inputCol, None)  # TypeError: Could not convert  to string type
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29428) Can't persist/set None-valued param

2019-10-10 Thread Borys Biletskyy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Borys Biletskyy updated SPARK-29428:

Description: 
{code:java}
import pytest
from pyspark import keyword_only
from pyspark.ml import Model
from pyspark.sql import DataFrame
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.param.shared import HasInputCol
from pyspark.sql.functions import *


class NoneParamTester(Model,
  HasInputCol,
  DefaultParamsReadable,
  DefaultParamsWritable
  ):
@keyword_only
def __init__(self, inputCol: str = None):
super(NoneParamTester, self).__init__()
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol: str = None):
kwargs = self._input_kwargs
self._set(**kwargs)
return self

def _transform(self, data: DataFrame) -> DataFrame:
return data


class TestNoneParam(object):
def test_persist_none(self, spark, temp_dir):
path = temp_dir + '/test_model'
model = NoneParamTester(inputCol=None)
assert model.isDefined(model.inputCol)
assert model.isSet(model.inputCol)
assert model.getInputCol() is None
model.write().overwrite().save(path)
NoneParamTester.load(path)  # TypeError: Could not convert  to string type

def test_set_none(self, spark):
model = NoneParamTester(inputCol=None)
assert model.isDefined(model.inputCol)
assert model.isSet(model.inputCol)
assert model.getInputCol() is None
model.set(model.inputCol, None)  # TypeError: Could not convert  to string type
{code}

  was:
{code:java}
import pytest
from pyspark import keyword_only
from pyspark.ml import Model
from pyspark.sql import DataFrame
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.param.shared import HasInputCol
from pyspark.sql.functions import *


class NoneParamTester(Model,
  HasInputCol,
  DefaultParamsReadable,
  DefaultParamsWritable
  ):
@keyword_only
def __init__(self, inputCol: str = None):
super(NoneParamTester, self).__init__()
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol: str = None):
kwargs = self._input_kwargs
self._set(**kwargs)
return self

def _transform(self, data: DataFrame) -> DataFrame:
return data


class TestHasInputColParam(object):
def test_persist_none(self, spark, temp_dir):
path = temp_dir + '/test_model'
model = NoneParamTester(inputCol=None)
assert model.isDefined(model.inputCol)
assert model.isSet(model.inputCol)
assert model.getInputCol() is None
model.write().overwrite().save(path)
NoneParamTester.load(path)  # TypeError: Could not convert  to string type

def test_set_none(self, spark):
model = NoneParamTester(inputCol=None)
assert model.isDefined(model.inputCol)
assert model.isSet(model.inputCol)
assert model.getInputCol() is None
model.set(model.inputCol, None)  # TypeError: Could not convert  to string type
{code}


> Can't persist/set None-valued param 
> 
>
> Key: SPARK-29428
> URL: https://issues.apache.org/jira/browse/SPARK-29428
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.2
>Reporter: Borys Biletskyy
>Priority: Major
>
> {code:java}
> import pytest
> from pyspark import keyword_only
> from pyspark.ml import Model
> from pyspark.sql import DataFrame
> from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
> from pyspark.ml.param.shared import HasInputCol
> from pyspark.sql.functions import *
> class NoneParamTester(Model,
>   HasInputCol,
>   DefaultParamsReadable,
>   DefaultParamsWritable
>   ):
> @keyword_only
> def __init__(self, inputCol: str = None):
> super(NoneParamTester, self).__init__()
> kwargs = self._input_kwargs
> self.setParams(**kwargs)
> @keyword_only
> def setParams(self, inputCol: str = None):
> kwargs = self._input_kwargs
> self._set(**kwargs)
> return self
> def _transform(self, data: DataFrame) -> DataFrame:
> return data
> class TestNoneParam(object):
> def test_persist_none(self, spark, temp_dir):
> path = temp_dir + '/test_model'
> model = NoneParamTester(inputCol=None)
> assert model.isDefined(model.inputCol)
> assert model.isSet(m

[jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to slow query planning

2019-10-10 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948443#comment-16948443
 ] 

Izek Greenfield commented on SPARK-13346:
-

[~davies] why this issue gets closed?

> Using DataFrames iteratively leads to slow query planning
> -
>
> Key: SPARK-13346
> URL: https://issues.apache.org/jira/browse/SPARK-13346
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> I have an iterative algorithm based on DataFrames, and the query plan grows 
> very quickly with each iteration.  Caching the current DataFrame at the end 
> of an iteration does not fix the problem.  However, converting the DataFrame 
> to an RDD and back at the end of each iteration does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to 
> several hundred lines, to several thousand lines, ...) with successive 
> iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the 
> query plan does not need to be computed since it is already cached.  The 
> computation on each iteration should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should 
> be simple to see if you create an iterative algorithm which produces a new 
> DataFrame from an old one on each iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29426) Watermark does not take effect

2019-10-10 Thread jingshanglu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948379#comment-16948379
 ] 

jingshanglu edited comment on SPARK-29426 at 10/10/19 11:45 AM:


my kafka mes like this:
{code:java}
// code placeholder

[kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 
172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-04 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{code}
output like this:
{code:java}
// code placeholder
Batch: 0
---
+--+---+--+---+-+
|window|sql|client| ip|count|
+--+---+--+---+-+
+--+---+--+---+-+---
Batch: 1
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|1|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|1|
++--++-+-+---
Batch: 2
---
+--+---+--+---+-+
|window|sql|client| ip|count|
+--+---+--+---+-+
+--+---+--+---+-+---
Batch: 3
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:25...|select * from user|192.168.54.6|172.0.0.1|1|
|[2019-03-05 12:30...|select * from user|192.168.54.6|172.0.0.1|1|
++--++-+-+---
Batch: 4
---
+--+---+--+---+-+
|window|sql|client| ip|count|
+--+---+--+---+-+
+--+---+--+---+-+---
Batch: 5
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2|
++--++-+-+
{code}
the watermark behind the event time(2019-03-04 12:23:22), but this event

{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}

still be Aggregated


was (Author: jingshang):
my kafka mes like this:
{code:java}
// code placeholder

[kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 
172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-04 
12:22:29","ip":"172.0.0.1","clie

[jira] [Comment Edited] (SPARK-29426) Watermark does not take effect

2019-10-10 Thread jingshanglu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948379#comment-16948379
 ] 

jingshanglu edited comment on SPARK-29426 at 10/10/19 11:46 AM:


my kafka mes like this:
{code:java}
// code placeholder

[kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 
172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}

{code}
output like this:
{code:java}
// code placeholder
Batch: 0
---
+--+---+--+---+-+
|window|sql|client| ip|count|
+--+---+--+---+-+
+--+---+--+---+-+---
Batch: 1
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|1|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|1|
++--++-+-+---
Batch: 2
---
+--+---+--+---+-+
|window|sql|client| ip|count|
+--+---+--+---+-+
+--+---+--+---+-+---
Batch: 3
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:25...|select * from user|192.168.54.6|172.0.0.1|1|
|[2019-03-05 12:30...|select * from user|192.168.54.6|172.0.0.1|1|
++--++-+-+---
Batch: 4
---
+--+---+--+---+-+
|window|sql|client| ip|count|
+--+---+--+---+-+
+--+---+--+---+-+---
Batch: 5
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 12:20...|select * from user|192.168.54.6|172.0.0.1|2|
|[2019-03-05 12:15...|select * from user|192.168.54.6|172.0.0.1|2|
++--++-+-+
{code}
the watermark behind the event time(2019-03-04 12:23:22), but this event

{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}

still be Aggregated


was (Author: jingshang):
my kafka mes like this:
{code:java}
// code placeholder

[kafka@HC-25-28-36 ~]$ kafka-console-producer.sh --broker-list 
172.25.28.38:9092,172.25.28.37:9092,172.25.28.36:9092 --topic test0
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:32:22","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-05 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{"sql":"select * from user","timestamp":"2019-03-04 
12:22:29","ip":"172.0.0.1","client":"192.168.54.6","port":3306}
{code}
output like this:
{code:java}
// code placeholder
Batch: 0
---
+--+---+--+---+-+
|window|sql|client| ip|count|
+--+---+--+---+-+
+--+---+--+---+-+---
Batch: 1
---
++--++-+-+
|  window|   sql|  client|   ip|count|
++--++-+-+
|[2019-03-05 

[jira] [Comment Edited] (SPARK-29421) Add an opportunity to change the file format of command CREATE TABLE LIKE

2019-10-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948506#comment-16948506
 ] 

Lantao Jin edited comment on SPARK-29421 at 10/10/19 12:03 PM:
---

[~cloud_fan] Yes, Hive support a similar command with {{STORED AS}}:
{code}
hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
OK
Time taken: 0.726 seconds
hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
OK
Time taken: 0.294 seconds
{code}


was (Author: cltlfcjin):
[~cloud_fan] Yes, Hive support the similar command with {{STORED AS}}:
{code}
hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
OK
Time taken: 0.726 seconds
hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
OK
Time taken: 0.294 seconds
{code}

> Add an opportunity to change the file format of command CREATE TABLE LIKE
> -
>
> Key: SPARK-29421
> URL: https://issues.apache.org/jira/browse/SPARK-29421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on 
> the definition of table tb2. The most user case is to create tb1 with the 
> same schema of tb2. But an inconvenient case here is this command also copies 
> the FileFormat from tb2, it cannot change the input/output format and serde. 
> Add the ability of changing file format is useful for some scenarios like 
> upgrading a table from a low performance file format to a high performance 
> one (parquet, orc).
> Here gives two options to enhance it.
> Option1: Add a configuration {{spark.sql.createTableLike.fileformat}}, the 
> value by default is "none" which keeps the behaviour same with current -- 
> copying the file format from source table. After run command SET 
> spark.sql.createTableLike.fileformat=parquet or any other valid file format 
> defined in {{HiveSerDe}}, {{CREATE TABLE ... LIKE}} will use the new file 
> format type.
> Option2: Add syntax {{USING fileformat}} after {{CREATE TABLE ... LIKE}}. For 
> example,
> {code}
> CREATE TABLE tb1 LIKE tb2 USING parquet;
> {code}
> If USING keyword is ignored, it also keeps the behaviour same with current -- 
> copying the file format from source table.
> Both of them can keep its behaviour same with current.
> We use option1 with parquet file format as an enhancement in our production 
> thriftserver because we need change many existing SQL scripts without any 
> modification. But for community, Option2 could be treated as a new feature 
> since it needs user to write additional USING part.
> cc [~dongjoon] [~hyukjin.kwon] [~joshrosen] [~cloud_fan] [~yumwang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29421) Add an opportunity to change the file format of command CREATE TABLE LIKE

2019-10-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948506#comment-16948506
 ] 

Lantao Jin commented on SPARK-29421:


[~cloud_fan] Yes, Hive support the similar command with {{STORED AS}}:
{code}
hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
OK
Time taken: 0.726 seconds
hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
OK
Time taken: 0.294 seconds
{code}

> Add an opportunity to change the file format of command CREATE TABLE LIKE
> -
>
> Key: SPARK-29421
> URL: https://issues.apache.org/jira/browse/SPARK-29421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on 
> the definition of table tb2. The most user case is to create tb1 with the 
> same schema of tb2. But an inconvenient case here is this command also copies 
> the FileFormat from tb2, it cannot change the input/output format and serde. 
> Add the ability of changing file format is useful for some scenarios like 
> upgrading a table from a low performance file format to a high performance 
> one (parquet, orc).
> Here gives two options to enhance it.
> Option1: Add a configuration {{spark.sql.createTableLike.fileformat}}, the 
> value by default is "none" which keeps the behaviour same with current -- 
> copying the file format from source table. After run command SET 
> spark.sql.createTableLike.fileformat=parquet or any other valid file format 
> defined in {{HiveSerDe}}, {{CREATE TABLE ... LIKE}} will use the new file 
> format type.
> Option2: Add syntax {{USING fileformat}} after {{CREATE TABLE ... LIKE}}. For 
> example,
> {code}
> CREATE TABLE tb1 LIKE tb2 USING parquet;
> {code}
> If USING keyword is ignored, it also keeps the behaviour same with current -- 
> copying the file format from source table.
> Both of them can keep its behaviour same with current.
> We use option1 with parquet file format as an enhancement in our production 
> thriftserver because we need change many existing SQL scripts without any 
> modification. But for community, Option2 could be treated as a new feature 
> since it needs user to write additional USING part.
> cc [~dongjoon] [~hyukjin.kwon] [~joshrosen] [~cloud_fan] [~yumwang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29421) Add an opportunity to change the file format of command CREATE TABLE LIKE

2019-10-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948506#comment-16948506
 ] 

Lantao Jin edited comment on SPARK-29421 at 10/10/19 12:07 PM:
---

[~cloud_fan] Yes, Hive support a similar command with {{STORED AS}}:
{code}
hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
OK
Time taken: 0.726 seconds
hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
OK
Time taken: 0.294 seconds
{code}

Here I use {{USING}} to be compatible with Spark command {{CREATE TABLE ... 
USING}}.
So do you think which option would be better?


was (Author: cltlfcjin):
[~cloud_fan] Yes, Hive support a similar command with {{STORED AS}}:
{code}
hive> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
OK
Time taken: 0.726 seconds
hive> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
OK
Time taken: 0.294 seconds
{code}

> Add an opportunity to change the file format of command CREATE TABLE LIKE
> -
>
> Key: SPARK-29421
> URL: https://issues.apache.org/jira/browse/SPARK-29421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on 
> the definition of table tb2. The most user case is to create tb1 with the 
> same schema of tb2. But an inconvenient case here is this command also copies 
> the FileFormat from tb2, it cannot change the input/output format and serde. 
> Add the ability of changing file format is useful for some scenarios like 
> upgrading a table from a low performance file format to a high performance 
> one (parquet, orc).
> Here gives two options to enhance it.
> Option1: Add a configuration {{spark.sql.createTableLike.fileformat}}, the 
> value by default is "none" which keeps the behaviour same with current -- 
> copying the file format from source table. After run command SET 
> spark.sql.createTableLike.fileformat=parquet or any other valid file format 
> defined in {{HiveSerDe}}, {{CREATE TABLE ... LIKE}} will use the new file 
> format type.
> Option2: Add syntax {{USING fileformat}} after {{CREATE TABLE ... LIKE}}. For 
> example,
> {code}
> CREATE TABLE tb1 LIKE tb2 USING parquet;
> {code}
> If USING keyword is ignored, it also keeps the behaviour same with current -- 
> copying the file format from source table.
> Both of them can keep its behaviour same with current.
> We use option1 with parquet file format as an enhancement in our production 
> thriftserver because we need change many existing SQL scripts without any 
> modification. But for community, Option2 could be treated as a new feature 
> since it needs user to write additional USING part.
> cc [~dongjoon] [~hyukjin.kwon] [~joshrosen] [~cloud_fan] [~yumwang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948575#comment-16948575
 ] 

Yuming Wang commented on SPARK-29354:
-

Does {{bin/spark-shell}} need jline?

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread Sungpeo Kook (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610
 ] 

Sungpeo Kook commented on SPARK-29354:
--

[~yumwang]
I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with uesr-provided Apache Hadoop` 

and yes, `bin/spark-shell` need jilne.

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread Sungpeo Kook (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610
 ] 

Sungpeo Kook edited comment on SPARK-29354 at 10/10/19 2:12 PM:


[~yumwang]
I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with uesr-provided Apache Hadoop` 

and yes, `bin/spark-shell` needs a jilne.


was (Author: elixir kook):
[~yumwang]
I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with uesr-provided Apache Hadoop` 

and yes, `bin/spark-shell` need jilne.

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread Sungpeo Kook (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610
 ] 

Sungpeo Kook edited comment on SPARK-29354 at 10/10/19 2:14 PM:


[~yumwang]
I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with uesr-provided Apache Hadoop` 

and yes, `bin/spark-shell` needs a jline.


was (Author: elixir kook):
[~yumwang]
I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with uesr-provided Apache Hadoop` 

and yes, `bin/spark-shell` needs a jilne.

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread Sungpeo Kook (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610
 ] 

Sungpeo Kook edited comment on SPARK-29354 at 10/10/19 2:17 PM:


[~yumwang] [~angerszhuuu]
At first, thank you guys for the response.

I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with uesr-provided Apache Hadoop` 

and yes, `bin/spark-shell` needs a jline.


was (Author: elixir kook):
[~yumwang]
I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with uesr-provided Apache Hadoop` 

and yes, `bin/spark-shell` needs a jline.

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread Sungpeo Kook (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948610#comment-16948610
 ] 

Sungpeo Kook edited comment on SPARK-29354 at 10/10/19 2:17 PM:


[~yumwang] [~angerszhuuu]
At first, thank you guys for the response.

I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with user-provided Apache Hadoop` 

and yes, `bin/spark-shell` needs a jline.


was (Author: elixir kook):
[~yumwang] [~angerszhuuu]
At first, thank you guys for the response.

I meant spark binary distributions which are listed on the [download 
page|https://spark.apache.org/downloads.html].
whose package type is `Pre-built with uesr-provided Apache Hadoop` 

and yes, `bin/spark-shell` needs a jline.

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section

2019-10-10 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948652#comment-16948652
 ] 

Thomas Graves commented on SPARK-28859:
---

I wouldn't expect users to specify the size when enabled its false. If they do 
specify false, I guess its ok for it to be 0, but not sure we really need to 
special case this.

Default of 0 is fine that is why I said if the user specifies a value it should 
be > 0, but I haven't looked to see when the configEntry does the validation on 
this.  If it validates the default value then we can't change it, or validator 
needs change. This is what the Jira is to investigate.  Taking a skim of the 
code it looks like the validator only runs on the non-default value.

> Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
> 
>
> Key: SPARK-28859
> URL: https://issues.apache.org/jira/browse/SPARK-28859
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Assignee: yifan
>Priority: Minor
>
> Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 
> when 
> MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code?
>  
> SPARK-28577 add this check before request memory resource to Yarn 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29400) Improve PrometheusResource to use labels

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29400.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26060
[https://github.com/apache/spark/pull/26060]

> Improve PrometheusResource to use labels
> 
>
> Key: SPARK-29400
> URL: https://issues.apache.org/jira/browse/SPARK-29400
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> SPARK-29064 introduced `PrometheusResource` for native support. This issue 
> aims to improve it to use **labels**.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29400) Improve PrometheusResource to use labels

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29400:
-

Assignee: Dongjoon Hyun

> Improve PrometheusResource to use labels
> 
>
> Key: SPARK-29400
> URL: https://issues.apache.org/jira/browse/SPARK-29400
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> SPARK-29064 introduced `PrometheusResource` for native support. This issue 
> aims to improve it to use **labels**.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29400) Improve PrometheusResource to use labels

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29400:
--
Parent: SPARK-29429
Issue Type: Sub-task  (was: Improvement)

> Improve PrometheusResource to use labels
> 
>
> Key: SPARK-29400
> URL: https://issues.apache.org/jira/browse/SPARK-29400
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> SPARK-29064 introduced `PrometheusResource` for native support. This issue 
> aims to improve it to use **labels**.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29429) Support Prometheus monitoring

2019-10-10 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29429:
-

 Summary: Support Prometheus monitoring
 Key: SPARK-29429
 URL: https://issues.apache.org/jira/browse/SPARK-29429
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29064) Add PrometheusResource to export Executor metrics

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29064:
--
Parent: SPARK-29429
Issue Type: Sub-task  (was: Improvement)

> Add PrometheusResource to export Executor metrics
> -
>
> Key: SPARK-29064
> URL: https://issues.apache.org/jira/browse/SPARK-29064
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29032:
--
Parent: SPARK-29429
Issue Type: Sub-task  (was: Improvement)

> Simplify Prometheus support by adding PrometheusServlet
> ---
>
> Key: SPARK-29032
> URL: https://issues.apache.org/jira/browse/SPARK-29032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to simplify `Prometheus` support in Spark standalone 
> environment or K8s environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29429) Support Prometheus monitoring

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29429:
--
Target Version/s: 3.0.0

> Support Prometheus monitoring
> -
>
> Key: SPARK-29429
> URL: https://issues.apache.org/jira/browse/SPARK-29429
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29032:
-

Assignee: Dongjoon Hyun

> Simplify Prometheus support by adding PrometheusServlet
> ---
>
> Key: SPARK-29032
> URL: https://issues.apache.org/jira/browse/SPARK-29032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to simplify `Prometheus` support in Spark standalone 
> environment or K8s environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29429) Support Prometheus monitoring natively

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29429:
--
Summary: Support Prometheus monitoring natively  (was: Support Prometheus 
monitoring)

> Support Prometheus monitoring natively
> --
>
> Key: SPARK-29429
> URL: https://issues.apache.org/jira/browse/SPARK-29429
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29064) Add PrometheusResource to export Executor metrics

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29064:
-

Assignee: Dongjoon Hyun

> Add PrometheusResource to export Executor metrics
> -
>
> Key: SPARK-29064
> URL: https://issues.apache.org/jira/browse/SPARK-29064
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29430) Document new metric endpoints for Prometheus

2019-10-10 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29430:
-

 Summary: Document new metric endpoints for Prometheus
 Key: SPARK-29430
 URL: https://issues.apache.org/jira/browse/SPARK-29430
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29396) Extend Spark plugin interface to driver

2019-10-10 Thread Imran Rashid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948720#comment-16948720
 ] 

Imran Rashid commented on SPARK-29396:
--

My hack to get around this in the past was to create a "SparkListener" which 
just ignored all the events it got, as that lets you instantiate arbitrary code 
in the driver, after most initialization but before running anything else.  Its 
an ugly api for sure, so it would be nice to improve -- but I'm curious if 
there is a functional shortcoming you need to address as well?

> Extend Spark plugin interface to driver
> ---
>
> Key: SPARK-29396
> URL: https://issues.apache.org/jira/browse/SPARK-29396
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Priority: Major
>
> Spark provides an extension API for people to implement executor plugins, 
> added in SPARK-24918 and later extended in SPARK-28091.
> That API does not offer any functionality for doing similar things on the 
> driver side, though. As a consequence of that, there is not a good way for 
> the executor plugins to get information or communicate in any way with the 
> Spark driver.
> I've been playing with such an improved API for developing some new 
> functionality. I'll file a few child bugs for the work to get the changes in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29431) Improve Web UI / Sql tab visualization with cached dataframes.

2019-10-10 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29431:
--

 Summary: Improve Web UI / Sql tab visualization with cached 
dataframes.
 Key: SPARK-29431
 URL: https://issues.apache.org/jira/browse/SPARK-29431
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


When the Spark plan has a cached dataframe, all the plan of the cached 
dataframe is not been shown in the SQL tree. 

More info at the pull request. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29430) Document new metric endpoints for Prometheus

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29430:
-

Assignee: Dongjoon Hyun

> Document new metric endpoints for Prometheus
> 
>
> Key: SPARK-29430
> URL: https://issues.apache.org/jira/browse/SPARK-29430
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-10 Thread Mukul Murthy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948800#comment-16948800
 ] 

Mukul Murthy commented on SPARK-29358:
--

That would be a start to make us not have to do #1, but #2 is still annoying. 
Serializing and deserializing the data just to merge schemas is clunky, and 
transforming each DataFrame's schema is annoying enough to use when you don't 
have StructTypes and nested columns. When you add those into the picture, it 
gets even messier.

 

 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20629:
--
Affects Version/s: (was: 2.3.0)
   (was: 2.2.0)
   3.0.0

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> We decided not to do this for YARN, but for EC2/GCE and similar systems nodes 
> may be shut down entirely without the ability to keep an AuxiliaryService 
> around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20629) Copy shuffle data when nodes are being shut down

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-20629:
---

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> We decided not to do this for YARN, but for EC2/GCE and similar systems nodes 
> may be shut down entirely without the ability to keep an AuxiliaryService 
> around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20624) Add better handling for node shutdown

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-20624:
---

> Add better handling for node shutdown
> -
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Holden Karau
>Priority: Minor
>  Labels: bulk-closed
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20624) Add better handling for node shutdown

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20624:
--
Labels:   (was: bulk-closed)

> Add better handling for node shutdown
> -
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Minor
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20629:
--
Labels:   (was: bulk-closed)

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> We decided not to do this for YARN, but for EC2/GCE and similar systems nodes 
> may be shut down entirely without the ability to keep an AuxiliaryService 
> around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20624) Add better handling for node shutdown

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20624:
--
Affects Version/s: (was: 2.3.0)
   (was: 2.2.0)
   3.0.0

> Add better handling for node shutdown
> -
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Minor
>  Labels: bulk-closed
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20624) Add better handling for node shutdown

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20624:
--
Priority: Major  (was: Minor)

> Add better handling for node shutdown
> -
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20732) Copy cache data when node is being shut down

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20732:
--
Labels:   (was: bulk-closed)

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20732) Copy cache data when node is being shut down

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20732:
--
Affects Version/s: (was: 2.3.0)
   (was: 2.2.0)
   3.0.0

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20732) Copy cache data when node is being shut down

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-20732:
---

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-21040:
---

> On executor/worker decommission consider speculatively re-launching current 
> tasks
> -
>
> Key: SPARK-21040
> URL: https://issues.apache.org/jira/browse/SPARK-21040
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> If speculative execution is enabled we may wish to consider decommissioning 
> of worker as a weight for speculative execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21040:
--
Labels:   (was: bulk-closed)

> On executor/worker decommission consider speculatively re-launching current 
> tasks
> -
>
> Key: SPARK-21040
> URL: https://issues.apache.org/jira/browse/SPARK-21040
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> If speculative execution is enabled we may wish to consider decommissioning 
> of worker as a weight for speculative execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks

2019-10-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21040:
--
Affects Version/s: (was: 2.3.0)
   (was: 2.2.0)
   3.0.0

> On executor/worker decommission consider speculatively re-launching current 
> tasks
> -
>
> Key: SPARK-21040
> URL: https://issues.apache.org/jira/browse/SPARK-21040
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> If speculative execution is enabled we may wish to consider decommissioning 
> of worker as a weight for speculative execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-10-10 Thread Nasir Ali (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948832#comment-16948832
 ] 

Nasir Ali commented on SPARK-28502:
---

[~bryanc] I tested it and it works fine with master branch. Is there any 
expected release date for version 3? Or could this bug be fix be integrated in 
next release?

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2019-10-10 Thread Michael Albert (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948860#comment-16948860
 ] 

Michael Albert commented on SPARK-28921:


Is there a timeline for this fix being present in a release version? This bug 
is really affecting us severely, and I've been waiting for 2.4.5 to drop. Is 
there a timeline in which we can expect some Spark release version to include 
the fix?

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2019-10-10 Thread Paul Schweigert (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948896#comment-16948896
 ] 

Paul Schweigert commented on SPARK-28921:
-

[~albertmichaelj] You can replace the `kubernetes-client-4.1.2.jar` (in the 
`jars/` directory) with the newer version (v 4.4.2).

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-10 Thread antonkulaga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948897#comment-16948897
 ] 

antonkulaga commented on SPARK-28547:
-

[~hyukjin.kwon] what is not clear for you? I think it is really clear that 
Spark performs miserably (freezing or taking many hours) whenever the data 
frame has 10-20K and more columns and I gave GTEX dataset as an example 
(however any gene or transcript expression dataset will be ok to demonstrate 
it). In many fields (like big part of bioinformatics) wide data frames are 
common, right now Spark is totally useless there.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-10 Thread antonkulaga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948897#comment-16948897
 ] 

antonkulaga edited comment on SPARK-28547 at 10/10/19 7:24 PM:
---

[~hyukjin.kwon] what is not clear for you? I think it is really clear that 
Spark performs miserably (freezing or taking hours/days to compute, even for 
simplest operations like per-column statistics) whenever the data frame has 
10-20K and more columns and I gave GTEX dataset as an example (however any gene 
or transcript expression dataset will be ok to demonstrate it). In many fields 
(like big part of bioinformatics) wide data frames are common, right now Spark 
is totally useless there.


was (Author: antonkulaga):
[~hyukjin.kwon] what is not clear for you? I think it is really clear that 
Spark performs miserably (freezing or taking many hours) whenever the data 
frame has 10-20K and more columns and I gave GTEX dataset as an example 
(however any gene or transcript expression dataset will be ok to demonstrate 
it). In many fields (like big part of bioinformatics) wide data frames are 
common, right now Spark is totally useless there.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-10-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948907#comment-16948907
 ] 

Sean R. Owen commented on SPARK-28547:
--

I agree, this is too open-ended. It's not clear whether it's a general problem 
or specific to a usage pattern, a SQL query, a data type or distribution. Often 
I find that use cases for "1 columns" are use cases for "a big array-valued 
column".

I bet there is room for improvement, but, ten thousand columns is just 
inherently slow given how metadata, query plans, etc are handled.

You'd at least need to help narrow down where the slow down is and why, and 
even better if you can propose a class of fix. As it is I'd close this.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe

2019-10-10 Thread Prasanna Saraswathi Krishnan (Jira)
Prasanna Saraswathi Krishnan created SPARK-29432:


 Summary: nullable flag of new column changes when persisting a 
pyspark dataframe
 Key: SPARK-29432
 URL: https://issues.apache.org/jira/browse/SPARK-29432
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.0
 Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution)

Python 3.7.3
Reporter: Prasanna Saraswathi Krishnan


When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 

{{>>> l = [('Alice', 1)]}}
{{>>> df = spark.createDataFrame(l)}}
{{>>> df.printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}

{{>>> from pyspark.sql.functions import lit}}
{{>>> df = df.withColumn('newCol', lit('newVal'))}}
{{>>> df.printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}
{{ |-- newCol: string (nullable = false)}}

{{>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')}}

{{>>> spark.sql("select * from default.withcolTest").printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}
{{ |-- newCol: string (nullable = true)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe

2019-10-10 Thread Prasanna Saraswathi Krishnan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanna Saraswathi Krishnan updated SPARK-29432:
-
Description: 
When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 

{{>>> l = [('Alice', 1)]}}
 {{>>> df = spark.createDataFrame(l)}}
 {{>>> df.printSchema()}}
 {{root}}
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)

{{>>> from pyspark.sql.functions import lit}}
 {{>>> df = df.withColumn('newCol', lit('newVal'))}}
 {{>>> df.printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}
 \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
default.withcolTest").printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}
 \{{ |-- newCol: string (nullable = true)}}

  was:
When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 

{{>>> l = [('Alice', 1)]}}
{{>>> df = spark.createDataFrame(l)}}
{{>>> df.printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}{{>>> from pyspark.sql.functions import lit}}
{{>>> df = df.withColumn('newCol', lit('newVal'))}}
{{>>> df.printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}
{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
default.withcolTest").printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}
{{ |-- newCol: string (nullable = true)}}


> nullable flag of new column changes when persisting a pyspark dataframe
> ---
>
> Key: SPARK-29432
> URL: https://issues.apache.org/jira/browse/SPARK-29432
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution)
> Python 3.7.3
>Reporter: Prasanna Saraswathi Krishnan
>Priority: Minor
>
> When I add a new column to a dataframe with {{withColumn}} function, by 
> default, the column is added with {{nullable=false}}.
> But, when I save the dataframe, the flag changes to {{nullable=true}}. Is 
> this the expected behavior? why?
>  
> {{>>> l = [('Alice', 1)]}}
>  {{>>> df = spark.createDataFrame(l)}}
>  {{>>> df.printSchema()}}
>  {{root}}
> |-- _1: string (nullable = true)
> |-- _2: long (nullable = true)
> {{>>> from pyspark.sql.functions import lit}}
>  {{>>> df = df.withColumn('newCol', lit('newVal'))}}
>  {{>>> df.printSchema()}}
>  {{root}}
>  \{{ |-- _1: string (nullable = true)}}
>  \{{ |-- _2: long (nullable = true)}}
>  \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
> default.withcolTest").printSchema()}}
>  {{root}}
>  \{{ |-- _1: string (nullable = true)}}
>  \{{ |-- _2: long (nullable = true)}}
>  \{{ |-- newCol: string (nullable = true)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe

2019-10-10 Thread Prasanna Saraswathi Krishnan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanna Saraswathi Krishnan updated SPARK-29432:
-
Description: 
When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 

{{>>> l = [('Alice', 1)]}}
{{>>> df = spark.createDataFrame(l)}}
{{>>> df.printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}{{>>> from pyspark.sql.functions import lit}}
{{>>> df = df.withColumn('newCol', lit('newVal'))}}
{{>>> df.printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}
{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
default.withcolTest").printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}
{{ |-- newCol: string (nullable = true)}}

  was:
When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 

{{>>> l = [('Alice', 1)]}}
{{>>> df = spark.createDataFrame(l)}}
{{>>> df.printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}

{{>>> from pyspark.sql.functions import lit}}
{{>>> df = df.withColumn('newCol', lit('newVal'))}}
{{>>> df.printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}
{{ |-- newCol: string (nullable = false)}}

{{>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')}}

{{>>> spark.sql("select * from default.withcolTest").printSchema()}}
{{root}}
{{ |-- _1: string (nullable = true)}}
{{ |-- _2: long (nullable = true)}}
{{ |-- newCol: string (nullable = true)}}


> nullable flag of new column changes when persisting a pyspark dataframe
> ---
>
> Key: SPARK-29432
> URL: https://issues.apache.org/jira/browse/SPARK-29432
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution)
> Python 3.7.3
>Reporter: Prasanna Saraswathi Krishnan
>Priority: Minor
>
> When I add a new column to a dataframe with {{withColumn}} function, by 
> default, the column is added with {{nullable=false}}.
> But, when I save the dataframe, the flag changes to {{nullable=true}}. Is 
> this the expected behavior? why?
>  
> {{>>> l = [('Alice', 1)]}}
> {{>>> df = spark.createDataFrame(l)}}
> {{>>> df.printSchema()}}
> {{root}}
> {{ |-- _1: string (nullable = true)}}
> {{ |-- _2: long (nullable = true)}}{{>>> from pyspark.sql.functions import 
> lit}}
> {{>>> df = df.withColumn('newCol', lit('newVal'))}}
> {{>>> df.printSchema()}}
> {{root}}
> {{ |-- _1: string (nullable = true)}}
> {{ |-- _2: long (nullable = true)}}
> {{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
> default.withcolTest").printSchema()}}
> {{root}}
> {{ |-- _1: string (nullable = true)}}
> {{ |-- _2: long (nullable = true)}}
> {{ |-- newCol: string (nullable = true)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe

2019-10-10 Thread Prasanna Saraswathi Krishnan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanna Saraswathi Krishnan updated SPARK-29432:
-
Description: 
When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 

{{>>> l = [('Alice', 1)]}}
 {{>>> df = spark.createDataFrame(l)}}
 {{>>> df.printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}

{{>>> from pyspark.sql.functions import lit}}
 {{>>> df = df.withColumn('newCol', lit('newVal'))}}
 {{>>> df.printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}
 \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
default.withcolTest").printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}
 \{{ |-- newCol: string (nullable = true)}}

  was:
When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 

{{>>> l = [('Alice', 1)]}}
 {{>>> df = spark.createDataFrame(l)}}
 {{>>> df.printSchema()}}
 {{root}}
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)

{{>>> from pyspark.sql.functions import lit}}
 {{>>> df = df.withColumn('newCol', lit('newVal'))}}
 {{>>> df.printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}
 \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
default.withcolTest").printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}
 \{{ |-- newCol: string (nullable = true)}}


> nullable flag of new column changes when persisting a pyspark dataframe
> ---
>
> Key: SPARK-29432
> URL: https://issues.apache.org/jira/browse/SPARK-29432
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution)
> Python 3.7.3
>Reporter: Prasanna Saraswathi Krishnan
>Priority: Minor
>
> When I add a new column to a dataframe with {{withColumn}} function, by 
> default, the column is added with {{nullable=false}}.
> But, when I save the dataframe, the flag changes to {{nullable=true}}. Is 
> this the expected behavior? why?
>  
> {{>>> l = [('Alice', 1)]}}
>  {{>>> df = spark.createDataFrame(l)}}
>  {{>>> df.printSchema()}}
>  {{root}}
>  \{{ |-- _1: string (nullable = true)}}
>  \{{ |-- _2: long (nullable = true)}}
> {{>>> from pyspark.sql.functions import lit}}
>  {{>>> df = df.withColumn('newCol', lit('newVal'))}}
>  {{>>> df.printSchema()}}
>  {{root}}
>  \{{ |-- _1: string (nullable = true)}}
>  \{{ |-- _2: long (nullable = true)}}
>  \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
> default.withcolTest").printSchema()}}
>  {{root}}
>  \{{ |-- _1: string (nullable = true)}}
>  \{{ |-- _2: long (nullable = true)}}
>  \{{ |-- newCol: string (nullable = true)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29116) Refactor py classes related to DecisionTree

2019-10-10 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948942#comment-16948942
 ] 

Huaxin Gao commented on SPARK-29116:


I will submit a PR after DecisionTree refactor is in, so I don't need to rebase 
and merge later.  [~podongfeng]

> Refactor py classes related to DecisionTree
> ---
>
> Key: SPARK-29116
> URL: https://issues.apache.org/jira/browse/SPARK-29116
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1, Like the scala side, move related classes to a seperate file 'tree.py'
> 2, add method 'predictLeaf' in 'DecisionTreeModel' & 'TreeEnsembleModel'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29433) Web UI Stages table tooltip correction

2019-10-10 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29433:
--

 Summary: Web UI Stages table tooltip correction
 Key: SPARK-29433
 URL: https://issues.apache.org/jira/browse/SPARK-29433
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


In the Web UI, Stages table, the tool tip of Input and output column are not 
corrrect.

Actual tooltip messages: 
 * Bytes and records read from Hadoop or from Spark storage.
 * Bytes and records written to Hadoop.

In this column we are only showing bytes, not records

More information at the pull request



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26651) Use Proleptic Gregorian calendar

2019-10-10 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948957#comment-16948957
 ] 

Maxim Gekk commented on SPARK-26651:


[~jiangxb] Could you consider this for including to the major changes lists of 
Spark 3.0 

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: ReleaseNote
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29434) Improve the MapStatuses serialization performance

2019-10-10 Thread DB Tsai (Jira)
DB Tsai created SPARK-29434:
---

 Summary: Improve the MapStatuses serialization performance
 Key: SPARK-29434
 URL: https://issues.apache.org/jira/browse/SPARK-29434
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: DB Tsai
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks

2019-10-10 Thread koert kuipers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948979#comment-16948979
 ] 

koert kuipers commented on SPARK-27665:
---

i tried using spark.shuffle.useOldFetchProtocol=true while using spark 3 
(master) to launch job, with spark 2.4.1 shuffle service running in yarn. i 
cannot get it to work.

for example on one cluster i saw:
{code}
Error occurred while fetching local blocks
java.nio.file.NoSuchFileException: 
/mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index
{code}

on another:
{code}
org.apache.spark.shuffle.FetchFailedException: 
/data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.NoSuchFileException: 
/data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204)
at 
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:161)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:60)
at 
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:172)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
... 11 more
{code}
 
 

> Split fetch shuffle blocks protocol from OpenBlocks
> ---
>
> Key: SPARK-27665
> URL: https://issues.apache.org/jira/brows

[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-10-10 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948981#comment-16948981
 ] 

Bryan Cutler commented on SPARK-28502:
--

Thanks for testing it out [~nasirali]! It's unlikely that this would make it to 
the 2.4.x line, but the next Spark release will be 3.0.0 anyway, which will 
roughly be early 2020.

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29435) Spark 3 doesnt work with older shuffle service

2019-10-10 Thread koert kuipers (Jira)
koert kuipers created SPARK-29435:
-

 Summary: Spark 3 doesnt work with older shuffle service
 Key: SPARK-29435
 URL: https://issues.apache.org/jira/browse/SPARK-29435
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.0.0
 Environment: Spark 3 from Sept 26, commit 
8beb736a00b004f97de7fcdf9ff09388d80fc548
Spark 2.4.1 shuffle service in yarn 
Reporter: koert kuipers


SPARK-27665 introduced a change to the shuffle protocol. It also introduced a 
setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run with 
old shuffle service.

However i have not gotten that to work. I have been testing with Spark 3 master 
(from Sept 26) and shuffle service from Spark 2.4.1 in yarn.

The errors i see are for example on EMR:
{code}
Error occurred while fetching local blocks
java.nio.file.NoSuchFileException: 
/mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index
{code}

And on CDH5:
{code}
org.apache.spark.shuffle.FetchFailedException: 
/data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.NoSuchFileException: 
/data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204)
at 
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:161)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:60)
at 
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:172)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at 
org.apache.spar

[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service

2019-10-10 Thread Marcelo Masiero Vanzin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948987#comment-16948987
 ] 

Marcelo Masiero Vanzin commented on SPARK-29435:


I think you have to set {{spark.shuffle.useOldFetchProtocol=true}} with 3.0 for 
that to work.

> Spark 3 doesnt work with older shuffle service
> --
>
> Key: SPARK-29435
> URL: https://issues.apache.org/jira/browse/SPARK-29435
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
> Environment: Spark 3 from Sept 26, commit 
> 8beb736a00b004f97de7fcdf9ff09388d80fc548
> Spark 2.4.1 shuffle service in yarn 
>Reporter: koert kuipers
>Priority: Major
>
> SPARK-27665 introduced a change to the shuffle protocol. It also introduced a 
> setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run 
> with old shuffle service.
> However i have not gotten that to work. I have been testing with Spark 3 
> master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn.
> The errors i see are for example on EMR:
> {code}
> Error occurred while fetching local blocks
> java.nio.file.NoSuchFileException: 
> /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index
> {code}
> And on CDH5:
> {code}
> org.apache.spark.shuffle.FetchFailedException: 
> /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.nio.file.NoSuchFileException: 
> /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
>   at java.nio.file.Files.newByteChannel(Files.java:361)
>   at java.nio.file.Files.newByteChannel(Files.java:407)
>   at 
> org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204)
>   at 
> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.ap

[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service

2019-10-10 Thread koert kuipers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948989#comment-16948989
 ] 

koert kuipers commented on SPARK-29435:
---

[~vanzin] sorry i should have been more clear, i did set 
spark.shuffle.useOldFetchProtocol=true and i could not get it to work

> Spark 3 doesnt work with older shuffle service
> --
>
> Key: SPARK-29435
> URL: https://issues.apache.org/jira/browse/SPARK-29435
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
> Environment: Spark 3 from Sept 26, commit 
> 8beb736a00b004f97de7fcdf9ff09388d80fc548
> Spark 2.4.1 shuffle service in yarn 
>Reporter: koert kuipers
>Priority: Major
>
> SPARK-27665 introduced a change to the shuffle protocol. It also introduced a 
> setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run 
> with old shuffle service.
> However i have not gotten that to work. I have been testing with Spark 3 
> master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn.
> The errors i see are for example on EMR:
> {code}
> Error occurred while fetching local blocks
> java.nio.file.NoSuchFileException: 
> /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index
> {code}
> And on CDH5:
> {code}
> org.apache.spark.shuffle.FetchFailedException: 
> /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.nio.file.NoSuchFileException: 
> /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index
>   at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
>   at java.nio.file.Files.newByteChannel(Files.java:361)
>   at java.nio.file.Files.newByteChannel(Files.java:407)
>   at 
> org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204)
>   at 
> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391)
>

[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-10-10 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26806:
---
Description: 
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:
{code:java}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}

  was:
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}

This issue was reported by [~liancheng]


> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Cheng Lian
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code:java}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-10-10 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26806:
---
Reporter: Cheng Lian  (was: liancheng)

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Cheng Lian
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29367) pandas udf not working with latest pyarrow release (0.15.0)

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29367.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26045
[https://github.com/apache/spark/pull/26045]

> pandas udf not working with latest pyarrow release (0.15.0)
> ---
>
> Key: SPARK-29367
> URL: https://issues.apache.org/jira/browse/SPARK-29367
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0, 2.4.1, 2.4.3
>Reporter: Julien Peloton
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> Hi,
> I recently upgraded pyarrow from 0.14 to 0.15 (released on Oct 5th), and my 
> pyspark jobs using pandas udf are failing with 
> java.lang.IllegalArgumentException (tested with Spark 2.4.0, 2.4.1, and 
> 2.4.3). Here is a full example to reproduce the failure with pyarrow 0.15:
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import BooleanType
> import pandas as pd
> @pandas_udf(BooleanType(), PandasUDFType.SCALAR)
> def qualitycuts(nbad: int, rb: float, magdiff: float) -> pd.Series:
> """ Apply simple quality cuts
> Returns
> --
> out: pandas.Series of booleans
> Return a Pandas DataFrame with the appropriate flag: false for bad alert,
> and true for good alert.
> """
> mask = nbad.values == 0
> mask *= rb.values >= 0.55
> mask *= abs(magdiff.values) <= 0.1
> return pd.Series(mask)
> spark = SparkSession.builder.getOrCreate()
> # Create dummy DF
> colnames = ["nbad", "rb", "magdiff"]
> df = spark.sparkContext.parallelize(
> zip(
> [0, 1, 0, 0],
> [0.01, 0.02, 0.6, 0.01],
> [0.02, 0.05, 0.1, 0.01]
> )
> ).toDF(colnames)
> df.show()
> # Apply cuts
> df = df\
> .withColumn("toKeep", qualitycuts(*colnames))\
> .filter("toKeep == true")\
> .drop("toKeep")
> # This will fail if latest pyarrow 0.15.0 is used
> df.show()
> {code}
> and the log is:
> {code}
> Driver stacktrace:
> 19/10/07 09:37:49 INFO DAGScheduler: Job 3 failed: showString at 
> NativeMethodAccessorImpl.java:0, took 0.660523 s
> Traceback (most recent call last):
>   File 
> "/Users/julien/Documents/workspace/myrepos/fink-broker/test_pyarrow.py", line 
> 44, in 
> df.show()
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>  line 378, in show
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o64.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 5, localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:98)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:89)
>   at 

[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949050#comment-16949050
 ] 

angerszhu commented on SPARK-29354:
---

[~Elixir Kook] 

i download spark-2.4.4-bin-hadoop2.7  in your link and I can found 
jline-2.14.6.jar in jars/ folder

 

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29284) df.distinct.count throw NoSuchElementException when enabled daptive executor

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29284.
--
Resolution: Cannot Reproduce

> df.distinct.count throw NoSuchElementException when enabled daptive executor 
> -
>
> Key: SPARK-29284
> URL: https://issues.apache.org/jira/browse/SPARK-29284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: yiming.xu
>Priority: Minor
>
> This case:
>  spark.sql("SET spark.sql.adaptive.enabled=true")
>   spark.sql("SET spark.sql.shuffle.partitions=1")
>   val result = spark.range(0).distinct().count 
> or spark.table("empty table").distinct.count
> throw java.util.NoSuchElementException: next on empty iterator
> If current stage partition is 
> 0,org.apache.spark.sql.execution.exchange.ExchangeCoordinator#doEstimationIfNecessary->partitionStartIndices
>  is empty (https://issues.apache.org/jira/browse/SPARK-22144)
> next stage rdd partition is empty.
> To solve the problem I will change partitionStartIndices to Array(0) when the 
> parent rdd partition is 0;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28636) Thriftserver can not support decimal type with negative scale

2019-10-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28636:

Issue Type: Bug  (was: Improvement)

> Thriftserver can not support decimal type with negative scale
> -
>
> Key: SPARK-28636
> URL: https://issues.apache.org/jira/browse/SPARK-28636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> 0: jdbc:hive2://localhost:1> select 2.35E10 * 1.0;
> Error: java.lang.IllegalArgumentException: Error: name expected at the 
> position 10 of 'decimal(6,-7)' but '-' is found. (state=,code=0)
> {code}
> {code:sql}
> spark-sql> select 2.35E10 * 1.0;
> 235
> {code}
> ThriftServer log:
> {noformat}
> java.lang.RuntimeException: java.lang.IllegalArgumentException: Error: name 
> expected at the position 10 of 'decimal(6,-7)' but '-' is found.
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at 
> java.security.AccessController.doPrivileged(AccessController.java:770)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy31.getResultSetMetadata(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.getResultSetMetadata(CLIService.java:502)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.GetResultSetMetadata(ThriftCLIService.java:609)
>   at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetResultSetMetadata.getResult(TCLIService.java:1697)
>   at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetResultSetMetadata.getResult(TCLIService.java:1682)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:819)
> Caused by: java.lang.IllegalArgumentException: Error: name expected at the 
> position 10 of 'decimal(6,-7)' but '-' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:378)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseParams(TypeInfoUtils.java:403)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parsePrimitiveParts(TypeInfoUtils.java:542)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.parsePrimitiveParts(TypeInfoUtils.java:557)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.createPrimitiveTypeInfo(TypeInfoFactory.java:136)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.getPrimitiveTypeInfo(TypeInfoFactory.java:109)
>   at 
> org.apache.hive.service.cli.TypeDescriptor.(TypeDescriptor.java:58)
>   at org.apache.hive.service.cli.TableSchema.(TableSchema.java:54)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$.getTableSchema(SparkExecuteStatementOperation.scala:313)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.resultSchema$lzycompute(SparkExecuteStatementOperation.scala:69)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.resultSchema(SparkExecuteStatementOperation.scala:64)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getResultSetSchema(SparkExecuteStatementOperation.scala:157)
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationResultSetSchema(OperationManager.java:233)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.getResultSetMetadata(HiveSessionImpl.java:787)
>   at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at jav

[jira] [Updated] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29432:
-
Description: 
When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 
{code}
>>> l = [('Alice', 1)]
>>> df = spark.createDataFrame(l)
>>> df.printSchema()
root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)
{code}

{code}
>>> from pyspark.sql.functions import lit
>>> df = df.withColumn('newCol', lit('newVal'))
>>> df.printSchema()
root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)
 |-- newCol: string (nullable = false)
>>> spark.sql("select * from default.withcolTest").printSchema()
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)
 |-- newCol: string (nullable = true)
{code}


  was:
When I add a new column to a dataframe with {{withColumn}} function, by 
default, the column is added with {{nullable=false}}.

But, when I save the dataframe, the flag changes to {{nullable=true}}. Is this 
the expected behavior? why?

 

{{>>> l = [('Alice', 1)]}}
 {{>>> df = spark.createDataFrame(l)}}
 {{>>> df.printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}

{{>>> from pyspark.sql.functions import lit}}
 {{>>> df = df.withColumn('newCol', lit('newVal'))}}
 {{>>> df.printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}
 \{{ |-- newCol: string (nullable = false)}}{{>>> spark.sql("select * from 
default.withcolTest").printSchema()}}
 {{root}}
 \{{ |-- _1: string (nullable = true)}}
 \{{ |-- _2: long (nullable = true)}}
 \{{ |-- newCol: string (nullable = true)}}


> nullable flag of new column changes when persisting a pyspark dataframe
> ---
>
> Key: SPARK-29432
> URL: https://issues.apache.org/jira/browse/SPARK-29432
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution)
> Python 3.7.3
>Reporter: Prasanna Saraswathi Krishnan
>Priority: Minor
>
> When I add a new column to a dataframe with {{withColumn}} function, by 
> default, the column is added with {{nullable=false}}.
> But, when I save the dataframe, the flag changes to {{nullable=true}}. Is 
> this the expected behavior? why?
>  
> {code}
> >>> l = [('Alice', 1)]
> >>> df = spark.createDataFrame(l)
> >>> df.printSchema()
> root
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
> {code}
> {code}
> >>> from pyspark.sql.functions import lit
> >>> df = df.withColumn('newCol', lit('newVal'))
> >>> df.printSchema()
> root
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
>  |-- newCol: string (nullable = false)
> >>> spark.sql("select * from default.withcolTest").printSchema()
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
>  |-- newCol: string (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29432.
--
Resolution: Cannot Reproduce

Can't fine {{withcolTest}} table. Also, please ask questions into mailing list. 
You could have a better answer.

> nullable flag of new column changes when persisting a pyspark dataframe
> ---
>
> Key: SPARK-29432
> URL: https://issues.apache.org/jira/browse/SPARK-29432
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution)
> Python 3.7.3
>Reporter: Prasanna Saraswathi Krishnan
>Priority: Minor
>
> When I add a new column to a dataframe with {{withColumn}} function, by 
> default, the column is added with {{nullable=false}}.
> But, when I save the dataframe, the flag changes to {{nullable=true}}. Is 
> this the expected behavior? why?
>  
> {code}
> >>> l = [('Alice', 1)]
> >>> df = spark.createDataFrame(l)
> >>> df.printSchema()
> root
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
> {code}
> {code}
> >>> from pyspark.sql.functions import lit
> >>> df = df.withColumn('newCol', lit('newVal'))
> >>> df.printSchema()
> root
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
>  |-- newCol: string (nullable = false)
> >>> spark.sql("select * from default.withcolTest").printSchema()
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
>  |-- newCol: string (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29423:
-
Component/s: (was: SQL)
 Structured Streaming

> leak on  org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
> ---
>
> Key: SPARK-29423
> URL: https://issues.apache.org/jira/browse/SPARK-29423
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: pin_zhang
>Priority: Major
>
> 1.  start server with start-thriftserver.sh
>  2.  JDBC client connect and disconnect to hiveserver2
>  for (int i = 0; i < 1; i++) {
>Connection conn = 
> DriverManager.getConnection("jdbc:hive2://localhost:1", "test", "");
>conn.close();
>  }
> 3.  instance of  
> org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep 
> increasing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus

2019-10-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949060#comment-16949060
 ] 

Hyukjin Kwon commented on SPARK-29423:
--

2.3.x is EOL releases. Can you try in higher versions?

> leak on  org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
> ---
>
> Key: SPARK-29423
> URL: https://issues.apache.org/jira/browse/SPARK-29423
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: pin_zhang
>Priority: Major
>
> 1.  start server with start-thriftserver.sh
>  2.  JDBC client connect and disconnect to hiveserver2
>  for (int i = 0; i < 1; i++) {
>Connection conn = 
> DriverManager.getConnection("jdbc:hive2://localhost:1", "test", "");
>conn.close();
>  }
> 3.  instance of  
> org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep 
> increasing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29432) nullable flag of new column changes when persisting a pyspark dataframe

2019-10-10 Thread Prasanna Saraswathi Krishnan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949068#comment-16949068
 ] 

Prasanna Saraswathi Krishnan commented on SPARK-29432:
--

My bad. When I formatted the code, I deleted the saveAsTable statement ny 
mistake. 

If you save the dataframe like - 

df.write.saveAsTable('delault.withcolTest', mode='overwrite')

And then get the schema, you should be able to reproduce the error. 

Not sure how it is resolved, if you can't reproduce the issue.

> nullable flag of new column changes when persisting a pyspark dataframe
> ---
>
> Key: SPARK-29432
> URL: https://issues.apache.org/jira/browse/SPARK-29432
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0-cdh6.1.1 (Cloudera distribution)
> Python 3.7.3
>Reporter: Prasanna Saraswathi Krishnan
>Priority: Minor
>
> When I add a new column to a dataframe with {{withColumn}} function, by 
> default, the column is added with {{nullable=false}}.
> But, when I save the dataframe, the flag changes to {{nullable=true}}. Is 
> this the expected behavior? why?
>  
> {code}
> >>> l = [('Alice', 1)]
> >>> df = spark.createDataFrame(l)
> >>> df.printSchema()
> root
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
> {code}
> {code}
> >>> from pyspark.sql.functions import lit
> >>> df = df.withColumn('newCol', lit('newVal'))
> >>> df.printSchema()
> root
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
>  |-- newCol: string (nullable = false)
> >>> spark.sql("select * from default.withcolTest").printSchema()
>  |-- _1: string (nullable = true)
>  |-- _2: long (nullable = true)
>  |-- newCol: string (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24266:
-
Affects Version/s: 3.0.0

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Priority: Major
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>  

[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24266:
-
Labels:   (was: bulk-closed)

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Chun Chen
>Priority: Major
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
> 

[jira] [Reopened] (SPARK-24266) Spark client terminates while driver is still running

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-24266:
--

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Chun Chen
>Priority: Major
>  Labels: bulk-closed
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
> 

[jira] [Resolved] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29337.
--
Resolution: Invalid

Please see [https://spark.apache.org/community.html]

> How to Cache Table and Pin it in Memory and should not Spill to Disk on 
> Thrift Server 
> --
>
> Key: SPARK-29337
> URL: https://issues.apache.org/jira/browse/SPARK-29337
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Srini E
>Priority: Major
>  Labels: Question, stack-overflow
> Attachments: Cache+Image.png
>
>
> Hi Team,
> How to pin the table in cache so it would not swap out of memory?
> Situation: We are using Microstrategy BI reporting. Semantic layer is built. 
> We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table 
> ; we did cache for SPARK context( Thrift server). Please see 
> below snapshot of Cache table, went to disk over time. Initially it was all 
> in cache , now some in cache and some in disk. That disk may be local disk 
> relatively more expensive reading than from s3. Queries may take longer and 
> inconsistent times from user experience perspective. If More queries running 
> using Cache tables, copies of the cache table images are copied and copies 
> are not staying in memory causing reports to run longer. so how to pin the 
> table so would not swap to disk. Spark memory management is dynamic 
> allocation, and how to use those few tables to Pin in memory .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

2019-10-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29335.
--
Resolution: Invalid

Please see [https://spark.apache.org/community.html]

> Cost Based Optimizer stats are not used while evaluating query plans in Spark 
> Sql
> -
>
> Key: SPARK-29335
> URL: https://issues.apache.org/jira/browse/SPARK-29335
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 2.3.0
> Environment: We tried to execute the same using Spark-sql and Thrify 
> server using SQLWorkbench but we are not able to use the stats.
>Reporter: Srini E
>Priority: Major
>  Labels: Question, stack-overflow
> Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few 
> critical queries run thru spark-sql or thru thrift server using jdbc driver. 
> Following settings added to spark-defaults.conf
> {code}
> spark.sql.cbo.enabled true
> spark.experimental.extrastrategies intervaljoin
> spark.sql.cbo.joinreorder.enabled true
> {code}
>  
> The tables that we are using are not partitioned.
> {code}
> spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid, 
> year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
> absref;
> {code}
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not 
> leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
> *When we are using spark sql and running queries we are not able to see the 
> stats in use in the explain plan and we are not sure if cbo is put to use.*
> *A quick response would be helpful.*
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>  
> {code}
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
> from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
> a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
> and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>  
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
> 2017),(ptype#4546 = A),(eid#4542 = 
> 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 
> 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(absref#4569),(absref#4569 = 
> Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: 
> string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
> (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
> ('a12.eid = 29940)) && isnull('a12.PID)))
>  +- 'Join Inner
>  :- 'SubqueryAlias a12
>  : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
>  +- 'SubqueryAlias a13
>  +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>  
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
> cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 
> 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
> cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527)))
>  +- Join Inner
>  :- SubqueryAlias a12
>  : +- SubqueryAlias t_fperiods_sundar
>  : +- 
> Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,

[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence

2019-10-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949086#comment-16949086
 ] 

Hyukjin Kwon commented on SPARK-29222:
--

Shall we increase the time a bit more if it was verified that it helps the 
flakiness?

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
> ---
>
> Key: SPARK-29222
> URL: https://issues.apache.org/jira/browse/SPARK-29222
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/]
> {code:java}
> Error Message
> 7 != 10
> StacktraceTraceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergence
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 7 != 10
>{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >