date:20190611



 [ 
https://issues.apache.org/jira/browse/SPARK-28013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28013:


Assignee: (was: Apache Spark)

> Upgrade to Kafka 2.2.1
> --
>
> Key: SPARK-28013
> URL: https://issues.apache.org/jira/browse/SPARK-28013
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue updates Kafka dependency to 2.2.1 to bring the following 
> improvement and bug fixes.
> https://issues.apache.org/jira/projects/KAFKA/versions/12345010



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28013) Upgrade to Kafka 2.2.1



 [ 
https://issues.apache.org/jira/browse/SPARK-28013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28013:


Assignee: Apache Spark

> Upgrade to Kafka 2.2.1
> --
>
> Key: SPARK-28013
> URL: https://issues.apache.org/jira/browse/SPARK-28013
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue updates Kafka dependency to 2.2.1 to bring the following 
> improvement and bug fixes.
> https://issues.apache.org/jira/projects/KAFKA/versions/12345010



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28013) Upgrade to Kafka 2.2.1

Dongjoon Hyun created SPARK-28013:
-

 Summary: Upgrade to Kafka 2.2.1
 Key: SPARK-28013
 URL: https://issues.apache.org/jira/browse/SPARK-28013
 Project: Spark
  Issue Type: Improvement
  Components: Build, Structured Streaming
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue updates Kafka dependency to 2.2.1 to bring the following improvement 
and bug fixes.
https://issues.apache.org/jira/projects/KAFKA/versions/12345010



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25520) The state of executors is KILLED on standalone



[ 
https://issues.apache.org/jira/browse/SPARK-25520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861760#comment-16861760
 ] 

Dongjoon Hyun commented on SPARK-25520:
---

In general, it looks like `com.microsoft.sqlserver.jdbc.SQLServerDriver` issue, 
doesn't it? Could you try another database like MySQL instead with Apache Spark 
2.4.3?

> The state of executors is  KILLED on standalone
> ---
>
> Key: SPARK-25520
> URL: https://issues.apache.org/jira/browse/SPARK-25520
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.1
> Environment: spark 2.3.1
> spark 2.2.2
> Java 1.8.0_131
> scala 2.11.8
>  
>Reporter: AKSonic
>Priority: Major
> Attachments: Spark.zip
>
>
> I create spark standalone cluster (4 servers) by using spark 2.3.1. The job 
> can be finished on Completed Drivers. I also can get the result by driver 
> log. But the status of all executors show KILLED state. The log show the 
> following error.
> 2018-09-25 00:47:37 INFO CoarseGrainedExecutorBackend:54 - Driver commanded a 
> shutdown
> 2018-09-25 00:47:37 ERROR CoarseGrainedExecutorBackend:43 - RECEIVED SIGNAL 
> TERM utdown
> I also try spark 2.2.2. I see the same issues on the GUI. All executors are 
> KILLED status.
> Is it right? what is the problem?
>  Config 
> 
> *spark-env.sh:*
> export SPARK_PUBLIC_DNS=hostname1
>  export SCALA_HOME=/opt/gpf/bigdata/scala-2.11.8
>  export JAVA_HOME=/usr/java/jdk1.8.0_131
>  export HADOOP_HOME=/opt/bigdata/hadoop-2.6.5
>  export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
> export 
> SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=[file:///spark/spark-event-dir]
>  -Dspark.history.ui.port=16066 -Dspark.history.retainedApplications=30 
> -Dspark.history.fs.cleaner.enabled=true 
> -Dspark.history.fs.cleaner.interval=1d -Dspark.history.fs.cleaner.maxAge=7d"
>  export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=hostname1:2181,hostname2:2181,hostname3:2181 
> -Dspark.deploy.zookeeper.dir=/opt/bigdata/spark-2.3.1/zk-recovery-dir"
> SPARK_LOCAL_DIRS=/opt/bigdata/spark-2.3.1/local-dir
>  SPARK_DRIVER_MEMORY=1G
> *spark-defaults.conf:*
> spark.eventLog.enabled true
>  spark.eventLog.compress true
>  spark.eventLog.dir [file:///spark/spark-event-dir]
> *slaves:*
> hostname1
> hostname2
> hostname3
> hostname4
> Testing 
> 
> *Testing program1 (Java):*
> public class JDBCApp {
> private static final String DB_OLAP_UAT_URL = "jdbc:sqlserver://dbhost";
> private static final String DB_DRIVER = 
> "com.microsoft.sqlserver.jdbc.SQLServerDriver";
>  private static final String SQL_TEXT = "select top 10 * from table1";
>  private static final String DB_OLAP_UAT_USR = "";
>  private static final String DB_OLAP_UAT_PWD = "";
> public static void main(String[] args)
> { 
> System.setProperty("spark.sql.warehouse.dir","file:///bigdata/spark/spark-warehouse");
>  // Logger.getLogger("org.apache.spark").setLevel(Level.DEBUG); SparkSession 
> spark = SparkSession .builder() .appName("JDBCApp") .getOrCreate(); 
> Dataset jdbcDF = spark.read() .format("jdbc") .option("driver", 
> DB_DRIVER) .option("url", DB_OLAP_UAT_URL) .option("dbtable", "(" + SQL_TEXT 
> + ") tmp") .option("user", DB_OLAP_UAT_USR) .option("password", 
> DB_OLAP_UAT_PWD) .load(); jdbcDF.show(); }
> }
> *Testing program2 (Java):*
> public class SimpleApp {
> public static void main(String[] args)
> { String filePath = args[0]; Logger logger = 
> Logger.getLogger("org.apache.spark"); // logger.setLevel(Level.DEBUG); 
> SparkSession spark = SparkSession.builder() .appName("Simple Application") 
> .getOrCreate(); Dataset logData = 
> spark.read().textFile(filePath).cache(); long numAs = 
> logData.filter((FilterFunction) s -> s.contains("e")).count(); long 
> numBs = logData.filter((FilterFunction) s -> 
> s.contains("r")).count(); logger.info("Lines with a: " + numAs + ", lines 
> with b: " + numBs); spark.stop(); }
> }
> You can run above 2 testing programs, and then you can see the state of 
> executors are KILLED.
> -- CMD 
> --
> ./spark-submit \
> --driver-class-path /sharedata/mssql-jdbc-6.4.0.jre8.jar \
> --jars /sharedata/mssql-jdbc-6.4.0.jre8.jar \
> --class JDBCApp \
> --master spark://hostname1:6066 \
> --deploy-mode cluster \
> --driver-memory 2G \
> --executor-memory 2G \
> --total-executor-cores 8 \
> /sharedata/spark-demo-1.0-SNAPSHOT.jar
>  
>

[jira] [Comment Edited] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.



[ 
https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861755#comment-16861755
 ] 

Dongjoon Hyun edited comment on SPARK-25246 at 6/12/19 5:10 AM:


Hi, [~shahid]. As [~devaraj.k] mentioned, you can use a different codec. The 
following is the `snappy` codec case, and I can see the entry from the Spark 
History UI from 2.3.3. The reported case is just the behavior of LZ4 codec 
although it's a default to the Spark.
{code}
$ bin/spark-shell --conf spark.eventLog.compress=true --conf 
spark.io.compression.codec=snappy
2019-06-11 22:04:26 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1560315870593).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.3
  /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
{code}


was (Author: dongjoon):
Hi, [~shahid]. As [~devaraj.k] mentioned, you can use a different codec. The 
following is the `snappy` codec case, and I can see the entry from the Spark 
History UI from 2.3.3. It's the behavior of LZ4.
{code}
$ bin/spark-shell --conf spark.eventLog.compress=true --conf 
spark.io.compression.codec=snappy
2019-06-11 22:04:26 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1560315870593).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.3
  /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
{code}

> When the spark.eventLog.compress is enabled, the Application is not showing 
> in the History server UI ('incomplete application' page), initially.
> 
>
> Key: SPARK-25246
> URL: https://issues.apache.org/jira/browse/SPARK-25246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: shahid
>Priority: Major
>
> 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" 
> 2) hdfs dfs -ls /spark-logs 
> {code:java}
> -rwxrwx---   1 root supergroup  *0* 2018-08-27 03:26 
> /spark-logs/application_1535313809919_0005.lz4.inprogress
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.



 [ 
https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25246.
---
Resolution: Not A Problem

> When the spark.eventLog.compress is enabled, the Application is not showing 
> in the History server UI ('incomplete application' page), initially.
> 
>
> Key: SPARK-25246
> URL: https://issues.apache.org/jira/browse/SPARK-25246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: shahid
>Priority: Major
>
> 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" 
> 2) hdfs dfs -ls /spark-logs 
> {code:java}
> -rwxrwx---   1 root supergroup  *0* 2018-08-27 03:26 
> /spark-logs/application_1535313809919_0005.lz4.inprogress
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.



[ 
https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861755#comment-16861755
 ] 

Dongjoon Hyun commented on SPARK-25246:
---

Hi, [~shahid]. As [~devaraj.k] mentioned, you can use a different codec. The 
following is the `snappy` codec case, and I can see the entry from the Spark 
History UI from 2.3.3. It's the behavior of LZ4.
{code}
$ bin/spark-shell --conf spark.eventLog.compress=true --conf 
spark.io.compression.codec=snappy
2019-06-11 22:04:26 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1560315870593).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.3
  /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
{code}

> When the spark.eventLog.compress is enabled, the Application is not showing 
> in the History server UI ('incomplete application' page), initially.
> 
>
> Key: SPARK-25246
> URL: https://issues.apache.org/jira/browse/SPARK-25246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: shahid
>Priority: Major
>
> 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" 
> 2) hdfs dfs -ls /spark-logs 
> {code:java}
> -rwxrwx---   1 root supergroup  *0* 2018-08-27 03:26 
> /spark-logs/application_1535313809919_0005.lz4.inprogress
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-06-11 Thread hemshankar sahu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861751#comment-16861751
 ] 

hemshankar sahu edited comment on SPARK-27891 at 6/12/19 4:54 AM:
--

Updating this as critical, as we use spark streaming which is expected to run 
for more than 1 week.


was (Author: hemshankar_sahu):
Updating this as critical, as we use spark streaming which runs more than 1 
week.

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Critical
> Attachments: application_1559242207407_0001.log, 
> spark_2.3.1_failure.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-06-11 Thread hemshankar sahu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemshankar sahu updated SPARK-27891:

Priority: Critical  (was: Major)

Updating this as critical, as we use spark streaming which runs more than 1 
week.

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Critical
> Attachments: application_1559242207407_0001.log, 
> spark_2.3.1_failure.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28012) Hive UDF supports literal struct type



 [ 
https://issues.apache.org/jira/browse/SPARK-28012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28012:


Assignee: Apache Spark

> Hive UDF supports literal struct type
> -
>
> Key: SPARK-28012
> URL: https://issues.apache.org/jira/browse/SPARK-28012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently using hive udf, the parameter is literal struct type, will report 
> an error.
> No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't 
> support the constant type [StructType(StructField(name,StringType,true), 
> StructField(value,DecimalType(3,1),true))]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28012) Hive UDF supports literal struct type



 [ 
https://issues.apache.org/jira/browse/SPARK-28012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28012:


Assignee: (was: Apache Spark)

> Hive UDF supports literal struct type
> -
>
> Key: SPARK-28012
> URL: https://issues.apache.org/jira/browse/SPARK-28012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> Currently using hive udf, the parameter is literal struct type, will report 
> an error.
> No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't 
> support the constant type [StructType(StructField(name,StringType,true), 
> StructField(value,DecimalType(3,1),true))]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28012) Hive UDF supports literal struct type

2019-06-11 Thread dzcxzl (JIRA)

dzcxzl created SPARK-28012:
--

 Summary: Hive UDF supports literal struct type
 Key: SPARK-28012
 URL: https://issues.apache.org/jira/browse/SPARK-28012
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: dzcxzl


Currently using hive udf, the parameter is literal struct type, will report an 
error.

No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't 
support the constant type [StructType(StructField(name,StringType,true), 
StructField(value,DecimalType(3,1),true))]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs



[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861749#comment-16861749
 ] 

Hyukjin Kwon commented on SPARK-27463:
--

BTW, I think we have cogroup at Dataset in Scala side. How is it different from 
that?

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs



[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861747#comment-16861747
 ] 

Hyukjin Kwon commented on SPARK-27463:
--

[~d80tb7] are you still working on this? If there are some API references in 
Pandas, I think we can just mimic it. If so, can you just open a PR?

cc [~icexelloss] as well.

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28011) SQL parse error when there are too many aliases in the table

2019-06-11 Thread U Shaw (JIRA)

U Shaw created SPARK-28011:
--

 Summary: SQL parse error when there are too many aliases in the 
table
 Key: SPARK-28011
 URL: https://issues.apache.org/jira/browse/SPARK-28011
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: U Shaw


A sql syntax error is reported when the following statement is executed.

..
FROM
menu_item_categories_tmp t1
LEFT JOIN menu_item_categories_tmp t2 ON t1.icat_id = 
t2.icat_parent_icat_id 
AND t1.tenant_id = t2.tenant_id 
AND t2.icat_status != 'd'
LEFT JOIN menu_item_categories_tmp t3 ON t2.icat_id = 
t3.icat_parent_icat_id 
AND t2.tenant_id = t3.tenant_id 
AND t3.icat_status != 'd'
LEFT JOIN menu_item_categories_tmp t4 ON t3.icat_id = 
t4.icat_parent_icat_id 
AND t3.tenant_id = t4.tenant_id 
AND t4.icat_status != 'd'
LEFT JOIN menu_item_categories_tmp t5 ON t4.icat_id = 
t5.icat_parent_icat_id 
AND t4.tenant_id = t5.tenant_id 
AND t5.icat_status != 'd'
LEFT JOIN menu_item_categories_tmp t6 ON t5.icat_id = 
t6.icat_parent_icat_id 
AND t5.tenant_id = t6.tenant_id 
AND t6.icat_status != 'd' 
WHERE
t1.icat_parent_icat_id = '0' 
AND t1.icat_status != 'd' 
) SELECT DISTINCT
tenant_id AS tenant_id,
type AS type,
CASE

WHEN t2.num >= 1 THEN
level0 ELSE NULL 
END AS level0,
CASE

WHEN t2.num >= 2 THEN
level1 ELSE NULL 
END AS level1,
CASE

WHEN t2.num >= 3 THEN
level2 ELSE NULL 
END AS level2,
CASE
..




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-06-11 Thread Henry Yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861728#comment-16861728
 ] 

Henry Yu commented on SPARK-27812:
--

OK [~dongjoon]

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27741) Transitivity on predicate pushdown

2019-06-11 Thread U Shaw (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

U Shaw updated SPARK-27741:
---
Fix Version/s: 2.4.0

> Transitivity on predicate pushdown 
> ---
>
> Key: SPARK-27741
> URL: https://issues.apache.org/jira/browse/SPARK-27741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: U Shaw
>Priority: Major
> Fix For: 2.4.0
>
>
> When using inner join, where conditions can be passed to join on, and when 
> using outer join, even if the conditions are the same, only the predicate is 
> pushed down to left or right.
> As follows:
> select * from t1 left join t2 on t1.id=t2.id where t1.id=1
> --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1
> Is Catalyst can support transitivity on predicate pushdown ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28010) Support ORDER BY ... USING syntax

2019-06-11 Thread Lantao Jin (JIRA)

Lantao Jin created SPARK-28010:
--

 Summary: Support ORDER BY ... USING syntax
 Key: SPARK-28010
 URL: https://issues.apache.org/jira/browse/SPARK-28010
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


Currently, SparkSQL ORDER BY is below:

{code:sql}
ORDER BY expression [ ASC | DESC]
SELECT * FROM tab ORDER BY col ASC
{code}
 
Can SparkSQL support grammar like below, as PostgreSQL ORDER BY:
{code:sql}
ORDER BY expression [ ASC | DESC | USING operator ]
SELECT * FROM tab ORDER BY col USING <
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27701) Extend NestedColumnAliasing to more nested field cases



 [ 
https://issues.apache.org/jira/browse/SPARK-27701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27701.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24599

> Extend NestedColumnAliasing to more nested field cases
> --
>
> Key: SPARK-27701
> URL: https://issues.apache.org/jira/browse/SPARK-27701
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> {{NestedColumnAliasing}} rule covers {{GetStructField}} only, currently. It 
> means that some nested field extraction expressions aren't pruned. For 
> example, if only accessing a nested field in an array of struct, this column 
> isn't pruned. This patch extends the rule to cover such cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27768) Infinity, -Infinity, NaN should be recognized in a case insensitive manner



[ 
https://issues.apache.org/jira/browse/SPARK-27768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861711#comment-16861711
 ] 

Dongjoon Hyun commented on SPARK-27768:
---

For this issue, I changed my mind to support this approach because I agreed the 
big picture of [~smilegator]. Sorry for the delay on this issue.

> Infinity, -Infinity, NaN should be recognized in a case insensitive manner
> --
>
> Key: SPARK-27768
> URL: https://issues.apache.org/jira/browse/SPARK-27768
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> When the inputs contain the constant 'infinity', Spark SQL does not generate 
> the expected results.
> {code:java}
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x);
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES ('infinity'), ('1')) v(x);
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES ('infinity'), ('infinity')) v(x);
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES ('-infinity'), ('infinity')) v(x);{code}
>  The root cause: Spark SQL does not recognize the special constants in a case 
> insensitive way. In PostgreSQL, they are recognized in a case insensitive 
> way. 
> Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27499) Support mapping spark.local.dir to hostPath volume

2019-06-11 Thread Junjie Chen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861707#comment-16861707
 ] 

Junjie Chen commented on SPARK-27499:
-

Yes, In KubernetesExecutorBuilder.scala, the LocalDrisFeatureStep is built 
before MountVolumesFeatureStep which means we cannot use any volumes mount 
later. I think we should build localDirsFeature at last, so that we can check 
if directories in SPARK_LOCAL_DIRS are set to volumes mounted either hostPath, 
PV, or others may support later and use that as local storage.

With that way, we can utilize specified media to improve the local storage 
performance instead of just emptyDir which is just a ephemeral directory on 
node.

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21136) Misleading error message for typo in SQL

2019-06-11 Thread Xingbo Jiang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-21136.
--
Resolution: Fixed

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Assignee: Yesheng Ma
>Priority: Minor
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861646#comment-16861646
 ] 

Hyukjin Kwon commented on SPARK-28006:
--

The proposal itself looks making sense to me from a cursory look. One concern 
is that though I don't think Spark has such type of Window function. cc 
[~hvanhovell] as well.
I suspect the output is the same as our grouped map Pandas UDF if I understood 
correctly? It might be helpful to show the output so that non-Python guys could 
understand how it works as well :-).



> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
>  
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
>  
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def zscore(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean() / v.std()
> return pdf
> df.groupby('id').apply(zscore){code}
> This approach has a few downside:
>  
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> @pandas_udf('double', GROUPED_XFORM)
> def zscore(v):
> return v - v.mean() / v.std()
> w = Window.partitionBy('id')
> df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28009) PipedRDD: Block not locked for reading failure

2019-06-11 Thread Douglas Colkitt (JIRA)

Douglas Colkitt created SPARK-28009:
---

 Summary: PipedRDD: Block not locked for reading failure
 Key: SPARK-28009
 URL: https://issues.apache.org/jira/browse/SPARK-28009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment: Running in a Docker container with Spark 2.4.0 on Linux 
kernel 4.9.0
Reporter: Douglas Colkitt


PipedRDD operation fails with the below stack trace. Failure primarily occurs 
when the STDOUT from the Unix process is small and the STDIN into the Unix 
process is comparatively much larger.

 

Given the similarity to SPARK-18406, this seems to be due to a race condition 
when it comes to accessing the block's reader locker. The PipedRDD class 
implementation spawns STDIN iterator in a separate thread, so that would 
corroborate the race condition hypothesis.

 

at scala.Predef$.assert(Predef.scala:170)

at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)

at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:842)

at 
org.apache.spark.storage.BlockManager.releaseLockAndDispose(BlockManager.scala:1610)

at 
org.apache.spark.storage.BlockManager$$anonfun$2.apply$mcV$sp(BlockManager.scala:621)

at 
org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)

at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33)

at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)

at scala.collection.Iterator$class.foreach(Iterator.scala:891)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

at org.apache.spark.rdd.PipedRDD$$anon$3.run(PipedRDD.scala:145)

Suppressed: java.lang.AssertionError: assertion failed

at scala.Predef$.assert(Predef.scala:156)

at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)

at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)

at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:363)

at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)

at scala.Option.foreach(Option.scala:257)

at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:362)

at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:358)

at scala.collection.Iterator$class.foreach(Iterator.scala:891)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:358)

at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:858)

at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$1.apply$mcV$sp(Executor.scala:409)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28008) Default values & column comments in AVRO schema converters

2019-06-11 Thread Mathew Wicks (JIRA)

Mathew Wicks created SPARK-28008:


 Summary: Default values & column comments in AVRO schema converters
 Key: SPARK-28008
 URL: https://issues.apache.org/jira/browse/SPARK-28008
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Mathew Wicks


Currently in both `toAvroType` and `toSqlType` 
[SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134]
 there are two behaviours which are unexpected.
h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no default 
value is set:

*Current Behaviour:*
{code:java}
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types._

val schema = new StructType().add("a", "string", nullable = true)
val avroSchema = SchemaConverters.toAvroType(schema)

println(avroSchema.toString(true))
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
"name" : "a",
"type" : [ "string", "null" ]
  } ]
}
{code}
*Expected Behaviour:*

(NOTE: The reversal of "null" & "string" in the union, needed for a default 
value of null)
{code:java}
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types._

val schema = new StructType().add("a", "string", nullable = true)
val avroSchema = SchemaConverters.toAvroType(schema)

println(avroSchema.toString(true))
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
"name" : "a",
"type" : [ "null", "string" ],
"default" : null
  } ]
}{code}
h2. Field comments/metadata is not propagated:

*Current Behaviour:*
{code:java}
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types._

val schema = new StructType().add("a", "string", nullable=false, 
comment="AAA")
val avroSchema = SchemaConverters.toAvroType(schema)

println(avroSchema.toString(true))
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
"name" : "a",
"type" : "string"
  } ]
}{code}
*Expected Behaviour:*
{code:java}
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types._

val schema = new StructType().add("a", "string", nullable=false, 
comment="AAA")
val avroSchema = SchemaConverters.toAvroType(schema)

println(avroSchema.toString(true))
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
"name" : "a",
"type" : "string",
"doc" : "AAA"
  } ]
}{code}
 

The behaviour should be similar (but the reverse) for `toSqlType`.

I think we should aim to get this in before 3.0, as it will probably be a 
breaking change for some usage of the AVRO API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28007) Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in Postgres/Redshift

2019-06-11 Thread Josh Rosen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-28007:
---
Summary: Caret operator (^) means bitwise XOR in Spark/Hive and 
exponentiation in Postgres/Redshift  (was: Caret operator (^) means bitwise XOR 
in Spark and exponentiation in Postgres/Redshift)

> Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in 
> Postgres/Redshift
> --
>
> Key: SPARK-28007
> URL: https://issues.apache.org/jira/browse/SPARK-28007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> The expression {{expr1 ^ expr2}} has different meanings in Spark and Postgres:
>  * [In Postgres|https://www.postgresql.org/docs/11/functions-math.html] and 
> [Redshift|https://docs.aws.amazon.com/redshift/latest/dg/r_OPERATOR_SYMBOLS.html]
>  , this returns {{expr1}} raised to the exponent {{expr2}} (additionally, the 
> Postgres docs explicitly state that this operation is left-associative).
>  * [In Spark|https://spark.apache.org/docs/2.4.3/api/sql/index.html#_14] and 
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ArithmeticOperators],
>  this returns the bitwise exclusive OR of {{expr1}} and {{expr2}}.
> I'm reporting this under the Postgres compatibility umbrella. If we have SQL 
> dialect support (e.g. a Postgres compatibility dialect), maybe this behavior 
> could be flagged there?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28007) Caret operator (^) means bitwise XOR in Spark and exponentiation in Postgres/Redshift

2019-06-11 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-28007:
--

 Summary: Caret operator (^) means bitwise XOR in Spark and 
exponentiation in Postgres/Redshift
 Key: SPARK-28007
 URL: https://issues.apache.org/jira/browse/SPARK-28007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


The expression {{expr1 ^ expr2}} has different meanings in Spark and Postgres:
 * [In Postgres|https://www.postgresql.org/docs/11/functions-math.html] and 
[Redshift|https://docs.aws.amazon.com/redshift/latest/dg/r_OPERATOR_SYMBOLS.html]
 , this returns {{expr1}} raised to the exponent {{expr2}} (additionally, the 
Postgres docs explicitly state that this operation is left-associative).
 * [In Spark|https://spark.apache.org/docs/2.4.3/api/sql/index.html#_14] and 
[Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ArithmeticOperators],
 this returns the bitwise exclusive OR of {{expr1}} and {{expr2}}.

I'm reporting this under the Postgres compatibility umbrella. If we have SQL 
dialect support (e.g. a Postgres compatibility dialect), maybe this behavior 
could be flagged there?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations



[ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861559#comment-16861559
 ] 

Li Jin commented on SPARK-28006:


cc [~hyukjin.kwon] [~LI,Xiao] [~ueshin] [~bryanc]

I think code wise this is pretty simple but since this is adding a new pandas 
udf type I'd like to get some feedback on this.

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
>  
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
>  
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def zscore(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean() / v.std()
> return pdf
> df.groupby('id').apply(zscore){code}
> This approach has a few downside:
>  
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> @pandas_udf('double', GROUPED_XFORM)
> def zscore(v):
> return v - v.mean() / v.std()
> w = Window.partitionBy('id')
> df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28006) User-defined grouped transform pandas_udf for window operations



 [ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28006:
---
Description: 
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:

 
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.
 * The user only needs to specify the output type of a single column.
 * The column being zscored is decoupled from the udf implementation
 * We only need to send one column to Python worker and concat the result with 
the original dataframe (this is what grouped aggregate is doing already)

 

 

  was:
Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:

 
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.

 

 


> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
>  
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
>  
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def zscore(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean() / v.std()
> return pdf
> df.groupby('id').apply(zscore){code}
> This approach has a few downside:
>  
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> @pandas_udf('double', GROUPED_XFORM)
> def zscore(v):
> return v - v.mean() / v.std()
> w = Window.partitionBy('id')
> df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled

[jira] [Created] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

Li Jin created SPARK-28006:
--

 Summary: User-defined grouped transform pandas_udf for window 
operations
 Key: SPARK-28006
 URL: https://issues.apache.org/jira/browse/SPARK-28006
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Li Jin


Currently, pandas_udf supports "grouped aggregate" type that can be used with 
unbounded and unbounded windows. There is another set of use cases that can 
benefit from a "grouped transform" type pandas_udf.

Grouped transform is defined as a N -> N mapping over a group. For example, 
"compute zscore for values in the group using the grouped mean and grouped 
stdev", or "rank the values in the group".

 

Currently, in order to do this, user needs to use "grouped apply", for example:

 
{code:java}
@pandas_udf(schema, GROUPED_MAP)
def zscore(pdf)
v = pdf['v']
pdf['v'] = v - v.mean() / v.std()
return pdf

df.groupby('id').apply(zscore){code}
This approach has a few downside:

 
 * Specifying the full return schema is complicated for the user although the 
function only changes one column.
 * The column name 'v' inside as part of the udf, makes the udf less reusable.
 * The entire dataframe is serialized to pass to Python although only one 
column is needed.

Here we propose a new type of pandas_udf to work with these types of use cases:
{code:java}
@pandas_udf('double', GROUPED_XFORM)
def zscore(v):
return v - v.mean() / v.std()

w = Window.partitionBy('id')

df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code}
Which addresses the above downsides.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT



 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28003:


Assignee: (was: Apache Spark)

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT



 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28003:


Assignee: Apache Spark

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Li Jin
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28005) SparkRackResolver should not log for resolving empty list

2019-06-11 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861410#comment-16861410
 ] 

Imran Rashid commented on SPARK-28005:
--

cc [~cltlfcjin]

> SparkRackResolver should not log for resolving empty list
> -
>
> Key: SPARK-28005
> URL: https://issues.apache.org/jira/browse/SPARK-28005
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> After SPARK-13704, {{SparkRackResolver}} generates an INFO message everytime 
> is called with 0 arguments:
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala#L73-L76
> That actually happens every 1s when there are no active executors, because of 
> the repeated offers that happen as part of delay scheduling:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L134-L139
> while this is relatively benign, its a pretty annoying thing to be logging at 
> INFO level every 1 second.
> This is easy to reproduce -- in spark-shell, with dynamic allocation, set log 
> level to info, see the logs appear every 1 second.  Then run something, see 
> the msgs stop.  After the executors timeout, see the msgs reappear.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> sc.setLogLevel("info")
> Thread.sleep(5000)
> sc.parallelize(1 to 10).count()
> // Exiting paste mode, now interpreting.
> 19/06/11 12:43:40 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:41 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:42 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:43 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:44 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:45 INFO spark.SparkContext: Starting job: count at :28
> 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Got job 0 (count at 
> :28) with 2 output partitions
> 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 
> (count at :28)
> ...
> 19/06/11 12:43:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose 
> tasks have all completed, from pool 
> 19/06/11 12:43:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at 
> :28) finished in 9.548 s
> 19/06/11 12:43:54 INFO scheduler.DAGScheduler: Job 0 finished: count at 
> :28, took 9.613049 s
> res2: Long = 10   
>   
> scala> 
> ...
> 19/06/11 12:44:56 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:57 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:58 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:59 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28005) SparkRackResolver should not log for resolving empty list

2019-06-11 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-28005:

Summary: SparkRackResolver should not log for resolving empty list
Key: SPARK-28005
URL: https://issues.apache.org/jira/browse/SPARK-28005
Project: Spark
Issue Type: Bug
Components: Scheduler
Affects Versions: 3.0.0
Reporter: Imran Rashid

After SPARK-13704, {{SparkRackResolver}} generates an INFO message everytime is
called with 0 arguments:

https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala#L73-L76

That actually happens every 1s when there are no active executors, because of
the repeated offers that happen as part of delay scheduling:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L134-L139

while this is relatively benign, its a pretty annoying thing to be logging at
INFO level every 1 second.

This is easy to reproduce -- in spark-shell, with dynamic allocation, set log
level to info, see the logs appear every 1 second. Then run something, see the
msgs stop. After the executors timeout, see the msgs reappear.

{noformat}
scala> :paste
// Entering paste mode (ctrl-D to finish)

sc.setLogLevel("info")
Thread.sleep(5000)
sc.parallelize(1 to 10).count()

// Exiting paste mode, now interpreting.

19/06/11 12:43:40 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
19/06/11 12:43:41 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
19/06/11 12:43:42 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
19/06/11 12:43:43 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
19/06/11 12:43:44 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
19/06/11 12:43:45 INFO spark.SparkContext: Starting job: count at :28
19/06/11 12:43:45 INFO scheduler.DAGScheduler: Got job 0 (count at :28)
with 2 output partitions
19/06/11 12:43:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 0
(count at :28)
...
19/06/11 12:43:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks
have all completed, from pool
19/06/11 12:43:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at
:28) finished in 9.548 s
19/06/11 12:43:54 INFO scheduler.DAGScheduler: Job 0 finished: count at
:28, took 9.613049 s
res2: Long = 10

scala>
...
19/06/11 12:44:56 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
19/06/11 12:44:57 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
19/06/11 12:44:58 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
19/06/11 12:44:59 INFO yarn.SparkRackResolver: Got an error when resolving
hostNames. Falling back to /default-rack for all
...
{noformat}

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28004) Update jquery to 3.4.1



 [ 
https://issues.apache.org/jira/browse/SPARK-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28004:


Assignee: Sean Owen  (was: Apache Spark)

> Update jquery to 3.4.1
> --
>
> Key: SPARK-28004
> URL: https://issues.apache.org/jira/browse/SPARK-28004
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 
> to keep up in general, but also to keep up with CVEs. In fact, we know of at 
> least one resolved in only 3.4.0+ 
> (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, 
> but, if the update isn't painful, maybe worthwhile in order to make future 
> 3.x updates easier.
> jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to 
> maintain compatibility with 1.9+ 
> (https://blog.jquery.com/2013/04/18/jquery-2-0-released/)
> 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard 
> to evaluate each one, but the most likely area for problems is in ajax(). 
> However, our usage of jQuery (and plugins) is pretty simple. 
> I've tried updating and testing the UI, and can't see any warnings, errors, 
> or problematic functionality. This includes the Spark UI, master UI, worker 
> UI, and docs (well, I wasn't able to build R docs)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28004) Update jquery to 3.4.1



 [ 
https://issues.apache.org/jira/browse/SPARK-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28004:


Assignee: Apache Spark  (was: Sean Owen)

> Update jquery to 3.4.1
> --
>
> Key: SPARK-28004
> URL: https://issues.apache.org/jira/browse/SPARK-28004
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>
> We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 
> to keep up in general, but also to keep up with CVEs. In fact, we know of at 
> least one resolved in only 3.4.0+ 
> (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, 
> but, if the update isn't painful, maybe worthwhile in order to make future 
> 3.x updates easier.
> jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to 
> maintain compatibility with 1.9+ 
> (https://blog.jquery.com/2013/04/18/jquery-2-0-released/)
> 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard 
> to evaluate each one, but the most likely area for problems is in ajax(). 
> However, our usage of jQuery (and plugins) is pretty simple. 
> I've tried updating and testing the UI, and can't see any warnings, errors, 
> or problematic functionality. This includes the Spark UI, master UI, worker 
> UI, and docs (well, I wasn't able to build R docs)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28004) Update jquery to 3.4.1

2019-06-11 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28004:
--
Description: 
We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 
to keep up in general, but also to keep up with CVEs. In fact, we know of at 
least one resolved in only 3.4.0+ 
(https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, 
but, if the update isn't painful, maybe worthwhile in order to make future 3.x 
updates easier.

jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to 
maintain compatibility with 1.9+ 
(https://blog.jquery.com/2013/04/18/jquery-2-0-released/)

2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard 
to evaluate each one, but the most likely area for problems is in ajax(). 
However, our usage of jQuery (and plugins) is pretty simple. 

I've tried updating and testing the UI, and can't see any warnings, errors, or 
problematic functionality. This includes the Spark UI, master UI, worker UI, 
and docs (well, I wasn't able to build R docs)




  was:
We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 
to keep up in general, but also to keep up with CVEs. In fact, we know of at 
least one resolved in only 3.4.0+ 
(https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, 
but, if the update isn't painful, maybe worthwhile in order to make future 3.x 
updates easier.

jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to 
maintain compatibility with 1.9+ 
(https://nvd.nist.gov/vuln/detail/CVE-2019-11358)

2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard 
to evaluate each one, but the most likely area for problems is in ajax(). 
However, our usage of jQuery (and plugins) is pretty simple. 

I've tried updating and testing the UI, and can't see any warnings, errors, or 
problematic functionality. This includes the Spark UI, master UI, worker UI, 
and docs (well, I wasn't able to build R docs)





> Update jquery to 3.4.1
> --
>
> Key: SPARK-28004
> URL: https://issues.apache.org/jira/browse/SPARK-28004
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 
> to keep up in general, but also to keep up with CVEs. In fact, we know of at 
> least one resolved in only 3.4.0+ 
> (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, 
> but, if the update isn't painful, maybe worthwhile in order to make future 
> 3.x updates easier.
> jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to 
> maintain compatibility with 1.9+ 
> (https://blog.jquery.com/2013/04/18/jquery-2-0-released/)
> 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard 
> to evaluate each one, but the most likely area for problems is in ajax(). 
> However, our usage of jQuery (and plugins) is pretty simple. 
> I've tried updating and testing the UI, and can't see any warnings, errors, 
> or problematic functionality. This includes the Spark UI, master UI, worker 
> UI, and docs (well, I wasn't able to build R docs)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28004) Update jquery to 3.4.1

2019-06-11 Thread Sean Owen (JIRA)

Sean Owen created SPARK-28004:
-

 Summary: Update jquery to 3.4.1
 Key: SPARK-28004
 URL: https://issues.apache.org/jira/browse/SPARK-28004
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 
to keep up in general, but also to keep up with CVEs. In fact, we know of at 
least one resolved in only 3.4.0+ 
(https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, 
but, if the update isn't painful, maybe worthwhile in order to make future 3.x 
updates easier.

jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to 
maintain compatibility with 1.9+ 
(https://nvd.nist.gov/vuln/detail/CVE-2019-11358)

2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard 
to evaluate each one, but the most likely area for problems is in ajax(). 
However, our usage of jQuery (and plugins) is pretty simple. 

I've tried updating and testing the UI, and can't see any warnings, errors, or 
problematic functionality. This includes the Spark UI, master UI, worker UI, 
and docs (well, I wasn't able to build R docs)






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling

2019-06-11 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861353#comment-16861353
 ] 

Xiangrui Meng commented on SPARK-27360:
---

Done. Thanks for taking the task!

> Standalone cluster mode support for GPU-aware scheduling
> 
>
> Key: SPARK-27360
> URL: https://issues.apache.org/jira/browse/SPARK-27360
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design and implement standalone manager support for GPU-aware scheduling:
> 1. static conf to describe resources
> 2. spark-submit to request resources 
> 2. auto discovery of GPUs
> 3. executor process isolation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-27999) setup resources when Standalone Worker starts up

2019-06-11 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng deleted SPARK-27999:
--


> setup resources when Standalone Worker starts up
> 
>
> Key: SPARK-27999
> URL: https://issues.apache.org/jira/browse/SPARK-27999
> Project: Spark
>  Issue Type: Sub-task
>Reporter: wuyi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27369) Standalone worker can load resource conf and discover resources

2019-06-11 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-27369:
--

Assignee: wuyi

> Standalone worker can load resource conf and discover resources
> ---
>
> Key: SPARK-27369
> URL: https://issues.apache.org/jira/browse/SPARK-27369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: wuyi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2019-06-11 Thread Huaxin Gao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861261#comment-16861261
 ] 

Huaxin Gao commented on SPARK-26185:


resolved in another PR

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT



 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28003:
---
Affects Version/s: (was: 2.3.2)
   2.3.3

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT

Li Jin created SPARK-28003:
--

 Summary: spark.createDataFrame with Arrow doesn't work with 
pandas.NaT 
 Key: SPARK-28003
 URL: https://issues.apache.org/jira/browse/SPARK-28003
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Li Jin


{code:java}
import pandas as pd
dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
pdf1 = pd.DataFrame({'time': dt1})

df1 = self.spark.createDataFrame(pdf1)
{code}
The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT



 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28003:
---
Affects Version/s: (was: 2.4.0)
   2.3.2
   2.4.3

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression



 [ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27258:


Assignee: Apache Spark

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Assignee: Apache Spark
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression



 [ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27258:


Assignee: (was: Apache Spark)

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28002) Support WITH clause column aliases



 [ 
https://issues.apache.org/jira/browse/SPARK-28002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28002:


Assignee: Apache Spark

> Support WITH clause column aliases
> --
>
> Key: SPARK-28002
> URL: https://issues.apache.org/jira/browse/SPARK-28002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>
> PostgreSQL supports column aliasing in a CTE so this is valid query:
> {noformat}
> WITH t(x) AS (SELECT 1)
> SELECT * FROM t WHERE x = 1{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28002) Support WITH clause column aliases



 [ 
https://issues.apache.org/jira/browse/SPARK-28002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28002:


Assignee: (was: Apache Spark)

> Support WITH clause column aliases
> --
>
> Key: SPARK-28002
> URL: https://issues.apache.org/jira/browse/SPARK-28002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Major
>
> PostgreSQL supports column aliasing in a CTE so this is valid query:
> {noformat}
> WITH t(x) AS (SELECT 1)
> SELECT * FROM t WHERE x = 1{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28002) Support WITH clause column aliases

2019-06-11 Thread Peter Toth (JIRA)

Peter Toth created SPARK-28002:
--

 Summary: Support WITH clause column aliases
 Key: SPARK-28002
 URL: https://issues.apache.org/jira/browse/SPARK-28002
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Peter Toth


PostgreSQL supports column aliasing in a CTE so this is valid query:
{noformat}
WITH t(x) AS (SELECT 1)
SELECT * FROM t WHERE x = 1{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27917) Semantic equals of CaseWhen is failing with case sensitivity of column Names



 [ 
https://issues.apache.org/jira/browse/SPARK-27917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27917:
--
Fix Version/s: 2.4.4

> Semantic equals of CaseWhen is failing with case sensitivity of column Names
> 
>
> Key: SPARK-27917
> URL: https://issues.apache.org/jira/browse/SPARK-27917
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.3
>Reporter: Akash R Nilugal
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> Semantic equals of CaseWhen is failing with case sensitivity of column Names



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception

2019-06-11 Thread Marius Stanescu (JIRA)

Marius Stanescu created SPARK-28001:
---

 Summary: Dataframe throws 'socket.timeout: timed out' exception
 Key: SPARK-28001
 URL: https://issues.apache.org/jira/browse/SPARK-28001
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.3
 Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz

RAM: 16 GB

OS: Windows 10 Enterprise 64-bit

Python: 3.7.2

PySpark: 3.4.3

Cluster manager: Spark Standalone
Reporter: Marius Stanescu


I load data from Azure Table Storage, create a DataFrame and perform a couple 
of operations via two user-defined functions, then call show() to display the 
results. If I load a very small batch of items, like 5, everything is working 
fine, but if I load a batch grater then 10 items from Azure Table Storage then 
I get the 'socket.timeout: timed out' exception.

Here is the code:

 
{code}
import time
import json
import requests
from requests.auth import HTTPBasicAuth
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import BooleanType

def main():
batch_size = 25

azure_table_account_name = '***'
azure_table_account_key = '***'
azure_table_name = '***'

spark = SparkSession \
.builder \
.appName(agent_name) \
.config("spark.sql.crossJoin.enabled", "true") \
.getOrCreate()

table_service = TableService(account_name=azure_table_account_name, 
account_key=azure_table_account_key)
continuation_token = None

while True:
messages = table_service.query_entities(
azure_table_name,
select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp",
num_results=batch_size,
marker=continuation_token,
timeout=60)
continuation_token = messages.next_marker

messages_list = list(messages)

if not len(messages_list):
time.sleep(5)
pass

messages_df = spark.createDataFrame(messages_list)

register_records_df = messages_df \
.withColumn('Registered', register_record('RowKey', 'PartitionKey', 
'messageId', 'ownerSmtp', 'Timestamp'))

only_registered_records_df = register_records_df \
.filter(register_records_df.Registered == True) \
.drop(register_records_df.Registered)

update_message_status_df = only_registered_records_df \
.withColumn('TableEntryDeleted', delete_table_entity('RowKey', 
'PartitionKey'))

results_df = update_message_status_df.select(
update_message_status_df.RowKey,
update_message_status_df.PartitionKey,
update_message_status_df.TableEntryDeleted)

#results_df.explain()
results_df.show(n=batch_size, truncate=False)


@udf(returnType=BooleanType())
def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp):
# call an API
try:
url = '{}/data/record/{}'.format('***', rowKey)
headers = { 'Content-type': 'application/json' }
response = requests.post(
url,
headers=headers,
auth=HTTPBasicAuth('***', '***'),
data=prepare_record_data(rowKey, partitionKey, 
messageId, ownerSmtp, timestamp))

return bool(response)
except:
return False


def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, timestamp):
record_data = {
"Title": messageId,
"Type": '***',
"Source": '***',
"Creator": ownerSmtp,
"Publisher": '***',
"Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
}
return json.dumps(record_data)


@udf(returnType=BooleanType())
def delete_table_entity(row_key, partition_key):
azure_table_account_name = '***'
azure_table_account_key = '***'
azure_table_name = '***'

try:
table_service = TableService(account_name=azure_table_account_name, 
account_key=azure_table_account_key)
table_service.delete_entity(azure_table_name, partition_key, row_key)
return True
except:
return False


if __name__ == "__main__":
main()
{code}
 

Here is the console output:
{noformat}
== Physical Plan ==
*(2) Project [RowKey#54, PartitionKey#53, pythonUDF0#93 AS TableEntryDeleted#81]
+- BatchEvalPython [delete_table_entity(RowKey#54, PartitionKey#53)], 
[PartitionKey#53, RowKey#54, pythonUDF0#93]
   +- *(1) Project [PartitionKey#53, RowKey#54]
  +- *(1) Project [PartitionKey#53, RowKey#54, Timestamp#55, etag#56, 
messageId#57, ownerSmtp#58]
 +- *(1) Filter (pythonUDF0#92 = true)
+- BatchEvalPython

[jira] [Commented] (SPARK-9983) Local physical operators for query execution



[ 
https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860976#comment-16860976
 ] 

Lai Zhou commented on SPARK-9983:
-

[~rxin], we now use Calcite to build a high performance hive sql engine , it's 
released as opensource now.

see [https://github.com/51nb/marble]

It works fine for real-time ML scene in our financial business. But I think 
it's not the best solution.

 

I think adding a single-node version of DataFrame to spark may be the best 
solution, because spark sql has 

natural compatibility with Hive sql, and people can enjoy the benefits of the 
excellent optimizer, vectorized execution, code gen...etc .

> Local physical operators for query execution
> 
>
> Key: SPARK-9983
> URL: https://issues.apache.org/jira/browse/SPARK-9983
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In distributed query execution, there are two kinds of operators:
> (1) operators that exchange data between different executors or threads: 
> examples include broadcast, shuffle.
> (2) operators that process data in a single thread: examples include project, 
> filter, group by, etc.
> This ticket proposes clearly differentiating them and creating local 
> operators in Spark. This leads to a lot of benefits: easier to test, easier 
> to optimize data exchange, better design (single responsibility), and 
> potentially even having a hyper-optimized single-node version of DataFrame.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28000) Add comments.sql

2019-06-11 Thread Lantao Jin (JIRA)

Lantao Jin created SPARK-28000:
--

 Summary: Add comments.sql
 Key: SPARK-28000
 URL: https://issues.apache.org/jira/browse/SPARK-28000
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


In this ticket, we plan to add the regression test cases of 
https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/comments.sql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9983) Local physical operators for query execution



[ 
https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860976#comment-16860976
 ] 

Lai Zhou edited comment on SPARK-9983 at 6/11/19 12:18 PM:
---

[~rxin], we now use Calcite to build a high performance hive sql engine , it's 
released as opensource now.

see [https://github.com/51nb/marble]

It works fine for real-time ML scene in our financial business. But I think 
it's not the best solution.

Adding a single-node version of DataFrame to spark may be the best solution, 
because spark sql has 

natural compatibility with Hive sql, and people can enjoy the benefits of the 
excellent optimizer, vectorized execution, code gen...etc .


was (Author: hhlai1990):
[~rxin], we now use Calcite to build a high performance hive sql engine , it's 
released as opensource now.

see [https://github.com/51nb/marble]

It works fine for real-time ML scene in our financial business. But I think 
it's not the best solution.

 I think adding a single-node version of DataFrame to spark may be the best 
solution, because spark sql has 

natural compatibility with Hive sql, and people can enjoy the benefits of the 
excellent optimizer, vectorized execution, code gen...etc .

> Local physical operators for query execution
> 
>
> Key: SPARK-9983
> URL: https://issues.apache.org/jira/browse/SPARK-9983
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In distributed query execution, there are two kinds of operators:
> (1) operators that exchange data between different executors or threads: 
> examples include broadcast, shuffle.
> (2) operators that process data in a single thread: examples include project, 
> filter, group by, etc.
> This ticket proposes clearly differentiating them and creating local 
> operators in Spark. This leads to a lot of benefits: easier to test, easier 
> to optimize data exchange, better design (single responsibility), and 
> potentially even having a hyper-optimized single-node version of DataFrame.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9983) Local physical operators for query execution



[ 
https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860976#comment-16860976
 ] 

Lai Zhou edited comment on SPARK-9983 at 6/11/19 12:15 PM:
---

[~rxin], we now use Calcite to build a high performance hive sql engine , it's 
released as opensource now.

see [https://github.com/51nb/marble]

It works fine for real-time ML scene in our financial business. But I think 
it's not the best solution.

 I think adding a single-node version of DataFrame to spark may be the best 
solution, because spark sql has 

natural compatibility with Hive sql, and people can enjoy the benefits of the 
excellent optimizer, vectorized execution, code gen...etc .


was (Author: hhlai1990):
[~rxin], we now use Calcite to build a high performance hive sql engine , it's 
released as opensource now.

see [https://github.com/51nb/marble]

It works fine for real-time ML scene in our financial business. But I think 
it's not the best solution.

 

I think adding a single-node version of DataFrame to spark may be the best 
solution, because spark sql has 

natural compatibility with Hive sql, and people can enjoy the benefits of the 
excellent optimizer, vectorized execution, code gen...etc .

> Local physical operators for query execution
> 
>
> Key: SPARK-9983
> URL: https://issues.apache.org/jira/browse/SPARK-9983
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In distributed query execution, there are two kinds of operators:
> (1) operators that exchange data between different executors or threads: 
> examples include broadcast, shuffle.
> (2) operators that process data in a single thread: examples include project, 
> filter, group by, etc.
> This ticket proposes clearly differentiating them and creating local 
> operators in Spark. This leads to a lot of benefits: easier to test, easier 
> to optimize data exchange, better design (single responsibility), and 
> potentially even having a hyper-optimized single-node version of DataFrame.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27369) Standalone worker can load resource conf and discover resources



 [ 
https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27369:


Assignee: Apache Spark

> Standalone worker can load resource conf and discover resources
> ---
>
> Key: SPARK-27369
> URL: https://issues.apache.org/jira/browse/SPARK-27369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27369) Standalone worker can load resource conf and discover resources



 [ 
https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27369:


Assignee: (was: Apache Spark)

> Standalone worker can load resource conf and discover resources
> ---
>
> Key: SPARK-27369
> URL: https://issues.apache.org/jira/browse/SPARK-27369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9983) Local physical operators for query execution



[ 
https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860959#comment-16860959
 ] 

Lai Zhou edited comment on SPARK-9983 at 6/11/19 11:51 AM:
---

[~rxin], `a hyper-optimized single-node version of DataFrame`, do you have any 
roadmap about it?

In real world, we use spark sql to handle our ETL jobs on Hive. We may extract 
a lots of user's variables by complex sql queries,  which will be the input for 
machine-learning models. 

But when we want to migrate the jobs to real-time system, we always need to 
interpret these sql queries by another programming language, which requires a 
lot of work.

Now the local mode of spark sql is not a direct and high performance execution 
mode, I think it will make great sense to have a hyper-optimized single-node 
version of DataFrame.

 


was (Author: hhlai1990):
[~rxin], `a hyper-optimized single-node version of DataFrame`, do you have any 
roadmap about it?

In real world, we use spark sql to handle our ETL jobs on Hive. We may extract 
a lots of user's variables by complex sql queries,  which will be the input for 
machine-learning models. 

But when we want to migrate the jobs to real-time system, we always need to 
interpret these sql queries by another programming language,

which requires a lot of work.

Now the local mode of spark sql is not a direct and high performance execution 
mode, I think it will make great sense to have a high hyper-optimized 
single-node.

 

> Local physical operators for query execution
> 
>
> Key: SPARK-9983
> URL: https://issues.apache.org/jira/browse/SPARK-9983
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In distributed query execution, there are two kinds of operators:
> (1) operators that exchange data between different executors or threads: 
> examples include broadcast, shuffle.
> (2) operators that process data in a single thread: examples include project, 
> filter, group by, etc.
> This ticket proposes clearly differentiating them and creating local 
> operators in Spark. This leads to a lot of benefits: easier to test, easier 
> to optimize data exchange, better design (single responsibility), and 
> potentially even having a hyper-optimized single-node version of DataFrame.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9983) Local physical operators for query execution



[ 
https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860959#comment-16860959
 ] 

Lai Zhou commented on SPARK-9983:
-

[~rxin], `a hyper-optimized single-node version of DataFrame`, do you have any 
roadmap about it?

In real world, we use spark sql to handle our ETL jobs on Hive. We may extract 
a lots of user's variables by complex sql queries,  which will be the input for 
machine-learning models. 

But when we want to migrate the jobs to real-time system, we always need to 
interpret these sql queries by another programming language,

which requires a lot of work.

Now the local mode of spark sql is not a direct and high performance execution 
mode, I think it will make great sense to have a high hyper-optimized 
single-node.

 

> Local physical operators for query execution
> 
>
> Key: SPARK-9983
> URL: https://issues.apache.org/jira/browse/SPARK-9983
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In distributed query execution, there are two kinds of operators:
> (1) operators that exchange data between different executors or threads: 
> examples include broadcast, shuffle.
> (2) operators that process data in a single thread: examples include project, 
> filter, group by, etc.
> This ticket proposes clearly differentiating them and creating local 
> operators in Spark. This leads to a lot of benefits: easier to test, easier 
> to optimize data exchange, better design (single responsibility), and 
> potentially even having a hyper-optimized single-node version of DataFrame.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression

2019-06-11 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-27258:
---

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling

2019-06-11 Thread wuyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860951#comment-16860951
 ] 

wuyi commented on SPARK-27360:
--

oops...I happened to find that sub-task 2 has changed to what I want after I 
created sub-task 4. Can you please help to delete sub-task 4 ? [~mengxr]

> Standalone cluster mode support for GPU-aware scheduling
> 
>
> Key: SPARK-27360
> URL: https://issues.apache.org/jira/browse/SPARK-27360
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design and implement standalone manager support for GPU-aware scheduling:
> 1. static conf to describe resources
> 2. spark-submit to request resources 
> 2. auto discovery of GPUs
> 3. executor process isolation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27999) setup resources when Standalone Worker starts up

2019-06-11 Thread wuyi (JIRA)

wuyi created SPARK-27999:


 Summary: setup resources when Standalone Worker starts up
 Key: SPARK-27999
 URL: https://issues.apache.org/jira/browse/SPARK-27999
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy, Spark Core
Affects Versions: 3.0.0
Reporter: wuyi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression



[ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924
 ] 

Stavros Kontopoulos edited comment on SPARK-27258 at 6/11/19 11:03 AM:
---

[~hehuiyuan] the ticket needs to be re-opened, [~srowen] are you ok with that 
(saw you resolved in the past)? Also next time pls stick to one PR if possible.


was (Author: skonto):
[~hehuiyuan] the ticket needs to be re-opened, unless someone else resolved it. 
Also next time pls stick to one PR if possible.

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression



[ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924
 ] 

Stavros Kontopoulos edited comment on SPARK-27258 at 6/11/19 11:03 AM:
---

[~hehuiyuan] the ticket needs to be re-opened, [~srowen] are you ok with that 
(saw you resolved it in the past)? Also next time pls stick to one PR if 
possible.


was (Author: skonto):
[~hehuiyuan] the ticket needs to be re-opened, [~srowen] are you ok with that 
(saw you resolved in the past)? Also next time pls stick to one PR if possible.

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression



[ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924
 ] 

Stavros Kontopoulos edited comment on SPARK-27258 at 6/11/19 11:02 AM:
---

[~hehuiyuan] the ticket needs to be re-opened, unless someone else resolved it. 
Also next time pls stick to one PR if possible.


was (Author: skonto):
[~hehuiyuan] the ticket needs to be re-opened. Also next time pls stick to one 
PR if possible.

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression



[ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924
 ] 

Stavros Kontopoulos edited comment on SPARK-27258 at 6/11/19 11:01 AM:
---

[~hehuiyuan] the ticket needs to be re-opened. Also next time pls stick to one 
PR if possible.


was (Author: skonto):
[~hehuiyuan] the ticket needs to be re-opened.

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression



[ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924
 ] 

Stavros Kontopoulos commented on SPARK-27258:
-

[~hehuiyuan] the ticket needs to be re-opened.

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label



[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922
 ] 

zhengruifeng edited comment on SPARK-24875 at 6/11/19 10:59 AM:


The  dataset is usually much smaller than the training dataset 
containing ,

if the score data is to huge to perform a simple op like countByValue, how 
could you train/evaluate the model?

I doubt whether it is worth to apply a approximation.


was (Author: podongfeng):
The  dataset is usually much smaller than the training dataset 
containing ,

if the score data is to huge to perform a simple op like countByValue, how 
could you train the model?

I doubt whether it is worth to apply a approximation.

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label



[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922
 ] 

zhengruifeng commented on SPARK-24875:
--

The  dataset is usually much smaller than the training dataset 
containing ,

if the score data is to huge to perform a simple op like countByValue, how 
could you train the model?

I doubt whether it is worth to apply a approximation.

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator



[ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860919#comment-16860919
 ] 

zhengruifeng commented on SPARK-26185:
--

Seems resolved?

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27759) Do not auto cast array to np.array in vectorized udf

2019-06-11 Thread colin fang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

colin fang updated SPARK-27759:
---
Description: 
{code:java}
pd_df = pd.DataFrame({'x': np.random.rand(11, 3, 5).tolist()})
df = spark.createDataFrame(pd_df).cache()
{code}
Each element in x is a list of list, as expected.
{code:java}
df.toPandas()['x']

# 0 [[0.08669612955959993, 0.32624430522634495, 0 
# 1 [[0.29838166086156914,  0.008550172904516762, 0... 
# 2 [[0.641304534802928, 0.2392047548381877, 0.555...
{code}
 
{code:java}
def my_udf(x):
# Hack to see what's inside a udf
raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, 
np.stack(x.values).shape)
return pd.Series(x.values)

my_udf = F.pandas_udf(my_udf, returnType=DoubleType())
df.coalesce(1).withColumn('y', my_udf('x')).show(

# Exception: ((11,), (3,), (5,), (11, 3)){code}
 

A batch (11) of `x` is converted to pd.Series, however, each element in the 
pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to work 
with nested 1d numpy array in practice in a udf.

 

For example, I need a ndarray of shape (11, 3, 5) in udf, so that I can make 
use of the numpy vectorized operations. If I was given a list of list intact, I 
can simply do `np.stack(x.values)`. However, it doesn't work here as what I 
received is a nested numpy 1d array.

 

 

  was:
{code:java}
pd_df = pd.DataFrame(\{'x': np.random.rand(11, 3, 5).tolist()})
df = spark.createDataFrame(pd_df).cache()
{code}
Each element in x is a list of list, as expected.
{code:java}
df.toPandas()['x']

# 0 [[0.08669612955959993, 0.32624430522634495, 0 
# 1 [[0.29838166086156914,  0.008550172904516762, 0... 
# 2 [[0.641304534802928, 0.2392047548381877, 0.555...
{code}
 
{code:java}
def my_udf(x):
# Hack to see what's inside a udf
raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, 
np.stack(x.values).shape)
return pd.Series(x.values)

my_udf = pandas_udf(dot_product, returnType=DoubleType())
df.withColumn('y', my_udf('x')).show()

Exception: ((2,), (3,), (5,), (2, 3))
{code}
 

A batch (2) of `x` is converted to pd.Series, however, each element in the 
pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to work 
with nested 1d numpy array in practice in a udf.

 

For example, I need a ndarray of shape (2, 3, 5) in udf, so that I can make use 
of the numpy vectorized operations. If I was given a list of list intact, I can 
simply do `np.stack(x.values)`. However, it doesn't work here as what I 
received is a nested numpy 1d array.

 

 


> Do not auto cast array to np.array in vectorized udf
> 
>
> Key: SPARK-27759
> URL: https://issues.apache.org/jira/browse/SPARK-27759
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: colin fang
>Priority: Minor
>
> {code:java}
> pd_df = pd.DataFrame({'x': np.random.rand(11, 3, 5).tolist()})
> df = spark.createDataFrame(pd_df).cache()
> {code}
> Each element in x is a list of list, as expected.
> {code:java}
> df.toPandas()['x']
> # 0 [[0.08669612955959993, 0.32624430522634495, 0 
> # 1 [[0.29838166086156914,  0.008550172904516762, 0... 
> # 2 [[0.641304534802928, 0.2392047548381877, 0.555...
> {code}
>  
> {code:java}
> def my_udf(x):
> # Hack to see what's inside a udf
> raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, 
> np.stack(x.values).shape)
> return pd.Series(x.values)
> my_udf = F.pandas_udf(my_udf, returnType=DoubleType())
> df.coalesce(1).withColumn('y', my_udf('x')).show(
> # Exception: ((11,), (3,), (5,), (11, 3)){code}
>  
> A batch (11) of `x` is converted to pd.Series, however, each element in the 
> pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to 
> work with nested 1d numpy array in practice in a udf.
>  
> For example, I need a ndarray of shape (11, 3, 5) in udf, so that I can make 
> use of the numpy vectorized operations. If I was given a list of list intact, 
> I can simply do `np.stack(x.values)`. However, it doesn't work here as what I 
> received is a nested numpy 1d array.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27998) Column alias should support double quote string



 [ 
https://issues.apache.org/jira/browse/SPARK-27998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27998:


Assignee: Apache Spark

> Column alias should support double quote string 
> 
>
> Key: SPARK-27998
> URL: https://issues.apache.org/jira/browse/SPARK-27998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Apache Spark
>Priority: Major
>
> According to the ANSI SQL standard, column alias can be double quote string. 
> {code:sql}
> SELECT au_fname AS "First name", 
> au_lname AS 'Last name', 
> city AS City, 
> state, 
> zip 'Postal code' FROM authors;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27998) Column alias should support double quote string



 [ 
https://issues.apache.org/jira/browse/SPARK-27998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27998:


Assignee: (was: Apache Spark)

> Column alias should support double quote string 
> 
>
> Key: SPARK-27998
> URL: https://issues.apache.org/jira/browse/SPARK-27998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> According to the ANSI SQL standard, column alias can be double quote string. 
> {code:sql}
> SELECT au_fname AS "First name", 
> au_lname AS 'Last name', 
> city AS City, 
> state, 
> zip 'Postal code' FROM authors;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner



[ 
https://issues.apache.org/jira/browse/SPARK-25360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860759#comment-16860759
 ] 

zhengruifeng commented on SPARK-25360:
--

But I think it maybe worth to impl a direct version of `sc.range` other than 
`parallelize.maptition`, to simplify lineage.

> Parallelized RDDs of Ranges could have known partitioner
> 
>
> Key: SPARK-25360
> URL: https://issues.apache.org/jira/browse/SPARK-25360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> We already have the logic to split up the generator, we could expose the same 
> logic as a partitioner. This would be useful when joining a small 
> parallelized collection with a larger collection and other cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner



[ 
https://issues.apache.org/jira/browse/SPARK-25360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860756#comment-16860756
 ] 

zhengruifeng commented on SPARK-25360:
--

[~holdenk] i am afraid it is not doable to add a partitioner to \{RDD[Long]} 
generated by \{sc.range}, refering to the defination of partitioner.
{code:java}
/**
 * An object that defines how the elements in a key-value pair RDD are 
partitioned by key.
 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
 *
 * Note that, partitioner must be deterministic, i.e. it must return the same 
partition id given
 * the same partition key.
 */{code}
Since the returned RDD[Long] is not a \{PairRDD}, so that following ops (like 
join, sort) which can utilize upstreaming partitioner.

 

An alternative is to add some method like `sc.tabulate[T](start, end, step, 
numSlices)(f: Long => T)`, so that the partitioner can be used in future ops.

> Parallelized RDDs of Ranges could have known partitioner
> 
>
> Key: SPARK-25360
> URL: https://issues.apache.org/jira/browse/SPARK-25360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> We already have the logic to split up the generator, we could expose the same 
> logic as a partitioner. This would be useful when joining a small 
> parallelized collection with a larger collection and other cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized



 [ 
https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27995.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24838
[https://github.com/apache/spark/pull/24838]

> Note the difference between str of Python 2 and 3 at Arrow optimized
> 
>
> Key: SPARK-27995
> URL: https://issues.apache.org/jira/browse/SPARK-27995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> When Arrow optimization is enabled in Python 2.7, 
> {code}
> import pandas
> pdf = pandas.DataFrame(["test1", "test2"])
> df = spark.createDataFrame(pdf)
> df.show()
> {code}
> I got the following output:
> {code}
> ++
> |   0|
> ++
> |[74 65 73 74 31]|
> |[74 65 73 74 32]|
> ++
> {code}
> This looks because Python's {{str}} and {{byte}} are same. it does look right:
> {code}
> >>> str == bytes
> True
> >>> isinstance("a", bytes)
> True
> {code}
> 1. Python 2 treats `str` as `bytes`.
> 2. PySpark added some special codes and hacks to recognizes `str` as string 
> types.
> 3. PyArrow / Pandas followed Python 2 difference
> We might have to match the behaviour to PySpark's but Python 2 is deprecated 
> anyway. I think it's better to just note it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized



 [ 
https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27995:


Assignee: Hyukjin Kwon

> Note the difference between str of Python 2 and 3 at Arrow optimized
> 
>
> Key: SPARK-27995
> URL: https://issues.apache.org/jira/browse/SPARK-27995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> When Arrow optimization is enabled in Python 2.7, 
> {code}
> import pandas
> pdf = pandas.DataFrame(["test1", "test2"])
> df = spark.createDataFrame(pdf)
> df.show()
> {code}
> I got the following output:
> {code}
> ++
> |   0|
> ++
> |[74 65 73 74 31]|
> |[74 65 73 74 32]|
> ++
> {code}
> This looks because Python's {{str}} and {{byte}} are same. it does look right:
> {code}
> >>> str == bytes
> True
> >>> isinstance("a", bytes)
> True
> {code}
> 1. Python 2 treats `str` as `bytes`.
> 2. PySpark added some special codes and hacks to recognizes `str` as string 
> types.
> 3. PyArrow / Pandas followed Python 2 difference
> We might have to match the behaviour to PySpark's but Python 2 is deprecated 
> anyway. I think it's better to just note it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized



 [ 
https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27995:
-
Summary: Note the difference between str of Python 2 and 3 at Arrow 
optimized  (was: Note the difference between str of Python 2 and 3 at Arrow 
optimized toPandas)

> Note the difference between str of Python 2 and 3 at Arrow optimized
> 
>
> Key: SPARK-27995
> URL: https://issues.apache.org/jira/browse/SPARK-27995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> When Arrow optimization is enabled in Python 2.7, 
> {code}
> import pandas
> pdf = pandas.DataFrame(["test1", "test2"])
> df = spark.createDataFrame(pdf)
> df.show()
> {code}
> I got the following output:
> {code}
> ++
> |   0|
> ++
> |[74 65 73 74 31]|
> |[74 65 73 74 32]|
> ++
> {code}
> This looks because Python's {{str}} and {{byte}} are same. it does look right:
> {code}
> >>> str == bytes
> True
> >>> isinstance("a", bytes)
> True
> {code}
> 1. Python 2 treats `str` as `bytes`.
> 2. PySpark added some special codes and hacks to recognizes `str` as string 
> types.
> 3. PyArrow / Pandas followed Python 2 difference
> We might have to match the behaviour to PySpark's but Python 2 is deprecated 
> anyway. I think it's better to just note it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2019-06-11 Thread HonglunChen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860734#comment-16860734
 ] 

HonglunChen commented on SPARK-18112:
-

[~hyukjin.kwon] Thanks, I did not remove the 1.2.1 jars, and put all the 2.3.3 
jars in a separate directory, then set "--conf 
spark.sql.hive.metastore.jars=/path of dir/*". It works now！

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860711#comment-16860711
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

you should place 2.3.3 jars if you want do use 2.3.3.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27998) Column alias should support double quote string

2019-06-11 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27998:

Affects Version/s: (was: 3.1.0)
   3.0.0

> Column alias should support double quote string 
> 
>
> Key: SPARK-27998
> URL: https://issues.apache.org/jira/browse/SPARK-27998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> According to the ANSI SQL standard, column alias can be double quote string. 
> {code:sql}
> SELECT au_fname AS "First name", 
> au_lname AS 'Last name', 
> city AS City, 
> state, 
> zip 'Postal code' FROM authors;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27998) Column alias should support double quote string

2019-06-11 Thread Zhu, Lipeng (JIRA)

Zhu, Lipeng created SPARK-27998:
---

 Summary: Column alias should support double quote string 
 Key: SPARK-27998
 URL: https://issues.apache.org/jira/browse/SPARK-27998
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Zhu, Lipeng


According to the ANSI SQL standard, column alias can be double quote string. 
{code:sql}
SELECT au_fname AS "First name", 
au_lname AS 'Last name', 
city AS City, 
state, 
zip 'Postal code' FROM authors;{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27499) Support mapping spark.local.dir to hostPath volume



[ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860648#comment-16860648
 ] 

Dongjoon Hyun commented on SPARK-27499:
---

Oh, do you mean `SPARK_LOCAL_DIRS` doesn't work on the PV, [~junjie]?

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27934) Add case.sql

2019-06-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27934.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Add case.sql
> 
>
> Key: SPARK-27934
> URL: https://issues.apache.org/jira/browse/SPARK-27934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API

2019-06-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27899.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Make HiveMetastoreClient.getTableObjectsByName available in 
> ExternalCatalog/SessionCatalog API
> --
>
> Key: SPARK-27899
> URL: https://issues.apache.org/jira/browse/SPARK-27899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> The new Spark ThriftServer SparkGetTablesOperation implemented in 
> https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata 
> request for every table. This can get very slow for large schemas (~50ms per 
> table with an external Hive metastore).
> Hive ThriftServer GetTablesOperation uses 
> HiveMetastoreClient.getTableObjectsByName to get table information in bulk, 
> but we don't expose that through our APIs that go through Hive -> 
> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> 
> SessionCatalog.
> If we added and exposed getTableObjectsByName through our catalog APIs, we 
> could resolve that performance problem in SparkGetTablesOperation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27499) Support mapping spark.local.dir to hostPath volume



 [ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27499:
--
Fix Version/s: (was: 2.4.0)

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27499) Support mapping spark.local.dir to hostPath volume



[ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860598#comment-16860598
 ] 

Dongjoon Hyun edited comment on SPARK-27499 at 6/11/19 7:42 AM:


No, I think you didn't read the patch of that, [~junjie].
Could you read the patch? You can start from the following line.
- 
https://github.com/apache/spark/pull/21238/files#diff-529fc5c06b9731c1fbda6f3db60b16aaR458

Also, please see the following `PVTestsSuite` test case.
- 
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala




was (Author: dongjoon):
No, I think you didn't read the patch of that, [~junjie].
Could you read the patch? You can start from the following line.
- 
https://github.com/apache/spark/pull/21238/files#diff-529fc5c06b9731c1fbda6f3db60b16aaR458

Also, please see the test case.

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24175) improve the Spark 2.4 migration guide



 [ 
https://issues.apache.org/jira/browse/SPARK-24175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24175:
--
Component/s: Documentation

> improve the Spark 2.4 migration guide
> -
>
> Key: SPARK-24175
> URL: https://issues.apache.org/jira/browse/SPARK-24175
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current Spark 2.4 migration guide is not well phrased. We should
> 1. State the before behavior
> 2. State the after behavior
> 3. Add a concrete example with code to illustrate.
> For example:
> Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
> promotes both sides to TIMESTAMP. To set `false` to 
> `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
> behavior. This option will be removed in Spark 3.0.
> -->
> In version 2.3 and earlier, Spark implicitly casts a timestamp column to date 
> type when comparing with a date column. In version 2.4 and later, Spark casts 
> the date column to timestamp type instead. As an example, "xxx" would result 
> in ".." in Spark 2.3, and in Spark 2.4, the result would be "..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24125) Add quoting rules to SQL guide



 [ 
https://issues.apache.org/jira/browse/SPARK-24125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24125:
--
Component/s: (was: SQL)
 Documentation

> Add quoting rules to SQL guide
> --
>
> Key: SPARK-24125
> URL: https://issues.apache.org/jira/browse/SPARK-24125
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Priority: Minor
>
> As far as I can tell, Spark SQL's quoting rules are as follows:
> * {{`foo bar`}} is an identifier
> * {{'foo bar'}} is a string literal
> * {{"foo bar"}} is a string literal
> The last of these is non-standard (usually {{"foo bar"}} is an identifier), 
> and so it's probably worth mentioning these rules in the 'reference' section 
> of the [SQL 
> guide|http://spark.apache.org/docs/latest/sql-programming-guide.html#reference].
> I'm assuming there's not a lot of enthusiasm to change the quoting rules, 
> given it would be a breaking change, and that backticks work just fine as an 
> alternative. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24125) Add quoting rules to SQL guide



 [ 
https://issues.apache.org/jira/browse/SPARK-24125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24125:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Add quoting rules to SQL guide
> --
>
> Key: SPARK-24125
> URL: https://issues.apache.org/jira/browse/SPARK-24125
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Henry Robinson
>Priority: Minor
>
> As far as I can tell, Spark SQL's quoting rules are as follows:
> * {{`foo bar`}} is an identifier
> * {{'foo bar'}} is a string literal
> * {{"foo bar"}} is a string literal
> The last of these is non-standard (usually {{"foo bar"}} is an identifier), 
> and so it's probably worth mentioning these rules in the 'reference' section 
> of the [SQL 
> guide|http://spark.apache.org/docs/latest/sql-programming-guide.html#reference].
> I'm assuming there's not a lot of enthusiasm to change the quoting rules, 
> given it would be a breaking change, and that backticks work just fine as an 
> alternative. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24584) More efficient storage of executor pod state



 [ 
https://issues.apache.org/jira/browse/SPARK-24584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24584:
--
Summary: More efficient storage of executor pod state  (was: [K8s] More 
efficient storage of executor pod state)

> More efficient storage of executor pod state
> 
>
> Key: SPARK-24584
> URL: https://issues.apache.org/jira/browse/SPARK-24584
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Priority: Major
>
> Currently we store buffers of snapshots in {{ExecutorPodsSnapshotStore}}, 
> where the snapshots are duplicated per subscriber. With hundreds or maybe 
> thousands of executors, this buffering may become untenable. Investigate 
> storing less state while still maintaining the same level-triggered semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24254) Eagerly evaluate some subqueries over LocalRelation



 [ 
https://issues.apache.org/jira/browse/SPARK-24254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24254:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Eagerly evaluate some subqueries over LocalRelation
> ---
>
> Key: SPARK-24254
> URL: https://issues.apache.org/jira/browse/SPARK-24254
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Henry Robinson
>Priority: Major
>
> Some queries would benefit from evaluating subqueries over {{LocalRelations}} 
> eagerly. For example:
> {code}
> SELECT t1.part_col FROM t1 JOIN (SELECT max(part_col) m FROM t2) foo WHERE 
> t1.part_col = foo.m
> {code}
> If {{max(part_col)}} could be evaluated during planning, there's an 
> opportunity to prune all but at most one partitions from the scan of {{t1}}. 
> Similarly, a near-identical query with a non-scalar subquery in the {{WHERE}} 
> clause:
> {code}
> SELECT * FROM t1 WHERE part_col IN (SELECT part_col FROM t2)
> {code}
> could be partially evaluated to eliminate some partitions, and remove the 
> join from the plan. 
> Obviously all subqueries over local relations can't be evaluated during 
> planning, but certain whitelisted aggregates could be if the input 
> cardinality isn't too high. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24584) [K8s] More efficient storage of executor pod state



 [ 
https://issues.apache.org/jira/browse/SPARK-24584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24584:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> [K8s] More efficient storage of executor pod state
> --
>
> Key: SPARK-24584
> URL: https://issues.apache.org/jira/browse/SPARK-24584
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Priority: Major
>
> Currently we store buffers of snapshots in {{ExecutorPodsSnapshotStore}}, 
> where the snapshots are duplicated per subscriber. With hundreds or maybe 
> thousands of executors, this buffering may become untenable. Investigate 
> storing less state while still maintaining the same level-triggered semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25050) Handle more than two types in avro union types when writing avro files



 [ 
https://issues.apache.org/jira/browse/SPARK-25050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25050:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Handle more than two types in avro union types when writing avro files
> --
>
> Key: SPARK-25050
> URL: https://issues.apache.org/jira/browse/SPARK-25050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24831) Spark Executor class create new serializer each time



 [ 
https://issues.apache.org/jira/browse/SPARK-24831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-24831.
---
Resolution: Incomplete

Sorry, but JIRA is not for questions.

> Spark Executor class create new serializer each time
> 
>
> Key: SPARK-24831
> URL: https://issues.apache.org/jira/browse/SPARK-24831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> When I look in the code I find this line:
>   (org.apache.spark.executor.Executor: 319)
>  val resultSer = env.serializer.newInstance()
>  why not hold this in threadLocal and reuse?
> [questioin in 
> stackOverflow|https://stackoverflow.com/questions/51298877/spark-executor-class-create-new-serializer-each-time]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25396) Read array of JSON objects via an Iterator



 [ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25396:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27013) Consider adding support for external encoders when resolving org.apache.spark.sql.catalyst.encoders.ExpressionEncoder's apply method



 [ 
https://issues.apache.org/jira/browse/SPARK-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27013:
--
Component/s: (was: Optimizer)
 SQL

> Consider adding support for external encoders when resolving 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder's apply method
> 
>
> Key: SPARK-27013
> URL: https://issues.apache.org/jira/browse/SPARK-27013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Ottey
>Priority: Minor
>
> I recently discovered that, because most of the common implicit encoders 
> introduced by 
> {noformat}import spark.implicits._{noformat}
> reduce to a call to {{ExpressionEncoder}}'s {{apply}} method, it's _very_ 
> difficult to generate and/or operate on {{Column}}'s whose internal types 
> reduce to some Scala type that wraps an external type, even if an implicit 
> encoder for that external type is available or could be trivially generated. 
> See the example below:
> {code:scala}
> import com.example.MyBean
> object Example {
> implicit def BeanEncoder: Encoder[MyBean] = Encoders.bean(classOf[MyBean])
> 
> def main(args: Array[String]): Unit = {
> val path = args(0)
> val spark: SparkSession = ???
> import spark.implicits._
> // THE FOLLOWING DOES NOT WORK!!!
> // implicit encoder for Seq[_] is found and used...
> // Calls ExpressionEncoder's apply method
> // Unwraps the inner type com.example.MyBean...
> // ScalaReflection.serialzeFor() cannot find encoder for our type
> // Even though we can trivially create one above
> // Fails at runtime with UnsupportedOperationException from 
> // ScalaReflection.serialzeFor()
> val ds = spark.read
>   .format("avro")
>   .option("compression", "snappy")
>   .load(path)
>   .select($"myColumn".as[Seq[MyBean]])
> }
> {code}
> What's particularly frustrating is that if we were using any user-defined 
> case class instead of the java bean type, this is not a problem, as the 
> structuring of the various implicit encoders in the related packages seems to 
> allow the {{ScalaReflection.serializeFor()}} method to work on arbitrary 
> {{scala.Product}} types... (There's an implicit encoder in 
> org.apache.spark.sql.Encoders that looks relevant)
> I realize that there are workarounds, such as wrapping the types and then 
> using a simple {{.map()}}, or using kryo or java serialization, but my 
> understanding is that would mean giving up on potential Catalyst 
> optimizations...
> It would be really nice if there were a simple way to tell 
> {{ScalaReflection.serializeFor()}} to look for/use other, potentially 
> user-defined encoders, especially if they could be generated from the factory 
> encoder methods supplied by Spark itself...
> Alternatively, It would be exceptionally nice if calls to 
> {{ExpressionEncoder}}'s {{apply}} method would support expressions with types 
> that include {{java.util.List}} or arbitrary java bean types as well as 
> {{scala.Product}} types.
> See 
> [here|https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-dataset]
>  on Stackoverflow for other details...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26948) vertex and edge rowkey upgrade and support multiple types?



 [ 
https://issues.apache.org/jira/browse/SPARK-26948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26948:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> vertex and edge rowkey upgrade and support multiple types?
> --
>
> Key: SPARK-26948
> URL: https://issues.apache.org/jira/browse/SPARK-26948
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: daile
>Priority: Minor
>
> Currently only Long is supported, but most of the graph databases use string 
> as the primary key.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27013) Consider adding support for external encoders when resolving org.apache.spark.sql.catalyst.encoders.ExpressionEncoder's apply method