[jira] [Assigned] (SPARK-28013) Upgrade to Kafka 2.2.1
[ https://issues.apache.org/jira/browse/SPARK-28013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28013: Assignee: (was: Apache Spark) > Upgrade to Kafka 2.2.1 > -- > > Key: SPARK-28013 > URL: https://issues.apache.org/jira/browse/SPARK-28013 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue updates Kafka dependency to 2.2.1 to bring the following > improvement and bug fixes. > https://issues.apache.org/jira/projects/KAFKA/versions/12345010 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28013) Upgrade to Kafka 2.2.1
[ https://issues.apache.org/jira/browse/SPARK-28013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28013: Assignee: Apache Spark > Upgrade to Kafka 2.2.1 > -- > > Key: SPARK-28013 > URL: https://issues.apache.org/jira/browse/SPARK-28013 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > This issue updates Kafka dependency to 2.2.1 to bring the following > improvement and bug fixes. > https://issues.apache.org/jira/projects/KAFKA/versions/12345010 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28013) Upgrade to Kafka 2.2.1
Dongjoon Hyun created SPARK-28013: - Summary: Upgrade to Kafka 2.2.1 Key: SPARK-28013 URL: https://issues.apache.org/jira/browse/SPARK-28013 Project: Spark Issue Type: Improvement Components: Build, Structured Streaming Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue updates Kafka dependency to 2.2.1 to bring the following improvement and bug fixes. https://issues.apache.org/jira/projects/KAFKA/versions/12345010 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25520) The state of executors is KILLED on standalone
[ https://issues.apache.org/jira/browse/SPARK-25520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861760#comment-16861760 ] Dongjoon Hyun commented on SPARK-25520: --- In general, it looks like `com.microsoft.sqlserver.jdbc.SQLServerDriver` issue, doesn't it? Could you try another database like MySQL instead with Apache Spark 2.4.3? > The state of executors is KILLED on standalone > --- > > Key: SPARK-25520 > URL: https://issues.apache.org/jira/browse/SPARK-25520 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.1 > Environment: spark 2.3.1 > spark 2.2.2 > Java 1.8.0_131 > scala 2.11.8 > >Reporter: AKSonic >Priority: Major > Attachments: Spark.zip > > > I create spark standalone cluster (4 servers) by using spark 2.3.1. The job > can be finished on Completed Drivers. I also can get the result by driver > log. But the status of all executors show KILLED state. The log show the > following error. > 2018-09-25 00:47:37 INFO CoarseGrainedExecutorBackend:54 - Driver commanded a > shutdown > 2018-09-25 00:47:37 ERROR CoarseGrainedExecutorBackend:43 - RECEIVED SIGNAL > TERM utdown > I also try spark 2.2.2. I see the same issues on the GUI. All executors are > KILLED status. > Is it right? what is the problem? > Config > > *spark-env.sh:* > export SPARK_PUBLIC_DNS=hostname1 > export SCALA_HOME=/opt/gpf/bigdata/scala-2.11.8 > export JAVA_HOME=/usr/java/jdk1.8.0_131 > export HADOOP_HOME=/opt/bigdata/hadoop-2.6.5 > export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop > export > SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=[file:///spark/spark-event-dir] > -Dspark.history.ui.port=16066 -Dspark.history.retainedApplications=30 > -Dspark.history.fs.cleaner.enabled=true > -Dspark.history.fs.cleaner.interval=1d -Dspark.history.fs.cleaner.maxAge=7d" > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=hostname1:2181,hostname2:2181,hostname3:2181 > -Dspark.deploy.zookeeper.dir=/opt/bigdata/spark-2.3.1/zk-recovery-dir" > SPARK_LOCAL_DIRS=/opt/bigdata/spark-2.3.1/local-dir > SPARK_DRIVER_MEMORY=1G > *spark-defaults.conf:* > spark.eventLog.enabled true > spark.eventLog.compress true > spark.eventLog.dir [file:///spark/spark-event-dir] > *slaves:* > hostname1 > hostname2 > hostname3 > hostname4 > Testing > > *Testing program1 (Java):* > public class JDBCApp { > private static final String DB_OLAP_UAT_URL = "jdbc:sqlserver://dbhost"; > private static final String DB_DRIVER = > "com.microsoft.sqlserver.jdbc.SQLServerDriver"; > private static final String SQL_TEXT = "select top 10 * from table1"; > private static final String DB_OLAP_UAT_USR = ""; > private static final String DB_OLAP_UAT_PWD = ""; > public static void main(String[] args) > { > System.setProperty("spark.sql.warehouse.dir","file:///bigdata/spark/spark-warehouse"); > // Logger.getLogger("org.apache.spark").setLevel(Level.DEBUG); SparkSession > spark = SparkSession .builder() .appName("JDBCApp") .getOrCreate(); > Dataset jdbcDF = spark.read() .format("jdbc") .option("driver", > DB_DRIVER) .option("url", DB_OLAP_UAT_URL) .option("dbtable", "(" + SQL_TEXT > + ") tmp") .option("user", DB_OLAP_UAT_USR) .option("password", > DB_OLAP_UAT_PWD) .load(); jdbcDF.show(); } > } > *Testing program2 (Java):* > public class SimpleApp { > public static void main(String[] args) > { String filePath = args[0]; Logger logger = > Logger.getLogger("org.apache.spark"); // logger.setLevel(Level.DEBUG); > SparkSession spark = SparkSession.builder() .appName("Simple Application") > .getOrCreate(); Dataset logData = > spark.read().textFile(filePath).cache(); long numAs = > logData.filter((FilterFunction) s -> s.contains("e")).count(); long > numBs = logData.filter((FilterFunction) s -> > s.contains("r")).count(); logger.info("Lines with a: " + numAs + ", lines > with b: " + numBs); spark.stop(); } > } > You can run above 2 testing programs, and then you can see the state of > executors are KILLED. > -- CMD > -- > ./spark-submit \ > --driver-class-path /sharedata/mssql-jdbc-6.4.0.jre8.jar \ > --jars /sharedata/mssql-jdbc-6.4.0.jre8.jar \ > --class JDBCApp \ > --master spark://hostname1:6066 \ > --deploy-mode cluster \ > --driver-memory 2G \ > --executor-memory 2G \ > --total-executor-cores 8 \ > /sharedata/spark-demo-1.0-SNAPSHOT.jar > >
[jira] [Comment Edited] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.
[ https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861755#comment-16861755 ] Dongjoon Hyun edited comment on SPARK-25246 at 6/12/19 5:10 AM: Hi, [~shahid]. As [~devaraj.k] mentioned, you can use a different codec. The following is the `snappy` codec case, and I can see the entry from the Spark History UI from 2.3.3. The reported case is just the behavior of LZ4 codec although it's a default to the Spark. {code} $ bin/spark-shell --conf spark.eventLog.compress=true --conf spark.io.compression.codec=snappy 2019-06-11 22:04:26 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1560315870593). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.3 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) Type in expressions to have them evaluated. Type :help for more information. scala> {code} was (Author: dongjoon): Hi, [~shahid]. As [~devaraj.k] mentioned, you can use a different codec. The following is the `snappy` codec case, and I can see the entry from the Spark History UI from 2.3.3. It's the behavior of LZ4. {code} $ bin/spark-shell --conf spark.eventLog.compress=true --conf spark.io.compression.codec=snappy 2019-06-11 22:04:26 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1560315870593). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.3 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) Type in expressions to have them evaluated. Type :help for more information. scala> {code} > When the spark.eventLog.compress is enabled, the Application is not showing > in the History server UI ('incomplete application' page), initially. > > > Key: SPARK-25246 > URL: https://issues.apache.org/jira/browse/SPARK-25246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: shahid >Priority: Major > > 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" > 2) hdfs dfs -ls /spark-logs > {code:java} > -rwxrwx--- 1 root supergroup *0* 2018-08-27 03:26 > /spark-logs/application_1535313809919_0005.lz4.inprogress > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.
[ https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25246. --- Resolution: Not A Problem > When the spark.eventLog.compress is enabled, the Application is not showing > in the History server UI ('incomplete application' page), initially. > > > Key: SPARK-25246 > URL: https://issues.apache.org/jira/browse/SPARK-25246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: shahid >Priority: Major > > 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" > 2) hdfs dfs -ls /spark-logs > {code:java} > -rwxrwx--- 1 root supergroup *0* 2018-08-27 03:26 > /spark-logs/application_1535313809919_0005.lz4.inprogress > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.
[ https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861755#comment-16861755 ] Dongjoon Hyun commented on SPARK-25246: --- Hi, [~shahid]. As [~devaraj.k] mentioned, you can use a different codec. The following is the `snappy` codec case, and I can see the entry from the Spark History UI from 2.3.3. It's the behavior of LZ4. {code} $ bin/spark-shell --conf spark.eventLog.compress=true --conf spark.io.compression.codec=snappy 2019-06-11 22:04:26 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1560315870593). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.3 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) Type in expressions to have them evaluated. Type :help for more information. scala> {code} > When the spark.eventLog.compress is enabled, the Application is not showing > in the History server UI ('incomplete application' page), initially. > > > Key: SPARK-25246 > URL: https://issues.apache.org/jira/browse/SPARK-25246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: shahid >Priority: Major > > 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" > 2) hdfs dfs -ls /spark-logs > {code:java} > -rwxrwx--- 1 root supergroup *0* 2018-08-27 03:26 > /spark-logs/application_1535313809919_0005.lz4.inprogress > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861751#comment-16861751 ] hemshankar sahu edited comment on SPARK-27891 at 6/12/19 4:54 AM: -- Updating this as critical, as we use spark streaming which is expected to run for more than 1 week. was (Author: hemshankar_sahu): Updating this as critical, as we use spark streaming which runs more than 1 week. > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Critical > Attachments: application_1559242207407_0001.log, > spark_2.3.1_failure.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs attached > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemshankar sahu updated SPARK-27891: Priority: Critical (was: Major) Updating this as critical, as we use spark streaming which runs more than 1 week. > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Critical > Attachments: application_1559242207407_0001.log, > spark_2.3.1_failure.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs attached > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28012) Hive UDF supports literal struct type
[ https://issues.apache.org/jira/browse/SPARK-28012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28012: Assignee: Apache Spark > Hive UDF supports literal struct type > - > > Key: SPARK-28012 > URL: https://issues.apache.org/jira/browse/SPARK-28012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Trivial > > Currently using hive udf, the parameter is literal struct type, will report > an error. > No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't > support the constant type [StructType(StructField(name,StringType,true), > StructField(value,DecimalType(3,1),true))] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28012) Hive UDF supports literal struct type
[ https://issues.apache.org/jira/browse/SPARK-28012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28012: Assignee: (was: Apache Spark) > Hive UDF supports literal struct type > - > > Key: SPARK-28012 > URL: https://issues.apache.org/jira/browse/SPARK-28012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: dzcxzl >Priority: Trivial > > Currently using hive udf, the parameter is literal struct type, will report > an error. > No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't > support the constant type [StructType(StructField(name,StringType,true), > StructField(value,DecimalType(3,1),true))] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28012) Hive UDF supports literal struct type
dzcxzl created SPARK-28012: -- Summary: Hive UDF supports literal struct type Key: SPARK-28012 URL: https://issues.apache.org/jira/browse/SPARK-28012 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: dzcxzl Currently using hive udf, the parameter is literal struct type, will report an error. No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't support the constant type [StructType(StructField(name,StringType,true), StructField(value,DecimalType(3,1),true))] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861749#comment-16861749 ] Hyukjin Kwon commented on SPARK-27463: -- BTW, I think we have cogroup at Dataset in Scala side. How is it different from that? > Support Dataframe Cogroup via Pandas UDFs > -- > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861747#comment-16861747 ] Hyukjin Kwon commented on SPARK-27463: -- [~d80tb7] are you still working on this? If there are some API references in Pandas, I think we can just mimic it. If so, can you just open a PR? cc [~icexelloss] as well. > Support Dataframe Cogroup via Pandas UDFs > -- > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28011) SQL parse error when there are too many aliases in the table
U Shaw created SPARK-28011: -- Summary: SQL parse error when there are too many aliases in the table Key: SPARK-28011 URL: https://issues.apache.org/jira/browse/SPARK-28011 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: U Shaw A sql syntax error is reported when the following statement is executed. .. FROM menu_item_categories_tmp t1 LEFT JOIN menu_item_categories_tmp t2 ON t1.icat_id = t2.icat_parent_icat_id AND t1.tenant_id = t2.tenant_id AND t2.icat_status != 'd' LEFT JOIN menu_item_categories_tmp t3 ON t2.icat_id = t3.icat_parent_icat_id AND t2.tenant_id = t3.tenant_id AND t3.icat_status != 'd' LEFT JOIN menu_item_categories_tmp t4 ON t3.icat_id = t4.icat_parent_icat_id AND t3.tenant_id = t4.tenant_id AND t4.icat_status != 'd' LEFT JOIN menu_item_categories_tmp t5 ON t4.icat_id = t5.icat_parent_icat_id AND t4.tenant_id = t5.tenant_id AND t5.icat_status != 'd' LEFT JOIN menu_item_categories_tmp t6 ON t5.icat_id = t6.icat_parent_icat_id AND t5.tenant_id = t6.tenant_id AND t6.icat_status != 'd' WHERE t1.icat_parent_icat_id = '0' AND t1.icat_status != 'd' ) SELECT DISTINCT tenant_id AS tenant_id, type AS type, CASE WHEN t2.num >= 1 THEN level0 ELSE NULL END AS level0, CASE WHEN t2.num >= 2 THEN level1 ELSE NULL END AS level1, CASE WHEN t2.num >= 3 THEN level2 ELSE NULL END AS level2, CASE .. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861728#comment-16861728 ] Henry Yu commented on SPARK-27812: -- OK [~dongjoon] > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27741) Transitivity on predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] U Shaw updated SPARK-27741: --- Fix Version/s: 2.4.0 > Transitivity on predicate pushdown > --- > > Key: SPARK-27741 > URL: https://issues.apache.org/jira/browse/SPARK-27741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.1 >Reporter: U Shaw >Priority: Major > Fix For: 2.4.0 > > > When using inner join, where conditions can be passed to join on, and when > using outer join, even if the conditions are the same, only the predicate is > pushed down to left or right. > As follows: > select * from t1 left join t2 on t1.id=t2.id where t1.id=1 > --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1 > Is Catalyst can support transitivity on predicate pushdown ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28010) Support ORDER BY ... USING syntax
Lantao Jin created SPARK-28010: -- Summary: Support ORDER BY ... USING syntax Key: SPARK-28010 URL: https://issues.apache.org/jira/browse/SPARK-28010 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin Currently, SparkSQL ORDER BY is below: {code:sql} ORDER BY expression [ ASC | DESC] SELECT * FROM tab ORDER BY col ASC {code} Can SparkSQL support grammar like below, as PostgreSQL ORDER BY: {code:sql} ORDER BY expression [ ASC | DESC | USING operator ] SELECT * FROM tab ORDER BY col USING < {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27701) Extend NestedColumnAliasing to more nested field cases
[ https://issues.apache.org/jira/browse/SPARK-27701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27701. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24599 > Extend NestedColumnAliasing to more nested field cases > -- > > Key: SPARK-27701 > URL: https://issues.apache.org/jira/browse/SPARK-27701 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > {{NestedColumnAliasing}} rule covers {{GetStructField}} only, currently. It > means that some nested field extraction expressions aren't pruned. For > example, if only accessing a nested field in an array of struct, this column > isn't pruned. This patch extends the rule to cover such cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27768) Infinity, -Infinity, NaN should be recognized in a case insensitive manner
[ https://issues.apache.org/jira/browse/SPARK-27768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861711#comment-16861711 ] Dongjoon Hyun commented on SPARK-27768: --- For this issue, I changed my mind to support this approach because I agreed the big picture of [~smilegator]. Sorry for the delay on this issue. > Infinity, -Infinity, NaN should be recognized in a case insensitive manner > -- > > Key: SPARK-27768 > URL: https://issues.apache.org/jira/browse/SPARK-27768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > When the inputs contain the constant 'infinity', Spark SQL does not generate > the expected results. > {code:java} > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x); > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES ('infinity'), ('1')) v(x); > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES ('infinity'), ('infinity')) v(x); > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES ('-infinity'), ('infinity')) v(x);{code} > The root cause: Spark SQL does not recognize the special constants in a case > insensitive way. In PostgreSQL, they are recognized in a case insensitive > way. > Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27499) Support mapping spark.local.dir to hostPath volume
[ https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861707#comment-16861707 ] Junjie Chen commented on SPARK-27499: - Yes, In KubernetesExecutorBuilder.scala, the LocalDrisFeatureStep is built before MountVolumesFeatureStep which means we cannot use any volumes mount later. I think we should build localDirsFeature at last, so that we can check if directories in SPARK_LOCAL_DIRS are set to volumes mounted either hostPath, PV, or others may support later and use that as local storage. With that way, we can utilize specified media to improve the local storage performance instead of just emptyDir which is just a ephemeral directory on node. > Support mapping spark.local.dir to hostPath volume > -- > > Key: SPARK-27499 > URL: https://issues.apache.org/jira/browse/SPARK-27499 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Junjie Chen >Priority: Minor > > Currently, the k8s executor builder mount spark.local.dir as emptyDir or > memory, it should satisfy some small workload, while in some heavily workload > like TPCDS, both of them can have some problem, such as pods are evicted due > to disk pressure when using emptyDir, and OOM when using tmpfs. > In particular on cloud environment, users may allocate cluster with minimum > configuration and add cloud storage when running workload. In this case, we > can specify multiple elastic storage as spark.local.dir to accelerate the > spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-21136. -- Resolution: Fixed > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Assignee: Yesheng Ma >Priority: Minor > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861646#comment-16861646 ] Hyukjin Kwon commented on SPARK-28006: -- The proposal itself looks making sense to me from a cursory look. One concern is that though I don't think Spark has such type of Window function. cc [~hvanhovell] as well. I suspect the output is the same as our grouped map Pandas UDF if I understood correctly? It might be helpful to show the output so that non-Python guys could understand how it works as well :-). > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > > Currently, in order to do this, user needs to use "grouped apply", for > example: > > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def zscore(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() / v.std() > return pdf > df.groupby('id').apply(zscore){code} > This approach has a few downside: > > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > @pandas_udf('double', GROUPED_XFORM) > def zscore(v): > return v - v.mean() / v.std() > w = Window.partitionBy('id') > df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled from the udf implementation > * We only need to send one column to Python worker and concat the result > with the original dataframe (this is what grouped aggregate is doing already) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28009) PipedRDD: Block not locked for reading failure
Douglas Colkitt created SPARK-28009: --- Summary: PipedRDD: Block not locked for reading failure Key: SPARK-28009 URL: https://issues.apache.org/jira/browse/SPARK-28009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Environment: Running in a Docker container with Spark 2.4.0 on Linux kernel 4.9.0 Reporter: Douglas Colkitt PipedRDD operation fails with the below stack trace. Failure primarily occurs when the STDOUT from the Unix process is small and the STDIN into the Unix process is comparatively much larger. Given the similarity to SPARK-18406, this seems to be due to a race condition when it comes to accessing the block's reader locker. The PipedRDD class implementation spawns STDIN iterator in a separate thread, so that would corroborate the race condition hypothesis. at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:842) at org.apache.spark.storage.BlockManager.releaseLockAndDispose(BlockManager.scala:1610) at org.apache.spark.storage.BlockManager$$anonfun$2.apply$mcV$sp(BlockManager.scala:621) at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.rdd.PipedRDD$$anon$3.run(PipedRDD.scala:145) Suppressed: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:363) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:362) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:358) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:358) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:858) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$1.apply$mcV$sp(Executor.scala:409) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28008) Default values & column comments in AVRO schema converters
Mathew Wicks created SPARK-28008: Summary: Default values & column comments in AVRO schema converters Key: SPARK-28008 URL: https://issues.apache.org/jira/browse/SPARK-28008 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Mathew Wicks Currently in both `toAvroType` and `toSqlType` [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134] there are two behaviours which are unexpected. h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no default value is set: *Current Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable = true) val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : [ "string", "null" ] } ] } {code} *Expected Behaviour:* (NOTE: The reversal of "null" & "string" in the union, needed for a default value of null) {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable = true) val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : [ "null", "string" ], "default" : null } ] }{code} h2. Field comments/metadata is not propagated: *Current Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable=false, comment="AAA") val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : "string" } ] }{code} *Expected Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable=false, comment="AAA") val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : "string", "doc" : "AAA" } ] }{code} The behaviour should be similar (but the reverse) for `toSqlType`. I think we should aim to get this in before 3.0, as it will probably be a breaking change for some usage of the AVRO API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28007) Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in Postgres/Redshift
[ https://issues.apache.org/jira/browse/SPARK-28007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28007: --- Summary: Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in Postgres/Redshift (was: Caret operator (^) means bitwise XOR in Spark and exponentiation in Postgres/Redshift) > Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in > Postgres/Redshift > -- > > Key: SPARK-28007 > URL: https://issues.apache.org/jira/browse/SPARK-28007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > The expression {{expr1 ^ expr2}} has different meanings in Spark and Postgres: > * [In Postgres|https://www.postgresql.org/docs/11/functions-math.html] and > [Redshift|https://docs.aws.amazon.com/redshift/latest/dg/r_OPERATOR_SYMBOLS.html] > , this returns {{expr1}} raised to the exponent {{expr2}} (additionally, the > Postgres docs explicitly state that this operation is left-associative). > * [In Spark|https://spark.apache.org/docs/2.4.3/api/sql/index.html#_14] and > [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ArithmeticOperators], > this returns the bitwise exclusive OR of {{expr1}} and {{expr2}}. > I'm reporting this under the Postgres compatibility umbrella. If we have SQL > dialect support (e.g. a Postgres compatibility dialect), maybe this behavior > could be flagged there? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28007) Caret operator (^) means bitwise XOR in Spark and exponentiation in Postgres/Redshift
Josh Rosen created SPARK-28007: -- Summary: Caret operator (^) means bitwise XOR in Spark and exponentiation in Postgres/Redshift Key: SPARK-28007 URL: https://issues.apache.org/jira/browse/SPARK-28007 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen The expression {{expr1 ^ expr2}} has different meanings in Spark and Postgres: * [In Postgres|https://www.postgresql.org/docs/11/functions-math.html] and [Redshift|https://docs.aws.amazon.com/redshift/latest/dg/r_OPERATOR_SYMBOLS.html] , this returns {{expr1}} raised to the exponent {{expr2}} (additionally, the Postgres docs explicitly state that this operation is left-associative). * [In Spark|https://spark.apache.org/docs/2.4.3/api/sql/index.html#_14] and [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ArithmeticOperators], this returns the bitwise exclusive OR of {{expr1}} and {{expr2}}. I'm reporting this under the Postgres compatibility umbrella. If we have SQL dialect support (e.g. a Postgres compatibility dialect), maybe this behavior could be flagged there? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861559#comment-16861559 ] Li Jin commented on SPARK-28006: cc [~hyukjin.kwon] [~LI,Xiao] [~ueshin] [~bryanc] I think code wise this is pretty simple but since this is adding a new pandas udf type I'd like to get some feedback on this. > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > > Currently, in order to do this, user needs to use "grouped apply", for > example: > > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def zscore(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() / v.std() > return pdf > df.groupby('id').apply(zscore){code} > This approach has a few downside: > > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > @pandas_udf('double', GROUPED_XFORM) > def zscore(v): > return v - v.mean() / v.std() > w = Window.partitionBy('id') > df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled from the udf implementation > * We only need to send one column to Python worker and concat the result > with the original dataframe (this is what grouped aggregate is doing already) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-28006: --- Description: Currently, pandas_udf supports "grouped aggregate" type that can be used with unbounded and unbounded windows. There is another set of use cases that can benefit from a "grouped transform" type pandas_udf. Grouped transform is defined as a N -> N mapping over a group. For example, "compute zscore for values in the group using the grouped mean and grouped stdev", or "rank the values in the group". Currently, in order to do this, user needs to use "grouped apply", for example: {code:java} @pandas_udf(schema, GROUPED_MAP) def zscore(pdf) v = pdf['v'] pdf['v'] = v - v.mean() / v.std() return pdf df.groupby('id').apply(zscore){code} This approach has a few downside: * Specifying the full return schema is complicated for the user although the function only changes one column. * The column name 'v' inside as part of the udf, makes the udf less reusable. * The entire dataframe is serialized to pass to Python although only one column is needed. Here we propose a new type of pandas_udf to work with these types of use cases: {code:java} @pandas_udf('double', GROUPED_XFORM) def zscore(v): return v - v.mean() / v.std() w = Window.partitionBy('id') df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code} Which addresses the above downsides. * The user only needs to specify the output type of a single column. * The column being zscored is decoupled from the udf implementation * We only need to send one column to Python worker and concat the result with the original dataframe (this is what grouped aggregate is doing already) was: Currently, pandas_udf supports "grouped aggregate" type that can be used with unbounded and unbounded windows. There is another set of use cases that can benefit from a "grouped transform" type pandas_udf. Grouped transform is defined as a N -> N mapping over a group. For example, "compute zscore for values in the group using the grouped mean and grouped stdev", or "rank the values in the group". Currently, in order to do this, user needs to use "grouped apply", for example: {code:java} @pandas_udf(schema, GROUPED_MAP) def zscore(pdf) v = pdf['v'] pdf['v'] = v - v.mean() / v.std() return pdf df.groupby('id').apply(zscore){code} This approach has a few downside: * Specifying the full return schema is complicated for the user although the function only changes one column. * The column name 'v' inside as part of the udf, makes the udf less reusable. * The entire dataframe is serialized to pass to Python although only one column is needed. Here we propose a new type of pandas_udf to work with these types of use cases: {code:java} @pandas_udf('double', GROUPED_XFORM) def zscore(v): return v - v.mean() / v.std() w = Window.partitionBy('id') df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code} Which addresses the above downsides. > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > > Currently, in order to do this, user needs to use "grouped apply", for > example: > > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def zscore(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() / v.std() > return pdf > df.groupby('id').apply(zscore){code} > This approach has a few downside: > > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > @pandas_udf('double', GROUPED_XFORM) > def zscore(v): > return v - v.mean() / v.std() > w = Window.partitionBy('id') > df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled
[jira] [Created] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
Li Jin created SPARK-28006: -- Summary: User-defined grouped transform pandas_udf for window operations Key: SPARK-28006 URL: https://issues.apache.org/jira/browse/SPARK-28006 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.4.3 Reporter: Li Jin Currently, pandas_udf supports "grouped aggregate" type that can be used with unbounded and unbounded windows. There is another set of use cases that can benefit from a "grouped transform" type pandas_udf. Grouped transform is defined as a N -> N mapping over a group. For example, "compute zscore for values in the group using the grouped mean and grouped stdev", or "rank the values in the group". Currently, in order to do this, user needs to use "grouped apply", for example: {code:java} @pandas_udf(schema, GROUPED_MAP) def zscore(pdf) v = pdf['v'] pdf['v'] = v - v.mean() / v.std() return pdf df.groupby('id').apply(zscore){code} This approach has a few downside: * Specifying the full return schema is complicated for the user although the function only changes one column. * The column name 'v' inside as part of the udf, makes the udf less reusable. * The entire dataframe is serialized to pass to Python although only one column is needed. Here we propose a new type of pandas_udf to work with these types of use cases: {code:java} @pandas_udf('double', GROUPED_XFORM) def zscore(v): return v - v.mean() / v.std() w = Window.partitionBy('id') df = df.withColumn('v_zscore', zscore(df['v']).over(w)){code} Which addresses the above downsides. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
[ https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28003: Assignee: (was: Apache Spark) > spark.createDataFrame with Arrow doesn't work with pandas.NaT > -- > > Key: SPARK-28003 > URL: https://issues.apache.org/jira/browse/SPARK-28003 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 2.4.3 >Reporter: Li Jin >Priority: Major > > {code:java} > import pandas as pd > dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 > pdf1 = pd.DataFrame({'time': dt1}) > df1 = self.spark.createDataFrame(pdf1) > {code} > The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
[ https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28003: Assignee: Apache Spark > spark.createDataFrame with Arrow doesn't work with pandas.NaT > -- > > Key: SPARK-28003 > URL: https://issues.apache.org/jira/browse/SPARK-28003 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 2.4.3 >Reporter: Li Jin >Assignee: Apache Spark >Priority: Major > > {code:java} > import pandas as pd > dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 > pdf1 = pd.DataFrame({'time': dt1}) > df1 = self.spark.createDataFrame(pdf1) > {code} > The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28005) SparkRackResolver should not log for resolving empty list
[ https://issues.apache.org/jira/browse/SPARK-28005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861410#comment-16861410 ] Imran Rashid commented on SPARK-28005: -- cc [~cltlfcjin] > SparkRackResolver should not log for resolving empty list > - > > Key: SPARK-28005 > URL: https://issues.apache.org/jira/browse/SPARK-28005 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > After SPARK-13704, {{SparkRackResolver}} generates an INFO message everytime > is called with 0 arguments: > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala#L73-L76 > That actually happens every 1s when there are no active executors, because of > the repeated offers that happen as part of delay scheduling: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L134-L139 > while this is relatively benign, its a pretty annoying thing to be logging at > INFO level every 1 second. > This is easy to reproduce -- in spark-shell, with dynamic allocation, set log > level to info, see the logs appear every 1 second. Then run something, see > the msgs stop. After the executors timeout, see the msgs reappear. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > sc.setLogLevel("info") > Thread.sleep(5000) > sc.parallelize(1 to 10).count() > // Exiting paste mode, now interpreting. > 19/06/11 12:43:40 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:41 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:42 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:43 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:44 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:45 INFO spark.SparkContext: Starting job: count at :28 > 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Got job 0 (count at > :28) with 2 output partitions > 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 > (count at :28) > ... > 19/06/11 12:43:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose > tasks have all completed, from pool > 19/06/11 12:43:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at > :28) finished in 9.548 s > 19/06/11 12:43:54 INFO scheduler.DAGScheduler: Job 0 finished: count at > :28, took 9.613049 s > res2: Long = 10 > > scala> > ... > 19/06/11 12:44:56 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:57 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:58 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:59 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28005) SparkRackResolver should not log for resolving empty list
Imran Rashid created SPARK-28005: Summary: SparkRackResolver should not log for resolving empty list Key: SPARK-28005 URL: https://issues.apache.org/jira/browse/SPARK-28005 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 3.0.0 Reporter: Imran Rashid After SPARK-13704, {{SparkRackResolver}} generates an INFO message everytime is called with 0 arguments: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala#L73-L76 That actually happens every 1s when there are no active executors, because of the repeated offers that happen as part of delay scheduling: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L134-L139 while this is relatively benign, its a pretty annoying thing to be logging at INFO level every 1 second. This is easy to reproduce -- in spark-shell, with dynamic allocation, set log level to info, see the logs appear every 1 second. Then run something, see the msgs stop. After the executors timeout, see the msgs reappear. {noformat} scala> :paste // Entering paste mode (ctrl-D to finish) sc.setLogLevel("info") Thread.sleep(5000) sc.parallelize(1 to 10).count() // Exiting paste mode, now interpreting. 19/06/11 12:43:40 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all 19/06/11 12:43:41 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all 19/06/11 12:43:42 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all 19/06/11 12:43:43 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all 19/06/11 12:43:44 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all 19/06/11 12:43:45 INFO spark.SparkContext: Starting job: count at :28 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Got job 0 (count at :28) with 2 output partitions 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at :28) ... 19/06/11 12:43:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 19/06/11 12:43:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at :28) finished in 9.548 s 19/06/11 12:43:54 INFO scheduler.DAGScheduler: Job 0 finished: count at :28, took 9.613049 s res2: Long = 10 scala> ... 19/06/11 12:44:56 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all 19/06/11 12:44:57 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all 19/06/11 12:44:58 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all 19/06/11 12:44:59 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all ... {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28004) Update jquery to 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28004: Assignee: Sean Owen (was: Apache Spark) > Update jquery to 3.4.1 > -- > > Key: SPARK-28004 > URL: https://issues.apache.org/jira/browse/SPARK-28004 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > > We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 > to keep up in general, but also to keep up with CVEs. In fact, we know of at > least one resolved in only 3.4.0+ > (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, > but, if the update isn't painful, maybe worthwhile in order to make future > 3.x updates easier. > jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to > maintain compatibility with 1.9+ > (https://blog.jquery.com/2013/04/18/jquery-2-0-released/) > 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard > to evaluate each one, but the most likely area for problems is in ajax(). > However, our usage of jQuery (and plugins) is pretty simple. > I've tried updating and testing the UI, and can't see any warnings, errors, > or problematic functionality. This includes the Spark UI, master UI, worker > UI, and docs (well, I wasn't able to build R docs) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28004) Update jquery to 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28004: Assignee: Apache Spark (was: Sean Owen) > Update jquery to 3.4.1 > -- > > Key: SPARK-28004 > URL: https://issues.apache.org/jira/browse/SPARK-28004 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Major > > We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 > to keep up in general, but also to keep up with CVEs. In fact, we know of at > least one resolved in only 3.4.0+ > (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, > but, if the update isn't painful, maybe worthwhile in order to make future > 3.x updates easier. > jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to > maintain compatibility with 1.9+ > (https://blog.jquery.com/2013/04/18/jquery-2-0-released/) > 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard > to evaluate each one, but the most likely area for problems is in ajax(). > However, our usage of jQuery (and plugins) is pretty simple. > I've tried updating and testing the UI, and can't see any warnings, errors, > or problematic functionality. This includes the Spark UI, master UI, worker > UI, and docs (well, I wasn't able to build R docs) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28004) Update jquery to 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-28004: -- Description: We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 to keep up in general, but also to keep up with CVEs. In fact, we know of at least one resolved in only 3.4.0+ (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, but, if the update isn't painful, maybe worthwhile in order to make future 3.x updates easier. jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to maintain compatibility with 1.9+ (https://blog.jquery.com/2013/04/18/jquery-2-0-released/) 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard to evaluate each one, but the most likely area for problems is in ajax(). However, our usage of jQuery (and plugins) is pretty simple. I've tried updating and testing the UI, and can't see any warnings, errors, or problematic functionality. This includes the Spark UI, master UI, worker UI, and docs (well, I wasn't able to build R docs) was: We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 to keep up in general, but also to keep up with CVEs. In fact, we know of at least one resolved in only 3.4.0+ (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, but, if the update isn't painful, maybe worthwhile in order to make future 3.x updates easier. jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to maintain compatibility with 1.9+ (https://nvd.nist.gov/vuln/detail/CVE-2019-11358) 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard to evaluate each one, but the most likely area for problems is in ajax(). However, our usage of jQuery (and plugins) is pretty simple. I've tried updating and testing the UI, and can't see any warnings, errors, or problematic functionality. This includes the Spark UI, master UI, worker UI, and docs (well, I wasn't able to build R docs) > Update jquery to 3.4.1 > -- > > Key: SPARK-28004 > URL: https://issues.apache.org/jira/browse/SPARK-28004 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > > We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 > to keep up in general, but also to keep up with CVEs. In fact, we know of at > least one resolved in only 3.4.0+ > (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, > but, if the update isn't painful, maybe worthwhile in order to make future > 3.x updates easier. > jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to > maintain compatibility with 1.9+ > (https://blog.jquery.com/2013/04/18/jquery-2-0-released/) > 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard > to evaluate each one, but the most likely area for problems is in ajax(). > However, our usage of jQuery (and plugins) is pretty simple. > I've tried updating and testing the UI, and can't see any warnings, errors, > or problematic functionality. This includes the Spark UI, master UI, worker > UI, and docs (well, I wasn't able to build R docs) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28004) Update jquery to 3.4.1
Sean Owen created SPARK-28004: - Summary: Update jquery to 3.4.1 Key: SPARK-28004 URL: https://issues.apache.org/jira/browse/SPARK-28004 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Sean Owen Assignee: Sean Owen We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 to keep up in general, but also to keep up with CVEs. In fact, we know of at least one resolved in only 3.4.0+ (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, but, if the update isn't painful, maybe worthwhile in order to make future 3.x updates easier. jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to maintain compatibility with 1.9+ (https://nvd.nist.gov/vuln/detail/CVE-2019-11358) 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard to evaluate each one, but the most likely area for problems is in ajax(). However, our usage of jQuery (and plugins) is pretty simple. I've tried updating and testing the UI, and can't see any warnings, errors, or problematic functionality. This includes the Spark UI, master UI, worker UI, and docs (well, I wasn't able to build R docs) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861353#comment-16861353 ] Xiangrui Meng commented on SPARK-27360: --- Done. Thanks for taking the task! > Standalone cluster mode support for GPU-aware scheduling > > > Key: SPARK-27360 > URL: https://issues.apache.org/jira/browse/SPARK-27360 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design and implement standalone manager support for GPU-aware scheduling: > 1. static conf to describe resources > 2. spark-submit to request resources > 2. auto discovery of GPUs > 3. executor process isolation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-27999) setup resources when Standalone Worker starts up
[ https://issues.apache.org/jira/browse/SPARK-27999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng deleted SPARK-27999: -- > setup resources when Standalone Worker starts up > > > Key: SPARK-27999 > URL: https://issues.apache.org/jira/browse/SPARK-27999 > Project: Spark > Issue Type: Sub-task >Reporter: wuyi >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27369) Standalone worker can load resource conf and discover resources
[ https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-27369: -- Assignee: wuyi > Standalone worker can load resource conf and discover resources > --- > > Key: SPARK-27369 > URL: https://issues.apache.org/jira/browse/SPARK-27369 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: wuyi >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861261#comment-16861261 ] Huaxin Gao commented on SPARK-26185: resolved in another PR > add weightCol in python MulticlassClassificationEvaluator > - > > Key: SPARK-26185 > URL: https://issues.apache.org/jira/browse/SPARK-26185 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in > MulticlassClassificationEvaluator.scala. This Jira will add weightCol in > python version of MulticlassClassificationEvaluator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
[ https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-28003: --- Affects Version/s: (was: 2.3.2) 2.3.3 > spark.createDataFrame with Arrow doesn't work with pandas.NaT > -- > > Key: SPARK-28003 > URL: https://issues.apache.org/jira/browse/SPARK-28003 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 2.4.3 >Reporter: Li Jin >Priority: Major > > {code:java} > import pandas as pd > dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 > pdf1 = pd.DataFrame({'time': dt1}) > df1 = self.spark.createDataFrame(pdf1) > {code} > The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
Li Jin created SPARK-28003: -- Summary: spark.createDataFrame with Arrow doesn't work with pandas.NaT Key: SPARK-28003 URL: https://issues.apache.org/jira/browse/SPARK-28003 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: Li Jin {code:java} import pandas as pd dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 pdf1 = pd.DataFrame({'time': dt1}) df1 = self.spark.createDataFrame(pdf1) {code} The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
[ https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-28003: --- Affects Version/s: (was: 2.4.0) 2.3.2 2.4.3 > spark.createDataFrame with Arrow doesn't work with pandas.NaT > -- > > Key: SPARK-28003 > URL: https://issues.apache.org/jira/browse/SPARK-28003 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.2, 2.4.3 >Reporter: Li Jin >Priority: Major > > {code:java} > import pandas as pd > dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 > pdf1 = pd.DataFrame({'time': dt1}) > df1 = self.spark.createDataFrame(pdf1) > {code} > The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression
[ https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27258: Assignee: Apache Spark > The value of "spark.app.name" or "--name" starts with number , which causes > resourceName does not match regular expression > -- > > Key: SPARK-27258 > URL: https://issues.apache.org/jira/browse/SPARK-27258 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: hehuiyuan >Assignee: Apache Spark >Priority: Minor > > {code:java} > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service > "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: > Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 > label must consist of lower case alphanumeric characters or '-', start with > an alphabetic character, and end with an alphanumeric character (e.g. > 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, > code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, > message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a > DNS-1035 label must consist of lower case alphanumeric characters or '-', > start with an alphabetic character, and end with an alphanumeric character > (e.g. 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Service, > name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, > uid=null, additionalProperties={}). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression
[ https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27258: Assignee: (was: Apache Spark) > The value of "spark.app.name" or "--name" starts with number , which causes > resourceName does not match regular expression > -- > > Key: SPARK-27258 > URL: https://issues.apache.org/jira/browse/SPARK-27258 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: hehuiyuan >Priority: Minor > > {code:java} > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service > "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: > Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 > label must consist of lower case alphanumeric characters or '-', start with > an alphabetic character, and end with an alphanumeric character (e.g. > 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, > code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, > message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a > DNS-1035 label must consist of lower case alphanumeric characters or '-', > start with an alphabetic character, and end with an alphanumeric character > (e.g. 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Service, > name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, > uid=null, additionalProperties={}). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28002) Support WITH clause column aliases
[ https://issues.apache.org/jira/browse/SPARK-28002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28002: Assignee: Apache Spark > Support WITH clause column aliases > -- > > Key: SPARK-28002 > URL: https://issues.apache.org/jira/browse/SPARK-28002 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Assignee: Apache Spark >Priority: Major > > PostgreSQL supports column aliasing in a CTE so this is valid query: > {noformat} > WITH t(x) AS (SELECT 1) > SELECT * FROM t WHERE x = 1{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28002) Support WITH clause column aliases
[ https://issues.apache.org/jira/browse/SPARK-28002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28002: Assignee: (was: Apache Spark) > Support WITH clause column aliases > -- > > Key: SPARK-28002 > URL: https://issues.apache.org/jira/browse/SPARK-28002 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Major > > PostgreSQL supports column aliasing in a CTE so this is valid query: > {noformat} > WITH t(x) AS (SELECT 1) > SELECT * FROM t WHERE x = 1{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28002) Support WITH clause column aliases
Peter Toth created SPARK-28002: -- Summary: Support WITH clause column aliases Key: SPARK-28002 URL: https://issues.apache.org/jira/browse/SPARK-28002 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth PostgreSQL supports column aliasing in a CTE so this is valid query: {noformat} WITH t(x) AS (SELECT 1) SELECT * FROM t WHERE x = 1{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27917) Semantic equals of CaseWhen is failing with case sensitivity of column Names
[ https://issues.apache.org/jira/browse/SPARK-27917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27917: -- Fix Version/s: 2.4.4 > Semantic equals of CaseWhen is failing with case sensitivity of column Names > > > Key: SPARK-27917 > URL: https://issues.apache.org/jira/browse/SPARK-27917 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.3 >Reporter: Akash R Nilugal >Assignee: Sandeep Katta >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > Semantic equals of CaseWhen is failing with case sensitivity of column Names -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception
Marius Stanescu created SPARK-28001: --- Summary: Dataframe throws 'socket.timeout: timed out' exception Key: SPARK-28001 URL: https://issues.apache.org/jira/browse/SPARK-28001 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.3 Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz RAM: 16 GB OS: Windows 10 Enterprise 64-bit Python: 3.7.2 PySpark: 3.4.3 Cluster manager: Spark Standalone Reporter: Marius Stanescu I load data from Azure Table Storage, create a DataFrame and perform a couple of operations via two user-defined functions, then call show() to display the results. If I load a very small batch of items, like 5, everything is working fine, but if I load a batch grater then 10 items from Azure Table Storage then I get the 'socket.timeout: timed out' exception. Here is the code: {code} import time import json import requests from requests.auth import HTTPBasicAuth from azure.cosmosdb.table.tableservice import TableService from azure.cosmosdb.table.models import Entity from pyspark.sql import SparkSession from pyspark.sql.functions import udf, struct from pyspark.sql.types import BooleanType def main(): batch_size = 25 azure_table_account_name = '***' azure_table_account_key = '***' azure_table_name = '***' spark = SparkSession \ .builder \ .appName(agent_name) \ .config("spark.sql.crossJoin.enabled", "true") \ .getOrCreate() table_service = TableService(account_name=azure_table_account_name, account_key=azure_table_account_key) continuation_token = None while True: messages = table_service.query_entities( azure_table_name, select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp", num_results=batch_size, marker=continuation_token, timeout=60) continuation_token = messages.next_marker messages_list = list(messages) if not len(messages_list): time.sleep(5) pass messages_df = spark.createDataFrame(messages_list) register_records_df = messages_df \ .withColumn('Registered', register_record('RowKey', 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp')) only_registered_records_df = register_records_df \ .filter(register_records_df.Registered == True) \ .drop(register_records_df.Registered) update_message_status_df = only_registered_records_df \ .withColumn('TableEntryDeleted', delete_table_entity('RowKey', 'PartitionKey')) results_df = update_message_status_df.select( update_message_status_df.RowKey, update_message_status_df.PartitionKey, update_message_status_df.TableEntryDeleted) #results_df.explain() results_df.show(n=batch_size, truncate=False) @udf(returnType=BooleanType()) def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp): # call an API try: url = '{}/data/record/{}'.format('***', rowKey) headers = { 'Content-type': 'application/json' } response = requests.post( url, headers=headers, auth=HTTPBasicAuth('***', '***'), data=prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, timestamp)) return bool(response) except: return False def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, timestamp): record_data = { "Title": messageId, "Type": '***', "Source": '***', "Creator": ownerSmtp, "Publisher": '***', "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ') } return json.dumps(record_data) @udf(returnType=BooleanType()) def delete_table_entity(row_key, partition_key): azure_table_account_name = '***' azure_table_account_key = '***' azure_table_name = '***' try: table_service = TableService(account_name=azure_table_account_name, account_key=azure_table_account_key) table_service.delete_entity(azure_table_name, partition_key, row_key) return True except: return False if __name__ == "__main__": main() {code} Here is the console output: {noformat} == Physical Plan == *(2) Project [RowKey#54, PartitionKey#53, pythonUDF0#93 AS TableEntryDeleted#81] +- BatchEvalPython [delete_table_entity(RowKey#54, PartitionKey#53)], [PartitionKey#53, RowKey#54, pythonUDF0#93] +- *(1) Project [PartitionKey#53, RowKey#54] +- *(1) Project [PartitionKey#53, RowKey#54, Timestamp#55, etag#56, messageId#57, ownerSmtp#58] +- *(1) Filter (pythonUDF0#92 = true) +- BatchEvalPython
[jira] [Commented] (SPARK-9983) Local physical operators for query execution
[ https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860976#comment-16860976 ] Lai Zhou commented on SPARK-9983: - [~rxin], we now use Calcite to build a high performance hive sql engine , it's released as opensource now. see [https://github.com/51nb/marble] It works fine for real-time ML scene in our financial business. But I think it's not the best solution. I think adding a single-node version of DataFrame to spark may be the best solution, because spark sql has natural compatibility with Hive sql, and people can enjoy the benefits of the excellent optimizer, vectorized execution, code gen...etc . > Local physical operators for query execution > > > Key: SPARK-9983 > URL: https://issues.apache.org/jira/browse/SPARK-9983 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In distributed query execution, there are two kinds of operators: > (1) operators that exchange data between different executors or threads: > examples include broadcast, shuffle. > (2) operators that process data in a single thread: examples include project, > filter, group by, etc. > This ticket proposes clearly differentiating them and creating local > operators in Spark. This leads to a lot of benefits: easier to test, easier > to optimize data exchange, better design (single responsibility), and > potentially even having a hyper-optimized single-node version of DataFrame. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28000) Add comments.sql
Lantao Jin created SPARK-28000: -- Summary: Add comments.sql Key: SPARK-28000 URL: https://issues.apache.org/jira/browse/SPARK-28000 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin In this ticket, we plan to add the regression test cases of https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/comments.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9983) Local physical operators for query execution
[ https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860976#comment-16860976 ] Lai Zhou edited comment on SPARK-9983 at 6/11/19 12:18 PM: --- [~rxin], we now use Calcite to build a high performance hive sql engine , it's released as opensource now. see [https://github.com/51nb/marble] It works fine for real-time ML scene in our financial business. But I think it's not the best solution. Adding a single-node version of DataFrame to spark may be the best solution, because spark sql has natural compatibility with Hive sql, and people can enjoy the benefits of the excellent optimizer, vectorized execution, code gen...etc . was (Author: hhlai1990): [~rxin], we now use Calcite to build a high performance hive sql engine , it's released as opensource now. see [https://github.com/51nb/marble] It works fine for real-time ML scene in our financial business. But I think it's not the best solution. I think adding a single-node version of DataFrame to spark may be the best solution, because spark sql has natural compatibility with Hive sql, and people can enjoy the benefits of the excellent optimizer, vectorized execution, code gen...etc . > Local physical operators for query execution > > > Key: SPARK-9983 > URL: https://issues.apache.org/jira/browse/SPARK-9983 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In distributed query execution, there are two kinds of operators: > (1) operators that exchange data between different executors or threads: > examples include broadcast, shuffle. > (2) operators that process data in a single thread: examples include project, > filter, group by, etc. > This ticket proposes clearly differentiating them and creating local > operators in Spark. This leads to a lot of benefits: easier to test, easier > to optimize data exchange, better design (single responsibility), and > potentially even having a hyper-optimized single-node version of DataFrame. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9983) Local physical operators for query execution
[ https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860976#comment-16860976 ] Lai Zhou edited comment on SPARK-9983 at 6/11/19 12:15 PM: --- [~rxin], we now use Calcite to build a high performance hive sql engine , it's released as opensource now. see [https://github.com/51nb/marble] It works fine for real-time ML scene in our financial business. But I think it's not the best solution. I think adding a single-node version of DataFrame to spark may be the best solution, because spark sql has natural compatibility with Hive sql, and people can enjoy the benefits of the excellent optimizer, vectorized execution, code gen...etc . was (Author: hhlai1990): [~rxin], we now use Calcite to build a high performance hive sql engine , it's released as opensource now. see [https://github.com/51nb/marble] It works fine for real-time ML scene in our financial business. But I think it's not the best solution. I think adding a single-node version of DataFrame to spark may be the best solution, because spark sql has natural compatibility with Hive sql, and people can enjoy the benefits of the excellent optimizer, vectorized execution, code gen...etc . > Local physical operators for query execution > > > Key: SPARK-9983 > URL: https://issues.apache.org/jira/browse/SPARK-9983 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In distributed query execution, there are two kinds of operators: > (1) operators that exchange data between different executors or threads: > examples include broadcast, shuffle. > (2) operators that process data in a single thread: examples include project, > filter, group by, etc. > This ticket proposes clearly differentiating them and creating local > operators in Spark. This leads to a lot of benefits: easier to test, easier > to optimize data exchange, better design (single responsibility), and > potentially even having a hyper-optimized single-node version of DataFrame. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27369) Standalone worker can load resource conf and discover resources
[ https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27369: Assignee: Apache Spark > Standalone worker can load resource conf and discover resources > --- > > Key: SPARK-27369 > URL: https://issues.apache.org/jira/browse/SPARK-27369 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27369) Standalone worker can load resource conf and discover resources
[ https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27369: Assignee: (was: Apache Spark) > Standalone worker can load resource conf and discover resources > --- > > Key: SPARK-27369 > URL: https://issues.apache.org/jira/browse/SPARK-27369 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9983) Local physical operators for query execution
[ https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860959#comment-16860959 ] Lai Zhou edited comment on SPARK-9983 at 6/11/19 11:51 AM: --- [~rxin], `a hyper-optimized single-node version of DataFrame`, do you have any roadmap about it? In real world, we use spark sql to handle our ETL jobs on Hive. We may extract a lots of user's variables by complex sql queries, which will be the input for machine-learning models. But when we want to migrate the jobs to real-time system, we always need to interpret these sql queries by another programming language, which requires a lot of work. Now the local mode of spark sql is not a direct and high performance execution mode, I think it will make great sense to have a hyper-optimized single-node version of DataFrame. was (Author: hhlai1990): [~rxin], `a hyper-optimized single-node version of DataFrame`, do you have any roadmap about it? In real world, we use spark sql to handle our ETL jobs on Hive. We may extract a lots of user's variables by complex sql queries, which will be the input for machine-learning models. But when we want to migrate the jobs to real-time system, we always need to interpret these sql queries by another programming language, which requires a lot of work. Now the local mode of spark sql is not a direct and high performance execution mode, I think it will make great sense to have a high hyper-optimized single-node. > Local physical operators for query execution > > > Key: SPARK-9983 > URL: https://issues.apache.org/jira/browse/SPARK-9983 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In distributed query execution, there are two kinds of operators: > (1) operators that exchange data between different executors or threads: > examples include broadcast, shuffle. > (2) operators that process data in a single thread: examples include project, > filter, group by, etc. > This ticket proposes clearly differentiating them and creating local > operators in Spark. This leads to a lot of benefits: easier to test, easier > to optimize data exchange, better design (single responsibility), and > potentially even having a hyper-optimized single-node version of DataFrame. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9983) Local physical operators for query execution
[ https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860959#comment-16860959 ] Lai Zhou commented on SPARK-9983: - [~rxin], `a hyper-optimized single-node version of DataFrame`, do you have any roadmap about it? In real world, we use spark sql to handle our ETL jobs on Hive. We may extract a lots of user's variables by complex sql queries, which will be the input for machine-learning models. But when we want to migrate the jobs to real-time system, we always need to interpret these sql queries by another programming language, which requires a lot of work. Now the local mode of spark sql is not a direct and high performance execution mode, I think it will make great sense to have a high hyper-optimized single-node. > Local physical operators for query execution > > > Key: SPARK-9983 > URL: https://issues.apache.org/jira/browse/SPARK-9983 > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In distributed query execution, there are two kinds of operators: > (1) operators that exchange data between different executors or threads: > examples include broadcast, shuffle. > (2) operators that process data in a single thread: examples include project, > filter, group by, etc. > This ticket proposes clearly differentiating them and creating local > operators in Spark. This leads to a lot of benefits: easier to test, easier > to optimize data exchange, better design (single responsibility), and > potentially even having a hyper-optimized single-node version of DataFrame. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression
[ https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-27258: --- > The value of "spark.app.name" or "--name" starts with number , which causes > resourceName does not match regular expression > -- > > Key: SPARK-27258 > URL: https://issues.apache.org/jira/browse/SPARK-27258 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: hehuiyuan >Priority: Minor > > {code:java} > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service > "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: > Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 > label must consist of lower case alphanumeric characters or '-', start with > an alphabetic character, and end with an alphanumeric character (e.g. > 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, > code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, > message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a > DNS-1035 label must consist of lower case alphanumeric characters or '-', > start with an alphabetic character, and end with an alphanumeric character > (e.g. 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Service, > name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, > uid=null, additionalProperties={}). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860951#comment-16860951 ] wuyi commented on SPARK-27360: -- oops...I happened to find that sub-task 2 has changed to what I want after I created sub-task 4. Can you please help to delete sub-task 4 ? [~mengxr] > Standalone cluster mode support for GPU-aware scheduling > > > Key: SPARK-27360 > URL: https://issues.apache.org/jira/browse/SPARK-27360 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design and implement standalone manager support for GPU-aware scheduling: > 1. static conf to describe resources > 2. spark-submit to request resources > 2. auto discovery of GPUs > 3. executor process isolation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27999) setup resources when Standalone Worker starts up
wuyi created SPARK-27999: Summary: setup resources when Standalone Worker starts up Key: SPARK-27999 URL: https://issues.apache.org/jira/browse/SPARK-27999 Project: Spark Issue Type: Sub-task Components: Deploy, Spark Core Affects Versions: 3.0.0 Reporter: wuyi -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression
[ https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924 ] Stavros Kontopoulos edited comment on SPARK-27258 at 6/11/19 11:03 AM: --- [~hehuiyuan] the ticket needs to be re-opened, [~srowen] are you ok with that (saw you resolved in the past)? Also next time pls stick to one PR if possible. was (Author: skonto): [~hehuiyuan] the ticket needs to be re-opened, unless someone else resolved it. Also next time pls stick to one PR if possible. > The value of "spark.app.name" or "--name" starts with number , which causes > resourceName does not match regular expression > -- > > Key: SPARK-27258 > URL: https://issues.apache.org/jira/browse/SPARK-27258 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: hehuiyuan >Priority: Minor > > {code:java} > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service > "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: > Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 > label must consist of lower case alphanumeric characters or '-', start with > an alphabetic character, and end with an alphanumeric character (e.g. > 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, > code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, > message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a > DNS-1035 label must consist of lower case alphanumeric characters or '-', > start with an alphabetic character, and end with an alphanumeric character > (e.g. 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Service, > name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, > uid=null, additionalProperties={}). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression
[ https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924 ] Stavros Kontopoulos edited comment on SPARK-27258 at 6/11/19 11:03 AM: --- [~hehuiyuan] the ticket needs to be re-opened, [~srowen] are you ok with that (saw you resolved it in the past)? Also next time pls stick to one PR if possible. was (Author: skonto): [~hehuiyuan] the ticket needs to be re-opened, [~srowen] are you ok with that (saw you resolved in the past)? Also next time pls stick to one PR if possible. > The value of "spark.app.name" or "--name" starts with number , which causes > resourceName does not match regular expression > -- > > Key: SPARK-27258 > URL: https://issues.apache.org/jira/browse/SPARK-27258 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: hehuiyuan >Priority: Minor > > {code:java} > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service > "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: > Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 > label must consist of lower case alphanumeric characters or '-', start with > an alphabetic character, and end with an alphanumeric character (e.g. > 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, > code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, > message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a > DNS-1035 label must consist of lower case alphanumeric characters or '-', > start with an alphabetic character, and end with an alphanumeric character > (e.g. 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Service, > name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, > uid=null, additionalProperties={}). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression
[ https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924 ] Stavros Kontopoulos edited comment on SPARK-27258 at 6/11/19 11:02 AM: --- [~hehuiyuan] the ticket needs to be re-opened, unless someone else resolved it. Also next time pls stick to one PR if possible. was (Author: skonto): [~hehuiyuan] the ticket needs to be re-opened. Also next time pls stick to one PR if possible. > The value of "spark.app.name" or "--name" starts with number , which causes > resourceName does not match regular expression > -- > > Key: SPARK-27258 > URL: https://issues.apache.org/jira/browse/SPARK-27258 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: hehuiyuan >Priority: Minor > > {code:java} > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service > "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: > Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 > label must consist of lower case alphanumeric characters or '-', start with > an alphabetic character, and end with an alphanumeric character (e.g. > 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, > code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, > message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a > DNS-1035 label must consist of lower case alphanumeric characters or '-', > start with an alphabetic character, and end with an alphanumeric character > (e.g. 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Service, > name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, > uid=null, additionalProperties={}). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression
[ https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924 ] Stavros Kontopoulos edited comment on SPARK-27258 at 6/11/19 11:01 AM: --- [~hehuiyuan] the ticket needs to be re-opened. Also next time pls stick to one PR if possible. was (Author: skonto): [~hehuiyuan] the ticket needs to be re-opened. > The value of "spark.app.name" or "--name" starts with number , which causes > resourceName does not match regular expression > -- > > Key: SPARK-27258 > URL: https://issues.apache.org/jira/browse/SPARK-27258 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: hehuiyuan >Priority: Minor > > {code:java} > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service > "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: > Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 > label must consist of lower case alphanumeric characters or '-', start with > an alphabetic character, and end with an alphanumeric character (e.g. > 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, > code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, > message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a > DNS-1035 label must consist of lower case alphanumeric characters or '-', > start with an alphabetic character, and end with an alphanumeric character > (e.g. 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Service, > name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, > uid=null, additionalProperties={}). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression
[ https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860924#comment-16860924 ] Stavros Kontopoulos commented on SPARK-27258: - [~hehuiyuan] the ticket needs to be re-opened. > The value of "spark.app.name" or "--name" starts with number , which causes > resourceName does not match regular expression > -- > > Key: SPARK-27258 > URL: https://issues.apache.org/jira/browse/SPARK-27258 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: hehuiyuan >Priority: Minor > > {code:java} > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service > "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: > Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 > label must consist of lower case alphanumeric characters or '-', start with > an alphabetic character, and end with an alphanumeric character (e.g. > 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, > code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, > message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a > DNS-1035 label must consist of lower case alphanumeric characters or '-', > start with an alphabetic character, and end with an alphanumeric character > (e.g. 'my-name', or 'abc-123', regex used for validation is > '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Service, > name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, > uid=null, additionalProperties={}). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922 ] zhengruifeng edited comment on SPARK-24875 at 6/11/19 10:59 AM: The dataset is usually much smaller than the training dataset containing , if the score data is to huge to perform a simple op like countByValue, how could you train/evaluate the model? I doubt whether it is worth to apply a approximation. was (Author: podongfeng): The dataset is usually much smaller than the training dataset containing , if the score data is to huge to perform a simple op like countByValue, how could you train the model? I doubt whether it is worth to apply a approximation. > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922 ] zhengruifeng commented on SPARK-24875: -- The dataset is usually much smaller than the training dataset containing , if the score data is to huge to perform a simple op like countByValue, how could you train the model? I doubt whether it is worth to apply a approximation. > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860919#comment-16860919 ] zhengruifeng commented on SPARK-26185: -- Seems resolved? > add weightCol in python MulticlassClassificationEvaluator > - > > Key: SPARK-26185 > URL: https://issues.apache.org/jira/browse/SPARK-26185 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in > MulticlassClassificationEvaluator.scala. This Jira will add weightCol in > python version of MulticlassClassificationEvaluator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27759) Do not auto cast array to np.array in vectorized udf
[ https://issues.apache.org/jira/browse/SPARK-27759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] colin fang updated SPARK-27759: --- Description: {code:java} pd_df = pd.DataFrame({'x': np.random.rand(11, 3, 5).tolist()}) df = spark.createDataFrame(pd_df).cache() {code} Each element in x is a list of list, as expected. {code:java} df.toPandas()['x'] # 0 [[0.08669612955959993, 0.32624430522634495, 0 # 1 [[0.29838166086156914, 0.008550172904516762, 0... # 2 [[0.641304534802928, 0.2392047548381877, 0.555... {code} {code:java} def my_udf(x): # Hack to see what's inside a udf raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, np.stack(x.values).shape) return pd.Series(x.values) my_udf = F.pandas_udf(my_udf, returnType=DoubleType()) df.coalesce(1).withColumn('y', my_udf('x')).show( # Exception: ((11,), (3,), (5,), (11, 3)){code} A batch (11) of `x` is converted to pd.Series, however, each element in the pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to work with nested 1d numpy array in practice in a udf. For example, I need a ndarray of shape (11, 3, 5) in udf, so that I can make use of the numpy vectorized operations. If I was given a list of list intact, I can simply do `np.stack(x.values)`. However, it doesn't work here as what I received is a nested numpy 1d array. was: {code:java} pd_df = pd.DataFrame(\{'x': np.random.rand(11, 3, 5).tolist()}) df = spark.createDataFrame(pd_df).cache() {code} Each element in x is a list of list, as expected. {code:java} df.toPandas()['x'] # 0 [[0.08669612955959993, 0.32624430522634495, 0 # 1 [[0.29838166086156914, 0.008550172904516762, 0... # 2 [[0.641304534802928, 0.2392047548381877, 0.555... {code} {code:java} def my_udf(x): # Hack to see what's inside a udf raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, np.stack(x.values).shape) return pd.Series(x.values) my_udf = pandas_udf(dot_product, returnType=DoubleType()) df.withColumn('y', my_udf('x')).show() Exception: ((2,), (3,), (5,), (2, 3)) {code} A batch (2) of `x` is converted to pd.Series, however, each element in the pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to work with nested 1d numpy array in practice in a udf. For example, I need a ndarray of shape (2, 3, 5) in udf, so that I can make use of the numpy vectorized operations. If I was given a list of list intact, I can simply do `np.stack(x.values)`. However, it doesn't work here as what I received is a nested numpy 1d array. > Do not auto cast array to np.array in vectorized udf > > > Key: SPARK-27759 > URL: https://issues.apache.org/jira/browse/SPARK-27759 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: colin fang >Priority: Minor > > {code:java} > pd_df = pd.DataFrame({'x': np.random.rand(11, 3, 5).tolist()}) > df = spark.createDataFrame(pd_df).cache() > {code} > Each element in x is a list of list, as expected. > {code:java} > df.toPandas()['x'] > # 0 [[0.08669612955959993, 0.32624430522634495, 0 > # 1 [[0.29838166086156914, 0.008550172904516762, 0... > # 2 [[0.641304534802928, 0.2392047548381877, 0.555... > {code} > > {code:java} > def my_udf(x): > # Hack to see what's inside a udf > raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, > np.stack(x.values).shape) > return pd.Series(x.values) > my_udf = F.pandas_udf(my_udf, returnType=DoubleType()) > df.coalesce(1).withColumn('y', my_udf('x')).show( > # Exception: ((11,), (3,), (5,), (11, 3)){code} > > A batch (11) of `x` is converted to pd.Series, however, each element in the > pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to > work with nested 1d numpy array in practice in a udf. > > For example, I need a ndarray of shape (11, 3, 5) in udf, so that I can make > use of the numpy vectorized operations. If I was given a list of list intact, > I can simply do `np.stack(x.values)`. However, it doesn't work here as what I > received is a nested numpy 1d array. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27998) Column alias should support double quote string
[ https://issues.apache.org/jira/browse/SPARK-27998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27998: Assignee: Apache Spark > Column alias should support double quote string > > > Key: SPARK-27998 > URL: https://issues.apache.org/jira/browse/SPARK-27998 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Assignee: Apache Spark >Priority: Major > > According to the ANSI SQL standard, column alias can be double quote string. > {code:sql} > SELECT au_fname AS "First name", > au_lname AS 'Last name', > city AS City, > state, > zip 'Postal code' FROM authors;{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27998) Column alias should support double quote string
[ https://issues.apache.org/jira/browse/SPARK-27998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27998: Assignee: (was: Apache Spark) > Column alias should support double quote string > > > Key: SPARK-27998 > URL: https://issues.apache.org/jira/browse/SPARK-27998 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > According to the ANSI SQL standard, column alias can be double quote string. > {code:sql} > SELECT au_fname AS "First name", > au_lname AS 'Last name', > city AS City, > state, > zip 'Postal code' FROM authors;{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner
[ https://issues.apache.org/jira/browse/SPARK-25360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860759#comment-16860759 ] zhengruifeng commented on SPARK-25360: -- But I think it maybe worth to impl a direct version of `sc.range` other than `parallelize.maptition`, to simplify lineage. > Parallelized RDDs of Ranges could have known partitioner > > > Key: SPARK-25360 > URL: https://issues.apache.org/jira/browse/SPARK-25360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > We already have the logic to split up the generator, we could expose the same > logic as a partitioner. This would be useful when joining a small > parallelized collection with a larger collection and other cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner
[ https://issues.apache.org/jira/browse/SPARK-25360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860756#comment-16860756 ] zhengruifeng commented on SPARK-25360: -- [~holdenk] i am afraid it is not doable to add a partitioner to \{RDD[Long]} generated by \{sc.range}, refering to the defination of partitioner. {code:java} /** * An object that defines how the elements in a key-value pair RDD are partitioned by key. * Maps each key to a partition ID, from 0 to `numPartitions - 1`. * * Note that, partitioner must be deterministic, i.e. it must return the same partition id given * the same partition key. */{code} Since the returned RDD[Long] is not a \{PairRDD}, so that following ops (like join, sort) which can utilize upstreaming partitioner. An alternative is to add some method like `sc.tabulate[T](start, end, step, numSlices)(f: Long => T)`, so that the partitioner can be used in future ops. > Parallelized RDDs of Ranges could have known partitioner > > > Key: SPARK-25360 > URL: https://issues.apache.org/jira/browse/SPARK-25360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > We already have the logic to split up the generator, we could expose the same > logic as a partitioner. This would be useful when joining a small > parallelized collection with a larger collection and other cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized
[ https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27995. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24838 [https://github.com/apache/spark/pull/24838] > Note the difference between str of Python 2 and 3 at Arrow optimized > > > Key: SPARK-27995 > URL: https://issues.apache.org/jira/browse/SPARK-27995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > When Arrow optimization is enabled in Python 2.7, > {code} > import pandas > pdf = pandas.DataFrame(["test1", "test2"]) > df = spark.createDataFrame(pdf) > df.show() > {code} > I got the following output: > {code} > ++ > | 0| > ++ > |[74 65 73 74 31]| > |[74 65 73 74 32]| > ++ > {code} > This looks because Python's {{str}} and {{byte}} are same. it does look right: > {code} > >>> str == bytes > True > >>> isinstance("a", bytes) > True > {code} > 1. Python 2 treats `str` as `bytes`. > 2. PySpark added some special codes and hacks to recognizes `str` as string > types. > 3. PyArrow / Pandas followed Python 2 difference > We might have to match the behaviour to PySpark's but Python 2 is deprecated > anyway. I think it's better to just note it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized
[ https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27995: Assignee: Hyukjin Kwon > Note the difference between str of Python 2 and 3 at Arrow optimized > > > Key: SPARK-27995 > URL: https://issues.apache.org/jira/browse/SPARK-27995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > When Arrow optimization is enabled in Python 2.7, > {code} > import pandas > pdf = pandas.DataFrame(["test1", "test2"]) > df = spark.createDataFrame(pdf) > df.show() > {code} > I got the following output: > {code} > ++ > | 0| > ++ > |[74 65 73 74 31]| > |[74 65 73 74 32]| > ++ > {code} > This looks because Python's {{str}} and {{byte}} are same. it does look right: > {code} > >>> str == bytes > True > >>> isinstance("a", bytes) > True > {code} > 1. Python 2 treats `str` as `bytes`. > 2. PySpark added some special codes and hacks to recognizes `str` as string > types. > 3. PyArrow / Pandas followed Python 2 difference > We might have to match the behaviour to PySpark's but Python 2 is deprecated > anyway. I think it's better to just note it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized
[ https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27995: - Summary: Note the difference between str of Python 2 and 3 at Arrow optimized (was: Note the difference between str of Python 2 and 3 at Arrow optimized toPandas) > Note the difference between str of Python 2 and 3 at Arrow optimized > > > Key: SPARK-27995 > URL: https://issues.apache.org/jira/browse/SPARK-27995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > When Arrow optimization is enabled in Python 2.7, > {code} > import pandas > pdf = pandas.DataFrame(["test1", "test2"]) > df = spark.createDataFrame(pdf) > df.show() > {code} > I got the following output: > {code} > ++ > | 0| > ++ > |[74 65 73 74 31]| > |[74 65 73 74 32]| > ++ > {code} > This looks because Python's {{str}} and {{byte}} are same. it does look right: > {code} > >>> str == bytes > True > >>> isinstance("a", bytes) > True > {code} > 1. Python 2 treats `str` as `bytes`. > 2. PySpark added some special codes and hacks to recognizes `str` as string > types. > 3. PyArrow / Pandas followed Python 2 difference > We might have to match the behaviour to PySpark's but Python 2 is deprecated > anyway. I think it's better to just note it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860734#comment-16860734 ] HonglunChen commented on SPARK-18112: - [~hyukjin.kwon] Thanks, I did not remove the 1.2.1 jars, and put all the 2.3.3 jars in a separate directory, then set "--conf spark.sql.hive.metastore.jars=/path of dir/*". It works now! > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860711#comment-16860711 ] Hyukjin Kwon commented on SPARK-18112: -- you should place 2.3.3 jars if you want do use 2.3.3. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27998) Column alias should support double quote string
[ https://issues.apache.org/jira/browse/SPARK-27998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27998: Affects Version/s: (was: 3.1.0) 3.0.0 > Column alias should support double quote string > > > Key: SPARK-27998 > URL: https://issues.apache.org/jira/browse/SPARK-27998 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > According to the ANSI SQL standard, column alias can be double quote string. > {code:sql} > SELECT au_fname AS "First name", > au_lname AS 'Last name', > city AS City, > state, > zip 'Postal code' FROM authors;{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27998) Column alias should support double quote string
Zhu, Lipeng created SPARK-27998: --- Summary: Column alias should support double quote string Key: SPARK-27998 URL: https://issues.apache.org/jira/browse/SPARK-27998 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Zhu, Lipeng According to the ANSI SQL standard, column alias can be double quote string. {code:sql} SELECT au_fname AS "First name", au_lname AS 'Last name', city AS City, state, zip 'Postal code' FROM authors;{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27499) Support mapping spark.local.dir to hostPath volume
[ https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860648#comment-16860648 ] Dongjoon Hyun commented on SPARK-27499: --- Oh, do you mean `SPARK_LOCAL_DIRS` doesn't work on the PV, [~junjie]? > Support mapping spark.local.dir to hostPath volume > -- > > Key: SPARK-27499 > URL: https://issues.apache.org/jira/browse/SPARK-27499 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Junjie Chen >Priority: Minor > > Currently, the k8s executor builder mount spark.local.dir as emptyDir or > memory, it should satisfy some small workload, while in some heavily workload > like TPCDS, both of them can have some problem, such as pods are evicted due > to disk pressure when using emptyDir, and OOM when using tmpfs. > In particular on cloud environment, users may allocate cluster with minimum > configuration and add cloud storage when running workload. In this case, we > can specify multiple elastic storage as spark.local.dir to accelerate the > spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27934) Add case.sql
[ https://issues.apache.org/jira/browse/SPARK-27934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27934. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 > Add case.sql > > > Key: SPARK-27934 > URL: https://issues.apache.org/jira/browse/SPARK-27934 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API
[ https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27899. - Resolution: Fixed Fix Version/s: 3.0.0 > Make HiveMetastoreClient.getTableObjectsByName available in > ExternalCatalog/SessionCatalog API > -- > > Key: SPARK-27899 > URL: https://issues.apache.org/jira/browse/SPARK-27899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Juliusz Sompolski >Assignee: Lantao Jin >Priority: Major > Fix For: 3.0.0 > > > The new Spark ThriftServer SparkGetTablesOperation implemented in > https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata > request for every table. This can get very slow for large schemas (~50ms per > table with an external Hive metastore). > Hive ThriftServer GetTablesOperation uses > HiveMetastoreClient.getTableObjectsByName to get table information in bulk, > but we don't expose that through our APIs that go through Hive -> > HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> > SessionCatalog. > If we added and exposed getTableObjectsByName through our catalog APIs, we > could resolve that performance problem in SparkGetTablesOperation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27499) Support mapping spark.local.dir to hostPath volume
[ https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27499: -- Fix Version/s: (was: 2.4.0) > Support mapping spark.local.dir to hostPath volume > -- > > Key: SPARK-27499 > URL: https://issues.apache.org/jira/browse/SPARK-27499 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Junjie Chen >Priority: Minor > > Currently, the k8s executor builder mount spark.local.dir as emptyDir or > memory, it should satisfy some small workload, while in some heavily workload > like TPCDS, both of them can have some problem, such as pods are evicted due > to disk pressure when using emptyDir, and OOM when using tmpfs. > In particular on cloud environment, users may allocate cluster with minimum > configuration and add cloud storage when running workload. In this case, we > can specify multiple elastic storage as spark.local.dir to accelerate the > spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27499) Support mapping spark.local.dir to hostPath volume
[ https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860598#comment-16860598 ] Dongjoon Hyun edited comment on SPARK-27499 at 6/11/19 7:42 AM: No, I think you didn't read the patch of that, [~junjie]. Could you read the patch? You can start from the following line. - https://github.com/apache/spark/pull/21238/files#diff-529fc5c06b9731c1fbda6f3db60b16aaR458 Also, please see the following `PVTestsSuite` test case. - https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala was (Author: dongjoon): No, I think you didn't read the patch of that, [~junjie]. Could you read the patch? You can start from the following line. - https://github.com/apache/spark/pull/21238/files#diff-529fc5c06b9731c1fbda6f3db60b16aaR458 Also, please see the test case. > Support mapping spark.local.dir to hostPath volume > -- > > Key: SPARK-27499 > URL: https://issues.apache.org/jira/browse/SPARK-27499 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Junjie Chen >Priority: Minor > Fix For: 2.4.0 > > > Currently, the k8s executor builder mount spark.local.dir as emptyDir or > memory, it should satisfy some small workload, while in some heavily workload > like TPCDS, both of them can have some problem, such as pods are evicted due > to disk pressure when using emptyDir, and OOM when using tmpfs. > In particular on cloud environment, users may allocate cluster with minimum > configuration and add cloud storage when running workload. In this case, we > can specify multiple elastic storage as spark.local.dir to accelerate the > spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24175) improve the Spark 2.4 migration guide
[ https://issues.apache.org/jira/browse/SPARK-24175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24175: -- Component/s: Documentation > improve the Spark 2.4 migration guide > - > > Key: SPARK-24175 > URL: https://issues.apache.org/jira/browse/SPARK-24175 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > The current Spark 2.4 migration guide is not well phrased. We should > 1. State the before behavior > 2. State the after behavior > 3. Add a concrete example with code to illustrate. > For example: > Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after > promotes both sides to TIMESTAMP. To set `false` to > `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous > behavior. This option will be removed in Spark 3.0. > --> > In version 2.3 and earlier, Spark implicitly casts a timestamp column to date > type when comparing with a date column. In version 2.4 and later, Spark casts > the date column to timestamp type instead. As an example, "xxx" would result > in ".." in Spark 2.3, and in Spark 2.4, the result would be "..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24125) Add quoting rules to SQL guide
[ https://issues.apache.org/jira/browse/SPARK-24125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24125: -- Component/s: (was: SQL) Documentation > Add quoting rules to SQL guide > -- > > Key: SPARK-24125 > URL: https://issues.apache.org/jira/browse/SPARK-24125 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Priority: Minor > > As far as I can tell, Spark SQL's quoting rules are as follows: > * {{`foo bar`}} is an identifier > * {{'foo bar'}} is a string literal > * {{"foo bar"}} is a string literal > The last of these is non-standard (usually {{"foo bar"}} is an identifier), > and so it's probably worth mentioning these rules in the 'reference' section > of the [SQL > guide|http://spark.apache.org/docs/latest/sql-programming-guide.html#reference]. > I'm assuming there's not a lot of enthusiasm to change the quoting rules, > given it would be a breaking change, and that backticks work just fine as an > alternative. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24125) Add quoting rules to SQL guide
[ https://issues.apache.org/jira/browse/SPARK-24125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24125: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Add quoting rules to SQL guide > -- > > Key: SPARK-24125 > URL: https://issues.apache.org/jira/browse/SPARK-24125 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Henry Robinson >Priority: Minor > > As far as I can tell, Spark SQL's quoting rules are as follows: > * {{`foo bar`}} is an identifier > * {{'foo bar'}} is a string literal > * {{"foo bar"}} is a string literal > The last of these is non-standard (usually {{"foo bar"}} is an identifier), > and so it's probably worth mentioning these rules in the 'reference' section > of the [SQL > guide|http://spark.apache.org/docs/latest/sql-programming-guide.html#reference]. > I'm assuming there's not a lot of enthusiasm to change the quoting rules, > given it would be a breaking change, and that backticks work just fine as an > alternative. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24584) More efficient storage of executor pod state
[ https://issues.apache.org/jira/browse/SPARK-24584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24584: -- Summary: More efficient storage of executor pod state (was: [K8s] More efficient storage of executor pod state) > More efficient storage of executor pod state > > > Key: SPARK-24584 > URL: https://issues.apache.org/jira/browse/SPARK-24584 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Matt Cheah >Priority: Major > > Currently we store buffers of snapshots in {{ExecutorPodsSnapshotStore}}, > where the snapshots are duplicated per subscriber. With hundreds or maybe > thousands of executors, this buffering may become untenable. Investigate > storing less state while still maintaining the same level-triggered semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24254) Eagerly evaluate some subqueries over LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-24254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24254: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Eagerly evaluate some subqueries over LocalRelation > --- > > Key: SPARK-24254 > URL: https://issues.apache.org/jira/browse/SPARK-24254 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Henry Robinson >Priority: Major > > Some queries would benefit from evaluating subqueries over {{LocalRelations}} > eagerly. For example: > {code} > SELECT t1.part_col FROM t1 JOIN (SELECT max(part_col) m FROM t2) foo WHERE > t1.part_col = foo.m > {code} > If {{max(part_col)}} could be evaluated during planning, there's an > opportunity to prune all but at most one partitions from the scan of {{t1}}. > Similarly, a near-identical query with a non-scalar subquery in the {{WHERE}} > clause: > {code} > SELECT * FROM t1 WHERE part_col IN (SELECT part_col FROM t2) > {code} > could be partially evaluated to eliminate some partitions, and remove the > join from the plan. > Obviously all subqueries over local relations can't be evaluated during > planning, but certain whitelisted aggregates could be if the input > cardinality isn't too high. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24584) [K8s] More efficient storage of executor pod state
[ https://issues.apache.org/jira/browse/SPARK-24584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24584: -- Affects Version/s: (was: 2.4.0) 3.0.0 > [K8s] More efficient storage of executor pod state > -- > > Key: SPARK-24584 > URL: https://issues.apache.org/jira/browse/SPARK-24584 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Matt Cheah >Priority: Major > > Currently we store buffers of snapshots in {{ExecutorPodsSnapshotStore}}, > where the snapshots are duplicated per subscriber. With hundreds or maybe > thousands of executors, this buffering may become untenable. Investigate > storing less state while still maintaining the same level-triggered semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25050) Handle more than two types in avro union types when writing avro files
[ https://issues.apache.org/jira/browse/SPARK-25050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25050: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Handle more than two types in avro union types when writing avro files > -- > > Key: SPARK-25050 > URL: https://issues.apache.org/jira/browse/SPARK-25050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24831) Spark Executor class create new serializer each time
[ https://issues.apache.org/jira/browse/SPARK-24831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-24831. --- Resolution: Incomplete Sorry, but JIRA is not for questions. > Spark Executor class create new serializer each time > > > Key: SPARK-24831 > URL: https://issues.apache.org/jira/browse/SPARK-24831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > > When I look in the code I find this line: > (org.apache.spark.executor.Executor: 319) > val resultSer = env.serializer.newInstance() > why not hold this in threadLocal and reuse? > [questioin in > stackOverflow|https://stackoverflow.com/questions/51298877/spark-executor-class-create-new-serializer-each-time] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25396) Read array of JSON objects via an Iterator
[ https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25396: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Read array of JSON objects via an Iterator > -- > > Key: SPARK-25396 > URL: https://issues.apache.org/jira/browse/SPARK-25396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > If a JSON file has a structure like below: > {code} > [ > { > "time":"2018-08-13T18:00:44.086Z", > "resourceId":"some-text", > "category":"A", > "level":2, > "operationName":"Error", > "properties":{...} > }, > { > "time":"2018-08-14T18:00:44.086Z", > "resourceId":"some-text2", > "category":"B", > "level":3, > "properties":{...} > }, > ... > ] > {code} > it should be read in the `multiLine` mode. In this mode, Spark read whole > array into memory in both cases when schema is `ArrayType` and `StructType`. > It can lead to unnecessary memory consumption and even to OOM for big JSON > files. > In general, there is no need to materialize all parsed JSON record in memory > there: > https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 > . So, JSON objects of an array can be read via an Iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27013) Consider adding support for external encoders when resolving org.apache.spark.sql.catalyst.encoders.ExpressionEncoder's apply method
[ https://issues.apache.org/jira/browse/SPARK-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27013: -- Component/s: (was: Optimizer) SQL > Consider adding support for external encoders when resolving > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder's apply method > > > Key: SPARK-27013 > URL: https://issues.apache.org/jira/browse/SPARK-27013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Ottey >Priority: Minor > > I recently discovered that, because most of the common implicit encoders > introduced by > {noformat}import spark.implicits._{noformat} > reduce to a call to {{ExpressionEncoder}}'s {{apply}} method, it's _very_ > difficult to generate and/or operate on {{Column}}'s whose internal types > reduce to some Scala type that wraps an external type, even if an implicit > encoder for that external type is available or could be trivially generated. > See the example below: > {code:scala} > import com.example.MyBean > object Example { > implicit def BeanEncoder: Encoder[MyBean] = Encoders.bean(classOf[MyBean]) > > def main(args: Array[String]): Unit = { > val path = args(0) > val spark: SparkSession = ??? > import spark.implicits._ > // THE FOLLOWING DOES NOT WORK!!! > // implicit encoder for Seq[_] is found and used... > // Calls ExpressionEncoder's apply method > // Unwraps the inner type com.example.MyBean... > // ScalaReflection.serialzeFor() cannot find encoder for our type > // Even though we can trivially create one above > // Fails at runtime with UnsupportedOperationException from > // ScalaReflection.serialzeFor() > val ds = spark.read > .format("avro") > .option("compression", "snappy") > .load(path) > .select($"myColumn".as[Seq[MyBean]]) > } > {code} > What's particularly frustrating is that if we were using any user-defined > case class instead of the java bean type, this is not a problem, as the > structuring of the various implicit encoders in the related packages seems to > allow the {{ScalaReflection.serializeFor()}} method to work on arbitrary > {{scala.Product}} types... (There's an implicit encoder in > org.apache.spark.sql.Encoders that looks relevant) > I realize that there are workarounds, such as wrapping the types and then > using a simple {{.map()}}, or using kryo or java serialization, but my > understanding is that would mean giving up on potential Catalyst > optimizations... > It would be really nice if there were a simple way to tell > {{ScalaReflection.serializeFor()}} to look for/use other, potentially > user-defined encoders, especially if they could be generated from the factory > encoder methods supplied by Spark itself... > Alternatively, It would be exceptionally nice if calls to > {{ExpressionEncoder}}'s {{apply}} method would support expressions with types > that include {{java.util.List}} or arbitrary java bean types as well as > {{scala.Product}} types. > See > [here|https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-dataset] > on Stackoverflow for other details... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26948) vertex and edge rowkey upgrade and support multiple types?
[ https://issues.apache.org/jira/browse/SPARK-26948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26948: -- Affects Version/s: (was: 2.4.0) 3.0.0 > vertex and edge rowkey upgrade and support multiple types? > -- > > Key: SPARK-26948 > URL: https://issues.apache.org/jira/browse/SPARK-26948 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 3.0.0 >Reporter: daile >Priority: Minor > > Currently only Long is supported, but most of the graph databases use string > as the primary key. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27013) Consider adding support for external encoders when resolving org.apache.spark.sql.catalyst.encoders.ExpressionEncoder's apply method
[ https://issues.apache.org/jira/browse/SPARK-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27013: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Consider adding support for external encoders when resolving > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder's apply method > > > Key: SPARK-27013 > URL: https://issues.apache.org/jira/browse/SPARK-27013 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.0.0 >Reporter: Frank Ottey >Priority: Minor > > I recently discovered that, because most of the common implicit encoders > introduced by > {noformat}import spark.implicits._{noformat} > reduce to a call to {{ExpressionEncoder}}'s {{apply}} method, it's _very_ > difficult to generate and/or operate on {{Column}}'s whose internal types > reduce to some Scala type that wraps an external type, even if an implicit > encoder for that external type is available or could be trivially generated. > See the example below: > {code:scala} > import com.example.MyBean > object Example { > implicit def BeanEncoder: Encoder[MyBean] = Encoders.bean(classOf[MyBean]) > > def main(args: Array[String]): Unit = { > val path = args(0) > val spark: SparkSession = ??? > import spark.implicits._ > // THE FOLLOWING DOES NOT WORK!!! > // implicit encoder for Seq[_] is found and used... > // Calls ExpressionEncoder's apply method > // Unwraps the inner type com.example.MyBean... > // ScalaReflection.serialzeFor() cannot find encoder for our type > // Even though we can trivially create one above > // Fails at runtime with UnsupportedOperationException from > // ScalaReflection.serialzeFor() > val ds = spark.read > .format("avro") > .option("compression", "snappy") > .load(path) > .select($"myColumn".as[Seq[MyBean]]) > } > {code} > What's particularly frustrating is that if we were using any user-defined > case class instead of the java bean type, this is not a problem, as the > structuring of the various implicit encoders in the related packages seems to > allow the {{ScalaReflection.serializeFor()}} method to work on arbitrary > {{scala.Product}} types... (There's an implicit encoder in > org.apache.spark.sql.Encoders that looks relevant) > I realize that there are workarounds, such as wrapping the types and then > using a simple {{.map()}}, or using kryo or java serialization, but my > understanding is that would mean giving up on potential Catalyst > optimizations... > It would be really nice if there were a simple way to tell > {{ScalaReflection.serializeFor()}} to look for/use other, potentially > user-defined encoders, especially if they could be generated from the factory > encoder methods supplied by Spark itself... > Alternatively, It would be exceptionally nice if calls to > {{ExpressionEncoder}}'s {{apply}} method would support expressions with types > that include {{java.util.List}} or arbitrary java bean types as well as > {{scala.Product}} types. > See > [here|https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-dataset] > on Stackoverflow for other details... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org