[jira] [Resolved] (SPARK-25817) Dataset encoder should support combination of map and product type

2018-10-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25817.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22812
[https://github.com/apache/spark/pull/22812]

> Dataset encoder should support combination of map and product type
> --
>
> Key: SPARK-25817
> URL: https://issues.apache.org/jira/browse/SPARK-25817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1291#comment-1291
 ] 

Chenxiao Mao commented on SPARK-25833:
--

[~dkbiswal] Thanks for you comments. I think you are right that this is a 
duplicate.

Does it make sense to describe this compatibility issue explicitly in the user 
guide to help users troubleshoot this issue?

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-10-27 Thread Greg Senia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Senia updated SPARK-25778:
---
Attachment: SPARK-25778.patch

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
> Attachments: SPARK-25778.patch
>
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
>   at 
> 

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1285#comment-1285
 ] 

Apache Spark commented on SPARK-25778:
--

User 'gss2002' has created a pull request for this issue:
https://github.com/apache/spark/pull/22867

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
> 

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1284#comment-1284
 ] 

Apache Spark commented on SPARK-25778:
--

User 'gss2002' has created a pull request for this issue:
https://github.com/apache/spark/pull/22867

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
> 

[jira] [Assigned] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25778:


Assignee: Apache Spark

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Assignee: Apache Spark
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
>   at 
> 

[jira] [Assigned] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25778:


Assignee: (was: Apache Spark)

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
>   at 
> 

[jira] [Updated] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-10-27 Thread Greg Senia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Senia updated SPARK-25778:
---
Summary: WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of 
access to tmpDir from $PWD to HDFS  (was: WriteAheadLogBackedBlockRDD in YARN 
Cluster Mode Fails due lack of access)

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
> 

[jira] [Assigned] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2018-10-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19851:
---

Assignee: Dilip Biswal

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2018-10-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19851.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22809
[https://github.com/apache/spark/pull/22809]

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Priority: Major
> Fix For: 3.0.0
>
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1252#comment-1252
 ] 

Apache Spark commented on SPARK-12172:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/22866

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1251#comment-1251
 ] 

Apache Spark commented on SPARK-12172:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/22866

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12172) Consider removing SparkR internal RDD APIs

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12172:


Assignee: (was: Apache Spark)

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12172) Consider removing SparkR internal RDD APIs

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12172:


Assignee: Apache Spark

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-25859.
--
  Resolution: Fixed
Assignee: Huaxin Gao
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0

> add scala/java/python example and doc for PrefixSpan
> 
>
> Key: SPARK-25859
> URL: https://issues.apache.org/jira/browse/SPARK-25859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.4.0
>
>
> scala/java/python examples and doc for PrefixSpan are added in 3.0 in 
> https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
> examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16693) Remove R deprecated methods

2018-10-27 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-16693.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 3.0.0

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 3.0.0
>
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0 -> 3.0.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25823) map_filter can generate incorrect data

2018-10-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25823:
--
Priority: Critical  (was: Blocker)

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25823) map_filter can generate incorrect data

2018-10-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25823:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1178#comment-1178
 ] 

Dilip Biswal edited comment on SPARK-25833 at 10/27/18 8:39 PM:


This looks like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-24864. Please see the discussion 
there. Basically Hive and Spark are two different systems and follow a 
different scheme to compute auto generated column names. We should be using 
aliases  in the view definition to make it runnable from spark.

cc [~smilegator] [~srowen]
Thank you.


was (Author: dkbiswal):
This looks like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-24864. Please see the discussion 
there. Basically Hive and Spark are two different systems and follow a 
different scheme to compute auto generated column names. We should be using 
aliases  in the view definition to make it runnable from spark.

Thank you.

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25858) Passing Field Metadata to Parquet

2018-10-27 Thread Xinli Shang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved SPARK-25858.
-
Resolution: Later

It is a little early to open this issue. I will re-open it after the dependency 
issues are designed. 

> Passing Field Metadata to Parquet
> -
>
> Key: SPARK-25858
> URL: https://issues.apache.org/jira/browse/SPARK-25858
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Affects Versions: 2.3.2
>Reporter: Xinli Shang
>Priority: Major
>
> h1. Problem Statement
> The Spark WriteSupport class for Parquet is hardcoded to use 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
> is not configurable. Currently, this class doesn’t carry over the field 
> metadata in StructType to MessageType. However, Parquet column encryption 
> (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType 
> of Parquet, so that the metadata can be used to control column encryption.
> h1. Technical Solution
>  # Extend SparkToParquetSchemaConverter class and override convert() method 
> to add the functionality of carrying over the field metadata
>  # Extend ParquetWriteSupport and use the extended converter in #1. The 
> extension avoids changing the built-in WriteSupport to mitigate the risk.
>  # Change Spark code to make the WriteSupport class configurable to let the 
> user configure to use the extended WriteSupport in #2.  The default 
> WriteSupport is still 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.
> h1. Technical Details
> {{Note: The code below kind of in messy format. The link below shows correct 
> format. }}
> h2. Extend SparkToParquetSchemaConverter class
>  *SparkToParquetMetadataSchemaConverter* extends 
> SparkToParquetSchemaConverter {
>  
>   *override* def convert(catalystSchema: StructType): MessageType =
> {                   Types                   ._buildMessage_()                 
>  .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)               
>   .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)              
>      }
>  
>   private def *convertFieldWithMetadata*(field: StructField) : Type =
> {               val extField  = new ExtType[Any](convertField(field))         
>       val metaBuilder = new MetadataBuilder().withMetadata(field.metadata)    
>            val metaData = metaBuilder.getMap              
> extField.setMetadata(metaData)              return extField         }
>  }
> h2. Extend ParquetWriteSupport
> class CryptoParquetWriteSupport extends ParquetWriteSupport {
>   *override* def init(configuration: Configuration): WriteContext =
> {           val converter = new 
> *SparkToParquetMetadataSchemaConverter*(configuration)   
> createContext(configuration, converter)    }
> }
> h2. Make WriteSupport configurable
> class ParquetFileFormat{
>  
>    **    override def prepareWrite(...) {
>    …
>    *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) 
> {*
>    ParquetOutputFormat._setWriteSupportClass_(job, 
> _classOf_[ParquetWriteSupport])
>    ** 
>   ...
>    }
> }
> h1. Verification
> The 
> [ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java]
>  in the github repository 
> [parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions]
>  has a sample verification of passing down the field metadata and perform 
> column encryption.
> h1. Dependency
>  * Parquet-1178
>  * Parquet-1396
>  * Parquet-1397



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1211#comment-1211
 ] 

Huaxin Gao commented on SPARK-25859:


PowerIterationClustering is not in the doc either. Do I need to add it too?

> add scala/java/python example and doc for PrefixSpan
> 
>
> Key: SPARK-25859
> URL: https://issues.apache.org/jira/browse/SPARK-25859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> scala/java/python examples and doc for PrefixSpan are added in 3.0 in 
> https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
> examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25861) Remove unused refreshInterval parameter from the headerSparkPage method.

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25861:


Assignee: (was: Apache Spark)

> Remove unused refreshInterval parameter from the headerSparkPage method.
> 
>
> Key: SPARK-25861
> URL: https://issues.apache.org/jira/browse/SPARK-25861
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: shahid
>Priority: Minor
>
> https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L221
>  
> refreshInterval is not used anywhere in the headerSparkPage method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25861) Remove unused refreshInterval parameter from the headerSparkPage method.

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25861:


Assignee: Apache Spark

> Remove unused refreshInterval parameter from the headerSparkPage method.
> 
>
> Key: SPARK-25861
> URL: https://issues.apache.org/jira/browse/SPARK-25861
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: shahid
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L221
>  
> refreshInterval is not used anywhere in the headerSparkPage method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25861) Remove unused refreshInterval parameter from the headerSparkPage method.

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1196#comment-1196
 ] 

Apache Spark commented on SPARK-25861:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22864

> Remove unused refreshInterval parameter from the headerSparkPage method.
> 
>
> Key: SPARK-25861
> URL: https://issues.apache.org/jira/browse/SPARK-25861
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: shahid
>Priority: Minor
>
> https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L221
>  
> refreshInterval is not used anywhere in the headerSparkPage method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25861) Remove unused refreshInterval parameter from the headerSparkPage method.

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1197#comment-1197
 ] 

Apache Spark commented on SPARK-25861:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22864

> Remove unused refreshInterval parameter from the headerSparkPage method.
> 
>
> Key: SPARK-25861
> URL: https://issues.apache.org/jira/browse/SPARK-25861
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: shahid
>Priority: Minor
>
> https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L221
>  
> refreshInterval is not used anywhere in the headerSparkPage method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25861) Remove unused refreshInterval parameter from the headerSparkPage method.

2018-10-27 Thread shahid (JIRA)
shahid created SPARK-25861:
--

 Summary: Remove unused refreshInterval parameter from the 
headerSparkPage method.
 Key: SPARK-25861
 URL: https://issues.apache.org/jira/browse/SPARK-25861
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.2
Reporter: shahid


https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L221
 
refreshInterval is not used anywhere in the headerSparkPage method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25859:


Assignee: Apache Spark

> add scala/java/python example and doc for PrefixSpan
> 
>
> Key: SPARK-25859
> URL: https://issues.apache.org/jira/browse/SPARK-25859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> scala/java/python examples and doc for PrefixSpan are added in 3.0 in 
> https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
> examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25859:


Assignee: (was: Apache Spark)

> add scala/java/python example and doc for PrefixSpan
> 
>
> Key: SPARK-25859
> URL: https://issues.apache.org/jira/browse/SPARK-25859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> scala/java/python examples and doc for PrefixSpan are added in 3.0 in 
> https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
> examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1189#comment-1189
 ] 

Apache Spark commented on SPARK-25859:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22863

> add scala/java/python example and doc for PrefixSpan
> 
>
> Key: SPARK-25859
> URL: https://issues.apache.org/jira/browse/SPARK-25859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> scala/java/python examples and doc for PrefixSpan are added in 3.0 in 
> https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
> examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1188#comment-1188
 ] 

Apache Spark commented on SPARK-25859:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22863

> add scala/java/python example and doc for PrefixSpan
> 
>
> Key: SPARK-25859
> URL: https://issues.apache.org/jira/browse/SPARK-25859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> scala/java/python examples and doc for PrefixSpan are added in 3.0 in 
> https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
> examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1178#comment-1178
 ] 

Dilip Biswal commented on SPARK-25833:
--

This looks like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-24864. Please see the discussion 
there. Basically Hive and Spark are two different systems and follow a 
different scheme to compute auto generated column names. We should be using 
aliases  in the view definition to make it runnable from spark.

Thank you.

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25661) Refactor AvroWriteBenchmark to use main method

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1147#comment-1147
 ] 

Apache Spark commented on SPARK-25661:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22861

> Refactor AvroWriteBenchmark to use main method
> --
>
> Key: SPARK-25661
> URL: https://issues.apache.org/jira/browse/SPARK-25661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25663) Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use main method

2018-10-27 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1148#comment-1148
 ] 

yucai commented on SPARK-25663:
---

[~Gengliang.Wang] I make an improvement on this, could you help review?

https://github.com/apache/spark/pull/22861

> Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use 
> main method
> 
>
> Key: SPARK-25663
> URL: https://issues.apache.org/jira/browse/SPARK-25663
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25663) Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use main method

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25663:


Assignee: (was: Apache Spark)

> Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use 
> main method
> 
>
> Key: SPARK-25663
> URL: https://issues.apache.org/jira/browse/SPARK-25663
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25661) Refactor AvroWriteBenchmark to use main method

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25661:


Assignee: (was: Apache Spark)

> Refactor AvroWriteBenchmark to use main method
> --
>
> Key: SPARK-25661
> URL: https://issues.apache.org/jira/browse/SPARK-25661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25663) Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use main method

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1144#comment-1144
 ] 

Apache Spark commented on SPARK-25663:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22861

> Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use 
> main method
> 
>
> Key: SPARK-25663
> URL: https://issues.apache.org/jira/browse/SPARK-25663
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25661) Refactor AvroWriteBenchmark to use main method

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25661:


Assignee: Apache Spark

> Refactor AvroWriteBenchmark to use main method
> --
>
> Key: SPARK-25661
> URL: https://issues.apache.org/jira/browse/SPARK-25661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25663) Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use main method

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25663:


Assignee: Apache Spark

> Refactor BuiltInDataSourceWriteBenchmark and DataSourceWriteBenchmark to use 
> main method
> 
>
> Key: SPARK-25663
> URL: https://issues.apache.org/jira/browse/SPARK-25663
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25661) Refactor AvroWriteBenchmark to use main method

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1146#comment-1146
 ] 

Apache Spark commented on SPARK-25661:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22861

> Refactor AvroWriteBenchmark to use main method
> --
>
> Key: SPARK-25661
> URL: https://issues.apache.org/jira/browse/SPARK-25661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23367) Include python document style checking

2018-10-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23367.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22425
[https://github.com/apache/spark/pull/22425]

> Include python document style checking
> --
>
> Key: SPARK-23367
> URL: https://issues.apache.org/jira/browse/SPARK-23367
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Assignee: Rekha Joshi
>Priority: Minor
> Fix For: 3.0.0
>
>
> As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] 
> this jira is to include python doc style checking in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23367) Include python document style checking

2018-10-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23367:
-

Assignee: Rekha Joshi

> Include python document style checking
> --
>
> Key: SPARK-23367
> URL: https://issues.apache.org/jira/browse/SPARK-23367
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Assignee: Rekha Joshi
>Priority: Minor
> Fix For: 3.0.0
>
>
> As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] 
> this jira is to include python doc style checking in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-27 Thread Peter Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1058#comment-1058
 ] 

Peter Toth commented on SPARK-25816:


Thanks [~bzhang], It seems both are regressions from 2.2 to 2.3 for the same 
reason. My submitted PR fixes them.

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Brian Zhang
>Priority: Critical
> Attachments: final_allDatatypes_Spark.avro, source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24709) Inferring schema from JSON string literal

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1033#comment-1033
 ] 

Apache Spark commented on SPARK-24709:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22858

> Inferring schema from JSON string literal
> -
>
> Key: SPARK-24709
> URL: https://issues.apache.org/jira/browse/SPARK-24709
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Need to add new function - *schema_of_json()*. The function should infer 
> schema of JSON string literal. The result of the function is a schema in DDL 
> format.
> One of the use cases is passing output of _schema_of_json()_ to 
> *from_json()*. Currently, the _from_json()_ function requires a schema as a 
> mandatory argument. An user has to pass a schema as string literal in SQL. 
> The new function should allow schema inferring from an example. Let's say 
> json_col is a column containing JSON string with the same schema. It should 
> be possible to pass a JSON string with the same schema to _schema_of_json()_ 
> which infers schema for the particular example.
> {code:sql}
> select from_json(json_col, schema_of_json('{"f1": 0, "f2": [0], "f2": "a"}'))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25860) Replace Literal(null, _) with FalseLiteral whenever possible

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1032#comment-1032
 ] 

Apache Spark commented on SPARK-25860:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/22857

> Replace Literal(null, _) with FalseLiteral whenever possible
> 
>
> Key: SPARK-25860
> URL: https://issues.apache.org/jira/browse/SPARK-25860
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should have a new optimization rule that replaces {{Literal(null, _)}} 
> with {{FalseLiteral}} in conditions in {{Join}} and {{Filter}}, predicates in 
> {{If}}, conditions in {{CaseWhen}}.
> The underlying idea is that those expressions evaluate to {{false}} if the 
> underlying expression is {{null}} (as an example see 
> {{GeneratePredicate$create}} or {{doGenCode}} and {{eval}} methods in {{If}} 
> and {{CaseWhen}}). Therefore, we can replace {{Literal(null, _)}} with 
> {{FalseLiteral}}, which can lead to more optimizations later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25860) Replace Literal(null, _) with FalseLiteral whenever possible

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25860:


Assignee: (was: Apache Spark)

> Replace Literal(null, _) with FalseLiteral whenever possible
> 
>
> Key: SPARK-25860
> URL: https://issues.apache.org/jira/browse/SPARK-25860
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should have a new optimization rule that replaces {{Literal(null, _)}} 
> with {{FalseLiteral}} in conditions in {{Join}} and {{Filter}}, predicates in 
> {{If}}, conditions in {{CaseWhen}}.
> The underlying idea is that those expressions evaluate to {{false}} if the 
> underlying expression is {{null}} (as an example see 
> {{GeneratePredicate$create}} or {{doGenCode}} and {{eval}} methods in {{If}} 
> and {{CaseWhen}}). Therefore, we can replace {{Literal(null, _)}} with 
> {{FalseLiteral}}, which can lead to more optimizations later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25860) Replace Literal(null, _) with FalseLiteral whenever possible

2018-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25860:


Assignee: Apache Spark

> Replace Literal(null, _) with FalseLiteral whenever possible
> 
>
> Key: SPARK-25860
> URL: https://issues.apache.org/jira/browse/SPARK-25860
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Assignee: Apache Spark
>Priority: Major
>
> We should have a new optimization rule that replaces {{Literal(null, _)}} 
> with {{FalseLiteral}} in conditions in {{Join}} and {{Filter}}, predicates in 
> {{If}}, conditions in {{CaseWhen}}.
> The underlying idea is that those expressions evaluate to {{false}} if the 
> underlying expression is {{null}} (as an example see 
> {{GeneratePredicate$create}} or {{doGenCode}} and {{eval}} methods in {{If}} 
> and {{CaseWhen}}). Therefore, we can replace {{Literal(null, _)}} with 
> {{FalseLiteral}}, which can lead to more optimizations later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25860) Replace Literal(null, _) with FalseLiteral whenever possible

2018-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1031#comment-1031
 ] 

Apache Spark commented on SPARK-25860:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/22857

> Replace Literal(null, _) with FalseLiteral whenever possible
> 
>
> Key: SPARK-25860
> URL: https://issues.apache.org/jira/browse/SPARK-25860
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should have a new optimization rule that replaces {{Literal(null, _)}} 
> with {{FalseLiteral}} in conditions in {{Join}} and {{Filter}}, predicates in 
> {{If}}, conditions in {{CaseWhen}}.
> The underlying idea is that those expressions evaluate to {{false}} if the 
> underlying expression is {{null}} (as an example see 
> {{GeneratePredicate$create}} or {{doGenCode}} and {{eval}} methods in {{If}} 
> and {{CaseWhen}}). Therefore, we can replace {{Literal(null, _)}} with 
> {{FalseLiteral}}, which can lead to more optimizations later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25860) Replace Literal(null, _) with FalseLiteral whenever possible

2018-10-27 Thread Anton Okolnychyi (JIRA)
Anton Okolnychyi created SPARK-25860:


 Summary: Replace Literal(null, _) with FalseLiteral whenever 
possible
 Key: SPARK-25860
 URL: https://issues.apache.org/jira/browse/SPARK-25860
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, SQL
Affects Versions: 3.0.0
Reporter: Anton Okolnychyi


We should have a new optimization rule that replaces {{Literal(null, _)}} with 
{{FalseLiteral}} in conditions in {{Join}} and {{Filter}}, predicates in 
{{If}}, conditions in {{CaseWhen}}.

The underlying idea is that those expressions evaluate to {{false}} if the 
underlying expression is {{null}} (as an example see 
{{GeneratePredicate$create}} or {{doGenCode}} and {{eval}} methods in {{If}} 
and {{CaseWhen}}). Therefore, we can replace {{Literal(null, _)}} with 
{{FalseLiteral}}, which can lead to more optimizations later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25259) Left/Right join support push down during-join predicates

2018-10-27 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25259:

Description: 
For example:
{code:sql}
create temporary view EMPLOYEE as select * from values
  ("10", "HAAS", "A00"),
  ("10", "THOMPSON", "B01"),
  ("30", "KWAN", "C01"),
  ("000110", "LUCCHESSI", "A00"),
  ("000120", "O'CONNELL", "A))"),
  ("000130", "QUINTANA", "C01")
  as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);

create temporary view DEPARTMENT as select * from values
  ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
  ("B01", "PLANNING", "20"),
  ("C01", "INFORMATION CENTER", "30"),
  ("D01", "DEVELOPMENT CENTER", null)
  as DEPARTMENT(DEPTNO, DEPTNAME, MGRNO);

create temporary view PROJECT as select * from values
  ("AD3100", "ADMIN SERVICES", "D01"),
  ("IF1000", "QUERY SERVICES", "C01"),
  ("IF2000", "USER EDUCATION", "E01"),
  ("MA2100", "WELD LINE AUDOMATION", "D01"),
  ("PL2100", "WELD LINE PLANNING", "01")
  as PROJECT(PROJNO, PROJNAME, DEPTNO);
{code}
below SQL:
{code:sql}
SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
ON P.DEPTNO = D.DEPTNO
AND P.DEPTNO='E01';
{code}
can Optimized to:
{code:sql}
SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
ON P.DEPTNO = D.DEPTNO
AND P.DEPTNO='E01';
{code}

  was:
For example:
{code:sql}
create temporary view EMPLOYEE as select * from values
  ("10", "HAAS", "A00"),
  ("10", "THOMPSON", "B01"),
  ("30", "KWAN", "C01"),
  ("000110", "LUCCHESSI", "A00"),
  ("000120", "O'CONNELL", "A))"),
  ("000130", "QUINTANA", "C01")
  as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);

create temporary view DEPARTMENT as select * from values
  ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
  ("B01", "PLANNING", "20"),
  ("C01", "INFORMATION CENTER", "30"),
  ("D01", "DEVELOPMENT CENTER", null)
  as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);

create temporary view PROJECT as select * from values
  ("AD3100", "ADMIN SERVICES", "D01"),
  ("IF1000", "QUERY SERVICES", "C01"),
  ("IF2000", "USER EDUCATION", "E01"),
  ("MA2100", "WELD LINE AUDOMATION", "D01"),
  ("PL2100", "WELD LINE PLANNING", "01")
  as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
{code}

below SQL:
{code:sql}
SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
ON P.DEPTNO = D.DEPTNO
AND P.DEPTNO='E01';
{code}

can Optimized to:
{code:sql}
SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
ON P.DEPTNO = D.DEPTNO
AND P.DEPTNO='E01';
{code}


> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as DEPARTMENT(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as PROJECT(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Huaxin Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-25859:
---
Description: scala/java/python examples and doc for PrefixSpan are added in 
3.0 in https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add 
the examples and doc in 2.4.  (was: scala/java/python examples and doc for 
PrefixSpan are added 3.0 in https://issues.apache.org/jira/browse/SPARK-24207. 
This jira is to add the examples and doc in 2.4.)

> add scala/java/python example and doc for PrefixSpan
> 
>
> Key: SPARK-25859
> URL: https://issues.apache.org/jira/browse/SPARK-25859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> scala/java/python examples and doc for PrefixSpan are added in 3.0 in 
> https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
> examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665973#comment-16665973
 ] 

Huaxin Gao commented on SPARK-25859:


[~felixcheung] I had some problems to submit a PR for v2.4.0-rc5. Will try 
again tomorrow. 

> add scala/java/python example and doc for PrefixSpan
> 
>
> Key: SPARK-25859
> URL: https://issues.apache.org/jira/browse/SPARK-25859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> scala/java/python examples and doc for PrefixSpan are added 3.0 in 
> https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
> examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25859) add scala/java/python example and doc for PrefixSpan

2018-10-27 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-25859:
--

 Summary: add scala/java/python example and doc for PrefixSpan
 Key: SPARK-25859
 URL: https://issues.apache.org/jira/browse/SPARK-25859
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0
Reporter: Huaxin Gao


scala/java/python examples and doc for PrefixSpan are added 3.0 in 
https://issues.apache.org/jira/browse/SPARK-24207. This jira is to add the 
examples and doc in 2.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org