[jira] [Created] (SPARK-30400) Test failure in SQL module on ppc64le

2019-12-31 Thread AK97 (Jira)
AK97 created SPARK-30400:


 Summary: Test failure in SQL module on ppc64le
 Key: SPARK-30400
 URL: https://issues.apache.org/jira/browse/SPARK-30400
 Project: Spark
  Issue Type: Test
  Components: Build
Affects Versions: 2.4.0
 Environment: 
os: rhel 7.6
arch: ppc64le
Reporter: AK97


I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, the 
test cases are failing in SQL module with following error :

- CREATE TABLE USING AS SELECT based on the file without write permission *** 
FAILED ***
  Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown (CreateTableAsSelectSuite.scala:92)

- create a table, drop it and create another one with the same name *** FAILED 
***
  org.apache.spark.sql.AnalysisException: Table default.jsonTable already 
exists. You need to drop it first.;
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
  at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)


Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event

2019-12-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-23626.
--
Resolution: Won't Fix

>  DAGScheduler blocked due to JobSubmitted event
> ---
>
> Key: SPARK-23626
> URL: https://issues.apache.org/jira/browse/SPARK-23626
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.1, 2.3.3, 2.4.3, 3.0.0
>Reporter: Ajith S
>Priority: Major
>
> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted 
> events has to be processed as DAGSchedulerEventProcessLoop is single threaded 
> and it will block other tasks in queue like TaskCompletion.
> The JobSubmitted event is time consuming depending on the nature of the job 
> (Example: calculating parent stage dependencies, shuffle dependencies, 
> partitions) and thus it blocks all the events to be processed.
>  
> I see multiple JIRA referring to this behavior
> https://issues.apache.org/jira/browse/SPARK-2647
> https://issues.apache.org/jira/browse/SPARK-4961
>  
> Similarly in my cluster some jobs partition calculation is time consuming 
> (Similar to stack at SPARK-2647) hence it slows down the spark 
> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if 
> its tasks are finished within seconds, as TaskCompletion Events are processed 
> at a slower rate due to blockage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30399) Bucketing does not compatible with partitioning in practice

2019-12-31 Thread Shay Elbaz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Elbaz updated SPARK-30399:
---
Description: 
When using Spark Bucketed table, Spark would use as many partitions as the 
number of buckets for the map-side join 
(_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
tables, but quite disastrous for _time-partitioned_ tables. In our use case, a 
daily partitioned key-value table is added 100GB of data every day. So in 100 
days there are 10TB of data we want to join with. Aiming to this scenario, we 
need thousands of buckets if we want every task to successfully *read and sort* 
all of it's data in a map-side join. But in such case, every daily increment 
would emit thousands of small files, leading to other big issues.

In practice, and with a hope for some hidden optimization, we set the number of 
buckets to 1000 and backfilled such a table with 10TB. When trying to join with 
the smallest input, every executor was killed by Yarn due to over allocating 
memory in the sorting phase. Even without such failures, it would take every 
executor unreasonably amount of time to locally sort all its data.

A question on SO remained unanswered for a while, so I thought asking here - is 
it by design that buckets cannot be used in time-partitioned table, or am I 
doing something wrong?

  was:
When using Spark Bucketed table, Spark would use as many partitions as the 
number of buckets for the map-side join 
(_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
tables, but quite disastrous for _time-partitioned_ tables. In our use case, a 
daily partitioned key-value table is added 100GB of data every day. So in 100 
days there are 10TB of data we want to join with - aiming to this scenario, we 
need thousands of buckets if we want every task to successfully *read and sort* 
all of it's data in a map-side join. But in such case, every daily increment 
would emit thousands of small files, leading to other big issues. 

In practice, and with a hope for some hidden optimization, we set the number of 
buckets to 1000 and backfilled such a table with 10TB. When trying to join with 
the smallest input, every executor was killed by Yarn due to over allocating 
memory in the sorting phase. Even without such failures, it would take every 
executor unreasonably amount of time to locally sort all its data.

A question on SO remained unanswered for a while, so I thought asking here - is 
it by design that buckets cannot be used in time-partitioned table, or am I 
doing something wrong?


> Bucketing does not compatible with partitioning in practice
> ---
>
> Key: SPARK-30399
> URL: https://issues.apache.org/jira/browse/SPARK-30399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: HDP 2.7
>Reporter: Shay Elbaz
>Priority: Minor
>
> When using Spark Bucketed table, Spark would use as many partitions as the 
> number of buckets for the map-side join 
> (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
> tables, but quite disastrous for _time-partitioned_ tables. In our use case, 
> a daily partitioned key-value table is added 100GB of data every day. So in 
> 100 days there are 10TB of data we want to join with. Aiming to this 
> scenario, we need thousands of buckets if we want every task to successfully 
> *read and sort* all of it's data in a map-side join. But in such case, every 
> daily increment would emit thousands of small files, leading to other big 
> issues.
> In practice, and with a hope for some hidden optimization, we set the number 
> of buckets to 1000 and backfilled such a table with 10TB. When trying to join 
> with the smallest input, every executor was killed by Yarn due to over 
> allocating memory in the sorting phase. Even without such failures, it would 
> take every executor unreasonably amount of time to locally sort all its data.
> A question on SO remained unanswered for a while, so I thought asking here - 
> is it by design that buckets cannot be used in time-partitioned table, or am 
> I doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30399) Bucketing does not compatible with partitioning in practice

2019-12-31 Thread Shay Elbaz (Jira)
Shay Elbaz created SPARK-30399:
--

 Summary: Bucketing does not compatible with partitioning in 
practice
 Key: SPARK-30399
 URL: https://issues.apache.org/jira/browse/SPARK-30399
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: HDP 2.7
Reporter: Shay Elbaz


When using Spark Bucketed table, Spark would use as many partitions as the 
number of buckets for the map-side join 
(_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
tables, but quite disastrous for _time-partitioned_ tables. In our use case, a 
daily partitioned key-value table is added 100GB of data every day. So in 100 
days there are 10TB of data we want to join with - aiming to this scenario, we 
need thousands of buckets if we want every task to successfully *read and sort* 
all of it's data in a map-side join. But in such case, every daily increment 
would emit thousands of small files, leading to other big issues. 

In practice, and with a hope for some hidden optimization, we set the number of 
buckets to 1000 and backfilled such a table with 10TB. When trying to join with 
the smallest input, every executor was killed by Yarn due to over allocating 
memory in the sorting phase. Even without such failures, it would take every 
executor unreasonably amount of time to locally sort all its data.

A question on SO remained unanswered for a while, so I thought asking here - is 
it by design that buckets cannot be used in time-partitioned table, or am I 
doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24202) Separate SQLContext dependency from SparkSession.implicits

2019-12-31 Thread Sam hendley (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006214#comment-17006214
 ] 

Sam hendley commented on SPARK-24202:
-

I agree that this would be a very valuable change, was there a reason this was 
closed without comment?

> Separate SQLContext dependency from SparkSession.implicits
> --
>
> Key: SPARK-24202
> URL: https://issues.apache.org/jira/browse/SPARK-24202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gerard Maas
>Priority: Major
>  Labels: bulk-closed
>
> The current implementation of the implicits in SparkSession passes the 
> current active SQLContext to the SQLImplicits class. This implies that all 
> usage of these (extremely helpful) implicits require the prior creation of a 
> Spark Session instance.
> Usage is typically done as follows:
>  
> {code:java}
> val sparkSession = SparkSession.builder()
> getOrCreate()
> import sparkSession.implicits._
> {code}
>  
> This is OK in user code, but it burdens the creation of library code that 
> uses Spark, where  static imports for _Encoder_ support is required.
> A simple example would be:
>  
> {code:java}
> class SparkTransformation[In: Encoder, Out: Encoder] {
>     def transform(ds: Dataset[In]): Dataset[Out]
> }
> {code}
>  
> Attempting to compile such code would result in the following exception:
> {code:java}
> Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
> String, etc) and Product types (case classes) are supported by importing 
> spark.implicits._  Support for serializing other types will be added in 
> future releases.{code}
> The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
> utilities to transform _RDD_ and local collections into a _Dataset_.
> These are 2 methods of the 46 implicit conversions offered by this class.
> The request is to separate the two implicit methods that depend on the 
> SQLContext instance creation into a separate class:
> {code:java}
> SQLImplicits#214-229
> /**
>  * Creates a [[Dataset]] from an RDD.
>  *
>  * @since 1.6.0
>  */
> implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = 
> {
>  DatasetHolder(_sqlContext.createDataset(rdd))
> }
> /**
>  * Creates a [[Dataset]] from a local Seq.
>  * @since 1.6.0
>  */
> implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): 
> DatasetHolder[T] = {
>  DatasetHolder(_sqlContext.createDataset(s))
> }{code}
> By separating the static methods from these two methods that depend on 
> _sqlContext_ into  separate classes, we could provide static imports for all 
> the other functionality and only require the instance-bound  implicits for 
> the RDD and collection support (Which is an uncommon use case these days)
> As this is potentially breaking the current interface, this might be a 
> candidate for Spark 3.0. Although there's nothing stopping us from creating a 
> separate hierarchy for the static encoders already. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26734) StackOverflowError on WAL serialization caused by large receivedBlockQueue

2019-12-31 Thread Stephen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006211#comment-17006211
 ] 

Stephen commented on SPARK-26734:
-

Is this bug fixed? I still see the same error at Spark 2.4.3

> StackOverflowError on WAL serialization caused by large receivedBlockQueue
> --
>
> Key: SPARK-26734
> URL: https://issues.apache.org/jira/browse/SPARK-26734
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
> Environment: spark 2.4.0 streaming job
> java 1.8
> scala 2.11.12
>Reporter: Ross M. Lodge
>Assignee: Ross M. Lodge
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> We encountered an intermittent StackOverflowError with a stack trace similar 
> to:
>  
> {noformat}
> Exception in thread "JobGenerator" java.lang.StackOverflowError
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){noformat}
> The name of the thread has been seen to be either "JobGenerator" or 
> "streaming-start", depending on when in the lifecycle of the job the problem 
> occurs.  It appears to only occur in streaming jobs with checkpointing and 
> WAL enabled; this has prevented us from upgrading to v2.4.0.
>  
> Via debugging, we tracked this down to allocateBlocksToBatch in 
> ReceivedBlockTracker:
> {code:java}
> /**
>  * Allocate all unallocated blocks to the given batch.
>  * This event will get written to the write ahead log (if enabled).
>  */
> def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
>   if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
> val streamIdToBlocks = streamIds.map { streamId =>
>   (streamId, getReceivedBlockQueue(streamId).clone())
> }.toMap
> val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
> if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
>   streamIds.foreach(getReceivedBlockQueue(_).clear())
>   timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
>   lastAllocatedBatchTime = batchTime
> } else {
>   logInfo(s"Possibly processed batch $batchTime needs to be processed 
> again in WAL recovery")
> }
>   } else {
> // This situation occurs when:
> // 1. WAL is ended with BatchAllocationEvent, but without 
> BatchCleanupEvent,
> // possibly processed batch job or half-processed batch job need to be 
> processed again,
> // so the batchTime will be equal to lastAllocatedBatchTime.
> // 2. Slow checkpointing makes recovered batch time older than WAL 
> recovered
> // lastAllocatedBatchTime.
> // This situation will only occurs in recovery time.
> logInfo(s"Possibly processed batch $batchTime needs to be processed again 
> in WAL recovery")
>   }
> }
> {code}
> Prior to 2.3.1, this code did
> {code:java}
> getReceivedBlockQueue(streamId).dequeueAll(x => true){code}
> but it was changed as part of SPARK-23991 to
> {code:java}
> getReceivedBlockQueue(streamId).clone(){code}
> We've not been able to reproduce this in a test of the actual above method, 
> but we've been able to produce a test that reproduces it by putting a lot of 
> values into the queue:
>  
> {code:java}
> class SerializationFailureTest extends FunSpec {
>   private val logger = LoggerFactory.getLogger(getClass)
>   private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
>   describe("Queue") {
> it("should be serializable") {
>   runTest(1062)
> }
> it("should not be serializable") {
>   runTest(1063)
> }
> it("should DEFINITELY not be serializable") {
>   runTest(199952)
> }
>   }
>   private def runTest(mx: Int): Array[Byte] = {
> try {
>   val random = new scala.util.Random()
>   val queue = new ReceivedBlockQueue()
>   for (_ <- 0 until mx) {
> queue += ReceivedBlockInfo(
>   streamId = 0,
>   numRecords = Some(random.nextInt(5)),
>   metadataOption = None,
>   blockStoreResult = WriteAheadLogBasedStoreResult(
> blockId = StreamBlockId(0, random.nextInt()),
> numRecords = Some(random.nextInt(5)),
> walRecordHandle = FileBasedWriteAheadLogSegment(
>   path = 
> s"""hdfs://foo.bar.com:8080/spark/streaming/BAZ/7/receivedData/0/log-${random.nextInt()}-${random.nextInt()}""",
>   offset = random.nextLong(),
>   length 

[jira] [Updated] (SPARK-30339) Avoid to fail twice in function lookup

2019-12-31 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30339:
-
Fix Version/s: 2.4.5

> Avoid to fail twice in function lookup
> --
>
> Key: SPARK-30339
> URL: https://issues.apache.org/jira/browse/SPARK-30339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> Currently if function lookup fails, spark will give it a second chance by 
> casting decimal type to double type. But for cases where decimal type doesn't 
> exist, it's meaningless to lookup again and cause extra cost like unnecessary 
> metastore access. We should throw exceptions directly in these cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30363) Add Documentation for Refresh Resources

2019-12-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30363.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27023
[https://github.com/apache/spark/pull/27023]

> Add Documentation for Refresh Resources
> ---
>
> Key: SPARK-30363
> URL: https://issues.apache.org/jira/browse/SPARK-30363
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.0.0
>
>
> Refresh Resources is not documented in the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30363) Add Documentation for Refresh Resources

2019-12-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30363:


Assignee: Rakesh Raushan

> Add Documentation for Refresh Resources
> ---
>
> Key: SPARK-30363
> URL: https://issues.apache.org/jira/browse/SPARK-30363
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Assignee: Rakesh Raushan
>Priority: Minor
>
> Refresh Resources is not documented in the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30389) Add jar should not allow to add other format except jar file

2019-12-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30389.
--
Resolution: Not A Problem

> Add jar should not allow to add other format except jar file
> 
>
> Key: SPARK-30389
> URL: https://issues.apache.org/jira/browse/SPARK-30389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> spark-sql> add jar /opt/abhi/udf/test1jar/12.txt;
> ADD JAR /opt/abhi/udf/test1jar/12.txt
> Added [/opt/abhi/udf/test1jar/12.txt] to class path
> spark-sql> list jar;
> spark://vm1:45169/jars/12.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30392) Documentation for the date_trunc function is incorrect

2019-12-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30392.
--
Resolution: Not A Problem

Yes, this is being reported against 2.3.0 apparently

> Documentation for the date_trunc function is incorrect
> --
>
> Key: SPARK-30392
> URL: https://issues.apache.org/jira/browse/SPARK-30392
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ashley
>Priority: Minor
> Attachments: Date_trunc.PNG
>
>
> The documentation for the date_trunc function includes a few sample SELECT 
> statements to show how the function works: 
> [https://spark.apache.org/docs/2.3.0/api/sql/#date_trunc]
> In the sample SELECT statements, the inputs to the function are swapped:
> *The docs show:* SELECT date_trunc('2015-03-05T09:32:05.359', 'YEAR'); 
>  *The docs _should_ show:* SELECT date_trunc('YEAR', 
> '2015-03-05T09:32:05.359');
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30336) Move Kafka consumer related classes to its own package

2019-12-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30336.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26991
[https://github.com/apache/spark/pull/26991]

> Move Kafka consumer related classes to its own package
> --
>
> Key: SPARK-30336
> URL: https://issues.apache.org/jira/browse/SPARK-30336
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> There're too many classes placed in a package "org.apache.spark.sql.kafka010" 
> which classes should have been grouped by purpose.
> As a part of change in SPARK-21869, we moved out producer related classes to 
> "org.apache.spark.sql.kafka010.producer" and only expose necessary 
> classes/methods to the outside of package. We can apply it to consumer 
> related classes as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30336) Move Kafka consumer related classes to its own package

2019-12-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30336:


Assignee: Jungtaek Lim

> Move Kafka consumer related classes to its own package
> --
>
> Key: SPARK-30336
> URL: https://issues.apache.org/jira/browse/SPARK-30336
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> There're too many classes placed in a package "org.apache.spark.sql.kafka010" 
> which classes should have been grouped by purpose.
> As a part of change in SPARK-21869, we moved out producer related classes to 
> "org.apache.spark.sql.kafka010.producer" and only expose necessary 
> classes/methods to the outside of package. We can apply it to consumer 
> related classes as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30339) Avoid to fail twice in function lookup

2019-12-31 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30339:
-
Description: Currently if function lookup fails, spark will give it a 
second chance by casting decimal type to double type. But for cases where 
decimal type doesn't exist, it's meaningless to lookup again and cause extra 
cost like unnecessary metastore access. We should throw exceptions directly in 
these cases.  (was: Currently if function lookup fails, spark will give it a 
second change by casting decimal type to double type. But for cases where 
decimal type doesn't exist, it's meaningless to lookup again and causes extra 
cost like unnecessary metastore access. We should throw exceptions directly in 
these cases.)

> Avoid to fail twice in function lookup
> --
>
> Key: SPARK-30339
> URL: https://issues.apache.org/jira/browse/SPARK-30339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently if function lookup fails, spark will give it a second chance by 
> casting decimal type to double type. But for cases where decimal type doesn't 
> exist, it's meaningless to lookup again and cause extra cost like unnecessary 
> metastore access. We should throw exceptions directly in these cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27692) Optimize evaluation of udf that is deterministic and has literal inputs

2019-12-31 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006127#comment-17006127
 ] 

Maciej Szymkiewicz commented on SPARK-27692:


Could you explain what is the value of this proposal over just passing a 
literal? 

> Optimize evaluation of udf that is deterministic and has literal inputs
> ---
>
> Key: SPARK-27692
> URL: https://issues.apache.org/jira/browse/SPARK-27692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> Deterministic UDF is a udf for which the following is true:  Given a specific 
> input, the output of the udf will be the same no matter how many times you 
> execute the udf.
> When your inputs to the UDF are all literal and UDF is deterministic, we can 
> optimize this to evaluate the udf once and use the output instead of 
> evaluating the UDF each time for every row in the query. 
> This is valid only if the UDF is deterministic and inputs are literal.  
> Otherwise we should not and cannot apply this optimization. 
> *Testing:* 
> We have used this internally and have seen significant performance 
> improvements for some very expensive UDFs ( as expected).
> In the PR, I have added unit tests. 
> *Credits:* 
> Thanks to Guy Khazma([https://github.com/guykhazma]) from the IBM Haifa 
> Research Team for the idea and the original implementation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30393) Too much ProvisionedThroughputExceededException while recover from checkpoint

2019-12-31 Thread Stephen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen updated SPARK-30393:

Environment: I am using EMR 5.25.0, Spark 2.4.3, 
spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty of 
memory. 6 kinesis shards, I even increased to 12 shards but still see the 
kinesis error  (was: I am using EMR 5.23.0, Spark 2.4.3, 
spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty of 
memory. 6 kinesis shards, I even increased to 12 shards but still see the 
kinesis error)

> Too much ProvisionedThroughputExceededException while recover from checkpoint
> -
>
> Key: SPARK-30393
> URL: https://issues.apache.org/jira/browse/SPARK-30393
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
> Environment: I am using EMR 5.25.0, Spark 2.4.3, 
> spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty 
> of memory. 6 kinesis shards, I even increased to 12 shards but still see the 
> kinesis error
>Reporter: Stephen
>Priority: Major
> Attachments: kinesisexceedreadlimit.png, 
> kinesisusagewhilecheckpointrecoveryerror.png, 
> sparkuiwhilecheckpointrecoveryerror.png
>
>
> I have a spark application which consume from Kinesis with 6 shards. Data was 
> produced to Kinesis at at most 2000 records/second. At non peak time data 
> only comes in at 200 records/second. Each record is 0.5K Bytes. So 6 shards 
> is enough to handle that.
> I use reduceByKeyAndWindow and mapWithState in the program and the sliding 
> window is one hour long.
> Recently I am trying to checkpoint the application to S3. I am testing this 
> at nonpeak time so the data incoming rate is very low like 200 records/sec. I 
> run the Spark application by creating new context, checkpoint is created at 
> s3, but when I kill the app and restarts, it failed to recover from 
> checkpoint, and the error message is the following and my SparkUI shows all 
> the batches are stucked, and it takes a long time for the checkpoint recovery 
> to complete, 15 minutes to over an hour.
> I found lots of error message in the log related to Kinesis exceeding read 
> limit:
> {quote}19/12/24 00:15:21 WARN TaskSetManager: Lost task 571.0 in stage 33.0 
> (TID 4452, ip-172-17-32-11.ec2.internal, executor 9): 
> org.apache.spark.SparkException: Gave up after 3 retries while getting shard 
> iterator from sequence number 
> 49601654074184110438492229476281538439036626028298502210, last exception:
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
> bq. at scala.Option.getOrElse(Option.scala:121)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getKinesisIterator(KinesisBackedBlockRDD.scala:246)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:206)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
> bq. at 
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
> bq. at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> bq. at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> bq. at org.apache.spark.scheduler.Task.run(Task.scala:121)
> bq. at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> bq. at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> bq. at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> bq. at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> bq. at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> bq. at java.lang.Thread.run(Thread.java:748)
> bq.

[jira] [Created] (SPARK-30398) PCA/RegressionMetrics/RowMatrix avoid unnecessary computation

2019-12-31 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30398:


 Summary: PCA/RegressionMetrics/RowMatrix avoid unnecessary 
computation
 Key: SPARK-30398
 URL: https://issues.apache.org/jira/browse/SPARK-30398
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 3.0.0
Reporter: zhengruifeng


use Summarizer instead of MultivariateOnlineSummarizer to avoid computation of 
unused metrics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30397) [pyspark] Writer applied to custom model changes type of keys' dict from int to str

2019-12-31 Thread Jean-Marc Montanier (Jira)
Jean-Marc Montanier created SPARK-30397:
---

 Summary: [pyspark] Writer applied to custom model changes type of 
keys' dict from int to str
 Key: SPARK-30397
 URL: https://issues.apache.org/jira/browse/SPARK-30397
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Jean-Marc Montanier


Hello,

 

I have a custom model that I'm trying to persist. Within this custom model 
there is a python dict mapping from int to int. When the model is saved (with 
write().save('path')), the keys of the dict are modified from int to str.

 

You can find bellow a code to reproduce the issue:
{code:python}
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
@author: Jean-Marc Montanier
@date: 2019/12/31
"""

from pyspark.sql import SparkSession

from pyspark import keyword_only
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml import Estimator, Model
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.param import Param, Params
from pyspark.ml.param.shared import HasInputCol, HasOutputCol
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf


spark = SparkSession \
.builder \
.appName("ImputeNormal") \
.getOrCreate()


class CustomFit(Estimator,
HasInputCol,
HasOutputCol,
DefaultParamsReadable,
DefaultParamsWritable,
):
@keyword_only
def __init__(self, inputCol="inputCol", outputCol="outputCol"):
super(CustomFit, self).__init__()

self._setDefault(inputCol="inputCol", outputCol="outputCol")
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol="inputCol", outputCol="outputCol"):
"""
setParams(self, inputCol="inputCol", outputCol="outputCol")
"""
kwargs = self._input_kwargs
self._set(**kwargs)
return self

def _fit(self, data):
inputCol = self.getInputCol()
outputCol = self.getOutputCol()

categories = data.where(data[inputCol].isNotNull()) \
.groupby(inputCol) \
.count() \
.orderBy("count", ascending=False) \
.limit(2)
categories = dict(categories.toPandas().set_index(inputCol)["count"])
for cat in categories:
categories[cat] = int(categories[cat])

return CustomModel(categories=categories,
   input_col=inputCol,
   output_col=outputCol)


class CustomModel(Model,
  DefaultParamsReadable,
  DefaultParamsWritable):

input_col = Param(Params._dummy(), "input_col", "Name of the input column")
output_col = Param(Params._dummy(), "output_col", "Name of the output 
column")
categories = Param(Params._dummy(), "categories", "Top categories")

def __init__(self, categories: dict = None, input_col="input_col", 
output_col="output_col"):
super(CustomModel, self).__init__()

self._set(categories=categories, input_col=input_col, 
output_col=output_col)

def get_output_col(self) -> str:
"""
output_col getter
:return:
"""
return self.getOrDefault(self.output_col)

def get_input_col(self) -> str:
"""
input_col getter
:return:
"""
return self.getOrDefault(self.input_col)

def get_categories(self):
"""
categories getter
:return:
"""
return self.getOrDefault(self.categories)

def _transform(self, data):
input_col = self.get_input_col()
output_col = self.get_output_col()
categories = self.get_categories()

def get_cat(val):
if val is None:
return -1
if val not in categories:
return -1
return int(categories[val])

get_cat_udf = udf(get_cat, IntegerType())

df = data.withColumn(output_col,
 get_cat_udf(input_col))

return df


def test_without_write():
fit_df = spark.createDataFrame([[10]] * 5 + [[11]] * 4 + [[12]] * 3 + 
[[None]] * 2, ['input'])
custom_fit = CustomFit(inputCol='input', outputCol='output')
pipeline = Pipeline(stages=[custom_fit])
pipeline_model = pipeline.fit(fit_df)

print("Categories: {}".format(pipeline_model.stages[0].get_categories()))

transform_df = spark.createDataFrame([[10]] * 2 + [[11]] * 2 + [[12]] * 2 + 
[[None]] * 2, ['input'])
test = pipeline_model.transform(transform_df)
test.show()  # This output is the expected output


def test_with_write():
fit_df = spark.createDataFrame([[10]] * 5 + [[11]] * 4 + [[12]] * 3 + 
[[None]] * 2, ['input'])
custom_fit = CustomFit(inputCol='input', outputCol='output')
pi

[jira] [Updated] (SPARK-30396) When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause

2019-12-31 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30396:
---
Description: 
This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986

This ticket will support:

When there are multiple DISTINCT aggregate expressions acting on different 
fields, any DISTINCT aggregate expression allows the use of the FILTER clause

 such as:
select class_id, count(distinct sex), sum(distinct id) filter (where sex = 
'man') from student group by class_id;
select class_id, count(distinct sex) filter (where class_id = 1), sum(distinct 
id) filter (where sex = 'man') from student group by class_id;

  was:
This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986

This ticket will support:

When there are multiple DISTINCT aggregate expressions acting on different 
fields, any DISTINCT aggregate expression allows the use of the FILTER clause

 


> When there are multiple DISTINCT aggregate expressions acting on different 
> fields, any DISTINCT aggregate expression allows the use of the FILTER clause
> 
>
> Key: SPARK-30396
> URL: https://issues.apache.org/jira/browse/SPARK-30396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986
> This ticket will support:
> When there are multiple DISTINCT aggregate expressions acting on different 
> fields, any DISTINCT aggregate expression allows the use of the FILTER clause
>  such as:
> select class_id, count(distinct sex), sum(distinct id) filter (where sex = 
> 'man') from student group by class_id;
> select class_id, count(distinct sex) filter (where class_id = 1), 
> sum(distinct id) filter (where sex = 'man') from student group by class_id;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30395) When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

2019-12-31 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30395:
---
Description: 
This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986

This ticket will support:

When one or more DISTINCT aggregate expressions operate on the same field, the 
DISTINCT aggregate expression allows the use of the FILTER clause

such as:

 
{code:java}
select sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id) filter (where sex = 'man') from student group 
by class_id;
select count(id) filter (where class_id = 1), sum(distinct id) filter (where 
sex = 'man') from student;
select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter 
(where sex = 'man') from student group by class_id;
select sum(distinct id), sum(distinct id) filter (where sex = 'man') from 
student;
select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') 
from student group by class_id;
select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct 
id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
{code}
 

 but not support:

 
{code:java}
select class_id, count(distinct sex), sum(distinct id) filter (where sex = 
'man') from student group by class_id;
select class_id, count(distinct sex) filter (where class_id = 1), sum(distinct 
id) filter (where sex = 'man') from student group by class_id;{code}
https://issues.apache.org/jira/browse/SPARK-30396 used for later.

  was:
This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986

This ticket will support:

When one or more DISTINCT aggregate expressions operate on the same field, the 
DISTINCT aggregate expression allows the use of the FILTER clause

such as:

 
{code:java}
select sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id) filter (where sex = 'man') from student group 
by class_id;
select count(id) filter (where class_id = 1), sum(distinct id) filter (where 
sex = 'man') from student;
select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter 
(where sex = 'man') from student group by class_id;
select sum(distinct id), sum(distinct id) filter (where sex = 'man') from 
student;
select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') 
from student group by class_id;
select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct 
id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
{code}
 

 but not support:

 


> When one or more DISTINCT aggregate expressions operate on the same field, 
> the DISTINCT aggregate expression allows the use of the FILTER clause
> 
>
> Key: SPARK-30395
> URL: https://issues.apache.org/jira/browse/SPARK-30395
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986
> This ticket will support:
> When one or more DISTINCT aggregate expressions operate on the same field, 
> the DISTINCT aggregate expression allows the use of the FILTER clause
> such as:
>  
> {code:java}
> select sum(distinct id) filter (where sex = 'man') from student;
> select class_id, sum(distinct id) filter (where sex = 'man') from student 
> group by class_id;
> select count(id) filter (where class_id = 1), sum(distinct id) filter (where 
> sex = 'man') from student;
> select class_id, count(id) filter (where class_id = 1), sum(distinct id) 
> filter (where sex = 'man') from student group by class_id;
> select sum(distinct id), sum(distinct id) filter (where sex = 'man') from 
> student;
> select class_id, sum(distinct id), sum(distinct id) filter (where sex = 
> 'man') from student group by class_id;
> select class_id, count(id), count(id) filter (where class_id = 1), 
> sum(distinct id), sum(distinct id) filter (where sex = 'man') from student 
> group by class_id;
> {code}
>  
>  but not support:
>  
> {code:java}
> select class_id, count(distinct sex), sum(distinct id) filter (where sex = 
> 'man') from student group by class_id;
> select class_id, count(distinct sex) filter (where class_id = 1), 
> sum(distinct id) filter (where sex = 'man') from student group by 
> class_id;{code}
> https://issues.apache.org/jira/browse/SPARK-30396 used for later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30395) When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

2019-12-31 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30395:
---
Description: 
This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986

This ticket will support:

When one or more DISTINCT aggregate expressions operate on the same field, the 
DISTINCT aggregate expression allows the use of the FILTER clause

such as:

 
{code:java}
select sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id) filter (where sex = 'man') from student group 
by class_id;
select count(id) filter (where class_id = 1), sum(distinct id) filter (where 
sex = 'man') from student;
select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter 
(where sex = 'man') from student group by class_id;
select sum(distinct id), sum(distinct id) filter (where sex = 'man') from 
student;
select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') 
from student group by class_id;
select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct 
id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
{code}
 

 but not support:

 

  was:
This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986

This ticket will support:

When one or more DISTINCT aggregate expressions operate on the same field, the 
DISTINCT aggregate expression allows the use of the FILTER clause

 


> When one or more DISTINCT aggregate expressions operate on the same field, 
> the DISTINCT aggregate expression allows the use of the FILTER clause
> 
>
> Key: SPARK-30395
> URL: https://issues.apache.org/jira/browse/SPARK-30395
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986
> This ticket will support:
> When one or more DISTINCT aggregate expressions operate on the same field, 
> the DISTINCT aggregate expression allows the use of the FILTER clause
> such as:
>  
> {code:java}
> select sum(distinct id) filter (where sex = 'man') from student;
> select class_id, sum(distinct id) filter (where sex = 'man') from student 
> group by class_id;
> select count(id) filter (where class_id = 1), sum(distinct id) filter (where 
> sex = 'man') from student;
> select class_id, count(id) filter (where class_id = 1), sum(distinct id) 
> filter (where sex = 'man') from student group by class_id;
> select sum(distinct id), sum(distinct id) filter (where sex = 'man') from 
> student;
> select class_id, sum(distinct id), sum(distinct id) filter (where sex = 
> 'man') from student group by class_id;
> select class_id, count(id), count(id) filter (where class_id = 1), 
> sum(distinct id), sum(distinct id) filter (where sex = 'man') from student 
> group by class_id;
> {code}
>  
>  but not support:
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30396) When there are multiple DISTINCT aggregate expressions acting on different fields, any DISTINCT aggregate expression allows the use of the FILTER clause

2019-12-31 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30396:
--

 Summary: When there are multiple DISTINCT aggregate expressions 
acting on different fields, any DISTINCT aggregate expression allows the use of 
the FILTER clause
 Key: SPARK-30396
 URL: https://issues.apache.org/jira/browse/SPARK-30396
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986

This ticket will support:

When there are multiple DISTINCT aggregate expressions acting on different 
fields, any DISTINCT aggregate expression allows the use of the FILTER clause

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30395) When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

2019-12-31 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30395:
--

 Summary: When one or more DISTINCT aggregate expressions operate 
on the same field, the DISTINCT aggregate expression allows the use of the 
FILTER clause
 Key: SPARK-30395
 URL: https://issues.apache.org/jira/browse/SPARK-30395
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


This ticket is related to https://issues.apache.org/jira/browse/SPARK-27986

This ticket will support:

When one or more DISTINCT aggregate expressions operate on the same field, the 
DISTINCT aggregate expression allows the use of the FILTER clause

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org