[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-03 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944194#comment-16944194
 ] 

L. C. Hsieh commented on SPARK-29302:
-

If this is not an issue, we should close it.

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29351) Avoid full synchronization in ShuffleMapStage

2019-10-03 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-29351:
---

Assignee: DB Tsai

> Avoid full synchronization in ShuffleMapStage
> -
>
> Key: SPARK-29351
> URL: https://issues.apache.org/jira/browse/SPARK-29351
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: # 
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> In one of our production streaming jobs that has more than 1k executors, and 
> each has 20 cores, Spark spends significant portion of time (30s) in sending 
> out the `ShuffeStatus`. We find there are two issues.
> # In driver's message loop, it's calling `serializedMapStatus` which is in 
> sync block. When the job scales really big, it can cause the contention.
> # When the job is big, the `MapStatus` is huge as well, the serialization 
> time and compression time is slow.
> This work aims to address the first problem.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29351) Avoid full synchronization in ShuffleMapStage

2019-10-03 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-29351.
-
Resolution: Resolved

> Avoid full synchronization in ShuffleMapStage
> -
>
> Key: SPARK-29351
> URL: https://issues.apache.org/jira/browse/SPARK-29351
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: # 
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> In one of our production streaming jobs that has more than 1k executors, and 
> each has 20 cores, Spark spends significant portion of time (30s) in sending 
> out the `ShuffeStatus`. We find there are two issues.
> # In driver's message loop, it's calling `serializedMapStatus` which is in 
> sync block. When the job scales really big, it can cause the contention.
> # When the job is big, the `MapStatus` is huge as well, the serialization 
> time and compression time is slow.
> This work aims to address the first problem.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29353) AlterTableAlterColumnStatement should fallback to v1 AlterTableChangeColumnCommand

2019-10-03 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-29353:
---

 Summary: AlterTableAlterColumnStatement should fallback to v1 
AlterTableChangeColumnCommand
 Key: SPARK-29353
 URL: https://issues.apache.org/jira/browse/SPARK-29353
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server

2019-10-03 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944168#comment-16944168
 ] 

Yuming Wang commented on SPARK-29337:
-

Could you try to cache table with options?
{code:sql}
CACHE TABLE tableName OPTIONS('storageLevel' 'MEMORY_ONLY');
{code}
https://github.com/apache/spark/pull/22263

> How to Cache Table and Pin it in Memory and should not Spill to Disk on 
> Thrift Server 
> --
>
> Key: SPARK-29337
> URL: https://issues.apache.org/jira/browse/SPARK-29337
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Srini E
>Priority: Major
> Attachments: Cache+Image.png
>
>
> Hi Team,
> How to pin the table in cache so it would not swap out of memory?
> Situation: We are using Microstrategy BI reporting. Semantic layer is built. 
> We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table 
> ; we did cache for SPARK context( Thrift server). Please see 
> below snapshot of Cache table, went to disk over time. Initially it was all 
> in cache , now some in cache and some in disk. That disk may be local disk 
> relatively more expensive reading than from s3. Queries may take longer and 
> inconsistent times from user experience perspective. If More queries running 
> using Cache tables, copies of the cache table images are copied and copies 
> are not staying in memory causing reports to run longer. so how to pin the 
> table so would not swap to disk. Spark memory management is dynamic 
> allocation, and how to use those few tables to Pin in memory .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29323) Add tooltip for The Executors Tab's column names in the Spark history server Page

2019-10-03 Thread liucht-inspur (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liucht-inspur updated SPARK-29323:
--
Attachment: image-2019-10-04-09-42-14-174.png

> Add tooltip for The Executors Tab's column names in the Spark history server 
> Page
> -
>
> Key: SPARK-29323
> URL: https://issues.apache.org/jira/browse/SPARK-29323
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: liucht-inspur
>Priority: Major
> Fix For: 2.4.4
>
> Attachments: image-2019-10-04-09-42-14-174.png
>
>
> the spark Executors of history Tab page, the Summary part shows the line in 
> the list of title, but format is irregular.
> Some column names have tooltip, such as Storage Memory, Task Time(GC Time), 
> Input, Shuffle Read,Shuffle Write and Blacklisted, but there are still some 
> list names that have not tooltip. They are RDD Blocks, Disk Used,Cores, 
> Activity Tasks, Failed Tasks , Complete Tasks and Total Tasks. oddly, 
> Executors section below,All the column names Contains the column names above 
> have tooltip .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29352) Move active streaming query state to the SharedState

2019-10-03 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-29352:
---

 Summary: Move active streaming query state to the SharedState
 Key: SPARK-29352
 URL: https://issues.apache.org/jira/browse/SPARK-29352
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.4, 3.0.0
Reporter: Burak Yavuz


We have checks to prevent the restarting of the same stream on the same spark 
session, but we can actually make that better in multi-tenant environments by 
actually putting that state in the SharedState instead of SessionState. This 
would allow a more comprehensive check for multi-tenant clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29339) Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build)

2019-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29339.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25993
[https://github.com/apache/spark/pull/25993]

> Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build)
> -
>
> Key: SPARK-29339
> URL: https://issues.apache.org/jira/browse/SPARK-29339
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> dapply and gapply with Arrow optimization and Arrow 0.14 seems failing:
> {code}
> > collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, 
> > structType("gear double")))
> Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
>   invalid 'n' argument
> {code}
> We should fix it and also test it in AppVeyor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29350) Fix BroadcastExchange reuse in Dynamic Partition Pruning

2019-10-03 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-29350:
---

Assignee: Wei Xue

> Fix BroadcastExchange reuse in Dynamic Partition Pruning
> 
>
> Key: SPARK-29350
> URL: https://issues.apache.org/jira/browse/SPARK-29350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
>
> Dynamic partition pruning filters are added as an in-subquery containing a 
> {{BroadcastExchange}} in a broadcast hash join. To ensure this new 
> {{BroadcastExchange}} can be reused, we need to make the {{ReuseExchange}} 
> rule visit in-subquery nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29350) Fix BroadcastExchange reuse in Dynamic Partition Pruning

2019-10-03 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-29350.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Fix BroadcastExchange reuse in Dynamic Partition Pruning
> 
>
> Key: SPARK-29350
> URL: https://issues.apache.org/jira/browse/SPARK-29350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> Dynamic partition pruning filters are added as an in-subquery containing a 
> {{BroadcastExchange}} in a broadcast hash join. To ensure this new 
> {{BroadcastExchange}} can be reused, we need to make the {{ReuseExchange}} 
> rule visit in-subquery nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28583) Subqueries should not call `onUpdatePlan` in Adaptive Query Execution

2019-10-03 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang reassigned SPARK-28583:


Assignee: Wei Xue  (was: Xingbo Jiang)

> Subqueries should not call `onUpdatePlan` in Adaptive Query Execution
> -
>
> Key: SPARK-28583
> URL: https://issues.apache.org/jira/browse/SPARK-28583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> Subqueries do not have their own execution id, thus when calling 
> {{AdaptiveSparkPlanExec.onUpdatePlan}}, it will actually get the 
> {{QueryExecution}} instance of the main query, which is wasteful and 
> problematic. It could cause issues like stack overflow or dead locks in some 
> circumstances.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28583) Subqueries should not call `onUpdatePlan` in Adaptive Query Execution

2019-10-03 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang reassigned SPARK-28583:


Assignee: Xingbo Jiang  (was: Wei Xue)

> Subqueries should not call `onUpdatePlan` in Adaptive Query Execution
> -
>
> Key: SPARK-28583
> URL: https://issues.apache.org/jira/browse/SPARK-28583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> Subqueries do not have their own execution id, thus when calling 
> {{AdaptiveSparkPlanExec.onUpdatePlan}}, it will actually get the 
> {{QueryExecution}} instance of the main query, which is wasteful and 
> problematic. It could cause issues like stack overflow or dead locks in some 
> circumstances.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected

2019-10-03 Thread Guilherme Souza (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944046#comment-16944046
 ] 

Guilherme Souza commented on SPARK-29336:
-

I've added a new test case that reproduces the problem here: 
https://github.com/sitegui/spark/commit/fa123cf289c47ceeb6f84278ae028e5a46a85bf0

The problem is triggered specially when the merging summaries had seen an 
uneven number of samples.

I've managed to reproduce for exact splits as well, however that requires a 
larger number of samples.

I'm current working at a forked branch and will try to create a PR that fixes 
the issue in the following days.

> The implementation of QuantileSummaries.merge  does not guarantee that the 
> relativeError will be respected 
> ---
>
> Key: SPARK-29336
> URL: https://issues.apache.org/jira/browse/SPARK-29336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Guilherme Souza
>Priority: Minor
>
> Hello Spark maintainers,
> I was experimenting with my own implementation of the [space-efficient 
> quantile 
> algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
>  in another language and I was using the Spark's one as a reference.
> In my analysis, I believe to have found an issue with the {{merge()}} logic. 
> Here is some simple Scala code that reproduces the issue I've found:
>  
> {code:java}
> var values = (1 to 100).toArray
> val all_quantiles = values.indices.map(i => (i+1).toDouble / 
> values.length).toArray
> for (n <- 0 until 5) {
>   var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
>   val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
>   val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
>   val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) 
> => Math.abs(expected - answer) }).toArray
>   val max_error = error.max
>   print(max_error + "\n")
> }
> {code}
> I query for all possible quantiles in a 100-element array with a desired 10% 
> max error. In this scenario, one would expect to observe a maximum error of 
> 10 ranks or less (10% of 100). However, the output I observe is:
>  
> {noformat}
> 16
> 12
> 10
> 11
> 17{noformat}
> The variance is probably due to non-deterministic operations behind the 
> scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not 
> used to it)
> Interestingly enough, if I change from five to one partition the code works 
> as expected and gives 10 every time. This seems to point to some problem at 
> the [merge 
> logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]
> The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
> the history) suggest the published paper is not clear on how that should be 
> done and, honestly, I was not confident in the current approach either.
> I've found SPARK-21184 that reports the same problem, but it was 
> unfortunately closed with no fix applied.
> In my external implementation I believe to have found a sound way to 
> implement the merge method. [Here is my take in Rust, if 
> relevant|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]
> I'd be really glad to add unit tests and contribute my implementation adapted 
> to Scala.
>  I'd love to hear your opinion on the matter.
> Best regards
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27903) Improve parser error message for mismatched parentheses in expressions

2019-10-03 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944041#comment-16944041
 ] 

Jeff Evans commented on SPARK-27903:


I've been spending some time playing around with the grammar, and I'm not sure 
this is possible in the general case.  It should be easy enough to handle the 
case outlined in this Jira (I have a working change for that), but an "extra" 
right parenthesis is much more challenging due to the way ANTLR works, and the 
way the grammar is written.

> Improve parser error message for mismatched parentheses in expressions
> --
>
> Key: SPARK-27903
> URL: https://issues.apache.org/jira/browse/SPARK-27903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> When parentheses are mismatched in expressions in queries, the error message 
> is confusing. This is especially true for large queries, where mismatched 
> parens are tedious for human to figure out. 
> For example, the error message for 
> {code:sql} 
> SELECT ((x + y) * z FROM t; 
> {code} 
> is 
> {code:java} 
> mismatched input 'FROM' expecting ','(line 1, pos 20) 
> {code} 
> One possible way to fix is to explicitly capture such kind of mismatched 
> parens in a grammar rule and print user-friendly error message such as 
> {code:java} 
> mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, 
> pos 20) 
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29351) Avoid full synchronization in ShuffleMapStage

2019-10-03 Thread DB Tsai (Jira)
DB Tsai created SPARK-29351:
---

 Summary: Avoid full synchronization in ShuffleMapStage
 Key: SPARK-29351
 URL: https://issues.apache.org/jira/browse/SPARK-29351
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.4
 Environment: # 
Reporter: DB Tsai
 Fix For: 3.0.0


In one of our production streaming jobs that has more than 1k executors, and 
each has 20 cores, Spark spends significant portion of time (30s) in sending 
out the `ShuffeStatus`. We find there are two issues.

# In driver's message loop, it's calling `serializedMapStatus` which is in sync 
block. When the job scales really big, it can cause the contention.
# When the job is big, the `MapStatus` is huge as well, the serialization time 
and compression time is slow.

This work aims to address the first problem.
 
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29329) maven incremental builds not working

2019-10-03 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943839#comment-16943839
 ] 

Thomas Graves commented on SPARK-29329:
---

filed issue with scala-maven-plugin, we will see what they say:

https://github.com/davidB/scala-maven-plugin/issues/364

> maven incremental builds not working
> 
>
> Key: SPARK-29329
> URL: https://issues.apache.org/jira/browse/SPARK-29329
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It looks like since we Upgraded scala-maven-plugin to 4.2.0 
> https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds 
> stop working.  Everytime you build its building all files, which takes 
> forever.
> It would be nice to fix this.
>  
> To reproduce, just build spark once ( I happened to be using the command 
> below):
> build/mvn -Phadoop-3.2 -Phive-thriftserver -Phive -Pyarn -Pkinesis-asl 
> -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests
> Then build it again and you will see that it compiles all the files and takes 
> 15-30 minutes. With incremental it skips all unnecessary files and takes 
> closer to 5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29054) Invalidate Kafka consumer when new delegation token available

2019-10-03 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29054.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25760
[https://github.com/apache/spark/pull/25760]

> Invalidate Kafka consumer when new delegation token available
> -
>
> Key: SPARK-29054
> URL: https://issues.apache.org/jira/browse/SPARK-29054
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Kafka consumers are cached. If delegation token is used and the token is 
> expired, then exception is thrown. Such case new consumer is created in a 
> Task retry with the latest delegation token. This can be enhanced by 
> detecting the existence of a new delegation token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29054) Invalidate Kafka consumer when new delegation token available

2019-10-03 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29054:
--

Assignee: Gabor Somogyi

> Invalidate Kafka consumer when new delegation token available
> -
>
> Key: SPARK-29054
> URL: https://issues.apache.org/jira/browse/SPARK-29054
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>
> Kafka consumers are cached. If delegation token is used and the token is 
> expired, then exception is thrown. Such case new consumer is created in a 
> Task retry with the latest delegation token. This can be enhanced by 
> detecting the existence of a new delegation token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29350) Fix BroadcastExchange reuse in Dynamic Partition Pruning

2019-10-03 Thread Wei Xue (Jira)
Wei Xue created SPARK-29350:
---

 Summary: Fix BroadcastExchange reuse in Dynamic Partition Pruning
 Key: SPARK-29350
 URL: https://issues.apache.org/jira/browse/SPARK-29350
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue


Dynamic partition pruning filters are added as an in-subquery containing a 
{{BroadcastExchange}} in a broadcast hash join. To ensure this new 
{{BroadcastExchange}} can be reused, we need to make the {{ReuseExchange}} rule 
visit in-subquery nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29320) Compare `sql/core` module in JDK8/11 (Part 1)

2019-10-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29320:
-

Assignee: Dongjoon Hyun

> Compare `sql/core` module in JDK8/11 (Part 1)
> -
>
> Key: SPARK-29320
> URL: https://issues.apache.org/jira/browse/SPARK-29320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29320) Compare `sql/core` module in JDK8/11 (Part 1)

2019-10-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29320.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26003
[https://github.com/apache/spark/pull/26003]

> Compare `sql/core` module in JDK8/11 (Part 1)
> -
>
> Key: SPARK-29320
> URL: https://issues.apache.org/jira/browse/SPARK-29320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching

2019-10-03 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-29349:
-

 Summary: Support FETCH_PRIOR in Thriftserver query results fetching
 Key: SPARK-29349
 URL: https://issues.apache.org/jira/browse/SPARK-29349
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


Support FETCH_PRIOR fetching in Thriftserver



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29296) Use scala-parallel-collections library in 2.13

2019-10-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29296.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25980
[https://github.com/apache/spark/pull/25980]

> Use scala-parallel-collections library in 2.13
> --
>
> Key: SPARK-29296
> URL: https://issues.apache.org/jira/browse/SPARK-29296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Classes like ForkJoinTaskSupport and .par moved to scala-parallel-collections 
> in 2.13. This needs to be included as a dependency only in 2.13 via a 
> profile. However we'll also have to rewrite use of .par to get this to work 
> in 2.12 and 2.13 simultaneously:
> https://github.com/scala/scala-parallel-collections/issues/22



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29296) Use scala-parallel-collections library in 2.13

2019-10-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29296:


Assignee: Sean R. Owen

> Use scala-parallel-collections library in 2.13
> --
>
> Key: SPARK-29296
> URL: https://issues.apache.org/jira/browse/SPARK-29296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Classes like ForkJoinTaskSupport and .par moved to scala-parallel-collections 
> in 2.13. This needs to be included as a dependency only in 2.13 via a 
> profile. However we'll also have to rewrite use of .par to get this to work 
> in 2.12 and 2.13 simultaneously:
> https://github.com/scala/scala-parallel-collections/issues/22



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29348) Add observable metrics

2019-10-03 Thread Jira
Herman van Hövell created SPARK-29348:
-

 Summary: Add observable metrics
 Key: SPARK-29348
 URL: https://issues.apache.org/jira/browse/SPARK-29348
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29347) External Row should be JSON serializable

2019-10-03 Thread Jira
Herman van Hövell created SPARK-29347:
-

 Summary: External Row should be JSON serializable
 Key: SPARK-29347
 URL: https://issues.apache.org/jira/browse/SPARK-29347
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell


External row should be exportable to json. This is needed for observable 
metrics because we want to include these metrics in streaming query progress 
(which is JSON serializable). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29345) Add an API that allows a user to define and observe arbitrary metrics on streaming queries

2019-10-03 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-29345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell updated SPARK-29345:
--
Summary: Add an API that allows a user to define and observe arbitrary 
metrics on streaming queries  (was: Add an API that allows a user to define and 
obser arbitrary metrics on streaming queries)

> Add an API that allows a user to define and observe arbitrary metrics on 
> streaming queries
> --
>
> Key: SPARK-29345
> URL: https://issues.apache.org/jira/browse/SPARK-29345
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29346) Create Aggregating Accumulator

2019-10-03 Thread Jira
Herman van Hövell created SPARK-29346:
-

 Summary: Create Aggregating Accumulator
 Key: SPARK-29346
 URL: https://issues.apache.org/jira/browse/SPARK-29346
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell


Create an accumulator that can compute a global aggregate over an arbitrary 
number of expressions. We will use this to implement observable metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29345) Add an API that allows a user to define and obser arbitrary metrics on streaming queries

2019-10-03 Thread Jira
Herman van Hövell created SPARK-29345:
-

 Summary: Add an API that allows a user to define and obser 
arbitrary metrics on streaming queries
 Key: SPARK-29345
 URL: https://issues.apache.org/jira/browse/SPARK-29345
 Project: Spark
  Issue Type: Epic
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29344) Spark application hang

2019-10-03 Thread Kitti (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kitti updated SPARK-29344:
--
Attachment: stderr

> Spark application hang
> --
>
> Key: SPARK-29344
> URL: https://issues.apache.org/jira/browse/SPARK-29344
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Kitti
>Priority: Major
> Attachments: stderr
>
>
> We found the issue that Spark application hang and stop working sometime 
> without any log in Spark Driver until we killed the application. 
>  
> 19/10/03 06:07:03 INFO spark.ContextCleaner: Cleaned accumulator 117 19/10/03 
> 06:07:03 INFO spark.ContextCleaner: Cleaned accumulator 80 19/10/03 06:07:03 
> INFO spark.ContextCleaner: Cleaned accumulator 105 19/10/03 06:07:03 INFO 
> spark.ContextCleaner: Cleaned accumulator 88 19/10/03 10:36:59 ERROR 
> yarn.ApplicationMaster: RECEIVED SIGNAL TERM



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29344) Spark application hang

2019-10-03 Thread Kitti (Jira)
Kitti created SPARK-29344:
-

 Summary: Spark application hang
 Key: SPARK-29344
 URL: https://issues.apache.org/jira/browse/SPARK-29344
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.1
Reporter: Kitti


We found the issue that Spark application hang and stop working sometime 
without any log in Spark Driver until we killed the application. 

 

19/10/03 06:07:03 INFO spark.ContextCleaner: Cleaned accumulator 117 19/10/03 
06:07:03 INFO spark.ContextCleaner: Cleaned accumulator 80 19/10/03 06:07:03 
INFO spark.ContextCleaner: Cleaned accumulator 105 19/10/03 06:07:03 INFO 
spark.ContextCleaner: Cleaned accumulator 88 19/10/03 10:36:59 ERROR 
yarn.ApplicationMaster: RECEIVED SIGNAL TERM



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29341) Upgrade cloudpickle to 1.0.0

2019-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29341.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26009
[https://github.com/apache/spark/pull/26009]

> Upgrade cloudpickle to 1.0.0
> 
>
> Key: SPARK-29341
> URL: https://issues.apache.org/jira/browse/SPARK-29341
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> Cloudpickle 1.0.0 includes two bug fixes. It is better we can upgrade to 
> include them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29341) Upgrade cloudpickle to 1.0.0

2019-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29341:


Assignee: L. C. Hsieh

> Upgrade cloudpickle to 1.0.0
> 
>
> Key: SPARK-29341
> URL: https://issues.apache.org/jira/browse/SPARK-29341
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Cloudpickle 1.0.0 includes two bug fixes. It is better we can upgrade to 
> include them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29142) Pyspark clustering models support column setters/getters/predict

2019-10-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29142:
-

Assignee: Huaxin Gao

> Pyspark clustering models support column setters/getters/predict
> 
>
> Key: SPARK-29142
> URL: https://issues.apache.org/jira/browse/SPARK-29142
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Unlike the reg/clf models, clustering models do not have some common class, 
> so we need to add them one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29343) Eliminate sorts without limit in the subquery of Join/Aggregation

2019-10-03 Thread EdisonWang (Jira)
EdisonWang created SPARK-29343:
--

 Summary: Eliminate sorts without limit in the subquery of 
Join/Aggregation
 Key: SPARK-29343
 URL: https://issues.apache.org/jira/browse/SPARK-29343
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang


The {{Sort}} without {{Limit}} operator in {{Join/GroupBy}} subquery is useless.

 

For example, {{select count(1) from (select a from test1 order by a)}} is equal 
to {{select count(1) from (select a from test1)}}.
'select * from (select a from test1 order by a) t1 join (select b from test2) 
t2 on t1.a = t2.b' is equal to {{select * from (select a from test1) t1 join 
(select b from test2) t2 on t1.a = t2.b}}.

Remove useless {{Sort}} operator can import performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner

2019-10-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28084.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 24903
[https://github.com/apache/spark/pull/24903]

> LOAD DATA command resolving the partition column name considering case 
> senstive manner 
> ---
>
> Key: SPARK-28084
> URL: https://issues.apache.org/jira/browse/SPARK-28084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Sujith Chacko
>Assignee: Sujith Chacko
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: parition_casesensitive.PNG
>
>
> LOAD DATA command resolving the partition column name considering case 
> sensitive manner, where as insert command resolves case-insensitive manner.
> Refer the snapshot for more details.
> !image-2019-06-18-00-04-22-475.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner

2019-10-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28084:
-

Assignee: Sujith Chacko

> LOAD DATA command resolving the partition column name considering case 
> senstive manner 
> ---
>
> Key: SPARK-28084
> URL: https://issues.apache.org/jira/browse/SPARK-28084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Sujith Chacko
>Assignee: Sujith Chacko
>Priority: Major
> Attachments: parition_casesensitive.PNG
>
>
> LOAD DATA command resolving the partition column name considering case 
> sensitive manner, where as insert command resolves case-insensitive manner.
> Refer the snapshot for more details.
> !image-2019-06-18-00-04-22-475.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29342) Make casting strings to intervals case insensitive

2019-10-03 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29342:
--

 Summary: Make casting strings to intervals case insensitive 
 Key: SPARK-29342
 URL: https://issues.apache.org/jira/browse/SPARK-29342
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


PostgreSQL is not sensitive to case of interval string values:
{code}
maxim=# select cast('10 Days' as INTERVAL);
 interval 
--
 10 days
(1 row)
{code}
but Spark is not tolerant to case:
{code}
spark-sql> select cast('INTERVAL 10 DAYS' as INTERVAL);
NULL
spark-sql> select cast('interval 10 days' as INTERVAL);
interval 1 weeks 3 days
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29341) Upgrade cloudpickle to 1.0.0

2019-10-03 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-29341:
---

 Summary: Upgrade cloudpickle to 1.0.0
 Key: SPARK-29341
 URL: https://issues.apache.org/jira/browse/SPARK-29341
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: L. C. Hsieh


Cloudpickle 1.0.0 includes two bug fixes. It is better we can upgrade to 
include them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29317) Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan

2019-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29317.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25989
[https://github.com/apache/spark/pull/25989]

> Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan
> ---
>
> Key: SPARK-29317
> URL: https://issues.apache.org/jira/browse/SPARK-29317
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> At SPARK-27463, some refactoring was made. There are two common base abstract 
> classes were introduced:
> 1. {{BaseArrowPythonRunner}}
> Before:
> {code}
> └── BasePythonRunner
> ├── ArrowPythonRunner
> ├── CoGroupedArrowPythonRunner
> ├── PythonRunner
> └── PythonUDFRunner
> {code}
> After:
> {code}
> BasePythonRunner
> ├── BaseArrowPythonRunner
> │   ├── ArrowPythonRunner
> │   └── CoGroupedArrowPythonRunner
> ├── PythonRunner
> └── PythonUDFRunner
> {code}
> The problem is that R code path is being matched with Python side:
> {code}
> └── BaseRRunner
> ├── ArrowRRunner
> └── RRunner
> {code}
> I would like to match the hierarchy and decouple other stuff for now. Ideally 
> we should deduplicate both code paths. Internal implementation is also 
> similar intentionally.
> 2. {{BasePandasGroupExec}}
> Before:
> {code}
> ├── FlatMapGroupsInPandasExec
> └── FlatMapCoGroupsInPandasExec
> {code}
> After:
> {code}
> └── BasePandasGroupExec
> ├── FlatMapGroupsInPandasExec
> └── FlatMapCoGroupsInPandasExec
> {code}
> Problem is that, R (with Arrow optimization, in particular) has some 
> duplicated codes with Pandas UDFs. 
> {{FlatMapGroupsInRWithArrowExec}} <> {{FlatMapGroupsInPandasExec}}
> {{MapPartitionsInRWithArrowExec}} <> {{ArrowEvalPythonExec}}
> In order to prepare deduplication here as well, it might better avoid 
> changing hierarchy alone in Python sides but just rather decouple it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29340) Spark Sql executions do not use thread local jobgroup

2019-10-03 Thread Navdeep Poonia (Jira)
Navdeep Poonia created SPARK-29340:
--

 Summary: Spark Sql executions do not use thread local jobgroup
 Key: SPARK-29340
 URL: https://issues.apache.org/jira/browse/SPARK-29340
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Navdeep Poonia


val sparkThreadLocal: SparkSession = DataCurator.spark.newSession()

sparkThreadLocal.sparkContext.setJobGroup("", "")

OR

sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", "")
sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", 
"")

 

The jobgroup property works fine for spark jobs/stages created by spark 
dataframe operations but in case of sparksql the jobgroup is randomly assigned 
to stages or is null sometimes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-10-03 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943127#comment-16943127
 ] 

Peter Toth edited comment on SPARK-29078 at 10/3/19 7:04 AM:
-

I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if that is a concern. The {{default}} database points to 
{{/apps/hive/warehouse}}, and new databases are created under that directory as 
well by default, but you have the option to create a new database pointing to a 
very different directory. I mean that way we could avoid this issue.

Anyways, this doesn't seem to me a Spark related issue.


was (Author: petertoth):
I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if that is a concern. The {{default}} database points to 
{{/apps/hive/warehouse}}, and new databases are created under that directory as 
well by default, but you have the option to create a new database pointing to a 
very different directory. I mean that way we could avoid this issue.

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-10-03 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943127#comment-16943127
 ] 

Peter Toth edited comment on SPARK-29078 at 10/3/19 7:01 AM:
-

I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if that is a concern. The {{default}} database points to 
{{/apps/hive/warehouse}}, and new databases are created under that directory as 
well by default, but you have the option to create a new database pointing to a 
very different directory. I mean that way we could avoid this issue.


was (Author: petertoth):
I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if that is a concern. The {{default}} database points to 
{{/apps/hive/warehouse}}, but you have the option to create a new database 
pointing to a very different directory. I mean that way we could avoid this 
issue.

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-10-03 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943127#comment-16943127
 ] 

Peter Toth edited comment on SPARK-29078 at 10/3/19 7:00 AM:
-

I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if that is a concern. The {{default}} database points to 
{{/apps/hive/warehouse}}, but you have the option to create a new database 
pointing to a very different directory. I mean that way we could avoid this 
issue.


was (Author: petertoth):
I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if the {{default}} database points to {{/apps/hive/warehouse}}. I 
mean that way we could avoid this issue.

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month

2019-10-03 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29328:
---
Labels: correctness  (was: )

> Incorrect calculation mean seconds per month
> 
>
> Key: SPARK-29328
> URL: https://issues.apache.org/jira/browse/SPARK-29328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: correctness
>
> Existing implementation assumes 31 days per month or 372 days per year which 
> is far away from the correct number. Spark uses the proleptic Gregorian 
> calendar by default SPARK-26651 in which the average year is 365.2425 days 
> long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix 
> calculation in 3 places at least:
> - GroupStateImpl.scala:167:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - EventTimeWatermark.scala:32:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)
> *BEFORE*
> {code}
> spark-sql> select months_between('2019-09-15', '1970-01-01');
> 596.4516129
> {code}
> *AFTER*
> {code}
> spark-sql> select months_between('2019-09-15', '1970-01-01');
> 596.45996838
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2

2019-10-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29305.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25978
[https://github.com/apache/spark/pull/25978]

> Update LICENSE and NOTICE for hadoop 3.2
> 
>
> Key: SPARK-29305
> URL: https://issues.apache.org/jira/browse/SPARK-29305
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code}
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5  
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 
> com.fasterxml.woodstox:woodstox-core:5.0.3
> com.github.stephenc.jcip:jcip-annotations:1.0-1   
> com.google.re2j:re2j:1.1  
> com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 
> com.nimbusds:nimbus-jose-jwt:4.41.1   
> dnsjava:dnsjava:2.1.7 
> net.minidev:accessors-smart:1.2   
> net.minidev:json-smart:2.3
> org.apache.commons:commons-configuration2:2.1.1   
> org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1
> org.apache.hadoop:hadoop-hdfs-client:3.2.0 
> org.apache.kerby:kerb-admin:1.0.1 
> org.apache.kerby:kerb-client:1.0.1
> org.apache.kerby:kerb-common:1.0.1
> org.apache.kerby:kerb-core:1.0.1  
> org.apache.kerby:kerb-crypto:1.0.1
> org.apache.kerby:kerb-identity:1.0.1  
> org.apache.kerby:kerb-server:1.0.1
> org.apache.kerby:kerb-simplekdc:1.0.1 
> org.apache.kerby:kerb-util:1.0.1  
> org.apache.kerby:kerby-asn1:1.0.1 
> org.apache.kerby:kerby-config:1.0.1   
> org.apache.kerby:kerby-pkix:1.0.1 
> org.apache.kerby:kerby-util:1.0.1 
> org.apache.kerby:kerby-xdr:1.0.1  
> org.apache.kerby:token-provider:1.0.1 
> org.codehaus.woodstox:stax2-api:3.1.4 
> org.ehcache:ehcache:3.3.1 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2

2019-10-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29305:
-
Priority: Minor  (was: Major)

> Update LICENSE and NOTICE for hadoop 3.2
> 
>
> Key: SPARK-29305
> URL: https://issues.apache.org/jira/browse/SPARK-29305
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
>
> {code}
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5  
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 
> com.fasterxml.woodstox:woodstox-core:5.0.3
> com.github.stephenc.jcip:jcip-annotations:1.0-1   
> com.google.re2j:re2j:1.1  
> com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 
> com.nimbusds:nimbus-jose-jwt:4.41.1   
> dnsjava:dnsjava:2.1.7 
> net.minidev:accessors-smart:1.2   
> net.minidev:json-smart:2.3
> org.apache.commons:commons-configuration2:2.1.1   
> org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1
> org.apache.hadoop:hadoop-hdfs-client:3.2.0 
> org.apache.kerby:kerb-admin:1.0.1 
> org.apache.kerby:kerb-client:1.0.1
> org.apache.kerby:kerb-common:1.0.1
> org.apache.kerby:kerb-core:1.0.1  
> org.apache.kerby:kerb-crypto:1.0.1
> org.apache.kerby:kerb-identity:1.0.1  
> org.apache.kerby:kerb-server:1.0.1
> org.apache.kerby:kerb-simplekdc:1.0.1 
> org.apache.kerby:kerb-util:1.0.1  
> org.apache.kerby:kerby-asn1:1.0.1 
> org.apache.kerby:kerby-config:1.0.1   
> org.apache.kerby:kerby-pkix:1.0.1 
> org.apache.kerby:kerby-util:1.0.1 
> org.apache.kerby:kerby-xdr:1.0.1  
> org.apache.kerby:token-provider:1.0.1 
> org.codehaus.woodstox:stax2-api:3.1.4 
> org.ehcache:ehcache:3.3.1 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2

2019-10-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29305:


Assignee: angerszhu

> Update LICENSE and NOTICE for hadoop 3.2
> 
>
> Key: SPARK-29305
> URL: https://issues.apache.org/jira/browse/SPARK-29305
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> {code}
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5  
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 
> com.fasterxml.woodstox:woodstox-core:5.0.3
> com.github.stephenc.jcip:jcip-annotations:1.0-1   
> com.google.re2j:re2j:1.1  
> com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 
> com.nimbusds:nimbus-jose-jwt:4.41.1   
> dnsjava:dnsjava:2.1.7 
> net.minidev:accessors-smart:1.2   
> net.minidev:json-smart:2.3
> org.apache.commons:commons-configuration2:2.1.1   
> org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1
> org.apache.hadoop:hadoop-hdfs-client:3.2.0 
> org.apache.kerby:kerb-admin:1.0.1 
> org.apache.kerby:kerb-client:1.0.1
> org.apache.kerby:kerb-common:1.0.1
> org.apache.kerby:kerb-core:1.0.1  
> org.apache.kerby:kerb-crypto:1.0.1
> org.apache.kerby:kerb-identity:1.0.1  
> org.apache.kerby:kerb-server:1.0.1
> org.apache.kerby:kerb-simplekdc:1.0.1 
> org.apache.kerby:kerb-util:1.0.1  
> org.apache.kerby:kerby-asn1:1.0.1 
> org.apache.kerby:kerby-config:1.0.1   
> org.apache.kerby:kerby-pkix:1.0.1 
> org.apache.kerby:kerby-util:1.0.1 
> org.apache.kerby:kerby-xdr:1.0.1  
> org.apache.kerby:token-provider:1.0.1 
> org.codehaus.woodstox:stax2-api:3.1.4 
> org.ehcache:ehcache:3.3.1 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org