date:20200226

[jira] [Created] (SPARK-30968) Upgrade aws-java-sdk-sts to 1.11.655

2020-02-26 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-30968:
-

 Summary: Upgrade aws-java-sdk-sts to 1.11.655
 Key: SPARK-30968
 URL: https://issues.apache.org/jira/browse/SPARK-30968
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-26 Thread Jorge Machado (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046265#comment-17046265
 ] 

Jorge Machado commented on SPARK-26412:
---

Hi, one question. 

when using "a tuple of pd.Series if UDF is called with more than one Spark DF 
columns" how can I get the Series into a variables. like 

a, b, c = iterator ? map seems not to be a python tuple ... 

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30765) Refine baes class abstraction code style

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30765.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27511
[https://github.com/apache/spark/pull/27511]

> Refine baes class abstraction code style
> 
>
> Key: SPARK-30765
> URL: https://issues.apache.org/jira/browse/SPARK-30765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xin Wu
>Assignee: Xin Wu
>Priority: Major
> Fix For: 3.1.0
>
>
> When doing base operator abstraction work, I found there are still some code 
> snippet is  inconsistent with other abstraction code style.
>  
> Case 1, override keyword missed for some fields in derived classes. The 
> compiler will not capture it if we rename some fields in the future.
> [https://github.com/apache/spark/pull/27368#discussion_r376694045]
>  
>  
> Case 2, inconsistent abstract class definition. The updated style will 
> simplify derived class definition.
> [https://github.com/apache/spark/pull/27368#discussion_r375061952]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30765) Refine baes class abstraction code style

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30765:


Assignee: Xin Wu

> Refine baes class abstraction code style
> 
>
> Key: SPARK-30765
> URL: https://issues.apache.org/jira/browse/SPARK-30765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xin Wu
>Assignee: Xin Wu
>Priority: Major
>
> When doing base operator abstraction work, I found there are still some code 
> snippet is  inconsistent with other abstraction code style.
>  
> Case 1, override keyword missed for some fields in derived classes. The 
> compiler will not capture it if we rename some fields in the future.
> [https://github.com/apache/spark/pull/27368#discussion_r376694045]
>  
>  
> Case 2, inconsistent abstract class definition. The updated style will 
> simplify derived class definition.
> [https://github.com/apache/spark/pull/27368#discussion_r375061952]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative

2020-02-26 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046245#comment-17046245
 ] 

Xiao Li edited comment on SPARK-27630 at 2/27/20 7:11 AM:
--

This change breaks the API. Need a release note.
{code:java}
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.apply"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.copy"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.this"),
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted$"),{code}


was (Author: smilegator):
This change breaks the API
{code:java}
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.apply"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.copy"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.this"),
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted$"),{code}

> Stage retry causes totalRunningTasks calculation to be negative
> ---
>
> Key: SPARK-27630
> URL: https://issues.apache.org/jira/browse/SPARK-27630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Track tasks separately for each stage attempt (instead of tracking by stage), 
> and do NOT reset the numRunningTasks to 0 on StageCompleted.
> In the case of stage retry, the {{taskEnd}} event from the zombie stage 
> sometimes makes the number of {{totalRunningTasks}} negative, which will 
> causes the job to get stuck.
>  Similar problem also exists with {{stageIdToTaskIndices}} & 
> {{stageIdToSpeculativeTaskIndices}}.
>  If it is a failed {{taskEnd}} event of the zombie stage, this will cause 
> {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the 
> task index of the active stage, and the number of {{totalPendingTasks}} will 
> increase unexpectedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative

2020-02-26 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046245#comment-17046245
 ] 

Xiao Li commented on SPARK-27630:
-

This change breaks the API
{code:java}
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.apply"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.copy"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.this"),
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted$"),{code}

> Stage retry causes totalRunningTasks calculation to be negative
> ---
>
> Key: SPARK-27630
> URL: https://issues.apache.org/jira/browse/SPARK-27630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Track tasks separately for each stage attempt (instead of tracking by stage), 
> and do NOT reset the numRunningTasks to 0 on StageCompleted.
> In the case of stage retry, the {{taskEnd}} event from the zombie stage 
> sometimes makes the number of {{totalRunningTasks}} negative, which will 
> causes the job to get stuck.
>  Similar problem also exists with {{stageIdToTaskIndices}} & 
> {{stageIdToSpeculativeTaskIndices}}.
>  If it is a failed {{taskEnd}} event of the zombie stage, this will cause 
> {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the 
> task index of the active stage, and the number of {{totalPendingTasks}} will 
> increase unexpectedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative

2020-02-26 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27630:

Labels: release-notes  (was: )

> Stage retry causes totalRunningTasks calculation to be negative
> ---
>
> Key: SPARK-27630
> URL: https://issues.apache.org/jira/browse/SPARK-27630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Track tasks separately for each stage attempt (instead of tracking by stage), 
> and do NOT reset the numRunningTasks to 0 on StageCompleted.
> In the case of stage retry, the {{taskEnd}} event from the zombie stage 
> sometimes makes the number of {{totalRunningTasks}} negative, which will 
> causes the job to get stuck.
>  Similar problem also exists with {{stageIdToTaskIndices}} & 
> {{stageIdToSpeculativeTaskIndices}}.
>  If it is a failed {{taskEnd}} event of the zombie stage, this will cause 
> {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the 
> task index of the active stage, and the number of {{totalPendingTasks}} will 
> increase unexpectedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-02-26 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30590:
---

Assignee: L. C. Hsieh

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Daniel Mantovani
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
>  How to reproduce:
> {code:scala}
> val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.Encoder
> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Row
> case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] {
>   def zero:Int = s
>   def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0)
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(b: Int): Int = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i")
> scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show
> +-+-+-+-+-+
> |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5|
> +-+-+-+-+-+
> |3|5|7|9|   11|
> +-+-+-+-+-+
> {code}
> With 6 arguments we have error:
> {code:scala}
> scala> 
> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
> assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
> IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
> None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 
> as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) 
> AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
> value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
> value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, 
> fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, 
> assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
> IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
> None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 
> as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) 
> AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
> value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
> value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
> 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
> value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
> value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, 
> fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, 
> assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
> IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
> None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 
> as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) 
> AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
> value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
> value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, 
> fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, 
> assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
> IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
> None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 
> as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) 
> AS foo_agg_6#141]
> +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS 
> e#17, _6#11 AS F#18]
>  +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431)
>  at 
>

[jira] [Created] (SPARK-30967) Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on table access

2020-02-26 Thread NITISH SHARMA (Jira)

NITISH SHARMA created SPARK-30967:
-

 Summary: Achieve LAST_ACCESS_TIME column update in TBLS table of 
hive metastore on table access 
 Key: SPARK-30967
 URL: https://issues.apache.org/jira/browse/SPARK-30967
 Project: Spark
  Issue Type: Question
  Components: Spark Shell
Affects Versions: 2.4.5
Reporter: NITISH SHARMA


I have a requirement where i am looking to update LAST_ACCESS_TIME in TBLS of 
Hive metastore whenever any table is accessed through spark. I set this below 
property in hive-site.xml and hive honors it and updates the LAST_ACCESS_TIME 
everytime it is accessed. 



    hive.exec.pre.hooks

    
org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec

 

However, the same thing i want to achieve using pyspark/spark-shell but its not 
honoring this property of hive hooks. Is there an alternate approach of 
achieving this - 'Update of LAST_ACCESS_TIME in hive metastore on access using 
spark'. 

I passed the property like this - 

spark-sql -e 'set 
spark.hadoop.hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;select
 * from db.table;'

as well as i put the same property in /etc/spark/conf/hive-site.xml location. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs

2020-02-26 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046228#comment-17046228
 ] 

zhengruifeng commented on SPARK-30932:
--

I checked added classes from {{added_ml_class:}}
 * FMClassifier, FMRegressor  has related Java example and doc;
 * RobustScaler has related Java example and doc;
 * MultilabelClassificationEvaluator,RankingEvaluator do not have related Java 
examples; However, other evaluators do not have examples, either;  *We may need 
to add some basic description in doc/ml-tuning.*
 * org.apache.spark.ml.functions has no related doc, is only used in 
\{{FunctionsSuite}}; *I am not sure we should make it public;*
 * org.apache.spark.ml.\{FitStart, FitEnd, LoadInstanceStart, LoadInstanceEnd, 
SaveInstanceStart, SaveInstanceEnd, TransformStart, TransformEnd} are marked 
\{{Unstable}} and has no related doc;

 

> ML 3.0 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-30932
> URL: https://issues.apache.org/jira/browse/SPARK-30932
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: 1_process_script.sh, added_ml_class, common_ml_class, 
> signature.diff
>
>
> Check Java compatibility for this release:
>  * APIs in {{spark.ml}}
>  * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
>  * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
>  ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
>  *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
>  ** Check Scala objects (especially with nesting!) carefully. These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
>  ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc. (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
>  * Check for differences in generated Scala vs Java docs. E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
>  * Remember that we should not break APIs from previous releases. If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
>  * If needed for complex issues, create small Java unit tests which execute 
> each method. (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
>  * There are not great tools. In the past, this task has been done by:
>  ** Generating API docs
>  ** Building JAR and outputting the Java class signatures for MLlib
>  ** Manually inspecting and searching the docs and class signatures for issues
>  * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30967) Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on hive table access through pyspark

2020-02-26 Thread NITISH SHARMA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

NITISH SHARMA updated SPARK-30967:
--
Summary: Achieve LAST_ACCESS_TIME column update in TBLS table of hive 
metastore on hive table access through pyspark  (was: Achieve LAST_ACCESS_TIME 
column update in TBLS table of hive metastore on table access )

> Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on 
> hive table access through pyspark
> ---
>
> Key: SPARK-30967
> URL: https://issues.apache.org/jira/browse/SPARK-30967
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 2.4.5
>Reporter: NITISH SHARMA
>Priority: Critical
>
> I have a requirement where i am looking to update LAST_ACCESS_TIME in TBLS of 
> Hive metastore whenever any table is accessed through spark. I set this below 
> property in hive-site.xml and hive honors it and updates the LAST_ACCESS_TIME 
> everytime it is accessed. 
> 
>     hive.exec.pre.hooks
>     
> org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec
>  
> However, the same thing i want to achieve using pyspark/spark-shell but its 
> not honoring this property of hive hooks. Is there an alternate approach of 
> achieving this - 'Update of LAST_ACCESS_TIME in hive metastore on access 
> using spark'. 
> I passed the property like this - 
> spark-sql -e 'set 
> spark.hadoop.hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;select
>  * from db.table;'
> as well as i put the same property in /etc/spark/conf/hive-site.xml location. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-02-26 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30590.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27499
[https://github.com/apache/spark/pull/27499]

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Daniel Mantovani
>Priority: Major
> Fix For: 3.0.0
>
>
>  How to reproduce:
> {code:scala}
> val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.Encoder
> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Row
> case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] {
>   def zero:Int = s
>   def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0)
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(b: Int): Int = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i")
> scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show
> +-+-+-+-+-+
> |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5|
> +-+-+-+-+-+
> |3|5|7|9|   11|
> +-+-+-+-+-+
> {code}
> With 6 arguments we have error:
> {code:scala}
> scala> 
> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
> assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
> IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
> None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 
> as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) 
> AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
> value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
> value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, 
> fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, 
> assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
> IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
> None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 
> as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) 
> AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
> value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
> value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
> 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
> value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
> value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, 
> fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, 
> assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
> IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
> None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 
> as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) 
> AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
> value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
> value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, 
> fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, 
> assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
> IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
> None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 
> as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) 
> AS foo_agg_6#141]
> +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS 
> e#17, _6#11 AS F#18]
>  +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>  at 
>

[jira] [Created] (SPARK-30966) spark.createDataFrame fails with pandas DataFrame including pandas.NA

2020-02-26 Thread Aki Ariga (Jira)

Aki Ariga created SPARK-30966:
-

 Summary: spark.createDataFrame fails with pandas DataFrame 
including pandas.NA 
 Key: SPARK-30966
 URL: https://issues.apache.org/jira/browse/SPARK-30966
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5
Reporter: Aki Ariga


As of pandas 1.0.0, pandas.NA was introduced, and that breaks createDataFrame 
function as the following:


{code:python}
In [5]: from pyspark.sql import SparkSession

In [6]: spark = SparkSession.builder.getOrCreate()

In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [8]: import numpy as np
   ...: import pandas as pd

In [12]: pdf = pd.DataFrame(data=[{'a':1,'b':2}, {'a':3,'b':4,'c':5}], 
dtype=pd.Int64Dtype())

In [16]: pdf
Out[16]:
   a  b c
0  1  2  
1  3  4 5

In [13]: sdf = spark.createDataFrame(pdf)
/Users/ariga/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/session.py:714:
 UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.enabled' is set to true; however, failed by the 
reason below:
  Did not pass numpy.dtype object
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is 
set to true.
  warnings.warn(msg)
---
TypeError Traceback (most recent call last)
 in 
> 1 sdf = spark.createDataFrame(df2)

~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/session.py in 
createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema = self._createFromRDD(data.map(prepare), 
schema, samplingRatio)
747 else:
--> 748 rdd, schema = self._createFromLocal(map(prepare, data), 
schema)
749 jrdd = 
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
schema.json())

~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/session.py in 
_createFromLocal(self, data, schema)
414
415 if schema is None or isinstance(schema, (list, tuple)):
--> 416 struct = self._inferSchemaFromList(data, names=schema)
417 converter = _create_converter(struct)
418 data = map(converter, data)

~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/session.py in 
_inferSchemaFromList(self, data, names)
346 warnings.warn("inferring schema from dict is deprecated,"
347   "please use pyspark.sql.Row instead")
--> 348 schema = reduce(_merge_type, (_infer_schema(row, names) for row 
in data))
349 if _has_nulltype(schema):
350 raise ValueError("Some of types cannot be determined after 
inferring")

~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/types.py in 
_merge_type(a, b, name)
   1099 fields = [StructField(f.name, _merge_type(f.dataType, 
nfs.get(f.name, NullType()),
   1100   
name=new_name(f.name)))
-> 1101   for f in a.fields]
   1102 names = set([f.name for f in fields])
   1103 for n in nfs:

~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/types.py in 
(.0)
   1099 fields = [StructField(f.name, _merge_type(f.dataType, 
nfs.get(f.name, NullType()),
   1100   
name=new_name(f.name)))
-> 1101   for f in a.fields]
   1102 names = set([f.name for f in fields])
   1103 for n in nfs:

~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/types.py in 
_merge_type(a, b, name)
   1092 elif type(a) is not type(b):
   1093 # TODO: type cast (such as int -> long)
-> 1094 raise TypeError(new_msg("Can not merge type %s and %s" % 
(type(a), type(b
   1095
   1096 # same type

TypeError: field c: Can not merge type  
and 

In [15]: pyspark.__version__
Out[15]: '2.4.5'
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs

2020-02-26 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045168#comment-17045168
 ] 

zhengruifeng edited comment on SPARK-30932 at 2/27/20 4:05 AM:
---

I check the output result and do not find outstanding issue, -except:-

-public method {{GeneralMLWriter#context}} was removed in Spark-25908 without 
deprecation.-

-{{GeneralMLWriter#context}}- was already deprecated in the doc, not a problem.


was (Author: podongfeng):
I check the output result and do not find outstanding issue, except:

public method \{{GeneralMLWriter#context}} was removed in Spark-25908 without 
deprecation.

> ML 3.0 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-30932
> URL: https://issues.apache.org/jira/browse/SPARK-30932
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: 1_process_script.sh, added_ml_class, common_ml_class, 
> signature.diff
>
>
> Check Java compatibility for this release:
>  * APIs in {{spark.ml}}
>  * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
>  * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
>  ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
>  *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
>  ** Check Scala objects (especially with nesting!) carefully. These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
>  ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc. (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
>  * Check for differences in generated Scala vs Java docs. E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
>  * Remember that we should not break APIs from previous releases. If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
>  * If needed for complex issues, create small Java unit tests which execute 
> each method. (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
>  * There are not great tools. In the past, this task has been done by:
>  ** Generating API docs
>  ** Building JAR and outputting the Java class signatures for MLlib
>  ** Manually inspecting and searching the docs and class signatures for issues
>  * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30963) Add GitHub Action job for document generation

2020-02-26 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30963.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27715
[https://github.com/apache/spark/pull/27715]

> Add GitHub Action job for document generation
> -
>
> Key: SPARK-30963
> URL: https://issues.apache.org/jira/browse/SPARK-30963
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30965) Support C ++ library to load Spark MLlib model

2020-02-26 Thread Mr.Nineteen (Jira)

Mr.Nineteen created SPARK-30965:
---

 Summary: Support C ++ library to load Spark MLlib model
 Key: SPARK-30965
 URL: https://issues.apache.org/jira/browse/SPARK-30965
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 2.3.4, 2.3.3, 
2.3.2, 2.3.1, 2.3.0
Reporter: Mr.Nineteen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30888) Add version information to the configuration of Network

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30888:


Assignee: jiaan.geng

> Add version information to the configuration of Network
> ---
>
> Key: SPARK-30888
> URL: https://issues.apache.org/jira/browse/SPARK-30888
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> spark/core/src/main/scala/org/apache/spark/internal/config/Network.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30888) Add version information to the configuration of Network

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30888.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27674
[https://github.com/apache/spark/pull/27674]

> Add version information to the configuration of Network
> ---
>
> Key: SPARK-30888
> URL: https://issues.apache.org/jira/browse/SPARK-30888
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> spark/core/src/main/scala/org/apache/spark/internal/config/Network.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30841) Add version information to the configuration of SQL

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30841.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27691
[https://github.com/apache/spark/pull/27691]

> Add version information to the configuration of SQL
> ---
>
> Key: SPARK-30841
> URL: https://issues.apache.org/jira/browse/SPARK-30841
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30841) Add version information to the configuration of SQL

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30841:


Assignee: jiaan.geng

> Add version information to the configuration of SQL
> ---
>
> Key: SPARK-30841
> URL: https://issues.apache.org/jira/browse/SPARK-30841
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30909) Add version information to the configuration of Python

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30909:


Assignee: jiaan.geng

> Add version information to the configuration of Python
> --
>
> Key: SPARK-30909
> URL: https://issues.apache.org/jira/browse/SPARK-30909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> core/src/main/scala/org/apache/spark/internal/config/Python.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30909) Add version information to the configuration of Python

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30909.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27704
[https://github.com/apache/spark/pull/27704]

> Add version information to the configuration of Python
> --
>
> Key: SPARK-30909
> URL: https://issues.apache.org/jira/browse/SPARK-30909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> core/src/main/scala/org/apache/spark/internal/config/Python.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30910) Add version information to the configuration of R

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30910.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27708
[https://github.com/apache/spark/pull/27708]

> Add version information to the configuration of R
> -
>
> Key: SPARK-30910
> URL: https://issues.apache.org/jira/browse/SPARK-30910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> core/src/main/scala/org/apache/spark/internal/config/R.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30910) Add version information to the configuration of R

2020-02-26 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30910:


Assignee: jiaan.geng

> Add version information to the configuration of R
> -
>
> Key: SPARK-30910
> URL: https://issues.apache.org/jira/browse/SPARK-30910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> core/src/main/scala/org/apache/spark/internal/config/R.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30928) ML, GraphX 3.0 QA: API: Binary incompatible changes

2020-02-26 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30928.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27696
[https://github.com/apache/spark/pull/27696]

> ML, GraphX 3.0 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-30928
> URL: https://issues.apache.org/jira/browse/SPARK-30928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30928) ML, GraphX 3.0 QA: API: Binary incompatible changes

2020-02-26 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30928:


Assignee: Huaxin Gao

> ML, GraphX 3.0 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-30928
> URL: https://issues.apache.org/jira/browse/SPARK-30928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30964) Accelerate InMemoryStore with a new index

2020-02-26 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-30964:
--

 Summary: Accelerate InMemoryStore with a new index
 Key: SPARK-30964
 URL: https://issues.apache.org/jira/browse/SPARK-30964
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 3.1.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Spark uses the class `InMemoryStore` as the KV storage for live UI and history 
server(by default if no LevelDB file path is provided).
In `InMemoryStore`, all the task data in one application is stored in a 
hashmap, which key is the task ID and the value is the task data. This fine for 
getting or deleting with a provided task ID.
However, Spark stage UI always shows all the task data in one stage and the 
current implementation is to look up all the values in the hashmap. The time 
complexity is O(numOfTasks). 
Also, when there are too many stages (>spark.ui.retainedStages), Spark will 
linearly try to look up all the task data of the stages to be deleted as well.

This can be very bad for a large application with many stages and tasks. We can 
improve it by allowing the natural key of an entity to have a real parent 
index. So that on each lookup with parent node provided, Spark can look up all 
the natural keys(in our case, the task IDs) first, and then find the data with 
the natural keys in the hashmap.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-02-26 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045961#comment-17045961
 ] 

Bryan Cutler commented on SPARK-30961:
--

[~nicornk] there were a number of fixes related to Arrow that went into the 
master branch for 3.0.0 and not branch-2.4, notably SPARK-26887 and SPARK-26566 
for the date issue. The latter was an upgrade of Arrow, and it is not the usual 
policy to backport upgrades. I would recommend using an older version of 
pyarrow with Spark, version 0.8.0 would be best, but you might be able to use 
0.11.1 without issues.

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30759) The cache in StringRegexExpression is not initialized for foldable patterns

2020-02-26 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30759:
--
Fix Version/s: (was: 3.1.0)
   2.4.6
   3.0.0

> The cache in StringRegexExpression is not initialized for foldable patterns
> ---
>
> Key: SPARK-30759
> URL: https://issues.apache.org/jira/browse/SPARK-30759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0, 2.4.6
>
> Attachments: Screen Shot 2020-02-08 at 22.45.50.png
>
>
> In the case of foldable patterns, the cache in StringRegexExpression should 
> be evaluated once but in fact it is compiled every time. Here is the example:
> {code:sql}
> SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*';
> {code}
> the code 
> https://github.com/apache/spark/blob/8aebc80e0e67bcb1aa300b8c8b1a209159237632/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L45-L48:
> {code:scala}
>   // try cache the pattern for Literal
>   private lazy val cache: Pattern = pattern match {
> case Literal(value: String, StringType) => compile(value)
> case _ => null
>   }
> {code}
> The attached screenshot shows that foldable expression doesn't fall to the 
> first case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30759) The cache in StringRegexExpression is not initialized for foldable patterns

2020-02-26 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30759:
--
Affects Version/s: (was: 3.1.0)
   1.6.3
   2.0.2
   2.1.3
   2.2.3
   2.3.4
   2.4.5

> The cache in StringRegexExpression is not initialized for foldable patterns
> ---
>
> Key: SPARK-30759
> URL: https://issues.apache.org/jira/browse/SPARK-30759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0, 2.4.6
>
> Attachments: Screen Shot 2020-02-08 at 22.45.50.png
>
>
> In the case of foldable patterns, the cache in StringRegexExpression should 
> be evaluated once but in fact it is compiled every time. Here is the example:
> {code:sql}
> SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*';
> {code}
> the code 
> https://github.com/apache/spark/blob/8aebc80e0e67bcb1aa300b8c8b1a209159237632/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L45-L48:
> {code:scala}
>   // try cache the pattern for Literal
>   private lazy val cache: Pattern = pattern match {
> case Literal(value: String, StringType) => compile(value)
> case _ => null
>   }
> {code}
> The attached screenshot shows that foldable expression doesn't fall to the 
> first case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30759) The cache in StringRegexExpression is not initialized for foldable patterns

2020-02-26 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30759:
--
Issue Type: Bug  (was: Improvement)

> The cache in StringRegexExpression is not initialized for foldable patterns
> ---
>
> Key: SPARK-30759
> URL: https://issues.apache.org/jira/browse/SPARK-30759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: Screen Shot 2020-02-08 at 22.45.50.png
>
>
> In the case of foldable patterns, the cache in StringRegexExpression should 
> be evaluated once but in fact it is compiled every time. Here is the example:
> {code:sql}
> SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*';
> {code}
> the code 
> https://github.com/apache/spark/blob/8aebc80e0e67bcb1aa300b8c8b1a209159237632/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L45-L48:
> {code:scala}
>   // try cache the pattern for Literal
>   private lazy val cache: Pattern = pattern match {
> case Literal(value: String, StringType) => compile(value)
> case _ => null
>   }
> {code}
> The attached screenshot shows that foldable expression doesn't fall to the 
> first case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30963) Add GitHub Action job for document generation

2020-02-26 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30963:
-

Assignee: Dongjoon Hyun

> Add GitHub Action job for document generation
> -
>
> Key: SPARK-30963
> URL: https://issues.apache.org/jira/browse/SPARK-30963
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30931) ML 3.0 QA: API: Python API coverage

2020-02-26 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045942#comment-17045942
 ] 

Huaxin Gao commented on SPARK-30931:


I didn't see any parity issues in code, but some of the Python docs are not 
exactly the same as Scala docs. I will open Jira to for the docs problems. 

 

> ML 3.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-30931
> URL: https://issues.apache.org/jira/browse/SPARK-30931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
>  * *GOAL*: Audit and create JIRAs to fix in the next release.
>  * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
>  * Inconsistency: Do class/method/parameter names match?
>  * Docs: Is the Python doc missing or just a stub? We want the Python doc to 
> be as complete as the Scala doc.
>  * API breaking changes: These should be very rare but are occasionally 
> either necessary (intentional) or accidental. These must be recorded and 
> added in the Migration Guide for this release.
>  ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
>  * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle. 
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2020-02-26 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26311:

Labels: release-notes  (was: )

> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: release-notes
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26329) ExecutorMetrics should poll faster than heartbeats

2020-02-26 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045938#comment-17045938
 ] 

Xiao Li commented on SPARK-26329:
-

This breaks the developer API

 
{code:java}
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd$"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd.apply"),
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd.copy$default$6"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd.copy"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd.this"),{code}

> ExecutorMetrics should poll faster than heartbeats
> --
>
> Key: SPARK-26329
> URL: https://issues.apache.org/jira/browse/SPARK-26329
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Wing Yew Poon
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> We should allow faster polling of the executor memory metrics (SPARK-23429 / 
> SPARK-23206) without requiring a faster heartbeat rate.  We've seen the 
> memory usage of executors pike over 1 GB in less than a second, but 
> heartbeats are only every 10 seconds (by default).  Spark needs to enable 
> fast polling to capture these peaks, without causing too much strain on the 
> system.
> In the current implementation, the metrics are polled along with the 
> heartbeat, but this leads to a slow rate of polling metrics by default.  If 
> users were to increase the rate of the heartbeat, they risk overloading the 
> driver on a large cluster, with too many messages and too much work to 
> aggregate the metrics.  But, the executor could poll the metrics more 
> frequently, and still only send the *max* since the last heartbeat for each 
> metric.  This keeps the load on the driver the same, and only introduces a 
> small overhead on the executor to grab the metrics and keep the max.
> The downside of this approach is that we still need to wait for the next 
> heartbeat for the driver to be aware of the new peak.   If the executor dies 
> or is killed before then, then we won't find out.  A potential future 
> enhancement would be to send an update *anytime* there is an increase by some 
> percentage, but we'll leave that out for now.
> Another possibility would be to change the metrics themselves to track peaks 
> for us, so we don't have to fine-tune the polling rate.  For example, some 
> jvm metrics provide a usage threshold, and notification: 
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold
> But, that is not available on all metrics.  This proposal gives us a generic 
> way to get a more accurate peak memory usage for *all* metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26329) ExecutorMetrics should poll faster than heartbeats

2020-02-26 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26329:

Labels: release-notes  (was: )

> ExecutorMetrics should poll faster than heartbeats
> --
>
> Key: SPARK-26329
> URL: https://issues.apache.org/jira/browse/SPARK-26329
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Wing Yew Poon
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> We should allow faster polling of the executor memory metrics (SPARK-23429 / 
> SPARK-23206) without requiring a faster heartbeat rate.  We've seen the 
> memory usage of executors pike over 1 GB in less than a second, but 
> heartbeats are only every 10 seconds (by default).  Spark needs to enable 
> fast polling to capture these peaks, without causing too much strain on the 
> system.
> In the current implementation, the metrics are polled along with the 
> heartbeat, but this leads to a slow rate of polling metrics by default.  If 
> users were to increase the rate of the heartbeat, they risk overloading the 
> driver on a large cluster, with too many messages and too much work to 
> aggregate the metrics.  But, the executor could poll the metrics more 
> frequently, and still only send the *max* since the last heartbeat for each 
> metric.  This keeps the load on the driver the same, and only introduces a 
> small overhead on the executor to grab the metrics and keep the max.
> The downside of this approach is that we still need to wait for the next 
> heartbeat for the driver to be aware of the new peak.   If the executor dies 
> or is killed before then, then we won't find out.  A potential future 
> enhancement would be to send an update *anytime* there is an increase by some 
> percentage, but we'll leave that out for now.
> Another possibility would be to change the metrics themselves to track peaks 
> for us, so we don't have to fine-tune the polling rate.  For example, some 
> jvm metrics provide a usage threshold, and notification: 
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold
> But, that is not available on all metrics.  This proposal gives us a generic 
> way to get a more accurate peak memory usage for *all* metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30931) ML 3.0 QA: API: Python API coverage

2020-02-26 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-30931:
---
Attachment: (was: image-2020-02-26-10-45-59-175.png)

> ML 3.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-30931
> URL: https://issues.apache.org/jira/browse/SPARK-30931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
>  * *GOAL*: Audit and create JIRAs to fix in the next release.
>  * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
>  * Inconsistency: Do class/method/parameter names match?
>  * Docs: Is the Python doc missing or just a stub? We want the Python doc to 
> be as complete as the Scala doc.
>  * API breaking changes: These should be very rare but are occasionally 
> either necessary (intentional) or accidental. These must be recorded and 
> added in the Migration Guide for this release.
>  ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
>  * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle. 
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27006) SPIP: .NET bindings for Apache Spark

2020-02-26 Thread Kyle Van Saders (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045867#comment-17045867
 ] 

Kyle Van Saders commented on SPARK-27006:
-

It makes perfect sense for .NET developers who are already familiar with LINQ 
to be able to query Spark via DataSets or DataFrames.

> SPIP: .NET bindings for Apache Spark
> 
>
> Key: SPARK-27006
> URL: https://issues.apache.org/jira/browse/SPARK-27006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>   Original Estimate: 4,032h
>  Remaining Estimate: 4,032h
>
> h4. Background and Motivation: 
> Apache Spark provides programming language support for Scala/Java (native), 
> and extensions for Python and R. While a variety of other language extensions 
> are possible to include in Apache Spark, .NET would bring one of the largest 
> developer community to the table. Presently, no good Big Data solution exists 
> for .NET developers in open source.  This SPIP aims at discussing how we can 
> bring Apache Spark goodness to the .NET development platform.  
> .NET is a free, cross-platform, open source developer platform for building 
> many different types of applications. With .NET, you can use multiple 
> languages, editors, and libraries to build for web, mobile, desktop, gaming, 
> and IoT types of applications. Even with .NET serving millions of developers, 
> there is no good Big Data solution that exists today, which this SPIP aims to 
> address.  
> The .NET developer community is one of the largest programming language 
> communities in the world. Its flagship programming language C# is listed as 
> one of the most popular programming languages in a variety of articles and 
> statistics: 
>  * Most popular Technologies on Stack Overflow: 
> [https://insights.stackoverflow.com/survey/2018/#most-popular-technologies|https://insights.stackoverflow.com/survey/2018/]
>   
>  * Most popular languages on GitHub 2018: 
> [https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10#2-java-9|https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10]
>  
>  * 1M+ new developers last 1 year  
>  * Second most demanded technology on LinkedIn 
>  * Top 30 High velocity OSS projects on GitHub 
> Including a C# language extension in Apache Spark will enable millions of 
> .NET developers to author Big Data applications in their preferred 
> programming language, developer environment, and tooling support. We aim to 
> promote the .NET bindings for Spark through engagements with the Spark 
> community (e.g., we are scheduled to present an early prototype at the SF 
> Spark Summit 2019) and the .NET developer community (e.g., similar 
> presentations will be held at .NET developer conferences this year).  As 
> such, we believe that our efforts will help grow the Spark community by 
> making it accessible to the millions of .NET developers. 
> Furthermore, our early discussions with some large .NET development teams got 
> an enthusiastic reception. 
> We recognize that earlier attempts at this goal (specifically Mobius 
> [https://github.com/Microsoft/Mobius]) were unsuccessful primarily due to the 
> lack of communication with the Spark community. Therefore, another goal of 
> this proposal is to not only develop .NET bindings for Spark in open source, 
> but also continuously seek feedback from the Spark community via posted 
> Jira’s (like this one) and the Spark developer mailing list. Our hope is that 
> through these engagements, we can build a community of developers that are 
> eager to contribute to this effort or want to leverage the resulting .NET 
> bindings for Spark in their respective Big Data applications. 
> h4. Target Personas: 
> .NET developers looking to build big data solutions.  
> h4. Goals: 
> Our primary goal is to help grow Apache Spark by making it accessible to the 
> large .NET developer base and ecosystem. We will also look for opportunities 
> to generalize the interop layers for Spark for adding other language 
> extensions in the future. [SPARK-26257]( 
> https://issues.apache.org/jira/browse/SPARK-26257) proposes such a 
> generalized interop layer, which we hope to address over the course of this 
> project.  
> Another important goal for us is to not only enable Spark as an application 
> solution for .NET developers, but also opening the door for .NET developers 
> to make contributions to Apache Spark itself.   
> Lastly, we aim to develop a .NET extension in the open, while continually 
> engaging with the Spark community for feedback on designs and code. We will 
> welcome PRs from the Spark community throughout this project and aim to grow 
> a

[jira] [Created] (SPARK-30963) Add GitHub Action job for document generation

2020-02-26 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-30963:
-

 Summary: Add GitHub Action job for document generation
 Key: SPARK-30963
 URL: https://issues.apache.org/jira/browse/SPARK-30963
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30931) ML 3.0 QA: API: Python API coverage

2020-02-26 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-30931:
---
Attachment: image-2020-02-26-10-45-59-175.png

> ML 3.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-30931
> URL: https://issues.apache.org/jira/browse/SPARK-30931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: image-2020-02-26-10-45-59-175.png
>
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
>  * *GOAL*: Audit and create JIRAs to fix in the next release.
>  * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
>  * Inconsistency: Do class/method/parameter names match?
>  * Docs: Is the Python doc missing or just a stub? We want the Python doc to 
> be as complete as the Scala doc.
>  * API breaking changes: These should be very rare but are occasionally 
> either necessary (intentional) or accidental. These must be recorded and 
> added in the Migration Guide for this release.
>  ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
>  * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle. 
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]

2020-02-26 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045771#comment-17045771
 ] 

Xiao Li commented on SPARK-30962:
-

Also, please give some examples to show how we can alter the table's comments 
and the columns' comments.

> Document ALTER TABLE statement in SQL Reference [Phase 2]
> -
>
> Key: SPARK-30962
> URL: https://issues.apache.org/jira/browse/SPARK-30962
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of 
> ALTER TABLE statements. See the doc in preview-2 
> [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]
>  
> We should add all the supported ALTER TABLE syntax. See 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]

2020-02-26 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30962:

Target Version/s: 3.0.0

> Document ALTER TABLE statement in SQL Reference [Phase 2]
> -
>
> Key: SPARK-30962
> URL: https://issues.apache.org/jira/browse/SPARK-30962
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of 
> ALTER TABLE statements. See the doc in preview-2 
> [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]
>  
> We should add all the supported ALTER TABLE syntax. See 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]

2020-02-26 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045770#comment-17045770
 ] 

Xiao Li commented on SPARK-30962:
-

cc [~huaxing] [~dkbiswal]

> Document ALTER TABLE statement in SQL Reference [Phase 2]
> -
>
> Key: SPARK-30962
> URL: https://issues.apache.org/jira/browse/SPARK-30962
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of 
> ALTER TABLE statements. See the doc in preview-2 
> [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]
>  
> We should add all the supported ALTER TABLE syntax. See 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]

2020-02-26 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30962:

Description: 
https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of ALTER 
TABLE statements. See the doc in preview-2 
[https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]
 

We should add all the supported ALTER TABLE syntax. See 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]

  was:
https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of ALTER 
TABLE statements. See the doc in preview-2 
[https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]

 

 

We should add all the supported ALTER TABLE syntax. See 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]


> Document ALTER TABLE statement in SQL Reference [Phase 2]
> -
>
> Key: SPARK-30962
> URL: https://issues.apache.org/jira/browse/SPARK-30962
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of 
> ALTER TABLE statements. See the doc in preview-2 
> [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]
>  
> We should add all the supported ALTER TABLE syntax. See 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]

2020-02-26 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30962:

Description: 
https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of ALTER 
TABLE statements. See the doc in preview-2 
[https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]

 

 

We should add all the supported ALTER TABLE syntax. See 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]

> Document ALTER TABLE statement in SQL Reference [Phase 2]
> -
>
> Key: SPARK-30962
> URL: https://issues.apache.org/jira/browse/SPARK-30962
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of 
> ALTER TABLE statements. See the doc in preview-2 
> [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]
>  
>  
> We should add all the supported ALTER TABLE syntax. See 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27619) MapType should be prohibited in hash expressions

2020-02-26 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27619.
-
Fix Version/s: 3.0.0
 Assignee: Rakesh Raushan
   Resolution: Fixed

> MapType should be prohibited in hash expressions
> 
>
> Key: SPARK-27619
> URL: https://issues.apache.org/jira/browse/SPARK-27619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Josh Rosen
>Assignee: Rakesh Raushan
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Spark currently allows MapType expressions to be used as input to hash 
> expressions, but I think that this should be prohibited because Spark SQL 
> does not support map equality.
> Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
> map elements:
> {code:java}
> val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
> val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
> // Demonstration of how Scala Map equality is unaffected by insertion order:
> assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
> assert(Map(1->1, 2->2) == Map(2->2, 1->1))
> assert(a.first() == b.first())
> // In contrast, this will print two different hashcodes:
> println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
> This behavior might be surprising to Scala developers.
> I think there's precedence for banning the use of MapType here because we 
> already prohibit MapType in aggregation / joins / equality comparisons 
> (SPARK-9415) and set operations (SPARK-19893).
> If we decide that we want this to be an error then it might also be a good 
> idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the 
> old and buggy behavior (in case applications were relying on it in cases 
> where it just so happens to be safe-by-accident (e.g. maps which only have 
> one entry)).
> Alternatively, we could support hashing here if we implemented support for 
> comparable map types (SPARK-18134).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]

2020-02-26 Thread Xiao Li (Jira)

Xiao Li created SPARK-30962:
---

 Summary: Document ALTER TABLE statement in SQL Reference [Phase 2]
 Key: SPARK-30962
 URL: https://issues.apache.org/jira/browse/SPARK-30962
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Xiao Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28791) Document ALTER TABLE statement in SQL Reference [Phase 1]

2020-02-26 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28791:

Summary: Document ALTER TABLE statement in SQL Reference [Phase 1]  (was: 
Document ALTER TABLE statement in SQL Reference.)

> Document ALTER TABLE statement in SQL Reference [Phase 1]
> -
>
> Key: SPARK-28791
> URL: https://issues.apache.org/jira/browse/SPARK-28791
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: pavithra ramachandran
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30782) Column resolution doesn't respect current catalog/namespace for v2 tables.

2020-02-26 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30782.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27532
[https://github.com/apache/spark/pull/27532]

> Column resolution doesn't respect current catalog/namespace for v2 tables.
> --
>
> Key: SPARK-30782
> URL: https://issues.apache.org/jira/browse/SPARK-30782
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> For v1 tables, you can perform the following:
> {code:sql}
> SELECT default.t.id FROM t;
> {code}
> For v2 tables, the following fails:
> {code:sql}
> USE testcat.ns1.ns2;
> SELECT testcat.ns1.ns2.t.id FROM t;
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`testcat.ns1.ns2.t.id`' given input columns: [t.id, t.point]; line 1 pos 7;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30782) Column resolution doesn't respect current catalog/namespace for v2 tables.

2020-02-26 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30782:
---

Assignee: Terry Kim

> Column resolution doesn't respect current catalog/namespace for v2 tables.
> --
>
> Key: SPARK-30782
> URL: https://issues.apache.org/jira/browse/SPARK-30782
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> For v1 tables, you can perform the following:
> {code:sql}
> SELECT default.t.id FROM t;
> {code}
> For v2 tables, the following fails:
> {code:sql}
> USE testcat.ns1.ns2;
> SELECT testcat.ns1.ns2.t.id FROM t;
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '`testcat.ns1.ns2.t.id`' given input columns: [t.id, t.point]; line 1 pos 7;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-02-26 Thread Nicolas Renkamp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Renkamp updated SPARK-30961:

Description: 
Hi,

there seems to be a bug in the arrow enabled to_pandas conversion from spark 
dataframe to pandas dataframe when the dataframe has a column of type DateType. 
Here is a minimal example to reproduce the issue:
{code:java}
spark = SparkSession.builder.getOrCreate()
is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
print("Arrow optimization is enabled: " + is_arrow_enabled)
spark_df = spark.createDataFrame(
[['2019-12-06']], 'created_at: string') \
.withColumn('created_at', F.to_date('created_at'))

# works
spark_df.toPandas()

spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
print("Arrow optimization is enabled: " + is_arrow_enabled)
# raises AttributeError: Can only use .dt accessor with datetimelike values
# series is still of type object, .dt does not exist
spark_df.toPandas(){code}
A fix would be to modify the _check_series_convert_date function in 
pyspark.sql.types to:
{code:java}
def _check_series_convert_date(series, data_type):
"""
Cast the series to datetime.date if it's a date type, otherwise returns the 
original series.:param series: pandas.Series
:param data_type: a Spark data type for the series
"""
from pyspark.sql.utils import require_minimum_pandas_version
require_minimum_pandas_version()from pandas import to_datetime
if type(data_type) == DateType:
return to_datetime(series).dt.date
else:
return series
{code}
Let me know if I should prepare a Pull Request for the 2.4.5 branch.

I have not tested the behavior on master branch.

 

Thanks,

Nicolas

  was:
Hi,

there seems to be a bug in the arrow enabled to_pandas conversion from spark 
dataframe to pandas dataframe when the dataframe has a column of type DateType. 
Here is a minimal example to reproduce the issue:
{code:java}
spark = SparkSession.builder.getOrCreate()
is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
print("Arrow optimization is enabled: " + is_arrow_enabled)
spark_df = spark.createDataFrame(
[['2019-12-06']], 'created_at: string') \
.withColumn('created_at', F.to_date('created_at'))

# works
spark_df.toPandas()

spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
print("Arrow optimization is enabled: " + is_arrow_enabled)
# raises AttributeError
spark_df.toPandas(){code}
A fix would be to modify the _check_series_convert_date function in 
pyspark.sql.types to:
{code:java}
def _check_series_convert_date(series, data_type):
"""
Cast the series to datetime.date if it's a date type, otherwise returns the 
original series.:param series: pandas.Series
:param data_type: a Spark data type for the series
"""
from pyspark.sql.utils import require_minimum_pandas_version
require_minimum_pandas_version()from pandas import to_datetime
if type(data_type) == DateType:
return to_datetime(series).dt.date
else:
return series
{code}
Let me know if I should prepare a Pull Request for the 2.4.5 branch.

I have not tested the behavior on master branch.

 

Thanks,

Nicolas


> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def

[jira] [Created] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-02-26 Thread Nicolas Renkamp (Jira)

Nicolas Renkamp created SPARK-30961:
---

 Summary: Arrow enabled: to_pandas with date column fails
 Key: SPARK-30961
 URL: https://issues.apache.org/jira/browse/SPARK-30961
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5
 Environment: Apache Spark 2.4.5
Reporter: Nicolas Renkamp


Hi,

there seems to be a bug in the arrow enabled to_pandas conversion from spark 
dataframe to pandas dataframe when the dataframe has a column of type DateType. 
Here is a minimal example to reproduce the issue:
{code:java}
spark = SparkSession.builder.getOrCreate()
is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
print("Arrow optimization is enabled: " + is_arrow_enabled)
spark_df = spark.createDataFrame(
[['2019-12-06']], 'created_at: string') \
.withColumn('created_at', F.to_date('created_at'))

# works
spark_df.toPandas()

spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
print("Arrow optimization is enabled: " + is_arrow_enabled)
# raises AttributeError
spark_df.toPandas(){code}
A fix would be to modify the _check_series_convert_date function in 
pyspark.sql.types to:
{code:java}
def _check_series_convert_date(series, data_type):
"""
Cast the series to datetime.date if it's a date type, otherwise returns the 
original series.:param series: pandas.Series
:param data_type: a Spark data type for the series
"""
from pyspark.sql.utils import require_minimum_pandas_version
require_minimum_pandas_version()from pandas import to_datetime
if type(data_type) == DateType:
return to_datetime(series).dt.date
else:
return series
{code}
Let me know if I should prepare a Pull Request for the 2.4.5 branch.

I have not tested the behavior on master branch.

 

Thanks,

Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2020-02-26 Thread Somnath (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045607#comment-17045607
 ] 

Somnath edited comment on SPARK-10795 at 2/26/20 4:05 PM:
--

 

Trying to submit the below test.py Spark app on a YARN cluster with the below 
command
{noformat}
PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn 
--deploy-mode cluster --archives venv#venv test.py{noformat}
Note: I am not using local mode, but trying to use the python3.7 site-packages 
under the virtualenv used for building the code in PyCharm 

This is how the Python project structure looks along with the contents of venv 
directory
{noformat}
-rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz
-rw-r--r-- 1 schakrabarti nobody      1313 Feb 26 13:07 test.py
drwxr-xr-x 6 schakrabarti nobody      4096 Feb 26 13:07 venv
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/bin
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/share
-rw-r--r-- 1 schakrabarti nobody    75 Feb 26 13:07 venv/pyvenv.cfg
drwxr-xr-x 2 schakrabarti nobody  4096 Feb 26 13:07 venv/include
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/lib
{noformat}
 Getting the same error of File does not exist - pyspark.zip (as shown below)
{noformat}
java.io.FileNotFoundException: File does not exist: 
hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat}
 
{noformat}
#test.py
import json
from pyspark.sql import SparkSession

if __name__ == "__main__":
  spark = SparkSession.builder \
   .appName("Test_App") \
   .master("spark://hostname-nn1.cluster.domain.com:41767") \
   .config("spark.ui.port", "4057") \
   .config("spark.executor.memory", "4g") \
   .getOrCreate()

  print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4))

  spark.stop(){noformat}


was (Author: somchakr):
 

Trying to submit the below test.py Spark app on a YARN cluster with the below 
command
{noformat}
PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn 
--deploy-mode cluster --archives venv#venv test.py{noformat}
Note: I am not using local mode, but trying to use the python3.7 site-packages 
under the virtualenv used for building the code in PyCharm 

This is how the Python project structure looks along with the contents of venv 
directory
{noformat}
-rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz
-rw-r--r-- 1 schakrabarti nobody      1313 Feb 26 13:07 test.py
drwxr-xr-x 6 schakrabarti nobody      4096 Feb 26 13:07 venv
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/bin
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/share
-rw-r--r-- 1 schakrabarti nobody    75 Feb 26 13:07 venv/pyvenv.cfg
drwxr-xr-x 2 schakrabarti nobody  4096 Feb 26 13:07 venv/include
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/lib
{noformat}
 Getting the same error of File does not exist - pyspark.zip (as shown below)
{noformat}
java.io.FileNotFoundException: File does not exist: 
hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat}
 
{noformat}
#test.py
import json
from pyspark.sql import SparkSession

if __name__ == "__main__":
  spark = SparkSession.builder \
   .appName("Test_App") \
   .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \
   .config("spark.ui.port", "4057") \
   .config("spark.executor.memory", "4g") \
   .getOrCreate()

  print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4))

  spark.stop(){noformat}

> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>Priority: Major
>  Labels: bulk-closed
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
>

[jira] [Commented] (SPARK-7101) Spark SQL should support java.sql.Time

2020-02-26 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045628#comment-17045628
 ] 

YoungGyu Chun commented on SPARK-7101:
--

I will work on this

> Spark SQL should support java.sql.Time
> --
>
> Key: SPARK-7101
> URL: https://issues.apache.org/jira/browse/SPARK-7101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: All
>Reporter: Peter Hagelund
>Priority: Minor
>
> Several RDBMSes support the TIME data type; for more exact mapping between 
> those and Spark SQL, support for java.sql.Time with an associated 
> DataType.TimeType would be helpful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30960) add back the legacy date/timestamp format support in CSV/JSON parser

2020-02-26 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-30960:
---

 Summary: add back the legacy date/timestamp format support in 
CSV/JSON parser
 Key: SPARK-30960
 URL: https://issues.apache.org/jira/browse/SPARK-30960
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2020-02-26 Thread Somnath (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045607#comment-17045607
 ] 

Somnath edited comment on SPARK-10795 at 2/26/20 3:04 PM:
--

 

Trying to submit the below test.py Spark app on a YARN cluster with the below 
command
{noformat}
PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn 
--deploy-mode cluster --archives venv#venv test.py{noformat}
Note: I am not using local mode, but trying to use the python3.7 site-packages 
under the virtualenv used for building the code in PyCharm 

This is how the Python project structure looks along with the contents of venv 
directory
{noformat}
-rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz
-rw-r--r-- 1 schakrabarti nobody      1313 Feb 26 13:07 test.py
drwxr-xr-x 6 schakrabarti nobody      4096 Feb 26 13:07 venv
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/bin
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/share
-rw-r--r-- 1 schakrabarti nobody    75 Feb 26 13:07 venv/pyvenv.cfg
drwxr-xr-x 2 schakrabarti nobody  4096 Feb 26 13:07 venv/include
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/lib
{noformat}
 Getting the same error of File does not exist - pyspark.zip (as shown below)
{noformat}
java.io.FileNotFoundException: File does not exist: 
hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat}
 
{noformat}
#test.py
import json
from pyspark.sql import SparkSession

if __name__ == "__main__":
  spark = SparkSession.builder \
   .appName("Test_App") \
   .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \
   .config("spark.ui.port", "4057") \
   .config("spark.executor.memory", "4g") \
   .getOrCreate()

  print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4))

  spark.stop(){noformat}


was (Author: somchakr):
 

Trying to submit the below test.py Spark app on a YARN cluster with the below 
command
{noformat}
PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn 
--deploy-mode cluster --archives venv#venv test.py{noformat}
Note: I am not using local mode, but trying to use the python3.7 site-packages 
under the virtualenv used for building the code in PyCharm 

This is how the Python project structure looks. with the 

 

 
{noformat}
-rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz
-rw-r--r-- 1 schakrabarti nobody      1313 Feb 26 13:07 test.py
drwxr-xr-x 6 schakrabarti nobody      4096 Feb 26 13:07 venv
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/bin
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/share
-rw-r--r-- 1 schakrabarti nobody    75 Feb 26 13:07 venv/pyvenv.cfg
drwxr-xr-x 2 schakrabarti nobody  4096 Feb 26 13:07 venv/include
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/lib
{noformat}
 

 

Getting the same error of File does not exist - pyspark.zip (as shown below)
{noformat}
java.io.FileNotFoundException: File does not exist: 
hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat}
 
{noformat}
#test.py
import json
from pyspark.sql import SparkSession

if __name__ == "__main__":
  spark = SparkSession.builder \
   .appName("Test_App") \
   .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \
   .config("spark.ui.port", "4057") \
   .config("spark.executor.memory", "4g") \
   .getOrCreate()

  print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4))

  spark.stop(){noformat}
 

 

 

 

 

 

> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>Priority: Major
>  Labels: bulk-closed
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
>

[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2020-02-26 Thread Somnath (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045607#comment-17045607
 ] 

Somnath commented on SPARK-10795:
-

 

Trying to submit the below test.py Spark app on a YARN cluster with the below 
command
{noformat}
PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn 
--deploy-mode cluster --archives venv#venv test.py{noformat}
Note: I am not using local mode, but trying to use the python3.7 site-packages 
under the virtualenv used for building the code in PyCharm 

This is how the Python project structure looks. with the 

 

 
{noformat}
-rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz
-rw-r--r-- 1 schakrabarti nobody      1313 Feb 26 13:07 test.py
drwxr-xr-x 6 schakrabarti nobody      4096 Feb 26 13:07 venv
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/bin
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/share
-rw-r--r-- 1 schakrabarti nobody    75 Feb 26 13:07 venv/pyvenv.cfg
drwxr-xr-x 2 schakrabarti nobody  4096 Feb 26 13:07 venv/include
drwxr-xr-x 3 schakrabarti nobody  4096 Feb 26 13:07 venv/lib
{noformat}
 

 

Getting the same error of File does not exist - pyspark.zip (as shown below)
{noformat}
java.io.FileNotFoundException: File does not exist: 
hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat}
 
{noformat}
#test.py
import json
from pyspark.sql import SparkSession

if __name__ == "__main__":
  spark = SparkSession.builder \
   .appName("Test_App") \
   .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \
   .config("spark.ui.port", "4057") \
   .config("spark.executor.memory", "4g") \
   .getOrCreate()

  print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4))

  spark.stop(){noformat}
 

 

 

 

 

 

> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>Priority: Major
>  Labels: bulk-closed
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30910) Add version information to the configuration of R

2020-02-26 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30910:
---
Summary: Add version information to the configuration of R  (was: Arrange 
version info of R)

> Add version information to the configuration of R
> -
>
> Key: SPARK-30910
> URL: https://issues.apache.org/jira/browse/SPARK-30910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> core/src/main/scala/org/apache/spark/internal/config/R.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30947) Log better message when accelerate resource is empty

2020-02-26 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-30947:
-
Summary: Log better message when accelerate resource is empty  (was: Don't 
log accelerate resources when it's empty)

> Log better message when accelerate resource is empty
> 
>
> Key: SPARK-30947
> URL: https://issues.apache.org/jira/browse/SPARK-30947
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> It's weird to see cpu/memory resources after logging resource is empty:
> {code:java}
> 20/02/25 21:47:55 INFO ResourceUtils: 
> ==
> 20/02/25 21:47:55 INFO ResourceUtils: Resources for spark.driver:
> 20/02/25 21:47:55 INFO ResourceUtils: 
> ==
> 20/02/25 21:47:55 INFO SparkContext: Submitted application: Spark shell
> 20/02/25 21:47:55 INFO ResourceProfile: Default ResourceProfile created, 
> executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
> memory -> name: memory, amount: 1024, script: , vendor: ), task resources: 
> Map(cpus -> name: cpus, amount: 1.0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30959) How to write using JDBC driver to SQL Server / Azure DWH to column of BINARY type?

2020-02-26 Thread Mikhail Kumachev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Kumachev updated SPARK-30959:
-
Description: 
My initial goal is to save UUId values to SQL Server/Azure DWH to column of 
BINARY(16) type.

For example, I have demo table:
{code:java}
CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code}
 

I want to write data to it using Spark like this:
{code:java}
import java.util.UUID

val uuid = UUID.randomUUID()

val uuidBytes = Array.ofDim[Byte](16)
ByteBuffer.wrap(uuidBytes)
    .order(ByteOrder.BIG_ENDIAN)
    .putLong(uuid.getMostSignificantBits())
    .putLong(uuid.getLeastSignificantBits()

val schema = StructType(
    List(
        StructField("EventId", BinaryType, false)
    )
 )

val data = Seq((uuidBytes)).toDF("EventId").rdd;
val df = spark.createDataFrame(data, schema);

df.write
    .format("jdbc")
    .option("url", "")
    .option("dbTable", "Events")
    .mode(org.apache.spark.sql.SaveMode.Append)
    .save()
{code}
 

This code returns an error: 
{noformat}
java.sql.BatchUpdateException: Conversion from variable or parameter type 
VARBINARY to target column type BINARY is not supported.{noformat}
 

My question is how to cope with this situation and insert UUId value to 
BINARY(16) column?

 

My investigation:

Spark uses conception of JdbcDialects and has a mapping for each Catalyst type 
to database type and vice versa. For example here is 
[MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]]
 which is used when we work against SQL Server or Azure DWH. In the method 
`getJDBCType` you can see the mapping:
{code:java}
case BinaryType => Some(JdbcType("VARBINARY(MAX)", 
java.sql.Types.VARBINARY)){code}
and this is the root of my problem as I think.

 

So, I decide to implement my own JdbcDialect to override this behavior:
{code:java}
class SqlServerDialect extends JdbcDialect {
    override def canHandle(url: String) : Boolean = 
url.startsWith("jdbc:sqlserver")

    override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
        case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY))
        case _ => None
    }
 }

val dialect = new SqlServerDialect
JdbcDialects.registerDialect(dialect)
{code}
With this modification I still catch exactly the same error. It looks like that 
Spark do not use mapping from my custom dialect. But I checked that the dialect 
is registered. So it is strange situation.

  was:
My initial goal is to save UUId values to SQL Server/Azure DWH to column of 
BINARY(16) type.

For example, I have demo table:
{code:java}
CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code}
 

I want to write data to it using Spark like this:
{code:java}
import java.util.UUID
val uuid = UUID.randomUUID()
 val uuidBytes = Array.ofDim[Byte](16)
 ByteBuffer.wrap(uuidBytes)
    .order(ByteOrder.BIG_ENDIAN)
    .putLong(uuid.getMostSignificantBits())
    .putLong(uuid.getLeastSignificantBits()
val schema = StructType(
    List(
        StructField("EventId", BinaryType, false)
    )
 )
 val data = Seq((uuidBytes)).toDF("EventId").rdd;
 val df = spark.createDataFrame(data, schema);
df.write
    .format("jdbc")
    .option("url", "")
    .option("dbTable", "Events")
    .mode(org.apache.spark.sql.SaveMode.Append)
    .save()
{code}
 

This code returns an error:

 
{noformat}
java.sql.BatchUpdateException: Conversion from variable or parameter type 
VARBINARY to target column type BINARY is not supported.{noformat}
 

 

My question is how to cope with this situation and insert UUId value to 
BINARY(16) column?

 

My investigation:

Spark uses conception of JdbcDialects and has a mapping for each Catalyst type 
to database type and vice versa. For example here is 
[MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]]
 which is used when we work against SQL Server or Azure DWH. In the method 
`getJDBCType` you can see the mapping:
{code:java}
case BinaryType => Some(JdbcType("VARBINARY(MAX)", 
java.sql.Types.VARBINARY)){code}
and this is the root of my problem as I think.

 

So, I decide to implement my own JdbcDialect to override this behavior:
{code:java}
class SqlServerDialect extends JdbcDialect {
    override def canHandle(url: String) : Boolean = 
url.startsWith("jdbc:sqlserver")

    override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
        case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY))
        case _ => None
    }
 }
val dialect = new SqlServerDialect
JdbcDialects.registerDialect(dialect)
{code}
With this modification I still catch exactly the same error. It looks like that 
Spark do not use mapping from my custom dialect. But I checked that the dialect 
is registered. So it is strange situation.


>

[jira] [Updated] (SPARK-30959) How to write using JDBC driver to SQL Server / Azure DWH to column of BINARY type?

2020-02-26 Thread Mikhail Kumachev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Kumachev updated SPARK-30959:
-
Description: 
My initial goal is to save UUId values to SQL Server/Azure DWH to column of 
BINARY(16) type.

For example, I have demo table:
{code:java}
CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code}
 

I want to write data to it using Spark like this:
{code:java}
import java.util.UUID
val uuid = UUID.randomUUID()
 val uuidBytes = Array.ofDim[Byte](16)
 ByteBuffer.wrap(uuidBytes)
    .order(ByteOrder.BIG_ENDIAN)
    .putLong(uuid.getMostSignificantBits())
    .putLong(uuid.getLeastSignificantBits()
val schema = StructType(
    List(
        StructField("EventId", BinaryType, false)
    )
 )
 val data = Seq((uuidBytes)).toDF("EventId").rdd;
 val df = spark.createDataFrame(data, schema);
df.write
    .format("jdbc")
    .option("url", "")
    .option("dbTable", "Events")
    .mode(org.apache.spark.sql.SaveMode.Append)
    .save()
{code}
 

This code returns an error:

 
{noformat}
java.sql.BatchUpdateException: Conversion from variable or parameter type 
VARBINARY to target column type BINARY is not supported.{noformat}
 

 

My question is how to cope with this situation and insert UUId value to 
BINARY(16) column?

 

My investigation:

Spark uses conception of JdbcDialects and has a mapping for each Catalyst type 
to database type and vice versa. For example here is 
[MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]]
 which is used when we work against SQL Server or Azure DWH. In the method 
`getJDBCType` you can see the mapping:
{code:java}
case BinaryType => Some(JdbcType("VARBINARY(MAX)", 
java.sql.Types.VARBINARY)){code}
and this is the root of my problem as I think.

 

So, I decide to implement my own JdbcDialect to override this behavior:
{code:java}
class SqlServerDialect extends JdbcDialect {
    override def canHandle(url: String) : Boolean = 
url.startsWith("jdbc:sqlserver")

    override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
        case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY))
        case _ => None
    }
 }
val dialect = new SqlServerDialect
JdbcDialects.registerDialect(dialect)
{code}
With this modification I still catch exactly the same error. It looks like that 
Spark do not use mapping from my custom dialect. But I checked that the dialect 
is registered. So it is strange situation.

  was:
My initial goal is to save UUId values to SQL Server/Azure DWH to column of 
BINARY(16) type.

For example, I have demo table:
{code:java}
CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code}
 

I want to write data to it using Spark like this:
{code:java}
import java.util.UUID
val uuid = UUID.randomUUID()
 val uuidBytes = Array.ofDim[Byte](16)
 ByteBuffer.wrap(uuidBytes)
    .order(ByteOrder.BIG_ENDIAN)
    .putLong(uuid.getMostSignificantBits())
    .putLong(uuid.getLeastSignificantBits()
val schema = StructType(
    List(
        StructField("EventId", BinaryType, false)
    )
 )
 val data = Seq((uuidBytes)).toDF("EventId").rdd;
 val df = spark.createDataFrame(data, schema);
df.write
    .format("jdbc")
    .option("url", "")
    .option("dbTable", "Events")
    .mode(org.apache.spark.sql.SaveMode.Append)
    .save()
{code}
 

This code returns an error:

 
{noformat}
java.sql.BatchUpdateException: Conversion from variable or parameter type 
VARBINARY to target column type BINARY is not supported.{noformat}
 

 

My question is how to cope with this situation and insert UUId value to 
BINARY(16) column?

 

My investigation:

Spark uses conception of JdbcDialects and has a mapping for each Catalyst type 
to database type and vice versa. For example here is 
[MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]]
 which is used when we work against SQL Server or Azure DWH. In the method 
`getJDBCType` you can see the mapping:
{code:java}
case BinaryType => Some(JdbcType("VARBINARY(MAX)", 
java.sql.Types.VARBINARY)){code}
and this is the root of my problem as I think.

 

So, I decide to implement my own JdbcDialect to override this behavior:
{code:java}
class SqlServerDialect extends JdbcDialect {
    override def canHandle(url: String) : Boolean = 
url.startsWith("jdbc:sqlserver")
    override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
        case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY))
        case _ => None
    }
 }
val dialect = new SqlServerDialect
JdbcDialects.registerDialect(dialect)
{code}

With this modification I still catch exactly the same error. It looks like that 
Spark do not use mapping from my custom dialect. But I checked that the dialect 
is registered. So it is strange situation.

[jira] [Created] (SPARK-30959) How to write using JDBC driver to SQL Server / Azure DWH to column of BINARY type?

2020-02-26 Thread Mikhail Kumachev (Jira)

Mikhail Kumachev created SPARK-30959:


 Summary: How to write using JDBC driver to SQL Server / Azure DWH 
to column of BINARY type?
 Key: SPARK-30959
 URL: https://issues.apache.org/jira/browse/SPARK-30959
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Mikhail Kumachev


My initial goal is to save UUId values to SQL Server/Azure DWH to column of 
BINARY(16) type.

For example, I have demo table:
{code:java}
CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code}
 

I want to write data to it using Spark like this:
{code:java}
import java.util.UUID
val uuid = UUID.randomUUID()
 val uuidBytes = Array.ofDim[Byte](16)
 ByteBuffer.wrap(uuidBytes)
    .order(ByteOrder.BIG_ENDIAN)
    .putLong(uuid.getMostSignificantBits())
    .putLong(uuid.getLeastSignificantBits()
val schema = StructType(
    List(
        StructField("EventId", BinaryType, false)
    )
 )
 val data = Seq((uuidBytes)).toDF("EventId").rdd;
 val df = spark.createDataFrame(data, schema);
df.write
    .format("jdbc")
    .option("url", "")
    .option("dbTable", "Events")
    .mode(org.apache.spark.sql.SaveMode.Append)
    .save()
{code}
 

This code returns an error:

 
{noformat}
java.sql.BatchUpdateException: Conversion from variable or parameter type 
VARBINARY to target column type BINARY is not supported.{noformat}
 

 

My question is how to cope with this situation and insert UUId value to 
BINARY(16) column?

 

My investigation:

Spark uses conception of JdbcDialects and has a mapping for each Catalyst type 
to database type and vice versa. For example here is 
[MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]]
 which is used when we work against SQL Server or Azure DWH. In the method 
`getJDBCType` you can see the mapping:
{code:java}
case BinaryType => Some(JdbcType("VARBINARY(MAX)", 
java.sql.Types.VARBINARY)){code}
and this is the root of my problem as I think.

 

So, I decide to implement my own JdbcDialect to override this behavior:
{code:java}
class SqlServerDialect extends JdbcDialect {
    override def canHandle(url: String) : Boolean = 
url.startsWith("jdbc:sqlserver")
    override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
        case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY))
        case _ => None
    }
 }
val dialect = new SqlServerDialect
JdbcDialects.registerDialect(dialect)
{code}

With this modification I still catch exactly the same error. It looks like that 
Spark do not use mapping from my custom dialect. But I checked that the dialect 
is registered. So it is strange situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30958) do not set default era for DateTimeFormatter

2020-02-26 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-30958:
---

 Summary: do not set default era for DateTimeFormatter
 Key: SPARK-30958
 URL: https://issues.apache.org/jira/browse/SPARK-30958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30957) Null-safe variant of Dataset.join(Dataset[_], Seq[String])

2020-02-26 Thread Enrico Minack (Jira)

Enrico Minack created SPARK-30957:
-

 Summary: Null-safe variant of Dataset.join(Dataset[_], Seq[String])
 Key: SPARK-30957
 URL: https://issues.apache.org/jira/browse/SPARK-30957
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Enrico Minack


The {{Dataset.join(Dataset, Seq[String])}} method provides extra convenience 
over {{Dataset.join(Dataset, joinExprs: Column)}} as it does not duplicate the 
join columns {{Seq[String]}} in the result {{DataFrame}}. Those columns are 
compared with {{===}}. When those join columns need to be compared null-safe 
with {{<=>}}, the join condition becomes very verbose and requires extra 
{{drop}} operations:
{code:java}
df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> 
df2("b")).drop(df2("a")).drop(df2("b")).show()
{code}
Elegant would be the following null-safe join operation:
{code:java}
df1.joinNullSafe(df2, joinColumns)
{code}
Possible namings:
 - {{Dataset.joinNullSafe(Dataset[_], Seq[String])}}
 - {{Dataset.joinWithNulls(Dataset[_], Seq[String])}}
 - {{Dataset.join(Dataset[_], Seq[String], <=>)}}

*I am happy to provide a PR if this Dataset API extension is appreciated.*

This request has been sent to the Apache Spark user and 
[dev|http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-dataframe-null-safe-joins-given-a-list-of-columns-tt28842.html]
 mailing list by Marcelo Valle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30319) Adds a stricter version of as[T]

2020-02-26 Thread Enrico Minack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30319:
--
Description: 
The behaviour of as[T] is not intuitive when you read code like 
df.as[T].write.csv("data.csv"). The result depends on the actual schema of df, 
where def as[T](): Dataset[T] should be agnostic to the schema of df. The 
expected behaviour is not provided elsewhere:
 * Extra columns that are not part of the type {{T}} are not dropped.
 * Order of columns is not aligned with schema of {{T}}.

A method that enforces schema of T on a given Dataset would be very convenient 
and allows to articulate and guarantee above assumptions about your data with 
the native Spark Dataset API. This method plays a more explicit and enforcing 
role than as[T] with respect to columns, column order and column type.

Possible naming of a stricter version of {{as[T]}}:
 * {{as[T](strict = true)}}
 * {{toDS[T]}} (as in {{toDF}})
 * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})

The naming {{toDS[T]}} is chosen in the linked PR.

  was:
The behaviour of as[T] is not intuitive when you read code like 
df.as[T].write.csv("data.csv"). The result depends on the actual schema of df, 
where def as[T](): Dataset[T] should be agnostic to the schema of df. The 
expected behaviour is not provided elsewhere:
 * Extra columns that are not part of the type {{T}} are not dropped.
 * Order of columns is not aligned with schema of {{T}}.
 * Columns are not cast to the types of {{T}}'s fields. They have to be cast 
explicitly.

A method that enforces schema of T on a given Dataset would be very convenient 
and allows to articulate and guarantee above assumptions about your data with 
the native Spark Dataset API. This method plays a more explicit and enforcing 
role than as[T] with respect to columns, column order and column type.

Possible naming of a stricter version of {{as[T]}}:
 * {{as[T](strict = true)}}
 * {{toDS[T]}} (as in {{toDF}})
 * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})

The naming {{toDS[T]}} is chosen here.


> Adds a stricter version of as[T]
> 
>
> Key: SPARK-30319
> URL: https://issues.apache.org/jira/browse/SPARK-30319
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> The behaviour of as[T] is not intuitive when you read code like 
> df.as[T].write.csv("data.csv"). The result depends on the actual schema of 
> df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The 
> expected behaviour is not provided elsewhere:
>  * Extra columns that are not part of the type {{T}} are not dropped.
>  * Order of columns is not aligned with schema of {{T}}.
> A method that enforces schema of T on a given Dataset would be very 
> convenient and allows to articulate and guarantee above assumptions about 
> your data with the native Spark Dataset API. This method plays a more 
> explicit and enforcing role than as[T] with respect to columns, column order 
> and column type.
> Possible naming of a stricter version of {{as[T]}}:
>  * {{as[T](strict = true)}}
>  * {{toDS[T]}} (as in {{toDF}})
>  * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})
> The naming {{toDS[T]}} is chosen in the linked PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-02-26 Thread Enrico Minack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Description: 
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments per partition on success.

With this pragmatic approach, increments from individual partitions / tasks are 
only merged into the accumulator on driver side for the first time per 
partition. This is useful for accumulators registered with {{countFailedValues 
== false}}. Hence, the accumulator aggregates all successful partitions only 
once.

The implementations require extra memory that scales with the number of 
partitions.

  was:
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - {{ALL}} sums over all increments of each partition: this represents the 
current implementation of accumulators
 - {{FIRST}} increment: allows to retrieve the first accumulator value for each 
partition only. This is useful for accumulators registered with 
{{countFailedValues == false}}.
 - {{LARGEST}} over all increments of each partition: accumulators aggregate 
multiple increments while a partition is processed, a successful task provides 
the most accumulated values that has always the largest cardinality than any 
accumulated value of failed tasks, hence it paramounts any failed task's value. 
This produces reliable accumulator values. This does not require 
{{countFailedValues == false}}. This should only be used in a single stage. The 
naming may be confused with {{MAX}}.

The implementations for {{LARGEST}} and {{FIRST}} require extra memory that 
scales with the number of partitions. The current {{ALL}} implementation does 
not require extra memory.


> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code produces identical 
> accumulator increments on success. Rerunning partitions for any reason should 
> always produce the same increments per partition on success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are only merged into the accumulator on driver side for the first time per 
> partition. This is useful for accumulators registered with 
> {{countFailedValues == false}}. Hence, the accumulator aggregates all 
> successful partitions only once.
> The implementations require extra memory that scales with the number of 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30956) Use intercept instead of try-catch to assert failures in IntervalUtilsSuite

2020-02-26 Thread Kent Yao (Jira)

Kent Yao created SPARK-30956:


 Summary: Use intercept instead of try-catch to assert failures in 
IntervalUtilsSuite
 Key: SPARK-30956
 URL: https://issues.apache.org/jira/browse/SPARK-30956
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


Addressed the comment from 
https://github.com/apache/spark/pull/27672#discussion_r383719562 to use 
`intercept` instead of `try-catch` block to assert  failures in the 
IntervalUtilsSuite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30955) Exclude Generate output when aliasing in nested column pruning

2020-02-26 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045231#comment-17045231
 ] 

L. C. Hsieh commented on SPARK-30955:
-

Thanks for pointing it. Yea, I re-checked and that code isn't in branch-2.4.

> Exclude Generate output when aliasing in nested column pruning
> --
>
> Key: SPARK-30955
> URL: https://issues.apache.org/jira/browse/SPARK-30955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> When aliasing in nested column pruning on Project on top of Generate, we 
> should exclude Generate outputs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30955) Exclude Generate output when aliasing in nested column pruning

2020-02-26 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-30955:

Affects Version/s: (was: 2.4.5)
   3.0.0

> Exclude Generate output when aliasing in nested column pruning
> --
>
> Key: SPARK-30955
> URL: https://issues.apache.org/jira/browse/SPARK-30955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> When aliasing in nested column pruning on Project on top of Generate, we 
> should exclude Generate outputs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30909) Add version information to the configuration of Python

2020-02-26 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30909:
---
Summary: Add version information to the configuration of Python  (was: 
Arrange version info of Python)

> Add version information to the configuration of Python
> --
>
> Key: SPARK-30909
> URL: https://issues.apache.org/jira/browse/SPARK-30909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> core/src/main/scala/org/apache/spark/internal/config/Python.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

70 matches

Mail list logo