[jira] [Created] (SPARK-30968) Upgrade aws-java-sdk-sts to 1.11.655
Dongjoon Hyun created SPARK-30968: - Summary: Upgrade aws-java-sdk-sts to 1.11.655 Key: SPARK-30968 URL: https://issues.apache.org/jira/browse/SPARK-30968 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046265#comment-17046265 ] Jorge Machado commented on SPARK-26412: --- Hi, one question. when using "a tuple of pd.Series if UDF is called with more than one Spark DF columns" how can I get the Series into a variables. like a, b, c = iterator ? map seems not to be a python tuple ... > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30765) Refine baes class abstraction code style
[ https://issues.apache.org/jira/browse/SPARK-30765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30765. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27511 [https://github.com/apache/spark/pull/27511] > Refine baes class abstraction code style > > > Key: SPARK-30765 > URL: https://issues.apache.org/jira/browse/SPARK-30765 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xin Wu >Assignee: Xin Wu >Priority: Major > Fix For: 3.1.0 > > > When doing base operator abstraction work, I found there are still some code > snippet is inconsistent with other abstraction code style. > > Case 1, override keyword missed for some fields in derived classes. The > compiler will not capture it if we rename some fields in the future. > [https://github.com/apache/spark/pull/27368#discussion_r376694045] > > > Case 2, inconsistent abstract class definition. The updated style will > simplify derived class definition. > [https://github.com/apache/spark/pull/27368#discussion_r375061952] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30765) Refine baes class abstraction code style
[ https://issues.apache.org/jira/browse/SPARK-30765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30765: Assignee: Xin Wu > Refine baes class abstraction code style > > > Key: SPARK-30765 > URL: https://issues.apache.org/jira/browse/SPARK-30765 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xin Wu >Assignee: Xin Wu >Priority: Major > > When doing base operator abstraction work, I found there are still some code > snippet is inconsistent with other abstraction code style. > > Case 1, override keyword missed for some fields in derived classes. The > compiler will not capture it if we rename some fields in the future. > [https://github.com/apache/spark/pull/27368#discussion_r376694045] > > > Case 2, inconsistent abstract class definition. The updated style will > simplify derived class definition. > [https://github.com/apache/spark/pull/27368#discussion_r375061952] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative
[ https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046245#comment-17046245 ] Xiao Li edited comment on SPARK-27630 at 2/27/20 7:11 AM: -- This change breaks the API. Need a release note. {code:java} ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.apply"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.copy"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.this"), ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted$"),{code} was (Author: smilegator): This change breaks the API {code:java} ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.apply"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.copy"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.this"), ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted$"),{code} > Stage retry causes totalRunningTasks calculation to be negative > --- > > Key: SPARK-27630 > URL: https://issues.apache.org/jira/browse/SPARK-27630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > Track tasks separately for each stage attempt (instead of tracking by stage), > and do NOT reset the numRunningTasks to 0 on StageCompleted. > In the case of stage retry, the {{taskEnd}} event from the zombie stage > sometimes makes the number of {{totalRunningTasks}} negative, which will > causes the job to get stuck. > Similar problem also exists with {{stageIdToTaskIndices}} & > {{stageIdToSpeculativeTaskIndices}}. > If it is a failed {{taskEnd}} event of the zombie stage, this will cause > {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the > task index of the active stage, and the number of {{totalPendingTasks}} will > increase unexpectedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative
[ https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046245#comment-17046245 ] Xiao Li commented on SPARK-27630: - This change breaks the API {code:java} ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.apply"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.copy"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted.this"), ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.SparkListenerSpeculativeTaskSubmitted$"),{code} > Stage retry causes totalRunningTasks calculation to be negative > --- > > Key: SPARK-27630 > URL: https://issues.apache.org/jira/browse/SPARK-27630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > Track tasks separately for each stage attempt (instead of tracking by stage), > and do NOT reset the numRunningTasks to 0 on StageCompleted. > In the case of stage retry, the {{taskEnd}} event from the zombie stage > sometimes makes the number of {{totalRunningTasks}} negative, which will > causes the job to get stuck. > Similar problem also exists with {{stageIdToTaskIndices}} & > {{stageIdToSpeculativeTaskIndices}}. > If it is a failed {{taskEnd}} event of the zombie stage, this will cause > {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the > task index of the active stage, and the number of {{totalPendingTasks}} will > increase unexpectedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative
[ https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27630: Labels: release-notes (was: ) > Stage retry causes totalRunningTasks calculation to be negative > --- > > Key: SPARK-27630 > URL: https://issues.apache.org/jira/browse/SPARK-27630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > Track tasks separately for each stage attempt (instead of tracking by stage), > and do NOT reset the numRunningTasks to 0 on StageCompleted. > In the case of stage retry, the {{taskEnd}} event from the zombie stage > sometimes makes the number of {{totalRunningTasks}} negative, which will > causes the job to get stuck. > Similar problem also exists with {{stageIdToTaskIndices}} & > {{stageIdToSpeculativeTaskIndices}}. > If it is a failed {{taskEnd}} event of the zombie stage, this will cause > {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the > task index of the active stage, and the number of {{totalPendingTasks}} will > increase unexpectedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement
[ https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30590: --- Assignee: L. C. Hsieh > can't use more than five type-safe user-defined aggregation in select > statement > --- > > Key: SPARK-30590 > URL: https://issues.apache.org/jira/browse/SPARK-30590 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Daniel Mantovani >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > How to reproduce: > {code:scala} > val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f") > import org.apache.spark.sql.expressions.Aggregator > import org.apache.spark.sql.Encoder > import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Row > case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] { > def zero:Int = s > def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0) > def merge(b1: Int, b2: Int): Int = b1 + b2 > def finish(b: Int): Int = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i") > scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show > +-+-+-+-+-+ > |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5| > +-+-+-+-+-+ > |3|5|7|9| 11| > +-+-+-+-+-+ > {code} > With 6 arguments we have error: > {code:scala} > scala> > df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show > org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate > [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, > assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, > IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, > None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 > as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) > AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS > value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS > value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, > fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, > assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, > IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, > None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 > as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) > AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS > value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS > value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];; > 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS > value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS > value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, > fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, > assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, > IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, > None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 > as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) > AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS > value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS > value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, > fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, > assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, > IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, > None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 > as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) > AS foo_agg_6#141] > +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS > e#17, _6#11 AS F#18] > +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431) > at >
[jira] [Created] (SPARK-30967) Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on table access
NITISH SHARMA created SPARK-30967: - Summary: Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on table access Key: SPARK-30967 URL: https://issues.apache.org/jira/browse/SPARK-30967 Project: Spark Issue Type: Question Components: Spark Shell Affects Versions: 2.4.5 Reporter: NITISH SHARMA I have a requirement where i am looking to update LAST_ACCESS_TIME in TBLS of Hive metastore whenever any table is accessed through spark. I set this below property in hive-site.xml and hive honors it and updates the LAST_ACCESS_TIME everytime it is accessed. hive.exec.pre.hooks org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec However, the same thing i want to achieve using pyspark/spark-shell but its not honoring this property of hive hooks. Is there an alternate approach of achieving this - 'Update of LAST_ACCESS_TIME in hive metastore on access using spark'. I passed the property like this - spark-sql -e 'set spark.hadoop.hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;select * from db.table;' as well as i put the same property in /etc/spark/conf/hive-site.xml location. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046228#comment-17046228 ] zhengruifeng commented on SPARK-30932: -- I checked added classes from {{added_ml_class:}} * FMClassifier, FMRegressor has related Java example and doc; * RobustScaler has related Java example and doc; * MultilabelClassificationEvaluator,RankingEvaluator do not have related Java examples; However, other evaluators do not have examples, either; *We may need to add some basic description in doc/ml-tuning.* * org.apache.spark.ml.functions has no related doc, is only used in \{{FunctionsSuite}}; *I am not sure we should make it public;* * org.apache.spark.ml.\{FitStart, FitEnd, LoadInstanceStart, LoadInstanceEnd, SaveInstanceStart, SaveInstanceEnd, TransformStart, TransformEnd} are marked \{{Unstable}} and has no related doc; > ML 3.0 QA: API: Java compatibility, docs > > > Key: SPARK-30932 > URL: https://issues.apache.org/jira/browse/SPARK-30932 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > Attachments: 1_process_script.sh, added_ml_class, common_ml_class, > signature.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30967) Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on hive table access through pyspark
[ https://issues.apache.org/jira/browse/SPARK-30967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] NITISH SHARMA updated SPARK-30967: -- Summary: Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on hive table access through pyspark (was: Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on table access ) > Achieve LAST_ACCESS_TIME column update in TBLS table of hive metastore on > hive table access through pyspark > --- > > Key: SPARK-30967 > URL: https://issues.apache.org/jira/browse/SPARK-30967 > Project: Spark > Issue Type: Question > Components: Spark Shell >Affects Versions: 2.4.5 >Reporter: NITISH SHARMA >Priority: Critical > > I have a requirement where i am looking to update LAST_ACCESS_TIME in TBLS of > Hive metastore whenever any table is accessed through spark. I set this below > property in hive-site.xml and hive honors it and updates the LAST_ACCESS_TIME > everytime it is accessed. > > hive.exec.pre.hooks > > org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec > > However, the same thing i want to achieve using pyspark/spark-shell but its > not honoring this property of hive hooks. Is there an alternate approach of > achieving this - 'Update of LAST_ACCESS_TIME in hive metastore on access > using spark'. > I passed the property like this - > spark-sql -e 'set > spark.hadoop.hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;select > * from db.table;' > as well as i put the same property in /etc/spark/conf/hive-site.xml location. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement
[ https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30590. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27499 [https://github.com/apache/spark/pull/27499] > can't use more than five type-safe user-defined aggregation in select > statement > --- > > Key: SPARK-30590 > URL: https://issues.apache.org/jira/browse/SPARK-30590 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Daniel Mantovani >Priority: Major > Fix For: 3.0.0 > > > How to reproduce: > {code:scala} > val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f") > import org.apache.spark.sql.expressions.Aggregator > import org.apache.spark.sql.Encoder > import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Row > case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] { > def zero:Int = s > def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0) > def merge(b1: Int, b2: Int): Int = b1 + b2 > def finish(b: Int): Int = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i") > scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show > +-+-+-+-+-+ > |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5| > +-+-+-+-+-+ > |3|5|7|9| 11| > +-+-+-+-+-+ > {code} > With 6 arguments we have error: > {code:scala} > scala> > df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show > org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate > [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, > assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, > IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, > None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 > as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) > AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS > value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS > value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, > fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, > assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, > IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, > None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 > as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) > AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS > value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS > value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];; > 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS > value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS > value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, > fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, > assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, > IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, > None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 > as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) > AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS > value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS > value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, > fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, > assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, > IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, > None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 > as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) > AS foo_agg_6#141] > +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS > e#17, _6#11 AS F#18] > +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95) > at >
[jira] [Created] (SPARK-30966) spark.createDataFrame fails with pandas DataFrame including pandas.NA
Aki Ariga created SPARK-30966: - Summary: spark.createDataFrame fails with pandas DataFrame including pandas.NA Key: SPARK-30966 URL: https://issues.apache.org/jira/browse/SPARK-30966 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.5 Reporter: Aki Ariga As of pandas 1.0.0, pandas.NA was introduced, and that breaks createDataFrame function as the following: {code:python} In [5]: from pyspark.sql import SparkSession In [6]: spark = SparkSession.builder.getOrCreate() In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", "true") In [8]: import numpy as np ...: import pandas as pd In [12]: pdf = pd.DataFrame(data=[{'a':1,'b':2}, {'a':3,'b':4,'c':5}], dtype=pd.Int64Dtype()) In [16]: pdf Out[16]: a b c 0 1 2 1 3 4 5 In [13]: sdf = spark.createDataFrame(pdf) /Users/ariga/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below: Did not pass numpy.dtype object Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true. warnings.warn(msg) --- TypeError Traceback (most recent call last) in > 1 sdf = spark.createDataFrame(df2) ~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 746 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 747 else: --> 748 rdd, schema = self._createFromLocal(map(prepare, data), schema) 749 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 750 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) ~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/session.py in _createFromLocal(self, data, schema) 414 415 if schema is None or isinstance(schema, (list, tuple)): --> 416 struct = self._inferSchemaFromList(data, names=schema) 417 converter = _create_converter(struct) 418 data = map(converter, data) ~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/session.py in _inferSchemaFromList(self, data, names) 346 warnings.warn("inferring schema from dict is deprecated," 347 "please use pyspark.sql.Row instead") --> 348 schema = reduce(_merge_type, (_infer_schema(row, names) for row in data)) 349 if _has_nulltype(schema): 350 raise ValueError("Some of types cannot be determined after inferring") ~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/types.py in _merge_type(a, b, name) 1099 fields = [StructField(f.name, _merge_type(f.dataType, nfs.get(f.name, NullType()), 1100 name=new_name(f.name))) -> 1101 for f in a.fields] 1102 names = set([f.name for f in fields]) 1103 for n in nfs: ~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/types.py in (.0) 1099 fields = [StructField(f.name, _merge_type(f.dataType, nfs.get(f.name, NullType()), 1100 name=new_name(f.name))) -> 1101 for f in a.fields] 1102 names = set([f.name for f in fields]) 1103 for n in nfs: ~/src/pytd/.venv/lib/python3.6/site-packages/pyspark/sql/types.py in _merge_type(a, b, name) 1092 elif type(a) is not type(b): 1093 # TODO: type cast (such as int -> long) -> 1094 raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b 1095 1096 # same type TypeError: field c: Can not merge type and In [15]: pyspark.__version__ Out[15]: '2.4.5' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045168#comment-17045168 ] zhengruifeng edited comment on SPARK-30932 at 2/27/20 4:05 AM: --- I check the output result and do not find outstanding issue, -except:- -public method {{GeneralMLWriter#context}} was removed in Spark-25908 without deprecation.- -{{GeneralMLWriter#context}}- was already deprecated in the doc, not a problem. was (Author: podongfeng): I check the output result and do not find outstanding issue, except: public method \{{GeneralMLWriter#context}} was removed in Spark-25908 without deprecation. > ML 3.0 QA: API: Java compatibility, docs > > > Key: SPARK-30932 > URL: https://issues.apache.org/jira/browse/SPARK-30932 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > Attachments: 1_process_script.sh, added_ml_class, common_ml_class, > signature.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30963) Add GitHub Action job for document generation
[ https://issues.apache.org/jira/browse/SPARK-30963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30963. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27715 [https://github.com/apache/spark/pull/27715] > Add GitHub Action job for document generation > - > > Key: SPARK-30963 > URL: https://issues.apache.org/jira/browse/SPARK-30963 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30965) Support C ++ library to load Spark MLlib model
Mr.Nineteen created SPARK-30965: --- Summary: Support C ++ library to load Spark MLlib model Key: SPARK-30965 URL: https://issues.apache.org/jira/browse/SPARK-30965 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 2.3.4, 2.3.3, 2.3.2, 2.3.1, 2.3.0 Reporter: Mr.Nineteen -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30888) Add version information to the configuration of Network
[ https://issues.apache.org/jira/browse/SPARK-30888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30888: Assignee: jiaan.geng > Add version information to the configuration of Network > --- > > Key: SPARK-30888 > URL: https://issues.apache.org/jira/browse/SPARK-30888 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > spark/core/src/main/scala/org/apache/spark/internal/config/Network.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30888) Add version information to the configuration of Network
[ https://issues.apache.org/jira/browse/SPARK-30888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30888. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27674 [https://github.com/apache/spark/pull/27674] > Add version information to the configuration of Network > --- > > Key: SPARK-30888 > URL: https://issues.apache.org/jira/browse/SPARK-30888 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > spark/core/src/main/scala/org/apache/spark/internal/config/Network.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30841) Add version information to the configuration of SQL
[ https://issues.apache.org/jira/browse/SPARK-30841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30841. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27691 [https://github.com/apache/spark/pull/27691] > Add version information to the configuration of SQL > --- > > Key: SPARK-30841 > URL: https://issues.apache.org/jira/browse/SPARK-30841 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30841) Add version information to the configuration of SQL
[ https://issues.apache.org/jira/browse/SPARK-30841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30841: Assignee: jiaan.geng > Add version information to the configuration of SQL > --- > > Key: SPARK-30841 > URL: https://issues.apache.org/jira/browse/SPARK-30841 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30909) Add version information to the configuration of Python
[ https://issues.apache.org/jira/browse/SPARK-30909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30909: Assignee: jiaan.geng > Add version information to the configuration of Python > -- > > Key: SPARK-30909 > URL: https://issues.apache.org/jira/browse/SPARK-30909 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > core/src/main/scala/org/apache/spark/internal/config/Python.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30909) Add version information to the configuration of Python
[ https://issues.apache.org/jira/browse/SPARK-30909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30909. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27704 [https://github.com/apache/spark/pull/27704] > Add version information to the configuration of Python > -- > > Key: SPARK-30909 > URL: https://issues.apache.org/jira/browse/SPARK-30909 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > core/src/main/scala/org/apache/spark/internal/config/Python.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30910) Add version information to the configuration of R
[ https://issues.apache.org/jira/browse/SPARK-30910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30910. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27708 [https://github.com/apache/spark/pull/27708] > Add version information to the configuration of R > - > > Key: SPARK-30910 > URL: https://issues.apache.org/jira/browse/SPARK-30910 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > core/src/main/scala/org/apache/spark/internal/config/R.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30910) Add version information to the configuration of R
[ https://issues.apache.org/jira/browse/SPARK-30910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30910: Assignee: jiaan.geng > Add version information to the configuration of R > - > > Key: SPARK-30910 > URL: https://issues.apache.org/jira/browse/SPARK-30910 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > core/src/main/scala/org/apache/spark/internal/config/R.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30928) ML, GraphX 3.0 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-30928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30928. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27696 [https://github.com/apache/spark/pull/27696] > ML, GraphX 3.0 QA: API: Binary incompatible changes > --- > > Key: SPARK-30928 > URL: https://issues.apache.org/jira/browse/SPARK-30928 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Blocker > Fix For: 3.0.0 > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30928) ML, GraphX 3.0 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-30928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30928: Assignee: Huaxin Gao > ML, GraphX 3.0 QA: API: Binary incompatible changes > --- > > Key: SPARK-30928 > URL: https://issues.apache.org/jira/browse/SPARK-30928 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30964) Accelerate InMemoryStore with a new index
Gengliang Wang created SPARK-30964: -- Summary: Accelerate InMemoryStore with a new index Key: SPARK-30964 URL: https://issues.apache.org/jira/browse/SPARK-30964 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 3.1.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Spark uses the class `InMemoryStore` as the KV storage for live UI and history server(by default if no LevelDB file path is provided). In `InMemoryStore`, all the task data in one application is stored in a hashmap, which key is the task ID and the value is the task data. This fine for getting or deleting with a provided task ID. However, Spark stage UI always shows all the task data in one stage and the current implementation is to look up all the values in the hashmap. The time complexity is O(numOfTasks). Also, when there are too many stages (>spark.ui.retainedStages), Spark will linearly try to look up all the task data of the stages to be deleted as well. This can be very bad for a large application with many stages and tasks. We can improve it by allowing the natural key of an entity to have a real parent index. So that on each lookup with parent node provided, Spark can look up all the natural keys(in our case, the task IDs) first, and then find the data with the natural keys in the hashmap. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails
[ https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045961#comment-17045961 ] Bryan Cutler commented on SPARK-30961: -- [~nicornk] there were a number of fixes related to Arrow that went into the master branch for 3.0.0 and not branch-2.4, notably SPARK-26887 and SPARK-26566 for the date issue. The latter was an upgrade of Arrow, and it is not the usual policy to backport upgrades. I would recommend using an older version of pyarrow with Spark, version 0.8.0 would be best, but you might be able to use 0.11.1 without issues. > Arrow enabled: to_pandas with date column fails > --- > > Key: SPARK-30961 > URL: https://issues.apache.org/jira/browse/SPARK-30961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 >Reporter: Nicolas Renkamp >Priority: Major > Labels: ready-to-commit > > Hi, > there seems to be a bug in the arrow enabled to_pandas conversion from spark > dataframe to pandas dataframe when the dataframe has a column of type > DateType. Here is a minimal example to reproduce the issue: > {code:java} > spark = SparkSession.builder.getOrCreate() > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > spark_df = spark.createDataFrame( > [['2019-12-06']], 'created_at: string') \ > .withColumn('created_at', F.to_date('created_at')) > # works > spark_df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", 'true') > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > # raises AttributeError: Can only use .dt accessor with datetimelike values > # series is still of type object, .dt does not exist > spark_df.toPandas(){code} > A fix would be to modify the _check_series_convert_date function in > pyspark.sql.types to: > {code:java} > def _check_series_convert_date(series, data_type): > """ > Cast the series to datetime.date if it's a date type, otherwise returns > the original series.:param series: pandas.Series > :param data_type: a Spark data type for the series > """ > from pyspark.sql.utils import require_minimum_pandas_version > require_minimum_pandas_version()from pandas import to_datetime > if type(data_type) == DateType: > return to_datetime(series).dt.date > else: > return series > {code} > Let me know if I should prepare a Pull Request for the 2.4.5 branch. > I have not tested the behavior on master branch. > > Thanks, > Nicolas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30759) The cache in StringRegexExpression is not initialized for foldable patterns
[ https://issues.apache.org/jira/browse/SPARK-30759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30759: -- Fix Version/s: (was: 3.1.0) 2.4.6 3.0.0 > The cache in StringRegexExpression is not initialized for foldable patterns > --- > > Key: SPARK-30759 > URL: https://issues.apache.org/jira/browse/SPARK-30759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0, 2.4.6 > > Attachments: Screen Shot 2020-02-08 at 22.45.50.png > > > In the case of foldable patterns, the cache in StringRegexExpression should > be evaluated once but in fact it is compiled every time. Here is the example: > {code:sql} > SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'; > {code} > the code > https://github.com/apache/spark/blob/8aebc80e0e67bcb1aa300b8c8b1a209159237632/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L45-L48: > {code:scala} > // try cache the pattern for Literal > private lazy val cache: Pattern = pattern match { > case Literal(value: String, StringType) => compile(value) > case _ => null > } > {code} > The attached screenshot shows that foldable expression doesn't fall to the > first case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30759) The cache in StringRegexExpression is not initialized for foldable patterns
[ https://issues.apache.org/jira/browse/SPARK-30759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30759: -- Affects Version/s: (was: 3.1.0) 1.6.3 2.0.2 2.1.3 2.2.3 2.3.4 2.4.5 > The cache in StringRegexExpression is not initialized for foldable patterns > --- > > Key: SPARK-30759 > URL: https://issues.apache.org/jira/browse/SPARK-30759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0, 2.4.6 > > Attachments: Screen Shot 2020-02-08 at 22.45.50.png > > > In the case of foldable patterns, the cache in StringRegexExpression should > be evaluated once but in fact it is compiled every time. Here is the example: > {code:sql} > SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'; > {code} > the code > https://github.com/apache/spark/blob/8aebc80e0e67bcb1aa300b8c8b1a209159237632/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L45-L48: > {code:scala} > // try cache the pattern for Literal > private lazy val cache: Pattern = pattern match { > case Literal(value: String, StringType) => compile(value) > case _ => null > } > {code} > The attached screenshot shows that foldable expression doesn't fall to the > first case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30759) The cache in StringRegexExpression is not initialized for foldable patterns
[ https://issues.apache.org/jira/browse/SPARK-30759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30759: -- Issue Type: Bug (was: Improvement) > The cache in StringRegexExpression is not initialized for foldable patterns > --- > > Key: SPARK-30759 > URL: https://issues.apache.org/jira/browse/SPARK-30759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > Attachments: Screen Shot 2020-02-08 at 22.45.50.png > > > In the case of foldable patterns, the cache in StringRegexExpression should > be evaluated once but in fact it is compiled every time. Here is the example: > {code:sql} > SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'; > {code} > the code > https://github.com/apache/spark/blob/8aebc80e0e67bcb1aa300b8c8b1a209159237632/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L45-L48: > {code:scala} > // try cache the pattern for Literal > private lazy val cache: Pattern = pattern match { > case Literal(value: String, StringType) => compile(value) > case _ => null > } > {code} > The attached screenshot shows that foldable expression doesn't fall to the > first case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30963) Add GitHub Action job for document generation
[ https://issues.apache.org/jira/browse/SPARK-30963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30963: - Assignee: Dongjoon Hyun > Add GitHub Action job for document generation > - > > Key: SPARK-30963 > URL: https://issues.apache.org/jira/browse/SPARK-30963 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30931) ML 3.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045942#comment-17045942 ] Huaxin Gao commented on SPARK-30931: I didn't see any parity issues in code, but some of the Python docs are not exactly the same as Scala docs. I will open Jira to for the docs problems. > ML 3.0 QA: API: Python API coverage > --- > > Key: SPARK-30931 > URL: https://issues.apache.org/jira/browse/SPARK-30931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally > either necessary (intentional) or accidental. These must be recorded and > added in the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26311: Labels: release-notes (was: ) > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Labels: release-notes > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26329) ExecutorMetrics should poll faster than heartbeats
[ https://issues.apache.org/jira/browse/SPARK-26329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045938#comment-17045938 ] Xiao Li commented on SPARK-26329: - This breaks the developer API {code:java} ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd$"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd.apply"), ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd.copy$default$6"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd.copy"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.SparkListenerTaskEnd.this"),{code} > ExecutorMetrics should poll faster than heartbeats > -- > > Key: SPARK-26329 > URL: https://issues.apache.org/jira/browse/SPARK-26329 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Wing Yew Poon >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > We should allow faster polling of the executor memory metrics (SPARK-23429 / > SPARK-23206) without requiring a faster heartbeat rate. We've seen the > memory usage of executors pike over 1 GB in less than a second, but > heartbeats are only every 10 seconds (by default). Spark needs to enable > fast polling to capture these peaks, without causing too much strain on the > system. > In the current implementation, the metrics are polled along with the > heartbeat, but this leads to a slow rate of polling metrics by default. If > users were to increase the rate of the heartbeat, they risk overloading the > driver on a large cluster, with too many messages and too much work to > aggregate the metrics. But, the executor could poll the metrics more > frequently, and still only send the *max* since the last heartbeat for each > metric. This keeps the load on the driver the same, and only introduces a > small overhead on the executor to grab the metrics and keep the max. > The downside of this approach is that we still need to wait for the next > heartbeat for the driver to be aware of the new peak. If the executor dies > or is killed before then, then we won't find out. A potential future > enhancement would be to send an update *anytime* there is an increase by some > percentage, but we'll leave that out for now. > Another possibility would be to change the metrics themselves to track peaks > for us, so we don't have to fine-tune the polling rate. For example, some > jvm metrics provide a usage threshold, and notification: > https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold > But, that is not available on all metrics. This proposal gives us a generic > way to get a more accurate peak memory usage for *all* metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26329) ExecutorMetrics should poll faster than heartbeats
[ https://issues.apache.org/jira/browse/SPARK-26329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26329: Labels: release-notes (was: ) > ExecutorMetrics should poll faster than heartbeats > -- > > Key: SPARK-26329 > URL: https://issues.apache.org/jira/browse/SPARK-26329 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Wing Yew Poon >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > We should allow faster polling of the executor memory metrics (SPARK-23429 / > SPARK-23206) without requiring a faster heartbeat rate. We've seen the > memory usage of executors pike over 1 GB in less than a second, but > heartbeats are only every 10 seconds (by default). Spark needs to enable > fast polling to capture these peaks, without causing too much strain on the > system. > In the current implementation, the metrics are polled along with the > heartbeat, but this leads to a slow rate of polling metrics by default. If > users were to increase the rate of the heartbeat, they risk overloading the > driver on a large cluster, with too many messages and too much work to > aggregate the metrics. But, the executor could poll the metrics more > frequently, and still only send the *max* since the last heartbeat for each > metric. This keeps the load on the driver the same, and only introduces a > small overhead on the executor to grab the metrics and keep the max. > The downside of this approach is that we still need to wait for the next > heartbeat for the driver to be aware of the new peak. If the executor dies > or is killed before then, then we won't find out. A potential future > enhancement would be to send an update *anytime* there is an increase by some > percentage, but we'll leave that out for now. > Another possibility would be to change the metrics themselves to track peaks > for us, so we don't have to fine-tune the polling rate. For example, some > jvm metrics provide a usage threshold, and notification: > https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold > But, that is not available on all metrics. This proposal gives us a generic > way to get a more accurate peak memory usage for *all* metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30931) ML 3.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-30931: --- Attachment: (was: image-2020-02-26-10-45-59-175.png) > ML 3.0 QA: API: Python API coverage > --- > > Key: SPARK-30931 > URL: https://issues.apache.org/jira/browse/SPARK-30931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally > either necessary (intentional) or accidental. These must be recorded and > added in the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27006) SPIP: .NET bindings for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045867#comment-17045867 ] Kyle Van Saders commented on SPARK-27006: - It makes perfect sense for .NET developers who are already familiar with LINQ to be able to query Spark via DataSets or DataFrames. > SPIP: .NET bindings for Apache Spark > > > Key: SPARK-27006 > URL: https://issues.apache.org/jira/browse/SPARK-27006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > Original Estimate: 4,032h > Remaining Estimate: 4,032h > > h4. Background and Motivation: > Apache Spark provides programming language support for Scala/Java (native), > and extensions for Python and R. While a variety of other language extensions > are possible to include in Apache Spark, .NET would bring one of the largest > developer community to the table. Presently, no good Big Data solution exists > for .NET developers in open source. This SPIP aims at discussing how we can > bring Apache Spark goodness to the .NET development platform. > .NET is a free, cross-platform, open source developer platform for building > many different types of applications. With .NET, you can use multiple > languages, editors, and libraries to build for web, mobile, desktop, gaming, > and IoT types of applications. Even with .NET serving millions of developers, > there is no good Big Data solution that exists today, which this SPIP aims to > address. > The .NET developer community is one of the largest programming language > communities in the world. Its flagship programming language C# is listed as > one of the most popular programming languages in a variety of articles and > statistics: > * Most popular Technologies on Stack Overflow: > [https://insights.stackoverflow.com/survey/2018/#most-popular-technologies|https://insights.stackoverflow.com/survey/2018/] > > * Most popular languages on GitHub 2018: > [https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10#2-java-9|https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10] > > * 1M+ new developers last 1 year > * Second most demanded technology on LinkedIn > * Top 30 High velocity OSS projects on GitHub > Including a C# language extension in Apache Spark will enable millions of > .NET developers to author Big Data applications in their preferred > programming language, developer environment, and tooling support. We aim to > promote the .NET bindings for Spark through engagements with the Spark > community (e.g., we are scheduled to present an early prototype at the SF > Spark Summit 2019) and the .NET developer community (e.g., similar > presentations will be held at .NET developer conferences this year). As > such, we believe that our efforts will help grow the Spark community by > making it accessible to the millions of .NET developers. > Furthermore, our early discussions with some large .NET development teams got > an enthusiastic reception. > We recognize that earlier attempts at this goal (specifically Mobius > [https://github.com/Microsoft/Mobius]) were unsuccessful primarily due to the > lack of communication with the Spark community. Therefore, another goal of > this proposal is to not only develop .NET bindings for Spark in open source, > but also continuously seek feedback from the Spark community via posted > Jira’s (like this one) and the Spark developer mailing list. Our hope is that > through these engagements, we can build a community of developers that are > eager to contribute to this effort or want to leverage the resulting .NET > bindings for Spark in their respective Big Data applications. > h4. Target Personas: > .NET developers looking to build big data solutions. > h4. Goals: > Our primary goal is to help grow Apache Spark by making it accessible to the > large .NET developer base and ecosystem. We will also look for opportunities > to generalize the interop layers for Spark for adding other language > extensions in the future. [SPARK-26257]( > https://issues.apache.org/jira/browse/SPARK-26257) proposes such a > generalized interop layer, which we hope to address over the course of this > project. > Another important goal for us is to not only enable Spark as an application > solution for .NET developers, but also opening the door for .NET developers > to make contributions to Apache Spark itself. > Lastly, we aim to develop a .NET extension in the open, while continually > engaging with the Spark community for feedback on designs and code. We will > welcome PRs from the Spark community throughout this project and aim to grow > a
[jira] [Created] (SPARK-30963) Add GitHub Action job for document generation
Dongjoon Hyun created SPARK-30963: - Summary: Add GitHub Action job for document generation Key: SPARK-30963 URL: https://issues.apache.org/jira/browse/SPARK-30963 Project: Spark Issue Type: Test Components: Project Infra Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30931) ML 3.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-30931: --- Attachment: image-2020-02-26-10-45-59-175.png > ML 3.0 QA: API: Python API coverage > --- > > Key: SPARK-30931 > URL: https://issues.apache.org/jira/browse/SPARK-30931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > Attachments: image-2020-02-26-10-45-59-175.png > > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally > either necessary (intentional) or accidental. These must be recorded and > added in the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]
[ https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045771#comment-17045771 ] Xiao Li commented on SPARK-30962: - Also, please give some examples to show how we can alter the table's comments and the columns' comments. > Document ALTER TABLE statement in SQL Reference [Phase 2] > - > > Key: SPARK-30962 > URL: https://issues.apache.org/jira/browse/SPARK-30962 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of > ALTER TABLE statements. See the doc in preview-2 > [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] > > We should add all the supported ALTER TABLE syntax. See > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]
[ https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30962: Target Version/s: 3.0.0 > Document ALTER TABLE statement in SQL Reference [Phase 2] > - > > Key: SPARK-30962 > URL: https://issues.apache.org/jira/browse/SPARK-30962 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of > ALTER TABLE statements. See the doc in preview-2 > [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] > > We should add all the supported ALTER TABLE syntax. See > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]
[ https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045770#comment-17045770 ] Xiao Li commented on SPARK-30962: - cc [~huaxing] [~dkbiswal] > Document ALTER TABLE statement in SQL Reference [Phase 2] > - > > Key: SPARK-30962 > URL: https://issues.apache.org/jira/browse/SPARK-30962 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of > ALTER TABLE statements. See the doc in preview-2 > [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] > > We should add all the supported ALTER TABLE syntax. See > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]
[ https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30962: Description: https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of ALTER TABLE statements. See the doc in preview-2 [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] We should add all the supported ALTER TABLE syntax. See [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] was: https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of ALTER TABLE statements. See the doc in preview-2 [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] We should add all the supported ALTER TABLE syntax. See [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] > Document ALTER TABLE statement in SQL Reference [Phase 2] > - > > Key: SPARK-30962 > URL: https://issues.apache.org/jira/browse/SPARK-30962 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of > ALTER TABLE statements. See the doc in preview-2 > [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] > > We should add all the supported ALTER TABLE syntax. See > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]
[ https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30962: Description: https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of ALTER TABLE statements. See the doc in preview-2 [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] We should add all the supported ALTER TABLE syntax. See [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] > Document ALTER TABLE statement in SQL Reference [Phase 2] > - > > Key: SPARK-30962 > URL: https://issues.apache.org/jira/browse/SPARK-30962 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of > ALTER TABLE statements. See the doc in preview-2 > [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] > > > We should add all the supported ALTER TABLE syntax. See > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27619) MapType should be prohibited in hash expressions
[ https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27619. - Fix Version/s: 3.0.0 Assignee: Rakesh Raushan Resolution: Fixed > MapType should be prohibited in hash expressions > > > Key: SPARK-27619 > URL: https://issues.apache.org/jira/browse/SPARK-27619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Josh Rosen >Assignee: Rakesh Raushan >Priority: Blocker > Labels: correctness > Fix For: 3.0.0 > > > Spark currently allows MapType expressions to be used as input to hash > expressions, but I think that this should be prohibited because Spark SQL > does not support map equality. > Currently, Spark SQL's map hashcodes are sensitive to the insertion order of > map elements: > {code:java} > val a = spark.createDataset(Map(1->1, 2->2) :: Nil) > val b = spark.createDataset(Map(2->2, 1->1) :: Nil) > // Demonstration of how Scala Map equality is unaffected by insertion order: > assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) > assert(Map(1->1, 2->2) == Map(2->2, 1->1)) > assert(a.first() == b.first()) > // In contrast, this will print two different hashcodes: > println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} > This behavior might be surprising to Scala developers. > I think there's precedence for banning the use of MapType here because we > already prohibit MapType in aggregation / joins / equality comparisons > (SPARK-9415) and set operations (SPARK-19893). > If we decide that we want this to be an error then it might also be a good > idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the > old and buggy behavior (in case applications were relying on it in cases > where it just so happens to be safe-by-accident (e.g. maps which only have > one entry)). > Alternatively, we could support hashing here if we implemented support for > comparable map types (SPARK-18134). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]
Xiao Li created SPARK-30962: --- Summary: Document ALTER TABLE statement in SQL Reference [Phase 2] Key: SPARK-30962 URL: https://issues.apache.org/jira/browse/SPARK-30962 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Xiao Li -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28791) Document ALTER TABLE statement in SQL Reference [Phase 1]
[ https://issues.apache.org/jira/browse/SPARK-28791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28791: Summary: Document ALTER TABLE statement in SQL Reference [Phase 1] (was: Document ALTER TABLE statement in SQL Reference.) > Document ALTER TABLE statement in SQL Reference [Phase 1] > - > > Key: SPARK-28791 > URL: https://issues.apache.org/jira/browse/SPARK-28791 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: pavithra ramachandran >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30782) Column resolution doesn't respect current catalog/namespace for v2 tables.
[ https://issues.apache.org/jira/browse/SPARK-30782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30782. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27532 [https://github.com/apache/spark/pull/27532] > Column resolution doesn't respect current catalog/namespace for v2 tables. > -- > > Key: SPARK-30782 > URL: https://issues.apache.org/jira/browse/SPARK-30782 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > For v1 tables, you can perform the following: > {code:sql} > SELECT default.t.id FROM t; > {code} > For v2 tables, the following fails: > {code:sql} > USE testcat.ns1.ns2; > SELECT testcat.ns1.ns2.t.id FROM t; > org.apache.spark.sql.AnalysisException: cannot resolve > '`testcat.ns1.ns2.t.id`' given input columns: [t.id, t.point]; line 1 pos 7; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30782) Column resolution doesn't respect current catalog/namespace for v2 tables.
[ https://issues.apache.org/jira/browse/SPARK-30782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30782: --- Assignee: Terry Kim > Column resolution doesn't respect current catalog/namespace for v2 tables. > -- > > Key: SPARK-30782 > URL: https://issues.apache.org/jira/browse/SPARK-30782 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > For v1 tables, you can perform the following: > {code:sql} > SELECT default.t.id FROM t; > {code} > For v2 tables, the following fails: > {code:sql} > USE testcat.ns1.ns2; > SELECT testcat.ns1.ns2.t.id FROM t; > org.apache.spark.sql.AnalysisException: cannot resolve > '`testcat.ns1.ns2.t.id`' given input columns: [t.id, t.point]; line 1 pos 7; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30961) Arrow enabled: to_pandas with date column fails
[ https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Renkamp updated SPARK-30961: Description: Hi, there seems to be a bug in the arrow enabled to_pandas conversion from spark dataframe to pandas dataframe when the dataframe has a column of type DateType. Here is a minimal example to reproduce the issue: {code:java} spark = SparkSession.builder.getOrCreate() is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") print("Arrow optimization is enabled: " + is_arrow_enabled) spark_df = spark.createDataFrame( [['2019-12-06']], 'created_at: string') \ .withColumn('created_at', F.to_date('created_at')) # works spark_df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", 'true') is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") print("Arrow optimization is enabled: " + is_arrow_enabled) # raises AttributeError: Can only use .dt accessor with datetimelike values # series is still of type object, .dt does not exist spark_df.toPandas(){code} A fix would be to modify the _check_series_convert_date function in pyspark.sql.types to: {code:java} def _check_series_convert_date(series, data_type): """ Cast the series to datetime.date if it's a date type, otherwise returns the original series.:param series: pandas.Series :param data_type: a Spark data type for the series """ from pyspark.sql.utils import require_minimum_pandas_version require_minimum_pandas_version()from pandas import to_datetime if type(data_type) == DateType: return to_datetime(series).dt.date else: return series {code} Let me know if I should prepare a Pull Request for the 2.4.5 branch. I have not tested the behavior on master branch. Thanks, Nicolas was: Hi, there seems to be a bug in the arrow enabled to_pandas conversion from spark dataframe to pandas dataframe when the dataframe has a column of type DateType. Here is a minimal example to reproduce the issue: {code:java} spark = SparkSession.builder.getOrCreate() is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") print("Arrow optimization is enabled: " + is_arrow_enabled) spark_df = spark.createDataFrame( [['2019-12-06']], 'created_at: string') \ .withColumn('created_at', F.to_date('created_at')) # works spark_df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", 'true') is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") print("Arrow optimization is enabled: " + is_arrow_enabled) # raises AttributeError spark_df.toPandas(){code} A fix would be to modify the _check_series_convert_date function in pyspark.sql.types to: {code:java} def _check_series_convert_date(series, data_type): """ Cast the series to datetime.date if it's a date type, otherwise returns the original series.:param series: pandas.Series :param data_type: a Spark data type for the series """ from pyspark.sql.utils import require_minimum_pandas_version require_minimum_pandas_version()from pandas import to_datetime if type(data_type) == DateType: return to_datetime(series).dt.date else: return series {code} Let me know if I should prepare a Pull Request for the 2.4.5 branch. I have not tested the behavior on master branch. Thanks, Nicolas > Arrow enabled: to_pandas with date column fails > --- > > Key: SPARK-30961 > URL: https://issues.apache.org/jira/browse/SPARK-30961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 >Reporter: Nicolas Renkamp >Priority: Major > Labels: ready-to-commit > > Hi, > there seems to be a bug in the arrow enabled to_pandas conversion from spark > dataframe to pandas dataframe when the dataframe has a column of type > DateType. Here is a minimal example to reproduce the issue: > {code:java} > spark = SparkSession.builder.getOrCreate() > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > spark_df = spark.createDataFrame( > [['2019-12-06']], 'created_at: string') \ > .withColumn('created_at', F.to_date('created_at')) > # works > spark_df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", 'true') > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > # raises AttributeError: Can only use .dt accessor with datetimelike values > # series is still of type object, .dt does not exist > spark_df.toPandas(){code} > A fix would be to modify the _check_series_convert_date function in > pyspark.sql.types to: > {code:java} > def
[jira] [Created] (SPARK-30961) Arrow enabled: to_pandas with date column fails
Nicolas Renkamp created SPARK-30961: --- Summary: Arrow enabled: to_pandas with date column fails Key: SPARK-30961 URL: https://issues.apache.org/jira/browse/SPARK-30961 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.5 Environment: Apache Spark 2.4.5 Reporter: Nicolas Renkamp Hi, there seems to be a bug in the arrow enabled to_pandas conversion from spark dataframe to pandas dataframe when the dataframe has a column of type DateType. Here is a minimal example to reproduce the issue: {code:java} spark = SparkSession.builder.getOrCreate() is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") print("Arrow optimization is enabled: " + is_arrow_enabled) spark_df = spark.createDataFrame( [['2019-12-06']], 'created_at: string') \ .withColumn('created_at', F.to_date('created_at')) # works spark_df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", 'true') is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") print("Arrow optimization is enabled: " + is_arrow_enabled) # raises AttributeError spark_df.toPandas(){code} A fix would be to modify the _check_series_convert_date function in pyspark.sql.types to: {code:java} def _check_series_convert_date(series, data_type): """ Cast the series to datetime.date if it's a date type, otherwise returns the original series.:param series: pandas.Series :param data_type: a Spark data type for the series """ from pyspark.sql.utils import require_minimum_pandas_version require_minimum_pandas_version()from pandas import to_datetime if type(data_type) == DateType: return to_datetime(series).dt.date else: return series {code} Let me know if I should prepare a Pull Request for the 2.4.5 branch. I have not tested the behavior on master branch. Thanks, Nicolas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045607#comment-17045607 ] Somnath edited comment on SPARK-10795 at 2/26/20 4:05 PM: -- Trying to submit the below test.py Spark app on a YARN cluster with the below command {noformat} PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py{noformat} Note: I am not using local mode, but trying to use the python3.7 site-packages under the virtualenv used for building the code in PyCharm This is how the Python project structure looks along with the contents of venv directory {noformat} -rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz -rw-r--r-- 1 schakrabarti nobody 1313 Feb 26 13:07 test.py drwxr-xr-x 6 schakrabarti nobody 4096 Feb 26 13:07 venv drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/bin drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/share -rw-r--r-- 1 schakrabarti nobody 75 Feb 26 13:07 venv/pyvenv.cfg drwxr-xr-x 2 schakrabarti nobody 4096 Feb 26 13:07 venv/include drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/lib {noformat} Getting the same error of File does not exist - pyspark.zip (as shown below) {noformat} java.io.FileNotFoundException: File does not exist: hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat} {noformat} #test.py import json from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .appName("Test_App") \ .master("spark://hostname-nn1.cluster.domain.com:41767") \ .config("spark.ui.port", "4057") \ .config("spark.executor.memory", "4g") \ .getOrCreate() print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4)) spark.stop(){noformat} was (Author: somchakr): Trying to submit the below test.py Spark app on a YARN cluster with the below command {noformat} PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py{noformat} Note: I am not using local mode, but trying to use the python3.7 site-packages under the virtualenv used for building the code in PyCharm This is how the Python project structure looks along with the contents of venv directory {noformat} -rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz -rw-r--r-- 1 schakrabarti nobody 1313 Feb 26 13:07 test.py drwxr-xr-x 6 schakrabarti nobody 4096 Feb 26 13:07 venv drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/bin drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/share -rw-r--r-- 1 schakrabarti nobody 75 Feb 26 13:07 venv/pyvenv.cfg drwxr-xr-x 2 schakrabarti nobody 4096 Feb 26 13:07 venv/include drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/lib {noformat} Getting the same error of File does not exist - pyspark.zip (as shown below) {noformat} java.io.FileNotFoundException: File does not exist: hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat} {noformat} #test.py import json from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .appName("Test_App") \ .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \ .config("spark.ui.port", "4057") \ .config("spark.executor.memory", "4g") \ .getOrCreate() print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4)) spark.stop(){noformat} > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit >Priority: Major > Labels: bulk-closed > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: >
[jira] [Commented] (SPARK-7101) Spark SQL should support java.sql.Time
[ https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045628#comment-17045628 ] YoungGyu Chun commented on SPARK-7101: -- I will work on this > Spark SQL should support java.sql.Time > -- > > Key: SPARK-7101 > URL: https://issues.apache.org/jira/browse/SPARK-7101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.1 > Environment: All >Reporter: Peter Hagelund >Priority: Minor > > Several RDBMSes support the TIME data type; for more exact mapping between > those and Spark SQL, support for java.sql.Time with an associated > DataType.TimeType would be helpful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30960) add back the legacy date/timestamp format support in CSV/JSON parser
Wenchen Fan created SPARK-30960: --- Summary: add back the legacy date/timestamp format support in CSV/JSON parser Key: SPARK-30960 URL: https://issues.apache.org/jira/browse/SPARK-30960 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045607#comment-17045607 ] Somnath edited comment on SPARK-10795 at 2/26/20 3:04 PM: -- Trying to submit the below test.py Spark app on a YARN cluster with the below command {noformat} PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py{noformat} Note: I am not using local mode, but trying to use the python3.7 site-packages under the virtualenv used for building the code in PyCharm This is how the Python project structure looks along with the contents of venv directory {noformat} -rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz -rw-r--r-- 1 schakrabarti nobody 1313 Feb 26 13:07 test.py drwxr-xr-x 6 schakrabarti nobody 4096 Feb 26 13:07 venv drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/bin drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/share -rw-r--r-- 1 schakrabarti nobody 75 Feb 26 13:07 venv/pyvenv.cfg drwxr-xr-x 2 schakrabarti nobody 4096 Feb 26 13:07 venv/include drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/lib {noformat} Getting the same error of File does not exist - pyspark.zip (as shown below) {noformat} java.io.FileNotFoundException: File does not exist: hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat} {noformat} #test.py import json from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .appName("Test_App") \ .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \ .config("spark.ui.port", "4057") \ .config("spark.executor.memory", "4g") \ .getOrCreate() print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4)) spark.stop(){noformat} was (Author: somchakr): Trying to submit the below test.py Spark app on a YARN cluster with the below command {noformat} PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py{noformat} Note: I am not using local mode, but trying to use the python3.7 site-packages under the virtualenv used for building the code in PyCharm This is how the Python project structure looks. with the {noformat} -rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz -rw-r--r-- 1 schakrabarti nobody 1313 Feb 26 13:07 test.py drwxr-xr-x 6 schakrabarti nobody 4096 Feb 26 13:07 venv drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/bin drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/share -rw-r--r-- 1 schakrabarti nobody 75 Feb 26 13:07 venv/pyvenv.cfg drwxr-xr-x 2 schakrabarti nobody 4096 Feb 26 13:07 venv/include drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/lib {noformat} Getting the same error of File does not exist - pyspark.zip (as shown below) {noformat} java.io.FileNotFoundException: File does not exist: hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat} {noformat} #test.py import json from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .appName("Test_App") \ .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \ .config("spark.ui.port", "4057") \ .config("spark.executor.memory", "4g") \ .getOrCreate() print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4)) spark.stop(){noformat} > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit >Priority: Major > Labels: bulk-closed > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: >
[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045607#comment-17045607 ] Somnath commented on SPARK-10795: - Trying to submit the below test.py Spark app on a YARN cluster with the below command {noformat} PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py{noformat} Note: I am not using local mode, but trying to use the python3.7 site-packages under the virtualenv used for building the code in PyCharm This is how the Python project structure looks. with the {noformat} -rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz -rw-r--r-- 1 schakrabarti nobody 1313 Feb 26 13:07 test.py drwxr-xr-x 6 schakrabarti nobody 4096 Feb 26 13:07 venv drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/bin drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/share -rw-r--r-- 1 schakrabarti nobody 75 Feb 26 13:07 venv/pyvenv.cfg drwxr-xr-x 2 schakrabarti nobody 4096 Feb 26 13:07 venv/include drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/lib {noformat} Getting the same error of File does not exist - pyspark.zip (as shown below) {noformat} java.io.FileNotFoundException: File does not exist: hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip{noformat} {noformat} #test.py import json from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .appName("Test_App") \ .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \ .config("spark.ui.port", "4057") \ .config("spark.executor.memory", "4g") \ .getOrCreate() print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4)) spark.stop(){noformat} > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit >Priority: Major > Labels: bulk-closed > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30910) Add version information to the configuration of R
[ https://issues.apache.org/jira/browse/SPARK-30910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30910: --- Summary: Add version information to the configuration of R (was: Arrange version info of R) > Add version information to the configuration of R > - > > Key: SPARK-30910 > URL: https://issues.apache.org/jira/browse/SPARK-30910 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > core/src/main/scala/org/apache/spark/internal/config/R.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30947) Log better message when accelerate resource is empty
[ https://issues.apache.org/jira/browse/SPARK-30947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-30947: - Summary: Log better message when accelerate resource is empty (was: Don't log accelerate resources when it's empty) > Log better message when accelerate resource is empty > > > Key: SPARK-30947 > URL: https://issues.apache.org/jira/browse/SPARK-30947 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > It's weird to see cpu/memory resources after logging resource is empty: > {code:java} > 20/02/25 21:47:55 INFO ResourceUtils: > == > 20/02/25 21:47:55 INFO ResourceUtils: Resources for spark.driver: > 20/02/25 21:47:55 INFO ResourceUtils: > == > 20/02/25 21:47:55 INFO SparkContext: Submitted application: Spark shell > 20/02/25 21:47:55 INFO ResourceProfile: Default ResourceProfile created, > executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , > memory -> name: memory, amount: 1024, script: , vendor: ), task resources: > Map(cpus -> name: cpus, amount: 1.0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30959) How to write using JDBC driver to SQL Server / Azure DWH to column of BINARY type?
[ https://issues.apache.org/jira/browse/SPARK-30959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Kumachev updated SPARK-30959: - Description: My initial goal is to save UUId values to SQL Server/Azure DWH to column of BINARY(16) type. For example, I have demo table: {code:java} CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code} I want to write data to it using Spark like this: {code:java} import java.util.UUID val uuid = UUID.randomUUID() val uuidBytes = Array.ofDim[Byte](16) ByteBuffer.wrap(uuidBytes) .order(ByteOrder.BIG_ENDIAN) .putLong(uuid.getMostSignificantBits()) .putLong(uuid.getLeastSignificantBits() val schema = StructType( List( StructField("EventId", BinaryType, false) ) ) val data = Seq((uuidBytes)).toDF("EventId").rdd; val df = spark.createDataFrame(data, schema); df.write .format("jdbc") .option("url", "") .option("dbTable", "Events") .mode(org.apache.spark.sql.SaveMode.Append) .save() {code} This code returns an error: {noformat} java.sql.BatchUpdateException: Conversion from variable or parameter type VARBINARY to target column type BINARY is not supported.{noformat} My question is how to cope with this situation and insert UUId value to BINARY(16) column? My investigation: Spark uses conception of JdbcDialects and has a mapping for each Catalyst type to database type and vice versa. For example here is [MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]] which is used when we work against SQL Server or Azure DWH. In the method `getJDBCType` you can see the mapping: {code:java} case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY)){code} and this is the root of my problem as I think. So, I decide to implement my own JdbcDialect to override this behavior: {code:java} class SqlServerDialect extends JdbcDialect { override def canHandle(url: String) : Boolean = url.startsWith("jdbc:sqlserver") override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY)) case _ => None } } val dialect = new SqlServerDialect JdbcDialects.registerDialect(dialect) {code} With this modification I still catch exactly the same error. It looks like that Spark do not use mapping from my custom dialect. But I checked that the dialect is registered. So it is strange situation. was: My initial goal is to save UUId values to SQL Server/Azure DWH to column of BINARY(16) type. For example, I have demo table: {code:java} CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code} I want to write data to it using Spark like this: {code:java} import java.util.UUID val uuid = UUID.randomUUID() val uuidBytes = Array.ofDim[Byte](16) ByteBuffer.wrap(uuidBytes) .order(ByteOrder.BIG_ENDIAN) .putLong(uuid.getMostSignificantBits()) .putLong(uuid.getLeastSignificantBits() val schema = StructType( List( StructField("EventId", BinaryType, false) ) ) val data = Seq((uuidBytes)).toDF("EventId").rdd; val df = spark.createDataFrame(data, schema); df.write .format("jdbc") .option("url", "") .option("dbTable", "Events") .mode(org.apache.spark.sql.SaveMode.Append) .save() {code} This code returns an error: {noformat} java.sql.BatchUpdateException: Conversion from variable or parameter type VARBINARY to target column type BINARY is not supported.{noformat} My question is how to cope with this situation and insert UUId value to BINARY(16) column? My investigation: Spark uses conception of JdbcDialects and has a mapping for each Catalyst type to database type and vice versa. For example here is [MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]] which is used when we work against SQL Server or Azure DWH. In the method `getJDBCType` you can see the mapping: {code:java} case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY)){code} and this is the root of my problem as I think. So, I decide to implement my own JdbcDialect to override this behavior: {code:java} class SqlServerDialect extends JdbcDialect { override def canHandle(url: String) : Boolean = url.startsWith("jdbc:sqlserver") override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY)) case _ => None } } val dialect = new SqlServerDialect JdbcDialects.registerDialect(dialect) {code} With this modification I still catch exactly the same error. It looks like that Spark do not use mapping from my custom dialect. But I checked that the dialect is registered. So it is strange situation. >
[jira] [Updated] (SPARK-30959) How to write using JDBC driver to SQL Server / Azure DWH to column of BINARY type?
[ https://issues.apache.org/jira/browse/SPARK-30959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Kumachev updated SPARK-30959: - Description: My initial goal is to save UUId values to SQL Server/Azure DWH to column of BINARY(16) type. For example, I have demo table: {code:java} CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code} I want to write data to it using Spark like this: {code:java} import java.util.UUID val uuid = UUID.randomUUID() val uuidBytes = Array.ofDim[Byte](16) ByteBuffer.wrap(uuidBytes) .order(ByteOrder.BIG_ENDIAN) .putLong(uuid.getMostSignificantBits()) .putLong(uuid.getLeastSignificantBits() val schema = StructType( List( StructField("EventId", BinaryType, false) ) ) val data = Seq((uuidBytes)).toDF("EventId").rdd; val df = spark.createDataFrame(data, schema); df.write .format("jdbc") .option("url", "") .option("dbTable", "Events") .mode(org.apache.spark.sql.SaveMode.Append) .save() {code} This code returns an error: {noformat} java.sql.BatchUpdateException: Conversion from variable or parameter type VARBINARY to target column type BINARY is not supported.{noformat} My question is how to cope with this situation and insert UUId value to BINARY(16) column? My investigation: Spark uses conception of JdbcDialects and has a mapping for each Catalyst type to database type and vice versa. For example here is [MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]] which is used when we work against SQL Server or Azure DWH. In the method `getJDBCType` you can see the mapping: {code:java} case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY)){code} and this is the root of my problem as I think. So, I decide to implement my own JdbcDialect to override this behavior: {code:java} class SqlServerDialect extends JdbcDialect { override def canHandle(url: String) : Boolean = url.startsWith("jdbc:sqlserver") override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY)) case _ => None } } val dialect = new SqlServerDialect JdbcDialects.registerDialect(dialect) {code} With this modification I still catch exactly the same error. It looks like that Spark do not use mapping from my custom dialect. But I checked that the dialect is registered. So it is strange situation. was: My initial goal is to save UUId values to SQL Server/Azure DWH to column of BINARY(16) type. For example, I have demo table: {code:java} CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code} I want to write data to it using Spark like this: {code:java} import java.util.UUID val uuid = UUID.randomUUID() val uuidBytes = Array.ofDim[Byte](16) ByteBuffer.wrap(uuidBytes) .order(ByteOrder.BIG_ENDIAN) .putLong(uuid.getMostSignificantBits()) .putLong(uuid.getLeastSignificantBits() val schema = StructType( List( StructField("EventId", BinaryType, false) ) ) val data = Seq((uuidBytes)).toDF("EventId").rdd; val df = spark.createDataFrame(data, schema); df.write .format("jdbc") .option("url", "") .option("dbTable", "Events") .mode(org.apache.spark.sql.SaveMode.Append) .save() {code} This code returns an error: {noformat} java.sql.BatchUpdateException: Conversion from variable or parameter type VARBINARY to target column type BINARY is not supported.{noformat} My question is how to cope with this situation and insert UUId value to BINARY(16) column? My investigation: Spark uses conception of JdbcDialects and has a mapping for each Catalyst type to database type and vice versa. For example here is [MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]] which is used when we work against SQL Server or Azure DWH. In the method `getJDBCType` you can see the mapping: {code:java} case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY)){code} and this is the root of my problem as I think. So, I decide to implement my own JdbcDialect to override this behavior: {code:java} class SqlServerDialect extends JdbcDialect { override def canHandle(url: String) : Boolean = url.startsWith("jdbc:sqlserver") override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY)) case _ => None } } val dialect = new SqlServerDialect JdbcDialects.registerDialect(dialect) {code} With this modification I still catch exactly the same error. It looks like that Spark do not use mapping from my custom dialect. But I checked that the dialect is registered. So it is strange situation.
[jira] [Created] (SPARK-30959) How to write using JDBC driver to SQL Server / Azure DWH to column of BINARY type?
Mikhail Kumachev created SPARK-30959: Summary: How to write using JDBC driver to SQL Server / Azure DWH to column of BINARY type? Key: SPARK-30959 URL: https://issues.apache.org/jira/browse/SPARK-30959 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.4.4 Reporter: Mikhail Kumachev My initial goal is to save UUId values to SQL Server/Azure DWH to column of BINARY(16) type. For example, I have demo table: {code:java} CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL){code} I want to write data to it using Spark like this: {code:java} import java.util.UUID val uuid = UUID.randomUUID() val uuidBytes = Array.ofDim[Byte](16) ByteBuffer.wrap(uuidBytes) .order(ByteOrder.BIG_ENDIAN) .putLong(uuid.getMostSignificantBits()) .putLong(uuid.getLeastSignificantBits() val schema = StructType( List( StructField("EventId", BinaryType, false) ) ) val data = Seq((uuidBytes)).toDF("EventId").rdd; val df = spark.createDataFrame(data, schema); df.write .format("jdbc") .option("url", "") .option("dbTable", "Events") .mode(org.apache.spark.sql.SaveMode.Append) .save() {code} This code returns an error: {noformat} java.sql.BatchUpdateException: Conversion from variable or parameter type VARBINARY to target column type BINARY is not supported.{noformat} My question is how to cope with this situation and insert UUId value to BINARY(16) column? My investigation: Spark uses conception of JdbcDialects and has a mapping for each Catalyst type to database type and vice versa. For example here is [MsSqlServerDialect|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala]] which is used when we work against SQL Server or Azure DWH. In the method `getJDBCType` you can see the mapping: {code:java} case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY)){code} and this is the root of my problem as I think. So, I decide to implement my own JdbcDialect to override this behavior: {code:java} class SqlServerDialect extends JdbcDialect { override def canHandle(url: String) : Boolean = url.startsWith("jdbc:sqlserver") override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY)) case _ => None } } val dialect = new SqlServerDialect JdbcDialects.registerDialect(dialect) {code} With this modification I still catch exactly the same error. It looks like that Spark do not use mapping from my custom dialect. But I checked that the dialect is registered. So it is strange situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30958) do not set default era for DateTimeFormatter
Wenchen Fan created SPARK-30958: --- Summary: do not set default era for DateTimeFormatter Key: SPARK-30958 URL: https://issues.apache.org/jira/browse/SPARK-30958 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30957) Null-safe variant of Dataset.join(Dataset[_], Seq[String])
Enrico Minack created SPARK-30957: - Summary: Null-safe variant of Dataset.join(Dataset[_], Seq[String]) Key: SPARK-30957 URL: https://issues.apache.org/jira/browse/SPARK-30957 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Enrico Minack The {{Dataset.join(Dataset, Seq[String])}} method provides extra convenience over {{Dataset.join(Dataset, joinExprs: Column)}} as it does not duplicate the join columns {{Seq[String]}} in the result {{DataFrame}}. Those columns are compared with {{===}}. When those join columns need to be compared null-safe with {{<=>}}, the join condition becomes very verbose and requires extra {{drop}} operations: {code:java} df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> df2("b")).drop(df2("a")).drop(df2("b")).show() {code} Elegant would be the following null-safe join operation: {code:java} df1.joinNullSafe(df2, joinColumns) {code} Possible namings: - {{Dataset.joinNullSafe(Dataset[_], Seq[String])}} - {{Dataset.joinWithNulls(Dataset[_], Seq[String])}} - {{Dataset.join(Dataset[_], Seq[String], <=>)}} *I am happy to provide a PR if this Dataset API extension is appreciated.* This request has been sent to the Apache Spark user and [dev|http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-dataframe-null-safe-joins-given-a-list-of-columns-tt28842.html] mailing list by Marcelo Valle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30319) Adds a stricter version of as[T]
[ https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-30319: -- Description: The behaviour of as[T] is not intuitive when you read code like df.as[T].write.csv("data.csv"). The result depends on the actual schema of df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The expected behaviour is not provided elsewhere: * Extra columns that are not part of the type {{T}} are not dropped. * Order of columns is not aligned with schema of {{T}}. A method that enforces schema of T on a given Dataset would be very convenient and allows to articulate and guarantee above assumptions about your data with the native Spark Dataset API. This method plays a more explicit and enforcing role than as[T] with respect to columns, column order and column type. Possible naming of a stricter version of {{as[T]}}: * {{as[T](strict = true)}} * {{toDS[T]}} (as in {{toDF}}) * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}}) The naming {{toDS[T]}} is chosen in the linked PR. was: The behaviour of as[T] is not intuitive when you read code like df.as[T].write.csv("data.csv"). The result depends on the actual schema of df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The expected behaviour is not provided elsewhere: * Extra columns that are not part of the type {{T}} are not dropped. * Order of columns is not aligned with schema of {{T}}. * Columns are not cast to the types of {{T}}'s fields. They have to be cast explicitly. A method that enforces schema of T on a given Dataset would be very convenient and allows to articulate and guarantee above assumptions about your data with the native Spark Dataset API. This method plays a more explicit and enforcing role than as[T] with respect to columns, column order and column type. Possible naming of a stricter version of {{as[T]}}: * {{as[T](strict = true)}} * {{toDS[T]}} (as in {{toDF}}) * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}}) The naming {{toDS[T]}} is chosen here. > Adds a stricter version of as[T] > > > Key: SPARK-30319 > URL: https://issues.apache.org/jira/browse/SPARK-30319 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > > The behaviour of as[T] is not intuitive when you read code like > df.as[T].write.csv("data.csv"). The result depends on the actual schema of > df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The > expected behaviour is not provided elsewhere: > * Extra columns that are not part of the type {{T}} are not dropped. > * Order of columns is not aligned with schema of {{T}}. > A method that enforces schema of T on a given Dataset would be very > convenient and allows to articulate and guarantee above assumptions about > your data with the native Spark Dataset API. This method plays a more > explicit and enforcing role than as[T] with respect to columns, column order > and column type. > Possible naming of a stricter version of {{as[T]}}: > * {{as[T](strict = true)}} > * {{toDS[T]}} (as in {{toDF}}) > * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}}) > The naming {{toDS[T]}} is chosen in the linked PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators
[ https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-30666: -- Description: This proposes a pragmatic improvement to allow for reliable single-stage accumulators. Under the assumption that a given stage / partition / rdd produces identical results, non-deterministic code produces identical accumulator increments on success. Rerunning partitions for any reason should always produce the same increments per partition on success. With this pragmatic approach, increments from individual partitions / tasks are only merged into the accumulator on driver side for the first time per partition. This is useful for accumulators registered with {{countFailedValues == false}}. Hence, the accumulator aggregates all successful partitions only once. The implementations require extra memory that scales with the number of partitions. was: This proposes a pragmatic improvement to allow for reliable single-stage accumulators. Under the assumption that a given stage / partition / rdd produces identical results, non-deterministic code produces identical accumulator increments on success. Rerunning partitions for any reason should always produce the same increments on success. With this pragmatic approach, increments from individual partitions / tasks are compared to earlier increments. Depending on the strategy of how a new increment updates over an earlier increment from the same partition, different semantics of accumulators (here called accumulator modes) can be implemented: - {{ALL}} sums over all increments of each partition: this represents the current implementation of accumulators - {{FIRST}} increment: allows to retrieve the first accumulator value for each partition only. This is useful for accumulators registered with {{countFailedValues == false}}. - {{LARGEST}} over all increments of each partition: accumulators aggregate multiple increments while a partition is processed, a successful task provides the most accumulated values that has always the largest cardinality than any accumulated value of failed tasks, hence it paramounts any failed task's value. This produces reliable accumulator values. This does not require {{countFailedValues == false}}. This should only be used in a single stage. The naming may be confused with {{MAX}}. The implementations for {{LARGEST}} and {{FIRST}} require extra memory that scales with the number of partitions. The current {{ALL}} implementation does not require extra memory. > Reliable single-stage accumulators > -- > > Key: SPARK-30666 > URL: https://issues.apache.org/jira/browse/SPARK-30666 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Enrico Minack >Priority: Major > > This proposes a pragmatic improvement to allow for reliable single-stage > accumulators. Under the assumption that a given stage / partition / rdd > produces identical results, non-deterministic code produces identical > accumulator increments on success. Rerunning partitions for any reason should > always produce the same increments per partition on success. > With this pragmatic approach, increments from individual partitions / tasks > are only merged into the accumulator on driver side for the first time per > partition. This is useful for accumulators registered with > {{countFailedValues == false}}. Hence, the accumulator aggregates all > successful partitions only once. > The implementations require extra memory that scales with the number of > partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30956) Use intercept instead of try-catch to assert failures in IntervalUtilsSuite
Kent Yao created SPARK-30956: Summary: Use intercept instead of try-catch to assert failures in IntervalUtilsSuite Key: SPARK-30956 URL: https://issues.apache.org/jira/browse/SPARK-30956 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao Addressed the comment from https://github.com/apache/spark/pull/27672#discussion_r383719562 to use `intercept` instead of `try-catch` block to assert failures in the IntervalUtilsSuite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30955) Exclude Generate output when aliasing in nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-30955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045231#comment-17045231 ] L. C. Hsieh commented on SPARK-30955: - Thanks for pointing it. Yea, I re-checked and that code isn't in branch-2.4. > Exclude Generate output when aliasing in nested column pruning > -- > > Key: SPARK-30955 > URL: https://issues.apache.org/jira/browse/SPARK-30955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > When aliasing in nested column pruning on Project on top of Generate, we > should exclude Generate outputs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30955) Exclude Generate output when aliasing in nested column pruning
[ https://issues.apache.org/jira/browse/SPARK-30955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-30955: Affects Version/s: (was: 2.4.5) 3.0.0 > Exclude Generate output when aliasing in nested column pruning > -- > > Key: SPARK-30955 > URL: https://issues.apache.org/jira/browse/SPARK-30955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > When aliasing in nested column pruning on Project on top of Generate, we > should exclude Generate outputs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30909) Add version information to the configuration of Python
[ https://issues.apache.org/jira/browse/SPARK-30909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30909: --- Summary: Add version information to the configuration of Python (was: Arrange version info of Python) > Add version information to the configuration of Python > -- > > Key: SPARK-30909 > URL: https://issues.apache.org/jira/browse/SPARK-30909 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > core/src/main/scala/org/apache/spark/internal/config/Python.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org