[jira] [Created] (SPARK-31570) gapply and dapply docs should be more aligned, possibly combined

2020-04-26 Thread Michael Chirico (Jira)
Michael Chirico created SPARK-31570:
---

 Summary: gapply and dapply docs should be more aligned, possibly 
combined
 Key: SPARK-31570
 URL: https://issues.apache.org/jira/browse/SPARK-31570
 Project: Spark
  Issue Type: Documentation
  Components: R
Affects Versions: 2.4.5
Reporter: Michael Chirico


This is a follow-up to https://issues.apache.org/jira/browse/SPARK-31568

There, we combined gapply and gapplyCollect to make it easier to sync arguments 
between those Rds.

dapply and dapplyCollect are also sufficiently similar that they could be on 
the same Rd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31569) Add links to subsections in SQL Reference main page

2020-04-26 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31569:
--

 Summary: Add links to subsections in SQL Reference main page
 Key: SPARK-31569
 URL: https://issues.apache.org/jira/browse/SPARK-31569
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Add links to subsections in SQL Reference main page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31527) date add/subtract interval only allow those day precision in ansi mode

2020-04-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31527:
---

Assignee: Kent Yao

> date add/subtract interval only allow those day precision in ansi mode
> --
>
> Key: SPARK-31527
> URL: https://issues.apache.org/jira/browse/SPARK-31527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Under ANSI mode, we should not allow date add interval with hours, minutes... 
> microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31527) date add/subtract interval only allow those day precision in ansi mode

2020-04-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31527.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28310
[https://github.com/apache/spark/pull/28310]

> date add/subtract interval only allow those day precision in ansi mode
> --
>
> Key: SPARK-31527
> URL: https://issues.apache.org/jira/browse/SPARK-31527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Under ANSI mode, we should not allow date add interval with hours, minutes... 
> microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31557) Legacy parser incorrectly interprets pre-Gregorian dates

2020-04-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31557.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28345
[https://github.com/apache/spark/pull/28345]

> Legacy parser incorrectly interprets pre-Gregorian dates
> 
>
> Key: SPARK-31557
> URL: https://issues.apache.org/jira/browse/SPARK-31557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.0.0
>
>
> With CSV:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
> "1800-01-01").map(x => s"$x,$x")
> seq: Seq[String] = List(0002-01-01,0002-01-01, 1000-01-01,1000-01-01, 
> 1500-01-01,1500-01-01, 1800-01-01,1800-01-01)
> scala> val ds = seq.toDF("value").as[String]
> ds: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> spark.read.schema("expected STRING, actual DATE").csv(ds).show
> +--+--+
> |  expected|actual|
> +--+--+
> |0002-01-01|0001-12-30|
> |1000-01-01|1000-01-06|
> |1500-01-01|1500-01-10|
> |1800-01-01|1800-01-01|
> +--+--+
> scala> 
> {noformat}
> Similarly, with JSON:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
> "1800-01-01").map { x =>
>   s"""{"expected": "$x", "actual": "$x"}"""
> }
>  |  | seq: Seq[String] = List({"expected": "0002-01-01", "actual": 
> "0002-01-01"}, {"expected": "1000-01-01", "actual": "1000-01-01"}, 
> {"expected": "1500-01-01", "actual": "1500-01-01"}, {"expected": 
> "1800-01-01", "actual": "1800-01-01"})
> scala> 
> scala> val ds = seq.toDF("value").as[String]
> ds: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> spark.read.schema("expected STRING, actual DATE").json(ds).show
> +--+--+
> |  expected|actual|
> +--+--+
> |0002-01-01|0001-12-30|
> |1000-01-01|1000-01-06|
> |1500-01-01|1500-01-10|
> |1800-01-01|1800-01-01|
> +--+--+
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31557) Legacy parser incorrectly interprets pre-Gregorian dates

2020-04-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31557:
---

Assignee: Bruce Robbins

> Legacy parser incorrectly interprets pre-Gregorian dates
> 
>
> Key: SPARK-31557
> URL: https://issues.apache.org/jira/browse/SPARK-31557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> With CSV:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
> "1800-01-01").map(x => s"$x,$x")
> seq: Seq[String] = List(0002-01-01,0002-01-01, 1000-01-01,1000-01-01, 
> 1500-01-01,1500-01-01, 1800-01-01,1800-01-01)
> scala> val ds = seq.toDF("value").as[String]
> ds: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> spark.read.schema("expected STRING, actual DATE").csv(ds).show
> +--+--+
> |  expected|actual|
> +--+--+
> |0002-01-01|0001-12-30|
> |1000-01-01|1000-01-06|
> |1500-01-01|1500-01-10|
> |1800-01-01|1800-01-01|
> +--+--+
> scala> 
> {noformat}
> Similarly, with JSON:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
> "1800-01-01").map { x =>
>   s"""{"expected": "$x", "actual": "$x"}"""
> }
>  |  | seq: Seq[String] = List({"expected": "0002-01-01", "actual": 
> "0002-01-01"}, {"expected": "1000-01-01", "actual": "1000-01-01"}, 
> {"expected": "1500-01-01", "actual": "1500-01-01"}, {"expected": 
> "1800-01-01", "actual": "1800-01-01"})
> scala> 
> scala> val ds = seq.toDF("value").as[String]
> ds: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> spark.read.schema("expected STRING, actual DATE").json(ds).show
> +--+--+
> |  expected|actual|
> +--+--+
> |0002-01-01|0001-12-30|
> |1000-01-01|1000-01-06|
> |1500-01-01|1500-01-10|
> |1800-01-01|1800-01-01|
> +--+--+
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite

2020-04-26 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092999#comment-17092999
 ] 

Wenchen Fan commented on SPARK-31554:
-

I've merged https://github.com/apache/spark/pull/28156

Let's see if the test is still flaky and we need 
https://github.com/apache/spark/pull/28055

> Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
> 
>
> Key: SPARK-31554
> URL: https://issues.apache.org/jira/browse/SPARK-31554
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, 
> for example:
> * https://github.com/apache/spark/pull/28328#issuecomment-618992335
> The error message:
> {code}
> org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error 
> reporting
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with 
> error line 'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152)
>   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188)
>   at 
> scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192)
>   at 
> org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30)
> {code}
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/27617#issuecomment-614318644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31388) org.apache.spark.sql.hive.thriftserver.CliSuite result matching is flaky

2020-04-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31388:
---

Assignee: Juliusz Sompolski

> org.apache.spark.sql.hive.thriftserver.CliSuite result matching is flaky
> 
>
> Key: SPARK-31388
> URL: https://issues.apache.org/jira/browse/SPARK-31388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>
> CliSuite.runCliWithin result matching has issues. Will describe in PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31388) org.apache.spark.sql.hive.thriftserver.CliSuite result matching is flaky

2020-04-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31388.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28156
[https://github.com/apache/spark/pull/28156]

> org.apache.spark.sql.hive.thriftserver.CliSuite result matching is flaky
> 
>
> Key: SPARK-31388
> URL: https://issues.apache.org/jira/browse/SPARK-31388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> CliSuite.runCliWithin result matching has issues. Will describe in PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-26 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092986#comment-17092986
 ] 

Wenchen Fan commented on SPARK-31449:
-

Can we fix Spark 2.4 to use the same code as JDK's GregorianCalendar and Spark 
3.0 to calculate timezone offsets?

> Investigate the difference between JDK and Spark's time zone offset 
> calculation
> ---
>
> Key: SPARK-31449
> URL: https://issues.apache.org/jira/browse/SPARK-31449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Major
>
> Spark 2.4 calculates time zone offsets from wall clock timestamp using 
> `DateTimeUtils.getOffsetFromLocalMillis()` (see 
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
> {code:scala}
>   private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
> Long = {
> var guess = tz.getRawOffset
> // the actual offset should be calculated based on milliseconds in UTC
> val offset = tz.getOffset(millisLocal - guess)
> if (offset != guess) {
>   guess = tz.getOffset(millisLocal - offset)
>   if (guess != offset) {
> // fallback to do the reverse lookup using java.sql.Timestamp
> // this should only happen near the start or end of DST
> val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> val year = getYear(days)
> val month = getMonth(days)
> val day = getDayOfMonth(days)
> var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
> if (millisOfDay < 0) {
>   millisOfDay += MILLIS_PER_DAY.toInt
> }
> val seconds = (millisOfDay / 1000L).toInt
> val hh = seconds / 3600
> val mm = seconds / 60 % 60
> val ss = seconds % 60
> val ms = millisOfDay % 1000
> val calendar = Calendar.getInstance(tz)
> calendar.set(year, month - 1, day, hh, mm, ss)
> calendar.set(Calendar.MILLISECOND, ms)
> guess = (millisLocal - calendar.getTimeInMillis()).toInt
>   }
> }
> guess
>   }
> {code}
> Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
> {code:java}
> if (zone instanceof ZoneInfo) {
> ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
> } else {
> int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
> internalGet(ZONE_OFFSET) : 
> zone.getRawOffset();
> zone.getOffsets(millis - gmtOffset, zoneOffsets);
> }
> {code}
> Need to investigate are there any differences in results between 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-26 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-31497.
---
Resolution: Fixed

Issue resolved by pull request 28279
[https://github.com/apache/spark/pull/28279]

> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
>   estimatorParamMaps=paramGrid,
>   evaluator=BinaryClassificationEvaluator(),
>   numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-26 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31497:
--
Fix Version/s: 3.0.0

> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
>   estimatorParamMaps=paramGrid,
>   evaluator=BinaryClassificationEvaluator(),
>   numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-26 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31497:
--
Target Version/s: 3.0.0

> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
>   estimatorParamMaps=paramGrid,
>   evaluator=BinaryClassificationEvaluator(),
>   numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31566) Add SQL Rest API Documentation

2020-04-26 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31566:
---
Description: 
SQL Rest API exposes query execution metrics as Public API. Its documentation 
will be useful for end-users. 
{code:java}
/applications/[app-id]/sql

1- A list of all queries for a given application.
2- ?details=[true|false (default)] lists metric details in addition to queries 
details.
3- ?offset=[offset]=[len] lists queries in the given range.{code}
{code:java}
/applications/[app-id]/sql/[execution-id]

1- Details for the given query.
2- ?details=[true|false (default)] lists metric details in addition to given 
query details.{code}

  was:
SQL Rest API exposes query execution metrics as Public API. Its documentation 
will be useful for end-users. Also, this is a follow-up jira with 
https://issues.apache.org/jira/browse/SPARK-31440
{code:java}
/applications/[app-id]/sql

1- A list of all queries for a given application.
2- ?details=[true|false (default)] lists metric details in addition to queries 
details.
3- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand when Physical Plan size is high.
4- ?offset=[offset]=[len] lists queries in the given range.{code}
{code:java}
/applications/[app-id]/sql/[execution-id]

1- ?details=[true|false (default)] lists metric details in addition to given 
query details.
2- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand for the given query when Physical Plan size 
is high.{code}


> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. 
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- Details for the given query.
> 2- ?details=[true|false (default)] lists metric details in addition to given 
> query details.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25595) Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled

2020-04-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25595:
-
Fix Version/s: 2.4.6

> Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled
> -
>
> Key: SPARK-25595
> URL: https://issues.apache.org/jira/browse/SPARK-25595
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> With flag IGNORE_CORRUPT_FILES enabled, schema inference should ignore 
> corrupt Avro files, which is consistent with Parquet and Orc data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31546) Backport SPARK-25595 Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled

2020-04-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31546.
--
Resolution: Done

> Backport SPARK-25595   Ignore corrupt Avro file if flag 
> IGNORE_CORRUPT_FILES enabled
> 
>
> Key: SPARK-31546
> URL: https://issues.apache.org/jira/browse/SPARK-31546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25595       Ignore corrupt Avro file if flag 
> IGNORE_CORRUPT_FILES enabled
> cc [~Gengliang.Wang]& [~hyukjin.kwon] for comments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31546) Backport SPARK-25595 Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES enabled

2020-04-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092978#comment-17092978
 ] 

Hyukjin Kwon commented on SPARK-31546:
--

Thanks [~Gengliang.Wang]!

> Backport SPARK-25595   Ignore corrupt Avro file if flag 
> IGNORE_CORRUPT_FILES enabled
> 
>
> Key: SPARK-31546
> URL: https://issues.apache.org/jira/browse/SPARK-31546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25595       Ignore corrupt Avro file if flag 
> IGNORE_CORRUPT_FILES enabled
> cc [~Gengliang.Wang]& [~hyukjin.kwon] for comments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31568) R: gapply documentation could be clearer about what the func argument is

2020-04-26 Thread Michael Chirico (Jira)
Michael Chirico created SPARK-31568:
---

 Summary: R: gapply documentation could be clearer about what the 
func argument is
 Key: SPARK-31568
 URL: https://issues.apache.org/jira/browse/SPARK-31568
 Project: Spark
  Issue Type: Documentation
  Components: R
Affects Versions: 2.4.5
Reporter: Michael Chirico


copied from pre-existing GH PR:

https://github.com/apache/spark/pull/28350

Spent a long time this weekend trying to figure out just what exactly key is in 
gapply's func. I had assumed it would be a named list, but apparently not -- 
the examples are working because schema is applying the name and the names of 
the output data.frame don't matter.

As near as I can tell the description I've added is correct, namely, that key 
is an unnamed list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31567) Update AppVeyor R version to 4.0.0

2020-04-26 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-31567:
-

 Summary: Update AppVeyor R version to 4.0.0
 Key: SPARK-31567
 URL: https://issues.apache.org/jira/browse/SPARK-31567
 Project: Spark
  Issue Type: Improvement
  Components: SparkR, Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31559) AM starts with initial fetched tokens in any attempt

2020-04-26 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092928#comment-17092928
 ] 

Jungtaek Lim commented on SPARK-31559:
--

PR submitted: https://github.com/apache/spark/pull/28336

> AM starts with initial fetched tokens in any attempt
> 
>
> Key: SPARK-31559
> URL: https://issues.apache.org/jira/browse/SPARK-31559
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The issue is only occurred in yarn-cluster mode.
> Submitter will obtain delegation tokens for yarn-cluster mode, and add these 
> credentials to the launch context. AM will be launched with these 
> credentials, and AM and driver are able to leverage these tokens.
> In Yarn cluster mode, driver is launched in AM, which in turn initializes 
> token manager (while initializing SparkContext) and obtain delegation tokens 
> (+ schedule to renew) if both principal and keytab are available.
> That said, even we provide principal and keytab to run application with 
> yarn-cluster mode, AM always starts with initial tokens from launch context 
> until token manager runs and obtains delegation tokens.
> So there's a "gap", and if user codes (driver) access to external system with 
> delegation tokens (e.g. HDFS) before initializing SparkContext, it cannot 
> leverage the tokens token manager will obtain. It will make the application 
> fail if AM is killed "after" the initial tokens are expired and relaunched.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31566) Add SQL Rest API Documentation

2020-04-26 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31566:
---
Description: 
SQL Rest API exposes query execution metrics as Public API. Its documentation 
will be useful for end-users. Also, this is a follow-up jira with 
https://issues.apache.org/jira/browse/SPARK-31440
{code:java}
/applications/[app-id]/sql

1- A list of all queries for a given application.
2- ?details=[true|false (default)] lists metric details in addition to queries 
details.
3- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand when Physical Plan size is high.
4- ?offset=[offset]=[len] lists queries in the given range.{code}
{code:java}
/applications/[app-id]/sql/[execution-id]

1- ?details=[true|false (default)] lists metric details in addition to given 
query details.
2- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand for the given query when Physical Plan size 
is high.{code}

  was:
SQL Rest API exposes query execution metrics as Public API. Its documentation 
will be useful for end-users. Also, this is a follow-up jira with 
https://issues.apache.org/jira/browse/SPARK-31440
{code:java}
/applications/[app-id]/sql

1- A list of all queries for a given application.
2- ?details=[true|false (default)] lists metric details in addition to queries 
details.
3- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand when Physical Plan size is high.
4- ?offset=[offset]=[len] lists queries in the given range.{code}
 
{code:java}
/applications/[app-id]/sql/[execution-id]

1- ?details=[true|false (default)] lists metric details in addition to given 
query details.
2- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand for the given query when Physical Plan size 
is high.{code}


> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. Also, this is a follow-up jira with 
> https://issues.apache.org/jira/browse/SPARK-31440
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?details=true=[true (default)|false] 
> enables/disables Physical planDescription on demand when Physical Plan size 
> is high.
> 4- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- ?details=[true|false (default)] lists metric details in addition to given 
> query details.
> 2- ?details=true=[true (default)|false] enables/disables 
> Physical planDescription on demand for the given query when Physical Plan 
> size is high.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31566) Add SQL Rest API Documentation

2020-04-26 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31566:
---
Description: 
SQL Rest API exposes query execution metrics as Public API. Its documentation 
will be useful for end-users. Also, this is a follow-up jira with 
https://issues.apache.org/jira/browse/SPARK-31440
{code:java}
/applications/[app-id]/sql

1- A list of all queries for a given application.
2- ?details=[true|false (default)] lists metric details in addition to queries 
details.
3- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand when Physical Plan size is high.
4- ?offset=[offset]=[len] lists queries in the given range.{code}
 
{code:java}
/applications/[app-id]/sql/[execution-id]

1- ?details=[true|false (default)] lists metric details in addition to given 
query details.
2- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand for the given query when Physical Plan size 
is high.{code}

  was:
SQL Rest API exposes query execution metrics as Public API. Its documentation 
will be useful for end-users. Also, this is a follow-up jira with 
https://issues.apache.org/jira/browse/SPARK-31440

 
{code:java}
/applications/[app-id]/sql

1- A list of all queries for a given application.
2- ?details=[true|false (default)] lists metric details in addition to queries 
details.
3- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand when Physical Plan size is high.
4- ?offset=[offset]=[len] lists queries in the given range.{code}
 
{code:java}
/applications/[app-id]/sql/[execution-id]

1- ?details=[true|false (default)] lists metric details in addition to given 
query details.
2- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand for the given query when Physical Plan size 
is high.{code}


> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. Also, this is a follow-up jira with 
> https://issues.apache.org/jira/browse/SPARK-31440
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?details=true=[true (default)|false] 
> enables/disables Physical planDescription on demand when Physical Plan size 
> is high.
> 4- ?offset=[offset]=[len] lists queries in the given range.{code}
>  
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- ?details=[true|false (default)] lists metric details in addition to given 
> query details.
> 2- ?details=true=[true (default)|false] enables/disables 
> Physical planDescription on demand for the given query when Physical Plan 
> size is high.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31566) Add SQL Rest API Documentation

2020-04-26 Thread Eren Avsarogullari (Jira)
Eren Avsarogullari created SPARK-31566:
--

 Summary: Add SQL Rest API Documentation
 Key: SPARK-31566
 URL: https://issues.apache.org/jira/browse/SPARK-31566
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Eren Avsarogullari


SQL Rest API exposes query execution metrics as Public API. Its documentation 
will be useful for end-users. Also, this is a follow-up jira with 
https://issues.apache.org/jira/browse/SPARK-31440

 
{code:java}
/applications/[app-id]/sql

1- A list of all queries for a given application.
2- ?details=[true|false (default)] lists metric details in addition to queries 
details.
3- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand when Physical Plan size is high.
4- ?offset=[offset]=[len] lists queries in the given range.{code}
 
{code:java}
/applications/[app-id]/sql/[execution-id]

1- ?details=[true|false (default)] lists metric details in addition to given 
query details.
2- ?details=true=[true (default)|false] enables/disables 
Physical planDescription on demand for the given query when Physical Plan size 
is high.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31565) Unify the font color of label among all DAG-viz.

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31565.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28352
[https://github.com/apache/spark/pull/28352]

> Unify the font color of label among all DAG-viz.
> 
>
> Key: SPARK-31565
> URL: https://issues.apache.org/jira/browse/SPARK-31565
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.0
>
>
> There are three types of DAG-viz in the WebUI.
> One is for stages, another one is for RDDs and the last one is for query 
> plans.
> But the font color of labels are slightly different among them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-04-26 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092884#comment-17092884
 ] 

Xiao Li edited comment on SPARK-12312 at 4/26/20, 11:03 PM:


How does the current solution handle Kerberos renewals? Any design doc about 
the whole support?


was (Author: smilegator):
How the current solution handle Kerberos renewals?

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-04-26 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092884#comment-17092884
 ] 

Xiao Li commented on SPARK-12312:
-

How the current solution handle Kerberos renewals?

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31489) Failure on pushing down filters with java.time.LocalDate values in ORC

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31489:
-

Assignee: Maxim Gekk

> Failure on pushing down filters with java.time.LocalDate values in ORC
> --
>
> Key: SPARK-31489
> URL: https://issues.apache.org/jira/browse/SPARK-31489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> When spark.sql.datetime.java8API.enabled is set to true, filters pushed down 
> with java.time.LocalDate values to ORC datasource fails with the exception:
> {code}
> Wrong value class java.time.LocalDate for DATE.EQUALS leaf
> java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for 
> DATE.EQUALS leaf
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.(SearchArgumentImpl.java:75)
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31489) Failure on pushing down filters with java.time.LocalDate values in ORC

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31489.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28261
[https://github.com/apache/spark/pull/28261]

> Failure on pushing down filters with java.time.LocalDate values in ORC
> --
>
> Key: SPARK-31489
> URL: https://issues.apache.org/jira/browse/SPARK-31489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> When spark.sql.datetime.java8API.enabled is set to true, filters pushed down 
> with java.time.LocalDate values to ORC datasource fails with the exception:
> {code}
> Wrong value class java.time.LocalDate for DATE.EQUALS leaf
> java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for 
> DATE.EQUALS leaf
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.(SearchArgumentImpl.java:75)
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31535) Fix nested CTE substitution

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31535.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28318
[https://github.com/apache/spark/pull/28318]

> Fix nested CTE substitution
> ---
>
> Key: SPARK-31535
> URL: https://issues.apache.org/jira/browse/SPARK-31535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> The following nested CTE should result empty result instead of {{1}}
> {noformat}
> WITH t(c) AS (SELECT 1)
> SELECT * FROM t
> WHERE c IN (
>   WITH t(c) AS (SELECT 2)
>   SELECT * FROM t
> )
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31535) Fix nested CTE substitution

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31535:
-

Assignee: Peter Toth

> Fix nested CTE substitution
> ---
>
> Key: SPARK-31535
> URL: https://issues.apache.org/jira/browse/SPARK-31535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Blocker
>  Labels: correctness
>
> The following nested CTE should result empty result instead of {{1}}
> {noformat}
> WITH t(c) AS (SELECT 1)
> SELECT * FROM t
> WHERE c IN (
>   WITH t(c) AS (SELECT 2)
>   SELECT * FROM t
> )
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31565) Unify the font color of label among all DAG-viz.

2020-04-26 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-31565:
--

 Summary: Unify the font color of label among all DAG-viz.
 Key: SPARK-31565
 URL: https://issues.apache.org/jira/browse/SPARK-31565
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0, 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


There are three types of DAG-viz in the WebUI.
One is for stages, another one is for RDDs and the last one is for query plans.
But the font color of labels are slightly different among them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31534) Text for tooltip should be escaped

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31534.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/28317

> Text for tooltip should be escaped
> --
>
> Key: SPARK-31534
> URL: https://issues.apache.org/jira/browse/SPARK-31534
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> Timeline View for application and job, and DAG Viz for job show tooltip but 
> its text are not escaped for HTML so they cannot be shown properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31534) Text for tooltip should be escaped

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31534:
--
Affects Version/s: 3.0.0

> Text for tooltip should be escaped
> --
>
> Key: SPARK-31534
> URL: https://issues.apache.org/jira/browse/SPARK-31534
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> Timeline View for application and job, and DAG Viz for job show tooltip but 
> its text are not escaped for HTML so they cannot be shown properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31432) Make SPARK_JARS_DIR configurable

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31432:
--
Summary: Make SPARK_JARS_DIR configurable  (was: bin/sbin scripts should 
allow to customize jars dir)

> Make SPARK_JARS_DIR configurable
> 
>
> Key: SPARK-31432
> URL: https://issues.apache.org/jira/browse/SPARK-31432
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.1.0
>Reporter: Shingo Furuyama
>Priority: Minor
>
> In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR 
> as same as SPARK_CONF_DIR.
> Our usecase:
>  We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
> incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
> tweak the jars by Maven Shade Plugin.
>  The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
> directory different from the default. So it is useful for us if we can set 
> SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
>  We can do that without the modification by deploying spark home as many as 
> set of jars, but it is somehow redundant.
> Common usecase:
>  I believe there is a similer usecase. For example, deploying spark built for 
> scala 2.11 and 2.12 in a machine and switch jars location by setting 
> SPARK_JARS_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31399) Closure cleaner broken in Scala 2.12

2020-04-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092836#comment-17092836
 ] 

Dongjoon Hyun commented on SPARK-31399:
---

Hi, [~smilegator] and [~rednaxelafx]. Is there any update for this Blocker 
issue? Thank you for any update in advance!

> Closure cleaner broken in Scala 2.12
> 
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Kris Mok
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}
> **Apache Spark 2.4.5 with Scala 2.12**
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
>   at 

[jira] [Commented] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-26 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092824#comment-17092824
 ] 

Maxim Gekk commented on SPARK-31449:


[~cloud_fan] [~hyukjin.kwon] I compared results of those 2 functions for all 
time zones with step of 1 day, and found many differences in results:
{code:scala}
test("Investigate the difference between JDK and Spark's time zone offset 
calculation") {
import java.util.{Calendar, TimeZone}
import sun.util.calendar.ZoneInfo
def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): Long = {
  var guess = tz.getRawOffset
  // the actual offset should be calculated based on milliseconds in UTC
  val offset = tz.getOffset(millisLocal - guess)
  if (offset != guess) {
guess = tz.getOffset(millisLocal - offset)
if (guess != offset) {
  // fallback to do the reverse lookup using java.sql.Timestamp
  // this should only happen near the start or end of DST
  val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
  val year = getYear(days)
  val month = getMonth(days)
  val day = getDayOfMonth(days)

  var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
  if (millisOfDay < 0) {
millisOfDay += MILLIS_PER_DAY.toInt
  }
  val seconds = (millisOfDay / 1000L).toInt
  val hh = seconds / 3600
  val mm = seconds / 60 % 60
  val ss = seconds % 60
  val ms = millisOfDay % 1000
  val calendar = Calendar.getInstance(tz)
  calendar.set(year, month - 1, day, hh, mm, ss)
  calendar.set(Calendar.MILLISECOND, ms)
  guess = (millisLocal - calendar.getTimeInMillis()).toInt
}
  }
  guess
}
def getOffsetFromLocalMillis2(millisLocal: Long, tz: TimeZone): Long = {
  tz match {
case zoneInfo: ZoneInfo => zoneInfo.getOffsetsByWall(millisLocal, null)
case timeZone: TimeZone => timeZone.getOffset(millisLocal - 
timeZone.getRawOffset)
  }
}

ALL_TIMEZONES
  .sortBy(_.getId)
  .foreach { zid =>
withDefaultTimeZone(zid) {
  val start = microsToMillis(instantToMicros(LocalDateTime.of(1, 1, 1, 
0, 0, 0)
.atZone(zid)
.toInstant))
  val end = microsToMillis(instantToMicros(LocalDateTime.of(2037, 1, 1, 
0, 0, 0)
.atZone(zid)
.toInstant))

  var millis = start
  var step: Long = MILLIS_PER_DAY
  while (millis < end) {
val offset1 = getOffsetFromLocalMillis(millis, 
TimeZone.getTimeZone(zid))
val offset2 = getOffsetFromLocalMillis2(millis, 
TimeZone.getTimeZone(zid))
if (offset1 != offset2) {
  println(s"${zid.getId} ${new Timestamp(millis)} $offset1 
$offset2")
}
millis += step
  }
}
  }
  }
{code}
{code}
Africa/Algiers 1916-10-01 23:47:48.0 360 0
Africa/Algiers 1917-10-07 23:47:48.0 360 0
Africa/Algiers 1918-10-06 23:47:48.0 360 0
Africa/Algiers 1919-10-05 23:47:48.0 360 0
Africa/Algiers 1920-10-23 23:47:48.0 360 0
Africa/Algiers 1921-06-21 23:47:48.0 360 0
Africa/Algiers 1946-10-06 23:47:48.0 360 0
Africa/Algiers 1963-04-13 23:47:48.0 360 0
Africa/Algiers 1971-09-26 23:47:48.0 360 0
Africa/Algiers 1979-10-25 23:47:48.0 360 0
Africa/Ceuta 1900-01-01 00:00:00.0 360 -1276000
Africa/Ceuta 1924-10-05 00:21:16.0 360 0
Africa/Ceuta 1926-10-03 00:21:16.0 360 0
Africa/Ceuta 1927-10-02 00:21:16.0 360 0
Africa/Ceuta 1928-10-07 00:21:16.0 360 0
Africa/Sao_Tome 1899-12-31 23:33:04.0 0 -2205000
Africa/Tripoli 1952-01-01 00:07:16.0 720 360
Africa/Tripoli 1954-01-01 00:07:16.0 720 360
Africa/Tripoli 1956-01-01 00:07:16.0 720 360
Africa/Tripoli 1982-01-01 00:07:16.0 720 360
Africa/Tripoli 1982-10-01 00:07:16.0 720 360
Africa/Tripoli 1983-10-01 00:07:16.0 720 360
Africa/Tripoli 1984-10-01 00:07:16.0 720 360
Africa/Tripoli 1985-10-01 00:07:16.0 720 360
Africa/Tripoli 1986-10-03 00:07:16.0 720 360
Africa/Tripoli 1987-10-01 00:07:16.0 720 360
Africa/Tripoli 1988-10-01 00:07:16.0 720 360
Africa/Tripoli 1989-10-01 00:07:16.0 720 360
Africa/Tripoli 1996-09-30 00:07:16.0 720 360
America/Inuvik 1965-10-30 18:00:00.0 -2160 -2880
America/Iqaluit 1999-10-30 20:00:00.0 -1440 -2160
America/Pangnirtung 1999-10-30 20:00:00.0 -1440 -2160
Antarctica/Casey 1900-01-01 00:00:00.0 2880 0
Antarctica/Davis 1900-01-01 00:00:00.0 2520 0
Antarctica/Davis 2009-10-18 05:00:00.0 2520 1800
Antarctica/Davis 2011-10-28 05:00:00.0 2520 1800
Antarctica/DumontDUrville 1900-01-01 00:00:00.0 3600 0
Antarctica/Mawson 1900-01-01 00:00:00.0 1800 0
Antarctica/Syowa 1900-01-01 00:00:00.0 1080 0

[jira] [Resolved] (SPARK-31562) Update ExpressionDescription for substring, current_date, and current_timestamp

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31562.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28342
[https://github.com/apache/spark/pull/28342]

> Update ExpressionDescription for substring, current_date, and 
> current_timestamp
> ---
>
> Key: SPARK-31562
> URL: https://issues.apache.org/jira/browse/SPARK-31562
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> This jira intends to add entries for substring, current_date, and 
> current_timestamp in the SQL built-in function documents. Specifically, the 
> entries are as follows;
> SELECT current_date;
> SELECT current_timestamp;
> SELECT substring('abcd' FROM 1);
> SELECT substring('abcd' FROM 1 FOR 2);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31562) Update ExpressionDescription for substring, current_date, and current_timestamp

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31562:
-

Assignee: Takeshi Yamamuro

> Update ExpressionDescription for substring, current_date, and 
> current_timestamp
> ---
>
> Key: SPARK-31562
> URL: https://issues.apache.org/jira/browse/SPARK-31562
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> This jira intends to add entries for substring, current_date, and 
> current_timestamp in the SQL built-in function documents. Specifically, the 
> entries are as follows;
> SELECT current_date;
> SELECT current_timestamp;
> SELECT substring('abcd' FROM 1);
> SELECT substring('abcd' FROM 1 FOR 2);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-26 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092818#comment-17092818
 ] 

Pablo Langa Blanco commented on SPARK-31500:


[https://github.com/apache/spark/pull/28351]

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib

2020-04-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31400.
--
Fix Version/s: 3.1.0
 Assignee: Junpei Zhou
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28347

> The catalogString doesn't distinguish Vectors in ml and mllib
> -
>
> Key: SPARK-31400
> URL: https://issues.apache.org/jira/browse/SPARK-31400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.4.5
> Environment: Ubuntu 16.04
>Reporter: Junpei Zhou
>Assignee: Junpei Zhou
>Priority: Minor
> Fix For: 3.1.0
>
>
> h2. Bug Description
> The `catalogString` is not detailed enough to distinguish the 
> pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
> h2. How to reproduce the bug
> [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
> example from the official document (Python code). If I keep all other lines 
> untouched, and only modify the Vectors import line, which means:
> {code:java}
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> {code}
> Or you can directly execute the following code snippet:
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> dataFrame = spark.createDataFrame([
> (0, Vectors.dense([1.0, 0.1, -1.0]),),
> (1, Vectors.dense([2.0, 1.1, 1.0]),),
> (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> scalerModel = scaler.fit(dataFrame)
> {code}
> It will raise an error:
> {code:java}
> IllegalArgumentException: 'requirement failed: Column features must be of 
> type struct,values:array> 
> but was actually 
> struct,values:array>.'
> {code}
> However, the actually struct and the desired struct are exactly the same 
> string, which cannot provide useful information to the programmer. I would 
> suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
> pyspark.mllib.linalg.Vectors.
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib

2020-04-26 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-31400:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> The catalogString doesn't distinguish Vectors in ml and mllib
> -
>
> Key: SPARK-31400
> URL: https://issues.apache.org/jira/browse/SPARK-31400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.4.5
> Environment: Ubuntu 16.04
>Reporter: Junpei Zhou
>Priority: Minor
>
> h2. Bug Description
> The `catalogString` is not detailed enough to distinguish the 
> pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
> h2. How to reproduce the bug
> [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
> example from the official document (Python code). If I keep all other lines 
> untouched, and only modify the Vectors import line, which means:
> {code:java}
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> {code}
> Or you can directly execute the following code snippet:
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> dataFrame = spark.createDataFrame([
> (0, Vectors.dense([1.0, 0.1, -1.0]),),
> (1, Vectors.dense([2.0, 1.1, 1.0]),),
> (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> scalerModel = scaler.fit(dataFrame)
> {code}
> It will raise an error:
> {code:java}
> IllegalArgumentException: 'requirement failed: Column features must be of 
> type struct,values:array> 
> but was actually 
> struct,values:array>.'
> {code}
> However, the actually struct and the desired struct are exactly the same 
> string, which cannot provide useful information to the programmer. I would 
> suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
> pyspark.mllib.linalg.Vectors.
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31552) Fix potential ClassCastException in ScalaReflection arrayClassFor

2020-04-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31552:
--
Fix Version/s: 2.4.6

> Fix potential ClassCastException in ScalaReflection arrayClassFor
> -
>
> Key: SPARK-31552
> URL: https://issues.apache.org/jira/browse/SPARK-31552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> arrayClassFor and dataTypeFor in ScalaReflection call each other circularly, 
> the cases in dataTypeFor are not fully handled in arrayClassFor
> For example:
> {code:java}
> scala> import scala.reflect.runtime.universe.TypeTag
> scala> import org.apache.spark.sql._
> scala> import org.apache.spark.sql.catalyst.encoders._
> scala> import org.apache.spark.sql.types._
> scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = 
> ExpressionEncoder()
> newArrayEncoder: [T <: Array[_]](implicit evidence$1: 
> reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]
> scala> val decOne = Decimal(1, 38, 18)
> decOne: org.apache.spark.sql.types.Decimal = 1E-18
> scala> val decTwo = Decimal(2, 38, 18)
> decTwo: org.apache.spark.sql.types.Decimal = 2E-18
> scala> val decSpark = Array(decOne, decTwo)
> decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)
> scala> Seq(decSpark).toDF()
> java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot 
> be cast to org.apache.spark.sql.types.ObjectType
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
>   at newArrayEncoder(:57)
>   ... 53 elided
> scala>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org