[jira] [Closed] (SPARK-38346) Add cache in MLlib BinaryClassificationMetrics

2022-08-30 Thread Mingchao Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingchao Wu closed SPARK-38346.
---

discussed with community developer, confirmed that cache is not needed here.

> Add cache in MLlib BinaryClassificationMetrics
> --
>
> Key: SPARK-38346
> URL: https://issues.apache.org/jira/browse/SPARK-38346
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.3
> Environment: Windows10/macOS12.2; spark_2.11-2.2.3; 
> mmlspark_2.11-0.18.0; lightgbmlib-2.2.350
>Reporter: Mingchao Wu
>Priority: Minor
>
> we run some example code use BinaryClassificationEvaluator in MLlib, found 
> that ShuffledRDD[28] at BinaryClassificationMetrics.scala:155 and 
> UnionRDD[36] BinaryClassificationMetrics.scala:90 were used more than once 
> but not cached.
> We use spark-2.2.3 and found the code in branch master is still without 
> cache, so we hope to improve it.
> The example code is as follow:
> {code:java}
> import com.microsoft.ml.spark.lightgbm.LightGBMRegressor
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.feature.VectorAssembler
> import org.apache.spark.sql.types.{DoubleType, IntegerType}
> import org.apache.spark.sql.{DataFrame, SparkSession}
> object LightGBMRegressorTest {  def main(args: Array[String]): Unit = {    
> val spark: SparkSession = SparkSession.builder()
>       .appName("LightGBMRegressorTest")
>       .master("local[*]")
>       .getOrCreate()    val startTime = System.currentTimeMillis()    var 
> originalData: DataFrame = spark.read.option("header", "true")
>       .option("inferSchema", "true")
>       .csv("data/hour.csv")    val labelCol = "workingday"
>     val cateCols = Array("season", "yr", "mnth", "hr")
>     val conCols: Array[String] = Array("temp", "atemp", "hum", "casual", 
> "cnt")
>     val vecCols = conCols ++ cateCols    import spark.implicits._
>     vecCols.foreach(col => {
>       originalData = originalData.withColumn(col, $"$col".cast(DoubleType))
>     })
>     originalData = originalData.withColumn(labelCol, 
> $"$labelCol".cast(IntegerType))    val assembler = new 
> VectorAssembler().setInputCols(vecCols).setOutputCol("features")    val 
> classifier: LightGBMRegressor = new 
> LightGBMRegressor().setNumIterations(100).setNumLeaves(31)
>       
> .setBoostFromAverage(false).setFeatureFraction(1.0).setMaxDepth(-1).setMaxBin(255)
>       
> .setLearningRate(0.1).setMinSumHessianInLeaf(0.001).setLambdaL1(0.0).setLambdaL2(0.0)
>       
> .setBaggingFraction(0.5).setBaggingFreq(1).setBaggingSeed(1).setObjective("binary")
>       
> .setLabelCol(labelCol).setCategoricalSlotNames(cateCols).setFeaturesCol("features")
>       .setBoostingType("gbdt")    val pipeline: Pipeline = new 
> Pipeline().setStages(Array(assembler, classifier))    val Array(tr, te) = 
> originalData.randomSplit(Array(0.7, .03), 666)
>     val model = pipeline.fit(tr)
>     val modelDF = model.transform(te)
>     val evaluator = new 
> BinaryClassificationEvaluator().setLabelCol(labelCol).setRawPredictionCol("prediction")
>     println(evaluator.evaluate(modelDF))    println(s"time: 
> ${System.currentTimeMillis() - startTime}" )
>     System.in.read()
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-30 Thread Prashant Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Singh updated SPARK-40266:
---
Affects Version/s: 3.3.0

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.3.0
> Environment: spark 3.3.0
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.3.0 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-30 Thread Prashant Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Singh updated SPARK-40266:
---
Environment: 
spark 3.3.0

Windows 10 (OS Build 19044.1889)

  was:
spark 3.1.2 

Windows 10 (OS Build 19044.1889)


> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.3.0
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.3.0 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-30 Thread Prashant Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Singh updated SPARK-40266:
---
Description: 
h3. What changes were proposed in this pull request?

Corrected datatype output of command from Long to Int
h3. Why are the changes needed?

It shows incorrect datatype
h3. Does this PR introduce _any_ user-facing change?

Yes. It proposes changes in documentation for console output.
[!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
h3. How was this patch tested?

Manually checked the changes by previewing markdown output. I tested output by 
installing spark 3.3.0 locally and running commands present in quick start docs

 

  was:
h3. What changes were proposed in this pull request?

Corrected datatype output of command from Long to Int
h3. Why are the changes needed?

It shows incorrect datatype
h3. Does this PR introduce _any_ user-facing change?

Yes. It proposes changes in documentation for console output.
[!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
h3. How was this patch tested?

Manually checked the changes by previewing markdown output. I tested output by 
installing spark 3.1.2 locally and running commands present in quick start docs

 


> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.3.0 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40272) Support service port custom with range

2022-08-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40272:


Assignee: Apache Spark

> Support service port custom with range
> --
>
> Key: SPARK-40272
> URL: https://issues.apache.org/jira/browse/SPARK-40272
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.2.2
>Reporter: XiaoLong Wu
>Assignee: Apache Spark
>Priority: Minor
>
> In practice, we often encounter firewall restrictions that limit ports to a 
> certain range, so this requires spark to have custom restrictions on all 
> service ports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40272) Support service port custom with range

2022-08-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597589#comment-17597589
 ] 

Apache Spark commented on SPARK-40272:
--

User 'chong0929' has created a pull request for this issue:
https://github.com/apache/spark/pull/37721

> Support service port custom with range
> --
>
> Key: SPARK-40272
> URL: https://issues.apache.org/jira/browse/SPARK-40272
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.2.2
>Reporter: XiaoLong Wu
>Priority: Minor
>
> In practice, we often encounter firewall restrictions that limit ports to a 
> certain range, so this requires spark to have custom restrictions on all 
> service ports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40272) Support service port custom with range

2022-08-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40272:


Assignee: (was: Apache Spark)

> Support service port custom with range
> --
>
> Key: SPARK-40272
> URL: https://issues.apache.org/jira/browse/SPARK-40272
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.2.2
>Reporter: XiaoLong Wu
>Priority: Minor
>
> In practice, we often encounter firewall restrictions that limit ports to a 
> certain range, so this requires spark to have custom restrictions on all 
> service ports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40272) Support service port custom with range

2022-08-30 Thread XiaoLong Wu (Jira)
XiaoLong Wu created SPARK-40272:
---

 Summary: Support service port custom with range
 Key: SPARK-40272
 URL: https://issues.apache.org/jira/browse/SPARK-40272
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.2, 3.0.0, 2.4.0
Reporter: XiaoLong Wu


In practice, we often encounter firewall restrictions that limit ports to a 
certain range, so this requires spark to have custom restrictions on all 
service ports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-30 Thread comet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597573#comment-17597573
 ] 

comet commented on SPARK-38330:
---

any update on this ticket? Anyone tested this one the latest version of Hadoop? 
I tested but still get the same error

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>   at 
> com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333)
>   at 
> 

[jira] [Commented] (SPARK-40271) Support list type for spark.sql.functions.lit

2022-08-30 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597571#comment-17597571
 ] 

Haejoon Lee commented on SPARK-40271:
-

I'm working on it

> Support list type for spark.sql.functions.lit
> -
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40271) Support list type for spark.sql.functions.lit

2022-08-30 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40271:
---

 Summary: Support list type for spark.sql.functions.lit
 Key: SPARK-40271
 URL: https://issues.apache.org/jira/browse/SPARK-40271
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Currently, `pyspark.sql.functions.lit` doesn't support for Python list type as 
below:


{code:python}
>>> df = spark.range(3).withColumn("c", lit([1,2,3]))
Traceback (most recent call last):
...
: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
The feature is not supported: Literal for '[1, 2, 3]' of class 
java.util.ArrayList.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
at org.apache.spark.sql.functions$.lit(functions.scala:125)
at org.apache.spark.sql.functions.lit(functions.scala)
at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:577)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
{code}

We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40260:


Assignee: Max Gekk  (was: Apache Spark)

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupByPositionRefersToAggregateFunctionError
> * groupByPositionRangeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.

2022-08-30 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40265:

Description: 
There is inconsistent behavior on `Index.intersection` when `other` is list of 
tuple for pandas API on Spark as below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
MultiIndex([], )
>>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.

  was:
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
MultiIndex([], )
>>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.


> Fix the inconsistent behavior for Index.intersection.
> -
>
> Key: SPARK-40265
> URL: https://issues.apache.org/jira/browse/SPARK-40265
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There is inconsistent behavior on `Index.intersection` when `other` is list 
> of tuple for pandas API on Spark as below:
> {code:python}
> >>> pidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
> MultiIndex([], )
> >>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
> Traceback (most recent call last):
> ...
> ValueError: Names should be list-like for a MultiIndex
> {code}
> We should fix it to follow pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597570#comment-17597570
 ] 

Apache Spark commented on SPARK-40260:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37712

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupByPositionRefersToAggregateFunctionError
> * groupByPositionRangeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40260:


Assignee: Apache Spark  (was: Max Gekk)

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupByPositionRefersToAggregateFunctionError
> * groupByPositionRangeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.

2022-08-30 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40265:

Description: 
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
MultiIndex([], )
>>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.

  was:
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> other = [(1, 2), (3, 4)]
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection(other).sort_values()
MultiIndex([], )
>>> pidx.intersection(other).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.


> Fix the inconsistent behavior for Index.intersection.
> -
>
> Key: SPARK-40265
> URL: https://issues.apache.org/jira/browse/SPARK-40265
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There is inconsistent behavior on Index.intersection for pandas API on Spark 
> as below:
> {code:python}
> >>> pidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
> MultiIndex([], )
> >>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
> Traceback (most recent call last):
> ...
> ValueError: Names should be list-like for a MultiIndex
> {code}
> We should fix it to follow pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.

2022-08-30 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40265:

Description: 
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> other = [(1, 2), (3, 4)]
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection(other).sort_values()
MultiIndex([], )
>>> pidx.intersection(other).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.

  was:
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection(other).sort_values()
MultiIndex([], )
>>> pidx.intersection(other).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.


> Fix the inconsistent behavior for Index.intersection.
> -
>
> Key: SPARK-40265
> URL: https://issues.apache.org/jira/browse/SPARK-40265
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There is inconsistent behavior on Index.intersection for pandas API on Spark 
> as below:
> {code:python}
> >>> other = [(1, 2), (3, 4)]
> >>> pidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx.intersection(other).sort_values()
> MultiIndex([], )
> >>> pidx.intersection(other).sort_values()
> Traceback (most recent call last):
> ...
> ValueError: Names should be list-like for a MultiIndex
> {code}
> We should fix it to follow pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597564#comment-17597564
 ] 

Apache Spark commented on SPARK-40266:
--

User 'pacificlion' has created a pull request for this issue:
https://github.com/apache/spark/pull/37719

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.1.2 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597563#comment-17597563
 ] 

Apache Spark commented on SPARK-40266:
--

User 'pacificlion' has created a pull request for this issue:
https://github.com/apache/spark/pull/37719

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.1.2 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-30 Thread Felipe (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe updated SPARK-39971:
---
Attachment: explainMode-cost.zip

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-30 Thread Felipe (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe updated SPARK-39971:
---
Attachment: (was: explainMode-cost.zip)

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40135) Support ps.Index in DataFrame creation

2022-08-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40135:
-

Assignee: Ruifeng Zheng

> Support ps.Index in DataFrame creation
> --
>
> Key: SPARK-40135
> URL: https://issues.apache.org/jira/browse/SPARK-40135
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40135) Support ps.Index in DataFrame creation

2022-08-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40135.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37564
[https://github.com/apache/spark/pull/37564]

> Support ps.Index in DataFrame creation
> --
>
> Key: SPARK-40135
> URL: https://issues.apache.org/jira/browse/SPARK-40135
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2