[jira] [Closed] (SPARK-38346) Add cache in MLlib BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-38346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingchao Wu closed SPARK-38346. --- discussed with community developer, confirmed that cache is not needed here. > Add cache in MLlib BinaryClassificationMetrics > -- > > Key: SPARK-38346 > URL: https://issues.apache.org/jira/browse/SPARK-38346 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.3 > Environment: Windows10/macOS12.2; spark_2.11-2.2.3; > mmlspark_2.11-0.18.0; lightgbmlib-2.2.350 >Reporter: Mingchao Wu >Priority: Minor > > we run some example code use BinaryClassificationEvaluator in MLlib, found > that ShuffledRDD[28] at BinaryClassificationMetrics.scala:155 and > UnionRDD[36] BinaryClassificationMetrics.scala:90 were used more than once > but not cached. > We use spark-2.2.3 and found the code in branch master is still without > cache, so we hope to improve it. > The example code is as follow: > {code:java} > import com.microsoft.ml.spark.lightgbm.LightGBMRegressor > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator > import org.apache.spark.ml.feature.VectorAssembler > import org.apache.spark.sql.types.{DoubleType, IntegerType} > import org.apache.spark.sql.{DataFrame, SparkSession} > object LightGBMRegressorTest { def main(args: Array[String]): Unit = { > val spark: SparkSession = SparkSession.builder() > .appName("LightGBMRegressorTest") > .master("local[*]") > .getOrCreate() val startTime = System.currentTimeMillis() var > originalData: DataFrame = spark.read.option("header", "true") > .option("inferSchema", "true") > .csv("data/hour.csv") val labelCol = "workingday" > val cateCols = Array("season", "yr", "mnth", "hr") > val conCols: Array[String] = Array("temp", "atemp", "hum", "casual", > "cnt") > val vecCols = conCols ++ cateCols import spark.implicits._ > vecCols.foreach(col => { > originalData = originalData.withColumn(col, $"$col".cast(DoubleType)) > }) > originalData = originalData.withColumn(labelCol, > $"$labelCol".cast(IntegerType)) val assembler = new > VectorAssembler().setInputCols(vecCols).setOutputCol("features") val > classifier: LightGBMRegressor = new > LightGBMRegressor().setNumIterations(100).setNumLeaves(31) > > .setBoostFromAverage(false).setFeatureFraction(1.0).setMaxDepth(-1).setMaxBin(255) > > .setLearningRate(0.1).setMinSumHessianInLeaf(0.001).setLambdaL1(0.0).setLambdaL2(0.0) > > .setBaggingFraction(0.5).setBaggingFreq(1).setBaggingSeed(1).setObjective("binary") > > .setLabelCol(labelCol).setCategoricalSlotNames(cateCols).setFeaturesCol("features") > .setBoostingType("gbdt") val pipeline: Pipeline = new > Pipeline().setStages(Array(assembler, classifier)) val Array(tr, te) = > originalData.randomSplit(Array(0.7, .03), 666) > val model = pipeline.fit(tr) > val modelDF = model.transform(te) > val evaluator = new > BinaryClassificationEvaluator().setLabelCol(labelCol).setRawPredictionCol("prediction") > println(evaluator.evaluate(modelDF)) println(s"time: > ${System.currentTimeMillis() - startTime}" ) > System.in.read() > } > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long
[ https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Singh updated SPARK-40266: --- Affects Version/s: 3.3.0 > Corrected console output in quick-start - Datatype Integer instead of Long > > > Key: SPARK-40266 > URL: https://issues.apache.org/jira/browse/SPARK-40266 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2, 3.3.0 > Environment: spark 3.3.0 > Windows 10 (OS Build 19044.1889) >Reporter: Prashant Singh >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > h3. What changes were proposed in this pull request? > Corrected datatype output of command from Long to Int > h3. Why are the changes needed? > It shows incorrect datatype > h3. Does this PR introduce _any_ user-facing change? > Yes. It proposes changes in documentation for console output. > [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png] > h3. How was this patch tested? > Manually checked the changes by previewing markdown output. I tested output > by installing spark 3.3.0 locally and running commands present in quick start > docs > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long
[ https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Singh updated SPARK-40266: --- Environment: spark 3.3.0 Windows 10 (OS Build 19044.1889) was: spark 3.1.2 Windows 10 (OS Build 19044.1889) > Corrected console output in quick-start - Datatype Integer instead of Long > > > Key: SPARK-40266 > URL: https://issues.apache.org/jira/browse/SPARK-40266 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 > Environment: spark 3.3.0 > Windows 10 (OS Build 19044.1889) >Reporter: Prashant Singh >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > h3. What changes were proposed in this pull request? > Corrected datatype output of command from Long to Int > h3. Why are the changes needed? > It shows incorrect datatype > h3. Does this PR introduce _any_ user-facing change? > Yes. It proposes changes in documentation for console output. > [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png] > h3. How was this patch tested? > Manually checked the changes by previewing markdown output. I tested output > by installing spark 3.3.0 locally and running commands present in quick start > docs > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long
[ https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Singh updated SPARK-40266: --- Description: h3. What changes were proposed in this pull request? Corrected datatype output of command from Long to Int h3. Why are the changes needed? It shows incorrect datatype h3. Does this PR introduce _any_ user-facing change? Yes. It proposes changes in documentation for console output. [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png] h3. How was this patch tested? Manually checked the changes by previewing markdown output. I tested output by installing spark 3.3.0 locally and running commands present in quick start docs was: h3. What changes were proposed in this pull request? Corrected datatype output of command from Long to Int h3. Why are the changes needed? It shows incorrect datatype h3. Does this PR introduce _any_ user-facing change? Yes. It proposes changes in documentation for console output. [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png] h3. How was this patch tested? Manually checked the changes by previewing markdown output. I tested output by installing spark 3.1.2 locally and running commands present in quick start docs > Corrected console output in quick-start - Datatype Integer instead of Long > > > Key: SPARK-40266 > URL: https://issues.apache.org/jira/browse/SPARK-40266 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 > Environment: spark 3.1.2 > Windows 10 (OS Build 19044.1889) >Reporter: Prashant Singh >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > h3. What changes were proposed in this pull request? > Corrected datatype output of command from Long to Int > h3. Why are the changes needed? > It shows incorrect datatype > h3. Does this PR introduce _any_ user-facing change? > Yes. It proposes changes in documentation for console output. > [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png] > h3. How was this patch tested? > Manually checked the changes by previewing markdown output. I tested output > by installing spark 3.3.0 locally and running commands present in quick start > docs > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40272) Support service port custom with range
[ https://issues.apache.org/jira/browse/SPARK-40272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40272: Assignee: Apache Spark > Support service port custom with range > -- > > Key: SPARK-40272 > URL: https://issues.apache.org/jira/browse/SPARK-40272 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0, 3.2.2 >Reporter: XiaoLong Wu >Assignee: Apache Spark >Priority: Minor > > In practice, we often encounter firewall restrictions that limit ports to a > certain range, so this requires spark to have custom restrictions on all > service ports. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40272) Support service port custom with range
[ https://issues.apache.org/jira/browse/SPARK-40272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597589#comment-17597589 ] Apache Spark commented on SPARK-40272: -- User 'chong0929' has created a pull request for this issue: https://github.com/apache/spark/pull/37721 > Support service port custom with range > -- > > Key: SPARK-40272 > URL: https://issues.apache.org/jira/browse/SPARK-40272 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0, 3.2.2 >Reporter: XiaoLong Wu >Priority: Minor > > In practice, we often encounter firewall restrictions that limit ports to a > certain range, so this requires spark to have custom restrictions on all > service ports. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40272) Support service port custom with range
[ https://issues.apache.org/jira/browse/SPARK-40272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40272: Assignee: (was: Apache Spark) > Support service port custom with range > -- > > Key: SPARK-40272 > URL: https://issues.apache.org/jira/browse/SPARK-40272 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0, 3.2.2 >Reporter: XiaoLong Wu >Priority: Minor > > In practice, we often encounter firewall restrictions that limit ports to a > certain range, so this requires spark to have custom restrictions on all > service ports. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40272) Support service port custom with range
XiaoLong Wu created SPARK-40272: --- Summary: Support service port custom with range Key: SPARK-40272 URL: https://issues.apache.org/jira/browse/SPARK-40272 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.2, 3.0.0, 2.4.0 Reporter: XiaoLong Wu In practice, we often encounter firewall restrictions that limit ports to a certain range, so this requires spark to have custom restrictions on all service ports. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597573#comment-17597573 ] comet commented on SPARK-38330: --- any update on this ticket? Anyone tested this one the latest version of Hadoop? I tested but still get the same error > Certificate doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > -- > > Key: SPARK-38330 > URL: https://issues.apache.org/jira/browse/SPARK-38330 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 3.2.1 > Environment: Spark 3.2.1 built with `hadoop-cloud` flag. > Direct access to s3 using default file committer. > JDK8. > >Reporter: André F. >Priority: Major > > Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, > lead us to the current exception while reading files on s3: > {code:java} > org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on > s3a:///.parquet: com.amazonaws.SdkClientException: Unable to > execute HTTP request: Certificate for doesn't match > any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: > Unable to execute HTTP request: Certificate for doesn't match any of > the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code} > > {code:java} > Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for > doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) > at > com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) > at > com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) > at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) > at > com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) > at > com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333) > at >
[jira] [Commented] (SPARK-40271) Support list type for spark.sql.functions.lit
[ https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597571#comment-17597571 ] Haejoon Lee commented on SPARK-40271: - I'm working on it > Support list type for spark.sql.functions.lit > - > > Key: SPARK-40271 > URL: https://issues.apache.org/jira/browse/SPARK-40271 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, `pyspark.sql.functions.lit` doesn't support for Python list type > as below: > {code:python} > >>> df = spark.range(3).withColumn("c", lit([1,2,3])) > Traceback (most recent call last): > ... > : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] > The feature is not supported: Literal for '[1, 2, 3]' of class > java.util.ArrayList. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) > at org.apache.spark.sql.functions$.lit(functions.scala:125) > at org.apache.spark.sql.functions.lit(functions.scala) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:833) > {code} > We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40271) Support list type for spark.sql.functions.lit
Haejoon Lee created SPARK-40271: --- Summary: Support list type for spark.sql.functions.lit Key: SPARK-40271 URL: https://issues.apache.org/jira/browse/SPARK-40271 Project: Spark Issue Type: Test Components: PySpark Affects Versions: 3.4.0 Reporter: Haejoon Lee Currently, `pyspark.sql.functions.lit` doesn't support for Python list type as below: {code:python} >>> df = spark.range(3).withColumn("c", lit([1,2,3])) Traceback (most recent call last): ... : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for '[1, 2, 3]' of class java.util.ArrayList. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) at org.apache.spark.sql.functions$.lit(functions.scala:125) at org.apache.spark.sql.functions.lit(functions.scala) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:833) {code} We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position
[ https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40260: Assignee: Max Gekk (was: Apache Spark) > Use error classes in the compilation errors of GROUP BY a position > -- > > Key: SPARK-40260 > URL: https://issues.apache.org/jira/browse/SPARK-40260 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * groupByPositionRefersToAggregateFunctionError > * groupByPositionRangeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.
[ https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40265: Description: There is inconsistent behavior on `Index.intersection` when `other` is list of tuple for pandas API on Spark as below: {code:python} >>> pidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection([(1, 2), (3, 4)]).sort_values() MultiIndex([], ) >>> pidx.intersection([(1, 2), (3, 4)]).sort_values() Traceback (most recent call last): ... ValueError: Names should be list-like for a MultiIndex {code} We should fix it to follow pandas. was: There is inconsistent behavior on Index.intersection for pandas API on Spark as below: {code:python} >>> pidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection([(1, 2), (3, 4)]).sort_values() MultiIndex([], ) >>> pidx.intersection([(1, 2), (3, 4)]).sort_values() Traceback (most recent call last): ... ValueError: Names should be list-like for a MultiIndex {code} We should fix it to follow pandas. > Fix the inconsistent behavior for Index.intersection. > - > > Key: SPARK-40265 > URL: https://issues.apache.org/jira/browse/SPARK-40265 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There is inconsistent behavior on `Index.intersection` when `other` is list > of tuple for pandas API on Spark as below: > {code:python} > >>> pidx > Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') > >>> psidx > Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') > >>> psidx.intersection([(1, 2), (3, 4)]).sort_values() > MultiIndex([], ) > >>> pidx.intersection([(1, 2), (3, 4)]).sort_values() > Traceback (most recent call last): > ... > ValueError: Names should be list-like for a MultiIndex > {code} > We should fix it to follow pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position
[ https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597570#comment-17597570 ] Apache Spark commented on SPARK-40260: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37712 > Use error classes in the compilation errors of GROUP BY a position > -- > > Key: SPARK-40260 > URL: https://issues.apache.org/jira/browse/SPARK-40260 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * groupByPositionRefersToAggregateFunctionError > * groupByPositionRangeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position
[ https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40260: Assignee: Apache Spark (was: Max Gekk) > Use error classes in the compilation errors of GROUP BY a position > -- > > Key: SPARK-40260 > URL: https://issues.apache.org/jira/browse/SPARK-40260 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * groupByPositionRefersToAggregateFunctionError > * groupByPositionRangeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.
[ https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40265: Description: There is inconsistent behavior on Index.intersection for pandas API on Spark as below: {code:python} >>> pidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection([(1, 2), (3, 4)]).sort_values() MultiIndex([], ) >>> pidx.intersection([(1, 2), (3, 4)]).sort_values() Traceback (most recent call last): ... ValueError: Names should be list-like for a MultiIndex {code} We should fix it to follow pandas. was: There is inconsistent behavior on Index.intersection for pandas API on Spark as below: {code:python} >>> other = [(1, 2), (3, 4)] >>> pidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection(other).sort_values() MultiIndex([], ) >>> pidx.intersection(other).sort_values() Traceback (most recent call last): ... ValueError: Names should be list-like for a MultiIndex {code} We should fix it to follow pandas. > Fix the inconsistent behavior for Index.intersection. > - > > Key: SPARK-40265 > URL: https://issues.apache.org/jira/browse/SPARK-40265 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There is inconsistent behavior on Index.intersection for pandas API on Spark > as below: > {code:python} > >>> pidx > Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') > >>> psidx > Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') > >>> psidx.intersection([(1, 2), (3, 4)]).sort_values() > MultiIndex([], ) > >>> pidx.intersection([(1, 2), (3, 4)]).sort_values() > Traceback (most recent call last): > ... > ValueError: Names should be list-like for a MultiIndex > {code} > We should fix it to follow pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.
[ https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40265: Description: There is inconsistent behavior on Index.intersection for pandas API on Spark as below: {code:python} >>> other = [(1, 2), (3, 4)] >>> pidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection(other).sort_values() MultiIndex([], ) >>> pidx.intersection(other).sort_values() Traceback (most recent call last): ... ValueError: Names should be list-like for a MultiIndex {code} We should fix it to follow pandas. was: There is inconsistent behavior on Index.intersection for pandas API on Spark as below: {code:python} >>> pidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection(other).sort_values() MultiIndex([], ) >>> pidx.intersection(other).sort_values() Traceback (most recent call last): ... ValueError: Names should be list-like for a MultiIndex {code} We should fix it to follow pandas. > Fix the inconsistent behavior for Index.intersection. > - > > Key: SPARK-40265 > URL: https://issues.apache.org/jira/browse/SPARK-40265 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > There is inconsistent behavior on Index.intersection for pandas API on Spark > as below: > {code:python} > >>> other = [(1, 2), (3, 4)] > >>> pidx > Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') > >>> psidx > Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') > >>> psidx.intersection(other).sort_values() > MultiIndex([], ) > >>> pidx.intersection(other).sort_values() > Traceback (most recent call last): > ... > ValueError: Names should be list-like for a MultiIndex > {code} > We should fix it to follow pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long
[ https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597564#comment-17597564 ] Apache Spark commented on SPARK-40266: -- User 'pacificlion' has created a pull request for this issue: https://github.com/apache/spark/pull/37719 > Corrected console output in quick-start - Datatype Integer instead of Long > > > Key: SPARK-40266 > URL: https://issues.apache.org/jira/browse/SPARK-40266 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 > Environment: spark 3.1.2 > Windows 10 (OS Build 19044.1889) >Reporter: Prashant Singh >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > h3. What changes were proposed in this pull request? > Corrected datatype output of command from Long to Int > h3. Why are the changes needed? > It shows incorrect datatype > h3. Does this PR introduce _any_ user-facing change? > Yes. It proposes changes in documentation for console output. > [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png] > h3. How was this patch tested? > Manually checked the changes by previewing markdown output. I tested output > by installing spark 3.1.2 locally and running commands present in quick start > docs > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long
[ https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597563#comment-17597563 ] Apache Spark commented on SPARK-40266: -- User 'pacificlion' has created a pull request for this issue: https://github.com/apache/spark/pull/37719 > Corrected console output in quick-start - Datatype Integer instead of Long > > > Key: SPARK-40266 > URL: https://issues.apache.org/jira/browse/SPARK-40266 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.2 > Environment: spark 3.1.2 > Windows 10 (OS Build 19044.1889) >Reporter: Prashant Singh >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > h3. What changes were proposed in this pull request? > Corrected datatype output of command from Long to Int > h3. Why are the changes needed? > It shows incorrect datatype > h3. Does this PR introduce _any_ user-facing change? > Yes. It proposes changes in documentation for console output. > [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png] > h3. How was this patch tested? > Manually checked the changes by previewing markdown output. I tested output > by installing spark 3.1.2 locally and running commands present in quick start > docs > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever
[ https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felipe updated SPARK-39971: --- Attachment: explainMode-cost.zip > ANALYZE TABLE makes some queries run forever > > > Key: SPARK-39971 > URL: https://issues.apache.org/jira/browse/SPARK-39971 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.2.2 >Reporter: Felipe >Priority: Major > Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, > 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-enabled.txt, > 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, > 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, > explainMode-cost.zip > > > I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without > the FOR ALL COLUMNS) some queries became really slow. For example query24 - > [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes > between 10~15min before running the ANALYZE TABLE. > After running ANALYZE TABLE I waited 24h before cancelling the execution. > If I disable spark.sql.cbo.joinReorder.enabled or > spark.sql.cbo.enabled it becomes fast again. > It seems something in join reordering is not working well when we have table > stats, but not column stats. > Rows Count: > store_sales - 2879966589 > store_returns - 288009578 > store - 1002 > item - 30 > customer - 1200 > customer_address - 600 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever
[ https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felipe updated SPARK-39971: --- Attachment: (was: explainMode-cost.zip) > ANALYZE TABLE makes some queries run forever > > > Key: SPARK-39971 > URL: https://issues.apache.org/jira/browse/SPARK-39971 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.2.2 >Reporter: Felipe >Priority: Major > Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, > 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT > ForAllColumns-joinreorder-enabled.txt, > 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, > 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, > explainMode-cost.zip > > > I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without > the FOR ALL COLUMNS) some queries became really slow. For example query24 - > [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes > between 10~15min before running the ANALYZE TABLE. > After running ANALYZE TABLE I waited 24h before cancelling the execution. > If I disable spark.sql.cbo.joinReorder.enabled or > spark.sql.cbo.enabled it becomes fast again. > It seems something in join reordering is not working well when we have table > stats, but not column stats. > Rows Count: > store_sales - 2879966589 > store_returns - 288009578 > store - 1002 > item - 30 > customer - 1200 > customer_address - 600 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40135) Support ps.Index in DataFrame creation
[ https://issues.apache.org/jira/browse/SPARK-40135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40135: - Assignee: Ruifeng Zheng > Support ps.Index in DataFrame creation > -- > > Key: SPARK-40135 > URL: https://issues.apache.org/jira/browse/SPARK-40135 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40135) Support ps.Index in DataFrame creation
[ https://issues.apache.org/jira/browse/SPARK-40135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-40135. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37564 [https://github.com/apache/spark/pull/37564] > Support ps.Index in DataFrame creation > -- > > Key: SPARK-40135 > URL: https://issues.apache.org/jira/browse/SPARK-40135 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org