[jira] [Created] (SPARK-40272) Support service port custom with range

2022-08-29 Thread XiaoLong Wu (Jira)
XiaoLong Wu created SPARK-40272:
---

 Summary: Support service port custom with range
 Key: SPARK-40272
 URL: https://issues.apache.org/jira/browse/SPARK-40272
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.2, 3.0.0, 2.4.0
Reporter: XiaoLong Wu


In practice, we often encounter firewall restrictions that limit ports to a 
certain range, so this requires spark to have custom restrictions on all 
service ports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-08-29 Thread comet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597573#comment-17597573
 ] 

comet commented on SPARK-38330:
---

any update on this ticket? Anyone tested this one the latest version of Hadoop? 
I tested but still get the same error

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>   at 
> com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333)
>   at 
> com.amazonaws.http.Amaz

[jira] [Commented] (SPARK-40271) Support list type for spark.sql.functions.lit

2022-08-29 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597571#comment-17597571
 ] 

Haejoon Lee commented on SPARK-40271:
-

I'm working on it

> Support list type for spark.sql.functions.lit
> -
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40271) Support list type for spark.sql.functions.lit

2022-08-29 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40271:
---

 Summary: Support list type for spark.sql.functions.lit
 Key: SPARK-40271
 URL: https://issues.apache.org/jira/browse/SPARK-40271
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Currently, `pyspark.sql.functions.lit` doesn't support for Python list type as 
below:


{code:python}
>>> df = spark.range(3).withColumn("c", lit([1,2,3]))
Traceback (most recent call last):
...
: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
The feature is not supported: Literal for '[1, 2, 3]' of class 
java.util.ArrayList.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
at org.apache.spark.sql.functions$.lit(functions.scala:125)
at org.apache.spark.sql.functions.lit(functions.scala)
at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:577)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
{code}

We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40260:


Assignee: Max Gekk  (was: Apache Spark)

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupByPositionRefersToAggregateFunctionError
> * groupByPositionRangeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.

2022-08-29 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40265:

Description: 
There is inconsistent behavior on `Index.intersection` when `other` is list of 
tuple for pandas API on Spark as below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
MultiIndex([], )
>>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.

  was:
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
MultiIndex([], )
>>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.


> Fix the inconsistent behavior for Index.intersection.
> -
>
> Key: SPARK-40265
> URL: https://issues.apache.org/jira/browse/SPARK-40265
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There is inconsistent behavior on `Index.intersection` when `other` is list 
> of tuple for pandas API on Spark as below:
> {code:python}
> >>> pidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
> MultiIndex([], )
> >>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
> Traceback (most recent call last):
> ...
> ValueError: Names should be list-like for a MultiIndex
> {code}
> We should fix it to follow pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597570#comment-17597570
 ] 

Apache Spark commented on SPARK-40260:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37712

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupByPositionRefersToAggregateFunctionError
> * groupByPositionRangeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40260:


Assignee: Apache Spark  (was: Max Gekk)

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupByPositionRefersToAggregateFunctionError
> * groupByPositionRangeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.

2022-08-29 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40265:

Description: 
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
MultiIndex([], )
>>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.

  was:
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> other = [(1, 2), (3, 4)]
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection(other).sort_values()
MultiIndex([], )
>>> pidx.intersection(other).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.


> Fix the inconsistent behavior for Index.intersection.
> -
>
> Key: SPARK-40265
> URL: https://issues.apache.org/jira/browse/SPARK-40265
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There is inconsistent behavior on Index.intersection for pandas API on Spark 
> as below:
> {code:python}
> >>> pidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
> MultiIndex([], )
> >>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
> Traceback (most recent call last):
> ...
> ValueError: Names should be list-like for a MultiIndex
> {code}
> We should fix it to follow pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.

2022-08-29 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40265:

Description: 
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> other = [(1, 2), (3, 4)]
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection(other).sort_values()
MultiIndex([], )
>>> pidx.intersection(other).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.

  was:
There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection(other).sort_values()
MultiIndex([], )
>>> pidx.intersection(other).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.


> Fix the inconsistent behavior for Index.intersection.
> -
>
> Key: SPARK-40265
> URL: https://issues.apache.org/jira/browse/SPARK-40265
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There is inconsistent behavior on Index.intersection for pandas API on Spark 
> as below:
> {code:python}
> >>> other = [(1, 2), (3, 4)]
> >>> pidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx.intersection(other).sort_values()
> MultiIndex([], )
> >>> pidx.intersection(other).sort_values()
> Traceback (most recent call last):
> ...
> ValueError: Names should be list-like for a MultiIndex
> {code}
> We should fix it to follow pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597564#comment-17597564
 ] 

Apache Spark commented on SPARK-40266:
--

User 'pacificlion' has created a pull request for this issue:
https://github.com/apache/spark/pull/37719

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.1.2 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597563#comment-17597563
 ] 

Apache Spark commented on SPARK-40266:
--

User 'pacificlion' has created a pull request for this issue:
https://github.com/apache/spark/pull/37719

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.1.2 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-29 Thread Felipe (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe updated SPARK-39971:
---
Attachment: explainMode-cost.zip

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-29 Thread Felipe (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe updated SPARK-39971:
---
Attachment: (was: explainMode-cost.zip)

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40135) Support ps.Index in DataFrame creation

2022-08-29 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40135:
-

Assignee: Ruifeng Zheng

> Support ps.Index in DataFrame creation
> --
>
> Key: SPARK-40135
> URL: https://issues.apache.org/jira/browse/SPARK-40135
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40135) Support ps.Index in DataFrame creation

2022-08-29 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40135.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37564
[https://github.com/apache/spark/pull/37564]

> Support ps.Index in DataFrame creation
> --
>
> Key: SPARK-40135
> URL: https://issues.apache.org/jira/browse/SPARK-40135
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40270) Make compute.max_rows as None working in DataFrame.style

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597548#comment-17597548
 ] 

Apache Spark commented on SPARK-40270:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37718

> Make compute.max_rows as None working in DataFrame.style
> 
>
> Key: SPARK-40270
> URL: https://issues.apache.org/jira/browse/SPARK-40270
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>
> {code}
> import pyspark.pandas as ps
> ps.set_option("compute.max_rows", None)
> ps.get_option("compute.max_rows")
> ps.range(1).style
> {code}
> fails as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
> pdf = self.head(max_results + 1)._to_internal_pandas()
> TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40270) Make compute.max_rows as None working in DataFrame.style

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40270:


Assignee: Apache Spark

> Make compute.max_rows as None working in DataFrame.style
> 
>
> Key: SPARK-40270
> URL: https://issues.apache.org/jira/browse/SPARK-40270
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>
> {code}
> import pyspark.pandas as ps
> ps.set_option("compute.max_rows", None)
> ps.get_option("compute.max_rows")
> ps.range(1).style
> {code}
> fails as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
> pdf = self.head(max_results + 1)._to_internal_pandas()
> TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40270) Make compute.max_rows as None working in DataFrame.style

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597546#comment-17597546
 ] 

Apache Spark commented on SPARK-40270:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37718

> Make compute.max_rows as None working in DataFrame.style
> 
>
> Key: SPARK-40270
> URL: https://issues.apache.org/jira/browse/SPARK-40270
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>
> {code}
> import pyspark.pandas as ps
> ps.set_option("compute.max_rows", None)
> ps.get_option("compute.max_rows")
> ps.range(1).style
> {code}
> fails as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
> pdf = self.head(max_results + 1)._to_internal_pandas()
> TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40270) Make compute.max_rows as None working in DataFrame.style

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40270:


Assignee: (was: Apache Spark)

> Make compute.max_rows as None working in DataFrame.style
> 
>
> Key: SPARK-40270
> URL: https://issues.apache.org/jira/browse/SPARK-40270
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>
> {code}
> import pyspark.pandas as ps
> ps.set_option("compute.max_rows", None)
> ps.get_option("compute.max_rows")
> ps.range(1).style
> {code}
> fails as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
> pdf = self.head(max_results + 1)._to_internal_pandas()
> TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39616) Upgrade Breeze to 2.0

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597543#comment-17597543
 ] 

Apache Spark commented on SPARK-39616:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37717

> Upgrade Breeze to 2.0
> -
>
> Key: SPARK-39616
> URL: https://issues.apache.org/jira/browse/SPARK-39616
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39616) Upgrade Breeze to 2.0

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597544#comment-17597544
 ] 

Apache Spark commented on SPARK-39616:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37717

> Upgrade Breeze to 2.0
> -
>
> Key: SPARK-39616
> URL: https://issues.apache.org/jira/browse/SPARK-39616
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-29 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597539#comment-17597539
 ] 

Pankaj Nagla edited comment on SPARK-22588 at 8/30/22 5:39 AM:
---

Thank you for sharing the information. [Vlocity Salesforce 
Certification|https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.





was (Author: JIRAUSER295111):
Thank you for sharing the information. Vlocity Salesforce Certification 
enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

[Salesforce marketing cloud administrator 
certification](https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/)


> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-29 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597539#comment-17597539
 ] 

Pankaj Nagla edited comment on SPARK-22588 at 8/30/22 5:38 AM:
---

Thank you for sharing the information. Vlocity Salesforce Certification 
enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

[Salesforce marketing cloud administrator 
certification](https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/)



was (Author: JIRAUSER295111):
Thank you for sharing the information. [Vlocity Salesforce 
Certification|[https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Assigned] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40269:


Assignee: (was: Apache Spark)

> Randomize the orders of peer in BlockManagerDecommissioner
> --
>
> Key: SPARK-40269
> URL: https://issues.apache.org/jira/browse/SPARK-40269
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating 
> data to the same set of nodes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597542#comment-17597542
 ] 

Apache Spark commented on SPARK-40269:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/37716

> Randomize the orders of peer in BlockManagerDecommissioner
> --
>
> Key: SPARK-40269
> URL: https://issues.apache.org/jira/browse/SPARK-40269
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating 
> data to the same set of nodes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40269:


Assignee: Apache Spark

> Randomize the orders of peer in BlockManagerDecommissioner
> --
>
> Key: SPARK-40269
> URL: https://issues.apache.org/jira/browse/SPARK-40269
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating 
> data to the same set of nodes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40270) Support compute.max_rows as None in DataFrame.style

2022-08-29 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-40270:


 Summary: Support compute.max_rows as None in DataFrame.style
 Key: SPARK-40270
 URL: https://issues.apache.org/jira/browse/SPARK-40270
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


{code}
import pyspark.pandas as ps
ps.set_option("compute.max_rows", None)
ps.get_option("compute.max_rows")
ps.range(1).style
{code}

fails as below:

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
pdf = self.head(max_results + 1)._to_internal_pandas()
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40270) Make compute.max_rows as None working in DataFrame.style

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40270:
-
Fix Version/s: 3.1.4
   3.4.0
   3.3.1
   3.2.3

> Make compute.max_rows as None working in DataFrame.style
> 
>
> Key: SPARK-40270
> URL: https://issues.apache.org/jira/browse/SPARK-40270
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>
> {code}
> import pyspark.pandas as ps
> ps.set_option("compute.max_rows", None)
> ps.get_option("compute.max_rows")
> ps.range(1).style
> {code}
> fails as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
> pdf = self.head(max_results + 1)._to_internal_pandas()
> TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-29 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597539#comment-17597539
 ] 

Pankaj Nagla edited comment on SPARK-22588 at 8/30/22 5:34 AM:
---

Thank you for sharing the information. [Vlocity Salesforce 
Certification|[https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.


was (Author: JIRAUSER295111):
Thank you for sharing the information.[Vlocity Salesforce 
Certification]([https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/])
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark

[jira] [Updated] (SPARK-40270) Make compute.max_rows as None working in DataFrame.style

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40270:
-
Priority: Minor  (was: Major)

> Make compute.max_rows as None working in DataFrame.style
> 
>
> Key: SPARK-40270
> URL: https://issues.apache.org/jira/browse/SPARK-40270
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> import pyspark.pandas as ps
> ps.set_option("compute.max_rows", None)
> ps.get_option("compute.max_rows")
> ps.range(1).style
> {code}
> fails as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
> pdf = self.head(max_results + 1)._to_internal_pandas()
> TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40270) Make compute.max_rows as None working in DataFrame.style

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40270:
-
Summary: Make compute.max_rows as None working in DataFrame.style  (was: 
Support compute.max_rows as None in DataFrame.style)

> Make compute.max_rows as None working in DataFrame.style
> 
>
> Key: SPARK-40270
> URL: https://issues.apache.org/jira/browse/SPARK-40270
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> import pyspark.pandas as ps
> ps.set_option("compute.max_rows", None)
> ps.get_option("compute.max_rows")
> ps.range(1).style
> {code}
> fails as below:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style
> pdf = self.head(max_results + 1)._to_internal_pandas()
> TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-29 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597539#comment-17597539
 ] 

Pankaj Nagla edited comment on SPARK-22588 at 8/30/22 5:32 AM:
---

Thank you for sharing the information.[Vlocity Salesforce 
Certification]([https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/])
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.


was (Author: JIRAUSER295111):
Thank you for sharing the information.[ [Vlocity Salesforce 
Certification]([https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/)|https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues

[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-29 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597539#comment-17597539
 ] 

Pankaj Nagla edited comment on SPARK-22588 at 8/30/22 5:32 AM:
---

Thank you for sharing the information.[ [Vlocity Salesforce 
Certification]([https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/)|https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.


was (Author: JIRAUSER295111):
Thank you for sharing the information.[ [Vlocity Salesforce 
Certification][|#Vlocity Salesforce Certification] 
[https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Created] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner

2022-08-29 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40269:


 Summary: Randomize the orders of peer in BlockManagerDecommissioner
 Key: SPARK-40269
 URL: https://issues.apache.org/jira/browse/SPARK-40269
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating 
data to the same set of nodes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-29 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597539#comment-17597539
 ] 

Pankaj Nagla edited comment on SPARK-22588 at 8/30/22 5:31 AM:
---

Thank you for sharing the information.[ [Vlocity Salesforce 
Certification][|#Vlocity Salesforce Certification] 
[https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.


was (Author: JIRAUSER295111):
Thank you for sharing the information.[# Vlocity Salesforce 
Certification]([https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/])
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For addit

[jira] [Comment Edited] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-29 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597539#comment-17597539
 ] 

Pankaj Nagla edited comment on SPARK-22588 at 8/30/22 5:29 AM:
---

Thank you for sharing the information.[# Vlocity Salesforce 
Certification]([https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/])
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.


was (Author: JIRAUSER295111):
Thank you for sharing the information. [Vlocity Salesforce 
Certification|[https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spa

[jira] [Commented] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-29 Thread Pankaj Nagla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597539#comment-17597539
 ] 

Pankaj Nagla commented on SPARK-22588:
--

Thank you for sharing the information. [Vlocity Salesforce 
Certification|[https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/]]
 enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer 
assisting many tops and arising companies obtain their wanted progress 
utilizing its Omnichannel procedures.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-29 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597537#comment-17597537
 ] 

BingKun Pan commented on SPARK-40056:
-

I will fix it.

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-29 Thread Felipe (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe updated SPARK-39971:
---
Attachment: explainMode-cost.zip

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39971) ANALYZE TABLE makes some queries run forever

2022-08-29 Thread Felipe (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597528#comment-17597528
 ] 

Felipe commented on SPARK-39971:


[~yumwang] I uploaded the plans

> ANALYZE TABLE makes some queries run forever
> 
>
> Key: SPARK-39971
> URL: https://issues.apache.org/jira/browse/SPARK-39971
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.2.2
>Reporter: Felipe
>Priority: Major
> Attachments: 1.1.BeforeAnalyzeTable-joinreorder-disabled.txt, 
> 1.2.BeforeAnalyzeTable-joinreorder-enabled.txt, 2.1.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-disabled.txt, 2.2.AfterAnalyzeTable WITHOUT 
> ForAllColumns-joinreorder-enabled.txt, 
> 3.1.AfterAnalyzeTableForAllColumns-joinreorder-disabled.txt, 
> 3.2.AfterAnalyzeTableForAllColumns-joinreorder-enabled.txt, 
> explainMode-cost.zip
>
>
> I'm using TPCDS to run benchmarks, and after running ANALYZE TABLE (without 
> the FOR ALL COLUMNS) some queries became really slow. For example query24 - 
> [https://raw.githubusercontent.com/Agirish/tpcds/master/query24.sql] takes 
> between 10~15min before running the ANALYZE TABLE.
> After running ANALYZE TABLE I waited 24h before cancelling the execution.
> If I disable spark.sql.cbo.joinReorder.enabled or 
> spark.sql.cbo.enabled it becomes fast again.
> It seems something in join reordering is not working well when we have table 
> stats, but not column stats.
> Rows Count:
> store_sales - 2879966589
> store_returns - 288009578
> store - 1002
> item - 30
> customer - 1200
> customer_address - 600



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34777) [UI] StagePage input size/records not show when records greater than zero

2022-08-29 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34777.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35498
[https://github.com/apache/spark/pull/35498]

> [UI] StagePage input size/records not show when records greater than zero
> -
>
> Key: SPARK-34777
> URL: https://issues.apache.org/jira/browse/SPARK-34777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: No input size records.png
>
>
> !No input size records.png|width=547,height=212!
> The `Input Size / Records` should show in summary metrics table and task 
> columns, as input records greater than zero and bytes is zero. One example is 
> spark streaming job read from kafka



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34777) [UI] StagePage input size/records not show when records greater than zero

2022-08-29 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34777:
---

Assignee: Zhongwei Zhu

> [UI] StagePage input size/records not show when records greater than zero
> -
>
> Key: SPARK-34777
> URL: https://issues.apache.org/jira/browse/SPARK-34777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.1
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Attachments: No input size records.png
>
>
> !No input size records.png|width=547,height=212!
> The `Input Size / Records` should show in summary metrics table and task 
> columns, as input records greater than zero and bytes is zero. One example is 
> spark streaming job read from kafka



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40056:


Assignee: (was: Apache Spark)

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40056:


Assignee: Apache Spark

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40056:
-
Fix Version/s: (was: 3.4.0)

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-40056:
--
  Assignee: (was: BingKun Pan)

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40056) Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9

2022-08-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597512#comment-17597512
 ] 

Hyukjin Kwon commented on SPARK-40056:
--

Reverted in 
https://github.com/apache/spark/commit/80e65514a2a12c085ec982a5509a8417c3f8b42b

> Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
> -
>
> Key: SPARK-40056
> URL: https://issues.apache.org/jira/browse/SPARK-40056
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40221) Not able to format using scalafmt

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40221.
--
Fix Version/s: 3.4.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

> Not able to format using scalafmt
> -
>
> Key: SPARK-40221
> URL: https://issues.apache.org/jira/browse/SPARK-40221
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> I'm following the guidance in [https://spark.apache.org/developer-tools.html] 
> using 
> {code:java}
> ./dev/scalafmt{code}
> to format the code, but getting this error:
> {code:java}
> [ERROR] Failed to execute goal 
> org.antipathy:mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) 
> on project spark-parent_2.12: Error formatting Scala files: missing setting 
> 'version'. To fix this problem, add the following line to .scalafmt.conf: 
> 'version=3.2.1'. -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40221) Not able to format using scalafmt

2022-08-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597511#comment-17597511
 ] 

Hyukjin Kwon commented on SPARK-40221:
--

Fixed in 
https://github.com/apache/spark/commit/80e65514a2a12c085ec982a5509a8417c3f8b42b

> Not able to format using scalafmt
> -
>
> Key: SPARK-40221
> URL: https://issues.apache.org/jira/browse/SPARK-40221
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> I'm following the guidance in [https://spark.apache.org/developer-tools.html] 
> using 
> {code:java}
> ./dev/scalafmt{code}
> to format the code, but getting this error:
> {code:java}
> [ERROR] Failed to execute goal 
> org.antipathy:mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) 
> on project spark-parent_2.12: Error formatting Scala files: missing setting 
> 'version'. To fix this problem, add the following line to .scalafmt.conf: 
> 'version=3.2.1'. -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40221) Not able to format using scalafmt

2022-08-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597509#comment-17597509
 ] 

Hyukjin Kwon commented on SPARK-40221:
--

Seems like SPARK-40056 caused this. I am reverting that patch for now.

> Not able to format using scalafmt
> -
>
> Key: SPARK-40221
> URL: https://issues.apache.org/jira/browse/SPARK-40221
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> I'm following the guidance in [https://spark.apache.org/developer-tools.html] 
> using 
> {code:java}
> ./dev/scalafmt{code}
> to format the code, but getting this error:
> {code:java}
> [ERROR] Failed to execute goal 
> org.antipathy:mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) 
> on project spark-parent_2.12: Error formatting Scala files: missing setting 
> 'version'. To fix this problem, add the following line to .scalafmt.conf: 
> 'version=3.2.1'. -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40268) Test decimal128 in UDF

2022-08-29 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-40268:
--

 Summary: Test decimal128 in UDF
 Key: SPARK-40268
 URL: https://issues.apache.org/jira/browse/SPARK-40268
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Write tests for decimal128 in UDF as input parameters and results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40253) Data read exception in orc format

2022-08-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597500#comment-17597500
 ] 

Hyukjin Kwon commented on SPARK-40253:
--

Is this still an issue in Spark 3.1+? Spark 2.4.x is EOL

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs:
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205)
>        at 
> org.apac

[jira] [Resolved] (SPARK-40258) Spark does not support pagination

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40258.
--
Resolution: Duplicate

> Spark does not support pagination
> -
>
> Key: SPARK-40258
> URL: https://issues.apache.org/jira/browse/SPARK-40258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: jacky
>Priority: Major
> Attachments: image-2022-08-29-20-15-51-309.png
>
>
> Spark does not support pagination when I use the sql:select *from test11 limt 
> 2,5
> !image-2022-08-29-20-15-51-309.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40258) Spark does not support pagination

2022-08-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597498#comment-17597498
 ] 

Hyukjin Kwon commented on SPARK-40258:
--

We support it now, see SPARK-28330

> Spark does not support pagination
> -
>
> Key: SPARK-40258
> URL: https://issues.apache.org/jira/browse/SPARK-40258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: jacky
>Priority: Major
> Attachments: image-2022-08-29-20-15-51-309.png
>
>
> Spark does not support pagination when I use the sql:select *from test11 limt 
> 2,5
> !image-2022-08-29-20-15-51-309.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40267) Add description for ExecutorAllocationManager metrics

2022-08-29 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-40267:


 Summary: Add description for ExecutorAllocationManager metrics
 Key: SPARK-40267
 URL: https://issues.apache.org/jira/browse/SPARK-40267
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.3.0
Reporter: Zhongwei Zhu


Some ExecutorAllocationManager metrics are hard to know what stands for just 
from metric name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40257) Remove since usage in streaming/query.py and window.py

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40257:


Assignee: Hyukjin Kwon

> Remove since usage in streaming/query.py and window.py
> --
>
> Key: SPARK-40257
> URL: https://issues.apache.org/jira/browse/SPARK-40257
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> NumPy documentation style doesn't play well with our pyspark.since (because 
> it just adds versionadded directive in the end but NumPy documentation assume 
> to put these in the corresponding section).
> We should better remove these usages out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40257) Remove since usage in streaming/query.py and window.py

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40257.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37707
[https://github.com/apache/spark/pull/37707]

> Remove since usage in streaming/query.py and window.py
> --
>
> Key: SPARK-40257
> URL: https://issues.apache.org/jira/browse/SPARK-40257
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> NumPy documentation style doesn't play well with our pyspark.since (because 
> it just adds versionadded directive in the end but NumPy documentation assume 
> to put these in the corresponding section).
> We should better remove these usages out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40252) Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` Api

2022-08-29 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40252:


Assignee: Yang Jie

> Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` 
> Api
> ---
>
> Key: SPARK-40252
> URL: https://issues.apache.org/jira/browse/SPARK-40252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Stream.collect(Collectors.joining(delimiter))  is slower than pure 
> StringJoiner api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40252) Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` Api

2022-08-29 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40252.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37701
[https://github.com/apache/spark/pull/37701]

> Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` 
> Api
> ---
>
> Key: SPARK-40252
> URL: https://issues.apache.org/jira/browse/SPARK-40252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> Stream.collect(Collectors.joining(delimiter))  is slower than pure 
> StringJoiner api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Prashant Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Singh updated SPARK-40266:
---
Description: 
h3. What changes were proposed in this pull request?

Corrected datatype output of command from Long to Int
h3. Why are the changes needed?

It shows incorrect datatype
h3. Does this PR introduce _any_ user-facing change?

Yes. It proposes changes in documentation for console output.
[!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
h3. How was this patch tested?

Manually checked the changes by previewing markdown output. I tested output by 
installing spark 3.1.2 locally and running commands present in quick start docs

 

  was:


### What changes were proposed in this pull request?
Corrected datatype output of command from Long to Int



### Why are the changes needed?
It shows incorrect datatype



### Does this PR introduce _any_ user-facing change?
Yes. It proposes changes in documentation for console output.
![image](https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png)




### How was this patch tested?
Manually checked the changes by previewing markdown output. I tested output by 
installing spark 3.1.2 locally and running commands present in quick start docs



> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h3. What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> h3. Why are the changes needed?
> It shows incorrect datatype
> h3. Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> [!https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png!|https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png]
> h3. How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.1.2 locally and running commands present in quick start 
> docs
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Prashant Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Singh updated SPARK-40266:
---
Description: 


### What changes were proposed in this pull request?
Corrected datatype output of command from Long to Int



### Why are the changes needed?
It shows incorrect datatype



### Does this PR introduce _any_ user-facing change?
Yes. It proposes changes in documentation for console output.
![image](https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png)




### How was this patch tested?
Manually checked the changes by previewing markdown output. I tested output by 
installing spark 3.1.2 locally and running commands present in quick start docs


> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> 
> ### What changes were proposed in this pull request?
> Corrected datatype output of command from Long to Int
> 
> ### Why are the changes needed?
> It shows incorrect datatype
> 
> ### Does this PR introduce _any_ user-facing change?
> Yes. It proposes changes in documentation for console output.
> ![image](https://user-images.githubusercontent.com/12110063/187332894-af2b6b43-7ff3-4062-8370-de4b477f178b.png)
> 
> ### How was this patch tested?
> Manually checked the changes by previewing markdown output. I tested output 
> by installing spark 3.1.2 locally and running commands present in quick start 
> docs
> 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40266:


Assignee: Apache Spark

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40266:


Assignee: (was: Apache Spark)

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597469#comment-17597469
 ] 

Apache Spark commented on SPARK-40266:
--

User 'pacificlion' has created a pull request for this issue:
https://github.com/apache/spark/pull/37715

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Prashant Singh (Jira)
Prashant Singh created SPARK-40266:
--

 Summary: Corrected  console output in quick-start -  Datatype 
Integer instead of Long
 Key: SPARK-40266
 URL: https://issues.apache.org/jira/browse/SPARK-40266
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.1.2
 Environment: spark 3.1.2 

Windows 10 (OS Build 19044.1889)
Reporter: Prashant Singh






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40266) Corrected console output in quick-start - Datatype Integer instead of Long

2022-08-29 Thread Prashant Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Singh updated SPARK-40266:
---
Priority: Minor  (was: Major)

> Corrected  console output in quick-start -  Datatype Integer instead of Long
> 
>
> Key: SPARK-40266
> URL: https://issues.apache.org/jira/browse/SPARK-40266
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: spark 3.1.2 
> Windows 10 (OS Build 19044.1889)
>Reporter: Prashant Singh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40160) Make pyspark.broadcast examples self-contained

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40160.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37629
[https://github.com/apache/spark/pull/37629]

> Make pyspark.broadcast examples self-contained
> --
>
> Key: SPARK-40160
> URL: https://issues.apache.org/jira/browse/SPARK-40160
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40160) Make pyspark.broadcast examples self-contained

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40160:


Assignee: Qian Sun

> Make pyspark.broadcast examples self-contained
> --
>
> Key: SPARK-40160
> URL: https://issues.apache.org/jira/browse/SPARK-40160
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40012.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37702
[https://github.com/apache/spark/pull/37702]

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: William Zijie Zhang
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40012) Make pyspark.sql.dataframe examples self-contained

2022-08-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40012:


Assignee: Hyukjin Kwon

> Make pyspark.sql.dataframe examples self-contained
> --
>
> Key: SPARK-40012
> URL: https://issues.apache.org/jira/browse/SPARK-40012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: William Zijie Zhang
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.

2022-08-29 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597439#comment-17597439
 ] 

Haejoon Lee commented on SPARK-40265:
-

I'm working on it

> Fix the inconsistent behavior for Index.intersection.
> -
>
> Key: SPARK-40265
> URL: https://issues.apache.org/jira/browse/SPARK-40265
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There is inconsistent behavior on Index.intersection for pandas API on Spark 
> as below:
> {code:python}
> >>> pidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx
> Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
> >>> psidx.intersection(other).sort_values()
> MultiIndex([], )
> >>> pidx.intersection(other).sort_values()
> Traceback (most recent call last):
> ...
> ValueError: Names should be list-like for a MultiIndex
> {code}
> We should fix it to follow pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40265) Fix the inconsistent behavior for Index.intersection.

2022-08-29 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40265:
---

 Summary: Fix the inconsistent behavior for Index.intersection.
 Key: SPARK-40265
 URL: https://issues.apache.org/jira/browse/SPARK-40265
 Project: Spark
  Issue Type: Test
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


There is inconsistent behavior on Index.intersection for pandas API on Spark as 
below:


{code:python}
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection(other).sort_values()
MultiIndex([], )
>>> pidx.intersection(other).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
{code}

We should fix it to follow pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40262) Expensive UDF evaluation pushed down past a join leads to performance issues

2022-08-29 Thread Shardul Mahadik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597436#comment-17597436
 ] 

Shardul Mahadik commented on SPARK-40262:
-

cc: [~cloud_fan] [~xkrogen] [~mridulm80] 

> Expensive UDF evaluation pushed down past a join leads to performance issues 
> -
>
> Key: SPARK-40262
> URL: https://issues.apache.org/jira/browse/SPARK-40262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Consider a Spark job with an expensive UDF which looks like follows:
> {code:scala}
> val expensive_udf = spark.udf.register("expensive_udf", (i: Int) => Option(i))
> spark.range(10).write.format("orc").save("/tmp/orc")
> val df = spark.read.format("orc").load("/tmp/orc").as("a")
> .join(spark.range(10).as("b"), "id")
> .withColumn("udf_op", expensive_udf($"a.id"))
> .join(spark.range(10).as("c"), $"udf_op" === $"c.id")
> {code}
> This creates a physical plan as follows:
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [cast(udf_op#338 as bigint)], [id#344L], Inner, 
> BuildRight, false
>:- Project [id#330L, if (isnull(cast(id#330L as int))) null else 
> expensive_udf(knownnotnull(cast(id#330L as int))) AS udf_op#338]
>:  +- BroadcastHashJoin [id#330L], [id#332L], Inner, BuildRight, false
>: :- Filter ((isnotnull(id#330L) AND isnotnull(cast(id#330L as int))) 
> AND isnotnull(expensive_udf(knownnotnull(cast(id#330L as int)
>: :  +- FileScan orc [id#330L] Batched: true, DataFilters: 
> [isnotnull(id#330L), isnotnull(cast(id#330L as int)), 
> isnotnull(expensive_udf(knownnotnull(cast(i..., Format: ORC, Location: 
> InMemoryFileIndex(1 paths)[file:/tmp/orc], PartitionFilters: [], 
> PushedFilters: [IsNotNull(id)], ReadSchema: struct
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [plan_id=416]
>:+- Range (0, 10, step=1, splits=16)
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]),false), [plan_id=420]
>   +- Range (0, 10, step=1, splits=16)
> {code}
> In this case, the expensive UDF call is duplicated thrice. Since the UDF 
> output is used in a future join, `InferFiltersFromConstraints` adds an `IS 
> NOT NULL` filter on the UDF output. But the pushdown rules duplicate this UDF 
> call and push the UDF past a previous join. The duplication behaviour [is 
> documented|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L196]
>  and in itself is not a huge issue. But given a highly restrictive join, the 
> UDF gets evaluated on many orders of magnitude more rows than it should have 
> slowing down the job.
> Can we avoid this duplication of UDF calls? In SPARK-37392, we made a 
> [similar change|https://github.com/apache/spark/pull/34823/files] where we 
> decided to only add inferred filters if the input is an attribute. Should we 
> use a similar strategy for `InferFiltersFromConstraints`?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40261) DirectTaskResult meta should not be counted into result size

2022-08-29 Thread Ziqi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziqi Liu updated SPARK-40261:
-
Summary: DirectTaskResult meta should not be counted into result size  
(was: TaskResult meta should not be counted into result size)

> DirectTaskResult meta should not be counted into result size
> 
>
> Key: SPARK-40261
> URL: https://issues.apache.org/jira/browse/SPARK-40261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> This issue exists for a long time (since 
> [https://github.com/liuzqt/spark/commit/c33e55008239f417764d589c1366371d18331686)]
> when calculating whether driver fetching result exceed 
> `spark.driver.maxResultSize` limit, the whole serialized result task size is 
> taken into account, including task metadata overhead 
> size([accumUpdates|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L41])
>  as well. However, the metadata should not be counted because they will be 
> discarded by the driver immediately after being processed.
> This will lead to exception when running jobs with tons of task but actually 
> return small results.
> Therefore we should only count 
> `[valueBytes|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L40]`
>  when calculating result size limit.
> cc [~joshrosen] 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40235) Use interruptible lock instead of synchronized in Executor.updateDependencies()

2022-08-29 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-40235.

Fix Version/s: 3.4.0
   Resolution: Fixed

Fixed by [https://github.com/apache/spark/pull/37681]

> Use interruptible lock instead of synchronized in 
> Executor.updateDependencies()
> ---
>
> Key: SPARK-40235
> URL: https://issues.apache.org/jira/browse/SPARK-40235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.4.0
>
>
> This patch modifies the synchronization in {{Executor.updateDependencies()}} 
> in order to allow tasks to be interrupted while they are blocked and waiting 
> on other tasks to finish downloading dependencies.
> This synchronization was added years ago in 
> [mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a]
>  in order to prevent concurrently-launching tasks from performing concurrent 
> dependency updates. If one task is downloading dependencies, all other 
> newly-launched tasks will block until the original dependency download is 
> complete.
> Let's say that a Spark task launches, becomes blocked on a 
> {{updateDependencies()}} call, then is cancelled while it is blocked. 
> Although Spark will send a Thread.interrupt() to the canceled task, the task 
> will continue waiting because threads blocked on a {{synchronized}} won't 
> throw an InterruptedException in response to the interrupt. As a result, the 
> blocked thread will continue to wait until the other thread exits the 
> synchronized block. 
> In the wild, we saw a case where this happened and the thread remained 
> blocked for over 1 minute, causing the TaskReaper to kick in and 
> self-destruct the executor.
> This PR aims to fix this problem by replacing the {{synchronized}} with a 
> ReentrantLock, which has a {{lockInterruptibly}} method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions

2022-08-29 Thread Lee Yang (Jira)
Lee Yang created SPARK-40264:


 Summary: Add helper function for DL model inference in 
pyspark.ml.functions
 Key: SPARK-40264
 URL: https://issues.apache.org/jira/browse/SPARK-40264
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.2.2
Reporter: Lee Yang


Add a helper function to create a pandas_udf for inference on a given DL model, 
where the user provides a predict function that is responsible for loading the 
model and inferring on a batch of numpy inputs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40263) Use interruptible lock instead of synchronized in TransportClientFactory.createClient()

2022-08-29 Thread Josh Rosen (Jira)
Josh Rosen created SPARK-40263:
--

 Summary: Use interruptible lock instead of synchronized in 
TransportClientFactory.createClient()
 Key: SPARK-40263
 URL: https://issues.apache.org/jira/browse/SPARK-40263
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Josh Rosen


Followup to SPARK-40235: we should apply a similar fix in 
TransportClientFactory.createClient



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40262) Expensive UDF evaluation pushed down past a join leads to performance issues

2022-08-29 Thread Shardul Mahadik (Jira)
Shardul Mahadik created SPARK-40262:
---

 Summary: Expensive UDF evaluation pushed down past a join leads to 
performance issues 
 Key: SPARK-40262
 URL: https://issues.apache.org/jira/browse/SPARK-40262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Shardul Mahadik


Consider a Spark job with an expensive UDF which looks like follows:
{code:scala}
val expensive_udf = spark.udf.register("expensive_udf", (i: Int) => Option(i))

spark.range(10).write.format("orc").save("/tmp/orc")

val df = spark.read.format("orc").load("/tmp/orc").as("a")
.join(spark.range(10).as("b"), "id")
.withColumn("udf_op", expensive_udf($"a.id"))
.join(spark.range(10).as("c"), $"udf_op" === $"c.id")
{code}
This creates a physical plan as follows:
{code:java}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [cast(udf_op#338 as bigint)], [id#344L], Inner, 
BuildRight, false
   :- Project [id#330L, if (isnull(cast(id#330L as int))) null else 
expensive_udf(knownnotnull(cast(id#330L as int))) AS udf_op#338]
   :  +- BroadcastHashJoin [id#330L], [id#332L], Inner, BuildRight, false
   : :- Filter ((isnotnull(id#330L) AND isnotnull(cast(id#330L as int))) 
AND isnotnull(expensive_udf(knownnotnull(cast(id#330L as int)
   : :  +- FileScan orc [id#330L] Batched: true, DataFilters: 
[isnotnull(id#330L), isnotnull(cast(id#330L as int)), 
isnotnull(expensive_udf(knownnotnull(cast(i..., Format: ORC, Location: 
InMemoryFileIndex(1 paths)[file:/tmp/orc], PartitionFilters: [], PushedFilters: 
[IsNotNull(id)], ReadSchema: struct
   : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false]),false), [plan_id=416]
   :+- Range (0, 10, step=1, splits=16)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false]),false), [plan_id=420]
  +- Range (0, 10, step=1, splits=16)
{code}
In this case, the expensive UDF call is duplicated thrice. Since the UDF output 
is used in a future join, `InferFiltersFromConstraints` adds an `IS NOT NULL` 
filter on the UDF output. But the pushdown rules duplicate this UDF call and 
push the UDF past a previous join. The duplication behaviour [is 
documented|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L196]
 and in itself is not a huge issue. But given a highly restrictive join, the 
UDF gets evaluated on many orders of magnitude more rows than it should have 
slowing down the job.

Can we avoid this duplication of UDF calls? In SPARK-37392, we made a [similar 
change|https://github.com/apache/spark/pull/34823/files] where we decided to 
only add inferred filters if the input is an attribute. Should we use a similar 
strategy for `InferFiltersFromConstraints`?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40261) TaskResult meta should not be counted into result size

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40261:


Assignee: (was: Apache Spark)

> TaskResult meta should not be counted into result size
> --
>
> Key: SPARK-40261
> URL: https://issues.apache.org/jira/browse/SPARK-40261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> This issue exists for a long time (since 
> [https://github.com/liuzqt/spark/commit/c33e55008239f417764d589c1366371d18331686)]
> when calculating whether driver fetching result exceed 
> `spark.driver.maxResultSize` limit, the whole serialized result task size is 
> taken into account, including task metadata overhead 
> size([accumUpdates|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L41])
>  as well. However, the metadata should not be counted because they will be 
> discarded by the driver immediately after being processed.
> This will lead to exception when running jobs with tons of task but actually 
> return small results.
> Therefore we should only count 
> `[valueBytes|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L40]`
>  when calculating result size limit.
> cc [~joshrosen] 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40261) TaskResult meta should not be counted into result size

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597399#comment-17597399
 ] 

Apache Spark commented on SPARK-40261:
--

User 'liuzqt' has created a pull request for this issue:
https://github.com/apache/spark/pull/37713

> TaskResult meta should not be counted into result size
> --
>
> Key: SPARK-40261
> URL: https://issues.apache.org/jira/browse/SPARK-40261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> This issue exists for a long time (since 
> [https://github.com/liuzqt/spark/commit/c33e55008239f417764d589c1366371d18331686)]
> when calculating whether driver fetching result exceed 
> `spark.driver.maxResultSize` limit, the whole serialized result task size is 
> taken into account, including task metadata overhead 
> size([accumUpdates|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L41])
>  as well. However, the metadata should not be counted because they will be 
> discarded by the driver immediately after being processed.
> This will lead to exception when running jobs with tons of task but actually 
> return small results.
> Therefore we should only count 
> `[valueBytes|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L40]`
>  when calculating result size limit.
> cc [~joshrosen] 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40261) TaskResult meta should not be counted into result size

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40261:


Assignee: Apache Spark

> TaskResult meta should not be counted into result size
> --
>
> Key: SPARK-40261
> URL: https://issues.apache.org/jira/browse/SPARK-40261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Assignee: Apache Spark
>Priority: Major
>
> This issue exists for a long time (since 
> [https://github.com/liuzqt/spark/commit/c33e55008239f417764d589c1366371d18331686)]
> when calculating whether driver fetching result exceed 
> `spark.driver.maxResultSize` limit, the whole serialized result task size is 
> taken into account, including task metadata overhead 
> size([accumUpdates|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L41])
>  as well. However, the metadata should not be counted because they will be 
> discarded by the driver immediately after being processed.
> This will lead to exception when running jobs with tons of task but actually 
> return small results.
> Therefore we should only count 
> `[valueBytes|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L40]`
>  when calculating result size limit.
> cc [~joshrosen] 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-08-29 Thread Lee Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lee Yang resolved SPARK-38648.
--
Resolution: Won't Fix

> SPIP: Simplified API for DL Inferencing
> ---
>
> Key: SPARK-38648
> URL: https://issues.apache.org/jira/browse/SPARK-38648
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Lee Yang
>Priority: Minor
>
> h1. Background and Motivation
> The deployment of deep learning (DL) models to Spark clusters can be a point 
> of friction today.  DL practitioners often aren't well-versed with Spark, and 
> Spark experts often aren't well-versed with the fast-changing DL frameworks.  
> Currently, the deployment of trained DL models is done in a fairly ad-hoc 
> manner, with each model integration usually requiring significant effort.
> To simplify this process, we propose adding an integration layer for each 
> major DL framework that can introspect their respective saved models to 
> more-easily integrate these models into Spark applications.  You can find a 
> detailed proposal here: 
> [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]
> h1. Goals
>  - Simplify the deployment of pre-trained single-node DL models to Spark 
> inference applications.
>  - Follow pandas_udf for simple inference use-cases.
>  - Follow Spark ML Pipelines APIs for transfer-learning use-cases.
>  - Enable integrations with popular third-party DL frameworks like 
> TensorFlow, PyTorch, and Huggingface.
>  - Focus on PySpark, since most of the DL frameworks use Python.
>  - Take advantage of built-in Spark features like GPU scheduling and Arrow 
> integration.
>  - Enable inference on both CPU and GPU.
> h1. Non-goals
>  - DL model training.
>  - Inference w/ distributed models, i.e. "model parallel" inference.
> h1. Target Personas
>  - Data scientists who need to deploy DL models on Spark.
>  - Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-08-29 Thread Lee Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597397#comment-17597397
 ] 

Lee Yang commented on SPARK-38648:
--

Agreed, closing.

> SPIP: Simplified API for DL Inferencing
> ---
>
> Key: SPARK-38648
> URL: https://issues.apache.org/jira/browse/SPARK-38648
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Lee Yang
>Priority: Minor
>
> h1. Background and Motivation
> The deployment of deep learning (DL) models to Spark clusters can be a point 
> of friction today.  DL practitioners often aren't well-versed with Spark, and 
> Spark experts often aren't well-versed with the fast-changing DL frameworks.  
> Currently, the deployment of trained DL models is done in a fairly ad-hoc 
> manner, with each model integration usually requiring significant effort.
> To simplify this process, we propose adding an integration layer for each 
> major DL framework that can introspect their respective saved models to 
> more-easily integrate these models into Spark applications.  You can find a 
> detailed proposal here: 
> [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]
> h1. Goals
>  - Simplify the deployment of pre-trained single-node DL models to Spark 
> inference applications.
>  - Follow pandas_udf for simple inference use-cases.
>  - Follow Spark ML Pipelines APIs for transfer-learning use-cases.
>  - Enable integrations with popular third-party DL frameworks like 
> TensorFlow, PyTorch, and Huggingface.
>  - Focus on PySpark, since most of the DL frameworks use Python.
>  - Take advantage of built-in Spark features like GPU scheduling and Arrow 
> integration.
>  - Enable inference on both CPU and GPU.
> h1. Non-goals
>  - DL model training.
>  - Inference w/ distributed models, i.e. "model parallel" inference.
> h1. Target Personas
>  - Data scientists who need to deploy DL models on Spark.
>  - Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40261) TaskResult meta should not be counted into result size

2022-08-29 Thread Ziqi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziqi Liu updated SPARK-40261:
-
Description: 
This issue exists for a long time (since 
[https://github.com/liuzqt/spark/commit/c33e55008239f417764d589c1366371d18331686)]

when calculating whether driver fetching result exceed 
`spark.driver.maxResultSize` limit, the whole serialized result task size is 
taken into account, including task metadata overhead 
size([accumUpdates|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L41])
 as well. However, the metadata should not be counted because they will be 
discarded by the driver immediately after being processed.

This will lead to exception when running jobs with tons of task but actually 
return small results.

Therefore we should only count 
`[valueBytes|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L40]`
 when calculating result size limit.

cc [~joshrosen] 

 

 

  was:
This issue exists for a long time (since 
[https://github.com/liuzqt/spark/commit/c33e55008239f417764d589c1366371d18331686)]

when calculating whether driver fetching result exceed 
`spark.driver.maxResultSize` limit, the whole serialized result task size is 
taken into account, including task 
metadata([accumUpdates|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L41])
 as well. However, the metadata should not be counted because they will be 
discarded by the driver immediately after being processed.

This will lead to exception when running jobs with tons of task but actually 
return small results.

Therefore we should only count 
`[valueBytes|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L40]`
 when calculating result size limit.

 

 


> TaskResult meta should not be counted into result size
> --
>
> Key: SPARK-40261
> URL: https://issues.apache.org/jira/browse/SPARK-40261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> This issue exists for a long time (since 
> [https://github.com/liuzqt/spark/commit/c33e55008239f417764d589c1366371d18331686)]
> when calculating whether driver fetching result exceed 
> `spark.driver.maxResultSize` limit, the whole serialized result task size is 
> taken into account, including task metadata overhead 
> size([accumUpdates|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L41])
>  as well. However, the metadata should not be counted because they will be 
> discarded by the driver immediately after being processed.
> This will lead to exception when running jobs with tons of task but actually 
> return small results.
> Therefore we should only count 
> `[valueBytes|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L40]`
>  when calculating result size limit.
> cc [~joshrosen] 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40261) TaskResult meta should not be counted into result size

2022-08-29 Thread Ziqi Liu (Jira)
Ziqi Liu created SPARK-40261:


 Summary: TaskResult meta should not be counted into result size
 Key: SPARK-40261
 URL: https://issues.apache.org/jira/browse/SPARK-40261
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Ziqi Liu


This issue exists for a long time (since 
[https://github.com/liuzqt/spark/commit/c33e55008239f417764d589c1366371d18331686)]

when calculating whether driver fetching result exceed 
`spark.driver.maxResultSize` limit, the whole serialized result task size is 
taken into account, including task 
metadata([accumUpdates|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L41])
 as well. However, the metadata should not be counted because they will be 
discarded by the driver immediately after being processed.

This will lead to exception when running jobs with tons of task but actually 
return small results.

Therefore we should only count 
`[valueBytes|https://github.com/apache/spark/blob/c95ed826e23fdec6e1a779cfebde7b3364594fb5/core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala#L40]`
 when calculating result size limit.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40254) Upgrade netty from 4.1.77 to 4.1.80

2022-08-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40254:
-

Assignee: Yang Jie

> Upgrade netty from 4.1.77 to 4.1.80
> ---
>
> Key: SPARK-40254
> URL: https://issues.apache.org/jira/browse/SPARK-40254
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> * [https://netty.io/news/2022/06/14/4-1-78-Final.html]
>  * [https://netty.io/news/2022/07/11/4-1-79-Final.html]
>  * https://netty.io/news/2022/08/26/4-1-80-Final.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40254) Upgrade netty from 4.1.77 to 4.1.80

2022-08-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40254.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37703
[https://github.com/apache/spark/pull/37703]

> Upgrade netty from 4.1.77 to 4.1.80
> ---
>
> Key: SPARK-40254
> URL: https://issues.apache.org/jira/browse/SPARK-40254
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> * [https://netty.io/news/2022/06/14/4-1-78-Final.html]
>  * [https://netty.io/news/2022/07/11/4-1-79-Final.html]
>  * https://netty.io/news/2022/08/26/4-1-80-Final.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40260:
-
Description: 
Migrate the following errors in QueryCompilationErrors:
* groupByPositionRefersToAggregateFunctionError
* groupByPositionRangeError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.

  was:
Migrate the following errors in QueryCompilationErrors:
* groupingIDMismatchError
* groupingColInvalidError
* groupingSizeTooLargeError
* groupingMustWithGroupingSetsOrCubeOrRollupError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.


> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupByPositionRefersToAggregateFunctionError
> * groupByPositionRangeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40260:


Assignee: Max Gekk  (was: Apache Spark)

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate the following errors in QueryCompilationErrors:
> * groupingIDMismatchError
> * groupingColInvalidError
> * groupingSizeTooLargeError
> * groupingMustWithGroupingSetsOrCubeOrRollupError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40260:
-
Fix Version/s: (was: 3.3.0)

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupingIDMismatchError
> * groupingColInvalidError
> * groupingSizeTooLargeError
> * groupingMustWithGroupingSetsOrCubeOrRollupError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-29 Thread Max Gekk (Jira)
Max Gekk created SPARK-40260:


 Summary: Use error classes in the compilation errors of GROUP BY a 
position
 Key: SPARK-40260
 URL: https://issues.apache.org/jira/browse/SPARK-40260
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk
Assignee: Apache Spark
 Fix For: 3.3.0


Migrate the following errors in QueryCompilationErrors:
* groupingIDMismatchError
* groupingColInvalidError
* groupingSizeTooLargeError
* groupingMustWithGroupingSetsOrCubeOrRollupError

onto use error classes. Throw an implementation of SparkThrowable. Also write a 
test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40260) Use error classes in the compilation errors of GROUP BY a position

2022-08-29 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40260:
-
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Use error classes in the compilation errors of GROUP BY a position
> --
>
> Key: SPARK-40260
> URL: https://issues.apache.org/jira/browse/SPARK-40260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * groupingIDMismatchError
> * groupingColInvalidError
> * groupingSizeTooLargeError
> * groupingMustWithGroupingSetsOrCubeOrRollupError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40259) Support Parquet DSv2 in subquery plan merge

2022-08-29 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40259:
---
Description: We could improve SPARK-34079 with DSv2 support.  (was: We 
could improve SPARK-34079 to support DSv2.)

> Support Parquet DSv2 in subquery plan merge
> ---
>
> Key: SPARK-40259
> URL: https://issues.apache.org/jira/browse/SPARK-40259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> We could improve SPARK-34079 with DSv2 support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40259) Support Parquet DSv2 in subquery plan merge

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40259:


Assignee: (was: Apache Spark)

> Support Parquet DSv2 in subquery plan merge
> ---
>
> Key: SPARK-40259
> URL: https://issues.apache.org/jira/browse/SPARK-40259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> We could improve SPARK-34079 to support DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40259) Support Parquet DSv2 in subquery plan merge

2022-08-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597296#comment-17597296
 ] 

Apache Spark commented on SPARK-40259:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/37711

> Support Parquet DSv2 in subquery plan merge
> ---
>
> Key: SPARK-40259
> URL: https://issues.apache.org/jira/browse/SPARK-40259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> We could improve SPARK-34079 to support DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40259) Support Parquet DSv2 in subquery plan merge

2022-08-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40259:


Assignee: Apache Spark

> Support Parquet DSv2 in subquery plan merge
> ---
>
> Key: SPARK-40259
> URL: https://issues.apache.org/jira/browse/SPARK-40259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>
> We could improve SPARK-34079 to support DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40259) Support Parquet DSv2 in subquery plan merge

2022-08-29 Thread Peter Toth (Jira)
Peter Toth created SPARK-40259:
--

 Summary: Support Parquet DSv2 in subquery plan merge
 Key: SPARK-40259
 URL: https://issues.apache.org/jira/browse/SPARK-40259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth


We could improve SPARK-34079 to support DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40245) Fix FileScan equality check when partition or data filter columns are not read

2022-08-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40245.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37693
[https://github.com/apache/spark/pull/37693]

> Fix FileScan equality check when partition or data filter columns are not read
> --
>
> Key: SPARK-40245
> URL: https://issues.apache.org/jira/browse/SPARK-40245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40245) Fix FileScan equality check when partition or data filter columns are not read

2022-08-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40245:
---

Assignee: Peter Toth

> Fix FileScan equality check when partition or data filter columns are not read
> --
>
> Key: SPARK-40245
> URL: https://issues.apache.org/jira/browse/SPARK-40245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >