date:20220120

[jira] [Created] (SPARK-37978) Remove the useless ChunkFetchFailureException class

2022-01-20 Thread weixiuli (Jira)

weixiuli created SPARK-37978:


 Summary: Remove the useless ChunkFetchFailureException class
 Key: SPARK-37978
 URL: https://issues.apache.org/jira/browse/SPARK-37978
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 3.2.0, 3.1.1, 3.0.3, 3.0.1, 3.0.0
Reporter: weixiuli


Remove the useless ChunkFetchFailureException class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27442) ParquetFileFormat fails to read column named with invalid characters

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27442:
---

Assignee: angerszhu

> ParquetFileFormat fails to read column named with invalid characters
> 
>
> Key: SPARK-27442
> URL: https://issues.apache.org/jira/browse/SPARK-27442
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0, 2.4.1
>Reporter: Jan Vršovský
>Assignee: angerszhu
>Priority: Minor
>
> When reading a parquet file which contains characters considered invalid, the 
> reader fails with exception:
> Name: org.apache.spark.sql.AnalysisException
> Message: Attribute name "..." contains invalid character(s) among " 
> ,;{}()\n\t=". Please use alias to rename it.
> Spark should not be able to write such files, but it should be able to read 
> it (and allow the user to correct it). However, possible workarounds (such as 
> using alias to rename the column, or forcing another schema) do not work, 
> since the check is done on the input.
> (Possible fix: remove superficial 
> {{ParquetWriteSupport.setSchema(requiredSchema, hadoopConf)}} from 
> {{buildReaderWithPartitionValues}} ?)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27442) ParquetFileFormat fails to read column named with invalid characters

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27442.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35229
[https://github.com/apache/spark/pull/35229]

> ParquetFileFormat fails to read column named with invalid characters
> 
>
> Key: SPARK-27442
> URL: https://issues.apache.org/jira/browse/SPARK-27442
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0, 2.4.1
>Reporter: Jan Vršovský
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>
> When reading a parquet file which contains characters considered invalid, the 
> reader fails with exception:
> Name: org.apache.spark.sql.AnalysisException
> Message: Attribute name "..." contains invalid character(s) among " 
> ,;{}()\n\t=". Please use alias to rename it.
> Spark should not be able to write such files, but it should be able to read 
> it (and allow the user to correct it). However, possible workarounds (such as 
> using alias to rename the column, or forcing another schema) do not work, 
> since the check is done on the input.
> (Possible fix: remove superficial 
> {{ParquetWriteSupport.setSchema(requiredSchema, hadoopConf)}} from 
> {{buildReaderWithPartitionValues}} ?)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37929) Support cascade mode for `dropNamespace` API

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37929:
---

Assignee: dch nguyen

> Support cascade mode for `dropNamespace` API 
> -
>
> Key: SPARK-37929
> URL: https://issues.apache.org/jira/browse/SPARK-37929
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37929) Support cascade mode for `dropNamespace` API

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37929.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35246
[https://github.com/apache/spark/pull/35246]

> Support cascade mode for `dropNamespace` API 
> -
>
> Key: SPARK-37929
> URL: https://issues.apache.org/jira/browse/SPARK-37929
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-28516) Data Type Formatting Functions: `to_char`

2022-01-20 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng reopened SPARK-28516:


> Data Type Formatting Functions: `to_char`
> -
>
> Key: SPARK-28516
> URL: https://issues.apache.org/jira/browse/SPARK-28516
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently, Spark does not have support for `to_char`. PgSQL, however, 
> [does|[https://www.postgresql.org/docs/12/functions-formatting.html]]:
> Query example: 
> {code:sql}
> SELECT to_char(SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 
> FOLLOWING),'9D9')
> {code}
> ||Function||Return Type||Description||Example||
> |{{to_char(}}{{timestamp}}{{, }}{{text}}{{)}}|{{text}}|convert time stamp to 
> string|{{to_char(current_timestamp, 'HH12:MI:SS')}}|
> |{{to_char(}}{{interval}}{{, }}{{text}}{{)}}|{{text}}|convert interval to 
> string|{{to_char(interval '15h 2m 12s', 'HH24:MI:SS')}}|
> |{{to_char(}}{{int}}{{, }}{{text}}{{)}}|{{text}}|convert integer to 
> string|{{to_char(125, '999')}}|
> |{{to_char}}{{(}}{{double precision}}{{, }}{{text}}{{)}}|{{text}}|convert 
> real/double precision to string|{{to_char(125.8::real, '999D9')}}|
> |{{to_char(}}{{numeric}}{{, }}{{text}}{{)}}|{{text}}|convert numeric to 
> string|{{to_char(-125.8, '999D99S')}}|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37967) ConstantFolding/ Literal.create support ObjectType

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37967:
---

Assignee: angerszhu

> ConstantFolding/ Literal.create support ObjectType
> --
>
> Key: SPARK-37967
> URL: https://issues.apache.org/jira/browse/SPARK-37967
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37967) ConstantFolding/ Literal.create support ObjectType

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37967.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35255
[https://github.com/apache/spark/pull/35255]

> ConstantFolding/ Literal.create support ObjectType
> --
>
> Key: SPARK-37967
> URL: https://issues.apache.org/jira/browse/SPARK-37967
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37977) Upgrade ORC to 1.6.13

2022-01-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37977:


Assignee: Dongjoon Hyun

> Upgrade ORC to 1.6.13
> -
>
> Key: SPARK-37977
> URL: https://issues.apache.org/jira/browse/SPARK-37977
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37977) Upgrade ORC to 1.6.13

2022-01-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37977.
--
Fix Version/s: 3.2.2
   Resolution: Fixed

Issue resolved by pull request 35266
[https://github.com/apache/spark/pull/35266]

> Upgrade ORC to 1.6.13
> -
>
> Key: SPARK-37977
> URL: https://issues.apache.org/jira/browse/SPARK-37977
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37971) Apply and evaluate expressions row-wise in a Spark DataFrame

2022-01-20 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479820#comment-17479820
 ] 

Hyukjin Kwon commented on SPARK-37971:
--

[~carlosft] you can manually transpose and perform row-wise operations. e.g.) 
see 
https://spark.apache.org/docs/latest/api/python//_modules/pyspark/pandas/frame.html#DataFrame.transpose

> Apply and evaluate expressions row-wise in a Spark DataFrame
> 
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Major
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result with PySpark's expr function. However it is not possible 
> to apply the expr function row-wise to a dataframe (UDF or map), and evaluate 
> all expressions efficiently.
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressions row-wise in a Spark DataFrame

2022-01-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37971:
-
Priority: Major  (was: Critical)

> Apply and evaluate expressions row-wise in a Spark DataFrame
> 
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Major
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result with PySpark's expr function. However it is not possible 
> to apply the expr function row-wise to a dataframe (UDF or map), and evaluate 
> all expressions efficiently.
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37976) All tasks finish but spark job does not conclude. Forever waits for [DONE]

2022-01-20 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479819#comment-17479819
 ] 

Hyukjin Kwon commented on SPARK-37976:
--

[~nitinsiwach] are you able to provide the reproducer?

> All tasks finish but spark job does not conclude. Forever waits for [DONE]
> --
>
> Key: SPARK-37976
> URL: https://issues.apache.org/jira/browse/SPARK-37976
> Project: Spark
>  Issue Type: Question
>  Components: PySpark, YARN
>Affects Versions: 3.1.0
>Reporter: Nitin Siwach
>Priority: Major
>
> I am using the following command to submit a spark job:  ```spark-submit 
> --master yarn --conf 
> "spark.jars.packages=com.microsoft.azure:synapseml_2.12:0.9.4" --conf 
> "spark.jars.repositories=https://mmlspark.azureedge.net/maven"; --conf 
> "spark.hadoop.fs.gs.implicit.dir.rep
> air.enable=false" 
> --py-files=gs://monsoon-credittech.appspot.com/monsoon_spark/src.zip 
> gs://monsoon-credittech.appspot.com/monsoon_spark/custom_estimator.py 
> train_evaluate_cv --data-path 
> gs://monsoon-credittech.appspot.com/mar19/training_data.csv
>  --index hash_CR_ACCOUNT_NBR --label flag__6_months --save-results --nrows 
> 10 --evaluate --run-id run_ss_02```
> Everything in the code finishes. I have {{print('save'); print(scores)}} 
> as the ultimate last line of my code and it executes as well. All the 
> activity on all nodes goes to 0. Yet the job does not conclude. My shell 
> prints ```{{{}22/01/13 19:29:15 INFO 
> org.sparkproject.jetty.server.AbstractConnector: Stopped 
> Spark@a69cfdd\{HTTP/1.1, (http/1.1)}{0.0.0.0:0}```{}}} and that's it. The job 
> constantly shows as running and I have to manually cancel it.
> Providing the output of ```jstack```. hoping it helps:
>  
>  
> {{Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):"DestroyJavaVM" #657 prio=5 os_prio=0 tid=0x7f9bd8013800 nid=0x3a83 
> waiting on condition [0x]   java.lang.Thread.State: 
> RUNNABLE"pool-42-thread-1" #360 prio=5 os_prio=0 tid=0x7f9b90582000 
> nid=0x3eb2 waiting on condition [0x7f9b6b24b000]   
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x94d52ac0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)"yarn-scheduler-endpoint" 
> #261 daemon prio=5 os_prio=0 tid=0x7f9bd9d61000 nid=0x3dab waiting on 
> condition [0x7f9b6f185000]   java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x88bf5dd0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)"client DomainSocketWatcher" 
> #54 daemon prio=5 os_prio=0 tid=0x7f9b919d1800 nid=0x3ae3 runnable 
> [0x7f9b84ee1000]   java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native 
> Method)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503)
> at java.lang.Thread.run(Thread.java:748)"Service Thread" #7 daemon 
> prio=9 os_prio=0 tid=0x7f9bd80cb800 nid=0x3a8c runnable 
> [0x]   java.lang.Thread.State: RUNNABLE"C1 CompilerThread1" 
> #6 daemon prio=9 os_prio=0 tid=0x7f9bd80c7000 nid=0x3a8b waiting on 
> condition [0x]   java.lang.Thread.State: RUNNABLE"C

[jira] [Updated] (SPARK-37976) All tasks finish but spark job does not conclude. Forever waits for [DONE]

2022-01-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37976:
-
Component/s: YARN

> All tasks finish but spark job does not conclude. Forever waits for [DONE]
> --
>
> Key: SPARK-37976
> URL: https://issues.apache.org/jira/browse/SPARK-37976
> Project: Spark
>  Issue Type: Question
>  Components: PySpark, YARN
>Affects Versions: 3.1.0
>Reporter: Nitin Siwach
>Priority: Major
>
> I am using the following command to submit a spark job:  ```spark-submit 
> --master yarn --conf 
> "spark.jars.packages=com.microsoft.azure:synapseml_2.12:0.9.4" --conf 
> "spark.jars.repositories=https://mmlspark.azureedge.net/maven"; --conf 
> "spark.hadoop.fs.gs.implicit.dir.rep
> air.enable=false" 
> --py-files=gs://monsoon-credittech.appspot.com/monsoon_spark/src.zip 
> gs://monsoon-credittech.appspot.com/monsoon_spark/custom_estimator.py 
> train_evaluate_cv --data-path 
> gs://monsoon-credittech.appspot.com/mar19/training_data.csv
>  --index hash_CR_ACCOUNT_NBR --label flag__6_months --save-results --nrows 
> 10 --evaluate --run-id run_ss_02```
> Everything in the code finishes. I have {{print('save'); print(scores)}} 
> as the ultimate last line of my code and it executes as well. All the 
> activity on all nodes goes to 0. Yet the job does not conclude. My shell 
> prints ```{{{}22/01/13 19:29:15 INFO 
> org.sparkproject.jetty.server.AbstractConnector: Stopped 
> Spark@a69cfdd\{HTTP/1.1, (http/1.1)}{0.0.0.0:0}```{}}} and that's it. The job 
> constantly shows as running and I have to manually cancel it.
> Providing the output of ```jstack```. hoping it helps:
>  
>  
> {{Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):"DestroyJavaVM" #657 prio=5 os_prio=0 tid=0x7f9bd8013800 nid=0x3a83 
> waiting on condition [0x]   java.lang.Thread.State: 
> RUNNABLE"pool-42-thread-1" #360 prio=5 os_prio=0 tid=0x7f9b90582000 
> nid=0x3eb2 waiting on condition [0x7f9b6b24b000]   
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x94d52ac0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)"yarn-scheduler-endpoint" 
> #261 daemon prio=5 os_prio=0 tid=0x7f9bd9d61000 nid=0x3dab waiting on 
> condition [0x7f9b6f185000]   java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x88bf5dd0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)"client DomainSocketWatcher" 
> #54 daemon prio=5 os_prio=0 tid=0x7f9b919d1800 nid=0x3ae3 runnable 
> [0x7f9b84ee1000]   java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native 
> Method)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503)
> at java.lang.Thread.run(Thread.java:748)"Service Thread" #7 daemon 
> prio=9 os_prio=0 tid=0x7f9bd80cb800 nid=0x3a8c runnable 
> [0x]   java.lang.Thread.State: RUNNABLE"C1 CompilerThread1" 
> #6 daemon prio=9 os_prio=0 tid=0x7f9bd80c7000 nid=0x3a8b waiting on 
> condition [0x]   java.lang.Thread.State: RUNNABLE"C2 
> CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x7f9bd80c4000 nid=0x3a8a 
> waiti

[jira] [Updated] (SPARK-37976) All tasks finish but spark job does not conclude. Forever waits for [DONE]

2022-01-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37976:
-
Priority: Major  (was: Blocker)

> All tasks finish but spark job does not conclude. Forever waits for [DONE]
> --
>
> Key: SPARK-37976
> URL: https://issues.apache.org/jira/browse/SPARK-37976
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Nitin Siwach
>Priority: Major
>
> I am using the following command to submit a spark job:  ```spark-submit 
> --master yarn --conf 
> "spark.jars.packages=com.microsoft.azure:synapseml_2.12:0.9.4" --conf 
> "spark.jars.repositories=https://mmlspark.azureedge.net/maven"; --conf 
> "spark.hadoop.fs.gs.implicit.dir.rep
> air.enable=false" 
> --py-files=gs://monsoon-credittech.appspot.com/monsoon_spark/src.zip 
> gs://monsoon-credittech.appspot.com/monsoon_spark/custom_estimator.py 
> train_evaluate_cv --data-path 
> gs://monsoon-credittech.appspot.com/mar19/training_data.csv
>  --index hash_CR_ACCOUNT_NBR --label flag__6_months --save-results --nrows 
> 10 --evaluate --run-id run_ss_02```
> Everything in the code finishes. I have {{print('save'); print(scores)}} 
> as the ultimate last line of my code and it executes as well. All the 
> activity on all nodes goes to 0. Yet the job does not conclude. My shell 
> prints ```{{{}22/01/13 19:29:15 INFO 
> org.sparkproject.jetty.server.AbstractConnector: Stopped 
> Spark@a69cfdd\{HTTP/1.1, (http/1.1)}{0.0.0.0:0}```{}}} and that's it. The job 
> constantly shows as running and I have to manually cancel it.
> Providing the output of ```jstack```. hoping it helps:
>  
>  
> {{Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):"DestroyJavaVM" #657 prio=5 os_prio=0 tid=0x7f9bd8013800 nid=0x3a83 
> waiting on condition [0x]   java.lang.Thread.State: 
> RUNNABLE"pool-42-thread-1" #360 prio=5 os_prio=0 tid=0x7f9b90582000 
> nid=0x3eb2 waiting on condition [0x7f9b6b24b000]   
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x94d52ac0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)"yarn-scheduler-endpoint" 
> #261 daemon prio=5 os_prio=0 tid=0x7f9bd9d61000 nid=0x3dab waiting on 
> condition [0x7f9b6f185000]   java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x88bf5dd0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)"client DomainSocketWatcher" 
> #54 daemon prio=5 os_prio=0 tid=0x7f9b919d1800 nid=0x3ae3 runnable 
> [0x7f9b84ee1000]   java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native 
> Method)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503)
> at java.lang.Thread.run(Thread.java:748)"Service Thread" #7 daemon 
> prio=9 os_prio=0 tid=0x7f9bd80cb800 nid=0x3a8c runnable 
> [0x]   java.lang.Thread.State: RUNNABLE"C1 CompilerThread1" 
> #6 daemon prio=9 os_prio=0 tid=0x7f9bd80c7000 nid=0x3a8b waiting on 
> condition [0x]   java.lang.Thread.State: RUNNABLE"C2 
> CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x7f9bd80c4000 nid=0x3a8a

[jira] [Assigned] (SPARK-37977) Upgrade ORC to 1.6.13

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37977:


Assignee: (was: Apache Spark)

> Upgrade ORC to 1.6.13
> -
>
> Key: SPARK-37977
> URL: https://issues.apache.org/jira/browse/SPARK-37977
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37977) Upgrade ORC to 1.6.13

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479818#comment-17479818
 ] 

Apache Spark commented on SPARK-37977:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35266

> Upgrade ORC to 1.6.13
> -
>
> Key: SPARK-37977
> URL: https://issues.apache.org/jira/browse/SPARK-37977
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37977) Upgrade ORC to 1.6.13

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37977:


Assignee: Apache Spark

> Upgrade ORC to 1.6.13
> -
>
> Key: SPARK-37977
> URL: https://issues.apache.org/jira/browse/SPARK-37977
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37888) Unify v1 and v2 DESCRIBE TABLE tests

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37888:


Assignee: (was: Apache Spark)

> Unify v1 and v2 DESCRIBE TABLE tests
> 
>
> Key: SPARK-37888
> URL: https://issues.apache.org/jira/browse/SPARK-37888
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37888) Unify v1 and v2 DESCRIBE TABLE tests

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37888:


Assignee: Apache Spark

> Unify v1 and v2 DESCRIBE TABLE tests
> 
>
> Key: SPARK-37888
> URL: https://issues.apache.org/jira/browse/SPARK-37888
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37888) Unify v1 and v2 DESCRIBE TABLE tests

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479817#comment-17479817
 ] 

Apache Spark commented on SPARK-37888:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/35265

> Unify v1 and v2 DESCRIBE TABLE tests
> 
>
> Key: SPARK-37888
> URL: https://issues.apache.org/jira/browse/SPARK-37888
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37977) Upgrade ORC to 1.6.13

2022-01-20 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-37977:
-

 Summary: Upgrade ORC to 1.6.13
 Key: SPARK-37977
 URL: https://issues.apache.org/jira/browse/SPARK-37977
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.2.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog

2022-01-20 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479813#comment-17479813
 ] 

Huaxin Gao commented on SPARK-37963:


Changed the fix version to 3.2.2 for now. Will change back if RC2 fails.

> Need to update Partition URI after renaming table in InMemoryCatalog
> 
>
> Key: SPARK-37963
> URL: https://issues.apache.org/jira/browse/SPARK-37963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> After renaming a partitioned table, select from the new table from 
> InMemoryCatalog will get an empty result.
> The following checkAnswer will fail as the result is empty.
> {code:java}
> sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)")
> sql("insert into table foo partition(j=2) values (1)")
> sql(s"alter table foo rename to bar")
> checkAnswer(spark.table("bar"), Row(1, 2)) {code}
> To fix the bug, we need to update Partition URI after renaming a table in 
> InMemoryCatalog
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog

2022-01-20 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-37963:
---
Fix Version/s: 3.2.2
   (was: 3.2.1)

> Need to update Partition URI after renaming table in InMemoryCatalog
> 
>
> Key: SPARK-37963
> URL: https://issues.apache.org/jira/browse/SPARK-37963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> After renaming a partitioned table, select from the new table from 
> InMemoryCatalog will get an empty result.
> The following checkAnswer will fail as the result is empty.
> {code:java}
> sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)")
> sql("insert into table foo partition(j=2) values (1)")
> sql(s"alter table foo rename to bar")
> checkAnswer(spark.table("bar"), Row(1, 2)) {code}
> To fix the bug, we need to update Partition URI after renaming a table in 
> InMemoryCatalog
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30661) KMeans blockify input vectors

2022-01-20 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479793#comment-17479793
 ] 

zhengruifeng edited comment on SPARK-30661 at 1/21/22, 2:54 AM:


according to https://issues.apache.org/jira/browse/SPARK-31454, dense kmeans 
with MKL can be 3.5X faster than existing impl.

 

I re-tested my blocked impl on a relative low-dimensional dataset.  native BLAS 
(openblas) was linked.

total trainng duraions were attached.

 

on dense dataset, it could be 30% ~ 3x faster than existing impl.

on spase dataset, it is usually slower than existing impl.  (according to my 
exprerience, it maybe 10X slower than existing impl on some high dimensional 
(dim>100,000) datasets.)

 

test code
{code:java}
import scala.util.Randomimport org.apache.spark.ml.linalg._
import org.apache.spark.ml.clustering._
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel

// val df = spark.read.option("numFeatures", 
"28").format("libsvm").load("/d1/Datasets/higgs/HIGGS")
val df = spark.read.parquet("/d1/Datasets/higgs/HIGGS.parquet").repartition(24)
df.persist(StorageLevel.MEMORY_AND_DISK)

df.count
df.count

def getSparseUDF(dim: Int) = {
    val rng = new Random(123)
    val newIndices = rng.shuffle(Seq.range(0, dim)).take(28).toArray.sorted
    udf { vec: Vector =>
        Vectors.sparse(dim, newIndices, vec.toArray).compressed
    }
}

// sc.setLogLevel("INFO")

// blocked impl
val km = new KMeans().setInitMode("random").setMaxIter(5).setTol(0)

for (dim <- Seq(28, 280, 2800); k <- Seq(2, 8, 32, 128); size <- Seq(1, 4, 16)) 
{
    Thread.sleep(1000)
    val ds = if (dim == 28) { df } else { val sparseUDF = getSparseUDF(dim); 
df.withColumn("features", sparseUDF(col("features"))) }
    val start = System.currentTimeMillis
    val model = km.setK(k).setMaxBlockSizeInMB(size).fit(ds)
    val end = System.currentTimeMillis
    println((model.uid, dim, k, size, end - start, model.summary.trainingCost, 
model.summary.numIter, model.centerMatrix.toString.take(20)))
} 


// existing impl
val km = new KMeans().setInitMode("random").setMaxIter(5).setTol(0)

for (dim <- Seq(28, 280, 2800); k <- Seq(2, 8, 32, 128)) {
    Thread.sleep(1000)
    val ds = if (dim == 28) { df } else { val sparseUDF = getSparseUDF(dim); 
df.withColumn("features", sparseUDF(col("features"))) }
    val start = System.currentTimeMillis
    val model = km.setK(k).fit(ds)
    val end = System.currentTimeMillis
    println((model.uid, dim, k, end - start, model.summary.trainingCost, 
model.summary.numIter, model.clusterCenters.head.toString.take(20)))
}{code}


was (Author: podongfeng):
according to https://issues.apache.org/jira/browse/SPARK-31454, dense kmeans 
with MKL can be 3.5X faster than existing impl.

 

I re-tested my blocked impl on a relative low-dimensional dataset.  native BLAS 
(openblas) was linked.

total trainng duraion:

!image-2022-01-21-10-42-41-771.png!

 

on dense dataset, it could be 30% ~ 3x faster than existing impl.

on spase dataset, it is usually slower than existing impl.  (according to my 
exprerience, it maybe 10X slower than existing impl on some high dimensional 
(dim>100,000) datasets.)

 

test code
{code:java}
import scala.util.Randomimport org.apache.spark.ml.linalg._
import org.apache.spark.ml.clustering._
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel

// val df = spark.read.option("numFeatures", 
"28").format("libsvm").load("/d1/Datasets/higgs/HIGGS")
val df = spark.read.parquet("/d1/Datasets/higgs/HIGGS.parquet").repartition(24)
df.persist(StorageLevel.MEMORY_AND_DISK)

df.count
df.count

def getSparseUDF(dim: Int) = {
    val rng = new Random(123)
    val newIndices = rng.shuffle(Seq.range(0, dim)).take(28).toArray.sorted
    udf { vec: Vector =>
        Vectors.sparse(dim, newIndices, vec.toArray).compressed
    }
}

// sc.setLogLevel("INFO")

// blocked impl
val km = new KMeans().setInitMode("random").setMaxIter(5).setTol(0)

for (dim <- Seq(28, 280, 2800); k <- Seq(2, 8, 32, 128); size <- Seq(1, 4, 16)) 
{
    Thread.sleep(1000)
    val ds = if (dim == 28) { df } else { val sparseUDF = getSparseUDF(dim); 
df.withColumn("features", sparseUDF(col("features"))) }
    val start = System.currentTimeMillis
    val model = km.setK(k).setMaxBlockSizeInMB(size).fit(ds)
    val end = System.currentTimeMillis
    println((model.uid, dim, k, size, end - start, model.summary.trainingCost, 
model.summary.numIter, model.centerMatrix.toString.take(20)))
} 


// existing impl
val km = new KMeans().setInitMode("random").setMaxIter(5).setTol(0)

for (dim <- Seq(28, 280, 2800); k <- Seq(2, 8, 32, 128)) {
    Thread.sleep(1000)
    val ds = if (dim == 28) { df } else { val sparseUDF = getSparseUDF(dim); 
df.withColumn("features", sparseUDF(col("features"))) }
    val start = System.currentTimeMillis
    val model = km.se

[jira] [Updated] (SPARK-30661) KMeans blockify input vectors

2022-01-20 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30661:
-
Attachment: blockify_kmeans.png

> KMeans blockify input vectors
> -
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Attachments: blockify_kmeans.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30661) KMeans blockify input vectors

2022-01-20 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479793#comment-17479793
 ] 

zhengruifeng commented on SPARK-30661:
--

according to https://issues.apache.org/jira/browse/SPARK-31454, dense kmeans 
with MKL can be 3.5X faster than existing impl.

 

I re-tested my blocked impl on a relative low-dimensional dataset.  native BLAS 
(openblas) was linked.

total trainng duraion:

!image-2022-01-21-10-42-41-771.png!

 

on dense dataset, it could be 30% ~ 3x faster than existing impl.

on spase dataset, it is usually slower than existing impl.  (according to my 
exprerience, it maybe 10X slower than existing impl on some high dimensional 
(dim>100,000) datasets.)

 

test code
{code:java}
import scala.util.Randomimport org.apache.spark.ml.linalg._
import org.apache.spark.ml.clustering._
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel

// val df = spark.read.option("numFeatures", 
"28").format("libsvm").load("/d1/Datasets/higgs/HIGGS")
val df = spark.read.parquet("/d1/Datasets/higgs/HIGGS.parquet").repartition(24)
df.persist(StorageLevel.MEMORY_AND_DISK)

df.count
df.count

def getSparseUDF(dim: Int) = {
    val rng = new Random(123)
    val newIndices = rng.shuffle(Seq.range(0, dim)).take(28).toArray.sorted
    udf { vec: Vector =>
        Vectors.sparse(dim, newIndices, vec.toArray).compressed
    }
}

// sc.setLogLevel("INFO")

// blocked impl
val km = new KMeans().setInitMode("random").setMaxIter(5).setTol(0)

for (dim <- Seq(28, 280, 2800); k <- Seq(2, 8, 32, 128); size <- Seq(1, 4, 16)) 
{
    Thread.sleep(1000)
    val ds = if (dim == 28) { df } else { val sparseUDF = getSparseUDF(dim); 
df.withColumn("features", sparseUDF(col("features"))) }
    val start = System.currentTimeMillis
    val model = km.setK(k).setMaxBlockSizeInMB(size).fit(ds)
    val end = System.currentTimeMillis
    println((model.uid, dim, k, size, end - start, model.summary.trainingCost, 
model.summary.numIter, model.centerMatrix.toString.take(20)))
} 


// existing impl
val km = new KMeans().setInitMode("random").setMaxIter(5).setTol(0)

for (dim <- Seq(28, 280, 2800); k <- Seq(2, 8, 32, 128)) {
    Thread.sleep(1000)
    val ds = if (dim == 28) { df } else { val sparseUDF = getSparseUDF(dim); 
df.withColumn("features", sparseUDF(col("features"))) }
    val start = System.currentTimeMillis
    val model = km.setK(k).fit(ds)
    val end = System.currentTimeMillis
    println((model.uid, dim, k, end - start, model.summary.trainingCost, 
model.summary.numIter, model.clusterCenters.head.toString.take(20)))
}{code}

> KMeans blockify input vectors
> -
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37037) Improve byte array sort by unify compareTo function of UTF8String and ByteArray

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479778#comment-17479778
 ] 

Apache Spark commented on SPARK-37037:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/35264

> Improve byte array sort by unify compareTo function of UTF8String and 
> ByteArray 
> 
>
> Key: SPARK-37037
> URL: https://issues.apache.org/jira/browse/SPARK-37037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.3.0
>
>
> BinaryType use `TypeUtils.compareBinary` to compare two byte array, however 
> it's slow since it compares byte array using unsigned int comparison byte by 
> bye.
> We can compare them using `Platform.getLong` with unsigned long comparison if 
> they have more than 8 bytes. And here is some histroy about this `TODO`  
> [https://github.com/apache/spark/pull/6755/files#r32197461 
> .|https://github.com/apache/spark/pull/6755/files#r32197461]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37037) Improve byte array sort by unify compareTo function of UTF8String and ByteArray

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479777#comment-17479777
 ] 

Apache Spark commented on SPARK-37037:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/35264

> Improve byte array sort by unify compareTo function of UTF8String and 
> ByteArray 
> 
>
> Key: SPARK-37037
> URL: https://issues.apache.org/jira/browse/SPARK-37037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.3.0
>
>
> BinaryType use `TypeUtils.compareBinary` to compare two byte array, however 
> it's slow since it compares byte array using unsigned int comparison byte by 
> bye.
> We can compare them using `Platform.getLong` with unsigned long comparison if 
> they have more than 8 bytes. And here is some histroy about this `TODO`  
> [https://github.com/apache/spark/pull/6755/files#r32197461 
> .|https://github.com/apache/spark/pull/6755/files#r32197461]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37897) Filter with subexpression elimination may cause query failed

2022-01-20 Thread hujiahua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479764#comment-17479764
 ] 

hujiahua commented on SPARK-37897:
--

[~viirya] Thank you for your reply. But I'm curious what is the correct way to 
use `plusOne`,  like this ?  
{code:java}
select t.* from  (select * from table1 where c1 >= 0) t where plusOne(t.c1) > 1 
and plusOne(t.c1) < 3 {code}
By the way, are there any implementation constraints associated with this 
(filter predicates order) in SQL ANSI Standards?

> Filter with subexpression elimination may cause query failed
> 
>
> Key: SPARK-37897
> URL: https://issues.apache.org/jira/browse/SPARK-37897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hujiahua
>Priority: Major
> Attachments: image-2022-01-13-20-22-09-055.png
>
>
>  
> The following test results will fail, the root cause was that the execution 
> order of filter predicates had changed after subexpression elimination. So I 
> think we should keep predicates execution order after subexpression 
> elimination.
> {code:java}
> test("filter with subexpression elimination may cause query failed.") {
>   withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) {
> val df = Seq(-1, 1, 2).toDF("c1")
> //register `plusOne` udf, and the function will failed if input was not a 
> positive number.
> spark.sqlContext.udf.register("plusOne",
>   (n: Int) => { if (n >= 0) n + 1 else throw new SparkException("Must be 
> positive number.") })
> val result = df.filter("c1 >= 0 and plusOne(c1) > 1 and plusOne(c1) < 
> 3").collect()
> assert(result.size === 1)
>   }
> } 
> Caused by: org.apache.spark.SparkException: Must be positive number.
>     at 
> org.apache.spark.sql.DataFrameSuite.$anonfun$new$3(DataFrameSuite.scala:67)
>     at 
> scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23)
>     ... 20 more{code}
>  
> https://github.com/apache/spark/blob/0e186e8a19926f91810f3eaf174611b71e598de6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala#L63
> !image-2022-01-13-20-22-09-055.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23897) Guava version

2022-01-20 Thread phoebe chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479744#comment-17479744
 ] 

phoebe chen commented on SPARK-23897:
-

The currently in-use Guava version 14.0.1 has following vulnerabilities:
 * CVE-2018-10237

 * CVE-2020-8908

FYI.

> Guava version
> -
>
> Key: SPARK-23897
> URL: https://issues.apache.org/jira/browse/SPARK-23897
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Guava dependency version 14 is pretty old, needs to be updated to at least 
> 16, google cloud storage connector uses newer one which causes pretty popular 
> error with guava; "java.lang.NoSuchMethodError: 
> com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;"
>  and causes app to crash



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37806) Support minimum number of tasks per executor before being rolling

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479724#comment-17479724
 ] 

Apache Spark commented on SPARK-37806:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35263

> Support minimum number of tasks per executor before being rolling
> -
>
> Key: SPARK-37806
> URL: https://issues.apache.org/jira/browse/SPARK-37806
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37976) All tasks finish but spark job does not conclude. Forever waits for [DONE]

2022-01-20 Thread Nitin Siwach (Jira)

Nitin Siwach created SPARK-37976:


 Summary: All tasks finish but spark job does not conclude. Forever 
waits for [DONE]
 Key: SPARK-37976
 URL: https://issues.apache.org/jira/browse/SPARK-37976
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Nitin Siwach


I am using the following command to submit a spark job:  ```spark-submit 
--master yarn --conf 
"spark.jars.packages=com.microsoft.azure:synapseml_2.12:0.9.4" --conf 
"spark.jars.repositories=https://mmlspark.azureedge.net/maven"; --conf 
"spark.hadoop.fs.gs.implicit.dir.rep
air.enable=false" 
--py-files=gs://monsoon-credittech.appspot.com/monsoon_spark/src.zip 
gs://monsoon-credittech.appspot.com/monsoon_spark/custom_estimator.py 
train_evaluate_cv --data-path 
gs://monsoon-credittech.appspot.com/mar19/training_data.csv
 --index hash_CR_ACCOUNT_NBR --label flag__6_months --save-results --nrows 
10 --evaluate --run-id run_ss_02```

Everything in the code finishes. I have {{print('save'); print(scores)}} as 
the ultimate last line of my code and it executes as well. All the activity on 
all nodes goes to 0. Yet the job does not conclude. My shell prints 
```{{{}22/01/13 19:29:15 INFO org.sparkproject.jetty.server.AbstractConnector: 
Stopped Spark@a69cfdd\{HTTP/1.1, (http/1.1)}{0.0.0.0:0}```{}}} and that's it. 
The job constantly shows as running and I have to manually cancel it.

Providing the output of ```jstack```. hoping it helps:

 
 
{{Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
mode):"DestroyJavaVM" #657 prio=5 os_prio=0 tid=0x7f9bd8013800 nid=0x3a83 
waiting on condition [0x]   java.lang.Thread.State: 
RUNNABLE"pool-42-thread-1" #360 prio=5 os_prio=0 tid=0x7f9b90582000 
nid=0x3eb2 waiting on condition [0x7f9b6b24b000]   java.lang.Thread.State: 
WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x94d52ac0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)"yarn-scheduler-endpoint" #261 
daemon prio=5 os_prio=0 tid=0x7f9bd9d61000 nid=0x3dab waiting on condition 
[0x7f9b6f185000]   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x88bf5dd0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)"client DomainSocketWatcher" 
#54 daemon prio=5 os_prio=0 tid=0x7f9b919d1800 nid=0x3ae3 runnable 
[0x7f9b84ee1000]   java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503)
at java.lang.Thread.run(Thread.java:748)"Service Thread" #7 daemon 
prio=9 os_prio=0 tid=0x7f9bd80cb800 nid=0x3a8c runnable 
[0x]   java.lang.Thread.State: RUNNABLE"C1 CompilerThread1" #6 
daemon prio=9 os_prio=0 tid=0x7f9bd80c7000 nid=0x3a8b waiting on condition 
[0x]   java.lang.Thread.State: RUNNABLE"C2 CompilerThread0" #5 
daemon prio=9 os_prio=0 tid=0x7f9bd80c4000 nid=0x3a8a waiting on condition 
[0x]   java.lang.Thread.State: RUNNABLE"Signal Dispatcher" #4 
daemon prio=9 os_prio=0 tid=0x7f9bd80c1000 nid=0x3a89 waiting on condition 
[0x]   java.lang.Thread.State: RUNNABLE"Finalizer" #3 daemon 
prio=8 os_prio=0 tid=0x7f9bd808b800 nid=0x3a88 in Object.wait() 
[0x7f9bc5e87000]   java.lang.Thread.State: WAITING (on object monitor)

[jira] [Resolved] (SPARK-37083) Inline type hints for python/pyspark/accumulators.py

2022-01-20 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37083.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34363
[https://github.com/apache/spark/pull/34363]

> Inline type hints for python/pyspark/accumulators.py
> 
>
> Key: SPARK-37083
> URL: https://issues.apache.org/jira/browse/SPARK-37083
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37083) Inline type hints for python/pyspark/accumulators.py

2022-01-20 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37083:
--

Assignee: dch nguyen

> Inline type hints for python/pyspark/accumulators.py
> 
>
> Key: SPARK-37083
> URL: https://issues.apache.org/jira/browse/SPARK-37083
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479642#comment-17479642
 ] 

Apache Spark commented on SPARK-37974:
--

User 'parthchandra' has created a pull request for this issue:
https://github.com/apache/spark/pull/35262

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Priority: Major
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37974:


Assignee: Apache Spark

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479641#comment-17479641
 ] 

Apache Spark commented on SPARK-37974:
--

User 'parthchandra' has created a pull request for this issue:
https://github.com/apache/spark/pull/35262

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Priority: Major
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37974:


Assignee: (was: Apache Spark)

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Priority: Major
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37975) Implement vectorized BYTE_STREAM_SPLIT encoding for Parquet V2 support

2022-01-20 Thread Parth Chandra (Jira)

Parth Chandra created SPARK-37975:
-

 Summary: Implement vectorized BYTE_STREAM_SPLIT encoding for 
Parquet V2 support
 Key: SPARK-37975
 URL: https://issues.apache.org/jira/browse/SPARK-37975
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Parth Chandra


Parquet V2 has a BYTE_STREAM_SPLIT encoding which is not currently directly 
usable by Spark (as there is no way to enable writing of this encoding). 
However, other engines may write a file with this encoding and the vectorized 
reader should be able to consume it.
A vectorized version of this encoding should be implemented



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-20 Thread Parth Chandra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Chandra updated SPARK-37974:
--
External issue ID:   (was: SPARK-36879)

> Implement vectorized  DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings 
> for Parquet V2 support
> ---
>
> Key: SPARK-37974
> URL: https://issues.apache.org/jira/browse/SPARK-37974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Parth Chandra
>Priority: Major
>
> SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
> values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
> string values. DELTA_BYTE_ARRAY encoding also requires the 
> DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized 
> versions as the current implementation simply calls the non-vectorized 
> Parquet library methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37974) Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

2022-01-20 Thread Parth Chandra (Jira)

Parth Chandra created SPARK-37974:
-

 Summary: Implement vectorized  DELTA_BYTE_ARRAY and 
DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
 Key: SPARK-37974
 URL: https://issues.apache.org/jira/browse/SPARK-37974
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Parth Chandra


SPARK-36879 implements the DELTA_BINARY_PACKED encoding which is for integer 
values, but does not implement the DELTA_BYTE_ARRAY encoding which is for 
string values. DELTA_BYTE_ARRAY encoding also requires the 
DELTA_LENGTH_BYTE_ARRAY encoding. Both these encodings need vectorized versions 
as the current implementation simply calls the non-vectorized Parquet library 
methods.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37973) Directly call super.getDefaultReadLimit when scala issue 12523 is fixed

2022-01-20 Thread Boyang Jerry Peng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-37973:
--
Description: 
In regards to [https://github.com/apache/spark/pull/35238] and more 
specifically these lines:

 

[https://github.com/apache/spark/pull/35238/files#diff-c9def1b07e12775929ebc58e107291495a48461f516db75acc4af4b6ce8b4dc7R106]

 

[https://github.com/apache/spark/pull/35238/files#diff-1ab03f750a3cbd95b7a84bb0c6bb6c6600c79dcb20ea2b43dd825d9aedab9656R140]

 

 
{code:java}
maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(super.getDefaultReadLimit)
 {code}
 

 

needed to be changed to

 
{code:java}
maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(ReadLimit.allAvailable()) 
{code}
 

Because of a bug in the scala compiler documented here:

 

[https://github.com/scala/bug/issues/12523]

 

After this bug is fixed we can revert this change, i.e. back to using 
`super.getDefaultReadLimit`

  was:
In regards to [https://github.com/apache/spark/pull/35238] and more 
specifically these lines:

 

[https://github.com/apache/spark/pull/35238/files#diff-c9def1b07e12775929ebc58e107291495a48461f516db75acc4af4b6ce8b4dc7R106]

 

[https://github.com/apache/spark/pull/35238/files#diff-1ab03f750a3cbd95b7a84bb0c6bb6c6600c79dcb20ea2b43dd825d9aedab9656R140]

 

```scala

maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(super.getDefaultReadLimit)

```

 

needed to be changed to



```scala

maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse({*}ReadLimit.allAvailable(){*})

```

 

Because of a bug in the scala compiler documented here:

 

[https://github.com/scala/bug/issues/12523]

 

After this bug is fixed we can revert this change, i.e. back to using 
`super.getDefaultReadLimit`


> Directly call super.getDefaultReadLimit when scala issue 12523 is fixed
> ---
>
> Key: SPARK-37973
> URL: https://issues.apache.org/jira/browse/SPARK-37973
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> In regards to [https://github.com/apache/spark/pull/35238] and more 
> specifically these lines:
>  
> [https://github.com/apache/spark/pull/35238/files#diff-c9def1b07e12775929ebc58e107291495a48461f516db75acc4af4b6ce8b4dc7R106]
>  
> [https://github.com/apache/spark/pull/35238/files#diff-1ab03f750a3cbd95b7a84bb0c6bb6c6600c79dcb20ea2b43dd825d9aedab9656R140]
>  
>  
> {code:java}
> maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(super.getDefaultReadLimit)
>  {code}
>  
>  
> needed to be changed to
>  
> {code:java}
> maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(ReadLimit.allAvailable())
>  {code}
>  
> Because of a bug in the scala compiler documented here:
>  
> [https://github.com/scala/bug/issues/12523]
>  
> After this bug is fixed we can revert this change, i.e. back to using 
> `super.getDefaultReadLimit`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37973) Directly call super.getDefaultReadLimit when scala issue 12523 is fixed

2022-01-20 Thread Boyang Jerry Peng (Jira)

Boyang Jerry Peng created SPARK-37973:
-

 Summary: Directly call super.getDefaultReadLimit when scala issue 
12523 is fixed
 Key: SPARK-37973
 URL: https://issues.apache.org/jira/browse/SPARK-37973
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Boyang Jerry Peng


In regards to [https://github.com/apache/spark/pull/35238] and more 
specifically these lines:

 

[https://github.com/apache/spark/pull/35238/files#diff-c9def1b07e12775929ebc58e107291495a48461f516db75acc4af4b6ce8b4dc7R106]

 

[https://github.com/apache/spark/pull/35238/files#diff-1ab03f750a3cbd95b7a84bb0c6bb6c6600c79dcb20ea2b43dd825d9aedab9656R140]

 

```scala

maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse(super.getDefaultReadLimit)

```

 

needed to be changed to



```scala

maxOffsetsPerTrigger.map(ReadLimit.maxRows).getOrElse({*}ReadLimit.allAvailable(){*})

```

 

Because of a bug in the scala compiler documented here:

 

[https://github.com/scala/bug/issues/12523]

 

After this bug is fixed we can revert this change, i.e. back to using 
`super.getDefaultReadLimit`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37968) Upgrade commons-collections3 to commons-collections4

2022-01-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37968.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35257
[https://github.com/apache/spark/pull/35257]

> Upgrade commons-collections3 to commons-collections4
> 
>
> Key: SPARK-37968
> URL: https://issues.apache.org/jira/browse/SPARK-37968
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> Apache commons-collections 3.x is a Java 1.3 compatible version, and it does 
> not use Java 5 generics.  
> Apache commons-collections4 4.4 is an upgraded version and it built by Java 8



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37968) Upgrade commons-collections3 to commons-collections4

2022-01-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37968:
-

Assignee: Yang Jie

> Upgrade commons-collections3 to commons-collections4
> 
>
> Key: SPARK-37968
> URL: https://issues.apache.org/jira/browse/SPARK-37968
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Apache commons-collections 3.x is a Java 1.3 compatible version, and it does 
> not use Java 5 generics.  
> Apache commons-collections4 4.4 is an upgraded version and it built by Java 8



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479602#comment-17479602
 ] 

Apache Spark commented on SPARK-36879:
--

User 'parthchandra' has created a pull request for this issue:
https://github.com/apache/spark/pull/35262

> Support Parquet v2 data page encodings for the vectorized path
> --
>
> Key: SPARK-36879
> URL: https://issues.apache.org/jira/browse/SPARK-36879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Parth Chandra
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently Spark only support Parquet V1 encodings (i.e., 
> PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:
> {code}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
> {code}
> It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
> DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
> listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37972) Typing incompatibilities with numpy==1.22.x

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37972:


Assignee: Apache Spark

> Typing incompatibilities with numpy==1.22.x
> ---
>
> Key: SPARK-37972
> URL: https://issues.apache.org/jira/browse/SPARK-37972
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> When type checked against {{numpy==1.22}} mypy detects following issues:
> {code:python}
> python/pyspark/mllib/linalg/__init__.py:412: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> python/pyspark/mllib/linalg/__init__.py:457: error: No overload variant of 
> "dot" matches argument types "ndarray[Any, Any]", "Iterable[float]"  
> [call-overload]
> python/pyspark/mllib/linalg/__init__.py:457: note: Possible overload variant:
> python/pyspark/mllib/linalg/__init__.py:457: note: def dot(a: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], b: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], out: 
> None = ...) -> Any
> python/pyspark/mllib/linalg/__init__.py:457: note: <1 more non-matching 
> overload not shown>
> python/pyspark/mllib/linalg/__init__.py:707: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37972) Typing incompatibilities with numpy==1.22.x

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37972:


Assignee: (was: Apache Spark)

> Typing incompatibilities with numpy==1.22.x
> ---
>
> Key: SPARK-37972
> URL: https://issues.apache.org/jira/browse/SPARK-37972
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> When type checked against {{numpy==1.22}} mypy detects following issues:
> {code:python}
> python/pyspark/mllib/linalg/__init__.py:412: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> python/pyspark/mllib/linalg/__init__.py:457: error: No overload variant of 
> "dot" matches argument types "ndarray[Any, Any]", "Iterable[float]"  
> [call-overload]
> python/pyspark/mllib/linalg/__init__.py:457: note: Possible overload variant:
> python/pyspark/mllib/linalg/__init__.py:457: note: def dot(a: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], b: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], out: 
> None = ...) -> Any
> python/pyspark/mllib/linalg/__init__.py:457: note: <1 more non-matching 
> overload not shown>
> python/pyspark/mllib/linalg/__init__.py:707: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37972) Typing incompatibilities with numpy==1.22.x

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479448#comment-17479448
 ] 

Apache Spark commented on SPARK-37972:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35261

> Typing incompatibilities with numpy==1.22.x
> ---
>
> Key: SPARK-37972
> URL: https://issues.apache.org/jira/browse/SPARK-37972
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> When type checked against {{numpy==1.22}} mypy detects following issues:
> {code:python}
> python/pyspark/mllib/linalg/__init__.py:412: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> python/pyspark/mllib/linalg/__init__.py:457: error: No overload variant of 
> "dot" matches argument types "ndarray[Any, Any]", "Iterable[float]"  
> [call-overload]
> python/pyspark/mllib/linalg/__init__.py:457: note: Possible overload variant:
> python/pyspark/mllib/linalg/__init__.py:457: note: def dot(a: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], b: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], out: 
> None = ...) -> Any
> python/pyspark/mllib/linalg/__init__.py:457: note: <1 more non-matching 
> overload not shown>
> python/pyspark/mllib/linalg/__init__.py:707: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37972) Typing incompatibilities with numpy==1.22.x

2022-01-20 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37972:
---
Description: 
When type checked against {{numpy==1.22}} mypy detects following issues:


{code:python}
python/pyspark/mllib/linalg/__init__.py:412: error: Argument 2 to "norm" has 
incompatible type "Union[float, str]"; expected "Union[None, float, 
Literal['fro'], Literal['nuc']]"  [arg-type]
python/pyspark/mllib/linalg/__init__.py:457: error: No overload variant of 
"dot" matches argument types "ndarray[Any, Any]", "Iterable[float]"  
[call-overload]
python/pyspark/mllib/linalg/__init__.py:457: note: Possible overload variant:
python/pyspark/mllib/linalg/__init__.py:457: note: def dot(a: 
Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], 
bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, 
complex, str, bytes]]], b: Union[_SupportsArray[dtype[Any]], 
_NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], out: 
None = ...) -> Any
python/pyspark/mllib/linalg/__init__.py:457: note: <1 more non-matching 
overload not shown>
python/pyspark/mllib/linalg/__init__.py:707: error: Argument 2 to "norm" has 
incompatible type "Union[float, str]"; expected "Union[None, float, 
Literal['fro'], Literal['nuc']]"  [arg-type]
{code}

  was:
When type checked against {{numpy==1.22}} mypy detects following issues:


{code:python}
python/pyspark/mllib/linalg/__init__.py:101: error: Argument 1 to "DenseVector" 
has incompatible type "Union[Any, _SupportsArray[dtype[Any]], 
_NestedSequence[_SupportsArray[dtype[Any]]], complex, str, bytes, 
_NestedSequence[Union[bool, int, float, complex, str, bytes]], range]"; 
expected "Union[bytes, ndarray[Any, Any], Iterable[float]]"  [arg-type]
python/pyspark/mllib/linalg/__init__.py:137: error: Argument 1 to "len" has 
incompatible type "Union[Any, _SupportsArray[dtype[Any]], 
_NestedSequence[_SupportsArray[dtype[Any]]], complex, str, bytes, 
_NestedSequence[Union[bool, int, float, complex, str, bytes]], range]"; 
expected "Sized"  [arg-type]
python/pyspark/mllib/linalg/__init__.py:412: error: Argument 2 to "norm" has 
incompatible type "Union[float, str]"; expected "Union[None, float, 
Literal['fro'], Literal['nuc']]"  [arg-type]
python/pyspark/mllib/linalg/__init__.py:457: error: No overload variant of 
"dot" matches argument types "ndarray[Any, Any]", "Iterable[float]"  
[call-overload]
python/pyspark/mllib/linalg/__init__.py:457: note: Possible overload variant:
python/pyspark/mllib/linalg/__init__.py:457: note: def dot(a: 
Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], 
bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, 
complex, str, bytes]]], b: Union[_SupportsArray[dtype[Any]], 
_NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], out: 
None = ...) -> Any
python/pyspark/mllib/linalg/__init__.py:457: note: <1 more non-matching 
overload not shown>
python/pyspark/mllib/linalg/__init__.py:707: error: Argument 2 to "norm" has 
incompatible type "Union[float, str]"; expected "Union[None, float, 
Literal['fro'], Literal['nuc']]"  [arg-type]
{code}


> Typing incompatibilities with numpy==1.22.x
> ---
>
> Key: SPARK-37972
> URL: https://issues.apache.org/jira/browse/SPARK-37972
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> When type checked against {{numpy==1.22}} mypy detects following issues:
> {code:python}
> python/pyspark/mllib/linalg/__init__.py:412: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> python/pyspark/mllib/linalg/__init__.py:457: error: No overload variant of 
> "dot" matches argument types "ndarray[Any, Any]", "Iterable[float]"  
> [call-overload]
> python/pyspark/mllib/linalg/__init__.py:457: note: Possible overload variant:
> python/pyspark/mllib/linalg/__init__.py:457: note: def dot(a: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], b: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], out: 
> None = ...) -> Any
> python/pyspark/mllib/linalg/__init__.py:457: note: <1 more non-matching 
> overload not shown>
> python/pyspark/mllib/l

[jira] [Updated] (SPARK-37972) Typing incompatibilities with numpy==1.22.x

2022-01-20 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37972:
---
Summary: Typing incompatibilities with numpy==1.22.x  (was: Typing 
incompatibilities with numpy~=1.22)

> Typing incompatibilities with numpy==1.22.x
> ---
>
> Key: SPARK-37972
> URL: https://issues.apache.org/jira/browse/SPARK-37972
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> When type checked against {{numpy==1.22}} mypy detects following issues:
> {code:python}
> python/pyspark/mllib/linalg/__init__.py:101: error: Argument 1 to 
> "DenseVector" has incompatible type "Union[Any, _SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], complex, str, bytes, 
> _NestedSequence[Union[bool, int, float, complex, str, bytes]], range]"; 
> expected "Union[bytes, ndarray[Any, Any], Iterable[float]]"  [arg-type]
> python/pyspark/mllib/linalg/__init__.py:137: error: Argument 1 to "len" has 
> incompatible type "Union[Any, _SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], complex, str, bytes, 
> _NestedSequence[Union[bool, int, float, complex, str, bytes]], range]"; 
> expected "Sized"  [arg-type]
> python/pyspark/mllib/linalg/__init__.py:412: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> python/pyspark/mllib/linalg/__init__.py:457: error: No overload variant of 
> "dot" matches argument types "ndarray[Any, Any]", "Iterable[float]"  
> [call-overload]
> python/pyspark/mllib/linalg/__init__.py:457: note: Possible overload variant:
> python/pyspark/mllib/linalg/__init__.py:457: note: def dot(a: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], b: 
> Union[_SupportsArray[dtype[Any]], 
> _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
> bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], out: 
> None = ...) -> Any
> python/pyspark/mllib/linalg/__init__.py:457: note: <1 more non-matching 
> overload not shown>
> python/pyspark/mllib/linalg/__init__.py:707: error: Argument 2 to "norm" has 
> incompatible type "Union[float, str]"; expected "Union[None, float, 
> Literal['fro'], Literal['nuc']]"  [arg-type]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37972) Typing incompatibilities with numpy~=1.22

2022-01-20 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-37972:
--

 Summary: Typing incompatibilities with numpy~=1.22
 Key: SPARK-37972
 URL: https://issues.apache.org/jira/browse/SPARK-37972
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 3.3.0
Reporter: Maciej Szymkiewicz


When type checked against {{numpy==1.22}} mypy detects following issues:


{code:python}
python/pyspark/mllib/linalg/__init__.py:101: error: Argument 1 to "DenseVector" 
has incompatible type "Union[Any, _SupportsArray[dtype[Any]], 
_NestedSequence[_SupportsArray[dtype[Any]]], complex, str, bytes, 
_NestedSequence[Union[bool, int, float, complex, str, bytes]], range]"; 
expected "Union[bytes, ndarray[Any, Any], Iterable[float]]"  [arg-type]
python/pyspark/mllib/linalg/__init__.py:137: error: Argument 1 to "len" has 
incompatible type "Union[Any, _SupportsArray[dtype[Any]], 
_NestedSequence[_SupportsArray[dtype[Any]]], complex, str, bytes, 
_NestedSequence[Union[bool, int, float, complex, str, bytes]], range]"; 
expected "Sized"  [arg-type]
python/pyspark/mllib/linalg/__init__.py:412: error: Argument 2 to "norm" has 
incompatible type "Union[float, str]"; expected "Union[None, float, 
Literal['fro'], Literal['nuc']]"  [arg-type]
python/pyspark/mllib/linalg/__init__.py:457: error: No overload variant of 
"dot" matches argument types "ndarray[Any, Any]", "Iterable[float]"  
[call-overload]
python/pyspark/mllib/linalg/__init__.py:457: note: Possible overload variant:
python/pyspark/mllib/linalg/__init__.py:457: note: def dot(a: 
Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], 
bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, 
complex, str, bytes]]], b: Union[_SupportsArray[dtype[Any]], 
_NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, 
bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], out: 
None = ...) -> Any
python/pyspark/mllib/linalg/__init__.py:457: note: <1 more non-matching 
overload not shown>
python/pyspark/mllib/linalg/__init__.py:707: error: Argument 2 to "norm" has 
incompatible type "Union[float, str]"; expected "Union[None, float, 
Literal['fro'], Literal['nuc']]"  [arg-type]
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressions row-wise in a Spark DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37971:
---
Description: 
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result with PySpark's expr function. However it is not possible to apply the 
expr function row-wise (UDF or map), and evaluate all expressions efficiently.
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}

  was:
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expressions efficiently.
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}


> Apply and evaluate expressions row-wise in a Spark DataFrame
> 
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Critical
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result with PySpark's expr function. However it is not possible 
> to apply the expr function row-wise (UDF or map), and evaluate all 
> expressions efficiently.
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressions row-wise in a Spark DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37971:
---
Description: 
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result with PySpark's expr function. However it is not possible to apply the 
expr function row-wise to a dataframe (UDF or map), and evaluate all 
expressions efficiently.
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}

  was:
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result with PySpark's expr function. However it is not possible to apply the 
expr function row-wise (UDF or map), and evaluate all expressions efficiently.
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}


> Apply and evaluate expressions row-wise in a Spark DataFrame
> 
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Critical
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result with PySpark's expr function. However it is not possible 
> to apply the expr function row-wise to a dataframe (UDF or map), and evaluate 
> all expressions efficiently.
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressions row-wise in a Spark DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37971:
---
Summary: Apply and evaluate expressions row-wise in a Spark DataFrame  
(was: Apply and evaluate expressiosn row-wise in a Spark DataFrame)

> Apply and evaluate expressions row-wise in a Spark DataFrame
> 
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Critical
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result. However it is not possible to apply the expr function 
> row-wise (UDF or map), and evaluate all expressions efficiently.
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressiosn row-wise in a Spark DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37971:
---
Description: 
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expressions efficiently.
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}

  was:
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expression efficiently.
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}


> Apply and evaluate expressiosn row-wise in a Spark DataFrame
> 
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Critical
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result. However it is not possible to apply the expr function 
> row-wise (UDF or map), and evaluate all expressions efficiently.
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressiosn row-wise in a Spark DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37971:
---
Summary: Apply and evaluate expressiosn row-wise in a Spark DataFrame  
(was: Apply and evaluate expressiosn row-wise in a DataFrame)

> Apply and evaluate expressiosn row-wise in a Spark DataFrame
> 
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Critical
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result. However it is not possible to apply the expr function 
> row-wise (UDF or map), and evaluate all expression efficiently.
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressiosn row-wise in a DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37971:
---
Description: 
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expression efficiently.
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}

  was:
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expression efficiently.

 
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}


> Apply and evaluate expressiosn row-wise in a DataFrame
> --
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Critical
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result. However it is not possible to apply the expr function 
> row-wise (UDF or map), and evaluate all expression efficiently.
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressiosn row-wise in a DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37971:
---
Description: 
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expression efficiently.

 
{code:java}
id  |  sql_expression
--
1   |  abs(-1) + 12
2   |  decode(1,2,3,4) - 1
3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}

  was:
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expression efficiently.

id  |  sql_expression

1   |  abs(-1) + 12

2   |  decode(1,2,3,4) - 1

3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression'))


> Apply and evaluate expressiosn row-wise in a DataFrame
> --
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Critical
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result. However it is not possible to apply the expr function 
> row-wise (UDF or map), and evaluate all expression efficiently.
>  
> {code:java}
> id  |  sql_expression
> --
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression')) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37971) Apply and evaluate expressiosn row-wise in a DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37971:
---
Description: 
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expression efficiently.

id  |  sql_expression

1   |  abs(-1) + 12

2   |  decode(1,2,3,4) - 1

3   |  30 * 20 - 5

df = df.withColumn('sql_eval', f.expr_row('sql_expression'))

  was:
This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expression efficiently.


> Apply and evaluate expressiosn row-wise in a DataFrame
> --
>
> Key: SPARK-37971
> URL: https://issues.apache.org/jira/browse/SPARK-37971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Carlos Gameiro
>Priority: Critical
>
> This functionality would serve very specific use cases.
> Consider a DataFrame with a column of SQL expressions encoded as strings. 
> Individually it's possible to evaluate each string and obtain the 
> corresponding result. However it is not possible to apply the expr function 
> row-wise (UDF or map), and evaluate all expression efficiently.
> id  |  sql_expression
> 1   |  abs(-1) + 12
> 2   |  decode(1,2,3,4) - 1
> 3   |  30 * 20 - 5
> df = df.withColumn('sql_eval', f.expr_row('sql_expression'))



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37971) Apply and evaluate expressiosn row-wise in a DataFrame

2022-01-20 Thread Carlos Gameiro (Jira)

Carlos Gameiro created SPARK-37971:
--

 Summary: Apply and evaluate expressiosn row-wise in a DataFrame
 Key: SPARK-37971
 URL: https://issues.apache.org/jira/browse/SPARK-37971
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Carlos Gameiro


This functionality would serve very specific use cases.

Consider a DataFrame with a column of SQL expressions encoded as strings. 
Individually it's possible to evaluate each string and obtain the corresponding 
result. However it is not possible to apply the expr function row-wise (UDF or 
map), and evaluate all expression efficiently.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37533) New SQL function: try_element_at

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479380#comment-17479380
 ] 

Apache Spark commented on SPARK-37533:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35260

> New SQL function: try_element_at
> 
>
> Key: SPARK-37533
> URL: https://issues.apache.org/jira/browse/SPARK-37533
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Add New SQL functions `try_element_at`, which is identical to the 
> `element_at` except that it returns null if error occurs
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37533) New SQL function: try_element_at

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479381#comment-17479381
 ] 

Apache Spark commented on SPARK-37533:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35260

> New SQL function: try_element_at
> 
>
> Key: SPARK-37533
> URL: https://issues.apache.org/jira/browse/SPARK-37533
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Add New SQL functions `try_element_at`, which is identical to the 
> `element_at` except that it returns null if error occurs
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35028) ANSI mode: disallow group by aliases

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479378#comment-17479378
 ] 

Apache Spark commented on SPARK-35028:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35260

> ANSI mode: disallow group by aliases
> 
>
> Key: SPARK-35028
> URL: https://issues.apache.org/jira/browse/SPARK-35028
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> As per the ANSI SQL standard secion 7.12 : 
> bq. Each  shall unambiguously reference a column 
> of the table resulting from the . A column referenced in a 
>  is a grouping column.
> By forbidding it, we can avoid ambiguous SQL queries like:
> SELECT col + 1 as col FROM t GROUP BY col



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37963) Need to update Partition URI after renaming table in InMemoryCatalog

2022-01-20 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-37963.

Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 35251
[https://github.com/apache/spark/pull/35251]

> Need to update Partition URI after renaming table in InMemoryCatalog
> 
>
> Key: SPARK-37963
> URL: https://issues.apache.org/jira/browse/SPARK-37963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
>
> After renaming a partitioned table, select from the new table from 
> InMemoryCatalog will get an empty result.
> The following checkAnswer will fail as the result is empty.
> {code:java}
> sql(s"create table foo(i int, j int) using PARQUET partitioned by (j)")
> sql("insert into table foo partition(j=2) values (1)")
> sql(s"alter table foo rename to bar")
> checkAnswer(spark.table("bar"), Row(1, 2)) {code}
> To fix the bug, we need to update Partition URI after renaming a table in 
> InMemoryCatalog
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37970) Introduce a new interface on streaming data source to notify the latest seen offset

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479335#comment-17479335
 ] 

Apache Spark commented on SPARK-37970:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35259

> Introduce a new interface on streaming data source to notify the latest seen 
> offset
> ---
>
> Key: SPARK-37970
> URL: https://issues.apache.org/jira/browse/SPARK-37970
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We figure out the case of streaming data source that knowing the latest seen 
> offset when restarting query would be handy and useful to implement some 
> feature. One useful case is enabling the data source to track the offset by 
> itself, for the case where the external storage of data source is not 
> exposing any API to provide the latest available offset.
> We will propose a new interface on streaming data source, which indicates 
> Spark to give the latest seen offset whenever the query is being restarted. 
> For the first start of the query, the initial offset of the data source 
> should be retrieved from calling initialOffset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37970) Introduce a new interface on streaming data source to notify the latest seen offset

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37970:


Assignee: Apache Spark

> Introduce a new interface on streaming data source to notify the latest seen 
> offset
> ---
>
> Key: SPARK-37970
> URL: https://issues.apache.org/jira/browse/SPARK-37970
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> We figure out the case of streaming data source that knowing the latest seen 
> offset when restarting query would be handy and useful to implement some 
> feature. One useful case is enabling the data source to track the offset by 
> itself, for the case where the external storage of data source is not 
> exposing any API to provide the latest available offset.
> We will propose a new interface on streaming data source, which indicates 
> Spark to give the latest seen offset whenever the query is being restarted. 
> For the first start of the query, the initial offset of the data source 
> should be retrieved from calling initialOffset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37970) Introduce a new interface on streaming data source to notify the latest seen offset

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37970:


Assignee: (was: Apache Spark)

> Introduce a new interface on streaming data source to notify the latest seen 
> offset
> ---
>
> Key: SPARK-37970
> URL: https://issues.apache.org/jira/browse/SPARK-37970
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We figure out the case of streaming data source that knowing the latest seen 
> offset when restarting query would be handy and useful to implement some 
> feature. One useful case is enabling the data source to track the offset by 
> itself, for the case where the external storage of data source is not 
> exposing any API to provide the latest available offset.
> We will propose a new interface on streaming data source, which indicates 
> Spark to give the latest seen offset whenever the query is being restarted. 
> For the first start of the query, the initial offset of the data source 
> should be retrieved from calling initialOffset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37970) Introduce a new interface on streaming data source to notify the latest seen offset

2022-01-20 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-37970:


 Summary: Introduce a new interface on streaming data source to 
notify the latest seen offset
 Key: SPARK-37970
 URL: https://issues.apache.org/jira/browse/SPARK-37970
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Jungtaek Lim


We figure out the case of streaming data source that knowing the latest seen 
offset when restarting query would be handy and useful to implement some 
feature. One useful case is enabling the data source to track the offset by 
itself, for the case where the external storage of data source is not exposing 
any API to provide the latest available offset.

We will propose a new interface on streaming data source, which indicates Spark 
to give the latest seen offset whenever the query is being restarted. For the 
first start of the query, the initial offset of the data source should be 
retrieved from calling initialOffset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37970) Introduce a new interface on streaming data source to notify the latest seen offset

2022-01-20 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479268#comment-17479268
 ] 

Jungtaek Lim commented on SPARK-37970:
--

Will submit a PR soon.

> Introduce a new interface on streaming data source to notify the latest seen 
> offset
> ---
>
> Key: SPARK-37970
> URL: https://issues.apache.org/jira/browse/SPARK-37970
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We figure out the case of streaming data source that knowing the latest seen 
> offset when restarting query would be handy and useful to implement some 
> feature. One useful case is enabling the data source to track the offset by 
> itself, for the case where the external storage of data source is not 
> exposing any API to provide the latest available offset.
> We will propose a new interface on streaming data source, which indicates 
> Spark to give the latest seen offset whenever the query is being restarted. 
> For the first start of the query, the initial offset of the data source 
> should be retrieved from calling initialOffset.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37969) Hive Serde insert should check schema before execution

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479242#comment-17479242
 ] 

Apache Spark commented on SPARK-37969:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35258

> Hive Serde insert should check schema before execution
> --
>
> Key: SPARK-37969
> URL: https://issues.apache.org/jira/browse/SPARK-37969
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 0.0 (TID 0) (10.12.188.15 executor driver): 
> java.lang.IllegalArgumentException: Error: : expected at the position 19 of 
> 'struct' but '(' is found.
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:384)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:507)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)
> [info]at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:112)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:122)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:313)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
> [info]at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:136)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> [info]at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info]
> [info]   Cause: java.lang.IllegalArgumentException: field ended by ';': 
> expected ';' but got 'IF' at line 2:   optional int32 (IF
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:208)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:113)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:101)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:476)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:430)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:425)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:70)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:137)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:126)
> [info

[jira] [Commented] (SPARK-37969) Hive Serde insert should check schema before execution

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479241#comment-17479241
 ] 

Apache Spark commented on SPARK-37969:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35258

> Hive Serde insert should check schema before execution
> --
>
> Key: SPARK-37969
> URL: https://issues.apache.org/jira/browse/SPARK-37969
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 0.0 (TID 0) (10.12.188.15 executor driver): 
> java.lang.IllegalArgumentException: Error: : expected at the position 19 of 
> 'struct' but '(' is found.
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:384)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:507)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)
> [info]at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:112)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:122)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:313)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
> [info]at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:136)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> [info]at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info]
> [info]   Cause: java.lang.IllegalArgumentException: field ended by ';': 
> expected ';' but got 'IF' at line 2:   optional int32 (IF
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:208)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:113)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:101)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:476)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:430)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:425)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:70)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:137)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:126)
> [info

[jira] [Assigned] (SPARK-37969) Hive Serde insert should check schema before execution

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37969:


Assignee: (was: Apache Spark)

> Hive Serde insert should check schema before execution
> --
>
> Key: SPARK-37969
> URL: https://issues.apache.org/jira/browse/SPARK-37969
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 0.0 (TID 0) (10.12.188.15 executor driver): 
> java.lang.IllegalArgumentException: Error: : expected at the position 19 of 
> 'struct' but '(' is found.
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:384)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:507)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)
> [info]at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:112)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:122)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:313)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
> [info]at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:136)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> [info]at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info]
> [info]   Cause: java.lang.IllegalArgumentException: field ended by ';': 
> expected ';' but got 'IF' at line 2:   optional int32 (IF
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:208)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:113)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:101)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:476)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:430)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:425)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:70)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:137)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:126)
> [info]   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
> [info]   at

[jira] [Assigned] (SPARK-37969) Hive Serde insert should check schema before execution

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37969:


Assignee: Apache Spark

> Hive Serde insert should check schema before execution
> --
>
> Key: SPARK-37969
> URL: https://issues.apache.org/jira/browse/SPARK-37969
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 0.0 (TID 0) (10.12.188.15 executor driver): 
> java.lang.IllegalArgumentException: Error: : expected at the position 19 of 
> 'struct' but '(' is found.
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:384)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:507)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
> [info]at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)
> [info]at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:112)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:122)
> [info]at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
> [info]at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:313)
> [info]at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
> [info]at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:136)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
> [info]at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info]
> [info]   Cause: java.lang.IllegalArgumentException: field ended by ';': 
> expected ';' but got 'IF' at line 2:   optional int32 (IF
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:208)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:113)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:101)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94)
> [info]   at 
> org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:476)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:430)
> [info]   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:425)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:70)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:137)
> [info]   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:126)
> [info]   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils

[jira] [Created] (SPARK-37969) Hive Serde insert should check schema before execution

2022-01-20 Thread angerszhu (Jira)

angerszhu created SPARK-37969:
-

 Summary: Hive Serde insert should check schema before execution
 Key: SPARK-37969
 URL: https://issues.apache.org/jira/browse/SPARK-37969
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu



{code:java}
[info]   Cause: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in stage 0.0 (TID 0) (10.12.188.15 executor driver): 
java.lang.IllegalArgumentException: Error: : expected at the position 19 of 
'struct' but '(' is found.
[info]  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:384)
[info]  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
[info]  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:507)
[info]  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
[info]  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)
[info]  at 
org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:112)
[info]  at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:122)
[info]  at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
[info]  at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
[info]  at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
[info]  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:313)
[info]  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252)
[info]  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[info]  at org.apache.spark.scheduler.Task.run(Task.scala:136)
[info]  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
[info]  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
[info]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
[info]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]  at java.lang.Thread.run(Thread.java:748)
[info]







[info]   Cause: java.lang.IllegalArgumentException: field ended by ';': 
expected ';' but got 'IF' at line 2:   optional int32 (IF
[info]   at 
org.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:239)
[info]   at 
org.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:208)
[info]   at 
org.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:113)
[info]   at 
org.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:101)
[info]   at 
org.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:94)
[info]   at 
org.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:84)
[info]   at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.getSchema(DataWritableWriteSupport.java:43)
[info]   at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.init(DataWritableWriteSupport.java:48)
[info]   at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:476)
[info]   at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:430)
[info]   at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:425)
[info]   at 
org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.(ParquetRecordWriterWrapper.java:70)
[info]   at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:137)
[info]   at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:126)
[info]   at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
[info]   at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271)
[info]   at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:132)
[info]   at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
[info]   at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
[info]   at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileForm

[jira] [Updated] (SPARK-36057) Support volcano/alternative schedulers

2022-01-20 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-36057:

 Shepherd: Holden Karau
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Support volcano/alternative schedulers
> --
>
> Key: SPARK-36057
> URL: https://issues.apache.org/jira/browse/SPARK-36057
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Priority: Major
>
> This is an umbrella issue for tracking the work for supporting Volcano & 
> Yunikorn on Kubernetes. These schedulers provide more YARN like features 
> (such as queues and minimum resources before scheduling jobs) that many folks 
> want on Kubernetes.
>  
> Yunikorn is an ASF project & Volcano is a CNCF project (sig-batch).
>  
> They've taken slightly different approaches to solving the same problem, but 
> from Spark's point of view we should be able to share much of the code.
>  
> See the initial brainstorming discussion in SPARK-35623.
>  
> Google DOC:
> [https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2022-01-20 Thread jingxiong zhong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong resolved SPARK-37708.
-
Resolution: Not A Problem

> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Major
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully，it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies？



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28137) Data Type Formatting Functions: `to_number`

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28137.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35060
[https://github.com/apache/spark/pull/35060]

> Data Type Formatting Functions: `to_number`
> ---
>
> Key: SPARK-28137
> URL: https://issues.apache.org/jira/browse/SPARK-28137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> ||Function||Return Type||Description||Example||
> |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
> numeric|{{to_number('12,454.8-', '99G999D9S')}}|
> https://www.postgresql.org/docs/12/functions-formatting.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28137) Data Type Formatting Functions: `to_number`

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28137:
---

Assignee: jiaan.geng

> Data Type Formatting Functions: `to_number`
> ---
>
> Key: SPARK-28137
> URL: https://issues.apache.org/jira/browse/SPARK-28137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
>
> ||Function||Return Type||Description||Example||
> |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
> numeric|{{to_number('12,454.8-', '99G999D9S')}}|
> https://www.postgresql.org/docs/12/functions-formatting.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37922) Combine to one cast if we can safely up-cast two casts

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37922:
---

Assignee: Yuming Wang

> Combine to one cast if we can safely up-cast two casts
> --
>
> Key: SPARK-37922
> URL: https://issues.apache.org/jira/browse/SPARK-37922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37922) Combine to one cast if we can safely up-cast two casts

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37922.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35220
[https://github.com/apache/spark/pull/35220]

> Combine to one cast if we can safely up-cast two casts
> --
>
> Key: SPARK-37922
> URL: https://issues.apache.org/jira/browse/SPARK-37922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37968) Upgrade commons-collections3 to commons-collections4

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479198#comment-17479198
 ] 

Apache Spark commented on SPARK-37968:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35257

> Upgrade commons-collections3 to commons-collections4
> 
>
> Key: SPARK-37968
> URL: https://issues.apache.org/jira/browse/SPARK-37968
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Apache commons-collections 3.x is a Java 1.3 compatible version, and it does 
> not use Java 5 generics.  
> Apache commons-collections4 4.4 is an upgraded version and it built by Java 8



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37968) Upgrade commons-collections3 to commons-collections4

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37968:


Assignee: Apache Spark

> Upgrade commons-collections3 to commons-collections4
> 
>
> Key: SPARK-37968
> URL: https://issues.apache.org/jira/browse/SPARK-37968
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Apache commons-collections 3.x is a Java 1.3 compatible version, and it does 
> not use Java 5 generics.  
> Apache commons-collections4 4.4 is an upgraded version and it built by Java 8



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37968) Upgrade commons-collections3 to commons-collections4

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37968:


Assignee: (was: Apache Spark)

> Upgrade commons-collections3 to commons-collections4
> 
>
> Key: SPARK-37968
> URL: https://issues.apache.org/jira/browse/SPARK-37968
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Apache commons-collections 3.x is a Java 1.3 compatible version, and it does 
> not use Java 5 generics.  
> Apache commons-collections4 4.4 is an upgraded version and it built by Java 8



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37968) Upgrade commons-collections3 to commons-collections4

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479197#comment-17479197
 ] 

Apache Spark commented on SPARK-37968:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35257

> Upgrade commons-collections3 to commons-collections4
> 
>
> Key: SPARK-37968
> URL: https://issues.apache.org/jira/browse/SPARK-37968
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Apache commons-collections 3.x is a Java 1.3 compatible version, and it does 
> not use Java 5 generics.  
> Apache commons-collections4 4.4 is an upgraded version and it built by Java 8



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37968) Upgrade commons-collections3 to commons-collections4

2022-01-20 Thread Yang Jie (Jira)

Yang Jie created SPARK-37968:


 Summary: Upgrade commons-collections3 to commons-collections4
 Key: SPARK-37968
 URL: https://issues.apache.org/jira/browse/SPARK-37968
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Yang Jie


Apache commons-collections 3.x is a Java 1.3 compatible version, and it does 
not use Java 5 generics.  

Apache commons-collections4 4.4 is an upgraded version and it built by Java 8



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37931) Quote the column name if needed

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37931:
---

Assignee: PengLei

> Quote the column name if needed
> ---
>
> Key: SPARK-37931
> URL: https://issues.apache.org/jira/browse/SPARK-37931
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
>
> Quote the column name if need instead of quoted anyway.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37931) Quote the column name if needed

2022-01-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37931.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35227
[https://github.com/apache/spark/pull/35227]

> Quote the column name if needed
> ---
>
> Key: SPARK-37931
> URL: https://issues.apache.org/jira/browse/SPARK-37931
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> Quote the column name if need instead of quoted anyway.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37933) Limit push down for parquet datasource v2

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479173#comment-17479173
 ] 

Apache Spark commented on SPARK-37933:
--

User 'stczwd' has created a pull request for this issue:
https://github.com/apache/spark/pull/35256

> Limit push down for parquet datasource v2
> -
>
> Key: SPARK-37933
> URL: https://issues.apache.org/jira/browse/SPARK-37933
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> Based on SPARK-37020, we can support limit push down to parquet datasource v2 
> reader. It can stop scanning parquet early, and reduce network and disk IO.
> Current limit parse status for parquet
> {code:java}
> == Parsed Logical Plan ==
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Analyzed Logical Plan ==
> a: int, b: int
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Optimized Logical Plan ==
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Physical Plan ==
> CollectLimit 10
> +- *(1) ColumnarToRow
>    +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, 
> Location: InMemoryFileIndex(1 
> paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], 
> PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: 
> struct, PushedFilters: [], PushedAggregation: [], PushedGroupBy: 
> [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37933) Limit push down for parquet datasource v2

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37933:


Assignee: Jackey Lee  (was: Apache Spark)

> Limit push down for parquet datasource v2
> -
>
> Key: SPARK-37933
> URL: https://issues.apache.org/jira/browse/SPARK-37933
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> Based on SPARK-37020, we can support limit push down to parquet datasource v2 
> reader. It can stop scanning parquet early, and reduce network and disk IO.
> Current limit parse status for parquet
> {code:java}
> == Parsed Logical Plan ==
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Analyzed Logical Plan ==
> a: int, b: int
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Optimized Logical Plan ==
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Physical Plan ==
> CollectLimit 10
> +- *(1) ColumnarToRow
>    +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, 
> Location: InMemoryFileIndex(1 
> paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], 
> PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: 
> struct, PushedFilters: [], PushedAggregation: [], PushedGroupBy: 
> [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37933) Limit push down for parquet datasource v2

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37933:


Assignee: Apache Spark  (was: Jackey Lee)

> Limit push down for parquet datasource v2
> -
>
> Key: SPARK-37933
> URL: https://issues.apache.org/jira/browse/SPARK-37933
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jackey Lee
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Based on SPARK-37020, we can support limit push down to parquet datasource v2 
> reader. It can stop scanning parquet early, and reduce network and disk IO.
> Current limit parse status for parquet
> {code:java}
> == Parsed Logical Plan ==
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Analyzed Logical Plan ==
> a: int, b: int
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Optimized Logical Plan ==
> GlobalLimit 10
> +- LocalLimit 10
>    +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
> == Physical Plan ==
> CollectLimit 10
> +- *(1) ColumnarToRow
>    +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, 
> Location: InMemoryFileIndex(1 
> paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], 
> PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: 
> struct, PushedFilters: [], PushedAggregation: [], PushedGroupBy: 
> [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37960) Support aggregate push down with CASE ... WHEN ... ELSE ... END

2022-01-20 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37960:
---
Summary: Support aggregate push down with CASE ... WHEN ... ELSE ... END  
(was: Support aggregate push down SUM(CASE ... WHEN ... ELSE ... END))

> Support aggregate push down with CASE ... WHEN ... ELSE ... END
> ---
>
> Key: SPARK-37960
> URL: https://issues.apache.org/jira/browse/SPARK-37960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports aggregate push down SUM(column) into JDBC data 
> source.
> SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37831) Add task partition id in metrics

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479162#comment-17479162
 ] 

Apache Spark commented on SPARK-37831:
--

User 'stczwd' has created a pull request for this issue:
https://github.com/apache/spark/pull/35256

> Add task partition id in metrics
> 
>
> Key: SPARK-37831
> URL: https://issues.apache.org/jira/browse/SPARK-37831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Jackey Lee
>Priority: Major
>
> There is no partition id in current metrics, it makes difficult to trace 
> stage metrics, such as stage shuffle read, especially when there are stage 
> retries. It is also impossible to check task metrics between different 
> applications.
> {code:java}
> class TaskData private[spark](
> val taskId: Long,
> val index: Int,
> val attempt: Int,
> val launchTime: Date,
> val resultFetchStart: Option[Date],
> @JsonDeserialize(contentAs = classOf[JLong])
> val duration: Option[Long],
> val executorId: String,
> val host: String,
> val status: String,
> val taskLocality: String,
> val speculative: Boolean,
> val accumulatorUpdates: Seq[AccumulableInfo],
> val errorMessage: Option[String] = None,
> val taskMetrics: Option[TaskMetrics] = None,
> val executorLogs: Map[String, String],
> val schedulerDelay: Long,
> val gettingResultTime: Long) {code}
> Adding partitionId in Task Data can not only make us easy to trace task 
> metrics, also can make it possible to collect metrics for actual stage 
> outputs, especially when stage retries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37967) ConstantFolding/ Literal.create support ObjectType

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37967:


Assignee: (was: Apache Spark)

> ConstantFolding/ Literal.create support ObjectType
> --
>
> Key: SPARK-37967
> URL: https://issues.apache.org/jira/browse/SPARK-37967
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37967) ConstantFolding/ Literal.create support ObjectType

2022-01-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479158#comment-17479158
 ] 

Apache Spark commented on SPARK-37967:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35255

> ConstantFolding/ Literal.create support ObjectType
> --
>
> Key: SPARK-37967
> URL: https://issues.apache.org/jira/browse/SPARK-37967
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37967) ConstantFolding/ Literal.create support ObjectType

2022-01-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37967:


Assignee: Apache Spark

> ConstantFolding/ Literal.create support ObjectType
> --
>
> Key: SPARK-37967
> URL: https://issues.apache.org/jira/browse/SPARK-37967
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

97 matches

Mail list logo