[jira] [Resolved] (SPARK-31301) flatten the result dataframe of tests in stat

2020-04-13 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31301.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28176
[https://github.com/apache/spark/pull/28176]

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.1.0
>
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31301) flatten the result dataframe of tests in stat

2020-04-13 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31301:


Assignee: zhengruifeng

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2020-04-13 Thread Zhou Jiashuai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082896#comment-17082896
 ] 

Zhou Jiashuai commented on SPARK-26385:
---

I enable the log with -Dsun.security.krb5.debug=true and 
-Dsun.security.spnego.debug=true and get the following logs. It seems to have 
logged out after run 24 or 25 hours.
{quote}[UnixLoginModule]: succeeded importing info: 
 uid = 3107
 gid = 3107
 supp gid = 3107
 Debug is true storeKey false useTicketCache true useKeyTab false doNotPrompt 
true ticketCache is null isInitiator true KeyTab is null refreshKrb5Config is 
false principal is null tryFirstPass is false useFirstPass is false storePass 
is false clearPass is false
 Acquire TGT from Cache
 Principal is null
 null credentials from Ticket Cache
 [Krb5LoginModule] authentication failed 
 Unable to obtain Principal Name for authentication 
 [UnixLoginModule]: added UnixPrincipal,
 UnixNumericUserPrincipal,
 UnixNumericGroupPrincipal(s),
 to Subject
 Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt 
true ticketCache is null isInitiator true KeyTab is 
username.keytab-a0d905e9-3926-422f-8068-ffec9ace4cc2 refreshKrb5Config is true 
principal is usern...@bdp.com tryFirstPass is false useFirstPass is false 
storePass is false clearPass is false
 Refreshing Kerberos configuration
 principal is usern...@bdp.com
 Will use keytab
 Commit Succeeded

Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt 
true ticketCache is null isInitiator true KeyTab is 
username.keytab-a0d905e9-3926-422f-8068-ffec9ace4cc2 refreshKrb5Config is true 
principal is usern...@bdp.com tryFirstPass is false useFirstPass is false 
storePass is false clearPass is false
 Refreshing Kerberos configuration
 principal is usern...@bdp.com
 Will use keytab
 Commit Succeeded

[Krb5LoginModule]: Entering logout
 [Krb5LoginModule]: logged out Subject
 Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt 
true ticketCache is null isInitiator true KeyTab is 
username.keytab-a0d905e9-3926-422f-8068-ffec9ace4cc2 refreshKrb5Config is true 
principal is usern...@bdp.com tryFirstPass is false useFirstPass is false 
storePass is false clearPass is false
 Refreshing Kerberos configuration
 principal is usern...@bdp.com
 Will use keytab
 Commit Succeeded
{quote}
 

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hado

[jira] [Updated] (SPARK-31432) bin/sbin scripts should allow to customize jars dir

2020-04-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31432:
--
Affects Version/s: (was: 2.4.5)
   (was: 3.0.0)
   3.1.0

> bin/sbin scripts should allow to customize jars dir
> ---
>
> Key: SPARK-31432
> URL: https://issues.apache.org/jira/browse/SPARK-31432
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.1.0
>Reporter: Shingo Furuyama
>Priority: Minor
>
> In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR 
> as same as SPARK_CONF_DIR.
> Our usecase:
>  We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
> incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
> tweak the jars by Maven Shade Plugin.
>  The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
> directory different from the default. So it is useful for us if we can set 
> SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
>  We can do that without the modification by deploying spark home as many as 
> set of jars, but it is somehow redundant.
> Common usecase:
>  I believe there is a similer usecase. For example, deploying spark built for 
> scala 2.11 and 2.12 in a machine and switch jars location by setting 
> SPARK_JARS_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30953) InsertAdaptiveSparkPlan should apply AQE on child plan of write commands

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30953:
---

Assignee: wuyi

> InsertAdaptiveSparkPlan should apply AQE on child plan of write commands
> 
>
> Key: SPARK-30953
> URL: https://issues.apache.org/jira/browse/SPARK-30953
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Apply AQE on write commands with child plan will expose {{LogicalQueryStage}} 
> to {{Analyzer}} while it should hider under {{AdaptiveSparkPlanExec}} only to 
> avoid unexpected broken.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30953) InsertAdaptiveSparkPlan should apply AQE on child plan of write commands

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30953.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27701
[https://github.com/apache/spark/pull/27701]

> InsertAdaptiveSparkPlan should apply AQE on child plan of write commands
> 
>
> Key: SPARK-30953
> URL: https://issues.apache.org/jira/browse/SPARK-30953
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Apply AQE on write commands with child plan will expose {{LogicalQueryStage}} 
> to {{Analyzer}} while it should hider under {{AdaptiveSparkPlanExec}} only to 
> avoid unexpected broken.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31441) Support duplicated column names for toPandas with Arrow execution.

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31441:


Assignee: Takuya Ueshin

> Support duplicated column names for toPandas with Arrow execution.
> --
>
> Key: SPARK-31441
> URL: https://issues.apache.org/jira/browse/SPARK-31441
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> When we execute {{toPandas()}} with Arrow execution, it fails if the column 
> names have duplicates.
> {code:python}
> >>> spark.sql("select 1 v, 1 v").toPandas()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 
> 2132, in toPandas
> pdf = table.to_pandas()
>   File "pyarrow/array.pxi", line 441, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
>   File 
> "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 653, in table_to_blockmanager
> columns = _deserialize_column_index(table, all_columns, column_indexes)
>   File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 
> 704, in _deserialize_column_index
> columns = _flatten_single_level_multiindex(columns)
>   File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 
> 937, in _flatten_single_level_multiindex
> raise ValueError('Found non-unique column index')
> ValueError: Found non-unique column index
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31441) Support duplicated column names for toPandas with Arrow execution.

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31441.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28210
[https://github.com/apache/spark/pull/28210]

> Support duplicated column names for toPandas with Arrow execution.
> --
>
> Key: SPARK-31441
> URL: https://issues.apache.org/jira/browse/SPARK-31441
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> When we execute {{toPandas()}} with Arrow execution, it fails if the column 
> names have duplicates.
> {code:python}
> >>> spark.sql("select 1 v, 1 v").toPandas()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 
> 2132, in toPandas
> pdf = table.to_pandas()
>   File "pyarrow/array.pxi", line 441, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
>   File 
> "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 653, in table_to_blockmanager
> columns = _deserialize_column_index(table, all_columns, column_indexes)
>   File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 
> 704, in _deserialize_column_index
> columns = _flatten_single_level_multiindex(columns)
>   File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 
> 937, in _flatten_single_level_multiindex
> raise ValueError('Found non-unique column index')
> ValueError: Found non-unique column index
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31392) Support CalendarInterval to be reflect to CalendarIntervalType

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31392:

Fix Version/s: (was: 3.1.0)
   3.0.0

> Support CalendarInterval to be reflect to CalendarIntervalType
> --
>
> Key: SPARK-31392
> URL: https://issues.apache.org/jira/browse/SPARK-31392
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Since Spark 3.0.0, we make CalendarInterval public, it's better for it to be 
> inferred to CalendarIntervalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31426:
---

Assignee: Maxim Gekk

> Regression in loading/saving timestamps from/to ORC files
> -
>
> Key: SPARK-31426
> URL: https://issues.apache.org/jira/browse/SPARK-31426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Here are results of DateTimeRebaseBenchmark on the current master branch:
> {code}
> Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 158259877  59877
>0  1.7 598.8   0.0X
> before 1582   61361  61361
>0  1.6 613.6   0.0X
> Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 1582, vec off   48197  48288
>  118  2.1 482.0   1.0X
> after 1582, vec on38247  38351
>  128  2.6 382.5   1.3X
> before 1582, vec off  53179  53359
>  249  1.9 531.8   0.9X
> before 1582, vec on   44076  44268
>  269  2.3 440.8   1.1X
> {code}
> The results of the same benchmark on Spark 2.4.6-SNAPSHOT:
> {code}
> Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 158218858  18858
>0  5.3 188.6   1.0X
> before 1582   18508  18508
>0  5.4 185.1   1.0X
> Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 1582, vec off   14063  14177
>  143  7.1 140.6   1.0X
> after 1582, vec on 5955   6029
>  100 16.8  59.5   2.4X
> before 1582, vec off  14119  14126
>7  7.1 141.2   1.0X
> before 1582, vec on5991   6007
>   25 16.7  59.9   2.3X
> {code}
>  Here is the PR with DateTimeRebaseBenchmark backported to 2.4: 
> https://github.com/MaxGekk/spark/pull/27



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31426.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28189
[https://github.com/apache/spark/pull/28189]

> Regression in loading/saving timestamps from/to ORC files
> -
>
> Key: SPARK-31426
> URL: https://issues.apache.org/jira/browse/SPARK-31426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Here are results of DateTimeRebaseBenchmark on the current master branch:
> {code}
> Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 158259877  59877
>0  1.7 598.8   0.0X
> before 1582   61361  61361
>0  1.6 613.6   0.0X
> Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 1582, vec off   48197  48288
>  118  2.1 482.0   1.0X
> after 1582, vec on38247  38351
>  128  2.6 382.5   1.3X
> before 1582, vec off  53179  53359
>  249  1.9 531.8   0.9X
> before 1582, vec on   44076  44268
>  269  2.3 440.8   1.1X
> {code}
> The results of the same benchmark on Spark 2.4.6-SNAPSHOT:
> {code}
> Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 158218858  18858
>0  5.3 188.6   1.0X
> before 1582   18508  18508
>0  5.4 185.1   1.0X
> Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 1582, vec off   14063  14177
>  143  7.1 140.6   1.0X
> after 1582, vec on 5955   6029
>  100 16.8  59.5   2.4X
> before 1582, vec off  14119  14126
>7  7.1 141.2   1.0X
> before 1582, vec on5991   6007
>   25 16.7  59.9   2.3X
> {code}
>  Here is the PR with DateTimeRebaseBenchmark backported to 2.4: 
> https://github.com/MaxGekk/spark/pull/27



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31441) Support duplicated column names for toPandas with Arrow execution.

2020-04-13 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-31441:
--
Summary: Support duplicated column names for toPandas with Arrow execution. 
 (was: Support duplicated column names for toPandas with arrow execution.)

> Support duplicated column names for toPandas with Arrow execution.
> --
>
> Key: SPARK-31441
> URL: https://issues.apache.org/jira/browse/SPARK-31441
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> When we execute {{toPandas()}} with Arrow execution, it fails if the column 
> names have duplicates.
> {code:python}
> >>> spark.sql("select 1 v, 1 v").toPandas()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 
> 2132, in toPandas
> pdf = table.to_pandas()
>   File "pyarrow/array.pxi", line 441, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
>   File 
> "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 653, in table_to_blockmanager
> columns = _deserialize_column_index(table, all_columns, column_indexes)
>   File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 
> 704, in _deserialize_column_index
> columns = _flatten_single_level_multiindex(columns)
>   File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 
> 937, in _flatten_single_level_multiindex
> raise ValueError('Found non-unique column index')
> ValueError: Found non-unique column index
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31441) Support duplicated column names for toPandas with arrow execution.

2020-04-13 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-31441:
-

 Summary: Support duplicated column names for toPandas with arrow 
execution.
 Key: SPARK-31441
 URL: https://issues.apache.org/jira/browse/SPARK-31441
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5, 3.0.0
Reporter: Takuya Ueshin


When we execute {{toPandas()}} with Arrow execution, it fails if the column 
names have duplicates.

{code:python}
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "", line 1, in 
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 
2132, in toPandas
pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in 
pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File 
"/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 653, in table_to_blockmanager
columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 
704, in _deserialize_column_index
columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 
937, in _flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31434) Drop builtin function pages from SQL references

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31434:


Assignee: Takeshi Yamamuro

> Drop builtin function pages from SQL references
> ---
>
> Key: SPARK-31434
> URL: https://issues.apache.org/jira/browse/SPARK-31434
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> This ticket intends to drop the built-in function pages from SQL references. 
> We've already had a complete list of built-in functions in the API documents.
> See related discussions for more details: 
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31434) Drop builtin function pages from SQL references

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31434.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28203
[https://github.com/apache/spark/pull/28203]

> Drop builtin function pages from SQL references
> ---
>
> Key: SPARK-31434
> URL: https://issues.apache.org/jira/browse/SPARK-31434
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> This ticket intends to drop the built-in function pages from SQL references. 
> We've already had a complete list of built-in functions in the API documents.
> See related discussions for more details: 
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31411) Show submitted time and duration in job details page

2020-04-13 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31411.

Fix Version/s: 3.1.0
   Resolution: Fixed

The issue is resolved in https://github.com/apache/spark/pull/28179

> Show submitted time and duration in job details page
> 
>
> Key: SPARK-31411
> URL: https://issues.apache.org/jira/browse/SPARK-31411
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, there is no submitted time and duration of a job in its job 
> details UI page. 
> We should show it on the job details page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31420) Infinite timeline redraw in job details page

2020-04-13 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082743#comment-17082743
 ] 

Dongjoon Hyun commented on SPARK-31420:
---

Thank you for confirming, [~sarutak]!

> Infinite timeline redraw in job details page
> 
>
> Key: SPARK-31420
> URL: https://issues.apache.org/jira/browse/SPARK-31420
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Kousuke Saruta
>Priority: Major
> Attachments: timeline.mov
>
>
> In the job page, the timeline section keeps changing the position style and 
> shaking. We can see that there is a warning "infinite loop in redraw" from 
> the console, which can be related to 
> https://github.com/visjs/vis-timeline/issues/17
> I am using the history server with the events under 
> "core/src/test/resources/spark-events" to reproduce.
> I have also uploaded a screen recording.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2020-04-13 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082736#comment-17082736
 ] 

Erik Krogen commented on SPARK-22148:
-

For future folks: the JIRA created for the issue is SPARK-31418 and discussion 
is continuing there.

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
>Assignee: Dhruve Ashar
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-04-13 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-31418:
--
Issue Type: Improvement  (was: Bug)

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-04-13 Thread Venkata krishnan Sowrirajan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082731#comment-17082731
 ] 

Venkata krishnan Sowrirajan commented on SPARK-31418:
-

[~tgraves] Currently, I'm thinking we can check if dynamic allocation is 
enabled if so we can request for one more executor using 
ExecutorAllocationClient#requestExecutors and start the abort timer. But I 
re-read your 
[comment|https://issues.apache.org/jira/browse/SPARK-22148?focusedCommentId=17078278&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17078278]
 again and it seems like you tried to pass the information to 
ExecutorAllocationManager and request the executor through 
ExecutorAllocationManager. Is that right?

Regarding, kill other non idle blacklisted executor idea, I don't think that 
would be better as we might kill tasks from other stages like mentioned in 
other comments from the PR. Let me know if you have any other thoughts on this 
problem. But we are facing this issue more frequently although retrying the 
whole job will pass but it happens frequently.

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-04-13 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-31306:


Assignee: Ben

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Assignee: Ben
>Priority: Major
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31440) Improve SQL Rest API

2020-04-13 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31440:
---
Description: 
SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
apply following improvements on SQL Rest API by aligning Spark-UI.

*Proposed Improvements:*
 1- Support Physical Operations and group metrics per operation by aligning 
Spark UI.
 2- *nodeId* can be useful for grouping metrics as well as for sorting and to 
differentiate same operators and their metrics.
 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab
 4- Remove *\n* from *metricValue(s)*
 5- *planDescription* can be optional Http parameter to avoid network cost 
(specially for complex jobs creating big-plans).
 6- *metrics* attribute needs to be exposed at the bottom order as 
*metricDetails*. This order matches with Spark UI by highlighting with 
execution order.

*Attachments:*
 Please find both *current* and *improved* versions of the results as attached 
for following SQL Rest Endpoint:
{code:java}
curl -X GET 
http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
 
 

  was:
SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
apply following improvements on SQL Rest API by aligning Spark-UI.

*Proposed Improvements:*
 1- Support Physical Operations and group metrics per operation by aligning 
Spark UI.
 2- *nodeId* can be useful for grouping metrics as well as for sorting and to 
differentiate same operators and their metrics.
 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab
 4- Remove *\n* from *metricValue(s)*
 5- *planDescription* can be optional Http parameter to avoid network cost 
(specially for complex jobs creating big-plans).
 6- *metrics* attribute needs to be exposed at the bottom order as 
*metricDetails*. This order matches with Spark UI by highlighting with 
execution order.

*Attachments:*
 Please find both *current* and *improved* versions of results as attached.
 


> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
>  1- Support Physical Operations and group metrics per operation by aligning 
> Spark UI.
>  2- *nodeId* can be useful for grouping metrics as well as for sorting and to 
> differentiate same operators and their metrics.
>  3- Filter *blank* metrics by aligning with Spark UI - SQL Tab
>  4- Remove *\n* from *metricValue(s)*
>  5- *planDescription* can be optional Http parameter to avoid network cost 
> (specially for complex jobs creating big-plans).
>  6- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. This order matches with Spark UI by highlighting with 
> execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31440) Improve SQL Rest API

2020-04-13 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31440:
---
Attachment: current_version.json

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
>  1- Support Physical Operations and group metrics per operation by aligning 
> Spark UI.
>  2- *nodeId* can be useful for grouping metrics as well as for sorting and to 
> differentiate same operators and their metrics.
>  3- Filter *blank* metrics by aligning with Spark UI - SQL Tab
>  4- Remove *\n* from *metricValue(s)*
>  5- *planDescription* can be optional Http parameter to avoid network cost 
> (specially for complex jobs creating big-plans).
>  6- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. This order matches with Spark UI by highlighting with 
> execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of results as attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31440) Improve SQL Rest API

2020-04-13 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31440:
---
Attachment: improved_version.json

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
>  1- Support Physical Operations and group metrics per operation by aligning 
> Spark UI.
>  2- *nodeId* can be useful for grouping metrics as well as for sorting and to 
> differentiate same operators and their metrics.
>  3- Filter *blank* metrics by aligning with Spark UI - SQL Tab
>  4- Remove *\n* from *metricValue(s)*
>  5- *planDescription* can be optional Http parameter to avoid network cost 
> (specially for complex jobs creating big-plans).
>  6- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. This order matches with Spark UI by highlighting with 
> execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of results as attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31440) Improve SQL Rest API

2020-04-13 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31440:
---
Description: 
SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
apply following improvements on SQL Rest API by aligning Spark-UI.

*Proposed Improvements:*
 1- Support Physical Operations and group metrics per operation by aligning 
Spark UI.
 2- *nodeId* can be useful for grouping metrics as well as for sorting and to 
differentiate same operators and their metrics.
 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab
 4- Remove *\n* from *metricValue(s)*
 5- *planDescription* can be optional Http parameter to avoid network cost 
(specially for complex jobs creating big-plans).
 6- *metrics* attribute needs to be exposed at the bottom order as 
*metricDetails*. This order matches with Spark UI by highlighting with 
execution order.

*Attachments:*
 Please find both *current* and *improved* versions of results as attached.
 

  was:
SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
apply following improvements on SQL Rest API by aligning Spark-UI.

*Proposed Improvements:*
1- Support Physical Operations and group metrics per operation by aligning 
Spark UI.
2- `nodeId` can be useful for grouping metrics as well as for sorting and to 
differentiate same operators and their metrics.
3- Filter `blank` metrics by aligning with Spark UI - SQL Tab
4- Remove `\n` from `metricValue`
5- `planDescription` can be optional Http parameter to avoid network cost 
(specially for complex jobs creating big-plans).
6- `metrics` attribute needs to be exposed at the bottom order as 
`metricDetails`. This order matches with Spark UI by highlighting with 
execution order.

*Attachments:*
Please find both *current* and *improved* versions of results as attached.


> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
>  1- Support Physical Operations and group metrics per operation by aligning 
> Spark UI.
>  2- *nodeId* can be useful for grouping metrics as well as for sorting and to 
> differentiate same operators and their metrics.
>  3- Filter *blank* metrics by aligning with Spark UI - SQL Tab
>  4- Remove *\n* from *metricValue(s)*
>  5- *planDescription* can be optional Http parameter to avoid network cost 
> (specially for complex jobs creating big-plans).
>  6- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. This order matches with Spark UI by highlighting with 
> execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of results as attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31440) Improve SQL Rest API

2020-04-13 Thread Eren Avsarogullari (Jira)
Eren Avsarogullari created SPARK-31440:
--

 Summary: Improve SQL Rest API
 Key: SPARK-31440
 URL: https://issues.apache.org/jira/browse/SPARK-31440
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Eren Avsarogullari


SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
apply following improvements on SQL Rest API by aligning Spark-UI.

*Proposed Improvements:*
1- Support Physical Operations and group metrics per operation by aligning 
Spark UI.
2- `nodeId` can be useful for grouping metrics as well as for sorting and to 
differentiate same operators and their metrics.
3- Filter `blank` metrics by aligning with Spark UI - SQL Tab
4- Remove `\n` from `metricValue`
5- `planDescription` can be optional Http parameter to avoid network cost 
(specially for complex jobs creating big-plans).
6- `metrics` attribute needs to be exposed at the bottom order as 
`metricDetails`. This order matches with Spark UI by highlighting with 
execution order.

*Attachments:*
Please find both *current* and *improved* versions of results as attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18299) Allow more aggregations on KeyValueGroupedDataset

2020-04-13 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-18299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-18299:
---

Assignee: nooberfsh

> Allow more aggregations on KeyValueGroupedDataset
> -
>
> Key: SPARK-18299
> URL: https://issues.apache.org/jira/browse/SPARK-18299
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Matthias Niehoff
>Assignee: nooberfsh
>Priority: Minor
> Fix For: 3.0.0
>
>
> The number of possible aggregations on a KeyValueGroupedDataset created by 
> groupByKey is limited to 4, as there are only methods with a maximum of 4 
> parameters.
> This value should be increased or - even better - made be completely 
> unlimited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31439) Perf regression of fromJavaDate

2020-04-13 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31439:
--

 Summary: Perf regression of fromJavaDate
 Key: SPARK-31439
 URL: https://issues.apache.org/jira/browse/SPARK-31439
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


DateTimeBenchmark shows the regression

Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date  614655 
> 43  8.1 122.8   1.0X
{code}

Current master:
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date 1154   1206 
> 46  4.3 230.9   1.0X
{code}

The regression is ~x2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files

2020-04-13 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31426:
---
Parent: SPARK-31404
Issue Type: Sub-task  (was: Bug)

> Regression in loading/saving timestamps from/to ORC files
> -
>
> Key: SPARK-31426
> URL: https://issues.apache.org/jira/browse/SPARK-31426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here are results of DateTimeRebaseBenchmark on the current master branch:
> {code}
> Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 158259877  59877
>0  1.7 598.8   0.0X
> before 1582   61361  61361
>0  1.6 613.6   0.0X
> Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 1582, vec off   48197  48288
>  118  2.1 482.0   1.0X
> after 1582, vec on38247  38351
>  128  2.6 382.5   1.3X
> before 1582, vec off  53179  53359
>  249  1.9 531.8   0.9X
> before 1582, vec on   44076  44268
>  269  2.3 440.8   1.1X
> {code}
> The results of the same benchmark on Spark 2.4.6-SNAPSHOT:
> {code}
> Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 158218858  18858
>0  5.3 188.6   1.0X
> before 1582   18508  18508
>0  5.4 185.1   1.0X
> Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 1582, vec off   14063  14177
>  143  7.1 140.6   1.0X
> after 1582, vec on 5955   6029
>  100 16.8  59.5   2.4X
> before 1582, vec off  14119  14126
>7  7.1 141.2   1.0X
> before 1582, vec on5991   6007
>   25 16.7  59.9   2.3X
> {code}
>  Here is the PR with DateTimeRebaseBenchmark backported to 2.4: 
> https://github.com/MaxGekk/spark/pull/27



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082498#comment-17082498
 ] 

Bruce Robbins commented on SPARK-31423:
---

[~cloud_fan] 
{quote}FYI this is the behavior of Spark 2.4
{quote}
Yes, I noted that in my description. What I mean is that in Spark 3.x (and 
without any legacy config touched), only ORC demonstrates this behavior. CAST, 
and the Parquet and Avro file formats do not demonstrate this behavior.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31438) Support JobCleaned Status in SparkListener

2020-04-13 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-31438:
---
Description: 
In Spark, we need do some hook after job cleaned, such as cleaning hive 
external temporary paths. This has already discussed in SPARK-31346 and [GitHub 
Pull Request #28129.|https://github.com/apache/spark/pull/28129]
 The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
finished, once all result has generated, it should be finished. After finish, 
Scheduler will leave the still running tasks to be zombie tasks and delete 
abnormal tasks asynchronously.
 Thus, we add JobCleaned Status to enable user to do some hook after all tasks 
cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is 
related to a stage, and once all stages of the job has been cleaned, then the 
job is cleaned.

  was:
In Spark, we need do some hook, such as cleaning hive external temporary paths, 
after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull Request 
#28129|https://github.com/apache/spark/pull/28129].
 The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
finished, once all result has generated, it should be finished. After finish, 
Scheduler will leave the still running tasks to be zombie tasks and delete 
abnormal tasks asynchronously.
 Thus, we add JobCleaned Status to enable user to do some hook after all tasks 
cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is 
related to a stage, and once all stages of the job has been cleaned, then the 
job is cleaned.


> Support JobCleaned Status in SparkListener
> --
>
> Key: SPARK-31438
> URL: https://issues.apache.org/jira/browse/SPARK-31438
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> In Spark, we need do some hook after job cleaned, such as cleaning hive 
> external temporary paths. This has already discussed in SPARK-31346 and 
> [GitHub Pull Request #28129.|https://github.com/apache/spark/pull/28129]
>  The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
> finished, once all result has generated, it should be finished. After finish, 
> Scheduler will leave the still running tasks to be zombie tasks and delete 
> abnormal tasks asynchronously.
>  Thus, we add JobCleaned Status to enable user to do some hook after all 
> tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, 
> which is related to a stage, and once all stages of the job has been cleaned, 
> then the job is cleaned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31438) Support JobCleaned Status in SparkListener

2020-04-13 Thread Jackey Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-31438:
---
Description: 
In Spark, we need do some hook, such as cleaning hive external temporary paths, 
after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull Request 
#28129|https://github.com/apache/spark/pull/28129].
 The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
finished, once all result has generated, it should be finished. After finish, 
Scheduler will leave the still running tasks to be zombie tasks and delete 
abnormal tasks asynchronously.
 Thus, we add JobCleaned Status to enable user to do some hook after all tasks 
cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is 
related to a stage, and once all stages of the job has been cleaned, then the 
job is cleaned.

  was:
In Spark, we need do some hook, such as hive external temporary paths cleaning, 
after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull Request 
#28129|https://github.com/apache/spark/pull/28129].
 The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
finished, once all result has generated, it should be finished. After finish, 
Scheduler will leave the still running tasks to be zombie tasks and delete 
abnormal tasks asynchronously.
 Thus, we add JobCleaned Status to enable user to do some hook after all tasks 
cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is 
related to a stage, and once all stages of the job has been cleaned, then the 
job is cleaned.


> Support JobCleaned Status in SparkListener
> --
>
> Key: SPARK-31438
> URL: https://issues.apache.org/jira/browse/SPARK-31438
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> In Spark, we need do some hook, such as cleaning hive external temporary 
> paths, after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull 
> Request #28129|https://github.com/apache/spark/pull/28129].
>  The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
> finished, once all result has generated, it should be finished. After finish, 
> Scheduler will leave the still running tasks to be zombie tasks and delete 
> abnormal tasks asynchronously.
>  Thus, we add JobCleaned Status to enable user to do some hook after all 
> tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, 
> which is related to a stage, and once all stages of the job has been cleaned, 
> then the job is cleaned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31438) Support JobCleaned Status in SparkListener

2020-04-13 Thread Jackey Lee (Jira)
Jackey Lee created SPARK-31438:
--

 Summary: Support JobCleaned Status in SparkListener
 Key: SPARK-31438
 URL: https://issues.apache.org/jira/browse/SPARK-31438
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Jackey Lee


In Spark, we need do some hook, such as hive external temporary paths cleaning, 
after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull Request 
#28129|https://github.com/apache/spark/pull/28129].
 The JobEnd Status is not suitable for this. As JobEnd is responsible for Job 
finished, once all result has generated, it should be finished. After finish, 
Scheduler will leave the still running tasks to be zombie tasks and delete 
abnormal tasks asynchronously.
 Thus, we add JobCleaned Status to enable user to do some hook after all tasks 
cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is 
related to a stage, and once all stages of the job has been cleaned, then the 
job is cleaned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31391) Add AdaptiveTestUtils to ease the test of AQE

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31391.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28162
[https://github.com/apache/spark/pull/28162]

> Add AdaptiveTestUtils to ease the test of AQE
> -
>
> Key: SPARK-31391
> URL: https://issues.apache.org/jira/browse/SPARK-31391
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Tests related to AQE now have much duplicate codes, we can use some utility 
> functions to make the test simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31391) Add AdaptiveTestUtils to ease the test of AQE

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31391:
---

Assignee: wuyi

> Add AdaptiveTestUtils to ease the test of AQE
> -
>
> Key: SPARK-31391
> URL: https://issues.apache.org/jira/browse/SPARK-31391
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Tests related to AQE now have much duplicate codes, we can use some utility 
> functions to make the test simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31409) Fix failed tests due to result order changing when we enable AQE

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31409.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28178
[https://github.com/apache/spark/pull/28178]

> Fix failed tests due to result order changing when we enable AQE
> 
>
> Key: SPARK-31409
> URL: https://issues.apache.org/jira/browse/SPARK-31409
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> query #147 in SQLQueryTestSuite#"udf/postgreSQL/udf-join.sql - Scala UDF" and 
> test sql/SQLQuerySuite#"check outputs of expression examples" will fail when 
> enable AQE due to result order changing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31409) Fix failed tests due to result order changing when we enable AQE

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31409:
---

Assignee: wuyi

> Fix failed tests due to result order changing when we enable AQE
> 
>
> Key: SPARK-31409
> URL: https://issues.apache.org/jira/browse/SPARK-31409
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> query #147 in SQLQueryTestSuite#"udf/postgreSQL/udf-join.sql - Scala UDF" and 
> test sql/SQLQuerySuite#"check outputs of expression examples" will fail when 
> enable AQE due to result order changing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation

2020-04-13 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31435.
--
Resolution: Duplicate

> Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration 
> documentation
> -
>
> Key: SPARK-31435
> URL: https://issues.apache.org/jira/browse/SPARK-31435
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> Related with SPARK-31432
> That issue introduces new environment variable that is documented in this 
> issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-04-13 Thread Hongze Zhang (Jira)
Hongze Zhang created SPARK-31437:


 Summary: Try assigning tasks to existing executors by which 
required resources in ResourceProfile are satisfied
 Key: SPARK-31437
 URL: https://issues.apache.org/jira/browse/SPARK-31437
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 3.0.0
Reporter: Hongze Zhang


By the change in [PR|https://github.com/apache/spark/pull/27773] of 
SPARK-29154, submitted tasks are scheduled onto executors only if resource 
profile IDs strictly match. As a result Spark always starts new executors for 
customized ResourceProfiles.

This limitation makes working with process-local jobs unfriendly. E.g. Task 
cores has been increased from 1 to 4 in a new stage, and executor has 8 slots, 
it is expected that 2 new tasks can be run on the existing executor but Spark 
starts new executors for new ResourceProfile. The behavior is unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31436) MinHash keyDistance optimization

2020-04-13 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-31436:


 Summary: MinHash keyDistance optimization
 Key: SPARK-31436
 URL: https://issues.apache.org/jira/browse/SPARK-31436
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


current implementation is based on set operation, it is inefficient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation

2020-04-13 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-31435:
--

 Summary: Add SPARK_JARS_DIR enviroment variable (new) to Spark 
configuration documentation
 Key: SPARK-31435
 URL: https://issues.apache.org/jira/browse/SPARK-31435
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.1.0
Reporter: Pablo Langa Blanco


Related with SPARK-31432

That issue introduces new environment variable that is documented in this issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation

2020-04-13 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082233#comment-17082233
 ] 

Pablo Langa Blanco commented on SPARK-31435:


I'm working on this

> Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration 
> documentation
> -
>
> Key: SPARK-31435
> URL: https://issues.apache.org/jira/browse/SPARK-31435
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> Related with SPARK-31432
> That issue introduces new environment variable that is documented in this 
> issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-13 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082228#comment-17082228
 ] 

Wenchen Fan edited comment on SPARK-31423 at 4/13/20, 10:54 AM:


FYI this is the behavior of Spark 2.4:
{code}
scala> val df = sql("select cast('1582-10-14' as DATE) dt")
df: org.apache.spark.sql.DataFrame = [dt: date]

scala> df.show
+--+
|dt|
+--+
|1582-10-24|
+--+


scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")

scala> spark.read.orc("/tmp/funny_orc_date").show
+--+
|dt|
+--+
|1582-10-24|
+--+
{code}

The result is wrong at the very beginning.


was (Author: cloud_fan):
FYI this is the behavior of Spark 2.4:
```
scala> val df = sql("select cast('1582-10-14' as DATE) dt")
df: org.apache.spark.sql.DataFrame = [dt: date]

scala> df.show
+--+
|dt|
+--+
|1582-10-24|
+--+


scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")

scala> spark.read.orc("/tmp/funny_orc_date").show
+--+
|dt|
+--+
|1582-10-24|
+--+
```

The result is wrong at the very beginning.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-13 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082228#comment-17082228
 ] 

Wenchen Fan commented on SPARK-31423:
-

FYI this is the behavior of Spark 2.4:
```
scala> val df = sql("select cast('1582-10-14' as DATE) dt")
df: org.apache.spark.sql.DataFrame = [dt: date]

scala> df.show
+--+
|dt|
+--+
|1582-10-24|
+--+


scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")

scala> spark.read.orc("/tmp/funny_orc_date").show
+--+
|dt|
+--+
|1582-10-24|
+--+
```

The result is wrong at the very beginning.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31407) Fix hive/SQLQuerySuite.derived from Hive query file: drop_database_removes_partition_dirs.q

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31407.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28177

> Fix hive/SQLQuerySuite.derived from Hive query file: 
> drop_database_removes_partition_dirs.q
> ---
>
> Key: SPARK-31407
> URL: https://issues.apache.org/jira/browse/SPARK-31407
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Test "derived from Hive query file: drop_database_removes_partition_dirs.q" 
> can fail if we run it separately but can success running with the whole 
> hive/SQLQuerySuite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31429:
-
Parent: (was: SPARK-28588)
Issue Type: Bug  (was: Sub-task)

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082196#comment-17082196
 ] 

Hyukjin Kwon commented on SPARK-31429:
--

Actually, let me retarget this as Spark 3.1. It should be good to do for Spark 
3.0 but I guess it's okay to miss it to. I will try anyway.

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31429:
-
Target Version/s:   (was: 3.0.0)

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082190#comment-17082190
 ] 

Hyukjin Kwon commented on SPARK-31429:
--

[~huaxingao], [~nchammas], [~kevinyu98], [~dkbiswal], [~maropu], would anyone 
be interested in this please? I would like to get this done for Spark 3.0 ... 
if you guys are busy, I will try to take a look .. probably next week or around 
there ..

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31429:
-
Parent: SPARK-28588
Issue Type: Sub-task  (was: Improvement)

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31429:
-
Affects Version/s: (was: 3.1.0)
   3.0.0

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31429:
-
Target Version/s: 3.0.0

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31434) Drop builtin function pages from SQL references

2020-04-13 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-31434:


 Summary: Drop builtin function pages from SQL references
 Key: SPARK-31434
 URL: https://issues.apache.org/jira/browse/SPARK-31434
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


This ticket intends to drop the built-in function pages from SQL references. 
We've already had a complete list of built-in functions in the API documents.

See related discussions for more details: 
https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.

2020-04-13 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082163#comment-17082163
 ] 

Nick Hryhoriev commented on SPARK-31427:


[~kabhwan] I will try to do it, but do not expect to get info in the nearest 
time.
But I can confirm that 2.4.5 has the same behave.

> Spark Structure streaming read data twice per every micro-batch.
> 
>
> Key: SPARK-31427
> URL: https://issues.apache.org/jira/browse/SPARK-31427
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Nick Hryhoriev
>Priority: Major
>
> I have a very strange issue with spark structure streaming. Spark structure 
> streaming creates two spark jobs for every micro-batch. As a result, read 
> data from Kafka twice. Here is a simple code snippet.
>  
> {code:java}
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.streaming.Trigger
> object CheckHowSparkReadFromKafka {
>   def main(args: Array[String]): Unit = {
> val session = SparkSession.builder()
>   .config(new SparkConf()
> .setAppName(s"simple read from kafka with repartition")
> .setMaster("local[*]")
> .set("spark.driver.host", "localhost"))
>   .getOrCreate()
> val testPath = "/tmp/spark-test"
> FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new 
> Path(testPath), true)
> import session.implicits._
> val stream = session
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers","kafka-20002-prod:9092")
>   .option("subscribe", "topic")
>   .option("maxOffsetsPerTrigger", 1000)
>   .option("failOnDataLoss", false)
>   .option("startingOffsets", "latest")
>   .load()
>   .repartitionByRange( $"offset")
>   .writeStream
>   .option("path", testPath + "/data")
>   .option("checkpointLocation", testPath + "/checkpoint")
>   .format("parquet")
>   .trigger(Trigger.ProcessingTime(10.seconds))
>   .start()
> stream.processAllAvailable()
> {code}
> This happens because if {{.repartitionByRange( $"offset")}}, if I remove this 
> line, all good. But with spark create two jobs, one with 1 stage just read 
> from Kafka, the second with 3 stage read -> shuffle -> write. So the result 
> of the first job never used.
> This has a significant impact on performance. Some of my Kafka topics have 
> 1550 partitions, so read them twice is a big deal. In case I add cache, 
> things going better, but this is not a way for me. In local mode, the first 
> job in batch takes less than 0.1 ms, except batch with index 0. But in YARN 
> cluster and Messos both jobs fully expected and on my topics take near 1.2 
> min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31433) Summarizer supports string arguments

2020-04-13 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-31433:


 Summary: Summarizer supports string arguments
 Key: SPARK-31433
 URL: https://issues.apache.org/jira/browse/SPARK-31433
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


It wil be convenient for Summarizer to support string arguments, like other sql 
functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31408) Build Spark’s own datetime pattern definition

2020-04-13 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31408:

Summary: Build Spark’s own datetime pattern definition  (was: Build Spark’s 
own Datetime patterns)

> Build Spark’s own datetime pattern definition
> -
>
> Key: SPARK-31408
> URL: https://issues.apache.org/jira/browse/SPARK-31408
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> This is an umbrella ticket for building Spark's own Datetime patterns and 
> related works.
> In Spark version 2.4 and earlier, datetime parsing and formatting are 
> performed by the old Java 7 `SimpleDateFormat` API. Since Spark 3.0, we 
> switch to the new Java 8 `DateTimeFormatter` to use the Proleptic Gregorian 
> calendar, which is required by the ISO and SQL standards.
> However, there are some datetime patterns not compatible between Java 8 and 
> Java 7 APIs, and it's fragile to rely on the JDK API to define Spark's 
> behavior. We should build our own Datetime patterns, which is compatible with 
> Spark 2.4 (the old Java 7 `SimpleDateFormat` API).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31432) bin/sbin scripts should allow to customize jars dir

2020-04-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31432:
-
Target Version/s:   (was: 2.4.6, 3.0.1)

> bin/sbin scripts should allow to customize jars dir
> ---
>
> Key: SPARK-31432
> URL: https://issues.apache.org/jira/browse/SPARK-31432
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Shingo Furuyama
>Priority: Minor
>
> In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR 
> as same as SPARK_CONF_DIR.
> Our usecase:
>  We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
> incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
> tweak the jars by Maven Shade Plugin.
>  The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
> directory different from the default. So it is useful for us if we can set 
> SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
>  We can do that without the modification by deploying spark home as many as 
> set of jars, but it is somehow redundant.
> Common usecase:
>  I believe there is a similer usecase. For example, deploying spark built for 
> scala 2.11 and 2.12 in a machine and switch jars location by setting 
> SPARK_JARS_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31432) bin/sbin scripts should allow to customize jars dir

2020-04-13 Thread Shingo Furuyama (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082129#comment-17082129
 ] 

Shingo Furuyama commented on SPARK-31432:
-

I will soon send a PR to the master branch. If the PR is merged, I will send it 
branch-2.4.

> bin/sbin scripts should allow to customize jars dir
> ---
>
> Key: SPARK-31432
> URL: https://issues.apache.org/jira/browse/SPARK-31432
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Shingo Furuyama
>Priority: Minor
>
> In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR 
> as same as SPARK_CONF_DIR.
> Our usecase:
>  We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
> incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
> tweak the jars by Maven Shade Plugin.
>  The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
> directory different from the default. So it is useful for us if we can set 
> SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
>  We can do that without the modification by deploying spark home as many as 
> set of jars, but it is somehow redundant.
> Common usecase:
>  I believe there is a similer usecase. For example, deploying spark built for 
> scala 2.11 and 2.12 in a machine and switch jars location by setting 
> SPARK_JARS_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31432) bin/sbin scripts should allow to customize jars dir

2020-04-13 Thread Shingo Furuyama (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shingo Furuyama updated SPARK-31432:

Environment: (was: In the script under bin/sbin, it is better that we 
can specify SPARK_JARS_DIR as same as SPARK_CONF_DIR.

Our usecase:
We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
tweak the jars by Maven Shade Plugin.
The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
directory different from the default. So it is useful for us if we can set 
SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
We can do that without the modification by deploying spark home as many as set 
of jars, but it is somehow redundant.

Common usecase:
I believe there is a similer usecase. For example, deploying spark built for 
scala 2.11 and 2.12 in a machine and switch jars location by setting 
SPARK_JARS_DIR.)

> bin/sbin scripts should allow to customize jars dir
> ---
>
> Key: SPARK-31432
> URL: https://issues.apache.org/jira/browse/SPARK-31432
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Shingo Furuyama
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31432) bin/sbin scripts should allow to customize jars dir

2020-04-13 Thread Shingo Furuyama (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shingo Furuyama updated SPARK-31432:

Description: 
In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR 
as same as SPARK_CONF_DIR.

Our usecase:
 We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
tweak the jars by Maven Shade Plugin.
 The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
directory different from the default. So it is useful for us if we can set 
SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
 We can do that without the modification by deploying spark home as many as set 
of jars, but it is somehow redundant.

Common usecase:
 I believe there is a similer usecase. For example, deploying spark built for 
scala 2.11 and 2.12 in a machine and switch jars location by setting 
SPARK_JARS_DIR.

> bin/sbin scripts should allow to customize jars dir
> ---
>
> Key: SPARK-31432
> URL: https://issues.apache.org/jira/browse/SPARK-31432
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Shingo Furuyama
>Priority: Minor
>
> In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR 
> as same as SPARK_CONF_DIR.
> Our usecase:
>  We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
> incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
> tweak the jars by Maven Shade Plugin.
>  The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
> directory different from the default. So it is useful for us if we can set 
> SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
>  We can do that without the modification by deploying spark home as many as 
> set of jars, but it is somehow redundant.
> Common usecase:
>  I believe there is a similer usecase. For example, deploying spark built for 
> scala 2.11 and 2.12 in a machine and switch jars location by setting 
> SPARK_JARS_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31432) bin/sbin scripts should allow to customize jars dir

2020-04-13 Thread Shingo Furuyama (Jira)
Shingo Furuyama created SPARK-31432:
---

 Summary: bin/sbin scripts should allow to customize jars dir
 Key: SPARK-31432
 URL: https://issues.apache.org/jira/browse/SPARK-31432
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.4.5, 3.0.0
 Environment: In the script under bin/sbin, it is better that we can 
specify SPARK_JARS_DIR as same as SPARK_CONF_DIR.

Our usecase:
We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
tweak the jars by Maven Shade Plugin.
The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
directory different from the default. So it is useful for us if we can set 
SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
We can do that without the modification by deploying spark home as many as set 
of jars, but it is somehow redundant.

Common usecase:
I believe there is a similer usecase. For example, deploying spark built for 
scala 2.11 and 2.12 in a machine and switch jars location by setting 
SPARK_JARS_DIR.
Reporter: Shingo Furuyama






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org