[jira] [Resolved] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31301. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28176 [https://github.com/apache/spark/pull/28176] > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.1.0 > > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31301: Assignee: zhengruifeng > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache
[ https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082896#comment-17082896 ] Zhou Jiashuai commented on SPARK-26385: --- I enable the log with -Dsun.security.krb5.debug=true and -Dsun.security.spnego.debug=true and get the following logs. It seems to have logged out after run 24 or 25 hours. {quote}[UnixLoginModule]: succeeded importing info: uid = 3107 gid = 3107 supp gid = 3107 Debug is true storeKey false useTicketCache true useKeyTab false doNotPrompt true ticketCache is null isInitiator true KeyTab is null refreshKrb5Config is false principal is null tryFirstPass is false useFirstPass is false storePass is false clearPass is false Acquire TGT from Cache Principal is null null credentials from Ticket Cache [Krb5LoginModule] authentication failed Unable to obtain Principal Name for authentication [UnixLoginModule]: added UnixPrincipal, UnixNumericUserPrincipal, UnixNumericGroupPrincipal(s), to Subject Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt true ticketCache is null isInitiator true KeyTab is username.keytab-a0d905e9-3926-422f-8068-ffec9ace4cc2 refreshKrb5Config is true principal is usern...@bdp.com tryFirstPass is false useFirstPass is false storePass is false clearPass is false Refreshing Kerberos configuration principal is usern...@bdp.com Will use keytab Commit Succeeded Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt true ticketCache is null isInitiator true KeyTab is username.keytab-a0d905e9-3926-422f-8068-ffec9ace4cc2 refreshKrb5Config is true principal is usern...@bdp.com tryFirstPass is false useFirstPass is false storePass is false clearPass is false Refreshing Kerberos configuration principal is usern...@bdp.com Will use keytab Commit Succeeded [Krb5LoginModule]: Entering logout [Krb5LoginModule]: logged out Subject Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt true ticketCache is null isInitiator true KeyTab is username.keytab-a0d905e9-3926-422f-8068-ffec9ace4cc2 refreshKrb5Config is true principal is usern...@bdp.com tryFirstPass is false useFirstPass is false storePass is false clearPass is false Refreshing Kerberos configuration principal is usern...@bdp.com Will use keytab Commit Succeeded {quote} > YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in > cache > --- > > Key: SPARK-26385 > URL: https://issues.apache.org/jira/browse/SPARK-26385 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Hadoop 2.6.0, Spark 2.4.0 >Reporter: T M >Priority: Major > > > Hello, > > I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, > Spark 2.4.0). After 25-26 hours, my job stops working with following error: > {code:java} > 2018-12-16 22:35:17 ERROR > org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query > TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = > a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, > realUser=, issueDate=1544903057122, maxDate=1545507857122, > sequenceNumber=10314, masterKeyId=344) can't be found in cache at > org.apache.hadoop.ipc.Client.call(Client.java:1470) at > org.apache.hadoop.ipc.Client.call(Client.java:1401) at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752) > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at > org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at > org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at > org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at > org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at > org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at > org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at > org.apache.hado
[jira] [Updated] (SPARK-31432) bin/sbin scripts should allow to customize jars dir
[ https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31432: -- Affects Version/s: (was: 2.4.5) (was: 3.0.0) 3.1.0 > bin/sbin scripts should allow to customize jars dir > --- > > Key: SPARK-31432 > URL: https://issues.apache.org/jira/browse/SPARK-31432 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.1.0 >Reporter: Shingo Furuyama >Priority: Minor > > In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR > as same as SPARK_CONF_DIR. > Our usecase: > We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an > incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we > tweak the jars by Maven Shade Plugin. > The jars slightly differ from jars in spark 2.4.5, and we locate it in a > directory different from the default. So it is useful for us if we can set > SPARK_JARS_DIR for bin/sbin scripts to point the direcotry. > We can do that without the modification by deploying spark home as many as > set of jars, but it is somehow redundant. > Common usecase: > I believe there is a similer usecase. For example, deploying spark built for > scala 2.11 and 2.12 in a machine and switch jars location by setting > SPARK_JARS_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30953) InsertAdaptiveSparkPlan should apply AQE on child plan of write commands
[ https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30953: --- Assignee: wuyi > InsertAdaptiveSparkPlan should apply AQE on child plan of write commands > > > Key: SPARK-30953 > URL: https://issues.apache.org/jira/browse/SPARK-30953 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Apply AQE on write commands with child plan will expose {{LogicalQueryStage}} > to {{Analyzer}} while it should hider under {{AdaptiveSparkPlanExec}} only to > avoid unexpected broken. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30953) InsertAdaptiveSparkPlan should apply AQE on child plan of write commands
[ https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30953. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27701 [https://github.com/apache/spark/pull/27701] > InsertAdaptiveSparkPlan should apply AQE on child plan of write commands > > > Key: SPARK-30953 > URL: https://issues.apache.org/jira/browse/SPARK-30953 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Apply AQE on write commands with child plan will expose {{LogicalQueryStage}} > to {{Analyzer}} while it should hider under {{AdaptiveSparkPlanExec}} only to > avoid unexpected broken. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31441) Support duplicated column names for toPandas with Arrow execution.
[ https://issues.apache.org/jira/browse/SPARK-31441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31441: Assignee: Takuya Ueshin > Support duplicated column names for toPandas with Arrow execution. > -- > > Key: SPARK-31441 > URL: https://issues.apache.org/jira/browse/SPARK-31441 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > When we execute {{toPandas()}} with Arrow execution, it fails if the column > names have duplicates. > {code:python} > >>> spark.sql("select 1 v, 1 v").toPandas() > Traceback (most recent call last): > File "", line 1, in > File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line > 2132, in toPandas > pdf = table.to_pandas() > File "pyarrow/array.pxi", line 441, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas > File > "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 653, in table_to_blockmanager > columns = _deserialize_column_index(table, all_columns, column_indexes) > File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line > 704, in _deserialize_column_index > columns = _flatten_single_level_multiindex(columns) > File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line > 937, in _flatten_single_level_multiindex > raise ValueError('Found non-unique column index') > ValueError: Found non-unique column index > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31441) Support duplicated column names for toPandas with Arrow execution.
[ https://issues.apache.org/jira/browse/SPARK-31441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31441. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28210 [https://github.com/apache/spark/pull/28210] > Support duplicated column names for toPandas with Arrow execution. > -- > > Key: SPARK-31441 > URL: https://issues.apache.org/jira/browse/SPARK-31441 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > When we execute {{toPandas()}} with Arrow execution, it fails if the column > names have duplicates. > {code:python} > >>> spark.sql("select 1 v, 1 v").toPandas() > Traceback (most recent call last): > File "", line 1, in > File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line > 2132, in toPandas > pdf = table.to_pandas() > File "pyarrow/array.pxi", line 441, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas > File > "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 653, in table_to_blockmanager > columns = _deserialize_column_index(table, all_columns, column_indexes) > File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line > 704, in _deserialize_column_index > columns = _flatten_single_level_multiindex(columns) > File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line > 937, in _flatten_single_level_multiindex > raise ValueError('Found non-unique column index') > ValueError: Found non-unique column index > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31392) Support CalendarInterval to be reflect to CalendarIntervalType
[ https://issues.apache.org/jira/browse/SPARK-31392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31392: Fix Version/s: (was: 3.1.0) 3.0.0 > Support CalendarInterval to be reflect to CalendarIntervalType > -- > > Key: SPARK-31392 > URL: https://issues.apache.org/jira/browse/SPARK-31392 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Since Spark 3.0.0, we make CalendarInterval public, it's better for it to be > inferred to CalendarIntervalType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files
[ https://issues.apache.org/jira/browse/SPARK-31426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31426: --- Assignee: Maxim Gekk > Regression in loading/saving timestamps from/to ORC files > - > > Key: SPARK-31426 > URL: https://issues.apache.org/jira/browse/SPARK-31426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Here are results of DateTimeRebaseBenchmark on the current master branch: > {code} > Save timestamps to ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 158259877 59877 >0 1.7 598.8 0.0X > before 1582 61361 61361 >0 1.6 613.6 0.0X > Load timestamps from ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 1582, vec off 48197 48288 > 118 2.1 482.0 1.0X > after 1582, vec on38247 38351 > 128 2.6 382.5 1.3X > before 1582, vec off 53179 53359 > 249 1.9 531.8 0.9X > before 1582, vec on 44076 44268 > 269 2.3 440.8 1.1X > {code} > The results of the same benchmark on Spark 2.4.6-SNAPSHOT: > {code} > Save timestamps to ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 158218858 18858 >0 5.3 188.6 1.0X > before 1582 18508 18508 >0 5.4 185.1 1.0X > Load timestamps from ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 1582, vec off 14063 14177 > 143 7.1 140.6 1.0X > after 1582, vec on 5955 6029 > 100 16.8 59.5 2.4X > before 1582, vec off 14119 14126 >7 7.1 141.2 1.0X > before 1582, vec on5991 6007 > 25 16.7 59.9 2.3X > {code} > Here is the PR with DateTimeRebaseBenchmark backported to 2.4: > https://github.com/MaxGekk/spark/pull/27 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files
[ https://issues.apache.org/jira/browse/SPARK-31426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31426. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28189 [https://github.com/apache/spark/pull/28189] > Regression in loading/saving timestamps from/to ORC files > - > > Key: SPARK-31426 > URL: https://issues.apache.org/jira/browse/SPARK-31426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Here are results of DateTimeRebaseBenchmark on the current master branch: > {code} > Save timestamps to ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 158259877 59877 >0 1.7 598.8 0.0X > before 1582 61361 61361 >0 1.6 613.6 0.0X > Load timestamps from ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 1582, vec off 48197 48288 > 118 2.1 482.0 1.0X > after 1582, vec on38247 38351 > 128 2.6 382.5 1.3X > before 1582, vec off 53179 53359 > 249 1.9 531.8 0.9X > before 1582, vec on 44076 44268 > 269 2.3 440.8 1.1X > {code} > The results of the same benchmark on Spark 2.4.6-SNAPSHOT: > {code} > Save timestamps to ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 158218858 18858 >0 5.3 188.6 1.0X > before 1582 18508 18508 >0 5.4 185.1 1.0X > Load timestamps from ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 1582, vec off 14063 14177 > 143 7.1 140.6 1.0X > after 1582, vec on 5955 6029 > 100 16.8 59.5 2.4X > before 1582, vec off 14119 14126 >7 7.1 141.2 1.0X > before 1582, vec on5991 6007 > 25 16.7 59.9 2.3X > {code} > Here is the PR with DateTimeRebaseBenchmark backported to 2.4: > https://github.com/MaxGekk/spark/pull/27 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31441) Support duplicated column names for toPandas with Arrow execution.
[ https://issues.apache.org/jira/browse/SPARK-31441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-31441: -- Summary: Support duplicated column names for toPandas with Arrow execution. (was: Support duplicated column names for toPandas with arrow execution.) > Support duplicated column names for toPandas with Arrow execution. > -- > > Key: SPARK-31441 > URL: https://issues.apache.org/jira/browse/SPARK-31441 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > When we execute {{toPandas()}} with Arrow execution, it fails if the column > names have duplicates. > {code:python} > >>> spark.sql("select 1 v, 1 v").toPandas() > Traceback (most recent call last): > File "", line 1, in > File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line > 2132, in toPandas > pdf = table.to_pandas() > File "pyarrow/array.pxi", line 441, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas > File > "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 653, in table_to_blockmanager > columns = _deserialize_column_index(table, all_columns, column_indexes) > File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line > 704, in _deserialize_column_index > columns = _flatten_single_level_multiindex(columns) > File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line > 937, in _flatten_single_level_multiindex > raise ValueError('Found non-unique column index') > ValueError: Found non-unique column index > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31441) Support duplicated column names for toPandas with arrow execution.
Takuya Ueshin created SPARK-31441: - Summary: Support duplicated column names for toPandas with arrow execution. Key: SPARK-31441 URL: https://issues.apache.org/jira/browse/SPARK-31441 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.5, 3.0.0 Reporter: Takuya Ueshin When we execute {{toPandas()}} with Arrow execution, it fails if the column names have duplicates. {code:python} >>> spark.sql("select 1 v, 1 v").toPandas() Traceback (most recent call last): File "", line 1, in File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31434) Drop builtin function pages from SQL references
[ https://issues.apache.org/jira/browse/SPARK-31434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31434: Assignee: Takeshi Yamamuro > Drop builtin function pages from SQL references > --- > > Key: SPARK-31434 > URL: https://issues.apache.org/jira/browse/SPARK-31434 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > > This ticket intends to drop the built-in function pages from SQL references. > We've already had a complete list of built-in functions in the API documents. > See related discussions for more details: > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31434) Drop builtin function pages from SQL references
[ https://issues.apache.org/jira/browse/SPARK-31434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31434. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28203 [https://github.com/apache/spark/pull/28203] > Drop builtin function pages from SQL references > --- > > Key: SPARK-31434 > URL: https://issues.apache.org/jira/browse/SPARK-31434 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > > This ticket intends to drop the built-in function pages from SQL references. > We've already had a complete list of built-in functions in the API documents. > See related discussions for more details: > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31411) Show submitted time and duration in job details page
[ https://issues.apache.org/jira/browse/SPARK-31411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-31411. Fix Version/s: 3.1.0 Resolution: Fixed The issue is resolved in https://github.com/apache/spark/pull/28179 > Show submitted time and duration in job details page > > > Key: SPARK-31411 > URL: https://issues.apache.org/jira/browse/SPARK-31411 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.1.0 > > > Currently, there is no submitted time and duration of a job in its job > details UI page. > We should show it on the job details page. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31420) Infinite timeline redraw in job details page
[ https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082743#comment-17082743 ] Dongjoon Hyun commented on SPARK-31420: --- Thank you for confirming, [~sarutak]! > Infinite timeline redraw in job details page > > > Key: SPARK-31420 > URL: https://issues.apache.org/jira/browse/SPARK-31420 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gengliang Wang >Assignee: Kousuke Saruta >Priority: Major > Attachments: timeline.mov > > > In the job page, the timeline section keeps changing the position style and > shaking. We can see that there is a warning "infinite loop in redraw" from > the console, which can be related to > https://github.com/visjs/vis-timeline/issues/17 > I am using the history server with the events under > "core/src/test/resources/spark-events" to reproduce. > I have also uploaded a screen recording. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082736#comment-17082736 ] Erik Krogen commented on SPARK-22148: - For future folks: the JIRA created for the issue is SPARK-31418 and discussion is continuing there. > TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current > executors are blacklisted but dynamic allocation is enabled > - > > Key: SPARK-22148 > URL: https://issues.apache.org/jira/browse/SPARK-22148 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Dhruve Ashar >Priority: Major > Fix For: 2.4.1, 3.0.0 > > Attachments: SPARK-22148_WIP.diff > > > Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and > the whole Spark job with `task X (partition Y) cannot run anywhere due to > node and executor blacklist. Blacklisting behavior can be configured via > spark.blacklist.*.` when all the available executors are blacklisted for a > pending Task or TaskSet. This makes sense for static allocation, where the > set of executors is fixed for the duration of the application, but this might > lead to unnecessary job failures when dynamic allocation is enabled. For > example, in a Spark application with a single job at a time, when a node > fails at the end of a stage attempt, all other executors will complete their > tasks, but the tasks running in the executors of the failing node will be > pending. Spark will keep waiting for those tasks for 2 minutes by default > (spark.network.timeout) until the heartbeat timeout is triggered, and then it > will blacklist those executors for that stage. At that point in time, other > executors would had been released after being idle for 1 minute by default > (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't > started yet and so there are no more tasks available (assuming the default of > spark.speculation = false). So Spark will fail because the only executors > available are blacklisted for that stage. > An alternative is requesting more executors to the cluster manager in this > situation. This could be retried a configurable number of times after a > configurable wait time between request attempts, so if the cluster manager > fails to provide a suitable executor then the job is aborted like in the > previous case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-31418: -- Issue Type: Improvement (was: Bug) > Blacklisting feature aborts Spark job without retrying for max num retries in > case of Dynamic allocation > > > Key: SPARK-31418 > URL: https://issues.apache.org/jira/browse/SPARK-31418 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.4.5 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > > With Spark blacklisting, if a task fails on an executor, the executor gets > blacklisted for the task. In order to retry the task, it checks if there are > idle blacklisted executor which can be killed and replaced to retry the task > if not it aborts the job without doing max retries. > In the context of dynamic allocation this can be better, instead of killing > the blacklisted idle executor (its possible there are no idle blacklisted > executor), request an additional executor and retry the task. > This can be easily reproduced with a simple job like below, although this > example should fail eventually just to show that its not retried > spark.task.maxFailures times: > {code:java} > def test(a: Int) = { a.asInstanceOf[String] } > sc.parallelize(1 to 10, 10).map(x => test(x)).collect > {code} > with dynamic allocation enabled and min executors set to 1. But there are > various other cases where this can fail as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082731#comment-17082731 ] Venkata krishnan Sowrirajan commented on SPARK-31418: - [~tgraves] Currently, I'm thinking we can check if dynamic allocation is enabled if so we can request for one more executor using ExecutorAllocationClient#requestExecutors and start the abort timer. But I re-read your [comment|https://issues.apache.org/jira/browse/SPARK-22148?focusedCommentId=17078278&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17078278] again and it seems like you tried to pass the information to ExecutorAllocationManager and request the executor through ExecutorAllocationManager. Is that right? Regarding, kill other non idle blacklisted executor idea, I don't think that would be better as we might kill tasks from other stages like mentioned in other comments from the PR. Let me know if you have any other thoughts on this problem. But we are facing this issue more frequently although retrying the whole job will pass but it happens frequently. > Blacklisting feature aborts Spark job without retrying for max num retries in > case of Dynamic allocation > > > Key: SPARK-31418 > URL: https://issues.apache.org/jira/browse/SPARK-31418 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.5 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > > With Spark blacklisting, if a task fails on an executor, the executor gets > blacklisted for the task. In order to retry the task, it checks if there are > idle blacklisted executor which can be killed and replaced to retry the task > if not it aborts the job without doing max retries. > In the context of dynamic allocation this can be better, instead of killing > the blacklisted idle executor (its possible there are no idle blacklisted > executor), request an additional executor and retry the task. > This can be easily reproduced with a simple job like below, although this > example should fail eventually just to show that its not retried > spark.task.maxFailures times: > {code:java} > def test(a: Int) = { a.asInstanceOf[String] } > sc.parallelize(1 to 10, 10).map(x => test(x)).collect > {code} > with dynamic allocation enabled and min executors set to 1. But there are > various other cases where this can fail as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0
[ https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-31306: Assignee: Ben > rand() function documentation suggests an inclusive upper bound of 1.0 > -- > > Key: SPARK-31306 > URL: https://issues.apache.org/jira/browse/SPARK-31306 > Project: Spark > Issue Type: Documentation > Components: PySpark, R, Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Ben >Assignee: Ben >Priority: Major > > The rand() function in PySpark, Spark, and R is documented as drawing from > U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing > (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing > `U[0.0, 1.0]` suggests the value returned could include 1.0). The function > itself uses Rand(), which is [documented |#L71] as having a result in the > range [0, 1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31440) Improve SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-31440: --- Description: SQL Rest API exposes query execution metrics as Public API. This Jira aims to apply following improvements on SQL Rest API by aligning Spark-UI. *Proposed Improvements:* 1- Support Physical Operations and group metrics per operation by aligning Spark UI. 2- *nodeId* can be useful for grouping metrics as well as for sorting and to differentiate same operators and their metrics. 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab 4- Remove *\n* from *metricValue(s)* 5- *planDescription* can be optional Http parameter to avoid network cost (specially for complex jobs creating big-plans). 6- *metrics* attribute needs to be exposed at the bottom order as *metricDetails*. This order matches with Spark UI by highlighting with execution order. *Attachments:* Please find both *current* and *improved* versions of the results as attached for following SQL Rest Endpoint: {code:java} curl -X GET http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code} was: SQL Rest API exposes query execution metrics as Public API. This Jira aims to apply following improvements on SQL Rest API by aligning Spark-UI. *Proposed Improvements:* 1- Support Physical Operations and group metrics per operation by aligning Spark UI. 2- *nodeId* can be useful for grouping metrics as well as for sorting and to differentiate same operators and their metrics. 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab 4- Remove *\n* from *metricValue(s)* 5- *planDescription* can be optional Http parameter to avoid network cost (specially for complex jobs creating big-plans). 6- *metrics* attribute needs to be exposed at the bottom order as *metricDetails*. This order matches with Spark UI by highlighting with execution order. *Attachments:* Please find both *current* and *improved* versions of results as attached. > Improve SQL Rest API > > > Key: SPARK-31440 > URL: https://issues.apache.org/jira/browse/SPARK-31440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: current_version.json, improved_version.json > > > SQL Rest API exposes query execution metrics as Public API. This Jira aims to > apply following improvements on SQL Rest API by aligning Spark-UI. > *Proposed Improvements:* > 1- Support Physical Operations and group metrics per operation by aligning > Spark UI. > 2- *nodeId* can be useful for grouping metrics as well as for sorting and to > differentiate same operators and their metrics. > 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab > 4- Remove *\n* from *metricValue(s)* > 5- *planDescription* can be optional Http parameter to avoid network cost > (specially for complex jobs creating big-plans). > 6- *metrics* attribute needs to be exposed at the bottom order as > *metricDetails*. This order matches with Spark UI by highlighting with > execution order. > *Attachments:* > Please find both *current* and *improved* versions of the results as > attached for following SQL Rest Endpoint: > {code:java} > curl -X GET > http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31440) Improve SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-31440: --- Attachment: current_version.json > Improve SQL Rest API > > > Key: SPARK-31440 > URL: https://issues.apache.org/jira/browse/SPARK-31440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: current_version.json, improved_version.json > > > SQL Rest API exposes query execution metrics as Public API. This Jira aims to > apply following improvements on SQL Rest API by aligning Spark-UI. > *Proposed Improvements:* > 1- Support Physical Operations and group metrics per operation by aligning > Spark UI. > 2- *nodeId* can be useful for grouping metrics as well as for sorting and to > differentiate same operators and their metrics. > 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab > 4- Remove *\n* from *metricValue(s)* > 5- *planDescription* can be optional Http parameter to avoid network cost > (specially for complex jobs creating big-plans). > 6- *metrics* attribute needs to be exposed at the bottom order as > *metricDetails*. This order matches with Spark UI by highlighting with > execution order. > *Attachments:* > Please find both *current* and *improved* versions of results as attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31440) Improve SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-31440: --- Attachment: improved_version.json > Improve SQL Rest API > > > Key: SPARK-31440 > URL: https://issues.apache.org/jira/browse/SPARK-31440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: current_version.json, improved_version.json > > > SQL Rest API exposes query execution metrics as Public API. This Jira aims to > apply following improvements on SQL Rest API by aligning Spark-UI. > *Proposed Improvements:* > 1- Support Physical Operations and group metrics per operation by aligning > Spark UI. > 2- *nodeId* can be useful for grouping metrics as well as for sorting and to > differentiate same operators and their metrics. > 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab > 4- Remove *\n* from *metricValue(s)* > 5- *planDescription* can be optional Http parameter to avoid network cost > (specially for complex jobs creating big-plans). > 6- *metrics* attribute needs to be exposed at the bottom order as > *metricDetails*. This order matches with Spark UI by highlighting with > execution order. > *Attachments:* > Please find both *current* and *improved* versions of results as attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31440) Improve SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-31440: --- Description: SQL Rest API exposes query execution metrics as Public API. This Jira aims to apply following improvements on SQL Rest API by aligning Spark-UI. *Proposed Improvements:* 1- Support Physical Operations and group metrics per operation by aligning Spark UI. 2- *nodeId* can be useful for grouping metrics as well as for sorting and to differentiate same operators and their metrics. 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab 4- Remove *\n* from *metricValue(s)* 5- *planDescription* can be optional Http parameter to avoid network cost (specially for complex jobs creating big-plans). 6- *metrics* attribute needs to be exposed at the bottom order as *metricDetails*. This order matches with Spark UI by highlighting with execution order. *Attachments:* Please find both *current* and *improved* versions of results as attached. was: SQL Rest API exposes query execution metrics as Public API. This Jira aims to apply following improvements on SQL Rest API by aligning Spark-UI. *Proposed Improvements:* 1- Support Physical Operations and group metrics per operation by aligning Spark UI. 2- `nodeId` can be useful for grouping metrics as well as for sorting and to differentiate same operators and their metrics. 3- Filter `blank` metrics by aligning with Spark UI - SQL Tab 4- Remove `\n` from `metricValue` 5- `planDescription` can be optional Http parameter to avoid network cost (specially for complex jobs creating big-plans). 6- `metrics` attribute needs to be exposed at the bottom order as `metricDetails`. This order matches with Spark UI by highlighting with execution order. *Attachments:* Please find both *current* and *improved* versions of results as attached. > Improve SQL Rest API > > > Key: SPARK-31440 > URL: https://issues.apache.org/jira/browse/SPARK-31440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Priority: Major > > SQL Rest API exposes query execution metrics as Public API. This Jira aims to > apply following improvements on SQL Rest API by aligning Spark-UI. > *Proposed Improvements:* > 1- Support Physical Operations and group metrics per operation by aligning > Spark UI. > 2- *nodeId* can be useful for grouping metrics as well as for sorting and to > differentiate same operators and their metrics. > 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab > 4- Remove *\n* from *metricValue(s)* > 5- *planDescription* can be optional Http parameter to avoid network cost > (specially for complex jobs creating big-plans). > 6- *metrics* attribute needs to be exposed at the bottom order as > *metricDetails*. This order matches with Spark UI by highlighting with > execution order. > *Attachments:* > Please find both *current* and *improved* versions of results as attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31440) Improve SQL Rest API
Eren Avsarogullari created SPARK-31440: -- Summary: Improve SQL Rest API Key: SPARK-31440 URL: https://issues.apache.org/jira/browse/SPARK-31440 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Eren Avsarogullari SQL Rest API exposes query execution metrics as Public API. This Jira aims to apply following improvements on SQL Rest API by aligning Spark-UI. *Proposed Improvements:* 1- Support Physical Operations and group metrics per operation by aligning Spark UI. 2- `nodeId` can be useful for grouping metrics as well as for sorting and to differentiate same operators and their metrics. 3- Filter `blank` metrics by aligning with Spark UI - SQL Tab 4- Remove `\n` from `metricValue` 5- `planDescription` can be optional Http parameter to avoid network cost (specially for complex jobs creating big-plans). 6- `metrics` attribute needs to be exposed at the bottom order as `metricDetails`. This order matches with Spark UI by highlighting with execution order. *Attachments:* Please find both *current* and *improved* versions of results as attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18299) Allow more aggregations on KeyValueGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-18299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-18299: --- Assignee: nooberfsh > Allow more aggregations on KeyValueGroupedDataset > - > > Key: SPARK-18299 > URL: https://issues.apache.org/jira/browse/SPARK-18299 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Matthias Niehoff >Assignee: nooberfsh >Priority: Minor > Fix For: 3.0.0 > > > The number of possible aggregations on a KeyValueGroupedDataset created by > groupByKey is limited to 4, as there are only methods with a maximum of 4 > parameters. > This value should be increased or - even better - made be completely > unlimited. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31439) Perf regression of fromJavaDate
Maxim Gekk created SPARK-31439: -- Summary: Perf regression of fromJavaDate Key: SPARK-31439 URL: https://issues.apache.org/jira/browse/SPARK-31439 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk DateTimeBenchmark shows the regression Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27 {code} Conversion from/to external types OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 614655 > 43 8.1 122.8 1.0X {code} Current master: {code} Conversion from/to external types OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 1154 1206 > 46 4.3 230.9 1.0X {code} The regression is ~x2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files
[ https://issues.apache.org/jira/browse/SPARK-31426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31426: --- Parent: SPARK-31404 Issue Type: Sub-task (was: Bug) > Regression in loading/saving timestamps from/to ORC files > - > > Key: SPARK-31426 > URL: https://issues.apache.org/jira/browse/SPARK-31426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Here are results of DateTimeRebaseBenchmark on the current master branch: > {code} > Save timestamps to ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 158259877 59877 >0 1.7 598.8 0.0X > before 1582 61361 61361 >0 1.6 613.6 0.0X > Load timestamps from ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 1582, vec off 48197 48288 > 118 2.1 482.0 1.0X > after 1582, vec on38247 38351 > 128 2.6 382.5 1.3X > before 1582, vec off 53179 53359 > 249 1.9 531.8 0.9X > before 1582, vec on 44076 44268 > 269 2.3 440.8 1.1X > {code} > The results of the same benchmark on Spark 2.4.6-SNAPSHOT: > {code} > Save timestamps to ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 158218858 18858 >0 5.3 188.6 1.0X > before 1582 18508 18508 >0 5.4 185.1 1.0X > Load timestamps from ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 1582, vec off 14063 14177 > 143 7.1 140.6 1.0X > after 1582, vec on 5955 6029 > 100 16.8 59.5 2.4X > before 1582, vec off 14119 14126 >7 7.1 141.2 1.0X > before 1582, vec on5991 6007 > 25 16.7 59.9 2.3X > {code} > Here is the PR with DateTimeRebaseBenchmark backported to 2.4: > https://github.com/MaxGekk/spark/pull/27 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082498#comment-17082498 ] Bruce Robbins commented on SPARK-31423: --- [~cloud_fan] {quote}FYI this is the behavior of Spark 2.4 {quote} Yes, I noted that in my description. What I mean is that in Spark 3.x (and without any legacy config touched), only ORC demonstrates this behavior. CAST, and the Parquet and Avro file formats do not demonstrate this behavior. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31438) Support JobCleaned Status in SparkListener
[ https://issues.apache.org/jira/browse/SPARK-31438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackey Lee updated SPARK-31438: --- Description: In Spark, we need do some hook after job cleaned, such as cleaning hive external temporary paths. This has already discussed in SPARK-31346 and [GitHub Pull Request #28129.|https://github.com/apache/spark/pull/28129] The JobEnd Status is not suitable for this. As JobEnd is responsible for Job finished, once all result has generated, it should be finished. After finish, Scheduler will leave the still running tasks to be zombie tasks and delete abnormal tasks asynchronously. Thus, we add JobCleaned Status to enable user to do some hook after all tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is related to a stage, and once all stages of the job has been cleaned, then the job is cleaned. was: In Spark, we need do some hook, such as cleaning hive external temporary paths, after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull Request #28129|https://github.com/apache/spark/pull/28129]. The JobEnd Status is not suitable for this. As JobEnd is responsible for Job finished, once all result has generated, it should be finished. After finish, Scheduler will leave the still running tasks to be zombie tasks and delete abnormal tasks asynchronously. Thus, we add JobCleaned Status to enable user to do some hook after all tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is related to a stage, and once all stages of the job has been cleaned, then the job is cleaned. > Support JobCleaned Status in SparkListener > -- > > Key: SPARK-31438 > URL: https://issues.apache.org/jira/browse/SPARK-31438 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Priority: Major > > In Spark, we need do some hook after job cleaned, such as cleaning hive > external temporary paths. This has already discussed in SPARK-31346 and > [GitHub Pull Request #28129.|https://github.com/apache/spark/pull/28129] > The JobEnd Status is not suitable for this. As JobEnd is responsible for Job > finished, once all result has generated, it should be finished. After finish, > Scheduler will leave the still running tasks to be zombie tasks and delete > abnormal tasks asynchronously. > Thus, we add JobCleaned Status to enable user to do some hook after all > tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, > which is related to a stage, and once all stages of the job has been cleaned, > then the job is cleaned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31438) Support JobCleaned Status in SparkListener
[ https://issues.apache.org/jira/browse/SPARK-31438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackey Lee updated SPARK-31438: --- Description: In Spark, we need do some hook, such as cleaning hive external temporary paths, after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull Request #28129|https://github.com/apache/spark/pull/28129]. The JobEnd Status is not suitable for this. As JobEnd is responsible for Job finished, once all result has generated, it should be finished. After finish, Scheduler will leave the still running tasks to be zombie tasks and delete abnormal tasks asynchronously. Thus, we add JobCleaned Status to enable user to do some hook after all tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is related to a stage, and once all stages of the job has been cleaned, then the job is cleaned. was: In Spark, we need do some hook, such as hive external temporary paths cleaning, after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull Request #28129|https://github.com/apache/spark/pull/28129]. The JobEnd Status is not suitable for this. As JobEnd is responsible for Job finished, once all result has generated, it should be finished. After finish, Scheduler will leave the still running tasks to be zombie tasks and delete abnormal tasks asynchronously. Thus, we add JobCleaned Status to enable user to do some hook after all tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is related to a stage, and once all stages of the job has been cleaned, then the job is cleaned. > Support JobCleaned Status in SparkListener > -- > > Key: SPARK-31438 > URL: https://issues.apache.org/jira/browse/SPARK-31438 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Priority: Major > > In Spark, we need do some hook, such as cleaning hive external temporary > paths, after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull > Request #28129|https://github.com/apache/spark/pull/28129]. > The JobEnd Status is not suitable for this. As JobEnd is responsible for Job > finished, once all result has generated, it should be finished. After finish, > Scheduler will leave the still running tasks to be zombie tasks and delete > abnormal tasks asynchronously. > Thus, we add JobCleaned Status to enable user to do some hook after all > tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, > which is related to a stage, and once all stages of the job has been cleaned, > then the job is cleaned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31438) Support JobCleaned Status in SparkListener
Jackey Lee created SPARK-31438: -- Summary: Support JobCleaned Status in SparkListener Key: SPARK-31438 URL: https://issues.apache.org/jira/browse/SPARK-31438 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Jackey Lee In Spark, we need do some hook, such as hive external temporary paths cleaning, after job cleaned, which is discussed in SPARK-31346 and [GitHub Pull Request #28129|https://github.com/apache/spark/pull/28129]. The JobEnd Status is not suitable for this. As JobEnd is responsible for Job finished, once all result has generated, it should be finished. After finish, Scheduler will leave the still running tasks to be zombie tasks and delete abnormal tasks asynchronously. Thus, we add JobCleaned Status to enable user to do some hook after all tasks cleaned in Job. The JobCleaned Status can get from TaskSetManagers, which is related to a stage, and once all stages of the job has been cleaned, then the job is cleaned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31391) Add AdaptiveTestUtils to ease the test of AQE
[ https://issues.apache.org/jira/browse/SPARK-31391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31391. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28162 [https://github.com/apache/spark/pull/28162] > Add AdaptiveTestUtils to ease the test of AQE > - > > Key: SPARK-31391 > URL: https://issues.apache.org/jira/browse/SPARK-31391 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Tests related to AQE now have much duplicate codes, we can use some utility > functions to make the test simpler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31391) Add AdaptiveTestUtils to ease the test of AQE
[ https://issues.apache.org/jira/browse/SPARK-31391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31391: --- Assignee: wuyi > Add AdaptiveTestUtils to ease the test of AQE > - > > Key: SPARK-31391 > URL: https://issues.apache.org/jira/browse/SPARK-31391 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Tests related to AQE now have much duplicate codes, we can use some utility > functions to make the test simpler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31409) Fix failed tests due to result order changing when we enable AQE
[ https://issues.apache.org/jira/browse/SPARK-31409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31409. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28178 [https://github.com/apache/spark/pull/28178] > Fix failed tests due to result order changing when we enable AQE > > > Key: SPARK-31409 > URL: https://issues.apache.org/jira/browse/SPARK-31409 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > query #147 in SQLQueryTestSuite#"udf/postgreSQL/udf-join.sql - Scala UDF" and > test sql/SQLQuerySuite#"check outputs of expression examples" will fail when > enable AQE due to result order changing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31409) Fix failed tests due to result order changing when we enable AQE
[ https://issues.apache.org/jira/browse/SPARK-31409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31409: --- Assignee: wuyi > Fix failed tests due to result order changing when we enable AQE > > > Key: SPARK-31409 > URL: https://issues.apache.org/jira/browse/SPARK-31409 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > query #147 in SQLQueryTestSuite#"udf/postgreSQL/udf-join.sql - Scala UDF" and > test sql/SQLQuerySuite#"check outputs of expression examples" will fail when > enable AQE due to result order changing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-31435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31435. -- Resolution: Duplicate > Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration > documentation > - > > Key: SPARK-31435 > URL: https://issues.apache.org/jira/browse/SPARK-31435 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > Related with SPARK-31432 > That issue introduces new environment variable that is documented in this > issue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied
Hongze Zhang created SPARK-31437: Summary: Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied Key: SPARK-31437 URL: https://issues.apache.org/jira/browse/SPARK-31437 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 3.0.0 Reporter: Hongze Zhang By the change in [PR|https://github.com/apache/spark/pull/27773] of SPARK-29154, submitted tasks are scheduled onto executors only if resource profile IDs strictly match. As a result Spark always starts new executors for customized ResourceProfiles. This limitation makes working with process-local jobs unfriendly. E.g. Task cores has been increased from 1 to 4 in a new stage, and executor has 8 slots, it is expected that 2 new tasks can be run on the existing executor but Spark starts new executors for new ResourceProfile. The behavior is unnecessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31436) MinHash keyDistance optimization
zhengruifeng created SPARK-31436: Summary: MinHash keyDistance optimization Key: SPARK-31436 URL: https://issues.apache.org/jira/browse/SPARK-31436 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng current implementation is based on set operation, it is inefficient -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation
Pablo Langa Blanco created SPARK-31435: -- Summary: Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation Key: SPARK-31435 URL: https://issues.apache.org/jira/browse/SPARK-31435 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.1.0 Reporter: Pablo Langa Blanco Related with SPARK-31432 That issue introduces new environment variable that is documented in this issue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-31435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082233#comment-17082233 ] Pablo Langa Blanco commented on SPARK-31435: I'm working on this > Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration > documentation > - > > Key: SPARK-31435 > URL: https://issues.apache.org/jira/browse/SPARK-31435 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > Related with SPARK-31432 > That issue introduces new environment variable that is documented in this > issue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082228#comment-17082228 ] Wenchen Fan edited comment on SPARK-31423 at 4/13/20, 10:54 AM: FYI this is the behavior of Spark 2.4: {code} scala> val df = sql("select cast('1582-10-14' as DATE) dt") df: org.apache.spark.sql.DataFrame = [dt: date] scala> df.show +--+ |dt| +--+ |1582-10-24| +--+ scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") scala> spark.read.orc("/tmp/funny_orc_date").show +--+ |dt| +--+ |1582-10-24| +--+ {code} The result is wrong at the very beginning. was (Author: cloud_fan): FYI this is the behavior of Spark 2.4: ``` scala> val df = sql("select cast('1582-10-14' as DATE) dt") df: org.apache.spark.sql.DataFrame = [dt: date] scala> df.show +--+ |dt| +--+ |1582-10-24| +--+ scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") scala> spark.read.orc("/tmp/funny_orc_date").show +--+ |dt| +--+ |1582-10-24| +--+ ``` The result is wrong at the very beginning. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082228#comment-17082228 ] Wenchen Fan commented on SPARK-31423: - FYI this is the behavior of Spark 2.4: ``` scala> val df = sql("select cast('1582-10-14' as DATE) dt") df: org.apache.spark.sql.DataFrame = [dt: date] scala> df.show +--+ |dt| +--+ |1582-10-24| +--+ scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") scala> spark.read.orc("/tmp/funny_orc_date").show +--+ |dt| +--+ |1582-10-24| +--+ ``` The result is wrong at the very beginning. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31407) Fix hive/SQLQuerySuite.derived from Hive query file: drop_database_removes_partition_dirs.q
[ https://issues.apache.org/jira/browse/SPARK-31407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31407. -- Fix Version/s: 3.0.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28177 > Fix hive/SQLQuerySuite.derived from Hive query file: > drop_database_removes_partition_dirs.q > --- > > Key: SPARK-31407 > URL: https://issues.apache.org/jira/browse/SPARK-31407 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: wuyi >Priority: Major > Fix For: 3.0.0 > > > Test "derived from Hive query file: drop_database_removes_partition_dirs.q" > can fail if we run it separately but can success running with the whole > hive/SQLQuerySuite. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation
[ https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31429: - Parent: (was: SPARK-28588) Issue Type: Bug (was: Sub-task) > Add additional fields in ExpressionDescription for more granular category in > documentation > -- > > Key: SPARK-31429 > URL: https://issues.apache.org/jira/browse/SPARK-31429 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > Add additional fields in ExpressionDescription so we can have more granular > category in function documentation. For example, we want to group window > function into finer categories such as ranking functions and analytic > functions. > See Hyukjin's comment below for more details; > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation
[ https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082196#comment-17082196 ] Hyukjin Kwon commented on SPARK-31429: -- Actually, let me retarget this as Spark 3.1. It should be good to do for Spark 3.0 but I guess it's okay to miss it to. I will try anyway. > Add additional fields in ExpressionDescription for more granular category in > documentation > -- > > Key: SPARK-31429 > URL: https://issues.apache.org/jira/browse/SPARK-31429 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > Add additional fields in ExpressionDescription so we can have more granular > category in function documentation. For example, we want to group window > function into finer categories such as ranking functions and analytic > functions. > See Hyukjin's comment below for more details; > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation
[ https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31429: - Target Version/s: (was: 3.0.0) > Add additional fields in ExpressionDescription for more granular category in > documentation > -- > > Key: SPARK-31429 > URL: https://issues.apache.org/jira/browse/SPARK-31429 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > Add additional fields in ExpressionDescription so we can have more granular > category in function documentation. For example, we want to group window > function into finer categories such as ranking functions and analytic > functions. > See Hyukjin's comment below for more details; > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation
[ https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082190#comment-17082190 ] Hyukjin Kwon commented on SPARK-31429: -- [~huaxingao], [~nchammas], [~kevinyu98], [~dkbiswal], [~maropu], would anyone be interested in this please? I would like to get this done for Spark 3.0 ... if you guys are busy, I will try to take a look .. probably next week or around there .. > Add additional fields in ExpressionDescription for more granular category in > documentation > -- > > Key: SPARK-31429 > URL: https://issues.apache.org/jira/browse/SPARK-31429 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > Add additional fields in ExpressionDescription so we can have more granular > category in function documentation. For example, we want to group window > function into finer categories such as ranking functions and analytic > functions. > See Hyukjin's comment below for more details; > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation
[ https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31429: - Parent: SPARK-28588 Issue Type: Sub-task (was: Improvement) > Add additional fields in ExpressionDescription for more granular category in > documentation > -- > > Key: SPARK-31429 > URL: https://issues.apache.org/jira/browse/SPARK-31429 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add additional fields in ExpressionDescription so we can have more granular > category in function documentation. For example, we want to group window > function into finer categories such as ranking functions and analytic > functions. > See Hyukjin's comment below for more details; > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation
[ https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31429: - Affects Version/s: (was: 3.1.0) 3.0.0 > Add additional fields in ExpressionDescription for more granular category in > documentation > -- > > Key: SPARK-31429 > URL: https://issues.apache.org/jira/browse/SPARK-31429 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > Add additional fields in ExpressionDescription so we can have more granular > category in function documentation. For example, we want to group window > function into finer categories such as ranking functions and analytic > functions. > See Hyukjin's comment below for more details; > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation
[ https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31429: - Target Version/s: 3.0.0 > Add additional fields in ExpressionDescription for more granular category in > documentation > -- > > Key: SPARK-31429 > URL: https://issues.apache.org/jira/browse/SPARK-31429 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add additional fields in ExpressionDescription so we can have more granular > category in function documentation. For example, we want to group window > function into finer categories such as ranking functions and analytic > functions. > See Hyukjin's comment below for more details; > https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31434) Drop builtin function pages from SQL references
Takeshi Yamamuro created SPARK-31434: Summary: Drop builtin function pages from SQL references Key: SPARK-31434 URL: https://issues.apache.org/jira/browse/SPARK-31434 Project: Spark Issue Type: Documentation Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro This ticket intends to drop the built-in function pages from SQL references. We've already had a complete list of built-in functions in the API documents. See related discussions for more details: https://github.com/apache/spark/pull/28170#issuecomment-611917191 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.
[ https://issues.apache.org/jira/browse/SPARK-31427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082163#comment-17082163 ] Nick Hryhoriev commented on SPARK-31427: [~kabhwan] I will try to do it, but do not expect to get info in the nearest time. But I can confirm that 2.4.5 has the same behave. > Spark Structure streaming read data twice per every micro-batch. > > > Key: SPARK-31427 > URL: https://issues.apache.org/jira/browse/SPARK-31427 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Nick Hryhoriev >Priority: Major > > I have a very strange issue with spark structure streaming. Spark structure > streaming creates two spark jobs for every micro-batch. As a result, read > data from Kafka twice. Here is a simple code snippet. > > {code:java} > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.streaming.Trigger > object CheckHowSparkReadFromKafka { > def main(args: Array[String]): Unit = { > val session = SparkSession.builder() > .config(new SparkConf() > .setAppName(s"simple read from kafka with repartition") > .setMaster("local[*]") > .set("spark.driver.host", "localhost")) > .getOrCreate() > val testPath = "/tmp/spark-test" > FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new > Path(testPath), true) > import session.implicits._ > val stream = session > .readStream > .format("kafka") > .option("kafka.bootstrap.servers","kafka-20002-prod:9092") > .option("subscribe", "topic") > .option("maxOffsetsPerTrigger", 1000) > .option("failOnDataLoss", false) > .option("startingOffsets", "latest") > .load() > .repartitionByRange( $"offset") > .writeStream > .option("path", testPath + "/data") > .option("checkpointLocation", testPath + "/checkpoint") > .format("parquet") > .trigger(Trigger.ProcessingTime(10.seconds)) > .start() > stream.processAllAvailable() > {code} > This happens because if {{.repartitionByRange( $"offset")}}, if I remove this > line, all good. But with spark create two jobs, one with 1 stage just read > from Kafka, the second with 3 stage read -> shuffle -> write. So the result > of the first job never used. > This has a significant impact on performance. Some of my Kafka topics have > 1550 partitions, so read them twice is a big deal. In case I add cache, > things going better, but this is not a way for me. In local mode, the first > job in batch takes less than 0.1 ms, except batch with index 0. But in YARN > cluster and Messos both jobs fully expected and on my topics take near 1.2 > min. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31433) Summarizer supports string arguments
zhengruifeng created SPARK-31433: Summary: Summarizer supports string arguments Key: SPARK-31433 URL: https://issues.apache.org/jira/browse/SPARK-31433 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng It wil be convenient for Summarizer to support string arguments, like other sql functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31408) Build Spark’s own datetime pattern definition
[ https://issues.apache.org/jira/browse/SPARK-31408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31408: Summary: Build Spark’s own datetime pattern definition (was: Build Spark’s own Datetime patterns) > Build Spark’s own datetime pattern definition > - > > Key: SPARK-31408 > URL: https://issues.apache.org/jira/browse/SPARK-31408 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > This is an umbrella ticket for building Spark's own Datetime patterns and > related works. > In Spark version 2.4 and earlier, datetime parsing and formatting are > performed by the old Java 7 `SimpleDateFormat` API. Since Spark 3.0, we > switch to the new Java 8 `DateTimeFormatter` to use the Proleptic Gregorian > calendar, which is required by the ISO and SQL standards. > However, there are some datetime patterns not compatible between Java 8 and > Java 7 APIs, and it's fragile to rely on the JDK API to define Spark's > behavior. We should build our own Datetime patterns, which is compatible with > Spark 2.4 (the old Java 7 `SimpleDateFormat` API). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31432) bin/sbin scripts should allow to customize jars dir
[ https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31432: - Target Version/s: (was: 2.4.6, 3.0.1) > bin/sbin scripts should allow to customize jars dir > --- > > Key: SPARK-31432 > URL: https://issues.apache.org/jira/browse/SPARK-31432 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.5, 3.0.0 >Reporter: Shingo Furuyama >Priority: Minor > > In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR > as same as SPARK_CONF_DIR. > Our usecase: > We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an > incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we > tweak the jars by Maven Shade Plugin. > The jars slightly differ from jars in spark 2.4.5, and we locate it in a > directory different from the default. So it is useful for us if we can set > SPARK_JARS_DIR for bin/sbin scripts to point the direcotry. > We can do that without the modification by deploying spark home as many as > set of jars, but it is somehow redundant. > Common usecase: > I believe there is a similer usecase. For example, deploying spark built for > scala 2.11 and 2.12 in a machine and switch jars location by setting > SPARK_JARS_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31432) bin/sbin scripts should allow to customize jars dir
[ https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082129#comment-17082129 ] Shingo Furuyama commented on SPARK-31432: - I will soon send a PR to the master branch. If the PR is merged, I will send it branch-2.4. > bin/sbin scripts should allow to customize jars dir > --- > > Key: SPARK-31432 > URL: https://issues.apache.org/jira/browse/SPARK-31432 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.5, 3.0.0 >Reporter: Shingo Furuyama >Priority: Minor > > In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR > as same as SPARK_CONF_DIR. > Our usecase: > We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an > incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we > tweak the jars by Maven Shade Plugin. > The jars slightly differ from jars in spark 2.4.5, and we locate it in a > directory different from the default. So it is useful for us if we can set > SPARK_JARS_DIR for bin/sbin scripts to point the direcotry. > We can do that without the modification by deploying spark home as many as > set of jars, but it is somehow redundant. > Common usecase: > I believe there is a similer usecase. For example, deploying spark built for > scala 2.11 and 2.12 in a machine and switch jars location by setting > SPARK_JARS_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31432) bin/sbin scripts should allow to customize jars dir
[ https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shingo Furuyama updated SPARK-31432: Environment: (was: In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR as same as SPARK_CONF_DIR. Our usecase: We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we tweak the jars by Maven Shade Plugin. The jars slightly differ from jars in spark 2.4.5, and we locate it in a directory different from the default. So it is useful for us if we can set SPARK_JARS_DIR for bin/sbin scripts to point the direcotry. We can do that without the modification by deploying spark home as many as set of jars, but it is somehow redundant. Common usecase: I believe there is a similer usecase. For example, deploying spark built for scala 2.11 and 2.12 in a machine and switch jars location by setting SPARK_JARS_DIR.) > bin/sbin scripts should allow to customize jars dir > --- > > Key: SPARK-31432 > URL: https://issues.apache.org/jira/browse/SPARK-31432 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.5, 3.0.0 >Reporter: Shingo Furuyama >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31432) bin/sbin scripts should allow to customize jars dir
[ https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shingo Furuyama updated SPARK-31432: Description: In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR as same as SPARK_CONF_DIR. Our usecase: We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we tweak the jars by Maven Shade Plugin. The jars slightly differ from jars in spark 2.4.5, and we locate it in a directory different from the default. So it is useful for us if we can set SPARK_JARS_DIR for bin/sbin scripts to point the direcotry. We can do that without the modification by deploying spark home as many as set of jars, but it is somehow redundant. Common usecase: I believe there is a similer usecase. For example, deploying spark built for scala 2.11 and 2.12 in a machine and switch jars location by setting SPARK_JARS_DIR. > bin/sbin scripts should allow to customize jars dir > --- > > Key: SPARK-31432 > URL: https://issues.apache.org/jira/browse/SPARK-31432 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.5, 3.0.0 >Reporter: Shingo Furuyama >Priority: Minor > > In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR > as same as SPARK_CONF_DIR. > Our usecase: > We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an > incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we > tweak the jars by Maven Shade Plugin. > The jars slightly differ from jars in spark 2.4.5, and we locate it in a > directory different from the default. So it is useful for us if we can set > SPARK_JARS_DIR for bin/sbin scripts to point the direcotry. > We can do that without the modification by deploying spark home as many as > set of jars, but it is somehow redundant. > Common usecase: > I believe there is a similer usecase. For example, deploying spark built for > scala 2.11 and 2.12 in a machine and switch jars location by setting > SPARK_JARS_DIR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31432) bin/sbin scripts should allow to customize jars dir
Shingo Furuyama created SPARK-31432: --- Summary: bin/sbin scripts should allow to customize jars dir Key: SPARK-31432 URL: https://issues.apache.org/jira/browse/SPARK-31432 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.4.5, 3.0.0 Environment: In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR as same as SPARK_CONF_DIR. Our usecase: We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we tweak the jars by Maven Shade Plugin. The jars slightly differ from jars in spark 2.4.5, and we locate it in a directory different from the default. So it is useful for us if we can set SPARK_JARS_DIR for bin/sbin scripts to point the direcotry. We can do that without the modification by deploying spark home as many as set of jars, but it is somehow redundant. Common usecase: I believe there is a similer usecase. For example, deploying spark built for scala 2.11 and 2.12 in a machine and switch jars location by setting SPARK_JARS_DIR. Reporter: Shingo Furuyama -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org