date:20220606

[GitHub] [spark] olaky commented on pull request #36762: [SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 `ClassCastException` in `ComputeCurrentTimeSuite`

2022-06-06 Thread GitBox



olaky commented on PR #36762:
URL: https://github.com/apache/spark/pull/36762#issuecomment-1148259858

   @dongjoon-hyun thanks a lot for picking this up for me!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] olaky commented on pull request #36753: [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries

2022-06-06 Thread GitBox



olaky commented on PR #36753:
URL: https://github.com/apache/spark/pull/36753#issuecomment-1148258363

   I cherry-picked 583a9c75bbb35387169d4f0cf763ef566d899954


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35250: [SPARK-37961][SQL] Override maxRows/maxRowsPerPartition for some logical operators

2022-06-06 Thread GitBox



zhengruifeng commented on PR #35250:
URL: https://github.com/apache/spark/pull/35250#issuecomment-1148255084

   @cloud-fan  Sure, Let me update this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #36732: [SPARK-39345][CORE][SQL][DSTREAM][ML][MESOS][SS] Replace `filter(!condition)` with `filterNot(condition)`

2022-06-06 Thread GitBox



LuciferYang commented on PR #36732:
URL: https://github.com/apache/spark/pull/36732#issuecomment-1148250861

   > No, we wouldn't backport this, that's more change. Does this offer any 
benefit? I'm not sure it's more readable even.
   
   If the readability is not improved, let me close this pr.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shiyuhang0 commented on pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2022-06-06 Thread GitBox



shiyuhang0 commented on PR #21990:
URL: https://github.com/apache/spark/pull/21990#issuecomment-1148248826

   Why not port it to Spark < 3 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #36776: [SPARK-38997][SPARK-39037][SQL][FOLLOWUP] `PushableColumnWithoutNestedColumn` need be translated to predicate too

2022-06-06 Thread GitBox



beliefer commented on code in PR #36776:
URL: https://github.com/apache/spark/pull/36776#discussion_r890808786


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -55,8 +55,13 @@ class V2ExpressionBuilder(
   } else {
 Some(FieldReference(name))
   }
-case pushableColumn(name) if !nestedPredicatePushdownEnabled =>
-  Some(FieldReference.column(name))
+case col @ pushableColumn(name) if !nestedPredicatePushdownEnabled =>

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #35612: [SPARK-38289][SQL] Refactor SQL CLI exit code to make it more clear

2022-06-06 Thread GitBox



AngersZh commented on PR #35612:
URL: https://github.com/apache/spark/pull/35612#issuecomment-1148234463

   Looks like the latest failed test not related to this pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #36776: [SPARK-38997][SPARK-39037][SQL][FOLLOWUP] `PushableColumnWithoutNestedColumn` need be translated to predicate too

2022-06-06 Thread GitBox



huaxingao commented on code in PR #36776:
URL: https://github.com/apache/spark/pull/36776#discussion_r890797844


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -55,8 +55,13 @@ class V2ExpressionBuilder(
   } else {
 Some(FieldReference(name))
   }
-case pushableColumn(name) if !nestedPredicatePushdownEnabled =>
-  Some(FieldReference.column(name))
+case col @ pushableColumn(name) if !nestedPredicatePushdownEnabled =>

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36776: [SPARK-38997][SPARK-39037][SQL][FOLLOWUP] `PushableColumnWithoutNestedColumn` need be translated to predicate too

2022-06-06 Thread GitBox



cloud-fan commented on code in PR #36776:
URL: https://github.com/apache/spark/pull/36776#discussion_r890793608


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -55,8 +55,13 @@ class V2ExpressionBuilder(
   } else {
 Some(FieldReference(name))
   }
-case pushableColumn(name) if !nestedPredicatePushdownEnabled =>
-  Some(FieldReference.column(name))
+case col @ pushableColumn(name) if !nestedPredicatePushdownEnabled =>

Review Comment:
   can we merge the code a bit more?
   ```
   case col @ pushableColumn(name) =>
 val ref = if (nestedPredicatePushdownEnabled) ... else ...
 if (predicate) ... else ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiuzhu9527 commented on pull request #36784: [SPARK-39396][SQL] Fix LDAP login exception 'error code 49 - invalid credentials'

2022-06-06 Thread GitBox



xiuzhu9527 commented on PR #36784:
URL: https://github.com/apache/spark/pull/36784#issuecomment-1148226308

   @HyukjinKwon please take a look, Thanks ！


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiuzhu9527 opened a new pull request, #36784: [SPARK-39396][SQL] Fix LDAP login exception 'error code 49 - invalid credentials'

2022-06-06 Thread GitBox



xiuzhu9527 opened a new pull request, #36784:
URL: https://github.com/apache/spark/pull/36784

   
   
   ### What changes were proposed in this pull request?
   
   In the PR, Fixed the problem that the DN is (cn=user,ou=people, dc=example, 
dc=com) LDAP login failure. 
   
   ### Why are the changes needed?
   
   
   The hard coded DN in the 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl#Authenticate()  is 
(uid=user,ou=people, dc=example, dc=com), resulting in LDAP authentication 
failure
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   ### How was this patch tested?
   
   Existing test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiuzhu9527 closed pull request #36783: [SPARK-39396][SQL] Fix LDAP login exception 'error code 49 - invalid …

2022-06-06 Thread GitBox



xiuzhu9527 closed pull request #36783: [SPARK-39396][SQL] Fix LDAP login 
exception 'error code 49 - invalid …
URL: https://github.com/apache/spark/pull/36783


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiuzhu9527 opened a new pull request, #36783: [SPARK-39396][SQL] Fix LDAP login exception 'error code 49 - invalid …

2022-06-06 Thread GitBox



xiuzhu9527 opened a new pull request, #36783:
URL: https://github.com/apache/spark/pull/36783

   
   
   
   ### What changes were proposed in this pull request?
   
   In the PR, Fixed the problem that the DN is (cn=user,ou=people, dc=example, 
dc=com) LDAP login failure. 
   
   ### Why are the changes needed?
   
   
   The hard coded DN in the 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl#Authenticate()  is 
(uid=user,ou=people, dc=example, dc=com), resulting in LDAP authentication 
failure
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   ### How was this patch tested?
   
   Existing test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #36782: [SPARK-39394][DOCS][SS] Improve PySpark Structured Streaming page more readable

2022-06-06 Thread GitBox



HyukjinKwon closed pull request #36782: [SPARK-39394][DOCS][SS] Improve PySpark 
Structured Streaming page more readable
URL: https://github.com/apache/spark/pull/36782


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36782: [SPARK-39394][DOCS][SS] Improve PySpark Structured Streaming page more readable

2022-06-06 Thread GitBox



HyukjinKwon commented on PR #36782:
URL: https://github.com/apache/spark/pull/36782#issuecomment-1148212500

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

2022-06-06 Thread GitBox



mridulm commented on code in PR #36775:
URL: https://github.com/apache/spark/pull/36775#discussion_r890765675


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##
@@ -253,6 +253,9 @@ class FileScanRDD(
   // Throw FileNotFoundException even if `ignoreCorruptFiles` 
is true
   case e: FileNotFoundException if !ignoreMissingFiles => 
throw e
   case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
+if (e.getMessage.contains("Filesystem closed")) {

Review Comment:
   +1 to @JoshRosen's proposal here.
   Given that hadoop is throwing a generic exception here, and given the lack 
of principled alternatives available - walking the stack allows us to 
reasonably detect if the cause is due to hadoop filesystem being closed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

2022-06-06 Thread GitBox



mridulm commented on code in PR #36775:
URL: https://github.com/apache/spark/pull/36775#discussion_r890765675


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##
@@ -253,6 +253,9 @@ class FileScanRDD(
   // Throw FileNotFoundException even if `ignoreCorruptFiles` 
is true
   case e: FileNotFoundException if !ignoreMissingFiles => 
throw e
   case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
+if (e.getMessage.contains("Filesystem closed")) {

Review Comment:
   +1 to @JoshRosen's proposal.
   Given that hadoop is throwing a generic exception here, and given the lack 
of principled alternatives available - walking the stack allows us to 
reasonably detect if the cause is due to hadoop filesystem being closed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #36777: [SPARK-39390][CORE] Hide and optimize `viewAcls`/`viewAclsGroups`/`modifyAcls`/`modifyAclsGroups` from INFO log

2022-06-06 Thread GitBox



huaxingao commented on PR #36777:
URL: https://github.com/apache/spark/pull/36777#issuecomment-1148178978

   Merged to master. Thanks @dcoliversun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao closed pull request #36777: [SPARK-39390][CORE] Hide and optimize `viewAcls`/`viewAclsGroups`/`modifyAcls`/`modifyAclsGroups` from INFO log

2022-06-06 Thread GitBox



huaxingao closed pull request #36777: [SPARK-39390][CORE] Hide and optimize 
`viewAcls`/`viewAclsGroups`/`modifyAcls`/`modifyAclsGroups` from INFO log
URL: https://github.com/apache/spark/pull/36777


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] otterc commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-06 Thread GitBox



otterc commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r890584847


##
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##
@@ -230,11 +241,14 @@ protected void serviceInit(Configuration externalConf) 
throws Exception {
   // when it comes back
   if (_recoveryPath != null) {
 registeredExecutorFile = initRecoveryDb(RECOVERY_FILE_NAME);
+mergeManagerFile = 
initRecoveryDb(SPARK_SHUFFLE_MERGE_RECOVERY_FILE_NAME);
   }
 
-  TransportConf transportConf = new TransportConf("shuffle", new 
HadoopConfigProvider(_conf));
-  MergedShuffleFileManager shuffleMergeManager = 
newMergedShuffleFileManagerInstance(
-transportConf);
+  TransportConf transportConf = new TransportConf("shuffle",new 
HadoopConfigProvider(_conf));

Review Comment:
   why this change?



##
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##
@@ -230,11 +241,14 @@ protected void serviceInit(Configuration externalConf) 
throws Exception {
   // when it comes back
   if (_recoveryPath != null) {
 registeredExecutorFile = initRecoveryDb(RECOVERY_FILE_NAME);
+mergeManagerFile = 
initRecoveryDb(SPARK_SHUFFLE_MERGE_RECOVERY_FILE_NAME);
   }
 
-  TransportConf transportConf = new TransportConf("shuffle", new 
HadoopConfigProvider(_conf));
-  MergedShuffleFileManager shuffleMergeManager = 
newMergedShuffleFileManagerInstance(
-transportConf);
+  TransportConf transportConf = new TransportConf("shuffle",new 
HadoopConfigProvider(_conf));
+  if (shuffleMergeManager == null) {

Review Comment:
   why did we add this condition here?
   We don't check for the `blockHandler` below.



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -350,15 +416,27 @@ void closeAndDeletePartitionFilesIfNeeded(
* up older shuffleMergeId partitions. The cleanup will be executed in a 
separate thread.
*/
   @VisibleForTesting
-  void closeAndDeletePartitionFiles(Map 
partitions) {
+  void closeAndDeleteOutdatedPartitions(Map 
partitions) {
 partitions
   .forEach((partitionId, partitionInfo) -> {
 synchronized (partitionInfo) {
   partitionInfo.closeAllFilesAndDeleteIfNeeded(true);
+  
removeAppShufflePartitionInfoFromInDB(partitionInfo.appAttemptShuffleMergeId);
 }
   });
   }
 
+  void removeAppShufflePartitionInfoFromInDB(AppAttemptShuffleMergeId 
appAttemptShuffleMergeId) {

Review Comment:
   Nit: `removeAppShufflePartitionInfoFromInDB` -> 
`removeAppShufflePartitionInfoFromDB`



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -655,6 +743,197 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
 }
   }
 
+
+  @Override
+  public void close() {
+if (db != null) {
+  try {
+db.close();
+  } catch (IOException e) {
+logger.error("Exception closing leveldb with registered app paths info 
and "
++ "shuffle partition info", e);
+  }
+}
+  }
+
+  private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo 
appPathsInfo) {
+if (db != null) {
+  try {
+byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, 
attemptId));
+String valueStr = mapper.writeValueAsString(appPathsInfo);
+byte[] value = valueStr.getBytes(StandardCharsets.UTF_8);
+db.put(key, value);
+  } catch (Exception e) {
+logger.error("Error saving registered app paths info", e);
+  }
+}
+  }
+
+  private void writeAppAttemptShuffleMergeInfoToDB(
+  String appId,
+  int appAttemptId,
+  int shuffleId,
+  int shuffleMergeId) {
+if (db != null) {
+  // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles
+  try{
+byte[] dbKey = getDbAppAttemptShufflePartitionKey(
+new AppAttemptShuffleMergeId(appId, appAttemptId, shuffleId, 
shuffleMergeId));
+db.put(dbKey, new byte[0]);
+  } catch (Exception e) {
+logger.error("Error saving active app shuffle partition", e);
+  }
+}
+
+  }
+
+  private  T parseDbKey(String key, String prefix, Class valueType) {
+try {
+  String json = key.substring(prefix.length() + 1);
+  return mapper.readValue(json, valueType);
+} catch (Exception exception) {
+  logger.error("Exception while parsing the DB key {}", key);
+  return null;
+}
+  }
+
+  private AppPathsInfo parseDBAppAttemptPathsValue(byte[] value, AppAttemptId 
appAttemptId) {
+try {
+  return mapper.readValue(value, AppPathsInfo.class);
+} catch (Exception exception) {
+  logger.error("Exception while parsing the DB value for {}", 
appAttemptId);
+  return null;
+}
+  }
+
+  priv

[GitHub] [spark] itholic opened a new pull request, #36782: [SPARK-39394][DOCS] Improve PySpark Structured Streaming page more readable

2022-06-06 Thread GitBox



itholic opened a new pull request, #36782:
URL: https://github.com/apache/spark/pull/36782

   ### What changes were proposed in this pull request?
   
   This PR proposes to improve the PySpark Structured Streaming API reference 
page to be more readable,
   
   So far, the PySpark Structured Streaming API reference page is not-well 
organized so it's a bit uncomfortable to be read as below:
   
   ![Screen Shot 2022-06-07 at 12 29 33 
PM](https://user-images.githubusercontent.com/44108233/172289683-0c130b6a-7716-40a3-b22b-42e38febe8c7.png)
   
   
   
   ### Why are the changes needed?
   
   The improvement of document readability will also improve the usability for 
PySpark Structured Streaming.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, now the documentation is categorized by its class or their own purpose 
more clearly as below:
   
   ![Screen Shot 2022-06-07 at 12 30 01 
PM](https://user-images.githubusercontent.com/44108233/172289737-bd6ebf0e-601c-4a80-a16a-cf885302e7b6.png)
   
   
   ### How was this patch tested?
   
   The existing doc build in CI should cover.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on a diff in pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

2022-06-06 Thread GitBox



JoshRosen commented on code in PR #36775:
URL: https://github.com/apache/spark/pull/36775#discussion_r890727456


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##
@@ -253,6 +253,9 @@ class FileScanRDD(
   // Throw FileNotFoundException even if `ignoreCorruptFiles` 
is true
   case e: FileNotFoundException if !ignoreMissingFiles => 
throw e
   case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
+if (e.getMessage.contains("Filesystem closed")) {

Review Comment:
   I agree that checking exception messages isn't the most robust solution.
   
   Thinking about the risks of exception text matching, I see two potential 
classes of problems:
   
   1. _Genuine_ corruption might just-so-happen to produce an exception that 
also contains "Filesystem closed", in which case we'd break a user's job.
   2. The third-party FileSystem implementation could change its error 
messages, which would cause this PR's fix to stop working.
   
   Given the fact that the current code treats essentially _all_ exceptions as 
corruption, a potentially-brittle solution which ignores issue (2) and solves 
issue (1) _might_ not be a terrible trade-off. 
   
   For example, let's say that we're only trying to put a bandaid on the 
`DFSClient` closed filesystem issue and aren't trying to solve the broader 
class of issues I mentioned in my other comment. In that case, we might be able 
to write some sort of helper function which would walk through the exception's 
stack frames and look for a frame that calls `DFSClient.checkOpen`. Given that 
Hadoop hasn't changed that method in 11+ years, it seems pretty unlikely that 
they would break it in a future release.
   
   Then, we could do something like
   
   ```
   case e: IOException if isDFSClientClosedException(e) => throw e
   ```
   
   before the current corruption ignoring case.
   
   This would have no chance of mis-identifying corruption as non-corruption. 
It's potentially brittle and doesn't necessarily solve all cases, but it could 
greatly reduce the likelihood of data loss (even though it doesn't completely 
eliminate it). Depending on the context of use, maybe that's an okay short-term 
band-aid?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35484: [SPARK-38181][SS][DOCS] Update comments in KafkaDataConsumer.scala

2022-06-06 Thread GitBox



HyukjinKwon commented on PR #35484:
URL: https://github.com/apache/spark/pull/35484#issuecomment-1148140363

   Oh, also please enable GItHub Actions in your forked repository 
(https://github.com/ArvinZheng/spark). Apache Spark repository leverages PR 
author's GitHub Actions resources.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35484: [SPARK-38181][SS][DOCS] Update comments in KafkaDataConsumer.scala

2022-06-06 Thread GitBox



HyukjinKwon commented on PR #35484:
URL: https://github.com/apache/spark/pull/35484#issuecomment-1148140028

   @ArvinZheng mind rebasing this PR? seems like something went wrong about 
finding GA actions in your fork.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db commented on pull request #35484: [SPARK-38181][SS][DOCS] Update comments in KafkaDataConsumer.scala

2022-06-06 Thread GitBox



anishshri-db commented on PR #35484:
URL: https://github.com/apache/spark/pull/35484#issuecomment-1148138847

   Seems like the tests are failing though ? Maybe try merging back and 
re-trigger ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db commented on a diff in pull request #35484: [SPARK-38181][SS][DOCS] Update comments in KafkaDataConsumer.scala

2022-06-06 Thread GitBox



anishshri-db commented on code in PR #35484:
URL: https://github.com/apache/spark/pull/35484#discussion_r890722033


##
connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala:
##
@@ -298,9 +296,10 @@ private[kafka010] class KafkaDataConsumer(
   s"requested $offset")
 
 // The following loop is basically for `failOnDataLoss = false`. When 
`failOnDataLoss` is
-// `false`, first, we will try to fetch the record at `offset`. If no such 
record exists, then
-// we will move to the next available offset within `[offset, 
untilOffset)` and retry.
-// If `failOnDataLoss` is `true`, the loop body will be executed only once.
+// `false`, we will try to fetch the record at `offset`, if the record 
does not exist, we will
+// try to fetch next available record within [offset, untilOffset).
+// If `failOnDataLoss` is `true`, the loop body will be executed only 
once, either return the
+// record at `offset` or throw an exception when the record does not exist

Review Comment:
   Nit: Period at the end 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

2022-06-06 Thread GitBox

JoshRosen commented on PR #36775:
URL: https://github.com/apache/spark/pull/36775#issuecomment-1148136217

Does checking for filesystem closed exceptions completely fix this issue or
are we vulnerable to race conditions? Skimming through the [Hadoop DFSClient
code](https://github.com/apache/hadoop/blame/2dfa928a201560bc739ff5e4fe29f8bb19188927/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java),
it looks like it only throws `Filesystem closed` IOExceptions from a
`checkOpen()` call, but what happens if the FS is closed while a thread has
proceeded past `checkOpen()` and is in the middle of some other operation? In
that case we might get a different IOException but would presumably want to
treat it the same way (because it is a side-effect of another task being
interrupted rather than a side-effect of a corrupt file).

It seems like the core problem is that we have a really overly-broad `catch`
block which catches corrupt files but also catches many other potential things.
For example, let's say that we get an IOException caused by a transient network
connectivity issue to external storage: this doesn't meet the intuitive
definition of "corrupt file" but would still get caught in the dragnet of the
current exception handler.

The current code seems biased towards identifying "false positive" instances
of corruption (which can lead to correctness issues). If instead we wanted to
bias towards false negatives (i.e. mis-identifying true corruption as a
transient crash, therefore failing a user's job) then maybe we could have the
code in `readFunc` wrap and re-throw exceptions in some sort of
`CorruptFileException` wrapper and then modify the `catch` here to only ignore
that new exception. This would require changes in a number of data sources,
though. The FileScanRDD code might simply lack the necessary information to be
able to identify true corruption cases, so pushing part of that decision one
layer lower might make sense.

I think that could be a breaking change for certain users, though, so we'd
need to treat it as a user-facing change and document it appropriately (and
might need add escape hatch flags in case users need to revert back to the old
(arguably buggy) behavior).

I'm not sure what's the right course of action here. I just wanted to flag
that there's a potentially broader issue here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #36776: [SPARK-38997][SPARK-39037][SQL][FOLLOWUP] `PushableColumnWithoutNestedColumn` need be translated to predicate too

2022-06-06 Thread GitBox



beliefer commented on PR #36776:
URL: https://github.com/apache/spark/pull/36776#issuecomment-1148132608

   ping @huaxingao cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #36773: [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown

2022-06-06 Thread GitBox



beliefer commented on code in PR #36773:
URL: https://github.com/apache/spark/pull/36773#discussion_r890715857


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala:
##
@@ -750,6 +750,22 @@ object DataSourceStrategy
 PushableColumnWithoutNestedColumn(right), _) =>
   Some(new GeneralAggregateFunc("CORR", agg.isDistinct,
 Array(FieldReference.column(left), FieldReference.column(right
+case aggregate.RegrIntercept(PushableColumnWithoutNestedColumn(left),

Review Comment:
   two indents ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

2022-06-06 Thread GitBox



HeartSaVioR commented on PR #36737:
URL: https://github.com/apache/spark/pull/36737#issuecomment-1148131018

   Sorry I'll find a time sooner. I'll also find someone able to review this in 
prior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #36773: [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown

2022-06-06 Thread GitBox



beliefer commented on code in PR #36773:
URL: https://github.com/apache/spark/pull/36773#discussion_r890714758


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala:
##
@@ -72,6 +72,26 @@ private[sql] object H2Dialect extends JdbcDialect {
   assert(f.children().length == 2)
   val distinct = if (f.isDistinct) "DISTINCT " else ""
   Some(s"CORR($distinct${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_INTERCEPT" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_INTERCEPT($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_R2" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_R2($distinct${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SLOPE" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_SLOPE($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SXX" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_SXX($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SXY" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #35612: [SPARK-38289][SQL] Refactor SQL CLI exit code to make it more clear

2022-06-06 Thread GitBox



AngersZh commented on PR #35612:
URL: https://github.com/apache/spark/pull/35612#issuecomment-1148126351

   > Tests are running in https://github.com/AngersZh/spark/runs/6765635006
   > 
   > Seems like Scala 2.13 build fails as below:
   > 
   > ```
   > [error] 
/home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:263:1:
  error: cannot find symbol
   > [error]   System.exit(SparkExitCode.EXIT_SUCCESS);
   > [error]^  symbol:   variable EXIT_SUCCESS
   > [error]   location: class SparkExitCode
   > [error] Note: Some input files use or override a deprecated API.
   > [error] Note: Recompile with -Xlint:deprecation for details.
   > [error] 1 error
   > ```
   
   Updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #36773: [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown

2022-06-06 Thread GitBox



beliefer commented on code in PR #36773:
URL: https://github.com/apache/spark/pull/36773#discussion_r890696454


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -,6 +,28 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df, Seq(Row(1d), Row(1d), Row(null)))
   }
 
+  test("scan with aggregate push-down: linear regression functions with filter 
and group by") {
+val df = sql(
+  """
+|SELECT
+|  REGR_INTERCEPT(bonus, bonus),
+|  REGR_R2(bonus, bonus),
+|  REGR_SLOPE(bonus, bonus),
+|  REGR_SXY(bonus, bonus)
+|FROM h2.test.employee where dept > 0 group by DePt""".stripMargin)

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #36773: [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown

2022-06-06 Thread GitBox



beliefer commented on code in PR #36773:
URL: https://github.com/apache/spark/pull/36773#discussion_r890694554


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala:
##
@@ -72,6 +72,26 @@ private[sql] object H2Dialect extends JdbcDialect {
   assert(f.children().length == 2)
   val distinct = if (f.isDistinct) "DISTINCT " else ""
   Some(s"CORR($distinct${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_INTERCEPT" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_INTERCEPT($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_R2" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_R2($distinct${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SLOPE" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_SLOPE($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SXX" =>

Review Comment:
   Thank you. I forgot to remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #36662: [SPARK-39286][DOC] Update documentation for the decode function

2022-06-06 Thread GitBox



beliefer commented on code in PR #36662:
URL: https://github.com/apache/spark/pull/36662#discussion_r890692975


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -2504,9 +2504,10 @@ object Decode {
   usage = """
 _FUNC_(bin, charset) - Decodes the first argument using the second 
argument character set.
 
-_FUNC_(expr, search, result [, search, result ] ... [, default]) - Decode 
compares expr
-  to each search value one by one. If expr is equal to a search, returns 
the corresponding result.
-  If no match is found, then Oracle returns default. If default is 
omitted, returns null.
+_FUNC_(expr, search, result [, search, result ] ... [, default]) - 
Compares expr
+  to each search value in order. If expr is equal to a search value, 
_FUNC_ returns
+  the corresponding result. If no match is found, then it returns default. 
If default

Review Comment:
   I just want simplify the comment. The comment is OK too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #36662: [SPARK-39286][DOC] Update documentation for the decode function

2022-06-06 Thread GitBox



beliefer commented on code in PR #36662:
URL: https://github.com/apache/spark/pull/36662#discussion_r890691547


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -2504,9 +2504,10 @@ object Decode {
   usage = """
 _FUNC_(bin, charset) - Decodes the first argument using the second 
argument character set.
 
-_FUNC_(expr, search, result [, search, result ] ... [, default]) - Decode 
compares expr
-  to each search value one by one. If expr is equal to a search, returns 
the corresponding result.
-  If no match is found, then Oracle returns default. If default is 
omitted, returns null.
+_FUNC_(expr, search, result [, search, result ] ... [, default]) - 
Compares expr
+  to each search value in order. If expr is equal to a search value, 
_FUNC_ returns

Review Comment:
   I got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dcoliversun commented on a diff in pull request #36777: [SPARK-39390][CORE] Hide and optimize `viewAcls`/`viewAclsGroups`/`modifyAcls`/`modifyAclsGroups` from INFO log

2022-06-06 Thread GitBox



dcoliversun commented on code in PR #36777:
URL: https://github.com/apache/spark/pull/36777#discussion_r890683738


##
core/src/main/scala/org/apache/spark/SecurityManager.scala:
##
@@ -87,10 +87,14 @@ private[spark] class SecurityManager(
   private var secretKey: String = _
   logInfo("SecurityManager: authentication " + (if (authOn) "enabled" else 
"disabled") +
 "; ui acls " + (if (aclsOn) "enabled" else "disabled") +
-"; users  with view permissions: " + viewAcls.toString() +
-"; groups with view permissions: " + viewAclsGroups.toString() +
-"; users  with modify permissions: " + modifyAcls.toString() +
-"; groups with modify permissions: " + modifyAclsGroups.toString())
+"; users  with view permissions: " +
+(if (viewAcls.nonEmpty) viewAcls.mkString(",") else "EMPTY") +

Review Comment:
   OK :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dcoliversun commented on a diff in pull request #36777: [SPARK-39390][CORE] Hide and optimize `viewAcls`/`viewAclsGroups`/`modifyAcls`/`modifyAclsGroups` from INFO log

2022-06-06 Thread GitBox



dcoliversun commented on code in PR #36777:
URL: https://github.com/apache/spark/pull/36777#discussion_r890683347


##
core/src/main/scala/org/apache/spark/SecurityManager.scala:
##
@@ -87,10 +87,14 @@ private[spark] class SecurityManager(
   private var secretKey: String = _
   logInfo("SecurityManager: authentication " + (if (authOn) "enabled" else 
"disabled") +
 "; ui acls " + (if (aclsOn) "enabled" else "disabled") +
-"; users  with view permissions: " + viewAcls.toString() +
-"; groups with view permissions: " + viewAclsGroups.toString() +
-"; users  with modify permissions: " + modifyAcls.toString() +
-"; groups with modify permissions: " + modifyAclsGroups.toString())
+"; users  with view permissions: " +

Review Comment:
   Fine. I delete extra spaces after `users`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35612: [SPARK-38289][SQL] Refactor SQL CLI exit code to make it more clear

2022-06-06 Thread GitBox



HyukjinKwon commented on PR #35612:
URL: https://github.com/apache/spark/pull/35612#issuecomment-1148079558

   Tests are running in https://github.com/AngersZh/spark/runs/6765635006
   
   Seems like Scala 2.13 build fails as below:
   
   ```
   [error] 
/home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:263:1:
  error: cannot find symbol
   [error]   System.exit(SparkExitCode.EXIT_SUCCESS);
   [error]^  symbol:   variable EXIT_SUCCESS
   [error]   location: class SparkExitCode
   [error] Note: Some input files use or override a deprecated API.
   [error] Note: Recompile with -Xlint:deprecation for details.
   [error] 1 error
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #36777: [SPARK-39390][CORE] Hide and optimize `viewAcls`/`viewAclsGroups`/`modifyAcls`/`modifyAclsGroups` from INFO log

2022-06-06 Thread GitBox



huaxingao commented on code in PR #36777:
URL: https://github.com/apache/spark/pull/36777#discussion_r890673570


##
core/src/main/scala/org/apache/spark/SecurityManager.scala:
##
@@ -87,10 +87,14 @@ private[spark] class SecurityManager(
   private var secretKey: String = _
   logInfo("SecurityManager: authentication " + (if (authOn) "enabled" else 
"disabled") +
 "; ui acls " + (if (aclsOn) "enabled" else "disabled") +
-"; users  with view permissions: " + viewAcls.toString() +
-"; groups with view permissions: " + viewAclsGroups.toString() +
-"; users  with modify permissions: " + modifyAcls.toString() +
-"; groups with modify permissions: " + modifyAclsGroups.toString())
+"; users  with view permissions: " +

Review Comment:
   nit: there is an extra space after `users`



##
core/src/main/scala/org/apache/spark/SecurityManager.scala:
##
@@ -87,10 +87,14 @@ private[spark] class SecurityManager(
   private var secretKey: String = _
   logInfo("SecurityManager: authentication " + (if (authOn) "enabled" else 
"disabled") +
 "; ui acls " + (if (aclsOn) "enabled" else "disabled") +
-"; users  with view permissions: " + viewAcls.toString() +
-"; groups with view permissions: " + viewAclsGroups.toString() +
-"; users  with modify permissions: " + modifyAcls.toString() +
-"; groups with modify permissions: " + modifyAclsGroups.toString())
+"; users  with view permissions: " +
+(if (viewAcls.nonEmpty) viewAcls.mkString(",") else "EMPTY") +

Review Comment:
   nit: In `mkString(",")`,  add an empty space after `,` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #32397: [WIP][SPARK-35084][CORE] Spark 3: supporting "--packages" in k8s cluster mode

2022-06-06 Thread GitBox



github-actions[bot] closed pull request #32397: [WIP][SPARK-35084][CORE] Spark 
3: supporting "--packages" in  k8s cluster mode
URL: https://github.com/apache/spark/pull/32397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35417: [SPARK-38102][CORE] Support custom commitProtocolClass in saveAsNewAPIHadoopDataset

2022-06-06 Thread GitBox



github-actions[bot] closed pull request #35417: [SPARK-38102][CORE] Support 
custom commitProtocolClass in saveAsNewAPIHadoopDataset
URL: https://github.com/apache/spark/pull/35417


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35363: [SPARK-38066][SQL] evaluateEquality should ignore attribute without min/max ColumnStat

2022-06-06 Thread GitBox



github-actions[bot] closed pull request #35363: [SPARK-38066][SQL] 
evaluateEquality should ignore attribute without min/max ColumnStat
URL: https://github.com/apache/spark/pull/35363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #36773: [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown

2022-06-06 Thread GitBox



huaxingao commented on code in PR #36773:
URL: https://github.com/apache/spark/pull/36773#discussion_r890658489


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala:
##
@@ -72,6 +72,26 @@ private[sql] object H2Dialect extends JdbcDialect {
   assert(f.children().length == 2)
   val distinct = if (f.isDistinct) "DISTINCT " else ""
   Some(s"CORR($distinct${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_INTERCEPT" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_INTERCEPT($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_R2" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_R2($distinct${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SLOPE" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_SLOPE($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SXX" =>

Review Comment:
   Do we ever hit this path? In `translateAggregate` you don't have `REGR_SXX` 
and in the PR description it is said that ".. REGR_SXX and REGR_SXY are 
replaced to other expression in runtime"? Actually `REGR_SXY` is not converted 
to other expression in runtime, right?



##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala:
##
@@ -72,6 +72,26 @@ private[sql] object H2Dialect extends JdbcDialect {
   assert(f.children().length == 2)
   val distinct = if (f.isDistinct) "DISTINCT " else ""
   Some(s"CORR($distinct${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_INTERCEPT" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_INTERCEPT($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_R2" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_R2($distinct${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SLOPE" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_SLOPE($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SXX" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""
+  Some(s"REGR_SXX($distinct${f.children().head}, 
${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "REGR_SXY" =>
+  assert(f.children().length == 2)
+  val distinct = if (f.isDistinct) "DISTINCT " else ""

Review Comment:
   Can we test `DISTINCT` too?



##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -,6 +,28 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df, Seq(Row(1d), Row(1d), Row(null)))
   }
 
+  test("scan with aggregate push-down: linear regression functions with filter 
and group by") {
+val df = sql(
+  """
+|SELECT
+|  REGR_INTERCEPT(bonus, bonus),
+|  REGR_R2(bonus, bonus),
+|  REGR_SLOPE(bonus, bonus),
+|  REGR_SXY(bonus, bonus)
+|FROM h2.test.employee where dept > 0 group by DePt""".stripMargin)

Review Comment:
   nit: capitalize the sql keywords `where` and `group by`?
   
   I just noticed that not all the sql keywords in this test suite are 
capitalized. Probably open a separate PR to fix them all?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vli-databricks commented on a diff in pull request #36780: [SPARK-39392][SQL] Refine ANSI error messages for try_* function hints

2022-06-06 Thread GitBox



vli-databricks commented on code in PR #36780:
URL: https://github.com/apache/spark/pull/36780#discussion_r890657167


##
core/src/main/resources/error/error-classes.json:
##
@@ -195,7 +195,7 @@
   },
   "INVALID_ARRAY_INDEX_IN_ELEMENT_AT" : {
 "message" : [
-  "The index  is out of bounds. The array has  
elements. To return NULL instead, use `try_element_at`. If necessary set 
 to \"false\" to bypass this error."
+  "The index  is out of bounds. The array has  
elements. Use `try_element_at` to tolerate accessing element at non-existing 
index and return NULL instead. If necessary set  to \"false\" to bypass 
this error."

Review Comment:
   Yes, that is better. Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #36781: [SPARK-39393][SQL] Parquet data source only supports push-down predicate filters for non-repeated primitive types

2022-06-06 Thread GitBox



huaxingao commented on PR #36781:
URL: https://github.com/apache/spark/pull/36781#issuecomment-1148022608

   @Borjianamin98 Could you please add a test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #36734: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true

2022-06-06 Thread GitBox



mridulm commented on PR #36734:
URL: https://github.com/apache/spark/pull/36734#issuecomment-1147982565

   Merged to master, thanks for working on this @akpatnam25 !
   Thanks for the review @otterc  :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm closed pull request #36734: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true

2022-06-06 Thread GitBox



mridulm closed pull request #36734: [SPARK-38987][SHUFFLE] Throw 
FetchFailedException when merged shuffle blocks are corrupted and 
spark.shuffle.detectCorrupt is set to true
URL: https://github.com/apache/spark/pull/36734


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

2022-06-06 Thread GitBox



dongjoon-hyun commented on code in PR #36775:
URL: https://github.com/apache/spark/pull/36775#discussion_r890600656


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##
@@ -253,6 +253,9 @@ class FileScanRDD(
   // Throw FileNotFoundException even if `ignoreCorruptFiles` 
is true
   case e: FileNotFoundException if !ignoreMissingFiles => 
throw e
   case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
+if (e.getMessage.contains("Filesystem closed")) {

Review Comment:
   +1 for @HyukjinKwon 's comment. We need a more robust way to detect, 
@boneanxs .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-06 Thread GitBox



dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890579762


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala:
##
@@ -41,21 +41,29 @@ import org.apache.spark.sql.types.{MetadataBuilder, 
StructField, StructType}
  *
  * We can remove this rule once we implement all the catalog functionality in 
`V2SessionCatalog`.
  */
-class ResolveSessionCatalog(val catalogManager: CatalogManager)
+class ResolveSessionCatalog(val analyzer: Analyzer)
   extends Rule[LogicalPlan] with LookupCatalog {
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
   import org.apache.spark.sql.connector.catalog.CatalogV2Util._
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._
 
+  override val catalogManager = analyzer.catalogManager
+
   override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp 
{
-case AddColumns(ResolvedV1TableIdentifier(ident), cols) =>
+case AddColumns(
+  ResolvedTable(catalog, ident, v1Table: V1Table, _), cols)
+if isSessionCatalog(catalog) =>
   cols.foreach { c =>
 assertTopLevelColumn(c.name, "AlterTableAddColumnsCommand")
 if (!c.nullable) {
   throw 
QueryCompilationErrors.addColumnWithV1TableCannotSpecifyNotNullError
 }
   }
-  AlterTableAddColumnsCommand(ident.asTableIdentifier, 
cols.map(convertToStructField))
+  val prevSchema = StructType(cols.map(convertToStructField))
+  val newSchema: StructType =
+DefaultCols.constantFoldCurrentDefaultsToExistDefaults(

Review Comment:
   Thanks for pointing this out, we don't need to duplicate it again. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-06 Thread GitBox



dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890579364


##
sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala:
##
@@ -186,7 +186,7 @@ abstract class BaseSessionStateBuilder(
 new ResolveSQLOnFile(session) +:
 new FallBackFileSourceV2(session) +:
 ResolveEncodersInScalaAgg +:
-new ResolveSessionCatalog(catalogManager) +:
+new ResolveSessionCatalog(this) +:

Review Comment:
   Good point, this is not needed, we can revert this part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36772: [SPARK-39387][BUILD] Upgrade hive-storage-api to 2.7.3

2022-06-06 Thread GitBox



dongjoon-hyun commented on PR #36772:
URL: https://github.com/apache/spark/pull/36772#issuecomment-1147900459

   Apache ORC community uses 2.8.1 based on `Panagiotis Garefalakis`'s comment 
(which I shared here) because he is the Hive committer and ORC PMC member. In 
Apache Spark community, what the Apache Spark community needs is actually a 
test coverage. I'm fine with both versions if both of them works in Spark.
   > Should I raise another PR to upgrade to 2.8.1?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Borjianamin98 opened a new pull request, #36781: [SPARK-39393][SQL] Parquet data source only supports push-down predicate filters for non-repeated primitive types

2022-06-06 Thread GitBox



Borjianamin98 opened a new pull request, #36781:
URL: https://github.com/apache/spark/pull/36781

   ### What changes were proposed in this pull request?
   
   In Spark version 3.1.0 and newer, Spark creates extra filter predicate 
conditions for repeated parquet columns.
   These fields do not have the ability to have a filter predicate, according 
to the [PARQUET-34](https://issues.apache.org/jira/browse/PARQUET-34) issue in 
the parquet library.
   
   This PR solves this problem until the appropriate functionality is provided 
by the parquet.
   
   Before this PR:
   
   Assume follow Protocol buffer schema:
   
   ```
   message Model {
   string name = 1;
   repeated string keywords = 2;
   }
   ```
   
   Suppose a parquet file is created from a set of records in the above format 
with the help of the parquet-protobuf library.
   Using Spark version 3.1.0 or newer, we get following exception when run the 
following query using spark-shell:
   
   ```
   val data = spark.read.parquet("/path/to/parquet")
   data.registerTempTable("models")
   spark.sql("select * from models where array_contains(keywords, 
'X')").show(false)
   ```
   
   ```
   Caused by: java.lang.IllegalArgumentException: FilterPredicates do not 
currently support repeated columns. Column keywords is repeated.
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:176)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:89)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
 at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:192)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45)
 at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:72)
 at 
org.apache.parquet.hadoop.ParquetFileReader.filterRowGroups(ParquetFileReader.java:870)
 at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:789)
 at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:373)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   ...
   ```
   
   The cause of the problem is due to a change in the data filtering conditions:
   
   ```
   spark.sql("select * from log where array_contains(keywords, 
'X')").explain(true);
   
   // Spark 3.0.2 and older
   == Physical Plan ==
   ... 
   +- FileScan parquet [link#0,keywords#1]
 DataFilters: [array_contains(keywords#1, Google)]
 PushedFilters: []
 ...
   
   // Spark 3.1.0 and newer
   == Physical Plan == ... 
   +- FileScan parquet [link#0,keywords#1]
 DataFilters: [isnotnull(keywords#1),  array_contains(keywords#1, Google)]
 PushedFilters: [IsNotNull(keywords)]
 ...
   ```
   
   Pushing filters down for repeated columns of parquet is not necessary 
because it is not supported by parquet library for now. So we can exclude them 
from pushed predicate filters and solve issue.
   
   ### Why are the changes needed?
   
   Predicate filters that are pushed down to parquet should not be created on 
repeated-type fields.
   
   ### Does this PR introduce any user-facing change?
   
   No, It's only fixed a bug and before this, due to the limitations of the 
parquet library, no more work was possible.
   
   ### How was this patch tested?
   
   Need no more tests and checked only by executing code base tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apac

[GitHub] [spark] gengliangwang commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890519223


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala:
##
@@ -41,21 +41,29 @@ import org.apache.spark.sql.types.{MetadataBuilder, 
StructField, StructType}
  *
  * We can remove this rule once we implement all the catalog functionality in 
`V2SessionCatalog`.
  */
-class ResolveSessionCatalog(val catalogManager: CatalogManager)
+class ResolveSessionCatalog(val analyzer: Analyzer)
   extends Rule[LogicalPlan] with LookupCatalog {
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
   import org.apache.spark.sql.connector.catalog.CatalogV2Util._
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._
 
+  override val catalogManager = analyzer.catalogManager
+
   override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp 
{
-case AddColumns(ResolvedV1TableIdentifier(ident), cols) =>
+case AddColumns(
+  ResolvedTable(catalog, ident, v1Table: V1Table, _), cols)
+if isSessionCatalog(catalog) =>
   cols.foreach { c =>
 assertTopLevelColumn(c.name, "AlterTableAddColumnsCommand")
 if (!c.nullable) {
   throw 
QueryCompilationErrors.addColumnWithV1TableCannotSpecifyNotNullError
 }
   }
-  AlterTableAddColumnsCommand(ident.asTableIdentifier, 
cols.map(convertToStructField))
+  val prevSchema = StructType(cols.map(convertToStructField))
+  val newSchema: StructType =
+DefaultCols.constantFoldCurrentDefaultsToExistDefaults(

Review Comment:
   `constantFoldCurrentDefaultsToExistDefaults` is called in 
`AlterTableAddColumnsCommand` too. Seems duplicated here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890517016


##
sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala:
##
@@ -186,7 +186,7 @@ abstract class BaseSessionStateBuilder(
 new ResolveSQLOnFile(session) +:
 new FallBackFileSourceV2(session) +:
 ResolveEncodersInScalaAgg +:
-new ResolveSessionCatalog(catalogManager) +:
+new ResolveSessionCatalog(this) +:

Review Comment:
   I have seen this in previous PRs. It is odd that the analyzer relies on 
ResolveSessionCatalog, while ResolveSessionCatalog relies on the analyzer too..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LucaCanali commented on a diff in pull request #36662: [SPARK-39286][DOC] Update documentation for the decode function

2022-06-06 Thread GitBox



LucaCanali commented on code in PR #36662:
URL: https://github.com/apache/spark/pull/36662#discussion_r890504556


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -2504,9 +2504,10 @@ object Decode {
   usage = """
 _FUNC_(bin, charset) - Decodes the first argument using the second 
argument character set.
 
-_FUNC_(expr, search, result [, search, result ] ... [, default]) - Decode 
compares expr
-  to each search value one by one. If expr is equal to a search, returns 
the corresponding result.
-  If no match is found, then Oracle returns default. If default is 
omitted, returns null.
+_FUNC_(expr, search, result [, search, result ] ... [, default]) - 
Compares expr
+  to each search value in order. If expr is equal to a search value, 
_FUNC_ returns
+  the corresponding result. If no match is found, then it returns default. 
If default

Review Comment:
   I believe "returns" needs a subject to complete the sentence. One option is 
to have "it" as proposed, other options are possible , for example repeating 
_FUNC_ , or more options I guess. 
   The proposed change of just using "returns default", is quite 
understandable, although strictly speaking is missing the subject.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #36779: [SPARK-39391][CORE] Reuse Partitioner classes

2022-06-06 Thread GitBox



hvanhovell closed pull request #36779: [SPARK-39391][CORE] Reuse Partitioner 
classes
URL: https://github.com/apache/spark/pull/36779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LucaCanali commented on a diff in pull request #36662: [SPARK-39286][DOC] Update documentation for the decode function

2022-06-06 Thread GitBox



LucaCanali commented on code in PR #36662:
URL: https://github.com/apache/spark/pull/36662#discussion_r890493367


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -2504,9 +2504,10 @@ object Decode {
   usage = """
 _FUNC_(bin, charset) - Decodes the first argument using the second 
argument character set.
 
-_FUNC_(expr, search, result [, search, result ] ... [, default]) - Decode 
compares expr
-  to each search value one by one. If expr is equal to a search, returns 
the corresponding result.
-  If no match is found, then Oracle returns default. If default is 
omitted, returns null.
+_FUNC_(expr, search, result [, search, result ] ... [, default]) - 
Compares expr
+  to each search value in order. If expr is equal to a search value, 
_FUNC_ returns

Review Comment:
   I believe it should be fine as @cloud-fan commented.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cxzl25 commented on pull request #36772: [SPARK-39387][BUILD] Upgrade hive-storage-api to 2.7.3

2022-06-06 Thread GitBox



cxzl25 commented on PR #36772:
URL: https://github.com/apache/spark/pull/36772#issuecomment-1147811164

   > can we have a test case for this
   
   I see that the UT of HIVE-25190 has adjusted the xmx to 3g.
   The current xmx of spark's uni test is 4g.
   I'm not sure if this scenario can be tested using the orc writer api, I can 
continue to investigate.
   
   
   
   > upgraded the storage api to 2.8.1 at 
[ORC-867](https://issues.apache.org/jira/browse/ORC-867) at Apache ORC 1.7.0+
   
   Because I see [HIVE-25190](https://issues.apache.org/jira/browse/HIVE-25190) 
resolves Overflow of newLength, 
[HIVE-25400](https://issues.apache.org/jira/browse/HIVE-25400) optimizes 
HIVE-25190 some changed behavior, both of these are in 2.7.3 release.
   Then I saw that the orc and hive storage versions were not aligned, so I 
used this minor version 2.7.3.
   
   I read the description of ORC-867 because 2.8.0 accidentally introduced 
guava dependencies, so 2.8.1 was used, while 2.7.3 did not introduce guava 
dependencies.
   
   Should I raise another PR to upgrade to 2.8.1?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #36780: [SPARK-39392] Refine ANSI error messages.

2022-06-06 Thread GitBox



gengliangwang commented on PR #36780:
URL: https://github.com/apache/spark/pull/36780#issuecomment-1147808366

   cc @srielau as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36780: [SPARK-39392] Refine ANSI error messages.

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36780:
URL: https://github.com/apache/spark/pull/36780#discussion_r890466926


##
core/src/main/resources/error/error-classes.json:
##
@@ -195,7 +195,7 @@
   },
   "INVALID_ARRAY_INDEX_IN_ELEMENT_AT" : {
 "message" : [
-  "The index  is out of bounds. The array has  
elements. To return NULL instead, use `try_element_at`. If necessary set 
 to \"false\" to bypass this error."
+  "The index  is out of bounds. The array has  
elements. Use `try_element_at` to tolerate accessing element at non-existing 
index and return NULL instead. If necessary set  to \"false\" to bypass 
this error."

Review Comment:
   `non-existing index` => `invalid index` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vli-databricks commented on pull request #36780: [SPARK-39392] Refine ANSI error messages.

2022-06-06 Thread GitBox



vli-databricks commented on PR #36780:
URL: https://github.com/apache/spark/pull/36780#issuecomment-1147802194

   @gengliangwang please take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vli-databricks opened a new pull request, #36780: [SPARK-39392] Refine ANSI error messages.

2022-06-06 Thread GitBox



vli-databricks opened a new pull request, #36780:
URL: https://github.com/apache/spark/pull/36780

   
   
   ### What changes were proposed in this pull request?
   
   Refine ANSI error messages and remove 'To return NULL instead'
   
   ### Why are the changes needed?
   
   Improve error messaging for ANSI mode since the user may not even aware that 
query was returning NULLs.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890447000


##
core/src/test/scala/org/apache/spark/SparkFunSuite.scala:
##
@@ -264,6 +264,81 @@ abstract class SparkFunSuite
 }
   }
 
+  /**
+   * Checks an exception with an error class against expected results.
+   * @param exception The exception to check
+   * @param errorClassThe expected error class identifying the error
+   * @param errorSubClass Optional the expected subclass, None if not given
+   * @param sqlState  Optional the expected SQLSTATE, not verified if not 
supplied
+   * @param parametersA map of parameter names and values. The names are 
as defined
+   *  in the error-classes file.
+   */
+  protected def checkError(exception: Exception with SparkThrowable,
+   errorClass: String,
+   errorSubClass: Option[String],
+   sqlState: Option[String],
+   parameters: Map[String, String],
+   matchPVals: Boolean = false): Unit = {
+assert(exception.getErrorClass === errorClass)
+if (exception.getErrorSubClass != null) { assert(errorSubClass.isDefined) }
+errorSubClass.foreach(subClass => assert(exception.getErrorSubClass === 
subClass))
+sqlState.foreach(state => assert(exception.getSqlState === state))
+val expectedParameters = (exception.getParameterNames zip 
exception.getMessageParameters).toMap
+if (matchPVals == true) {
+  assert(expectedParameters.size === parameters.size)
+  expectedParameters.foreach(
+exp => {
+  val parm = parameters.getOrElse(exp._1,
+throw new IllegalArgumentException("Missing parameter" + exp._1))
+  if (!exp._2.matches(parm)) {
+throw new IllegalArgumentException("(" + exp._1 + ", " + exp._2 +
+  ") does not match: " + parm)
+  }
+}
+  )
+} else {
+  assert(expectedParameters === parameters)
+}
+  }
+
+  protected def checkError(exception: Exception with SparkThrowable,
+   errorClass: String,
+   errorSubClass: String,
+   sqlState: String,
+   parameters: Map[String, String]): Unit =
+checkError(exception, errorClass, Some(errorSubClass), Some(sqlState), 
parameters)
+
+  protected def checkError(exception: Exception with SparkThrowable,
+   errorClass: String,
+   sqlState: String,
+   parameters: Map[String, String]): Unit =
+checkError(exception, errorClass, None, Some(sqlState), parameters)
+
+  protected def checkError(exception: Exception with SparkThrowable,
+   errorClass: String,
+   parameters: Map[String, String]): Unit =
+checkError(exception, errorClass, None, None, parameters)
+
+  /**
+   * Checks an exception with an error class against expected results.
+   * @param exception The exception to check
+   * @param errorClassThe expected error class identifying the error
+   * @param sqlState  Optional the expected SQLSTATE, not verified if not 
supplied
+   * @param parametersAn array of values. This does not verify the right 
name association.
+   */
+  protected def checkError(exception: Exception with SparkThrowable,
+   errorClass: String,
+   sqlState: String,
+   parameters: Array[String]): Unit =
+checkError(exception, errorClass, None, Some(sqlState),
+  (exception.getParameterNames zip parameters).toMap)
+
+  protected def checkError(exception: Exception with SparkThrowable,
+   errorClass: String,
+   parameters: Array[String]): Unit =

Review Comment:
   ```suggestion
 protected def checkError(
 exception: Exception with SparkThrowable,
 errorClass: String,
 errorSubClass: String,
 sqlState: String,
 parameters: Map[String, String]): Unit =
   checkError(exception, errorClass, Some(errorSubClass), Some(sqlState), 
parameters)
   
 protected def checkError(
 exception: Exception with SparkThrowable,
 errorClass: String,
 sqlState: String,
 parameters: Map[String, String]): Unit =
   checkError(exception, errorClass, None, Some(sqlState), parameters)
   
 protected def checkError(
 exception: Exception with SparkThrowable,
 errorClass: String,
 parameters: Map[String, String]): Unit =
   checkError(exception, errorClass, None, None, parameters)
   
 /**
  * Checks an exception with an error class against expected results.
  * @param exception The exception to check
  * @param errorClassThe expected error class identifying the er

[GitHub] [spark] gengliangwang commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890443211


##
core/src/test/scala/org/apache/spark/SparkFunSuite.scala:
##
@@ -264,6 +264,81 @@ abstract class SparkFunSuite
 }
   }
 
+  /**
+   * Checks an exception with an error class against expected results.
+   * @param exception The exception to check
+   * @param errorClassThe expected error class identifying the error
+   * @param errorSubClass Optional the expected subclass, None if not given
+   * @param sqlState  Optional the expected SQLSTATE, not verified if not 
supplied
+   * @param parametersA map of parameter names and values. The names are 
as defined
+   *  in the error-classes file.
+   */
+  protected def checkError(exception: Exception with SparkThrowable,
+   errorClass: String,
+   errorSubClass: Option[String],
+   sqlState: Option[String],
+   parameters: Map[String, String],
+   matchPVals: Boolean = false): Unit = {

Review Comment:
   ```suggestion
 protected def checkError(
 exception: Exception with SparkThrowable,
 errorClass: String,
 errorSubClass: Option[String],
 sqlState: Option[String],
 parameters: Map[String, String],
 matchPVals: Boolean = false): Unit = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890433533


##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -35,6 +35,9 @@ public interface SparkThrowable {
   // Succinct, human-readable, unique, and consistent representation of the 
error category
   // If null, error class is not set
   String getErrorClass();
+  default String getErrorSubClass() {
+return null;

Review Comment:
   Oh I am fine with either "null" or empty string. I was just wondering which 
is better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890432585


##
core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryError.java:
##
@@ -28,6 +28,7 @@
 @Private
 public final class SparkOutOfMemoryError extends OutOfMemoryError implements 
SparkThrowable {
 String errorClass;
+String errorSubClass;

Review Comment:
   I mean we can skip adding `errorSubClass` and implementing 
`getErrorSubClass` in this PR. We can have it when there is sub classes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



srielau commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890414735


##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -46,4 +49,13 @@ default String getSqlState() {
   default boolean isInternalError() {
 return SparkThrowableHelper.isInternalError(this.getErrorClass());
   }
+
+  default String[] getMessageParameters() {
+return new String[]{};
+  }
+
+  // True if this error is an internal error.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



srielau commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890413958


##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -46,4 +49,13 @@ default String getSqlState() {
   default boolean isInternalError() {
 return SparkThrowableHelper.isInternalError(this.getErrorClass());
   }
+
+  default String[] getMessageParameters() {
+return new String[]{};
+  }
+
+  // True if this error is an internal error.

Review Comment:
   ```suggestion
// Returns a string array of all parameters that need to be passed to this 
error message
   ```



##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -46,4 +49,13 @@ default String getSqlState() {
   default boolean isInternalError() {
 return SparkThrowableHelper.isInternalError(this.getErrorClass());
   }
+
+  default String[] getMessageParameters() {
+return new String[]{};
+  }
+
+  // True if this error is an internal error.

Review Comment:
   ```suggestion
 // Returns a string array of all parameters that need to be passed to this 
error message.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



srielau commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890413958


##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -46,4 +49,13 @@ default String getSqlState() {
   default boolean isInternalError() {
 return SparkThrowableHelper.isInternalError(this.getErrorClass());
   }
+
+  default String[] getMessageParameters() {
+return new String[]{};
+  }
+
+  // True if this error is an internal error.

Review Comment:
   ```suggestion
   // Returns a string array of all parameters that need to be passed to this 
error message
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



srielau commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890411559


##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -35,6 +35,9 @@ public interface SparkThrowable {
   // Succinct, human-readable, unique, and consistent representation of the 
error category
   // If null, error class is not set
   String getErrorClass();
+  default String getErrorSubClass() {
+return null;

Review Comment:
   I think this is mapped to an Option in scala... But you tell me... I'm all 
thumbs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



srielau commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890410763


##
core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryError.java:
##
@@ -28,6 +28,7 @@
 @Private
 public final class SparkOutOfMemoryError extends OutOfMemoryError implements 
SparkThrowable {
 String errorClass;
+String errorSubClass;

Review Comment:
   I think that's correct. At this point Out Of Memory has no subclass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36772: [SPARK-39387][BUILD] Upgrade hive-storage-api to 2.7.3

2022-06-06 Thread GitBox



dongjoon-hyun commented on PR #36772:
URL: https://github.com/apache/spark/pull/36772#issuecomment-1147728250

   BTW, Apache ORC community upgraded the storage api to 2.8.1 at 
[ORC-867](https://issues.apache.org/jira/browse/ORC-867) at Apache ORC 1.7.0+


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890392259


##
core/src/main/scala/org/apache/spark/SparkException.scala:
##
@@ -28,23 +28,47 @@ class SparkException(
 message: String,
 cause: Throwable,
 errorClass: Option[String],
+errorSubClass: Option[String],
 messageParameters: Array[String])
   extends Exception(message, cause) with SparkThrowable {
 
+   def this(message: String,
+cause: Throwable,
+errorClass: Option[String],
+messageParameters: Array[String]) =
+ this(message = message,
+  cause = cause,
+  errorClass = errorClass,
+  errorSubClass = None,
+  messageParameters = messageParameters)
+
   def this(message: String, cause: Throwable) =
-this(message = message, cause = cause, errorClass = None, 
messageParameters = Array.empty)
+this(message = message, cause = cause, errorClass = None, errorSubClass = 
None,
+  messageParameters = Array.empty)
 
   def this(message: String) =
 this(message = message, cause = null)
 
   def this(errorClass: String, messageParameters: Array[String], cause: 
Throwable) =
 this(
-  message = SparkThrowableHelper.getMessage(errorClass, messageParameters),
+  message = SparkThrowableHelper.getMessage(errorClass, None, 
messageParameters),
+  cause = cause,
+  errorClass = Some(errorClass),
+  errorSubClass = None,
+  messageParameters = messageParameters)
+
+  def this(errorClass: String, errorSubClass: String,
+   messageParameters: Array[String], cause: Throwable) =

Review Comment:
   ```suggestion
 def this(
 errorClass: String,
 errorSubClass: String,
 messageParameters: Array[String],
 cause: Throwable) =
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890391580


##
core/src/main/scala/org/apache/spark/SparkException.scala:
##
@@ -28,23 +28,47 @@ class SparkException(
 message: String,
 cause: Throwable,
 errorClass: Option[String],
+errorSubClass: Option[String],
 messageParameters: Array[String])
   extends Exception(message, cause) with SparkThrowable {
 
+   def this(message: String,
+cause: Throwable,
+errorClass: Option[String],
+messageParameters: Array[String]) =

Review Comment:
   ```suggestion
  def this(
  message: String,
  cause: Throwable,
  errorClass: Option[String],
  messageParameters: Array[String]) =
   ```



##
core/src/main/scala/org/apache/spark/SparkException.scala:
##
@@ -28,23 +28,47 @@ class SparkException(
 message: String,
 cause: Throwable,
 errorClass: Option[String],
+errorSubClass: Option[String],
 messageParameters: Array[String])
   extends Exception(message, cause) with SparkThrowable {
 
+   def this(message: String,
+cause: Throwable,
+errorClass: Option[String],
+messageParameters: Array[String]) =

Review Comment:
   nit comment for scala style



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



gengliangwang commented on code in PR #36693:
URL: https://github.com/apache/spark/pull/36693#discussion_r890373656


##
core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryError.java:
##
@@ -28,6 +28,7 @@
 @Private
 public final class SparkOutOfMemoryError extends OutOfMemoryError implements 
SparkThrowable {
 String errorClass;
+String errorSubClass;

Review Comment:
   It seems that the `errorSubClass` here is always null if there is no 
constructor method setting it.



##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -35,6 +35,9 @@ public interface SparkThrowable {
   // Succinct, human-readable, unique, and consistent representation of the 
error category
   // If null, error class is not set
   String getErrorClass();
+  default String getErrorSubClass() {

Review Comment:
   ```suggestion
   
   default String getErrorSubClass() {
   ```



##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -46,4 +49,13 @@ default String getSqlState() {
   default boolean isInternalError() {
 return SparkThrowableHelper.isInternalError(this.getErrorClass());
   }
+
+  default String[] getMessageParameters() {
+return new String[]{};
+  }
+
+  // True if this error is an internal error.

Review Comment:
   We need to update the comment since it returns an array



##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -35,6 +35,9 @@ public interface SparkThrowable {
   // Succinct, human-readable, unique, and consistent representation of the 
error category
   // If null, error class is not set
   String getErrorClass();
+  default String getErrorSubClass() {

Review Comment:
   nit: Usually there needs a blank line between methods.



##
core/src/main/java/org/apache/spark/SparkThrowable.java:
##
@@ -35,6 +35,9 @@ public interface SparkThrowable {
   // Succinct, human-readable, unique, and consistent representation of the 
error category
   // If null, error class is not set
   String getErrorClass();
+  default String getErrorSubClass() {
+return null;

Review Comment:
   Do we consider making the default value as empty string to avoid 
nullpointerexception?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on pull request #36753: [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries

2022-06-06 Thread GitBox



JoshRosen commented on PR #36753:
URL: https://github.com/apache/spark/pull/36753#issuecomment-1147680899

   When updating this PR, let's also pull in my changes from 
https://github.com/apache/spark/pull/36765 . When merging this, we should 
probably pick it all the way back to 3.0 (since it looks like a correctness 
issue that would also impact those versions).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell opened a new pull request, #36779: [SPARK-39391][CORE] Reuse Partitioner classes

2022-06-06 Thread GitBox



hvanhovell opened a new pull request, #36779:
URL: https://github.com/apache/spark/pull/36779

   ### What changes were proposed in this pull request?
   This PR creates two new `Partitioner` classes:
   - `ConstantPartitioner`: This moves all tuples in a RDD into a single 
partition. This replaces two anonymous partitioners in `RDD` and 
`ShuffleExchangeExec`.
   - `PartitionIdPassthrough`: This is a dummy partitioner that passes through 
keys when they already have been computed. This is actually not a new class, it 
was moved from `ShuffleRowRDD.scala` to core. This replaces two anonymous 
partitioners in `BlockMatrix` and `RDD`.
   
   ### Why are the changes needed?
   Less code.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36775: [SPARK-39389]Filesystem closed should not be considered as corrupt files

2022-06-06 Thread GitBox



AmplabJenkins commented on PR #36775:
URL: https://github.com/apache/spark/pull/36775#issuecomment-1147644959

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor opened a new pull request, #36778: [SPARK-39383][SQL] Support DEFAULT columns in ALTER TABLE ALTER COLUMNS to V2 data sources

2022-06-06 Thread GitBox



dtenedor opened a new pull request, #36778:
URL: https://github.com/apache/spark/pull/36778

   ### What changes were proposed in this pull request?
   
   Extend DEFAULT column support in ALTER TABLE ALTER COLUMNS commands to 
include V2 data sources.
   
   (Note: this depends on https://github.com/apache/spark/pull/36771.)
   
   Example:
   
   ```
   > create or replace table t (a string default 'abc', b string) using 
$v2Source
   > insert into t values (default)
   > alter table t alter column b set default 'def'
   > insert into t values ("ghi")
   > Select * from t
   "abc", null,
   "ghi", "def"
   ```
   
   ### Why are the changes needed?
   
   This makes V2 data sources easier to use and extend.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   This PR includes new test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-06 Thread GitBox



dtenedor commented on PR #36745:
URL: https://github.com/apache/spark/pull/36745#issuecomment-1147641785

   @gengliangwang @HyukjinKwon @cloud-fan this is ready to merge at your 
convenience (or leave review comment(s) for further iteration if desired)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #36774: [WIP][SPARK-39388][SQL] Reuse `orcScheam` when push down Orc predicates

2022-06-06 Thread GitBox



LuciferYang commented on code in PR #36774:
URL: https://github.com/apache/spark/pull/36774#discussion_r890306149


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##
@@ -153,10 +152,9 @@ class OrcFileFormat
   } else {
 // ORC predicate pushdown
 if (orcFilterPushDown && filters.nonEmpty) {
-  OrcUtils.readCatalystSchema(filePath, conf, 
ignoreCorruptFiles).foreach { fileSchema =>

Review Comment:
   And should we change to use `OrcUtils.readSchema(filePath, conf, 
ignoreCorruptFiles)` at line 147 in `OrcFileFormat`, then it seems that the 
`FileFormatException ` is handled in `OrcFileFormat`, otherwise, the 
`FileFormatException ` will be thrown and handled by `FileScanRDD.`  Am I 
right? 
   `



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #36774: [WIP][SPARK-39388][SQL] Reuse `orcScheam` when push down Orc predicates

2022-06-06 Thread GitBox



LuciferYang commented on code in PR #36774:
URL: https://github.com/apache/spark/pull/36774#discussion_r890291383


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##
@@ -153,10 +152,9 @@ class OrcFileFormat
   } else {
 // ORC predicate pushdown
 if (orcFilterPushDown && filters.nonEmpty) {
-  OrcUtils.readCatalystSchema(filePath, conf, 
ignoreCorruptFiles).foreach { fileSchema =>

Review Comment:
   Or in what scenario does line 147 in `OrcFileFormat` 
(`Utils.tryWithResource(OrcFile.createReader(filePath, 
readerOptions))(_.getSchema) `) not throw an `FileFormatException`,  but line 
77 in `OrcUtils`(`Utils.tryWithResource(OrcFile.createReader(file, 
readerOptions)) { reader => reader.getSchema } `) does?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #36774: [WIP][SPARK-39388][SQL] Reuse `orcScheam` when push down Orc predicates

2022-06-06 Thread GitBox



LuciferYang commented on code in PR #36774:
URL: https://github.com/apache/spark/pull/36774#discussion_r890285150


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##
@@ -153,10 +152,9 @@ class OrcFileFormat
   } else {
 // ORC predicate pushdown
 if (orcFilterPushDown && filters.nonEmpty) {
-  OrcUtils.readCatalystSchema(filePath, conf, 
ignoreCorruptFiles).foreach { fileSchema =>

Review Comment:
   Thanks for your review @beliefer , you are right, 
`OrcUtils.readCatalystSchema` catch the `FileFormatException`. But if 
`OrcUtils.readCatalystSchema` catches an `FileFormatException`, it should have 
been thrown on line 147.
   
   
https://github.com/apache/spark/blob/18ca369f01905b421a658144e23b5a4e60702655/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L146-L161
   
   
   
https://github.com/apache/spark/blob/18ca369f01905b421a658144e23b5a4e60702655/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L72-L94



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #36774: [WIP][SPARK-39388][SQL] Reuse `orcScheam` when push down Orc predicates

2022-06-06 Thread GitBox



LuciferYang commented on code in PR #36774:
URL: https://github.com/apache/spark/pull/36774#discussion_r890291383


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##
@@ -153,10 +152,9 @@ class OrcFileFormat
   } else {
 // ORC predicate pushdown
 if (orcFilterPushDown && filters.nonEmpty) {
-  OrcUtils.readCatalystSchema(filePath, conf, 
ignoreCorruptFiles).foreach { fileSchema =>

Review Comment:
   In what scenario does line 147 in `OrcFileFormat` 
(`Utils.tryWithResource(OrcFile.createReader(filePath, 
readerOptions))(_.getSchema) `) not throw an `FileFormatException`,  but line 
77 in `OrcUtils`(`Utils.tryWithResource(OrcFile.createReader(file, 
readerOptions)) { reader => reader.getSchema } `) does?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #36774: [WIP][SPARK-39388][SQL] Reuse `orcScheam` when push down Orc predicates

2022-06-06 Thread GitBox



LuciferYang commented on code in PR #36774:
URL: https://github.com/apache/spark/pull/36774#discussion_r890285150


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##
@@ -153,10 +152,9 @@ class OrcFileFormat
   } else {
 // ORC predicate pushdown
 if (orcFilterPushDown && filters.nonEmpty) {
-  OrcUtils.readCatalystSchema(filePath, conf, 
ignoreCorruptFiles).foreach { fileSchema =>

Review Comment:
   Thanks for your review @beliefer , you are right, 
`OrcUtils.readCatalystSchema` catch the `FileFormatException`. But if 
`OrcUtils.readCatalystSchema` catches an `FileFormatException`, it should have 
been thrown on line 146.
   
   
https://github.com/apache/spark/blob/18ca369f01905b421a658144e23b5a4e60702655/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L146-L161
   
   
   
https://github.com/apache/spark/blob/18ca369f01905b421a658144e23b5a4e60702655/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L72-L94



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #36774: [WIP][SPARK-39388][SQL] Reuse `orcScheam` when push down Orc predicates

2022-06-06 Thread GitBox



LuciferYang commented on code in PR #36774:
URL: https://github.com/apache/spark/pull/36774#discussion_r890285150


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##
@@ -153,10 +152,9 @@ class OrcFileFormat
   } else {
 // ORC predicate pushdown
 if (orcFilterPushDown && filters.nonEmpty) {
-  OrcUtils.readCatalystSchema(filePath, conf, 
ignoreCorruptFiles).foreach { fileSchema =>

Review Comment:
   thanks for your review @beliefer , you are right, 
`OrcUtils.readCatalystSchema` catch the `FileFormatException`. But if 
`OrcUtils.readCatalystSchema` catches an `FileFormatException`, it should have 
been thrown on line 146.
   
   
https://github.com/apache/spark/blob/18ca369f01905b421a658144e23b5a4e60702655/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L146-L161
   
   
   
https://github.com/apache/spark/blob/18ca369f01905b421a658144e23b5a4e60702655/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L72-L94



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] chenzhx commented on a diff in pull request #36663: [SPARK-38899][SQL]DS V2 supports push down datetime functions

2022-06-06 Thread GitBox



chenzhx commented on code in PR #36663:
URL: https://github.com/apache/spark/pull/36663#discussion_r890253947


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -259,6 +259,55 @@ class V2ExpressionBuilder(
   } else {
 None
   }
+case date: DateAdd =>
+  val childrenExpressions = date.children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == date.children.length) {
+Some(new GeneralScalarExpression("DATE_ADD", 
childrenExpressions.toArray[V2Expression]))
+  } else {
+None
+  }
+case date: DateDiff =>
+  val childrenExpressions = date.children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == date.children.length) {
+Some(new GeneralScalarExpression("DATE_DIFF", 
childrenExpressions.toArray[V2Expression]))
+  } else {
+None
+  }
+case date: TruncDate =>
+  val childrenExpressions = date.children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == date.children.length) {
+Some(new GeneralScalarExpression("TRUNC", 
childrenExpressions.toArray[V2Expression]))
+  } else {
+None
+  }
+case Second(child, _) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("SECOND", Array[V2Expression](v)))
+case Minute(child, _) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("MINUTE", Array[V2Expression](v)))
+case Hour(child, _) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("HOUR", Array[V2Expression](v)))
+case Month(child) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("MONTH", Array[V2Expression](v)))
+case Quarter(child) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("QUARTER", Array[V2Expression](v)))
+case Year(child) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("YEAR", Array[V2Expression](v)))
+// The DAY_OF_WEEK function in Spark returns the day of the week for 
date/timestamp.
+// Database dialects do not need to follow ISO semantics when handling 
DAY_OF_WEEK.

Review Comment:
   In ISO semantics, 1 represents Monday.
   But the DayOfWeek function in Spark, 1 represents Sunday.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #36773: [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown

2022-06-06 Thread GitBox



srowen commented on code in PR #36773:
URL: https://github.com/apache/spark/pull/36773#discussion_r890245878


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala:
##
@@ -750,6 +750,22 @@ object DataSourceStrategy
 PushableColumnWithoutNestedColumn(right), _) =>
   Some(new GeneralAggregateFunc("CORR", agg.isDistinct,
 Array(FieldReference.column(left), FieldReference.column(right
+case aggregate.RegrIntercept(PushableColumnWithoutNestedColumn(left),

Review Comment:
   Not required, but it'd be nice to find a way to fix the indents here - 
should not line up with 'case' in the continuation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36777: [SPARK-39390][CORE] Hide and optimize `viewAcls`/`viewAclsGroups`/`modifyAcls`/`modifyAclsGroups` from INFO log

2022-06-06 Thread GitBox



AmplabJenkins commented on PR #36777:
URL: https://github.com/apache/spark/pull/36777#issuecomment-1147544750

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on pull request #36693: [SPARK-39349] Add a centralized CheckError method for QA of error path

2022-06-06 Thread GitBox



srielau commented on PR #36693:
URL: https://github.com/apache/spark/pull/36693#issuecomment-1147543765

   @gengliangwang @cloud-fan Hoping for a review.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #35612: [SPARK-38289][SQL] Refactor SQL CLI exit code to make it more clear

2022-06-06 Thread GitBox



cloud-fan commented on code in PR #35612:
URL: https://github.com/apache/spark/pull/35612#discussion_r890193035


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:
##
@@ -259,7 +260,7 @@ static class HelpOptionExecutor implements 
ServerOptionsExecutor {
 @Override
 public void execute() {
   new HelpFormatter().printHelp(serverName, options);
-  System.exit(0);
+  System.exit(SparkExitCode.EXIT_SUCCESS());

Review Comment:
   ```suggestion
 System.exit(SparkExitCode.EXIT_SUCCESS);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36662: [SPARK-39286][DOC] Update documentation for the decode function

2022-06-06 Thread GitBox



cloud-fan commented on code in PR #36662:
URL: https://github.com/apache/spark/pull/36662#discussion_r890191027


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -2504,9 +2504,10 @@ object Decode {
   usage = """
 _FUNC_(bin, charset) - Decodes the first argument using the second 
argument character set.
 
-_FUNC_(expr, search, result [, search, result ] ... [, default]) - Decode 
compares expr
-  to each search value one by one. If expr is equal to a search, returns 
the corresponding result.
-  If no match is found, then Oracle returns default. If default is 
omitted, returns null.
+_FUNC_(expr, search, result [, search, result ] ... [, default]) - 
Compares expr
+  to each search value in order. If expr is equal to a search value, 
_FUNC_ returns

Review Comment:
   `_FUNC_` is a placeholder and will be replaced by the real function name in 
the doc page. I think this is fine?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36663: [SPARK-38899][SQL]DS V2 supports push down datetime functions

2022-06-06 Thread GitBox



cloud-fan commented on code in PR #36663:
URL: https://github.com/apache/spark/pull/36663#discussion_r890188771


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -259,6 +259,55 @@ class V2ExpressionBuilder(
   } else {
 None
   }
+case date: DateAdd =>
+  val childrenExpressions = date.children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == date.children.length) {
+Some(new GeneralScalarExpression("DATE_ADD", 
childrenExpressions.toArray[V2Expression]))
+  } else {
+None
+  }
+case date: DateDiff =>
+  val childrenExpressions = date.children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == date.children.length) {
+Some(new GeneralScalarExpression("DATE_DIFF", 
childrenExpressions.toArray[V2Expression]))
+  } else {
+None
+  }
+case date: TruncDate =>
+  val childrenExpressions = date.children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == date.children.length) {
+Some(new GeneralScalarExpression("TRUNC", 
childrenExpressions.toArray[V2Expression]))
+  } else {
+None
+  }
+case Second(child, _) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("SECOND", Array[V2Expression](v)))
+case Minute(child, _) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("MINUTE", Array[V2Expression](v)))
+case Hour(child, _) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("HOUR", Array[V2Expression](v)))
+case Month(child) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("MONTH", Array[V2Expression](v)))
+case Quarter(child) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("QUARTER", Array[V2Expression](v)))
+case Year(child) => generateExpression(child)
+  .map(v => new GeneralScalarExpression("YEAR", Array[V2Expression](v)))
+// The DAY_OF_WEEK function in Spark returns the day of the week for 
date/timestamp.
+// Database dialects do not need to follow ISO semantics when handling 
DAY_OF_WEEK.

Review Comment:
   hmm, do you mean the Spark behavior is non-standard for DAY_OF_WEEK function?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Eugene-Mark commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

2022-06-06 Thread GitBox



Eugene-Mark commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1147448291

   Agreed with you that it's better not to modify Oracle related part, just 
removed from the commit.
   Yes, I suggest we use scale = 18. 
   And for precision, when `Number(*)` or `Number` is used in Teradata, the 
precision returned from JDBC is `40`, which is larger than Spark's max 
precision, so I also handles this case explicitly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36698: [SPARK-39316][SQL] Merge PromotePrecision and CheckOverflow into decimal binary arithmetic

2022-06-06 Thread GitBox



cloud-fan commented on code in PR #36698:
URL: https://github.com/apache/spark/pull/36698#discussion_r890146130


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala:
##
@@ -232,3 +216,33 @@ case class CheckOverflowInSum(
   override protected def withNewChildInternal(newChild: Expression): 
CheckOverflowInSum =
 copy(child = newChild)
 }
+
+/**
+ * An add expression which is only used internally by Sum/Avg.
+ *
+ * Nota that, this expression does not check overflow which is different with 
`Add`.
+ */
+case class DecimalAddNoOverflowCheck(
+left: Expression,
+right: Expression,
+override val dataType: DataType,
+failOnError: Boolean = SQLConf.get.ansiEnabled) extends BinaryArithmetic {
+  require(dataType.isInstanceOf[DecimalType])
+
+  override def nullable: Boolean = super[BinaryArithmetic].nullable

Review Comment:
   if `failOnError` returns true, this expression can only return null if input 
is null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35250: [SPARK-37961][SQL] Override maxRows/maxRowsPerPartition for some logical operators

2022-06-06 Thread GitBox



cloud-fan commented on PR #35250:
URL: https://github.com/apache/spark/pull/35250#issuecomment-1147439706

   Somehow this PR lost track. @zhengruifeng do you want to reopen it and get 
it in?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36732: [SPARK-39345][CORE][SQL][DSTREAM][ML][MESOS][SS] Replace `filter(!condition)` with `filterNot(condition)`

2022-06-06 Thread GitBox



srowen commented on PR #36732:
URL: https://github.com/apache/spark/pull/36732#issuecomment-1147422759

   No, we wouldn't backport this, that's more change. 
   Does this offer any benefit? I'm not sure it's more readable even.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #36499: [SPARK-38846][SQL] Add explicit data mapping between Teradata Numeric Type and Spark DecimalType

2022-06-06 Thread GitBox



srowen commented on PR #36499:
URL: https://github.com/apache/spark/pull/36499#issuecomment-1147421904

   You're saying, basically, assume scale=18? that's seems reasonable.
   Or are you saying there needs to be an arbitrary precision type? I don't see 
how a DB would support that.
   I'm hesitant to modify Oracle without knowing why it is how it is, and why 
it should change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #36763: [SPARK-39376][SQL] Hide duplicated columns in star expansion of subquery alias from NATURAL/USING JOIN

2022-06-06 Thread GitBox



cloud-fan closed pull request #36763: [SPARK-39376][SQL] Hide duplicated 
columns in star expansion of subquery alias from NATURAL/USING JOIN
URL: https://github.com/apache/spark/pull/36763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 119 matches

Mail list logo