[GitHub] [spark] LuciferYang commented on pull request #38075: [WIP][SPARK-40633][BUILD] Upgrade janino to 3.1.8

2022-11-20 Thread GitBox


LuciferYang commented on PR #38075:
URL: https://github.com/apache/spark/pull/38075#issuecomment-1321605889

   will check 3.1.9


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

2022-11-20 Thread GitBox


cloud-fan commented on code in PR #38722:
URL: https://github.com/apache/spark/pull/38722#discussion_r1027662257


##
core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java:
##
@@ -812,9 +812,7 @@ public boolean append(Object kbase, long koff, int klen, 
Object vbase, long voff
 
 // If the map has reached its growth threshold, try to grow it.
 if (numKeys >= growthThreshold) {
-  // We use two array entries per key, so the array size is twice the 
capacity.
-  // We should compare the current capacity of the array, instead of 
its size.
-  if (longArray.size() / 2 < MAX_CAPACITY) {
+  if (longArray.size() < MAX_CAPACITY) {

Review Comment:
   I think we need to read the code of this entire class to understand the 
capacity. According to the code comment
   ```
 /**
  * The maximum number of keys that BytesToBytesMap supports. The hash 
table has to be
  * power-of-2-sized and its backing Java array can contain at most (1 
 30) elements,
  * since that's the largest power-of-2 that's less than Integer.MAX_VALUE. 
We need two long array
  * entries per key, giving us a maximum capacity of (1  29).
  */
 public static final int MAX_CAPACITY = (1 << 29);
   ```
   
   This is to restrict the number of keys, and `longArray.size() / 2` is the 
number of keys. The code here should be correct.
   
   I also saw code like this
   ```
   private void allocate(int capacity) {
   assert (capacity >= 0);
   capacity = Math.max((int) Math.min(MAX_CAPACITY, 
ByteArrayMethods.nextPowerOf2(capacity)), 64);
   assert (capacity <= MAX_CAPACITY);
   longArray = allocateArray(capacity * 2L);
   longArray.zeroOut();
   
   this.growthThreshold = (int) (capacity * loadFactor);
   this.mask = capacity - 1;
 }
   ```
   
   `longArray` is 2x capacity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-20 Thread GitBox


mridulm commented on PR #38711:
URL: https://github.com/apache/spark/pull/38711#issuecomment-1321576092

   Thanks for the PR to fix this bug !
   I will need to think more about this issue, but I am leaning towards a 
variant of the solution proposed.
   Namely:
   
   * `stageAttemptToNumSpeculativeTasks` -> 
`stageAttemptToUnsubmittedSpeculativeTasks`
 * These are task indices which have been marked speculatable, which have 
not yet been launched.
   * When a speculative task starts, remove it from 
`stageAttemptToUnsubmittedSpeculativeTasks`
   * When a non-speculative task finishes, remove it from 
`stageAttemptToUnsubmittedSpeculativeTasks`
   
   Thoughts @Ngone51, @toujours33, @LuciferYang ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert

2022-11-20 Thread GitBox


cloud-fan commented on code in PR #38495:
URL: https://github.com/apache/spark/pull/38495#discussion_r1027656667


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala:
##
@@ -894,12 +895,14 @@ class InsertSuite extends QueryTest with 
TestHiveSingleton with BeforeAndAfter
 
   sql(insertString.toLowerCase(Locale.ROOT))
   sql(insertString.toUpperCase(Locale.ROOT))
+  spark.sessionState.catalog.alterTableStats(TableIdentifier("test1"), 
None)

Review Comment:
   Does it test anything? It just invokes `alterTableStats` but does no 
verification.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist

2022-11-20 Thread GitBox


cloud-fan commented on code in PR #38703:
URL: https://github.com/apache/spark/pull/38703#discussion_r1027651034


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala:
##
@@ -355,7 +355,7 @@ case class ListQuery(
   plan.canonicalized,
   outerAttrs.map(_.canonicalized),
   ExprId(0),
-  childOutputs.map(_.canonicalized.asInstanceOf[Attribute]),
+  plan.canonicalized.output,

Review Comment:
   This looks correct, but I don't know how is this related to the cache 
problem. Can you elaborate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec

2022-11-20 Thread GitBox


cloud-fan commented on PR #38687:
URL: https://github.com/apache/spark/pull/38687#issuecomment-1321552064

   Oh wait a minute. Due to the spark catalog extension (set via 
`spark.sql.catalog.spark_catalog`), we can have tables supporting time travel 
in the v1 catalog as well. I think `SessionCatalog.tableRelationCache` also 
need a fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service

2022-11-20 Thread GitBox


mridulm commented on PR #38560:
URL: https://github.com/apache/spark/pull/38560#issuecomment-1321551358

   @yabola, there is quite a lot of nontrivial overlap between this PR and 
@wankunde's PR is trying doing.
   Would be great if you both can coordinate on this - I would love to get this 
functionality merged before we starting getting closer to code freeze for 3.4
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec

2022-11-20 Thread GitBox


cloud-fan commented on PR #38687:
URL: https://github.com/apache/spark/pull/38687#issuecomment-1321550782

   thanks, merging to master/3.3!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service

2022-11-20 Thread GitBox


mridulm commented on PR #38560:
URL: https://github.com/apache/spark/pull/38560#issuecomment-1321548579

   > One things that I know need to be addressed are:
   > Some merge data infos are not saved on the driver because they are too 
small ( controlled by spark.shuffle.push.minShuffleSizeToWait)
   please see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2295
   
   
   In this case, we should fire a remove immediately - we are not going to use 
it for this app anyway ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-20 Thread GitBox


gengliangwang commented on code in PR #38567:
URL: https://github.com/apache/spark/pull/38567#discussion_r1027640551


##
core/src/main/scala/org/apache/spark/status/AppStatusStore.scala:
##
@@ -769,7 +772,14 @@ private[spark] object AppStatusStore {
   def createLiveStore(
   conf: SparkConf,
   appStatusSource: Option[AppStatusSource] = None): AppStatusStore = {
-val store = new ElementTrackingStore(new InMemoryStore(), conf)
+val storePath = conf.get(LIVE_UI_LOCAL_STORE_DIR).map(new File(_))
+// For the disk-based KV store of live UI, let's simply make it ROCKSDB 
only for now,
+// instead of supporting both LevelDB and RocksDB. RocksDB is built based 
on LevelDB with
+// improvements on writes and reads. Furthermore, we can reuse the 
RocksDBFileManager in
+// streaming for replicating the local RocksDB file to DFS. The 
replication in DFS can be
+// used for Spark history server.

Review Comment:
   Thanks, I have removed the comment about replication.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #37922: [SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

2022-11-20 Thread GitBox


mridulm commented on code in PR #37922:
URL: https://github.com/apache/spark/pull/37922#discussion_r1027637860


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -393,6 +394,35 @@ public void applicationRemoved(String appId, boolean 
cleanupLocalDirs) {
 }
   }
 
+  @Override
+  public void removeShuffleMerge(RemoveShuffleMerge msg) {
+String appId = msg.appId;
+int appAttemptId = msg.appAttemptId;
+int shuffleId = msg.shuffleId;
+AppShuffleInfo appShuffleInfo = validateAndGetAppShuffleInfo(appId);
+if (appShuffleInfo.attemptId != appAttemptId) {
+  throw new IllegalArgumentException(
+  String.format("The attempt id %s in this RemoveShuffleMerge message 
does not match "
+  + "with the current attempt id %s stored in shuffle service 
for application %s",
+  appAttemptId, appShuffleInfo.attemptId, appId));
+}
+
+appShuffleInfo.shuffles.compute(shuffleId, (shuffleIdKey, partitionsInfo) 
-> {
+  if (null != partitionsInfo) {

Review Comment:
   The validation related to `shuffleMergeId`, from `finalizeShuffleMerge`, is 
applicable here depending on `shuffleMergeId`.
   I have given the details 
[here](https://github.com/apache/spark/pull/37922#discussion_r990753031) - 
please refer to it.



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/NoOpMergedShuffleFileManager.java:
##
@@ -84,4 +85,9 @@ public MergedBlockMeta getMergedBlockMeta(
   public String[] getMergedBlockDirs(String appId) {
 throw new UnsupportedOperationException("Cannot handle shuffle block 
merge");
   }
+
+  @Override
+  public void removeShuffleMerge(RemoveShuffleMerge removeShuffleMerge) {
+throw new UnsupportedOperationException("Cannot handle shuffle block 
merge");

Review Comment:
   nit: 
   ```suggestion
   throw new UnsupportedOperationException("Cannot handle merged shuffle 
remove");
   ```



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/MergedShuffleFileManager.java:
##
@@ -121,6 +122,13 @@ MergedBlockMeta getMergedBlockMeta(
*/
   String[] getMergedBlockDirs(String appId);
 
+  /**
+   * Remove shuffle merge data files.
+   *
+   * @param removeShuffleMerge Remove shuffle merge RPC

Review Comment:
   ```suggestion
  * @param removeShuffleMerge contains shuffle details (appId, shuffleId, 
etc) to uniquely identify a shuffle to be removed
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #38737: [SPARK-41174][SQL] Propagate an error class to users for invalid `format` of `to_binary()`

2022-11-20 Thread GitBox


LuciferYang opened a new pull request, #38737:
URL: https://github.com/apache/spark/pull/38737

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] erenavsarogullari commented on pull request #38736: [WIP][SPARK-41214][SQL] - SubPlan metrics are missed when AQE is enabled under InMemoryRelation

2022-11-20 Thread GitBox


erenavsarogullari commented on PR #38736:
URL: https://github.com/apache/spark/pull/38736#issuecomment-1321545696

   cc @cloud-fan @Ngone51 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] erenavsarogullari opened a new pull request, #38736: [WIP][SPARK-41214][SQL] - SubPlan metrics under InMemoryRelation are missed when …

2022-11-20 Thread GitBox


erenavsarogullari opened a new pull request, #38736:
URL: https://github.com/apache/spark/pull/38736

   
   
   ### What changes were proposed in this pull request?
   
   `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` enables AQE 
optimizations under `InMemoryRelation`(IMR) nodes by creating separated 
sub-SparkPlans for AQE optimizations. However, when 
`spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true`, Spark UI 
does not show correct DAG due to lack of final sub-plans (under IMR) 
submissions (into UI).
   
   **DAG before fix:**
   https://user-images.githubusercontent.com/1437738/202984370-c179707a-c091-4133-adb6-d5009c98875a.png;>
   
   **DAG after fix:**
   https://user-images.githubusercontent.com/1437738/202984481-a63ba5e2-fc66-4dc7-98a0-f2233a93e5c8.png;>
   
   
   ### Why are the changes needed?
   
   `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` enables AQE 
optimizations under `InMemoryRelation`(IMR) nodes. Following sample query has 
IMR node on both `BroadcastHashJoin` legs. However, 
   when `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true`, 
following datas are missed due to lack of final sub-plans (under IMR) 
submissions (into UI).
   
   ### Does this PR introduce _any_ user-facing change?
   
   Currently, Spark UI does not show final DAG when 
`spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true` so this 
causes some physical operator metrics as missed. This PR aims to fix this 
problem.
   
   
   ### How was this patch tested?
   
   Specific UT Coverage will be added (In Progress).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children

2022-11-20 Thread GitBox


cloud-fan commented on code in PR #38713:
URL: https://github.com/apache/spark/pull/38713#discussion_r1027630891


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -1263,60 +1263,71 @@ class AstBuilder extends 
SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
* }}}
*/
   override def visitRelation(ctx: RelationContext): LogicalPlan = 
withOrigin(ctx) {
-withJoinRelations(plan(ctx.relationPrimary), ctx)
+withRelationExtensions(ctx, plan(ctx.relationPrimary))
+  }
+
+  private def withRelationExtensions(ctx: RelationContext, query: 
LogicalPlan): LogicalPlan = {
+ctx.relationExtension().asScala.foldLeft(query) { (left, extension) =>
+  if (extension.joinRelation() != null) {
+withJoinRelation(extension.joinRelation(), left)
+  } else if (extension.pivotClause() != null) {
+withPivot(extension.pivotClause(), left)
+  } else {
+assert(extension.unpivotClause() != null)
+withUnpivot(extension.unpivotClause(), left)
+  }
+}
   }
 
   /**
-   * Join one more [[LogicalPlan]]s to the current logical plan.
+   * Join one more [[LogicalPlan]] to the current logical plan.
*/
-  private def withJoinRelations(base: LogicalPlan, ctx: RelationContext): 
LogicalPlan = {
-ctx.joinRelation.asScala.foldLeft(base) { (left, join) =>

Review Comment:
   The actually code change is very small, just remove this loop and rename a 
few variables.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -1263,60 +1263,71 @@ class AstBuilder extends 
SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
* }}}
*/
   override def visitRelation(ctx: RelationContext): LogicalPlan = 
withOrigin(ctx) {
-withJoinRelations(plan(ctx.relationPrimary), ctx)
+withRelationExtensions(ctx, plan(ctx.relationPrimary))
+  }
+
+  private def withRelationExtensions(ctx: RelationContext, query: 
LogicalPlan): LogicalPlan = {
+ctx.relationExtension().asScala.foldLeft(query) { (left, extension) =>
+  if (extension.joinRelation() != null) {
+withJoinRelation(extension.joinRelation(), left)
+  } else if (extension.pivotClause() != null) {
+withPivot(extension.pivotClause(), left)
+  } else {
+assert(extension.unpivotClause() != null)
+withUnpivot(extension.unpivotClause(), left)
+  }
+}
   }
 
   /**
-   * Join one more [[LogicalPlan]]s to the current logical plan.
+   * Join one more [[LogicalPlan]] to the current logical plan.
*/
-  private def withJoinRelations(base: LogicalPlan, ctx: RelationContext): 
LogicalPlan = {
-ctx.joinRelation.asScala.foldLeft(base) { (left, join) =>

Review Comment:
   The actual code change is very small, just remove this loop and rename a few 
variables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-20 Thread GitBox


dongjoon-hyun commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1321531047

   Thank you, @mridulm !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-20 Thread GitBox


mridulm commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1321529913

   I was on two minds whether to fix this in 3.3 as well ...
   Yes, 3.3 is affected by it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-20 Thread GitBox


LuciferYang commented on code in PR #38567:
URL: https://github.com/apache/spark/pull/38567#discussion_r1027625306


##
core/src/main/scala/org/apache/spark/status/KVUtils.scala:
##
@@ -80,6 +89,44 @@ private[spark] object KVUtils extends Logging {
 db
   }
 
+  def createKVStore(
+  storePath: Option[File],
+  diskBackend: HybridStoreDiskBackend.Value,
+  conf: SparkConf): KVStore = {
+storePath.map { path =>
+  val dir = diskBackend match {
+case LEVELDB => "listing.ldb"
+case ROCKSDB => "listing.rdb"
+  }
+
+  val dbPath = Files.createDirectories(new File(path, 
dir).toPath()).toFile()
+  Utils.chmod700(dbPath)
+
+
+  val metadata = FsHistoryProviderMetadata(
+FsHistoryProvider.CURRENT_LISTING_VERSION,
+AppStatusStore.CURRENT_VERSION,
+conf.get(History.HISTORY_LOG_DIR))
+
+  try {
+open(dbPath, metadata, conf, Some(diskBackend))
+  } catch {

Review Comment:
   If unexpected data cleaning doesn't occur in the live UI scenario, I think 
it is OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-20 Thread GitBox


mridulm commented on code in PR #38567:
URL: https://github.com/apache/spark/pull/38567#discussion_r1027618122


##
core/src/main/scala/org/apache/spark/status/AppStatusStore.scala:
##
@@ -769,7 +772,14 @@ private[spark] object AppStatusStore {
   def createLiveStore(
   conf: SparkConf,
   appStatusSource: Option[AppStatusSource] = None): AppStatusStore = {
-val store = new ElementTrackingStore(new InMemoryStore(), conf)
+val storePath = conf.get(LIVE_UI_LOCAL_STORE_DIR).map(new File(_))
+// For the disk-based KV store of live UI, let's simply make it ROCKSDB 
only for now,
+// instead of supporting both LevelDB and RocksDB. RocksDB is built based 
on LevelDB with
+// improvements on writes and reads. Furthermore, we can reuse the 
RocksDBFileManager in
+// streaming for replicating the local RocksDB file to DFS. The 
replication in DFS can be
+// used for Spark history server.

Review Comment:
   nit: Can we remove the part about replication for now ? When we add support 
for it, we can suitably modify the comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-20 Thread GitBox


dongjoon-hyun commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1321516325

   Thank you, @gaoyajun02 , @mridulm , @otterc .
   - Do we need to backport this to branch-3.3?
   - According to the previous failure description, what happens in branch-3.3 
in case of failure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] liuzqt commented on a diff in pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollec

2022-11-20 Thread GitBox


liuzqt commented on code in PR #38704:
URL: https://github.com/apache/spark/pull/38704#discussion_r1027614714


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2251,7 +2251,11 @@ class DatasetLargeResultCollectingSuite extends QueryTest
   with SharedSparkSession {
 
   override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
-  test("collect data with single partition larger than 2GB bytes array limit") 
{
+  // SPARK-41193: Ignore this suite because it cannot run successfully with 
Spark
+  // default Java Options, if user need do local test, please make the 
following changes:
+  // - Maven test: change `-Xmx4g` of `scalatest-maven-plugin` in 
`sql/core/pom.xml` to `-Xmx10g`
+  // - SBT test: change `-Xmx4g` of `Test / javaOptions` in `SparkBuild.scala` 
to `-Xmx10g`
+  ignore("collect data with single partition larger than 2GB bytes array 
limit") {

Review Comment:
   I think we can leave it as ignore for now with the comments about using 
larger mem to make it work. I'm not sure if we're able to configure the build 
args for a specific test suite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-20 Thread GitBox


mridulm commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1321511612

   Merged to master.
   Thanks for fixing this @gaoyajun02 !
   Thanks for the review @otterc :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] asfgit closed pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-20 Thread GitBox


asfgit closed pull request #38333: [SPARK-40872] Fallback to original shuffle 
block when a push-merged shuffle chunk is zero-size
URL: https://github.com/apache/spark/pull/38333


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #38467: [SPARK-40987][CORE] Avoid creating a directory when deleting a block, causing DAGScheduler to not work

2022-11-20 Thread GitBox


mridulm commented on PR #38467:
URL: https://github.com/apache/spark/pull/38467#issuecomment-1321508245

   Agree with @Ngone51, there are two issues here.
   
   a) When we have locked for read/write, we expect it to be unlocked and 
exceptions to be handled gracefully.
   In this case, `removeBlockInternal` should ensure the lock is unlocked 
gracefully.
   Instead of catching `Exception`, I would suggest to move`removeBlock` into 
`finally` and everything above it in `removeBlockInternal` into a `try` block.
   
   A quick look indicated the other uses of `lockForWriting` should be fine - 
but perhaps something we should audit in future @Ngone51 !
   
   
   b) Ensure we do not recreate a directory when exit'ing (it is not limited to 
`removeBlockInternal` in this PR).
   
   In addition to (a), I do believe we should do what is in this PR @Ngone51.
   Thoughts ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027605394


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   Ah, you are right. My bad.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


HeartSaVioR commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027594911


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   I was also incorrect. This is actually very clear:
   
   https://github.com/databricks/scala-style-guide#indent
   
   > For method and class constructor invocations, use 2 space indentation for 
its parameters and put each in each line when the parameters don't fit in two 
lines.
   
   Technically, this is method "invocation", not method "definition".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching

2022-11-20 Thread GitBox


cloud-fan closed pull request #38692: [SPARK-41183][SQL] Add an extension API 
to do plan normalization for caching
URL: https://github.com/apache/spark/pull/38692


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching

2022-11-20 Thread GitBox


cloud-fan commented on PR #38692:
URL: https://github.com/apache/spark/pull/38692#issuecomment-1321487076

   thanks for the review, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #38731: [SPARK-41209][PYSPARK] Improve PySpark type inference in _merge_type method

2022-11-20 Thread GitBox


sadikovi commented on PR #38731:
URL: https://github.com/apache/spark/pull/38731#issuecomment-1321483028

   @HyukjinKwon I noticed that NullType in PySpark is on the list of atomic 
types which it is not, in fact, it is mentioned in the type's doc string. 
However, I tried to remove it but encountered test failures in 
`pyspark.pandas.tests.indexes.test_base`. I am going to keep it as is but could 
you comment on why NullType was added to `_atomic_types` list? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`

2022-11-20 Thread GitBox


zhengruifeng commented on code in PR #38735:
URL: https://github.com/apache/spark/pull/38735#discussion_r1027583933


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -115,6 +115,9 @@ def __init__(
 self._cache: Dict[str, Any] = {}
 self._session: "RemoteSparkSession" = session
 
+def __repr__(self) -> str:
+return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes))

Review Comment:
   this follows the default behavior of 
https://github.com/apache/spark/blob/40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3/python/pyspark/sql/dataframe.py#L860-L869



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`

2022-11-20 Thread GitBox


zhengruifeng opened a new pull request, #38735:
URL: https://github.com/apache/spark/pull/38735

   ### What changes were proposed in this pull request?
   Implement `DataFrame.__repr__` and `DataFrame.dtypes`
   
   
   ### Why are the changes needed?
   For api coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027573603


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   That's true. We have been unclear about this. 
   
   However, given that the general principle is to distinguish different 
sections for readability and we already use `2-space indentation` for **the 
method body**, I believe what we need is to extend the existing rule by 
removing `when the parameters don't fit in two lines`.
   
   Mixing some part of the test case name and method body doesn't give us much 
readability. More worse, it's not extensible because that eventually enforces 
us to use two-space indentation and four-space indentation separately in Case 1 
and Case 2.
   **Case 1**
   ```
   metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
 "between analyzed and executed", schema) { (df, _, _) =>
   ```
   **Case 2**
   ```
   metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
   "long long long long long long long long long long long long long long " 
+
   "between analyzed and executed", schema) { (df, _, _) =>
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027573603


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   That's true. We have been unclear about this. 
   
   However, given that the general principle is to distinguish different 
sections for readability and we already use `2-space indentation` for **the 
method body**, I believe what we need is to extend the existing rule by 
removing `when the parameters don't fit in two lines`.
   
   Mixing some part of the test case name and method body doesn't give us much 
readability. More worse, it's not extensible because that eventually leads us 
to use two-space indentation and four-space indentation in Case 1 and Case 2.
   **Case 1**
   ```
   metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
 "between analyzed and executed", schema) { (df, _, _) =>
   ```
   **Case 2**
   ```
   metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
   "long long long long long long long long long long long long long long " 
+
   "between analyzed and executed", schema) { (df, _, _) =>
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027573603


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   That's true. We have been unclear about this. Given that the general 
principle is to distinguish different sections for readability and we already 
use `2-space indentation` for *the method body*, I believe what we need is to 
extend the existing rule by removing `when the parameters don't fit in two 
lines`.
   
   Mixing some part of the test case name and method body doesn't give us much 
readability. More worse, it's not extensible because that eventually leads us 
to use two-space indentation and four-space indentation in Case 1 and Case 2.
   **Case 1**
   ```
   metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
 "between analyzed and executed", schema) { (df, _, _) =>
   ```
   **Case 2**
   ```
   metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
   "long long long long long long long long long long long long long long " 
+
   "between analyzed and executed", schema) { (df, _, _) =>
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2022-11-20 Thread GitBox


pan3793 commented on code in PR #38732:
URL: https://github.com/apache/spark/pull/38732#discussion_r1027559033


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala:
##
@@ -119,6 +126,12 @@ class ExecutorPodsAllocator(
   }
 }
 snapshotsStore.addSubscriber(podAllocationDelay) {
+  maxNumExecutorFailuresOpt.foreach { maxNumExecutorFailures =>
+if (failureTracker.numFailedExecutors > maxNumExecutorFailures) {
+  logError(s"Max number of executor failures ($maxNumExecutorFailures) 
reached")
+  SparkContext.getActive.foreach(_.stop(EXIT_MAX_EXECUTOR_FAILURES))

Review Comment:
   It's ugly, but `AbstractPodsAllocator` is a developer api, pass `sc` will 
change the constructor signature. Suggestions are welcome.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

2022-11-20 Thread GitBox


zhengruifeng opened a new pull request, #38734:
URL: https://github.com/apache/spark/pull/38734

   ### What changes were proposed in this pull request?
   Implement `DataFrame.isEmpty`
   
   
   ### Why are the changes needed?
   API Coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, new api
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


HeartSaVioR commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027545869


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   https://github.com/databricks/scala-style-guide#indent
   
   For method declarations, use 4 space indentation for their parameters and 
put each in each line **when the parameters don't fit in two lines**. Return 
types can be either on the same line as the last parameter, or start a new line 
with 2 space indent.
   
   So the rule of 4 space indentation is for 3+ lines of definition of method. 
We don't specifically mention about two lines of definition of method. While 
it's not super clear since we don't have strict guideline, but there is a 
general rule of spacing `Use 2-space indentation in general.`, which I think it 
could apply to cases which don't fall to exceptions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027524453


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   I'm curious why it's not a parameter? For me, the second line is a parameter 
because is **a part of the first parameter**. And, Apache Spark usually splits 
parameter definition sections from method definition sections, doesn't it?
   > The second line is not a parameter but a continuation of test name string.
   
   BTW, one thing I agree with @HeartSaVioR that we respect the nearest style 
in the code in general. So, I want to ask @Yaohua628 and @HeartSaVioR 
explicitly. Do you want to make this as an Apache Spark coding style 
officially? What I'm asking is that `the indentation on test case name 
splitting`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027524453


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   I'm curious why it's not a parameter? For me, the second line is a parameter 
because is **a part of the first parameter**. And, Apache Spark usually splits 
parameter definition sections from method definition sections, doesn't it?
   > The second line is not a parameter but a continuation of test name string.
   
   BTW, one thing I agree with @HeartSaVioR that we respect the nearest style 
in the code in general. So, I want to ask @Yaohua628 and @HeartSaVioR 
explicitly. Do you want to make this as an Apache Spark coding style 
officially? What I asking is that `the indentation on test case name splitting`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2022-11-20 Thread GitBox


pan3793 commented on code in PR #38732:
URL: https://github.com/apache/spark/pull/38732#discussion_r1027519725


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala:
##
@@ -723,6 +723,25 @@ private[spark] object Config extends Logging {
   .checkValue(value => value > 0, "Maximum number of pending pods should 
be a positive integer")
   .createWithDefault(Int.MaxValue)
 
+  val KUBERNETES_MAX_EXECUTOR_FAILURES =
+ConfigBuilder("spark.kubernetes.executor.maxNumFailures")

Review Comment:
   The equivalent yarn configuration is `spark.yarn.max.executor.failures`, 
naming suggestion is welcome



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2022-11-20 Thread GitBox


pan3793 commented on code in PR #38732:
URL: https://github.com/apache/spark/pull/38732#discussion_r1027519725


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala:
##
@@ -723,6 +723,25 @@ private[spark] object Config extends Logging {
   .checkValue(value => value > 0, "Maximum number of pending pods should 
be a positive integer")
   .createWithDefault(Int.MaxValue)
 
+  val KUBERNETES_MAX_EXECUTOR_FAILURES =
+ConfigBuilder("spark.kubernetes.executor.maxNumFailures")

Review Comment:
   The equivalent yarn configuration is `spark.yarn.max.executor.failures`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn opened a new pull request, #38733: [SPARK-41211][Core] Upgrade ZooKeeper from 3.6.2 to 3.6.3

2022-11-20 Thread GitBox


yaooqinn opened a new pull request, #38733:
URL: https://github.com/apache/spark/pull/38733

   
   
   ### What changes were proposed in this pull request?
   
   Upgrade ZooKeeper to 3.6.3
   
   ### Why are the changes needed?
   
   
   
   
   ZooKeeper 3.6.3 contains many bugfixes, such as a thread leak issue 
described in ZOOKEEPER-3706.
   
   FYI, https://zookeeper.apache.org/doc/r3.6.3/releasenotes.html
   
   - Why 3.6.3?
 - https://github.com/apache/spark/pull/37507
 - https://github.com/apache/spark/pull/32572
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   no
   
   ### How was this patch tested?
   
   
   existing tests and dependency check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-20 Thread GitBox


zhengruifeng commented on PR #38686:
URL: https://github.com/apache/spark/pull/38686#issuecomment-1321403447

   I will update this PR after the arrow-based collect is fixed 
https://github.com/apache/spark/pull/38706 , otherwise, some e2e tests will fail


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


HeartSaVioR commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027509435


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -600,7 +600,7 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   val df2 = spark.read.format("json")
 .load(dir.getCanonicalPath + "/target/new-streaming-data-join")
   // Verify self-join results
-  assert(streamQuery2.lastProgress.numInputRows == 4L)
+  assert(streamQuery2.lastProgress.numInputRows == 2L)

Review Comment:
   Off-topic: this is very interesting. Looks like fixing this "enables" 
ReusedExchange, which somehow makes ProgressReporter pick up the metric from 
the single leaf node instead of two.
   
   > Before the fix
   
   ```
   == Parsed Logical Plan ==
   WriteToMicroBatchDataSourceV1 
FileSink[/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/new-streaming-data-join],
 77baa2ac-cc0b-4e01-94ff-ec20c98eb29b, 
[checkpointLocation=/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/checkpoint_join,
 
path=/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/new-streaming-data-join],
 Append, 0
   +- Project [name#2339, age#2340, info#2341, _metadata#2345]
  +- Join Inner, name#2339 = name#2504) AND (age#2340 = age#2505)) AND 
(info#2341 = info#2506)) AND (_metadata#2345 = _metadata#2507))
 :- Project [name#2339, age#2340, info#2341, _metadata#2345]
 :  +- Project [_metadata#2345, name#2339, age#2340, info#2341, 
_metadata#2345]
 : +- Project [name#2517 AS name#2339, age#2518 AS age#2340, 
info#2519 AS info#2341, _metadata#2529 AS _metadata#2345]
 :+- Relation [name#2517,age#2518,info#2519,_metadata#2529] json
 +- Project [name#2504, age#2505, info#2506, _metadata#2507]
+- Project [_metadata#2507, name#2504, age#2505, info#2506, 
_metadata#2507]
   +- Project [name#2523 AS name#2504, age#2524 AS age#2505, 
info#2525 AS info#2506, _metadata#2530 AS _metadata#2507]
  +- Relation [name#2523,age#2524,info#2525,_metadata#2530] json
   
   == Analyzed Logical Plan ==
   name: string, age: int, info: struct, 
_metadata: 
struct
   WriteToMicroBatchDataSourceV1 
FileSink[/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/new-streaming-data-join],
 77baa2ac-cc0b-4e01-94ff-ec20c98eb29b, 
[checkpointLocation=/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/checkpoint_join,
 
path=/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/new-streaming-data-join],
 Append, 0
   +- Project [name#2339, age#2340, info#2341, _metadata#2345]
  +- Join Inner, name#2339 = name#2504) AND (age#2340 = age#2505)) AND 
(info#2341 = info#2506)) AND (_metadata#2345 = _metadata#2507))
 :- Project [name#2339, age#2340, info#2341, _metadata#2345]
 :  +- Project [_metadata#2345, name#2339, age#2340, info#2341, 
_metadata#2345]
 : +- Project [name#2517 AS name#2339, age#2518 AS age#2340, 
info#2519 AS info#2341, _metadata#2529 AS _metadata#2345]
 :+- Relation [name#2517,age#2518,info#2519,_metadata#2529] json
 +- Project [name#2504, age#2505, info#2506, _metadata#2507]
+- Project [_metadata#2507, name#2504, age#2505, info#2506, 
_metadata#2507]
   +- Project [name#2523 AS name#2504, age#2524 AS age#2505, 
info#2525 AS info#2506, _metadata#2530 AS _metadata#2507]
  +- Relation [name#2523,age#2524,info#2525,_metadata#2530] json
   
   == Optimized Logical Plan ==
   Project [name#2517, age#2518, info#2519, _metadata#2529]
   +- Join Inner, name#2517 = name#2523) AND (age#2518 = age#2524)) AND 
(info#2519 = info#2525)) AND (_metadata#2529 = _metadata#2530))
  :- Filter (((isnotnull(name#2517) AND isnotnull(age#2518)) AND 
isnotnull(info#2519)) AND isnotnull(_metadata#2529))
  :  +- Relation [name#2517,age#2518,info#2519,_metadata#2529] json
  +- Filter (((isnotnull(name#2523) AND isnotnull(age#2524)) AND 
isnotnull(info#2525)) AND isnotnull(_metadata#2530))
 +- Relation [name#2523,age#2524,info#2525,_metadata#2530] json
   
   == Physical Plan ==
   *(3) Project [name#2517, age#2518, info#2519, _metadata#2529]
   +- StreamingSymmetricHashJoin [name#2517, age#2518, info#2519, 
_metadata#2529], [name#2523, age#2524, info#2525, _metadata#2530], Inner, 
condition = [ leftOnly = null, rightOnly = null, both = null, full = null ], 
state info [ checkpoint = 
file:/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/checkpoint_join/state,
 runId 

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-20 Thread GitBox


zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027508681


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -203,6 +204,19 @@ message Sort {
   }
 }
 
+
+// Drop specified columns.
+message Drop {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) columns to drop.
+  //
+  // Should contain at least 1 item.
+  repeated Expression cols = 2;

Review Comment:
   here follows the naming in `def drop(col: Column, cols: Column*): DataFrame`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 opened a new pull request, #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2022-11-20 Thread GitBox


pan3793 opened a new pull request, #38732:
URL: https://github.com/apache/spark/pull/38732

   
   
   ### What changes were proposed in this pull request?
   
   Fail Spark Application when executor failures reach threshold.
   
   ### Why are the changes needed?
   
   
   Sometimes, executor can not launch successful because of wrong 
configuration, but in K8s, Driver does not know that, just keep requesting new 
executor.
   
   This adds functionality similar to YARN[1][2] to K8s.
   
   [1] [SPARK-7451](https://issues.apache.org/jira/browse/SPARK-7451)
   [2] [SPARK-6735](https://issues.apache.org/jira/browse/SPARK-6735)
   
   ### Does this PR introduce _any_ user-facing change?
   
   New feature.
   
   ### How was this patch tested?
   
   UT will be added soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-20 Thread GitBox


zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027507821


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -523,6 +524,19 @@ class SparkConnectPlanner(session: SparkSession) {
   sameOrderExpressions = Seq.empty)
   }
 
+  private def transformDrop(rel: proto.Drop): LogicalPlan = {
+assert(rel.getColsCount > 0, s"cols must contains at least 1 item!")
+
+val cols = rel.getColsList.asScala.toArray.map { expr =>
+  Column(transformExpression(expr))

Review Comment:
   Do you mean verify for the arrow-based collect?
   since we will remove the json code path, it always fails if there are 
unsupported types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-20 Thread GitBox


zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027506855


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -255,10 +255,21 @@ def distinct(self) -> "DataFrame":
 )
 
 def drop(self, *cols: "ColumnOrString") -> "DataFrame":

Review Comment:
   IIUC,  there will be two RPC if we implement it on the client side
   1, `all_cols = self.columns` to fetch the schema;
   2, build the plan
   
   with a dedicated proto mesage, we only need one RPC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-20 Thread GitBox


zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027505888


##
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##
@@ -148,6 +148,23 @@ class SparkConnectProtoSuite extends PlanTest with 
SparkConnectPlanTest {
 comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("SPARK-41169: Test drop") {
+// single column
+val connectPlan = connectTestRelation.drop("id")
+val sparkPlan = sparkTestRelation.drop("id")
+comparePlans(connectPlan, sparkPlan)
+
+// all columns
+val connectPlan2 = connectTestRelation.drop("id", "name")
+val sparkPlan2 = sparkTestRelation.drop("id", "name")
+comparePlans(connectPlan2, sparkPlan2)
+
+// non-existing column

Review Comment:
   will add



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-20 Thread GitBox


zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027505120


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -203,6 +204,19 @@ message Sort {
   }
 }
 
+
+// Drop specified columns.
+message Drop {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) columns to drop.
+  //
+  // Should contain at least 1 item.
+  repeated Expression cols = 2;

Review Comment:
   Dataset.drop takes arbitrary expressions into account 
https://github.com/apache/spark/blob/3b4faaf89ecd1b70d3d9b3c6096e1b70275670fb/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2952-L2957



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


HeartSaVioR commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027504882


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   Looks like the indentation is consistent at least in the test suite. Please 
check other test cases using metadataColumnsTest - they are all 2 spaces. The 
second line is not a parameter but a continuation of test name string.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wankunde commented on pull request #37922: [SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

2022-11-20 Thread GitBox


wankunde commented on PR #37922:
URL: https://github.com/apache/spark/pull/37922#issuecomment-1321392369

   Hi, @mridulm , I've been working on some other issues recently. If @yabola  
can do all or part of this task in https://github.com/apache/spark/pull/38560, 
please go ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027504573


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -275,8 +275,13 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 .get.withName(FileFormat.ROW_INDEX)
   }
 }
+// SPARK-41151: metadata column is not nullable for file sources,
+// [[CreateNamedStruct]] is also not nullable.

Review Comment:
   One more thing: This is misleading because this is not true in general.
   > [[CreateNamedStruct]] is also not nullable.
   
   To be clear in this context, it would be better to mention 
`CreateStruct(structColumns)` explicitly instead of saying 
`[[CreateNamedStruct]] is also not nullable`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027503203


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##
@@ -464,11 +464,13 @@ object FileSourceMetadataAttribute {
 
   val FILE_SOURCE_METADATA_COL_ATTR_KEY = "__file_source_metadata_col"
 
-  def apply(name: String, dataType: DataType, nullable: Boolean = true): 
AttributeReference =
-AttributeReference(name, dataType, nullable,
+  def apply(name: String, dataType: DataType): AttributeReference = {
+// Metadata column for file sources is always not nullable.
+AttributeReference(name, dataType, nullable = false,
   new MetadataBuilder()
 .putBoolean(METADATA_COL_ATTR_KEY, value = true)
 .putBoolean(FILE_SOURCE_METADATA_COL_ATTR_KEY, value = true).build())()
+  }

Review Comment:
   Do we need to add this `{}` additionally?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent

2022-11-20 Thread GitBox


dongjoon-hyun commented on code in PR #38683:
URL: https://github.com/apache/spark/pull/38683#discussion_r1027500899


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  metadataColumnsTest("SPARK-41151: consistent _metadata nullability " +
+"between analyzed and executed", schema) { (df, _, _) =>

Review Comment:
   Indentation? We need two more spaces.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #38731: [SPARK-41209] Improve PySpark type inference in _merge_type method

2022-11-20 Thread GitBox


sadikovi commented on PR #38731:
URL: https://github.com/apache/spark/pull/38731#issuecomment-1321374183

   @HyukjinKwon @xinrong-meng Could you review this PR? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi opened a new pull request, #38731: [SPARK-41209] Improve PySpark type inference in _merge_type method

2022-11-20 Thread GitBox


sadikovi opened a new pull request, #38731:
URL: https://github.com/apache/spark/pull/38731

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR updates `_merge_type` method to allow upcast from any `AtomicType` 
to `StringType` similar to Cast.scala 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297).
   
   This allows us to avoid errors in the case when it is okay to infer type, 
for example:
   ```python
   self.spark.createDataFrame([[1.33, 1], ["2.1", 1]])
   ```
   
   However, there is an interesting side-effect that I was debating while 
working on this PR: 
   ```python
   spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)])
   
   # Result:
   # {"1": true}  1
   # {"2": false}  3
   ```
   
   It seems to be okay to merge map keys with different types but I would like 
to call it out explicitly.
   
   I also removed `NullType` from the list of atomic type because it is neither 
atomic nor complex type.
   
   ### Why are the changes needed?
   
   
   This makes the behaviour between PySpark and Arrow execution more 
consistent. For example, arrow can handle type upcasts while PySpark cannot.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, users may notice that previously examples that would fail during 
inference work now which could constitute a change in behaviour between Spark 
3.3 and Spark 3.4.
   
   ### How was this patch tested?
   
   
   I updated the existing unit tests and added a couple of new ones to check 
that we can upcast to StringType.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #38730: [SPARK-41181][SQL] Migrate the map options errors onto error classes

2022-11-20 Thread GitBox


panbingkun opened a new pull request, #38730:
URL: https://github.com/apache/spark/pull/38730

   ### What changes were proposed in this pull request?
   The pr aims to migrate the map options errors onto error classes.
   
   ### Why are the changes needed?
   The changes improve the error framework.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mcdull-zhang commented on pull request #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist

2022-11-20 Thread GitBox


mcdull-zhang commented on PR #38703:
URL: https://github.com/apache/spark/pull/38703#issuecomment-1321371779

   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #38729: [CONNECT][INFRA] Update protobuf versions in CI

2022-11-20 Thread GitBox


zhengruifeng opened a new pull request, #38729:
URL: https://github.com/apache/spark/pull/38729

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultC

2022-11-20 Thread GitBox


LuciferYang commented on code in PR #38704:
URL: https://github.com/apache/spark/pull/38704#discussion_r1027465227


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -2251,7 +2251,11 @@ class DatasetLargeResultCollectingSuite extends QueryTest
   with SharedSparkSession {
 
   override protected def sparkConf: SparkConf = 
super.sparkConf.set(MAX_RESULT_SIZE.key, "4g")
-  test("collect data with single partition larger than 2GB bytes array limit") 
{
+  // SPARK-41193: Ignore this suite because it cannot run successfully with 
Spark
+  // default Java Options, if user need do local test, please make the 
following changes:
+  // - Maven test: change `-Xmx4g` of `scalatest-maven-plugin` in 
`sql/core/pom.xml` to `-Xmx10g`
+  // - SBT test: change `-Xmx4g` of `Test / javaOptions` in `SparkBuild.scala` 
to `-Xmx10g`
+  ignore("collect data with single partition larger than 2GB bytes array 
limit") {

Review Comment:
   So how do we move forward? This is a blocking for developers
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python

2022-11-20 Thread GitBox


zhengruifeng commented on PR #38718:
URL: https://github.com/apache/spark/pull/38718#issuecomment-1321369830

   it seems that the versions in `build_and_test` are not updated?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`

2022-11-20 Thread GitBox


itholic commented on code in PR #38644:
URL: https://github.com/apache/spark/pull/38644#discussion_r1027410164


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala:
##
@@ -244,7 +244,7 @@ class CastWithAnsiOnSuite extends CastSuiteBase with 
QueryErrorsBase {
   Decimal("12345678901234567890123456789012345678"))
 checkExceptionInExpression[ArithmeticException](
   cast("123456789012345678901234567890123456789", DecimalType(38, 0)),
-  "Out of decimal type range")
+  "NUMERIC_OUT_OF_SUPPORTED_RANGE")

Review Comment:
   Yes, just created JIRA here: 
https://issues.apache.org/jira/browse/SPARK-41208



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions

2022-11-20 Thread GitBox


itholic commented on code in PR #38728:
URL: https://github.com/apache/spark/pull/38728#discussion_r1027408511


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -45,12 +46,18 @@ import org.apache.spark.util.Utils
 final case class InvalidPlanInput(
 private val message: String = "",
 private val cause: Throwable = None.orNull)
-extends Exception(message, cause)
+extends SparkException(
+  errorClass = "CONNECT.INVALID_PLAN_INPUT",
+  messageParameters = Map("msg" -> message),
+  cause = cause)
 
 final case class InvalidCommandInput(
 private val message: String = "",
 private val cause: Throwable = null)
-extends Exception(message, cause)
+extends SparkException(

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions

2022-11-20 Thread GitBox


itholic commented on code in PR #38728:
URL: https://github.com/apache/spark/pull/38728#discussion_r1027408367


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -45,12 +46,18 @@ import org.apache.spark.util.Utils
 final case class InvalidPlanInput(
 private val message: String = "",
 private val cause: Throwable = None.orNull)
-extends Exception(message, cause)
+extends SparkException(

Review Comment:
   Can we have test for the exception ??
   
   If it's already exists, let's convert the existing test to use `checkError` 
to check error class and parameters generated properly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python

2022-11-20 Thread GitBox


zhengruifeng commented on PR #38718:
URL: https://github.com/apache/spark/pull/38718#issuecomment-1321307266

   late lgtm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #38725: [SPARK-41182][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1102

2022-11-20 Thread GitBox


panbingkun commented on code in PR #38725:
URL: https://github.com/apache/spark/pull/38725#discussion_r1027386994


##
core/src/main/resources/error/error-classes.json:
##
@@ -1326,6 +1326,11 @@
   "grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup"
 ]
   },
+  "UNSUPPORTED_LITERAL_FOR_SOURCE_TYPE" : {
+"message" : [
+  "Literals of the type  are currently not supported for the 
 type."

Review Comment:
   OK, Let me to do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions

2022-11-20 Thread GitBox


HyukjinKwon commented on code in PR #38728:
URL: https://github.com/apache/spark/pull/38728#discussion_r1027385407


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -45,12 +46,18 @@ import org.apache.spark.util.Utils
 final case class InvalidPlanInput(
 private val message: String = "",
 private val cause: Throwable = None.orNull)
-extends Exception(message, cause)
+extends SparkException(

Review Comment:
   cc @MaxGekk @itholic HELP!!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.

2022-11-20 Thread GitBox


HyukjinKwon closed pull request #38726: [SPARK-41203] [CONNECT] Support 
Dataframe.tansform in Python client.
URL: https://github.com/apache/spark/pull/38726


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.

2022-11-20 Thread GitBox


HyukjinKwon commented on PR #38726:
URL: https://github.com/apache/spark/pull/38726#issuecomment-1321296180

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #36070: [SPARK-31675][CORE] Fix rename and delete files with different filesystem

2022-11-20 Thread GitBox


github-actions[bot] closed pull request #36070: [SPARK-31675][CORE] Fix rename 
and delete files with different filesystem
URL: https://github.com/apache/spark/pull/36070


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #36443: [POC][WIP][SPARK-39088][CORE] Add a "live" driver link to the UI for history server when serving in-progress applications.

2022-11-20 Thread GitBox


github-actions[bot] commented on PR #36443:
URL: https://github.com/apache/spark/pull/36443#issuecomment-1321290211

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #36908: [SPARK-39510][SQL][WIP] Leverage the natural partitioning and ordering of MonotonicallyIncreasingID

2022-11-20 Thread GitBox


github-actions[bot] closed pull request #36908: [SPARK-39510][SQL][WIP] 
Leverage the natural partitioning and ordering of MonotonicallyIncreasingID
URL: https://github.com/apache/spark/pull/36908


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec

2022-11-20 Thread GitBox


huaxingao commented on PR #38687:
URL: https://github.com/apache/spark/pull/38687#issuecomment-1321287120

   LGTM. Thanks for fixing this! @ulysses-you 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HuwCampbell commented on a diff in pull request #36441: [SPARK-39091][SQL] Updating specific SQL Expression traits that don't compose when multiple are extended due to nodePatterns be

2022-11-20 Thread GitBox


HuwCampbell commented on code in PR #36441:
URL: https://github.com/apache/spark/pull/36441#discussion_r973814236


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala:
##
@@ -839,6 +841,52 @@ class HigherOrderFunctionsSuite extends SparkFunSuite with 
ExpressionEvalHelper
   Seq(1d, 2d, Double.NaN, null))
   }
 
+  test("SPARK-39081: compatibility of combo of HigherOrderFunctions" +
+" with other Expression subclasses") {
+// Dummy example given in JIRA, this is to test a compile time issue only, 
has no real usage
+case class MyExploder(
+   arrays: Expression,// Array[AnyDataType]
+   asOfDate: Expression,  // 
LambdaFunction[AnyDataType -> TimestampType]
+   extractor: Expression) // TimestampType
+  extends HigherOrderFunction with Generator with TimeZoneAwareExpression {
+
+  override def children: Seq[Expression] = Seq[Expression](arrays, 
asOfDate)
+
+  override def withNewChildrenInternal(newChildren: 
IndexedSeq[Expression]): Expression = null
+
+  override def eval(input: InternalRow): TraversableOnce[InternalRow] = 
null
+
+  override def nodePatternsInternal(): Seq[TreePattern] = Seq()
+
+  override def elementSchema: StructType = null
+
+  override def arguments: Seq[Expression] =
+Seq(arrays, asOfDate)
+
+  override def argumentTypes: Seq[AbstractDataType] =
+Seq(ArrayType, TimestampType)
+
+  override def doGenCode(ctx: CodegenContext, exp: ExprCode): ExprCode = 
null
+
+  override def functions: Seq[Expression] =
+Seq(extractor)
+
+  override def functionTypes: Seq[TimestampType] =
+Seq(TimestampType)
+
+  override def bind(f: (Expression,
+Seq[(DataType, Boolean)]) => LambdaFunction): HigherOrderFunction = 
null
+
+  override def timeZoneId: Option[String] = None
+
+  override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression = 
null
+}
+
+/* Should not get the error - value nodePatterns
+ ... cannot override final member, or conflict in nodePatterns between two 
types

Review Comment:
   Yeah I think you're pretty much dead on here, the correctness of the 
optimiser will be up to the user to enforce; as they will need to ensure that 
all node patterns are accounted for.
   
   But, and it's a big one, it makes it possible to compose these interfaces, 
even with the extra challenges... which it's just _not_ possible to do as of 
Spark 3.2.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38714: [WIP][SPARK-41141]. avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it

2022-11-20 Thread GitBox


AmplabJenkins commented on PR #38714:
URL: https://github.com/apache/spark/pull/38714#issuecomment-1321252021

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source

2022-11-20 Thread GitBox


HeartSaVioR closed pull request #38717: [SPARK-41198][SS] Fix metrics in 
streaming query having CTE and DSv1 streaming source
URL: https://github.com/apache/spark/pull/38717


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source

2022-11-20 Thread GitBox


HeartSaVioR commented on PR #38717:
URL: https://github.com/apache/spark/pull/38717#issuecomment-1321250486

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source

2022-11-20 Thread GitBox


HeartSaVioR commented on PR #38717:
URL: https://github.com/apache/spark/pull/38717#issuecomment-1321250199

   Thanks for understanding. Let's go with no risk fix for now, and have more 
time to think about the holistic fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`

2022-11-20 Thread GitBox


srielau commented on PR #38685:
URL: https://github.com/apache/spark/pull/38685#issuecomment-1321215985

   > > But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH 
cte(c1, c1) AS?
   > > How about AS T(c1, c1)
   > 
   > @srielau I assumed that we will provide a query context which should point 
out to the problematic part.
   
   Sure, but will every tool look at it? By that token we don't need most of 
the payload for various errors.
   Either way, that is not the main point. The main question is whether we 
should have a distinct error messages for duplicate identifier in 
"constructors". We apparently allow duplicate attribute names in structs (?)..
   
   Should this say: MAP_KEY_ALREADY_EXISTS
   spark-sql> select map('a', 5, 'a', 6);
   Duplicate map key a was found
   
   I just checked CTE and table alias. Neither enforce unique names, which is 
curious.
   
   So I suppose the question boils down to CREATE TABLE and CREATE VIEW.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`

2022-11-20 Thread GitBox


MaxGekk commented on PR #38685:
URL: https://github.com/apache/spark/pull/38685#issuecomment-1321196320

   > But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH 
cte(c1, c1) AS?
   > How about AS T(c1, c1)
   
   @srielau I assumed that we will provide a query context which should point 
out to the problematic part.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`

2022-11-20 Thread GitBox


srielau commented on PR #38685:
URL: https://github.com/apache/spark/pull/38685#issuecomment-1321188144

   I've some doubts about COLUMN_ALREADY_EXISTS when a column is duplicated 
within a new list.
   I.e. it makes a lot of sense for ALTER TABLE ADD COLUMN.
   But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH  
cte(c1, c1) AS?
   How about AS T(c1, c1)
   
   I think we would like DUPLICATE_COLUMN_NAME for these case and also include 
the table name.
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY

2022-11-20 Thread GitBox


AmplabJenkins commented on PR #38722:
URL: https://github.com/apache/spark/pull/38722#issuecomment-1321183990

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-20 Thread GitBox


AmplabJenkins commented on PR #38723:
URL: https://github.com/apache/spark/pull/38723#issuecomment-1321183964

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38725: [SPARK-41182][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1102

2022-11-20 Thread GitBox


AmplabJenkins commented on PR #38725:
URL: https://github.com/apache/spark/pull/38725#issuecomment-1321183950

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.

2022-11-20 Thread GitBox


AmplabJenkins commented on PR #38726:
URL: https://github.com/apache/spark/pull/38726#issuecomment-1321183941

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #38725: [SPARK-41182][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1102

2022-11-20 Thread GitBox


MaxGekk commented on code in PR #38725:
URL: https://github.com/apache/spark/pull/38725#discussion_r1027319785


##
core/src/main/resources/error/error-classes.json:
##
@@ -1326,6 +1326,11 @@
   "grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup"
 ]
   },
+  "UNSUPPORTED_LITERAL_FOR_SOURCE_TYPE" : {
+"message" : [
+  "Literals of the type  are currently not supported for the 
 type."

Review Comment:
   Looking at where the error comes from:
   1. 
https://github.com/apache/spark/blob/51e4c2cc55aa01f07b28b1cd807b553f8729075d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/intervalExpressions.scala#L159
   2. 
https://github.com/apache/spark/blob/b14da8b1b65d9f00f49fab87f738715089bc43e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2795
   
   , the error message and error class name confuse slightly. The error is 
mostly about an extract field. How about error class name:
   `INVALID_EXTRACT_FIELD`
   Also cc @srielau



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`

2022-11-20 Thread GitBox


MaxGekk commented on PR #38685:
URL: https://github.com/apache/spark/pull/38685#issuecomment-1321178694

   @srielau @LuciferYang @panbingkun @itholic @cloud-fan Could you review this 
PR, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions

2022-11-20 Thread GitBox


AmplabJenkins commented on PR #38728:
URL: https://github.com/apache/spark/pull/38728#issuecomment-1321155240

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip opened a new pull request, #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions

2022-11-20 Thread GitBox


grundprinzip opened a new pull request, #38728:
URL: https://github.com/apache/spark/pull/38728

   ### What changes were proposed in this pull request?
   Migrate existing custom exceptions in Spark Connect to use the proper Spark 
exceptions.
   
   ### Why are the changes needed?
   Consistency
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   existing UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-20 Thread GitBox


grundprinzip commented on code in PR #38723:
URL: https://github.com/apache/spark/pull/38723#discussion_r1027288685


##
python/pyspark/sql/connect/column.py:
##
@@ -263,6 +263,22 @@ def __str__(self) -> str:
 return f"Column({self._unparsed_identifier})"
 
 
+class SQLExpression(Expression):
+"""Returns Expression which contains a string which is a SQL expression
+and server side will parse it by Catalyst
+"""
+
+def __init__(self, expr: str) -> None:
+super().__init__()

Review Comment:
   should we assert here that `expr` is string and not another expression?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-20 Thread GitBox


grundprinzip commented on code in PR #38723:
URL: https://github.com/apache/spark/pull/38723#discussion_r1027288544


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -124,6 +125,29 @@ def withPlan(cls, plan: plan.LogicalPlan, session: 
"RemoteSparkSession") -> "Dat
 def select(self, *cols: "ExpressionOrString") -> "DataFrame":
 return DataFrame.withPlan(plan.Project(self._plan, *cols), 
session=self._session)
 
+def selectExpr(self, *expr: Union[str, List[str]]) -> "DataFrame":
+"""Projects a set of SQL expressions and returns a new 
:class:`DataFrame`.
+
+This is a variant of :func:`select` that accepts SQL expressions.
+
+.. versionadded:: 3.4.0
+
+Returns
+---
+:class:`DataFrame`
+A DataFrame with new/old columns transformed by expressions.
+"""
+sql_expr = []
+if len(expr) == 1 and isinstance(expr[0], list):
+expr = expr[0]  # type: ignore[assignment]
+for element in expr:
+if isinstance(element, str):
+sql_expr.append(SQLExpression(element))
+else:
+sql_expr.extend([SQLExpression(e) for e in element])
+
+return DataFrame.withPlan(plan.Project(self._plan, *sql_expr), 
session=self._session)

Review Comment:
   This was not necessarily the question that I had, but I was not remembering 
correctly the type interface to Project:
   
   ```
   def __init__(self, child: Optional["LogicalPlan"], *columns: 
"ExpressionOrString") -> None:
   ```
   
   In this case `SQLExpression` is an expression and it just works.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.

2022-11-20 Thread GitBox


grundprinzip commented on code in PR #38726:
URL: https://github.com/apache/spark/pull/38726#discussion_r1027276398


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -756,6 +757,62 @@ def schema(self) -> StructType:
 else:
 return self._schema
 
+def transform(self, func: Callable[..., "DataFrame"], *args: Any, 
**kwargs: Any) -> "DataFrame":
+"""Returns a new :class:`DataFrame`. Concise syntax for chaining 
custom transformations.
+
+.. versionadded:: 3.0.0

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source

2022-11-20 Thread GitBox


HeartSaVioR commented on code in PR #38717:
URL: https://github.com/apache/spark/pull/38717#discussion_r1027251716


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala:
##
@@ -341,7 +355,13 @@ trait ProgressReporter extends Logging {
   val logicalPlanLeafToSource = newData.flatMap { case (source, 
logicalPlan) =>
 logicalPlan.collectLeaves().map { leaf => leaf -> source }
   }
-  val allLogicalPlanLeaves = lastExecution.logical.collectLeaves() // 
includes non-streaming

Review Comment:
   I admit I'm not 100% sure of the historical reason to pick up logical plan 
rather than optimized plan. My guess is to enable comparing the node between 
"logical node in newData" and "leaf node in logical plan", as optimizer may 
make a change against leaf node.
   
   That said, this approach is a best effort and never be a perfect one. Say, 
if optimizer deals with self-union into aggregation, the optimized plan will 
have one less leaf node, which breaks the mechanism. If optimizer swaps 
subtrees which end up with swapping leaf nodes, it could lead to incorrect 
metrics.
   
   Ideally we should either 1) move all streaming sources to DSv2 or 2) have a 
dedicated logical and physical node for streaming DSv1 sources, but both don't 
seem to be easy to achieve.
   
   Another possible idea might be assigning some UUID in the node tag for 
association and retain the tag even optimizer applies rules. If it could 
propagate the node tag to physical plan, even better. (If that is feasible, we 
could simply collect the nodes having node tag from executed plan and match 
with source.)
   
   I could probably explore the idea, but it would take time, and also I'm not 
sure whether the idea is feasible one or not. Do you think the idea makes 
sense, or it is against how Spark optimization/physical rules work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` configuration file for `protobuf` module testing

2022-11-20 Thread GitBox


HyukjinKwon closed pull request #38708: [SPARK-41194][PROTOBUF][TESTS] Add 
`log4j2.properties` configuration file for `protobuf` module testing
URL: https://github.com/apache/spark/pull/38708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` configuration file for `protobuf` module testing

2022-11-20 Thread GitBox


HyukjinKwon commented on PR #38708:
URL: https://github.com/apache/spark/pull/38708#issuecomment-1321075055

   Thanks @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` configuration file for `protobuf` module testing

2022-11-20 Thread GitBox


HyukjinKwon commented on PR #38708:
URL: https://github.com/apache/spark/pull/38708#issuecomment-1321075019

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source

2022-11-20 Thread GitBox


HeartSaVioR commented on code in PR #38717:
URL: https://github.com/apache/spark/pull/38717#discussion_r1027251960


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala:
##
@@ -341,7 +355,13 @@ trait ProgressReporter extends Logging {
   val logicalPlanLeafToSource = newData.flatMap { case (source, 
logicalPlan) =>
 logicalPlan.collectLeaves().map { leaf => leaf -> source }
   }
-  val allLogicalPlanLeaves = lastExecution.logical.collectLeaves() // 
includes non-streaming

Review Comment:
   For me, chasing the metrics issue in general is "beyond the scope" as of 
now, although this must be something we should deal with eventually, sooner 
than later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.

2022-11-20 Thread GitBox


HyukjinKwon commented on code in PR #38726:
URL: https://github.com/apache/spark/pull/38726#discussion_r1027251817


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -756,6 +757,62 @@ def schema(self) -> StructType:
 else:
 return self._schema
 
+def transform(self, func: Callable[..., "DataFrame"], *args: Any, 
**kwargs: Any) -> "DataFrame":
+"""Returns a new :class:`DataFrame`. Concise syntax for chaining 
custom transformations.
+
+.. versionadded:: 3.0.0
+
+Parameters
+--
+func : function
+a function that takes and returns a :class:`DataFrame`.
+*args
+Positional arguments to pass to func.
+
+.. versionadded:: 3.3.0

Review Comment:
   ```suggestion
   ```



##
python/pyspark/sql/connect/dataframe.py:
##
@@ -756,6 +757,62 @@ def schema(self) -> StructType:
 else:
 return self._schema
 
+def transform(self, func: Callable[..., "DataFrame"], *args: Any, 
**kwargs: Any) -> "DataFrame":
+"""Returns a new :class:`DataFrame`. Concise syntax for chaining 
custom transformations.
+
+.. versionadded:: 3.0.0
+
+Parameters
+--
+func : function
+a function that takes and returns a :class:`DataFrame`.
+*args
+Positional arguments to pass to func.
+
+.. versionadded:: 3.3.0
+**kwargs
+Keyword arguments to pass to func.
+
+.. versionadded:: 3.3.0

Review Comment:
   ```suggestion
   ```



##
python/pyspark/sql/connect/dataframe.py:
##
@@ -756,6 +757,62 @@ def schema(self) -> StructType:
 else:
 return self._schema
 
+def transform(self, func: Callable[..., "DataFrame"], *args: Any, 
**kwargs: Any) -> "DataFrame":
+"""Returns a new :class:`DataFrame`. Concise syntax for chaining 
custom transformations.
+
+.. versionadded:: 3.0.0

Review Comment:
   ```suggestion
   
   .. versionadded:: 3.3.0
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source

2022-11-20 Thread GitBox


HeartSaVioR commented on code in PR #38717:
URL: https://github.com/apache/spark/pull/38717#discussion_r1027251716


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala:
##
@@ -341,7 +355,13 @@ trait ProgressReporter extends Logging {
   val logicalPlanLeafToSource = newData.flatMap { case (source, 
logicalPlan) =>
 logicalPlan.collectLeaves().map { leaf => leaf -> source }
   }
-  val allLogicalPlanLeaves = lastExecution.logical.collectLeaves() // 
includes non-streaming

Review Comment:
   I admit I'm not 100% sure of the historical reason to pick up logical plan 
rather than optimized plan. My guess is to enable comparing the node between 
"logical node in newData" and "leaf node in logical plan", as optimizer may 
make a change against leaf node.
   
   That said, this approach is a best effort and never be a perfect one. Say, 
if optimizer deals with self-union into aggregation, the optimized plan will 
have one less leaf node, which breaks the mechanism.
   
   Ideally we should either 1) move all streaming sources to DSv2 or 2) have a 
dedicated logical and physical node for streaming DSv1 sources, but both don't 
seem to be easy to achieve.
   
   Another possible idea might be assigning some UUID in the node tag for 
association and retain the tag even optimizer applies rules. If it could 
propagate the node tag to physical plan, even better. (If that is feasible, we 
could simply collect the nodes having node tag from executed plan and match 
with source.)
   
   I could probably explore the idea, but it would take time, and also I'm not 
sure whether the idea is feasible one or not. Do you think the idea makes 
sense, or it is against how Spark optimization/physical rules work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >