[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #97805 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97805/testReport)** for PR 22573 at commit [`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #97765 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97765/testReport)** for PR 22573 at commit [`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #97748 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97748/testReport)** for PR 22573 at commit [`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #97765 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97765/testReport)** for PR 22573 at commit [`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #97748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97748/testReport)** for PR 22573 at commit [`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4226/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/22573 @dongjoon-hyun, Iceberg schema evolution is based on the field IDs, not on names. The current table schema's names are the runtime names for columns in that table, and all reads happen by first translating those names to IDs and projecting the IDs from the data files. That way, renames can never cause you to get incorrect data. You're mostly right that Spark has a problem with schema evolution for HadoopFS tables. That wouldn't affect my suggestion here, though. If you're filtering or projecting field `m.n`, then Spark currently handles that by matching columns by name. If you're matching by name, then `m.n` can't change across versions, or at least you can always project `m.n` from the data (in the case of Avro). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22573 Thank you, @rdblue . BTW, in general, indexing might be unsafe in Apache Spark when Metastore Schema is different from File Schema. Does it assume schema evolution feature in `IceBerg`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/22573 The approach we've taken in Iceberg is to allow `.` in names by using an index in the top-level schema. The full path of every leaf in the schema is produced and added to a map from the full field name to the field's ID. The reason why we do this is to avoid problem areas: * Parsing the name using `.` as a delimiter * Traversing the schema structure For example, the schema `0: a struct<2: x int, 3: y int>, 1: a.z int` produces this index: `Map("a" -> 0, "a.x" -> 2, "a.y" -> 3, "a.z" -> 1)`. Binding filters like `a.x > 3` or `a.z < 5` is done using the index instead of parsing the field name and traversing, so you get the right result without needing to decide whether "a.x" is nested or if it is the actual name. So the lookup is quick and correctly produces `id(2) > 3` and `id(1) < 5`. This is also used for projection because users want to be able to select nested columns by name using dotted field names. The only drawback to this approach is that you can't have duplicates in the index: each full field name must be unique. In the example above, the top-level `a.z` field could not be named `a.x` or else it would collide with `x` nested in `a`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22573 That's great! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22573 Updating `Filter` APIs sounds reasonable to me. This should be part of our data source API v2. cc @cloud-fan @rxin @rdblue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22573 Can we update public `Filter` API in Spark 3.0.0? @cloud-fan and @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22573 I was thinking to change the APIs in `Filter` so we can represent nested fields easier, but also realized that it's a stable public interface. Without changing the interface of `Filter`, we can have the following two options, 1. Use backtick to wrap around the column name and structure name containing dots. For example, ```scala `column.1`.`attribute.b` ``` It's also easier for people to understand when they are reading the pushdown plans in text format. 2. Alternatively, we can use ASCII delimited text to avoid delimiter collision, for example `\31` is commonly used between fields of a record, or members of a row. This simplifies parsing significantly, but the downside is that it's not readable, so when we print the plan, we need to add the backtick for visualization. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22573 I think the problem is, the current public `Filter` API uses string as the attribute type, which is hard to represent nested fields. Ideally we should extend the API, create a new interface for column and nested column, instead of string. But `Filter` is a public API so this is hard to do. This PR proposes to encode nested columns as string. This works, but we should think carefully about how to encode, so that column name with dot is still supported. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96711/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #96711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96711/testReport)** for PR 22573 at commit [`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96710/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #96710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96710/testReport)** for PR 22573 at commit [`53165b8`](https://github.com/apache/spark/commit/53165b8c4096fb9df4009fcd14532eef94fb6a8e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96709/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #96709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96709/testReport)** for PR 22573 at commit [`2f21842`](https://github.com/apache/spark/commit/2f21842d4676993d0d28abb6297796c672186f53). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #96711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96711/testReport)** for PR 22573 at commit [`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3548/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #96710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96710/testReport)** for PR 22573 at commit [`53165b8`](https://github.com/apache/spark/commit/53165b8c4096fb9df4009fcd14532eef94fb6a8e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3547/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22573 **[Test build #96709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96709/testReport)** for PR 22573 at commit [`2f21842`](https://github.com/apache/spark/commit/2f21842d4676993d0d28abb6297796c672186f53). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22573 @gatorsmile @cloud-fan @dongjoon-hyun @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3546/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org