[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #97805 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97805/testReport)**
 for PR 22573 at commit 
[`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #97765 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97765/testReport)**
 for PR 22573 at commit 
[`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #97748 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97748/testReport)**
 for PR 22573 at commit 
[`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #97765 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97765/testReport)**
 for PR 22573 at commit 
[`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #97748 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97748/testReport)**
 for PR 22573 at commit 
[`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4226/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-01 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/22573
  
@dongjoon-hyun, Iceberg schema evolution is based on the field IDs, not on 
names. The current table schema's names are the runtime names for columns in 
that table, and all reads happen by first translating those names to IDs and 
projecting the IDs from the data files. That way, renames can never cause you 
to get incorrect data.

You're mostly right that Spark has a problem with schema evolution for 
HadoopFS tables. That wouldn't affect my suggestion here, though. If you're 
filtering or projecting field `m.n`, then Spark currently handles that by 
matching columns by name. If you're matching by name, then `m.n` can't change 
across versions, or at least you can always project `m.n` from the data (in the 
case of Avro).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22573
  
Thank you, @rdblue . BTW, in general, indexing might be unsafe in Apache 
Spark when Metastore Schema is different from File Schema. Does it assume 
schema evolution feature in `IceBerg`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-10-01 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/22573
  
The approach we've taken in Iceberg is to allow `.` in names by using an 
index in the top-level schema. The full path of every leaf in the schema is 
produced and added to a map from the full field name to the field's ID.

The reason why we do this is to avoid problem areas:

* Parsing the name using `.` as a delimiter
* Traversing the schema structure

For example, the schema `0: a struct<2: x int, 3: y int>, 1: a.z int` 
produces this index: `Map("a" -> 0, "a.x" -> 2, "a.y" -> 3, "a.z" -> 1)`.

Binding filters like `a.x > 3` or `a.z < 5` is done using the index instead 
of parsing the field name and traversing, so you get the right result without 
needing to decide whether "a.x" is nested or if it is the actual name. So the 
lookup is quick and correctly produces `id(2) > 3` and `id(1) < 5`. This is 
also used for projection because users want to be able to select nested columns 
by name using dotted field names.

The only drawback to this approach is that you can't have duplicates in the 
index: each full field name must be unique. In the example above, the top-level 
`a.z` field could not be named `a.x` or else it would collide with `x` nested 
in `a`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22573
  
That's great!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22573
  
Updating `Filter` APIs sounds reasonable to me. This should be part of our 
data source API v2. cc @cloud-fan @rxin @rdblue 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22573
  
Can we update public `Filter` API in Spark 3.0.0? @cloud-fan and 
@gatorsmile .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22573
  
I was thinking to change the APIs in `Filter` so we can represent nested 
fields easier, but also realized that it's a stable public interface.

Without changing the interface of `Filter`, we can have the following two 
options,

1. Use backtick to wrap around the column name and structure name 
containing dots. For example, 
  ```scala
  `column.1`.`attribute.b`
  ```
  It's also easier for people to understand when they are reading the 
pushdown plans in text format.

2. Alternatively, we can use ASCII delimited text to avoid delimiter 
collision, for example `\31` is commonly used between fields of a record, or 
members of a row. This simplifies parsing significantly, but the downside is 
that it's not readable, so when we print the plan, we need to add the backtick 
for visualization.

What do you think? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22573
  
I think the problem is, the current public `Filter` API uses string as the 
attribute type, which is hard to represent nested fields.

Ideally we should extend the API, create a new interface for column and 
nested column, instead of string. But `Filter` is a public API so this is hard 
to do.

This PR proposes to encode nested columns as string. This works, but we 
should think carefully about how to encode, so that column name with dot is 
still supported.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96711/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #96711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96711/testReport)**
 for PR 22573 at commit 
[`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96710/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #96710 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96710/testReport)**
 for PR 22573 at commit 
[`53165b8`](https://github.com/apache/spark/commit/53165b8c4096fb9df4009fcd14532eef94fb6a8e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96709/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #96709 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96709/testReport)**
 for PR 22573 at commit 
[`2f21842`](https://github.com/apache/spark/commit/2f21842d4676993d0d28abb6297796c672186f53).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #96711 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96711/testReport)**
 for PR 22573 at commit 
[`d59cb55`](https://github.com/apache/spark/commit/d59cb557ca247a74c101f90394f11916b28f2525).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3548/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #96710 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96710/testReport)**
 for PR 22573 at commit 
[`53165b8`](https://github.com/apache/spark/commit/53165b8c4096fb9df4009fcd14532eef94fb6a8e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3547/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22573
  
**[Test build #96709 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96709/testReport)**
 for PR 22573 at commit 
[`2f21842`](https://github.com/apache/spark/commit/2f21842d4676993d0d28abb6297796c672186f53).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22573
  
@gatorsmile @cloud-fan @dongjoon-hyun @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3546/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org