[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Component/s: (was: Input/Output) SQL > EqualNotNull not passing to data sources > > > Key: SPARK-9814 > URL: https://issues.apache.org/jira/browse/SPARK-9814 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Minor > > When data sources (such as Parquet) tries to filter data when reading from > HDFS (not in memory), Physical planing phase passes the filter objects in > {{org.apache.spark.sql.sources}}, which are appropriately built and picked up > by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. > On the other hand, it does not pass {{EqualNullSafe}} filter in > {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible > to pass for other datasources such as Parquet and JSON. In more detail, it > does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in > {{PrunedFilteredScan}} and {{PrunedScan}}, > {code} > def buildScan(requiredColumns: Array[String], filters: Array[Filter]): > RDD[Row] > {code} > even though the binary capability issue is > solved.(https://issues.apache.org/jira/browse/SPARK-8747). > I understand that {{CatalystScan}} can take the all raw expressions accessing > to the query planner. However, it is experimental and also it needs different > interfaces (as well as unstable for the reasons such as binary capability). > In general, the problem below can happen. > 1. > {code:sql} > SELECT * FROM table WHERE field = 1; > {code} > > 2. > {code:sql} > SELECT * FROM table WHERE field <=> 1; > {code} > The second query can be hugely slow although the functionally is almost > identical because of the possible large network traffic (etc.) by not > filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Environment: (was: Centos 6.6) > EqualNotNull not passing to data sources > > > Key: SPARK-9814 > URL: https://issues.apache.org/jira/browse/SPARK-9814 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Hyukjin Kwon >Priority: Minor > > When data sources (such as Parquet) tries to filter data when reading from > HDFS (not in memory), Physical planing phase passes the filter objects in > {{org.apache.spark.sql.sources}}, which are appropriately built and picked up > by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. > On the other hand, it does not pass {{EqualNullSafe}} filter in > {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible > to pass for other datasources such as Parquet and JSON. In more detail, it > does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in > {{PrunedFilteredScan}} and {{PrunedScan}}, > {code} > def buildScan(requiredColumns: Array[String], filters: Array[Filter]): > RDD[Row] > {code} > even though the binary capability issue is > solved.(https://issues.apache.org/jira/browse/SPARK-8747). > I understand that {{CatalystScan}} can take the all raw expressions accessing > to the query planner. However, it is experimental and also it needs different > interfaces (as well as unstable for the reasons such as binary capability). > In general, the problem below can happen. > 1. > {code:sql} > SELECT * FROM table WHERE field = 1; > {code} > > 2. > {code:sql} > SELECT * FROM table WHERE field <=> 1; > {code} > The second query can be hugely slow although the functionally is almost > identical because of the possible large network traffic (etc.) by not > filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Description: When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}}, which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} and {{PrunedScan}}, {code} def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] {code} even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that {{CatalystScan}} can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. {code:sql} SELECT * FROM table WHERE field = 1; {code} 2. {code:sql} SELECT * FROM table WHERE field <=> 1; {code} The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. was: When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in org.apache.spark.sql.sources, which are appropriately built and picked up by selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy. On the other hand, it does not pass EqualNullSafe filter in org.apache.spark.sql.catalyst.expressions even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that CatalystScan can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. SELECT * FROM table WHERE field = 1; 2. SELECT * FROM table WHERE field <=> 1; The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. > EqualNotNull not passing to data sources > > > Key: SPARK-9814 > URL: https://issues.apache.org/jira/browse/SPARK-9814 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Environment: Centos 6.6 >Reporter: Hyukjin Kwon >Priority: Minor > > When data sources (such as Parquet) tries to filter data when reading from > HDFS (not in memory), Physical planing phase passes the filter objects in > {{org.apache.spark.sql.sources}}, which are appropriately built and picked up > by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. > On the other hand, it does not pass {{EqualNullSafe}} filter in > {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible > to pass for other datasources such as Parquet and JSON. In more detail, it > does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in > {{PrunedFilteredScan}} and {{PrunedScan}}, > {code} > def buildScan(requiredColumns: Array[String], filters: Array[Filter]): > RDD[Row] > {code} > even though the binary capability issue is > solved.(https://issues.apache.org/jira/browse/SPARK-8747). > I understand that {{CatalystScan}} can take the all raw expressions accessing > to the query planner. However, it is experimental and also it needs different > interfaces (as well as unstable for the reasons such as binary capability). > In general, the problem below can happen. > 1. > {code:sql} > SELECT * FROM table WHERE field = 1; > {code} > > 2. > {code:sql} > SELECT * FROM table WHERE field <=> 1; > {code} > The second query can be hugely slow although the functionally is almost > identical because of the possible large network traffic (etc.) by not > filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources
[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-9814: Description: When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in org.apache.spark.sql.sources, which are appropriately built and picked up by selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy. On the other hand, it does not pass EqualNullSafe filter in org.apache.spark.sql.catalyst.expressions even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that CatalystScan can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. SELECT * FROM table WHERE field = 1; 2. SELECT * FROM table WHERE field <=> 1; The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. was: When data sources (such as Parquet) tries to filter data when reading from HDFS (not in memory), Physical planing phase passes the filter objects in `org.apache.spark.sql.sources`, which are appropriately built and picked up by `selectFilters()` in `org.apache.spark.sql.sources.DataSourceStrategy`. On the other hand, it does not pass `EqualNullSafe` filter in `org.apache.spark.sql.catalyst.expressions` even though this seems possible to pass for other datasources such as Parquet and JSON. In more detail, it does not pass to (below) `buildScan` in `PrunedFilteredScan` and `PrunedScan`, ``` def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] ``` even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747). I understand that `CatalystScan` can take the all raw expressions accessing to the query planner. However, it is experimental and also it needs different interfaces (as well as unstable for the reasons such as binary capability). In general, the problem below can happen. 1. ``` SELECT * FROM table WHERE field = 1; ``` 2. ``` SELECT * FROM table WHERE field <=> 1; ``` The second query can be hugely slow although the functionally is almost identical because of the possible large network traffic (etc.) by not filtered data from the source RDD. > EqualNotNull not passing to data sources > > > Key: SPARK-9814 > URL: https://issues.apache.org/jira/browse/SPARK-9814 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Environment: Centos 6.6 >Reporter: Hyukjin Kwon >Priority: Minor > > When data sources (such as Parquet) tries to filter data when reading from > HDFS (not in memory), Physical planing phase passes the filter objects in > org.apache.spark.sql.sources, which are appropriately built and picked up by > selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy. > On the other hand, it does not pass EqualNullSafe filter in > org.apache.spark.sql.catalyst.expressions even though this seems possible to > pass for other datasources such as Parquet and JSON. In more detail, it does > not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, > even though the binary capability issue is > solved.(https://issues.apache.org/jira/browse/SPARK-8747). > I understand that CatalystScan can take the all raw expressions accessing to > the query planner. However, it is experimental and also it needs different > interfaces (as well as unstable for the reasons such as binary capability). > In general, the problem below can happen. > 1. SELECT * FROM table WHERE field = 1; > 2. SELECT * FROM table WHERE field <=> 1; > The second query can be hugely slow although the functionally is almost > identical because of the possible large network traffic (etc.) by not > filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org