[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Description: 
When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
org.apache.spark.sql.sources, which are appropriately built and picked up by 
selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy.

On the other hand, it does not pass EqualNullSafe filter in 
org.apache.spark.sql.catalyst.expressions even though this seems possible to 
pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even 
though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that CatalystScan can take the all raw expressions accessing to 
the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability).

In general, the problem below can happen.

1. SELECT * FROM table WHERE field = 1;

2. SELECT * FROM table WHERE field = 1;

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.


  was:
When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
`org.apache.spark.sql.sources`, which are appropriately built and picked up by 
`selectFilters()` in  `org.apache.spark.sql.sources.DataSourceStrategy`. 

On the other hand, it does not pass `EqualNullSafe` filter in 
`org.apache.spark.sql.catalyst.expressions` even though this seems possible to 
pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass to (below) `buildScan` in `PrunedFilteredScan` and `PrunedScan`,

```
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
```
even though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that  `CatalystScan` can take the all raw expressions accessing to 
the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability). 


In general, the problem below can happen.

1. 
```
SELECT * 
FROM table
WHERE field = 1;
```

2. 
```
SELECT * 
FROM table
WHERE field = 1;
```

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.  





 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
 Environment: Centos 6.6
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 org.apache.spark.sql.sources, which are appropriately built and picked up by 
 selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy.
 On the other hand, it does not pass EqualNullSafe filter in 
 org.apache.spark.sql.catalyst.expressions even though this seems possible to 
 pass for other datasources such as Parquet and JSON. In more detail, it does 
 not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, 
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that CatalystScan can take the all raw expressions accessing to 
 the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1. SELECT * FROM table WHERE field = 1;
 2. SELECT * FROM table WHERE field = 1;
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Description: 
When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
{{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.

On the other hand, it does not pass {{EqualNullSafe}} filter in 
{{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
to pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} 
and {{PrunedScan}}, 

{code}
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
{code}

even though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that {{CatalystScan}} can take the all raw expressions accessing 
to the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability).


In general, the problem below can happen.

1.
{code:sql}
SELECT * FROM table WHERE field = 1;
{code}
 
2. 
{code:sql}
SELECT * FROM table WHERE field = 1;
{code}

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.


  was:
When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
org.apache.spark.sql.sources, which are appropriately built and picked up by 
selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy.

On the other hand, it does not pass EqualNullSafe filter in 
org.apache.spark.sql.catalyst.expressions even though this seems possible to 
pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even 
though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that CatalystScan can take the all raw expressions accessing to 
the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability).

In general, the problem below can happen.

1. SELECT * FROM table WHERE field = 1;

2. SELECT * FROM table WHERE field = 1;

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.



 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
 Environment: Centos 6.6
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Environment: (was: Centos 6.6)

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources

2015-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Component/s: (was: Input/Output)
 SQL

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org