[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381862#comment-14381862
 ] 

Cheng Lian commented on SPARK-6554:
---

Parquet filter push-down isn't enabled by default in 1.3.0 because the most 
recent Parquet version (1.6.0rc3) up until Spark 1.3.0 release suffers from two 
bugs (PARQUET-136 & PARQUET-173). So it's generally not recommended to be used 
in production yet. These two bugs have been fixed in Parquet master, and the 
official 1.6.0 release should be out pretty soon. We probably will upgrade to 
Parquet 1.6.0 in Spark 1.4.0.

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.Sp

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381850#comment-14381850
 ] 

Cheng Lian commented on SPARK-6554:
---

Marked this as critical rather than blocker mostly because Parquet filter 
push-down is not enabled by default in 1.3.0.

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing m

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381831#comment-14381831
 ] 

Jon Chase commented on SPARK-6554:
--

"spark.sql.parquet.filterPushdown" was the problem.  Leaving it set to false 
works around the problem for now.  

Thanks for jumping on this.

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.Securi

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381814#comment-14381814
 ] 

Apache Spark commented on SPARK-6554:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5210

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: 

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381788#comment-14381788
 ] 

Cheng Lian commented on SPARK-6554:
---

Hi [~jonchase], did you happen to turn on Parquet filter push-down by setting 
"spark.sql.parquet.filterPushdown" to true? The reason behind this is that, in 
your case, the partition column doesn't exist in the Parquet data file, thus 
Parquet filter push-down logics sees it as an invalid column. We should remove 
all predicates that touch those partition columns which don't exist in Parquet 
data files before doing the push-down optimization.

> Cannot use partition columns in where clause
> 
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to loa

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381720#comment-14381720
 ] 

Jon Chase commented on SPARK-6554:
--

Here's a test case to reproduce the issue:

{code}
@Test
public void testSpark_6554() {
// given:
DataFrame saveDF = sql.jsonRDD(
sc.parallelize(Lists.newArrayList("{\"col1\": 1}")),

DataTypes.createStructType(Lists.newArrayList(DataTypes.createStructField("col1",
 DataTypes.IntegerType, false;

// when:
saveDF.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + "/col2=2");

// then:
DataFrame loadedDF = sql.load(tmp.getRoot().getAbsolutePath());
assertEquals(1, loadedDF.count());

assertEquals(2, loadedDF.schema().fieldNames().length);
assertEquals("col1", loadedDF.schema().fieldNames()[0]);
assertEquals("col2", loadedDF.schema().fieldNames()[1]);

loadedDF.registerTempTable("df");

// this query works
Row[] results = sql.sql("select col1, col2 from df").collect();
assertEquals(1, results.length);
assertEquals(2, results[0].size());

// this query is broken
results = sql.sql("select col1, col2 from df where col2 > 0").collect();
assertEquals(1, results.length);
assertEquals(2, results[0].size());
}
{code}

> Cannot use partition columns in where clause
> 
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> ...
>  |-- probeTypeId: integer (nullable = true)
> ...
> Parquet is also aware of the column:
>  optional int32 probeTypeId;
> And this works fine:
> sqlContext.sql("select probeTypeId from df limit 1");
> ...as does df.show() - it shows the correct values for the partition column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> Here's the full stack trace:
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in