[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381850#comment-14381850
 ] 

Cheng Lian commented on SPARK-6554:
---

Marked this as critical rather than blocker mostly because Parquet filter 
push-down is not enabled by default in 1.3.0.

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
 disabled; ui 

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381862#comment-14381862
 ] 

Cheng Lian commented on SPARK-6554:
---

Parquet filter push-down isn't enabled by default in 1.3.0 because the most 
recent Parquet version (1.6.0rc3) up until Spark 1.3.0 release suffers from two 
bugs (PARQUET-136  PARQUET-173). So it's generally not recommended to be used 
in production yet. These two bugs have been fixed in Parquet master, and the 
official 1.6.0 release should be out pretty soon. We probably will upgrade to 
Parquet 1.6.0 in Spark 1.4.0.

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load 

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381831#comment-14381831
 ] 

Jon Chase commented on SPARK-6554:
--

spark.sql.parquet.filterPushdown was the problem.  Leaving it set to false 
works around the problem for now.  

Thanks for jumping on this.

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: 

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381814#comment-14381814
 ] 

Apache Spark commented on SPARK-6554:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5210

 Cannot use partition columns in where clause when Parquet filter push-down is 
 enabled
 -

 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase
Assignee: Cheng Lian
Priority: Critical

 I'm having trouble referencing partition columns in my queries with Parquet.  
 In the following example, 'probeTypeId' is a partition column.  For example, 
 the directory structure looks like this:
 {noformat}
 /mydata
 /probeTypeId=1
 ...files...
 /probeTypeId=2
 ...files...
 {noformat}
 I see the column when I reference load a DF using the /mydata directory and 
 call df.printSchema():
 {noformat}
  |-- probeTypeId: integer (nullable = true)
 {noformat}
 Parquet is also aware of the column:
 {noformat}
  optional int32 probeTypeId;
 {noformat}
 And this works fine:
 {code}
 sqlContext.sql(select probeTypeId from df limit 1);
 {code}
 ...as does {{df.show()}} - it shows the correct values for the partition 
 column.
 However, when I try to use a partition column in a where clause, I get an 
 exception stating that the column was not found in the schema:
 {noformat}
 sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1);
 ...
 ...
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
 was not found in schema!
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
   at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
   at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
   at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
   at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 ...
 ...
 {noformat}
 Here's the full stack trace:
 {noformat}
 using local[*] for master
 06:05:55,675 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
 set
 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
 Naming appender as [STDOUT]
 06:05:55,721 |-INFO in 
 ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
 type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
 property
 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
 Setting level of ROOT logger to INFO
 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
 Attaching appender named [STDOUT] to Logger[ROOT]
 06:05:55,769 |-INFO in 
 ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
 configuration.
 06:05:55,770 |-INFO in 
 ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
 configuration as safe fallback point
 INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
 WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 INFO  org.apache.spark.SecurityManager Changing view acls to: jon
 INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
 INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
 disabled; ui acls disabled;