[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381850#comment-14381850 ] Cheng Lian commented on SPARK-6554: --- Marked this as critical rather than blocker mostly because Parquet filter push-down is not enabled by default in 1.3.0. Cannot use partition columns in where clause when Parquet filter push-down is enabled - Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui
[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381862#comment-14381862 ] Cheng Lian commented on SPARK-6554: --- Parquet filter push-down isn't enabled by default in 1.3.0 because the most recent Parquet version (1.6.0rc3) up until Spark 1.3.0 release suffers from two bugs (PARQUET-136 PARQUET-173). So it's generally not recommended to be used in production yet. These two bugs have been fixed in Parquet master, and the official 1.6.0 release should be out pretty soon. We probably will upgrade to Parquet 1.6.0 in Spark 1.4.0. Cannot use partition columns in where clause when Parquet filter push-down is enabled - Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load
[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381831#comment-14381831 ] Jon Chase commented on SPARK-6554: -- spark.sql.parquet.filterPushdown was the problem. Leaving it set to false works around the problem for now. Thanks for jumping on this. Cannot use partition columns in where clause when Parquet filter push-down is enabled - Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager:
[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381814#comment-14381814 ] Apache Spark commented on SPARK-6554: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5210 Cannot use partition columns in where clause when Parquet filter push-down is enabled - Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls disabled;