[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-6554: ----------------------------------- Assignee: Apache Spark (was: Cheng Lian) > Cannot use partition columns in where clause when Parquet filter push-down is > enabled > ------------------------------------------------------------------------------------- > > Key: SPARK-6554 > URL: https://issues.apache.org/jira/browse/SPARK-6554 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Reporter: Jon Chase > Assignee: Apache Spark > Priority: Critical > > I'm having trouble referencing partition columns in my queries with Parquet. > In the following example, 'probeTypeId' is a partition column. For example, > the directory structure looks like this: > {noformat} > /mydata > /probeTypeId=1 > ...files... > /probeTypeId=2 > ...files... > {noformat} > I see the column when I reference load a DF using the /mydata directory and > call df.printSchema(): > {noformat} > |-- probeTypeId: integer (nullable = true) > {noformat} > Parquet is also aware of the column: > {noformat} > optional int32 probeTypeId; > {noformat} > And this works fine: > {code} > sqlContext.sql("select probeTypeId from df limit 1"); > {code} > ...as does {{df.show()}} - it shows the correct values for the partition > column. > However, when I try to use a partition column in a where clause, I get an > exception stating that the column was not found in the schema: > {noformat} > sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1"); > ... > ... > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] > was not found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > at > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > at > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > ... > ... > {noformat} > Here's the full stack trace: > {noformat} > using local[*] for master > 06:05:55,675 |-INFO in > ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not > set > 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - > About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] > 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - > Naming appender as [STDOUT] > 06:05:55,721 |-INFO in > ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default > type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] > property > 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - > Setting level of ROOT logger to INFO > 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - > Attaching appender named [STDOUT] to Logger[ROOT] > 06:05:55,769 |-INFO in > ch.qos.logback.classic.joran.action.ConfigurationAction - End of > configuration. > 06:05:55,770 |-INFO in > ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current > configuration as safe fallback point > INFO org.apache.spark.SparkContext Running Spark version 1.3.0 > WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > INFO org.apache.spark.SecurityManager Changing view acls to: jon > INFO org.apache.spark.SecurityManager Changing modify acls to: jon > INFO org.apache.spark.SecurityManager SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(jon); users with > modify permissions: Set(jon) > INFO akka.event.slf4j.Slf4jLogger Slf4jLogger started > INFO Remoting Starting remoting > INFO Remoting Remoting started; listening on addresses > :[akka.tcp://sparkDriver@192.168.1.134:62493] > INFO org.apache.spark.util.Utils Successfully started service 'sparkDriver' > on port 62493. > INFO org.apache.spark.SparkEnv Registering MapOutputTracker > INFO org.apache.spark.SparkEnv Registering BlockManagerMaster > INFO o.a.spark.storage.DiskBlockManager Created local directory at > /var/folders/x7/9hdp8kw9569864088tsl4jmm0000gn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4 > INFO org.apache.spark.storage.MemoryStore MemoryStore started with capacity > 1966.1 MB > INFO org.apache.spark.HttpFileServer HTTP File server directory is > /var/folders/x7/9hdp8kw9569864088tsl4jmm0000gn/T/spark-cf4687bd-1563-4ddf-b697-21c96fd95561/httpd-6343b9c9-bb66-43ac-ac43-6da80c7a1f95 > INFO org.apache.spark.HttpServer Starting HTTP Server > INFO o.spark-project.jetty.server.Server jetty-8.y.z-SNAPSHOT > INFO o.s.jetty.server.AbstractConnector Started SocketConnector@0.0.0.0:62494 > INFO org.apache.spark.util.Utils Successfully started service 'HTTP file > server' on port 62494. > INFO org.apache.spark.SparkEnv Registering OutputCommitCoordinator > INFO o.spark-project.jetty.server.Server jetty-8.y.z-SNAPSHOT > INFO o.s.jetty.server.AbstractConnector Started > SelectChannelConnector@0.0.0.0:4040 > INFO org.apache.spark.util.Utils Successfully started service 'SparkUI' on > port 4040. > INFO org.apache.spark.ui.SparkUI Started SparkUI at http://192.168.1.134:4040 > INFO org.apache.spark.executor.Executor Starting executor ID <driver> on > host localhost > INFO org.apache.spark.util.AkkaUtils Connecting to HeartbeatReceiver: > akka.tcp://sparkDriver@192.168.1.134:62493/user/HeartbeatReceiver > INFO o.a.s.n.n.NettyBlockTransferService Server created on 62495 > INFO o.a.spark.storage.BlockManagerMaster Trying to register BlockManager > INFO o.a.s.s.BlockManagerMasterActor Registering block manager > localhost:62495 with 1966.1 MB RAM, BlockManagerId(<driver>, localhost, 62495) > INFO o.a.spark.storage.BlockManagerMaster Registered BlockManager > INFO o.a.h.conf.Configuration.deprecation mapred.max.split.size is > deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize > INFO o.a.h.conf.Configuration.deprecation > mapred.reduce.tasks.speculative.execution is deprecated. Instead, use > mapreduce.reduce.speculative > INFO o.a.h.conf.Configuration.deprecation > mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use > mapreduce.job.committer.setup.cleanup.needed > INFO o.a.h.conf.Configuration.deprecation mapred.min.split.size.per.rack is > deprecated. Instead, use > mapreduce.input.fileinputformat.split.minsize.per.rack > INFO o.a.h.conf.Configuration.deprecation mapred.min.split.size is > deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize > INFO o.a.h.conf.Configuration.deprecation mapred.min.split.size.per.node is > deprecated. Instead, use > mapreduce.input.fileinputformat.split.minsize.per.node > INFO o.a.h.conf.Configuration.deprecation mapred.reduce.tasks is deprecated. > Instead, use mapreduce.job.reduces > INFO o.a.h.conf.Configuration.deprecation mapred.input.dir.recursive is > deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive > INFO o.a.h.hive.metastore.HiveMetaStore 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > INFO o.a.h.hive.metastore.ObjectStore ObjectStore, initialize called > INFO DataNucleus.Persistence Property hive.metastore.integral.jdo.pushdown > unknown - will be ignored > INFO DataNucleus.Persistence Property datanucleus.cache.level2 unknown - > will be ignored > INFO o.a.h.hive.metastore.ObjectStore Setting MetaStore object pin classes > with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > INFO o.a.h.h.metastore.MetaStoreDirectSql MySQL check failed, assuming we > are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), > after : "". > INFO DataNucleus.Datastore The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > INFO DataNucleus.Datastore The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > INFO DataNucleus.Datastore The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > INFO DataNucleus.Datastore The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > INFO DataNucleus.Query Reading in results for query > "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is > closing > INFO o.a.h.hive.metastore.ObjectStore Initialized ObjectStore > INFO o.a.h.hive.metastore.HiveMetaStore Added admin role in metastore > INFO o.a.h.hive.metastore.HiveMetaStore Added public role in metastore > INFO o.a.h.hive.metastore.HiveMetaStore No user is added in admin role, > since config is empty > INFO o.a.h.hive.ql.session.SessionState No Tez session required at this > point. hive.execution.engine=mr. > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > root > |-- clientMarketId: integer (nullable = true) > |-- clientCountryId: integer (nullable = true) > |-- clientRegionId: integer (nullable = true) > |-- clientStateId: integer (nullable = true) > |-- clientAsnId: integer (nullable = true) > |-- reporterZoneId: integer (nullable = true) > |-- reporterCustomerId: integer (nullable = true) > |-- responseCode: integer (nullable = true) > |-- measurementValue: integer (nullable = true) > |-- year: integer (nullable = true) > |-- month: integer (nullable = true) > |-- day: integer (nullable = true) > |-- providerOwnerZoneId: integer (nullable = true) > |-- providerOwnerCustomerId: integer (nullable = true) > |-- providerId: integer (nullable = true) > |-- probeTypeId: integer (nullable = true) > ====================================================== > INFO hive.ql.parse.ParseDriver Parsing command: select probeTypeId from df > where probeTypeId = 1 limit 1 > INFO hive.ql.parse.ParseDriver Parse Completed > ==== results for select probeTypeId from df where probeTypeId = 1 limit 1 > ====================================================== > INFO o.a.s.sql.parquet.ParquetRelation2 Reading 33.33333333333333% of > partitions > INFO org.apache.spark.storage.MemoryStore ensureFreeSpace(191336) called > with curMem=0, maxMem=2061647216 > INFO org.apache.spark.storage.MemoryStore Block broadcast_0 stored as values > in memory (estimated size 186.9 KB, free 1966.0 MB) > INFO org.apache.spark.storage.MemoryStore ensureFreeSpace(27530) called with > curMem=191336, maxMem=2061647216 > INFO org.apache.spark.storage.MemoryStore Block broadcast_0_piece0 stored as > bytes in memory (estimated size 26.9 KB, free 1965.9 MB) > INFO o.a.spark.storage.BlockManagerInfo Added broadcast_0_piece0 in memory > on localhost:62495 (size: 26.9 KB, free: 1966.1 MB) > INFO o.a.spark.storage.BlockManagerMaster Updated info of block > broadcast_0_piece0 > INFO org.apache.spark.SparkContext Created broadcast 0 from NewHadoopRDD at > newParquet.scala:447 > INFO o.a.h.conf.Configuration.deprecation mapred.max.split.size is > deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize > INFO o.a.h.conf.Configuration.deprecation mapred.min.split.size is > deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize > INFO o.a.s.s.p.ParquetRelation2$$anon$1$$anon$2 Using Task Side Metadata > Split Strategy > INFO org.apache.spark.SparkContext Starting job: runJob at > SparkPlan.scala:121 > INFO o.a.spark.scheduler.DAGScheduler Got job 0 (runJob at > SparkPlan.scala:121) with 1 output partitions (allowLocal=false) > INFO o.a.spark.scheduler.DAGScheduler Final stage: Stage 0(runJob at > SparkPlan.scala:121) > INFO o.a.spark.scheduler.DAGScheduler Parents of final stage: List() > INFO o.a.spark.scheduler.DAGScheduler Missing parents: List() > INFO o.a.spark.scheduler.DAGScheduler Submitting Stage 0 > (MapPartitionsRDD[3] at map at SparkPlan.scala:96), which has no missing > parents > INFO org.apache.spark.storage.MemoryStore ensureFreeSpace(5512) called with > curMem=218866, maxMem=2061647216 > INFO org.apache.spark.storage.MemoryStore Block broadcast_1 stored as values > in memory (estimated size 5.4 KB, free 1965.9 MB) > INFO org.apache.spark.storage.MemoryStore ensureFreeSpace(3754) called with > curMem=224378, maxMem=2061647216 > INFO org.apache.spark.storage.MemoryStore Block broadcast_1_piece0 stored as > bytes in memory (estimated size 3.7 KB, free 1965.9 MB) > INFO o.a.spark.storage.BlockManagerInfo Added broadcast_1_piece0 in memory > on localhost:62495 (size: 3.7 KB, free: 1966.1 MB) > INFO o.a.spark.storage.BlockManagerMaster Updated info of block > broadcast_1_piece0 > INFO org.apache.spark.SparkContext Created broadcast 1 from broadcast at > DAGScheduler.scala:839 > INFO o.a.spark.scheduler.DAGScheduler Submitting 1 missing tasks from Stage > 0 (MapPartitionsRDD[3] at map at SparkPlan.scala:96) > INFO o.a.s.scheduler.TaskSchedulerImpl Adding task set 0.0 with 1 tasks > INFO o.a.spark.scheduler.TaskSetManager Starting task 0.0 in stage 0.0 (TID > 0, localhost, PROCESS_LOCAL, 1687 bytes) > INFO org.apache.spark.executor.Executor Running task 0.0 in stage 0.0 (TID 0) > INFO o.a.s.s.p.ParquetRelation2$$anon$1 Input split: ParquetInputSplit{part: > file:/Users/jon/Downloads/sparksql/1partitionsminusgeo/year=2015/month=1/day=14/providerOwnerZoneId=0/providerOwnerCustomerId=0/providerId=287/probeTypeId=1/part-r-00001.parquet > start: 0 end: 8851183 length: 8851183 hosts: [] requestedSchema: message > root { > optional int32 probeTypeId; > } > readSupportMetadata: > {org.apache.spark.sql.parquet.row.requested_schema={"type":"struct","fields":[{"name":"probeTypeId","type":"integer","nullable":true,"metadata":{}}]}, > > org.apache.spark.sql.parquet.row.metadata={"type":"struct","fields":[{"name":"clientMarketId","type":"integer","nullable":true,"metadata":{}},{"name":"clientCountryId","type":"integer","nullable":true,"metadata":{}},{"name":"clientRegionId","type":"integer","nullable":true,"metadata":{}},{"name":"clientStateId","type":"integer","nullable":true,"metadata":{}},{"name":"clientAsnId","type":"integer","nullable":true,"metadata":{}},{"name":"reporterZoneId","type":"integer","nullable":true,"metadata":{}},{"name":"reporterCustomerId","type":"integer","nullable":true,"metadata":{}},{"name":"responseCode","type":"integer","nullable":true,"metadata":{}},{"name":"measurementValue","type":"integer","nullable":true,"metadata":{}},{"name":"year","type":"integer","nullable":true,"metadata":{}},{"name":"month","type":"integer","nullable":true,"metadata":{}},{"name":"day","type":"integer","nullable":true,"metadata":{}},{"name":"providerOwnerZoneId","type":"integer","nullable":true,"metadata":{}},{"name":"providerOwnerCustomerId","type":"integer","nullable":true,"metadata":{}},{"name":"providerId","type":"integer","nullable":true,"metadata":{}},{"name":"probeTypeId","type":"integer","nullable":true,"metadata":{}}]}}} > ERROR org.apache.spark.executor.Executor Exception in task 0.0 in stage 0.0 > (TID 0) > java.lang.IllegalArgumentException: Column [probeTypeId] was not found in > schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > ~[parquet-common-1.6.0rc3.jar:na] > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > ~[parquet-column-1.6.0rc3.jar:na] > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > ~[parquet-column-1.6.0rc3.jar:na] > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > ~[parquet-column-1.6.0rc3.jar:na] > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > ~[parquet-column-1.6.0rc3.jar:na] > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > ~[parquet-column-1.6.0rc3.jar:na] > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > ~[parquet-column-1.6.0rc3.jar:na] > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) > ~[parquet-column-1.6.0rc3.jar:na] > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) > ~[parquet-hadoop-1.6.0rc3.jar:na] > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > ~[parquet-hadoop-1.6.0rc3.jar:na] > at > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > ~[parquet-column-1.6.0rc3.jar:na] > at > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > ~[parquet-hadoop-1.6.0rc3.jar:na] > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > ~[parquet-hadoop-1.6.0rc3.jar:na] > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > ~[parquet-hadoop-1.6.0rc3.jar:na] > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at > org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at org.apache.spark.scheduler.Task.run(Task.scala:64) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > ~[spark-core_2.10-1.3.0.jar:1.3.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_31] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_31] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31] > WARN o.a.spark.scheduler.TaskSetManager Lost task 0.0 in stage 0.0 (TID 0, > localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not > found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > at > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > at > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > ERROR o.a.spark.scheduler.TaskSetManager Task 0 in stage 0.0 failed 1 times; > aborting job > INFO o.a.s.scheduler.TaskSchedulerImpl Removed TaskSet 0.0, whose tasks have > all completed, from pool > INFO o.a.s.scheduler.TaskSchedulerImpl Cancelling stage 0 > INFO o.a.spark.scheduler.DAGScheduler Job 0 failed: runJob at > SparkPlan.scala:121, took 0.132538 s > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/metrics/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/stages/stage/kill,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/static,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/executors/threadDump,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/executors/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/executors,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/environment/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/environment,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/storage/rdd/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/storage/rdd,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/storage/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/storage,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/stages/pool/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/stages/pool,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/stages/stage/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/stages/stage,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/stages/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/stages,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/jobs/job/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/jobs/job,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/jobs/json,null} > INFO o.s.j.server.handler.ContextHandler stopped > o.s.j.s.ServletContextHandler{/jobs,null} > INFO org.apache.spark.ui.SparkUI Stopped Spark web UI at > http://192.168.1.134:4040 > INFO o.a.spark.scheduler.DAGScheduler Stopping DAGScheduler > INFO o.a.s.MapOutputTrackerMasterActor MapOutputTrackerActor stopped! > INFO org.apache.spark.storage.MemoryStore MemoryStore cleared > INFO o.apache.spark.storage.BlockManager BlockManager stopped > INFO o.a.spark.storage.BlockManagerMaster BlockManagerMaster stopped > INFO o.a.s.s.OutputCommitCoordinator$OutputCommitCoordinatorActor > OutputCommitCoordinator stopped! > INFO org.apache.spark.SparkContext Successfully stopped SparkContext > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] > was not found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > at > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > at > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Mar 26, 2015 6:06:02 AM INFO: parquet.filter2.compat.FilterCompat: Filtering > using predicate: eq(probeTypeId, 1) > Mar 26, 2015 6:06:02 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not > initialize counter due to context is not a instance of > TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > Mar 26, 2015 6:06:02 AM INFO: parquet.filter2.compat.FilterCompat: Filtering > using predicate: eq(probeTypeId, 1) > Process finished with exit code 255 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org