Spark 1.3.0, Parquet

I'm having trouble referencing partition columns in my queries.

In the following example, 'probeTypeId' is a partition column.  For
example, the directory structure looks like this:

/mydata
    /probeTypeId=1
        ...files...
    /probeTypeId=2
        ...files...

I see the column when I reference load a DF using the /mydata directory and
call df.printSchema():

...
 |-- probeTypeId: integer (nullable = true)
...

Parquet is also aware of the column:
     optional int32 probeTypeId;

And this works fine:

sqlContext.sql("select probeTypeId from df limit 1");

...as does df.show() - it shows the correct values for the partition column.


However, when I try to use a partition column in a where clause, I get an
exception stating that the column was not found in the schema:

sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");



...
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column
[probeTypeId] was not found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
...
...



What am I doing wrong?







Here's the full stack trace:

using local[*] for master
06:05:55,675 |-INFO in
ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute
not set
06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction -
About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction -
Naming appender as [STDOUT]
06:05:55,721 |-INFO in
ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default
type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder]
property
06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction
- Setting level of ROOT logger to INFO
06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction -
Attaching appender named [STDOUT] to Logger[ROOT]
06:05:55,769 |-INFO in
ch.qos.logback.classic.joran.action.ConfigurationAction - End of
configuration.
06:05:55,770 |-INFO in
ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering
current configuration as safe fallback point

INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
INFO  org.apache.spark.SecurityManager Changing view acls to: jon
INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
INFO  org.apache.spark.SecurityManager SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(jon); users
with modify permissions: Set(jon)
INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
INFO  Remoting Starting remoting
INFO  Remoting Remoting started; listening on addresses :[akka.tcp://
sparkDriver@192.168.1.134:62493]
INFO  org.apache.spark.util.Utils Successfully started service
'sparkDriver' on port 62493.
INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
INFO  o.a.spark.storage.DiskBlockManager Created local directory at
/var/folders/x7/9hdp8kw9569864088tsl4jmm0000gn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4
INFO  org.apache.spark.storage.MemoryStore MemoryStore started with
capacity 1966.1 MB
INFO  org.apache.spark.HttpFileServer HTTP File server directory is
/var/folders/x7/9hdp8kw9569864088tsl4jmm0000gn/T/spark-cf4687bd-1563-4ddf-b697-21c96fd95561/httpd-6343b9c9-bb66-43ac-ac43-6da80c7a1f95
INFO  org.apache.spark.HttpServer Starting HTTP Server
INFO  o.spark-project.jetty.server.Server jetty-8.y.z-SNAPSHOT
INFO  o.s.jetty.server.AbstractConnector Started
SocketConnector@0.0.0.0:62494
INFO  org.apache.spark.util.Utils Successfully started service 'HTTP file
server' on port 62494.
INFO  org.apache.spark.SparkEnv Registering OutputCommitCoordinator
INFO  o.spark-project.jetty.server.Server jetty-8.y.z-SNAPSHOT
INFO  o.s.jetty.server.AbstractConnector Started
SelectChannelConnector@0.0.0.0:4040
INFO  org.apache.spark.util.Utils Successfully started service 'SparkUI' on
port 4040.
INFO  org.apache.spark.ui.SparkUI Started SparkUI at
http://192.168.1.134:4040
INFO  org.apache.spark.executor.Executor Starting executor ID <driver> on
host localhost
INFO  org.apache.spark.util.AkkaUtils Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@192.168.1.134:62493/user/HeartbeatReceiver
INFO  o.a.s.n.n.NettyBlockTransferService Server created on 62495
INFO  o.a.spark.storage.BlockManagerMaster Trying to register BlockManager
INFO  o.a.s.s.BlockManagerMasterActor Registering block manager
localhost:62495 with 1966.1 MB RAM, BlockManagerId(<driver>, localhost,
62495)
INFO  o.a.spark.storage.BlockManagerMaster Registered BlockManager
INFO  o.a.h.conf.Configuration.deprecation mapred.max.split.size is
deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
INFO  o.a.h.conf.Configuration.deprecation
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
mapreduce.reduce.speculative
INFO  o.a.h.conf.Configuration.deprecation
mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use
mapreduce.job.committer.setup.cleanup.needed
INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size.per.rack
is deprecated. Instead, use
mapreduce.input.fileinputformat.split.minsize.per.rack
INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size is
deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size.per.node
is deprecated. Instead, use
mapreduce.input.fileinputformat.split.minsize.per.node
INFO  o.a.h.conf.Configuration.deprecation mapred.reduce.tasks is
deprecated. Instead, use mapreduce.job.reduces
INFO  o.a.h.conf.Configuration.deprecation mapred.input.dir.recursive is
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
INFO  o.a.h.hive.metastore.HiveMetaStore 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
INFO  o.a.h.hive.metastore.ObjectStore ObjectStore, initialize called
INFO  DataNucleus.Persistence Property hive.metastore.integral.jdo.pushdown
unknown - will be ignored
INFO  DataNucleus.Persistence Property datanucleus.cache.level2 unknown -
will be ignored
INFO  o.a.h.hive.metastore.ObjectStore Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
INFO  o.a.h.h.metastore.MetaStoreDirectSql MySQL check failed, assuming we
are not on mysql: Lexical error at line 1, column 5.  Encountered: "@"
(64), after : "".
INFO  DataNucleus.Datastore The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
INFO  DataNucleus.Datastore The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
INFO  DataNucleus.Datastore The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
INFO  DataNucleus.Datastore The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
INFO  DataNucleus.Query Reading in results for query
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is
closing
INFO  o.a.h.hive.metastore.ObjectStore Initialized ObjectStore
INFO  o.a.h.hive.metastore.HiveMetaStore Added admin role in metastore
INFO  o.a.h.hive.metastore.HiveMetaStore Added public role in metastore
INFO  o.a.h.hive.metastore.HiveMetaStore No user is added in admin role,
since config is empty
INFO  o.a.h.hive.ql.session.SessionState No Tez session required at this
point. hive.execution.engine=mr.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
root
 |-- clientMarketId: integer (nullable = true)
 |-- clientCountryId: integer (nullable = true)
 |-- clientRegionId: integer (nullable = true)
 |-- clientStateId: integer (nullable = true)
 |-- clientAsnId: integer (nullable = true)
 |-- reporterZoneId: integer (nullable = true)
 |-- reporterCustomerId: integer (nullable = true)
 |-- responseCode: integer (nullable = true)
 |-- measurementValue: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- providerOwnerZoneId: integer (nullable = true)
 |-- providerOwnerCustomerId: integer (nullable = true)
 |-- providerId: integer (nullable = true)
 |-- probeTypeId: integer (nullable = true)

======================================================
INFO  hive.ql.parse.ParseDriver Parsing command: select probeTypeId from df
where probeTypeId = 1 limit 1
INFO  hive.ql.parse.ParseDriver Parse Completed
==== results for select probeTypeId from df where probeTypeId = 1 limit 1
======================================================
INFO  o.a.s.sql.parquet.ParquetRelation2 Reading 33.33333333333333% of
partitions
INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(191336) called
with curMem=0, maxMem=2061647216
INFO  org.apache.spark.storage.MemoryStore Block broadcast_0 stored as
values in memory (estimated size 186.9 KB, free 1966.0 MB)
INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(27530) called
with curMem=191336, maxMem=2061647216
INFO  org.apache.spark.storage.MemoryStore Block broadcast_0_piece0 stored
as bytes in memory (estimated size 26.9 KB, free 1965.9 MB)
INFO  o.a.spark.storage.BlockManagerInfo Added broadcast_0_piece0 in memory
on localhost:62495 (size: 26.9 KB, free: 1966.1 MB)
INFO  o.a.spark.storage.BlockManagerMaster Updated info of block
broadcast_0_piece0
INFO  org.apache.spark.SparkContext Created broadcast 0 from NewHadoopRDD
at newParquet.scala:447
INFO  o.a.h.conf.Configuration.deprecation mapred.max.split.size is
deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size is
deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
INFO  o.a.s.s.p.ParquetRelation2$$anon$1$$anon$2 Using Task Side Metadata
Split Strategy
INFO  org.apache.spark.SparkContext Starting job: runJob at
SparkPlan.scala:121
INFO  o.a.spark.scheduler.DAGScheduler Got job 0 (runJob at
SparkPlan.scala:121) with 1 output partitions (allowLocal=false)
INFO  o.a.spark.scheduler.DAGScheduler Final stage: Stage 0(runJob at
SparkPlan.scala:121)
INFO  o.a.spark.scheduler.DAGScheduler Parents of final stage: List()
INFO  o.a.spark.scheduler.DAGScheduler Missing parents: List()
INFO  o.a.spark.scheduler.DAGScheduler Submitting Stage 0
(MapPartitionsRDD[3] at map at SparkPlan.scala:96), which has no missing
parents
INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(5512) called
with curMem=218866, maxMem=2061647216
INFO  org.apache.spark.storage.MemoryStore Block broadcast_1 stored as
values in memory (estimated size 5.4 KB, free 1965.9 MB)
INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(3754) called
with curMem=224378, maxMem=2061647216
INFO  org.apache.spark.storage.MemoryStore Block broadcast_1_piece0 stored
as bytes in memory (estimated size 3.7 KB, free 1965.9 MB)
INFO  o.a.spark.storage.BlockManagerInfo Added broadcast_1_piece0 in memory
on localhost:62495 (size: 3.7 KB, free: 1966.1 MB)
INFO  o.a.spark.storage.BlockManagerMaster Updated info of block
broadcast_1_piece0
INFO  org.apache.spark.SparkContext Created broadcast 1 from broadcast at
DAGScheduler.scala:839
INFO  o.a.spark.scheduler.DAGScheduler Submitting 1 missing tasks from
Stage 0 (MapPartitionsRDD[3] at map at SparkPlan.scala:96)
INFO  o.a.s.scheduler.TaskSchedulerImpl Adding task set 0.0 with 1 tasks
INFO  o.a.spark.scheduler.TaskSetManager Starting task 0.0 in stage 0.0
(TID 0, localhost, PROCESS_LOCAL, 1687 bytes)
INFO  org.apache.spark.executor.Executor Running task 0.0 in stage 0.0 (TID
0)
INFO  o.a.s.s.p.ParquetRelation2$$anon$1 Input split:
ParquetInputSplit{part:
file:/Users/jon/Downloads/sparksql/1partitionsminusgeo/year=2015/month=1/day=14/providerOwnerZoneId=0/providerOwnerCustomerId=0/providerId=287/probeTypeId=1/part-r-00001.parquet
start: 0 end: 8851183 length: 8851183 hosts: [] requestedSchema: message
root {
  optional int32 probeTypeId;
}
 readSupportMetadata:
{org.apache.spark.sql.parquet.row.requested_schema={"type":"struct","fields":[{"name":"probeTypeId","type":"integer","nullable":true,"metadata":{}}]},
org.apache.spark.sql.parquet.row.metadata={"type":"struct","fields":[{"name":"clientMarketId","type":"integer","nullable":true,"metadata":{}},{"name":"clientCountryId","type":"integer","nullable":true,"metadata":{}},{"name":"clientRegionId","type":"integer","nullable":true,"metadata":{}},{"name":"clientStateId","type":"integer","nullable":true,"metadata":{}},{"name":"clientAsnId","type":"integer","nullable":true,"metadata":{}},{"name":"reporterZoneId","type":"integer","nullable":true,"metadata":{}},{"name":"reporterCustomerId","type":"integer","nullable":true,"metadata":{}},{"name":"responseCode","type":"integer","nullable":true,"metadata":{}},{"name":"measurementValue","type":"integer","nullable":true,"metadata":{}},{"name":"year","type":"integer","nullable":true,"metadata":{}},{"name":"month","type":"integer","nullable":true,"metadata":{}},{"name":"day","type":"integer","nullable":true,"metadata":{}},{"name":"providerOwnerZoneId","type":"integer","nullable":true,"metadata":{}},{"name":"providerOwnerCustomerId","type":"integer","nullable":true,"metadata":{}},{"name":"providerId","type":"integer","nullable":true,"metadata":{}},{"name":"probeTypeId","type":"integer","nullable":true,"metadata":{}}]}}}
ERROR org.apache.spark.executor.Executor Exception in task 0.0 in stage 0.0
(TID 0)
java.lang.IllegalArgumentException: Column [probeTypeId] was not found in
schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
~[parquet-common-1.6.0rc3.jar:na]
at
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
~[parquet-column-1.6.0rc3.jar:na]
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
~[parquet-column-1.6.0rc3.jar:na]
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
~[parquet-column-1.6.0rc3.jar:na]
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
~[parquet-column-1.6.0rc3.jar:na]
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
~[parquet-column-1.6.0rc3.jar:na]
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
~[parquet-column-1.6.0rc3.jar:na]
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
~[parquet-column-1.6.0rc3.jar:na]
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
~[parquet-hadoop-1.6.0rc3.jar:na]
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
~[parquet-hadoop-1.6.0rc3.jar:na]
at
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
~[parquet-column-1.6.0rc3.jar:na]
at
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
~[parquet-hadoop-1.6.0rc3.jar:na]
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
~[parquet-hadoop-1.6.0rc3.jar:na]
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
~[parquet-hadoop-1.6.0rc3.jar:na]
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at
org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.scheduler.Task.run(Task.scala:64)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
~[spark-core_2.10-1.3.0.jar:1.3.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_31]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_31]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
WARN  o.a.spark.scheduler.TaskSetManager Lost task 0.0 in stage 0.0 (TID 0,
localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was
not found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR o.a.spark.scheduler.TaskSetManager Task 0 in stage 0.0 failed 1
times; aborting job
INFO  o.a.s.scheduler.TaskSchedulerImpl Removed TaskSet 0.0, whose tasks
have all completed, from pool
INFO  o.a.s.scheduler.TaskSchedulerImpl Cancelling stage 0
INFO  o.a.spark.scheduler.DAGScheduler Job 0 failed: runJob at
SparkPlan.scala:121, took 0.132538 s
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/metrics/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/static,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/executors/threadDump,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/executors/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/executors,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/environment/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/environment,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/storage/rdd,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/storage/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/storage,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/stages/pool/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/stages/pool,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/stages/stage/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/stages/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/stages,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}
INFO  o.s.j.server.handler.ContextHandler stopped
o.s.j.s.ServletContextHandler{/jobs,null}
INFO  org.apache.spark.ui.SparkUI Stopped Spark web UI at
http://192.168.1.134:4040
INFO  o.a.spark.scheduler.DAGScheduler Stopping DAGScheduler
INFO  o.a.s.MapOutputTrackerMasterActor MapOutputTrackerActor stopped!
INFO  org.apache.spark.storage.MemoryStore MemoryStore cleared
INFO  o.apache.spark.storage.BlockManager BlockManager stopped
INFO  o.a.spark.storage.BlockManagerMaster BlockManagerMaster stopped
INFO  o.a.s.s.OutputCommitCoordinator$OutputCommitCoordinatorActor
OutputCommitCoordinator stopped!
INFO  org.apache.spark.SparkContext Successfully stopped SparkContext

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column
[probeTypeId] was not found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Mar 26, 2015 6:06:02 AM INFO: parquet.filter2.compat.FilterCompat:
Filtering using predicate: eq(probeTypeId, 1)
Mar 26, 2015 6:06:02 AM WARNING: parquet.hadoop.ParquetRecordReader: Can
not initialize counter due to context is not a instance of
TaskInputOutputContext, but is
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Mar 26, 2015 6:06:02 AM INFO: parquet.filter2.compat.FilterCompat:
Filtering using predicate: eq(probeTypeId, 1)

Process finished with exit code 255

Reply via email to