[ https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582543#comment-14582543 ]
Brad Willard commented on SPARK-8128: ------------------------------------- I have more logging from the job before it dies {quote} 15/06/11 21:27:11 INFO DataSourceStrategy: Selected 7 partitions out of 7, pruned 0.0% partitions. 15/06/11 21:27:11 INFO MemoryStore: ensureFreeSpace(129584) called with curMem=971296, maxMem=27783541555 15/06/11 21:27:11 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 126.5 KB, free 25.9 GB) 15/06/11 21:27:11 INFO MemoryStore: ensureFreeSpace(8973) called with curMem=1100880, maxMem=27783541555 15/06/11 21:27:11 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 8.8 KB, free 25.9 GB) 15/06/11 21:27:11 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 10.61.187.209:55426 (size: 8.8 KB, free: 25.9 GB) 15/06/11 21:27:11 INFO SparkContext: Created broadcast 15 from count at NativeMethodAccessorImpl.java:-2 15/06/11 21:27:11 INFO Exchange: Using SparkSqlSerializer2. 15/06/11 21:27:11 INFO SparkContext: Starting job: count at NativeMethodAccessorImpl.java:-2 15/06/11 21:27:11 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1$$anon$2: Using Task Side Metadata Split Strategy 15/06/11 21:27:11 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1$$anon$2: Using Task Side Metadata Split Strategy 15/06/11 21:27:11 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1$$anon$2: Using Task Side Metadata Split Strategy 15/06/11 21:27:11 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1$$anon$2: Using Task Side Metadata Split Strategy 15/06/11 21:27:11 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1$$anon$2: Using Task Side Metadata Split Strategy 15/06/11 21:27:12 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1$$anon$2: Using Task Side Metadata Split Strategy 15/06/11 21:27:12 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1$$anon$2: Using Task Side Metadata Split Strategy 15/06/11 21:27:12 INFO DAGScheduler: Registering RDD 159 (count at NativeMethodAccessorImpl.java:-2) 15/06/11 21:27:12 INFO DAGScheduler: Got job 6 (count at NativeMethodAccessorImpl.java:-2) with 1 output partitions (allowLocal=false) 15/06/11 21:27:12 INFO DAGScheduler: Final stage: ResultStage 12(count at NativeMethodAccessorImpl.java:-2) 15/06/11 21:27:12 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 11) 15/06/11 21:27:12 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 11) 15/06/11 21:27:12 INFO DAGScheduler: Submitting ShuffleMapStage 11 (MapPartitionsRDD[159] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents 15/06/11 21:27:12 INFO MemoryStore: ensureFreeSpace(12608) called with curMem=1109853, maxMem=27783541555 15/06/11 21:27:12 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 12.3 KB, free 25.9 GB) 15/06/11 21:27:12 INFO MemoryStore: ensureFreeSpace(5655) called with curMem=1122461, maxMem=27783541555 15/06/11 21:27:12 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 5.5 KB, free 25.9 GB) 15/06/11 21:27:12 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 10.61.187.209:55426 (size: 5.5 KB, free: 25.9 GB) 15/06/11 21:27:12 INFO SparkContext: Created broadcast 16 from broadcast at DAGScheduler.scala:874 15/06/11 21:27:12 INFO DAGScheduler: Submitting 699 missing tasks from ShuffleMapStage 11 (MapPartitionsRDD[159] at count at NativeMethodAccessorImpl.java:-2) 15/06/11 21:27:12 INFO TaskSchedulerImpl: Adding task set 11.0 with 699 tasks 15/06/11 21:27:12 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 4022, 10.236.232.187, ANY, 2251 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 4023, 10.16.200.220, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 2.0 in stage 11.0 (TID 4024, 10.236.232.187, ANY, 2250 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 3.0 in stage 11.0 (TID 4025, 10.16.200.220, ANY, 2254 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 4.0 in stage 11.0 (TID 4026, 10.236.232.187, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 5.0 in stage 11.0 (TID 4027, 10.16.200.220, ANY, 2255 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 6.0 in stage 11.0 (TID 4028, 10.236.232.187, ANY, 2255 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 7.0 in stage 11.0 (TID 4029, 10.16.200.220, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 8.0 in stage 11.0 (TID 4030, 10.236.232.187, ANY, 2254 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 9.0 in stage 11.0 (TID 4031, 10.16.200.220, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 10.0 in stage 11.0 (TID 4032, 10.236.232.187, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 11.0 in stage 11.0 (TID 4033, 10.16.200.220, ANY, 2251 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 12.0 in stage 11.0 (TID 4034, 10.236.232.187, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 13.0 in stage 11.0 (TID 4035, 10.16.200.220, ANY, 2255 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 14.0 in stage 11.0 (TID 4036, 10.236.232.187, ANY, 2250 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 15.0 in stage 11.0 (TID 4037, 10.16.200.220, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 16.0 in stage 11.0 (TID 4038, 10.236.232.187, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 17.0 in stage 11.0 (TID 4039, 10.16.200.220, ANY, 2251 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 18.0 in stage 11.0 (TID 4040, 10.236.232.187, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 19.0 in stage 11.0 (TID 4041, 10.16.200.220, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 20.0 in stage 11.0 (TID 4042, 10.236.232.187, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 21.0 in stage 11.0 (TID 4043, 10.16.200.220, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 22.0 in stage 11.0 (TID 4044, 10.236.232.187, ANY, 2254 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 23.0 in stage 11.0 (TID 4045, 10.16.200.220, ANY, 2255 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 24.0 in stage 11.0 (TID 4046, 10.236.232.187, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 25.0 in stage 11.0 (TID 4047, 10.16.200.220, ANY, 2254 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 26.0 in stage 11.0 (TID 4048, 10.236.232.187, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 27.0 in stage 11.0 (TID 4049, 10.16.200.220, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 28.0 in stage 11.0 (TID 4050, 10.236.232.187, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 29.0 in stage 11.0 (TID 4051, 10.16.200.220, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 30.0 in stage 11.0 (TID 4052, 10.236.232.187, ANY, 2254 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 31.0 in stage 11.0 (TID 4053, 10.16.200.220, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 32.0 in stage 11.0 (TID 4054, 10.236.232.187, ANY, 2255 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 33.0 in stage 11.0 (TID 4055, 10.16.200.220, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 34.0 in stage 11.0 (TID 4056, 10.236.232.187, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 35.0 in stage 11.0 (TID 4057, 10.16.200.220, ANY, 2253 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 36.0 in stage 11.0 (TID 4058, 10.236.232.187, ANY, 2254 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 37.0 in stage 11.0 (TID 4059, 10.16.200.220, ANY, 2252 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 38.0 in stage 11.0 (TID 4060, 10.236.232.187, ANY, 2250 bytes) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 39.0 in stage 11.0 (TID 4061, 10.16.200.220, ANY, 2251 bytes) 15/06/11 21:27:12 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 10.236.232.187:43792 (size: 5.5 KB, free: 20.7 GB) 15/06/11 21:27:12 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 10.16.200.220:50231 (size: 5.5 KB, free: 20.7 GB) 15/06/11 21:27:12 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 10.16.200.220:50231 (size: 8.8 KB, free: 20.7 GB) 15/06/11 21:27:12 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 10.236.232.187:43792 (size: 8.8 KB, free: 20.7 GB) 15/06/11 21:27:12 INFO TaskSetManager: Starting task 40.0 in stage 11.0 (TID 4062, 10.16.200.220, ANY, 2252 bytes) 15/06/11 21:27:12 WARN TaskSetManager: Lost task 35.0 in stage 11.0 (TID 4057, 10.16.200.220): java.lang.IllegalArgumentException: Column [unitsPerRow_i] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Gt.accept(Operators.java:217) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:153) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) {quote} > Dataframe Fails to Recognize Column in Schema > --------------------------------------------- > > Key: SPARK-8128 > URL: https://issues.apache.org/jira/browse/SPARK-8128 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 1.3.0, 1.3.1, 1.4.0 > Reporter: Brad Willard > > I'm loading a folder of parquet files with about 600 parquet files and > loading it into one dataframe so schema merging is involved. There is some > bug with the schema merging that you print the schema and it shows and > attributes. However when you run a query and filter on that attribute is > errors saying it's not in the schema. > I think this bug could be related to an attribute name being reused in a > nested object. "mediaProcessingState" appears twice in the schema and is the > problem. > sdf = sql_context.parquet('/parquet/big_data_folder') > sdf.printSchema() > root > |-- _id: string (nullable = true) > |-- addedOn: string (nullable = true) > |-- attachment: string (nullable = true) > ....... > |-- items: array (nullable = true) > | |-- element: struct (containsNull = true) > | | |-- _id: string (nullable = true) > | | |-- addedOn: string (nullable = true) > | | |-- authorId: string (nullable = true) > | | |-- mediaProcessingState: long (nullable = true) > |-- mediaProcessingState: long (nullable = true) > |-- title: string (nullable = true) > |-- key: string (nullable = true) > sdf.filter(sdf.mediaProcessingState == 3).count() > causes this exception > Py4JJavaError: An error occurred while calling o67.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in > stage 4.0 (TID 70565, XXXXXXXXXXXXXXX): java.lang.IllegalArgumentException: > Column [mediaProcessingState] was not found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > at > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > at > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > You also get the same error if you register it as a temp table and try to > execute the same sql query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org