luochenghui created SPARK-6659: ---------------------------------- Summary: Spark SQL 1.3 cannot read json file that only with a record. Key: SPARK-6659 URL: https://issues.apache.org/jira/browse/SPARK-6659 Project: Spark Issue Type: Bug Reporter: luochenghui
Dear friends: Spark SQL 1.3 cannot read json file that only with a record. here is my json file's content. {"name":"milo","age",24} when i run Spark SQL under the local mode,it throws an exception rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record; what i had done: 1 ./spark-shell 2 scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5f3be6c8 scala> val df = sqlContext.jsonFile("/home/milo/person.json") 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35842 (size: 22.2 KB, free: 267.2 MB) 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:98 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) with 1 output partitions (allowLocal=false) 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at JsonRDD.scala:51) 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List() 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List() 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51), which has no missing parents 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=186397, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 267.1 MB) 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with curMem=189581, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB) 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35842 (size: 2.2 KB, free: 267.2 MB) 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51) 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1291 bytes) 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 bytes result sent to driver 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1209 ms on localhost (1/1) 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) finished in 1.308 s 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at JsonRDD.scala:51, took 2.002429 s df: org.apache.spark.sql.DataFrame = [_corrupt_record: string] 3 scala> df.select("name").show() 15/03/19 22:12:44 INFO BlockManager: Removing broadcast 1 15/03/19 22:12:44 INFO BlockManager: Removing block broadcast_1_piece0 15/03/19 22:12:44 INFO MemoryStore: Block broadcast_1_piece0 of size 2251 dropped from memory (free 280059394) 15/03/19 22:12:44 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:35842 in memory (size: 2.2 KB, free: 267.2 MB) 15/03/19 22:12:44 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/19 22:12:44 INFO BlockManager: Removing block broadcast_1 15/03/19 22:12:44 INFO MemoryStore: Block broadcast_1 of size 3184 dropped from memory (free 280062578) 15/03/19 22:12:45 INFO ContextCleaner: Cleaned broadcast 1 15/03/19 22:12:45 INFO BlockManager: Removing broadcast 0 15/03/19 22:12:45 INFO BlockManager: Removing block broadcast_0 15/03/19 22:12:45 INFO MemoryStore: Block broadcast_0 of size 163705 dropped from memory (free 280226283) 15/03/19 22:12:45 INFO BlockManager: Removing block broadcast_0_piece0 15/03/19 22:12:45 INFO MemoryStore: Block broadcast_0_piece0 of size 22692 dropped from memory (free 280248975) 15/03/19 22:12:45 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:35842 in memory (size: 22.2 KB, free: 267.3 MB) 15/03/19 22:12:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/19 22:12:45 INFO ContextCleaner: Cleaned broadcast 0 org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:121) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069) at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133) at org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:465) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:480) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:37) at $iwC$$iwC$$iwC.<init>(<console>:39) at $iwC$$iwC.<init>(<console>:41) at $iwC.<init>(<console>:43) at <init>(<console>:45) at .<init>(<console>:49) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) but i invoke df.show() ,it could work. scala> df.show() 15/03/19 22:13:32 INFO MemoryStore: ensureFreeSpace(81443) called with curMem=0, maxMem=280248975 15/03/19 22:13:32 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 79.5 KB, free 267.2 MB) 15/03/19 22:13:32 INFO MemoryStore: ensureFreeSpace(31262) called with curMem=81443, maxMem=280248975 15/03/19 22:13:32 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 30.5 KB, free 267.2 MB) 15/03/19 22:13:32 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:35842 (size: 30.5 KB, free: 267.2 MB) 15/03/19 22:13:32 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 15/03/19 22:13:32 INFO SparkContext: Created broadcast 2 from textFile at JSONRelation.scala:98 15/03/19 22:13:32 INFO FileInputFormat: Total input paths to process : 1 15/03/19 22:13:32 INFO SparkContext: Starting job: runJob at SparkPlan.scala:121 15/03/19 22:13:32 INFO DAGScheduler: Got job 1 (runJob at SparkPlan.scala:121) with 1 output partitions (allowLocal=false) 15/03/19 22:13:32 INFO DAGScheduler: Final stage: Stage 1(runJob at SparkPlan.scala:121) 15/03/19 22:13:32 INFO DAGScheduler: Parents of final stage: List() 15/03/19 22:13:32 INFO DAGScheduler: Missing parents: List() 15/03/19 22:13:32 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[8] at map at SparkPlan.scala:96), which has no missing parents 15/03/19 22:13:32 INFO MemoryStore: ensureFreeSpace(3968) called with curMem=112705, maxMem=280248975 15/03/19 22:13:32 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.9 KB, free 267.2 MB) 15/03/19 22:13:32 INFO MemoryStore: ensureFreeSpace(2724) called with curMem=116673, maxMem=280248975 15/03/19 22:13:32 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.7 KB, free 267.2 MB) 15/03/19 22:13:32 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:35842 (size: 2.7 KB, free: 267.2 MB) 15/03/19 22:13:32 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0 15/03/19 22:13:32 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:839 15/03/19 22:13:32 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[8] at map at SparkPlan.scala:96) 15/03/19 22:13:32 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/03/19 22:13:32 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1291 bytes) 15/03/19 22:13:32 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/03/19 22:13:32 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26 15/03/19 22:13:33 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1968 bytes result sent to driver 15/03/19 22:13:33 INFO DAGScheduler: Stage 1 (runJob at SparkPlan.scala:121) finished in 0.249 s 15/03/19 22:13:33 INFO DAGScheduler: Job 1 finished: runJob at SparkPlan.scala:121, took 0.381798 s 15/03/19 22:13:33 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 242 ms on localhost (1/1) 15/03/19 22:13:33 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool _corrupt_record {"name":"milo","a... And i tested another case with a json file more than one record,it ran success. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org