On 18 Oct 2016, at 08:43, Chetan Khatri <ckhatriman...@gmail.com<mailto:ckhatriman...@gmail.com>> wrote:
Hello Community members, I am getting error while reading large JSON file in spark, the underlying read code can't handle more than 2^31 bytes in a single line: if (bytesConsumed > Integer.MAX_VALUE) { throw new IOException("Too many bytes before newline: " + bytesConsumed); } That's because it's trying to split work by line, and of course, there aren't lines you need to move over to reading the JSON by other means, i'm afraid. At a guess, something involving SparkContext.binaryFiles() streaming the data straight into a JSON parser, Code: val landingVisitor = sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json") unrelated, but use s3a if you can. It's better, you know. Error: 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8) java.io.IOException: Too many bytes before newline: 2147483648 at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:135) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237) What would be resolution for the same ? Thanks in Advance ! -- Yours Aye, Chetan Khatri.