Re: spark 1.5.2 memory leak? reading JSON

Eran Witkon Sun, 20 Dec 2015 21:57:00 -0800

Once I removed the CR LF from the file it worked ok.
eran
On Mon, 21 Dec 2015 at 06:29 Yin Huai <yh...@databricks.com> wrote:


> Hi Eran,
>
> Can you try 1.6? With the change in
> https://github.com/apache/spark/pull/10288, JSON data source will not
> throw a runtime exception if there is any record that it cannot parse.
> Instead, it will put the entire record to the column of "_corrupt_record".
>
> Thanks,
>
> Yin
>
> On Sun, Dec 20, 2015 at 9:37 AM, Eran Witkon <eranwit...@gmail.com> wrote:
>
>> Thanks for this!
>> This was the problem...
>>
>> On Sun, 20 Dec 2015 at 18:49 Chris Fregly <ch...@fregly.com> wrote:
>>
>>> hey Eran, I run into this all the time with Json.
>>>
>>> the problem is likely that your Json is "too pretty" and extending
>>> beyond a single line which trips up the Json reader.
>>>
>>> my solution is usually to de-pretty the Json - either manually or
>>> through an ETL step - by stripping all white space before pointing my
>>> DataFrame/JSON reader at the file.
>>>
>>> this tool is handy for one-off scenerios:  http://jsonviewer.stack.hu
>>>
>>> for streaming use cases, you'll want to have a light de-pretty ETL step
>>> either within the Spark Streaming job after ingestion - or upstream using
>>> something like a Flume interceptor, NiFi Processor (I love NiFi), or Kafka
>>> transformation assuming those exist by now.
>>>
>>> a similar problem exists for XML, btw.  there's lots of wonky
>>> workarounds for this that use MapPartitions and all kinds of craziness.
>>>  the best option, in my opinion, is to just ETL/flatten the data to make
>>> the DataFrame reader happy.
>>>
>>> On Dec 19, 2015, at 4:55 PM, Eran Witkon <eranwit...@gmail.com> wrote:
>>>
>>> Hi,
>>> I tried the following code in spark-shell on spark1.5.2:
>>>
>>> *val df =
>>> sqlContext.read.json("/home/eranw/Workspace/JSON/sample/sample2.json")*
>>> *df.count()*
>>>
>>> 15/12/19 23:49:40 ERROR Executor: Managed memory leak detected; size =
>>> 67108864 bytes, TID = 3
>>> 15/12/19 23:49:40 ERROR Executor: Exception in task 0.0 in stage 4.0
>>> (TID 3)
>>> java.lang.RuntimeException: Failed to parse a value for data type
>>> StructType() (current token: VALUE_STRING).
>>> at scala.sys.package$.error(package.scala:27)
>>> at
>>> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:172)
>>> at
>>> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:251)
>>> at
>>> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:246)
>>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>> at
>>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:365)
>>> at
>>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>>> at
>>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$
>>> 1.org
>>> $apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>>>
>>> Am I am doing something wrong?
>>> Eran
>>>
>>>
>

Re: spark 1.5.2 memory leak? reading JSON

Reply via email to