I think the problem here is that you are passing in parsed JSON that stored
as a dictionary (which is converted to a hashmap when going into the JVM).
You should instead be passing in the path to the json file (formatted as
Akhil suggests) so that Spark can do the parsing in parallel.  The other
option would be to construct and RDD of JSON string and pass that to the
JSON method.

On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Each Json Doc should be in a single line i guess.
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> Note that the file that is offered as *a json file* is not a typical JSON
> file. Each line must contain a separate, self-contained valid JSON object.
> As a consequence, a regular multi-line JSON file will most often fail.
>
> Thanks
> Best Regards
>
> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini <fnpalad...@gmail.com>
> wrote:
>
>> Hello guys,
>>
>> I'm very new to Spark and I'm having some troubles when reading a JSON to
>> dataframe on PySpark.
>>
>> I'm getting a JSON object from an API response and I would like to store
>> it in Spark as a DataFrame (I've read that DataFrame is better than RDD,
>> that's accurate?). For what I've read
>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext>
>> on documentation, I just need to call the method sqlContext.read.json in
>> order to do what I want.
>>
>> *Following is the code from my test application:*
>> json_object = json.loads(response.text)
>> sc = SparkContext("local", appName="JSON to RDD")
>> sqlContext = SQLContext(sc)
>> dataframe = sqlContext.read.json(json_object)
>> dataframe.show()
>>
>> *The problem is that when I run **"spark-submit myExample.py" I got the
>> following error:*
>> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
>> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
>> localhost, 48634)
>> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
>> Traceback (most recent call last):
>>   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
>> line 35, in <module>
>>     dataframe = sqlContext.read.json(json_object)
>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>> line 144, in json
>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36,
>> in deco
>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 304, in get_return_value
>> py4j.protocol.Py4JError: An error occurred while calling o21.json. Trace:
>> py4j.Py4JException: Method json([class java.util.HashMap]) does not exist
>>     at
>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>     at
>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>>     at py4j.Gateway.invoke(Gateway.java:252)
>>     at
>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>     at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>     at java.lang.Thread.run(Thread.java:745)
>>
>> *What I'm doing wrong? *
>> Check out this gist
>> <https://gist.github.com/paladini/2e2ea913d545a407b842> to see the JSON
>> I'm trying to load.
>>
>> Thanks!
>> Fernando Paladini
>>
>
>

Reply via email to