Each Json Doc should be in a single line i guess.
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

Note that the file that is offered as *a json file* is not a typical JSON
file. Each line must contain a separate, self-contained valid JSON object.
As a consequence, a regular multi-line JSON file will most often fail.

Thanks
Best Regards

On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini <fnpalad...@gmail.com>
wrote:

> Hello guys,
>
> I'm very new to Spark and I'm having some troubles when reading a JSON to
> dataframe on PySpark.
>
> I'm getting a JSON object from an API response and I would like to store
> it in Spark as a DataFrame (I've read that DataFrame is better than RDD,
> that's accurate?). For what I've read
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext>
> on documentation, I just need to call the method sqlContext.read.json in
> order to do what I want.
>
> *Following is the code from my test application:*
> json_object = json.loads(response.text)
> sc = SparkContext("local", appName="JSON to RDD")
> sqlContext = SQLContext(sc)
> dataframe = sqlContext.read.json(json_object)
> dataframe.show()
>
> *The problem is that when I run **"spark-submit myExample.py" I got the
> following error:*
> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
> localhost, 48634)
> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
> Traceback (most recent call last):
>   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
> line 35, in <module>
>     dataframe = sqlContext.read.json(json_object)
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line
> 144, in json
>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36,
> in deco
>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 304, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o21.json. Trace:
> py4j.Py4JException: Method json([class java.util.HashMap]) does not exist
>     at
> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>     at
> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>     at py4j.Gateway.invoke(Gateway.java:252)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:207)
>     at java.lang.Thread.run(Thread.java:745)
>
> *What I'm doing wrong? *
> Check out this gist
> <https://gist.github.com/paladini/2e2ea913d545a407b842> to see the JSON
> I'm trying to load.
>
> Thanks!
> Fernando Paladini
>

Reply via email to