Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-10-05 Thread Fernando Paladini
Thank you for the replies and sorry about the delay, my e-mail client send
this conversation to Spam (??).

I'll take a look in your tips and come back later to post my questions /
progress. Again, thank you so much!

2015-09-30 18:37 GMT-03:00 Michael Armbrust :

> I think the problem here is that you are passing in parsed JSON that
> stored as a dictionary (which is converted to a hashmap when going into the
> JVM).  You should instead be passing in the path to the json file
> (formatted as Akhil suggests) so that Spark can do the parsing in
> parallel.  The other option would be to construct and RDD of JSON string
> and pass that to the JSON method.
>
> On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das 
> wrote:
>
>> Each Json Doc should be in a single line i guess.
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>
>> Note that the file that is offered as *a json file* is not a typical
>> JSON file. Each line must contain a separate, self-contained valid JSON
>> object. As a consequence, a regular multi-line JSON file will most often
>> fail.
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini > > wrote:
>>
>>> Hello guys,
>>>
>>> I'm very new to Spark and I'm having some troubles when reading a JSON
>>> to dataframe on PySpark.
>>>
>>> I'm getting a JSON object from an API response and I would like to store
>>> it in Spark as a DataFrame (I've read that DataFrame is better than RDD,
>>> that's accurate?). For what I've read
>>> 
>>> on documentation, I just need to call the method sqlContext.read.json in
>>> order to do what I want.
>>>
>>> *Following is the code from my test application:*
>>> json_object = json.loads(response.text)
>>> sc = SparkContext("local", appName="JSON to RDD")
>>> sqlContext = SQLContext(sc)
>>> dataframe = sqlContext.read.json(json_object)
>>> dataframe.show()
>>>
>>> *The problem is that when I run **"spark-submit myExample.py" I got the
>>> following error:*
>>> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
>>> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
>>> localhost, 48634)
>>> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
>>> Traceback (most recent call last):
>>>   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
>>> line 35, in 
>>> dataframe = sqlContext.read.json(json_object)
>>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>> line 144, in json
>>>   File
>>> "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
>>> 538, in __call__
>>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>>> 36, in deco
>>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>> line 304, in get_return_value
>>> py4j.protocol.Py4JError: An error occurred while calling o21.json. Trace:
>>> py4j.Py4JException: Method json([class java.util.HashMap]) does not exist
>>> at
>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>> at
>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>>> at py4j.Gateway.invoke(Gateway.java:252)
>>> at
>>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> *What I'm doing wrong? *
>>> Check out this gist
>>>  to see the JSON
>>> I'm trying to load.
>>>
>>> Thanks!
>>> Fernando Paladini
>>>
>>
>>
>


-- 
Fernando Paladini


Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-10-05 Thread Fernando Paladini
Update:

I've updated my code and now I have the following JSON:
https://gist.github.com/paladini/27bb5636d91dec79bd56
In the same link you can check the output from "spark-submit
myPythonScript.py", where I call "myDataframe.show()". The following is
printed by Spark (among other useless debug information):


​
That's correct for the given JSON input
 (gist link above)?
How can I test if Spark can understand this DataFrame and make complex
manipulations with that?

Thank you! Hope you can help me soon :3
Fernando Paladini.

2015-10-05 15:23 GMT-03:00 Fernando Paladini :

> Thank you for the replies and sorry about the delay, my e-mail client send
> this conversation to Spam (??).
>
> I'll take a look in your tips and come back later to post my questions /
> progress. Again, thank you so much!
>
> 2015-09-30 18:37 GMT-03:00 Michael Armbrust :
>
>> I think the problem here is that you are passing in parsed JSON that
>> stored as a dictionary (which is converted to a hashmap when going into the
>> JVM).  You should instead be passing in the path to the json file
>> (formatted as Akhil suggests) so that Spark can do the parsing in
>> parallel.  The other option would be to construct and RDD of JSON string
>> and pass that to the JSON method.
>>
>> On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das 
>> wrote:
>>
>>> Each Json Doc should be in a single line i guess.
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>>
>>> Note that the file that is offered as *a json file* is not a typical
>>> JSON file. Each line must contain a separate, self-contained valid JSON
>>> object. As a consequence, a regular multi-line JSON file will most often
>>> fail.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini <
>>> fnpalad...@gmail.com> wrote:
>>>
 Hello guys,

 I'm very new to Spark and I'm having some troubles when reading a JSON
 to dataframe on PySpark.

 I'm getting a JSON object from an API response and I would like to
 store it in Spark as a DataFrame (I've read that DataFrame is better than
 RDD, that's accurate?). For what I've read
 
 on documentation, I just need to call the method sqlContext.read.json in
 order to do what I want.

 *Following is the code from my test application:*
 json_object = json.loads(response.text)
 sc = SparkContext("local", appName="JSON to RDD")
 sqlContext = SQLContext(sc)
 dataframe = sqlContext.read.json(json_object)
 dataframe.show()

 *The problem is that when I run **"spark-submit myExample.py" I got
 the following error:*
 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
 manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
 localhost, 48634)
 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
 Traceback (most recent call last):
   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
 line 35, in 
 dataframe = sqlContext.read.json(json_object)
   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
 line 144, in json
   File
 "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
 538, in __call__
   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
 36, in deco
   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 304, in get_return_value
 py4j.protocol.Py4JError: An error occurred while calling o21.json.
 Trace:
 py4j.Py4JException: Method json([class java.util.HashMap]) does not
 exist
 at
 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
 at
 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
 at py4j.Gateway.invoke(Gateway.java:252)
 at
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)

 *What I'm doing wrong? *
 Check out this gist
  to see the
 JSON I'm trying to load.

 Thanks!
 Fernando Paladini

>>>
>>>
>>
>
>
> --
> Fernando Paladini
>



-- 
Fernando Paladini


Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-10-05 Thread Michael Armbrust
Looks correct to me.  Try for example:

from pyspark.sql.functions import *
df.withColumn("value", explode(df['values'])).show()

On Mon, Oct 5, 2015 at 2:15 PM, Fernando Paladini 
wrote:

> Update:
>
> I've updated my code and now I have the following JSON:
> https://gist.github.com/paladini/27bb5636d91dec79bd56
> In the same link you can check the output from "spark-submit
> myPythonScript.py", where I call "myDataframe.show()". The following is
> printed by Spark (among other useless debug information):
>
>
> ​
> That's correct for the given JSON input
>  (gist link
> above)? How can I test if Spark can understand this DataFrame and make
> complex manipulations with that?
>
> Thank you! Hope you can help me soon :3
> Fernando Paladini.
>
> 2015-10-05 15:23 GMT-03:00 Fernando Paladini :
>
>> Thank you for the replies and sorry about the delay, my e-mail client
>> send this conversation to Spam (??).
>>
>> I'll take a look in your tips and come back later to post my questions /
>> progress. Again, thank you so much!
>>
>> 2015-09-30 18:37 GMT-03:00 Michael Armbrust :
>>
>>> I think the problem here is that you are passing in parsed JSON that
>>> stored as a dictionary (which is converted to a hashmap when going into the
>>> JVM).  You should instead be passing in the path to the json file
>>> (formatted as Akhil suggests) so that Spark can do the parsing in
>>> parallel.  The other option would be to construct and RDD of JSON string
>>> and pass that to the JSON method.
>>>
>>> On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das 
>>> wrote:
>>>
 Each Json Doc should be in a single line i guess.
 http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

 Note that the file that is offered as *a json file* is not a typical
 JSON file. Each line must contain a separate, self-contained valid JSON
 object. As a consequence, a regular multi-line JSON file will most often
 fail.

 Thanks
 Best Regards

 On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini <
 fnpalad...@gmail.com> wrote:

> Hello guys,
>
> I'm very new to Spark and I'm having some troubles when reading a JSON
> to dataframe on PySpark.
>
> I'm getting a JSON object from an API response and I would like to
> store it in Spark as a DataFrame (I've read that DataFrame is better than
> RDD, that's accurate?). For what I've read
> 
> on documentation, I just need to call the method sqlContext.read.json in
> order to do what I want.
>
> *Following is the code from my test application:*
> json_object = json.loads(response.text)
> sc = SparkContext("local", appName="JSON to RDD")
> sqlContext = SQLContext(sc)
> dataframe = sqlContext.read.json(json_object)
> dataframe.show()
>
> *The problem is that when I run **"spark-submit myExample.py" I got
> the following error:*
> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
> localhost, 48634)
> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
> Traceback (most recent call last):
>   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
> line 35, in 
> dataframe = sqlContext.read.json(json_object)
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 144, in json
>   File
> "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
> 538, in __call__
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
> 36, in deco
>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 304, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o21.json.
> Trace:
> py4j.Py4JException: Method json([class java.util.HashMap]) does not
> exist
> at
> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
> at
> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
> at py4j.Gateway.invoke(Gateway.java:252)
> at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
>
> *What I'm doing wrong? *
> Check out this gist
>  to see the
> JSON I'm trying to load.
>
> Thanks!
> Fernando Paladini
>


>>>
>>
>>
>> --
>> Fernando Paladini
>>
>
>
>
> --
> Fernando Paladini
>


Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-30 Thread Akhil Das
Each Json Doc should be in a single line i guess.
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

Note that the file that is offered as *a json file* is not a typical JSON
file. Each line must contain a separate, self-contained valid JSON object.
As a consequence, a regular multi-line JSON file will most often fail.

Thanks
Best Regards

On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini 
wrote:

> Hello guys,
>
> I'm very new to Spark and I'm having some troubles when reading a JSON to
> dataframe on PySpark.
>
> I'm getting a JSON object from an API response and I would like to store
> it in Spark as a DataFrame (I've read that DataFrame is better than RDD,
> that's accurate?). For what I've read
> 
> on documentation, I just need to call the method sqlContext.read.json in
> order to do what I want.
>
> *Following is the code from my test application:*
> json_object = json.loads(response.text)
> sc = SparkContext("local", appName="JSON to RDD")
> sqlContext = SQLContext(sc)
> dataframe = sqlContext.read.json(json_object)
> dataframe.show()
>
> *The problem is that when I run **"spark-submit myExample.py" I got the
> following error:*
> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
> localhost, 48634)
> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
> Traceback (most recent call last):
>   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
> line 35, in 
> dataframe = sqlContext.read.json(json_object)
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line
> 144, in json
>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36,
> in deco
>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 304, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o21.json. Trace:
> py4j.Py4JException: Method json([class java.util.HashMap]) does not exist
> at
> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
> at
> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
> at py4j.Gateway.invoke(Gateway.java:252)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
>
> *What I'm doing wrong? *
> Check out this gist
>  to see the JSON
> I'm trying to load.
>
> Thanks!
> Fernando Paladini
>


Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-30 Thread Michael Armbrust
I think the problem here is that you are passing in parsed JSON that stored
as a dictionary (which is converted to a hashmap when going into the JVM).
You should instead be passing in the path to the json file (formatted as
Akhil suggests) so that Spark can do the parsing in parallel.  The other
option would be to construct and RDD of JSON string and pass that to the
JSON method.

On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das 
wrote:

> Each Json Doc should be in a single line i guess.
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> Note that the file that is offered as *a json file* is not a typical JSON
> file. Each line must contain a separate, self-contained valid JSON object.
> As a consequence, a regular multi-line JSON file will most often fail.
>
> Thanks
> Best Regards
>
> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini 
> wrote:
>
>> Hello guys,
>>
>> I'm very new to Spark and I'm having some troubles when reading a JSON to
>> dataframe on PySpark.
>>
>> I'm getting a JSON object from an API response and I would like to store
>> it in Spark as a DataFrame (I've read that DataFrame is better than RDD,
>> that's accurate?). For what I've read
>> 
>> on documentation, I just need to call the method sqlContext.read.json in
>> order to do what I want.
>>
>> *Following is the code from my test application:*
>> json_object = json.loads(response.text)
>> sc = SparkContext("local", appName="JSON to RDD")
>> sqlContext = SQLContext(sc)
>> dataframe = sqlContext.read.json(json_object)
>> dataframe.show()
>>
>> *The problem is that when I run **"spark-submit myExample.py" I got the
>> following error:*
>> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
>> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
>> localhost, 48634)
>> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
>> Traceback (most recent call last):
>>   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
>> line 35, in 
>> dataframe = sqlContext.read.json(json_object)
>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>> line 144, in json
>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36,
>> in deco
>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 304, in get_return_value
>> py4j.protocol.Py4JError: An error occurred while calling o21.json. Trace:
>> py4j.Py4JException: Method json([class java.util.HashMap]) does not exist
>> at
>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>> at
>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>> at py4j.Gateway.invoke(Gateway.java:252)
>> at
>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> *What I'm doing wrong? *
>> Check out this gist
>>  to see the JSON
>> I'm trying to load.
>>
>> Thanks!
>> Fernando Paladini
>>
>
>


Fwd: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-28 Thread Fernando Paladini
Hello guys,

I'm very new to Spark and I'm having some troubles when reading a JSON to
dataframe on PySpark.

I'm getting a JSON object from an API response and I would like to store it
in Spark as a DataFrame (I've read that DataFrame is better than RDD,
that's accurate?). For what I've read

on documentation, I just need to call the method sqlContext.read.json in
order to do what I want.

*Following is the code from my test application:*
json_object = json.loads(response.text)
sc = SparkContext("local", appName="JSON to RDD")
sqlContext = SQLContext(sc)
dataframe = sqlContext.read.json(json_object)
dataframe.show()

*The problem is that when I run **"spark-submit myExample.py" I got the
following error:*
15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
localhost, 48634)
15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
Traceback (most recent call last):
  File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py", line
35, in 
dataframe = sqlContext.read.json(json_object)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line
144, in json
  File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36,
in deco
  File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o21.json. Trace:
py4j.Py4JException: Method json([class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

*What I'm doing wrong? *
Check out this gist 
to see the JSON I'm trying to load.

Thanks!
Fernando Paladini