Re: Difference in R and Spark Output

2017-01-01 Thread Saroj C
Dear Felix,
 Thanks. Please find the differences


Cluster
Spark - Size
R- Size
0
69
114
1
79
141
2
77
93
3
90
44
4
130
53



Spark - Centers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0.807554406
0.123759
-0.58642
-0.17803
0.624278
-0.06752
0.033517
-0.01504
-0.02794
0.016699
0.20841
-0.00149
-0.05598
0.039746
0.030756
-0.19788
-0.07906
-0.14881
0.0056
0.01479
0.066883
0.002491
-0.428583581
-0.81975
0.347356
-0.18664
0.047582
0.058692
-0.0721
-0.13873
-0.08666
0.085334
0.054398
-0.0228
0.008369
0.073103
0.022246
-0.15439
-0.06016
-0.15073
-0.03734
0.004299
0.089258
-0.00694
0.692744675
0.148123
0.087253
0.851781
-0.2179
0.003407
-0.12357
-0.01795
0.016427
0.088004
0.021502
-0.04616
-0.00847
0.023397
0.057656
-0.12036
-0.03947
-0.13338
-0.02975
0.012217
0.090547
-0.00232
-0.677692276
0.581091
0.446125
-0.13087
0.037225
0.018936
0.055286
0.01146
-0.08648
0.053719
0.072753
-0.00873
-0.04448
0.042067
0.089221
-0.1977
-0.07368
-0.14674
-0.00641
0.020815
0.058425
0.016745
1.03518389
0.228964
0.539982
-0.3581
-0.13488
-0.00525
-0.1267
-0.04439
-0.01923
0.111272
-0.05181
-0.05508
-0.04143
0.046479
0.059224
-0.16148
-0.07541
-0.12046
-0.03535
0.003049
0.070862
0.010083






















R - Centers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0.7710882
0.86271
0.249609
0.074961
0.251188
-0.05293
-0.11106
-0.08063
0.01516
0.054043
0.056937
-0.0287
-0.03291
0.056607
0.045214
-0.15237
-0.05442
-0.14038
-0.02326
0.013882
0.078523
-0.0087
-0.644077
0.022256
0.368266
-0.06912
0.123979
0.009181
-0.04506
-0.04179
-0.0255
0.041568
0.04081
-0.02936
-0.04849
0.049712
0.062894
-0.16736
-0.06679
-0.12705
-0.007
0.018079
0.062337
0.00349
0.9772678
-0.57499
0.523792
-0.27319
0.163677
0.053579
-0.07616
0.074556
0.00662
0.087303
0.088835
-0.01923
-0.04938
0.07299
0.059872
-0.19137
-0.04737
-0.1536
0.002926
0.049441
0.079147
0.02771
0.5172924
0.167666
-0.16523
-0.82951
-0.77577
-0.00981
0.018531
-0.09629
-0.1654
0.273644
-0.05433
-0.03593
0.115834
0.021465
-0.00981
-0.15112
-0.16178
-0.04783
-0.19962
-0.12418
0.07286
0.03266
0.717927
-0.34229
-0.33544
0.817617
-0.21383
0.02735
0.01675
-0.10814
-0.1747
0.033743
0.038308
-0.0495
-0.05961
-0.01977
0.092247
-0.16017
-0.04787
-0.20766
0.040038
0.024614
0.090587
-0.0236


Please let me know, if any additional info will help to find these 
anomalies.

Thanks & Regards
Saroj




From:   Felix Cheung 
To: User , Saroj C 
Date:   12/31/2016 10:36 AM
Subject:Re: Difference in R and Spark Output



Could you elaborate more on the huge difference you are seeing?



From: Saroj C 
Sent: Friday, December 30, 2016 5:12:04 AM
To: User
Subject: Difference in R and Spark Output 
 
Dear All, 
 For the attached input file, there is a huge difference between the 
Clusters in R and Spark(ML). Any idea, what could be the difference ? 

Note we wanted to create Five(5) clusters. 

Please find the snippets in Spark and R 

Spark 

//Load the Data file 

// Create K means Cluster 
KMeans kmeans = new KMeans().setK(5).setMaxIter(500) 
.setFeaturesCol("features"
).setPredictionCol("prediction"); 


In R 

//Load the Data File into df 

//Create the K Means Cluster 
  
model <- kmeans(df, 5) 



Thanks & Regards
Saroj 
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you


24/7 Spark Streaming on YARN in Production

2017-01-01 Thread Bernhard Schäfer
Two weeks ago I have published a blogpost about our experiences running 24/7
Spark Streaming applications on YARN in production: 
https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-production/
 
Amongst others it contains a reference spark-submit command covering
subtleties such as YARN, backpressure and spark.locality.wait configuration.
Maybe this helps some people not getting stuck on the same issues we faced.
Although the post is not targeted at Structured Streaming, the majority of
configurations still hold true.
Feedback is appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/24-7-Spark-Streaming-on-YARN-in-Production-tp28265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: What is missing here to use sql in spark?

2017-01-01 Thread Michal Šenkýř

Happy new year, Raymond!

Not sure whether I undestand your problem correctly but it seems to me 
that you are just not processing your result.

sqlContext.sql(...) returns a DataFrame which you have to call an action on.

Therefore, to get the result you are expecting, you just have to call:
sqlContext.sql(...).show()

You can also assign it to a variable or register it as a new table 
(view) to work with it further:

df2 = sqlContext.sql(...)
or:
sqlContext.sql(...).createOrReplaceTempView("flight201601_carriers")

Regards,

Michal Šenkýř


On 2.1.2017 05:22, Raymond Xie wrote:

Happy new year!

Below is my script:

pyspark --packages com.databricks:spark-csv_2.10:1.4.0
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = 
sqlContext.read.format('com.databricks.spark.csv').options(header='true', 
inferschema='true').load('file:///root/Downloads/data/flight201601short2.csv')

df.show(5)
df.registerTempTable("flight201601")
sqlContext.sql("select distinct CARRIER from flight201601")

df.show(5) is below:

++---+-++---+--+--+--+---++--+
|YEAR|QUARTER|MONTH|DAY_OF_MONTH|DAY_OF_WEEK|   
FL_DATE|UNIQUE_CARRIER|AIRLINE_ID|CARRIER|TAIL_NUM|FL_NUM|

++---+-++---+--+--+--+---++--+
|2016|  1|1|   6|  3|2016-01-06|AA| 
19805| AA|  N4YBAA|43|
|2016|  1|1|   7|  4|2016-01-07|AA| 
19805| AA|  N434AA|43|
|2016|  1|1|   8|  5|2016-01-08|AA| 
19805| AA|  N541AA|43|
|2016|  1|1|   9|  6|2016-01-09|AA| 
19805| AA|  N489AA|43|
|2016|  1|1|  10|  7|2016-01-10|AA| 
19805| AA|  N439AA|43|

++---+-++---+--+--+--+---++--+

The final result is NOT what I am expecting, it currently shows the 
following:


>>> sqlContext.sql("select distinct CARRIER from flight201601")
DataFrame[CARRIER: string]

I am expecting the distinct CARRIER will be created:

AA
BB
CC
...

flight201601short2.csv is attached here for your reference.


Thank you very much.



///
/
/Sincerely yours,/


/Raymond/



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
Thank you very much Marco, is your code in Scala? do you have a python
example? Can anyone give me a python example to handle json data on Spark?


**
*Sincerely yours,*


*Raymond*

On Sun, Jan 1, 2017 at 12:29 PM, Marco Mistroni  wrote:

> Hi
>you will need to pass the schema, like in the snippet below (even
> though the code might have been superseeded in spark 2.0)
>
> import sqlContext.implicits._
> val jsonRdd = sc.textFile("file:///c:/tmp/1973-01-11.json")
> val schema = (new StructType).add("hour", StringType).add("month",
> StringType)
>   .add("second", StringType).add("year", StringType)
>   .add("timezone", StringType).add("day", StringType)
>   .add("minute", StringType)
> val jsonContentWithSchema = sqlContext.jsonRDD(jsonRdd, schema)
>
> But somehow i seem to remember that there was a way , in Spark 2.0, so
> that Spark will infer the schema  for you..
>
> hth
> marco
>
>
>
>
>
> On Sun, Jan 1, 2017 at 12:40 PM, Raymond Xie  wrote:
>
>> I found the cause:
>>
>> I need to "put" the json file onto hdfs first before it can be used, here
>> is what I did:
>>
>> hdfs dfs -put  /root/Downloads/data/json/world_bank.json
>> hdfs://localhost:9000/json
>> df = sqlContext.read.json("/json/")
>> df.show(10)
>>
>> .
>>
>> However, there is a new problem here, the json data needs to be sort of
>> treaked before it can be really used, simply using df =
>> sqlContext.read.json("/json/") just makes the df messy, I need the df know
>> the fields in the json file.
>>
>> How?
>>
>> Thank you.
>>
>>
>>
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>> On Sat, Dec 31, 2016 at 11:52 PM, Miguel Morales > > wrote:
>>
>>> Looks like it's trying to treat that path as a folder, try omitting
>>> the file name and just use the folder path.
>>>
>>> On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie 
>>> wrote:
>>> > Happy new year!!!
>>> >
>>> > I am trying to load a json file into spark, the json file is attached
>>> here.
>>> >
>>> > I received the following error, can anyone help me to fix it? Thank
>>> you very
>>> > much. I am using Spark 1.6.2 and python 2.7.5
>>> >
>>>  from pyspark.sql import SQLContext
>>>  sqlContext = SQLContext(sc)
>>>  df = sqlContext.read.json("/root/Downloads/data/json/world_bank.j
>>> son")
>>> > 16/12/31 22:54:53 INFO json.JSONRelation: Listing
>>> > hdfs://localhost:9000/root/Downloads/data/json/world_bank.json on
>>> driver
>>> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0 stored as
>>> > values in memory (estimated size 212.4 KB, free 212.4 KB)
>>> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0_piece0
>>> stored
>>> > as bytes in memory (estimated size 19.6 KB, free 232.0 KB)
>>> > 16/12/31 22:54:54 INFO storage.BlockManagerInfo: Added
>>> broadcast_0_piece0 in
>>> > memory on localhost:39844 (size: 19.6 KB, free: 511.1 MB)
>>> > 16/12/31 22:54:54 INFO spark.SparkContext: Created broadcast 0 from
>>> json at
>>> > NativeMethodAccessorImpl.java:-2
>>> > Traceback (most recent call last):
>>> >   File "", line 1, in 
>>> >   File "/opt/spark/python/pyspark/sql/readwriter.py", line 176, in
>>> json
>>> > return self._df(self._jreader.json(path))
>>> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>>> line
>>> > 813, in __call__
>>> >   File "/opt/spark/python/pyspark/sql/utils.py", line 45, in deco
>>> > return f(*a, **kw)
>>> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
>>> line 308,
>>> > in get_return_value
>>> > py4j.protocol.Py4JJavaError: An error occurred while calling o19.json.
>>> > : java.io.IOException: No input paths specified in job
>>> > at
>>> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInpu
>>> tFormat.java:201)
>>> > at
>>> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInput
>>> Format.java:313)
>>> > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>> > at scala.Option.getOrElse(Option.scala:120)
>>> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>> > at
>>> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapParti
>>> tionsRDD.scala:35)
>>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>> > at scala.Option.getOrElse(Option.scala:120)
>>> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>> > at
>>> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapParti
>>> tionsRDD.scala:35)
>>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>> > at 

Re: Error when loading json to spark

2017-01-01 Thread Marco Mistroni
Hi
   you will need to pass the schema, like in the snippet below (even though
the code might have been superseeded in spark 2.0)

import sqlContext.implicits._
val jsonRdd = sc.textFile("file:///c:/tmp/1973-01-11.json")
val schema = (new StructType).add("hour", StringType).add("month",
StringType)
  .add("second", StringType).add("year", StringType)
  .add("timezone", StringType).add("day", StringType)
  .add("minute", StringType)
val jsonContentWithSchema = sqlContext.jsonRDD(jsonRdd, schema)

But somehow i seem to remember that there was a way , in Spark 2.0, so that
Spark will infer the schema  for you..

hth
marco





On Sun, Jan 1, 2017 at 12:40 PM, Raymond Xie  wrote:

> I found the cause:
>
> I need to "put" the json file onto hdfs first before it can be used, here
> is what I did:
>
> hdfs dfs -put  /root/Downloads/data/json/world_bank.json
> hdfs://localhost:9000/json
> df = sqlContext.read.json("/json/")
> df.show(10)
>
> .
>
> However, there is a new problem here, the json data needs to be sort of
> treaked before it can be really used, simply using df =
> sqlContext.read.json("/json/") just makes the df messy, I need the df know
> the fields in the json file.
>
> How?
>
> Thank you.
>
>
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Sat, Dec 31, 2016 at 11:52 PM, Miguel Morales 
> wrote:
>
>> Looks like it's trying to treat that path as a folder, try omitting
>> the file name and just use the folder path.
>>
>> On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie 
>> wrote:
>> > Happy new year!!!
>> >
>> > I am trying to load a json file into spark, the json file is attached
>> here.
>> >
>> > I received the following error, can anyone help me to fix it? Thank you
>> very
>> > much. I am using Spark 1.6.2 and python 2.7.5
>> >
>>  from pyspark.sql import SQLContext
>>  sqlContext = SQLContext(sc)
>>  df = sqlContext.read.json("/root/Downloads/data/json/world_bank.
>> json")
>> > 16/12/31 22:54:53 INFO json.JSONRelation: Listing
>> > hdfs://localhost:9000/root/Downloads/data/json/world_bank.json on
>> driver
>> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0 stored as
>> > values in memory (estimated size 212.4 KB, free 212.4 KB)
>> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0_piece0
>> stored
>> > as bytes in memory (estimated size 19.6 KB, free 232.0 KB)
>> > 16/12/31 22:54:54 INFO storage.BlockManagerInfo: Added
>> broadcast_0_piece0 in
>> > memory on localhost:39844 (size: 19.6 KB, free: 511.1 MB)
>> > 16/12/31 22:54:54 INFO spark.SparkContext: Created broadcast 0 from
>> json at
>> > NativeMethodAccessorImpl.java:-2
>> > Traceback (most recent call last):
>> >   File "", line 1, in 
>> >   File "/opt/spark/python/pyspark/sql/readwriter.py", line 176, in json
>> > return self._df(self._jreader.json(path))
>> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>> line
>> > 813, in __call__
>> >   File "/opt/spark/python/pyspark/sql/utils.py", line 45, in deco
>> > return f(*a, **kw)
>> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line
>> 308,
>> > in get_return_value
>> > py4j.protocol.Py4JJavaError: An error occurred while calling o19.json.
>> > : java.io.IOException: No input paths specified in job
>> > at
>> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInpu
>> tFormat.java:201)
>> > at
>> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInput
>> Format.java:313)
>> > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> > at scala.Option.getOrElse(Option.scala:120)
>> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> > at
>> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapParti
>> tionsRDD.scala:35)
>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> > at scala.Option.getOrElse(Option.scala:120)
>> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> > at
>> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapParti
>> tionsRDD.scala:35)
>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> > at scala.Option.getOrElse(Option.scala:120)
>> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> > at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.
>> scala:1129)
>> > at
>> > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:150)
>> > at
>> > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:111)
>> > at 

RE: [PySpark - 1.6] - Avoid object serialization

2017-01-01 Thread Sidney Feiner
Thanks everybody but I've found another way of doing it.
Because I didn't really actually need an instance of my class, I created a 
"static" class. All variables get initiated as class variables and all methods 
are class methods.
Thanks a lot anyways, hope my answer will also help one day ☺

Sidney Feiner   /  SW Developer
M: +972.528197720  /  Skype: sidney.feiner.startapp

[StartApp]

From: Holden Karau [mailto:hol...@pigscanfly.ca]
Sent: Thursday, December 29, 2016 8:54 PM
To: Chawla,Sumit ; Eike von Seggern 

Cc: Sidney Feiner ; user@spark.apache.org
Subject: Re: [PySpark - 1.6] - Avoid object serialization

Alternatively, using the broadcast functionality can also help with this.

On Thu, Dec 29, 2016 at 3:05 AM Eike von Seggern 
> wrote:
2016-12-28 20:17 GMT+01:00 Chawla,Sumit 
>:
Would this work for you?

def processRDD(rdd):
analyzer = ShortTextAnalyzer(root_dir)
rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1]))

ssc.union(*streams).filter(lambda x: x[1] != None)
.foreachRDD(lambda rdd: processRDD(rdd))

I think, this will still send each analyzer to all executors where rdd 
partitions are stored.

Maybe you can work around this with `RDD.foreachPartition()`:

def processRDD(rdd):
def partition_func(records):
analyzer = ShortTextAnalyzer(root_dir)
for record in records:
analyzer.analyze_short_text_event(record[1])
rdd.foreachPartition(partition_func)

This will create one analyzer per partition and RDD.

Best

Eike


Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
I found the cause:

I need to "put" the json file onto hdfs first before it can be used, here
is what I did:

hdfs dfs -put  /root/Downloads/data/json/world_bank.json
hdfs://localhost:9000/json
df = sqlContext.read.json("/json/")
df.show(10)

.

However, there is a new problem here, the json data needs to be sort of
treaked before it can be really used, simply using df =
sqlContext.read.json("/json/") just makes the df messy, I need the df know
the fields in the json file.

How?

Thank you.




**
*Sincerely yours,*


*Raymond*

On Sat, Dec 31, 2016 at 11:52 PM, Miguel Morales 
wrote:

> Looks like it's trying to treat that path as a folder, try omitting
> the file name and just use the folder path.
>
> On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie  wrote:
> > Happy new year!!!
> >
> > I am trying to load a json file into spark, the json file is attached
> here.
> >
> > I received the following error, can anyone help me to fix it? Thank you
> very
> > much. I am using Spark 1.6.2 and python 2.7.5
> >
>  from pyspark.sql import SQLContext
>  sqlContext = SQLContext(sc)
>  df = sqlContext.read.json("/root/Downloads/data/json/world_
> bank.json")
> > 16/12/31 22:54:53 INFO json.JSONRelation: Listing
> > hdfs://localhost:9000/root/Downloads/data/json/world_bank.json on driver
> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0 stored as
> > values in memory (estimated size 212.4 KB, free 212.4 KB)
> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0_piece0
> stored
> > as bytes in memory (estimated size 19.6 KB, free 232.0 KB)
> > 16/12/31 22:54:54 INFO storage.BlockManagerInfo: Added
> broadcast_0_piece0 in
> > memory on localhost:39844 (size: 19.6 KB, free: 511.1 MB)
> > 16/12/31 22:54:54 INFO spark.SparkContext: Created broadcast 0 from json
> at
> > NativeMethodAccessorImpl.java:-2
> > Traceback (most recent call last):
> >   File "", line 1, in 
> >   File "/opt/spark/python/pyspark/sql/readwriter.py", line 176, in json
> > return self._df(self._jreader.json(path))
> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
> line
> > 813, in __call__
> >   File "/opt/spark/python/pyspark/sql/utils.py", line 45, in deco
> > return f(*a, **kw)
> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line
> 308,
> > in get_return_value
> > py4j.protocol.Py4JJavaError: An error occurred while calling o19.json.
> > : java.io.IOException: No input paths specified in job
> > at
> > org.apache.hadoop.mapred.FileInputFormat.listStatus(
> FileInputFormat.java:201)
> > at
> > org.apache.hadoop.mapred.FileInputFormat.getSplits(
> FileInputFormat.java:313)
> > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
> MapPartitionsRDD.scala:35)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
> MapPartitionsRDD.scala:35)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(
> RDD.scala:1129)
> > at
> > org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:150)
> > at
> > org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:111)
> > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
> > at
> > org.apache.spark.sql.execution.datasources.json.InferSchema$.infer(
> InferSchema.scala:65)
> > at
> > org.apache.spark.sql.execution.datasources.json.
> JSONRelation$$anonfun$4.apply(JSONRelation.scala:114)
> > at
> > org.apache.spark.sql.execution.datasources.json.
> JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
> > at scala.Option.getOrElse(Option.scala:120)
> > at
> > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$
> lzycompute(JSONRelation.scala:109)
> > at
> > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(
> JSONRelation.scala:108)
> > at
> > org.apache.spark.sql.sources.HadoopFsRelation.schema$
> lzycompute(interfaces.scala:636)
> > at
> > org.apache.spark.sql.sources.HadoopFsRelation.schema(
> interfaces.scala:635)
> > at
> > 

Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
Thank you Miguel, here is the output:

>>>df = sqlContext.read.json("/root/Downloads/data/json")
17/01/01 07:28:19 INFO json.JSONRelation: Listing
hdfs://localhost:9000/root/Downloads/data/json on driver
17/01/01 07:28:19 INFO storage.MemoryStore: Block broadcast_2 stored as
values in memory (estimated size 212.4 KB, free 444.5 KB)
17/01/01 07:28:19 INFO storage.MemoryStore: Block broadcast_2_piece0 stored
as bytes in memory (estimated size 19.6 KB, free 464.1 KB)
17/01/01 07:28:19 INFO storage.BlockManagerInfo: Added broadcast_2_piece0
in memory on localhost:39844 (size: 19.6 KB, free: 511.1 MB)
17/01/01 07:28:19 INFO spark.SparkContext: Created broadcast 2 from json at
NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/readwriter.py", line 176, in json
return self._df(self._jreader.json(path))
  File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
813, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308,
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o115.json.
: java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1129)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
at
org.apache.spark.sql.execution.datasources.json.InferSchema$.infer(InferSchema.scala:65)
at
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:114)
at
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109)
at
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108)
at
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
at
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

>>>



**
*Sincerely yours,*


*Raymond*

On Sat, Dec 31, 2016 at 11:52 PM, Miguel Morales 
wrote:

> Looks like it's trying to treat that path as a folder, try omitting
> the file name and just use the folder path.
>
> On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie  wrote:
> > Happy new year!!!
> >
> > I am trying to load a json file into spark, the json file is attached
> here.
> >
> > I received the following error, can