Re: Convert DStream[Long] to Long

2015-04-25 Thread Akhil Das
Like this?

messages.foreachRDD(rdd = {

if(rdd.count()  0) //Do whatever you want.

})


Thanks
Best Regards

On Fri, Apr 24, 2015 at 11:20 PM, Sergio Jiménez Barrio 
drarse.a...@gmail.com wrote:

 Hi,

 I need compare the count of messages recived if is 0 or not, but
 messages.count() return a DStream[Long]. I tried this solution:

 val cuenta  = messages.count().foreachRDD{ rdd =
 rdd.first()
  }

 But this return a type Unit, not Long.


 Any suggestion? Thanks!



Re: Convert DStream[Long] to Long

2015-04-25 Thread Sergio Jiménez Barrio
It is solved. Thank u! Is more efficient

messages.foreachRDD(rdd = {

if(!rdd.isEmpty) //Do whatever you want.

})


2015-04-25 19:21 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:

 Like this?

 messages.foreachRDD(rdd = {

 if(rdd.count()  0) //Do whatever you want.

 })


 Thanks
 Best Regards

 On Fri, Apr 24, 2015 at 11:20 PM, Sergio Jiménez Barrio 
 drarse.a...@gmail.com wrote:

 Hi,

 I need compare the count of messages recived if is 0 or not, but
 messages.count() return a DStream[Long]. I tried this solution:

 val cuenta  = messages.count().foreachRDD{ rdd =
 rdd.first()
  }

 But this return a type Unit, not Long.


 Any suggestion? Thanks!





Re: StreamingContext.textFileStream issue

2015-04-25 Thread Yang Lei
I have no problem running the socket text stream sample in the same 
environment. 

Thanks

Yang

Sent from my iPhone

 On Apr 25, 2015, at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 Make sure you are having =2 core for your streaming application.
 
 Thanks
 Best Regards
 
 On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote:
 I hit the same issue as if the directory has no files at all when running
 the sample examples/src/main/python/streaming/hdfs_wordcount.py with a
 local directory, and adding file into that directory . Appreciate comments
 on how to resolve this.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


Re: DAG

2015-04-25 Thread Akhil Das
May be this will give you a good start
https://github.com/apache/spark/pull/2077

Thanks
Best Regards

On Sat, Apr 25, 2015 at 1:29 AM, Giovanni Paolo Gibilisco gibb...@gmail.com
 wrote:

 Hi,
 I would like to know if it is possible to build the DAG before actually
 executing the application. My guess is that in the scheduler the DAG is
 built dynamically at runtime since it might depend on the data, but I was
 wondering if there is a way (and maybe a tool already) to analyze the code
 and buidl the DAG.

 Thank you!



Re: directory loader in windows

2015-04-25 Thread ayan guha
This code is in python. Also I tried with fwd slash at the end with same
result
 On 26 Apr 2015 01:36, Jeetendra Gangele gangele...@gmail.com wrote:

 also if this code is in scala why not val in newsY? is this define above?
 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in
 __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at
 org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

 at 

Re: StreamingContext.textFileStream issue

2015-04-25 Thread Akhil Das
Make sure you are having =2 core for your streaming application.

Thanks
Best Regards

On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote:

 I hit the same issue as if the directory has no files at all when running
 the sample examples/src/main/python/streaming/hdfs_wordcount.py with a
 local directory, and adding file into that directory . Appreciate comments
 on how to resolve this.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




回复:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread doovsaid
Even through grouping by only on name, the issue (CassCastException) still be 
here.

- 原始邮件 
发件人:ayan guha guha.a...@gmail.com
收件人:doovs...@sina.com
抄送人:user user@spark.apache.org
主题:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown
日期:2015年04月25日 22点33分

Sorry if I am looking at the wrong issue, but your query is wrong.you 
shoulf group by only on name.
On Sat, Apr 25, 2015 at 11:59 PM,  doovs...@sina.com wrote:
Hi all,

When I query Postgresql based on Spark SQL like this:

      dataFrame.registerTempTable(Employees)

      val emps = sqlContext.sql(select name, sum(salary) from Employees group 
by name, salary)

      monitor {

        emps.take(10)

          .map(row = (row.getString(0), row.getDecimal(1)))

          .foreach(println)

      }



The type of salary column in data table is numeric(10, 2).



It throws the following exception:

Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: 
java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal



Who know this issue and how to solve it? Thanks.



Regards,

Yi

-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-- 
Best Regards,
Ayan Guha
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark1.3.1 using mysql error!

2015-04-25 Thread Anand Mohan
Yes, you would need to add the MySQL driver jar to the Spark driver 
executor classpath.
Either using the deprecated SPARK_CLASSPATH environment variable (which the
latest docs still recommend anyway although its deprecated) like so
export SPARK_CLASSPATH=/usr/share/java/mysql-connector.jar
spark-shell or spark-sql

The other un-deprecated way is to set the below variables in
spark-defaults.conf
spark.driver.extraClassPath  /usr/share/java/mysql-connector.jar
spark.executor.extraClassPath  /usr/share/java/mysql-connector.jar

and then either run spark-shell or spark-sql or start-thriftserver and
beeline

Things would be better once SPARK-6966 is merged into 1.4.0 when you can use 
1. use the --jars parameter for spark-shell, spark-sql, etc or
2. sc.addJar to add the driver after starting spark-shell.

Good Luck,
Anand Mohan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-3-1-using-mysql-error-tp22643p22658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DAG

2015-04-25 Thread Corey Nolet
Giovanni,

The DAG can be walked by calling the dependencies() function on any RDD.
It returns a  Seq containing the parent RDDs. If you start at the leaves
and walk through the parents until dependencies() returns an empty Seq, you
ultimately have your DAG.

On Sat, Apr 25, 2015 at 1:28 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 May be this will give you a good start
 https://github.com/apache/spark/pull/2077

 Thanks
 Best Regards

 On Sat, Apr 25, 2015 at 1:29 AM, Giovanni Paolo Gibilisco 
 gibb...@gmail.com wrote:

 Hi,
 I would like to know if it is possible to build the DAG before actually
 executing the application. My guess is that in the scheduler the DAG is
 built dynamically at runtime since it might depend on the data, but I was
 wondering if there is a way (and maybe a tool already) to analyze the code
 and buidl the DAG.

 Thank you!





Re: How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-25 Thread Joseph Bradley
It looks like your code is making 1 Row per item, which means that
columnSimilarities will compute similarities between users.  If you
transpose the matrix (or construct it as the transpose), then
columnSimilarities should do what you want, and it will return meaningful
indices.
Joseph

On Fri, Apr 24, 2015 at 11:20 PM, amghost zhengweita...@outlook.com wrote:

 I have encountered the all-pairs similarity problem in my recommendation
 system. Thanks to this databricks blog, it seems RowMatrix may come to
 help.

 However, RowMatrix is a matrix type without meaningful row indices, thereby
 I don't know how to retrieve the similarity result after invoking
 columnSimilarities(threshold) for specific item i and j

 Below is some details about what I am doing:

 1) My data file comes from Movielens with format like this:

 user::item::rating
 2) I build up a RowMatrix in which each sparse vector i represents the
 ratings of all users to this item i

 val dataPath = ...
 val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split(::) match {
   case Array(user, item, rate) = Rating(user.toInt, item.toInt,
 rate.toDouble)
 })
 val rows = ratings.map(rating=(rating.product, (rating.user,
 rating.rating)))
   .groupByKey()
   .map(p = Vectors.sparse(userAmount, p._2.map(r=(r._1-1, r._2)).toSeq))

 val mat = new RowMatrix(rows)

 val similarities = mat.columnSimilarities(0.5)
 Now I get a CoordinateMatrix similarities. How can I get the similarity of
 specific item i and j? Although it can be used to retrieve a
 RDD[MatrixEntry], I am not sure whether the row i and column j correspond
 to
 item i and j.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-retrieve-item-pair-after-calculating-similarity-using-RowMatrix-tp22654.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: KMeans takeSample jobs and RDD cached

2015-04-25 Thread Joseph Bradley
Yes, the count() should be the first task, and the sampling + collecting
should be the second task.  The first one is probably slow because the RDD
being sampled is not yet cached/materialized.

K-Means creates some RDDs internally while learning, and since they aren't
needed after learning, they are unpersisted (uncached) at the end.

Joseph

On Sat, Apr 25, 2015 at 6:36 AM, podioss grega...@hotmail.com wrote:

 Hi,
 i am running k-means algorithm with initialization mode set to random and
 various dataset sizes and values for clusters and i have a question
 regarding the takeSample job of the algorithm.
 More specific i notice that in every application there are  two sampling
 jobs. The first one is consuming the most time compared to all others while
 the second one is much quicker and that sparked my interest to investigate
 what is actually happening.
 In order to explain it, i  checked the source code of the takeSample
 operation and i saw that there is a count action involved and then the
 computation of a PartiotionwiseSampledRDD with a PoissonSampler.
 So my question is,if that count action corresponds to the first takeSample
 job and if the second takeSample job is the one doing the actual sampling.

 I also have a question for the RDDs that are created for the k-means. In
 the
 middle of the execution under the storage tab of the web ui i can see 3
 RDDs
 with their partitions cached in memory across all nodes which is very
 helpful for monitoring reasons. The problem is that after the completion i
 can only see one of them and the portion of the cache memory it used and i
 would like to ask why the web ui doesn't display all the RDDs involded in
 the computation.

 Thank you



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-25 Thread amghost
I have encountered the all-pairs similarity problem in my recommendation
system. Thanks to this databricks blog, it seems RowMatrix may come to help.

However, RowMatrix is a matrix type without meaningful row indices, thereby
I don't know how to retrieve the similarity result after invoking
columnSimilarities(threshold) for specific item i and j

Below is some details about what I am doing:

1) My data file comes from Movielens with format like this:

user::item::rating
2) I build up a RowMatrix in which each sparse vector i represents the
ratings of all users to this item i

val dataPath = ...
val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split(::) match { 
  case Array(user, item, rate) = Rating(user.toInt, item.toInt,
rate.toDouble)
})
val rows = ratings.map(rating=(rating.product, (rating.user,
rating.rating)))
  .groupByKey()
  .map(p = Vectors.sparse(userAmount, p._2.map(r=(r._1-1, r._2)).toSeq))

val mat = new RowMatrix(rows)

val similarities = mat.columnSimilarities(0.5)
Now I get a CoordinateMatrix similarities. How can I get the similarity of
specific item i and j? Although it can be used to retrieve a
RDD[MatrixEntry], I am not sure whether the row i and column j correspond to
item i and j.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-retrieve-item-pair-after-calculating-similarity-using-RowMatrix-tp22654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: what is the best way to transfer data from RDBMS to spark?

2015-04-25 Thread ayan guha
Actually, Spark SQL provides a data source. Here is from documentation -

JDBC To Other Databases

Spark SQL also includes a data source that can read data from other
databases using JDBC. This functionality should be preferred over using
JdbcRDD
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD.
This is because the results are returned as a DataFrame and they can easily
be processed in Spark SQL or joined with other data sources. The JDBC data
source is also easier to use from Java or Python as it does not require the
user to provide a ClassTag. (Note that this is different than the Spark SQL
JDBC server, which allows other applications to run queries using Spark
SQL).

On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote:

 What is the specific usecase? I can think of couple of ways (write to hdfs
 and then read from spark or stream data to spark). Also I have seen people
 using mysql jars to bring data in. Essentially you want to simulate
 creation of rdd.
 On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com wrote:

 If I run spark in stand-alone mode ( not YARN mode ), is there any tool
 like Sqoop that able to transfer data from RDBMS to spark storage?

 Thanks

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Best Regards,
Ayan Guha


回复:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread doovsaid
Yeah, same issue. I noticed this issue is not solved yet. 

- 原始邮件 -
发件人:Ted Yu yuzhih...@gmail.com
收件人:doovs...@sina.com
抄送人:user user@spark.apache.org
主题:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown
日期:2015年04月25日 22点04分

Looks like this is related:
https://issues.apache.org/jira/browse/SPARK-5456

On Sat, Apr 25, 2015 at 6:59 AM,  doovs...@sina.com wrote:
Hi all,

When I query Postgresql based on Spark SQL like this:

      dataFrame.registerTempTable(Employees)

      val emps = sqlContext.sql(select name, sum(salary) from Employees group 
by name, salary)

      monitor {

        emps.take(10)

          .map(row = (row.getString(0), row.getDecimal(1)))

          .foreach(println)

      }



The type of salary column in data table is numeric(10, 2).



It throws the following exception:

Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: 
java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal



Who know this issue and how to solve it? Thanks.



Regards,

Yi

-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread ayan guha
Sorry if I am looking at the wrong issue, but your query is wrong.you
shoulf group by only on name.

On Sat, Apr 25, 2015 at 11:59 PM, doovs...@sina.com wrote:

 Hi all,
 When I query Postgresql based on Spark SQL like this:
   dataFrame.registerTempTable(Employees)
   val emps = sqlContext.sql(select name, sum(salary) from Employees
 group by name, salary)
   monitor {
 emps.take(10)
   .map(row = (row.getString(0), row.getDecimal(1)))
   .foreach(println)
   }

 The type of salary column in data table is numeric(10, 2).

 It throws the following exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent
 failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to
 org.apache.spark.sql.types.Decimal

 Who know this issue and how to solve it? Thanks.

 Regards,
 Yi
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Best Regards,
Ayan Guha


directory loader in windows

2015-04-25 Thread ayan guha
Hi

I am facing this weird issue.

I am on Windows, and I am trying to load all files within a folder. Here is
my code -

loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
newsY = sc.textFile(loc)
print newsY.count()

Even this simple code fails. I have tried with giving exact file names,
everything works.

Am I missing something stupid here? Anyone facing this (anyone still use
windows?:))

Here is error trace:

D:\Project\Spark\code\news\jsonfeeds

Traceback (most recent call last):
  File D:/Project/Spark/code/newsfeeder.py, line 28, in module
print newsY.count()
  File
D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
line 932, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File
D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
line 923, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File
D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
line 739, in reduce
vals = self.mapPartitions(func).collect()
  File
D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
line 713, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in
__call__
self.target_id, self.name)
  File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
get_return_value
format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NullPointerException

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

at org.apache.hadoop.util.Shell.run(Shell.java:455)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

at
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

at
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

at
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

at py4j.Gateway.invoke(Gateway.java:259)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:207)

at java.lang.Thread.run(Unknown Source)

-- 
Best Regards,
Ayan Guha


Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele
loc = D:\\Project\\Spark\\code\\news\\jsonfeeds\\

On 25 April 2015 at 20:49, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi Ayan can you try below line

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds

 On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in
 __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at
 org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

 at py4j.Gateway.invoke(Gateway.java:259)

 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

 at 

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele
Hi Ayan can you try below line

loc = D:\\Project\\Spark\\code\\news\\jsonfeeds

On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in
 __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

 at py4j.Gateway.invoke(Gateway.java:259)

 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

 at py4j.commands.CallCommand.execute(CallCommand.java:79)

 at py4j.GatewayConnection.run(GatewayConnection.java:207)

 at 

KMeans takeSample jobs and RDD cached

2015-04-25 Thread podioss
Hi,
i am running k-means algorithm with initialization mode set to random and
various dataset sizes and values for clusters and i have a question
regarding the takeSample job of the algorithm.
More specific i notice that in every application there are  two sampling
jobs. The first one is consuming the most time compared to all others while
the second one is much quicker and that sparked my interest to investigate
what is actually happening. 
In order to explain it, i  checked the source code of the takeSample
operation and i saw that there is a count action involved and then the
computation of a PartiotionwiseSampledRDD with a PoissonSampler.
So my question is,if that count action corresponds to the first takeSample
job and if the second takeSample job is the one doing the actual sampling.

I also have a question for the RDDs that are created for the k-means. In the
middle of the execution under the storage tab of the web ui i can see 3 RDDs
with their partitions cached in memory across all nodes which is very
helpful for monitoring reasons. The problem is that after the completion i
can only see one of them and the portion of the cache memory it used and i
would like to ask why the web ui doesn't display all the RDDs involded in
the computation.

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread doovsaid
Hi all,
When I query Postgresql based on Spark SQL like this:
  dataFrame.registerTempTable(Employees)
  val emps = sqlContext.sql(select name, sum(salary) from Employees group 
by name, salary)
  monitor {
emps.take(10)
  .map(row = (row.getString(0), row.getDecimal(1)))
  .foreach(println)
  }

The type of salary column in data table is numeric(10, 2).

It throws the following exception:
Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: 
java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal

Who know this issue and how to solve it? Thanks.

Regards,
Yi
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread Ted Yu
Looks like this is related:
https://issues.apache.org/jira/browse/SPARK-5456

On Sat, Apr 25, 2015 at 6:59 AM, doovs...@sina.com wrote:

 Hi all,
 When I query Postgresql based on Spark SQL like this:
   dataFrame.registerTempTable(Employees)
   val emps = sqlContext.sql(select name, sum(salary) from Employees
 group by name, salary)
   monitor {
 emps.take(10)
   .map(row = (row.getString(0), row.getDecimal(1)))
   .foreach(println)
   }

 The type of salary column in data table is numeric(10, 2).

 It throws the following exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent
 failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to
 org.apache.spark.sql.types.Decimal

 Who know this issue and how to solve it? Thanks.

 Regards,
 Yi
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele
extra forward slash at the end. sometime I have seen this kind of issues

On 25 April 2015 at 20:50, Jeetendra Gangele gangele...@gmail.com wrote:

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds\\

 On 25 April 2015 at 20:49, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi Ayan can you try below line

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds

 On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537,
 in __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at
 org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at 

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele
also if this code is in scala why not val in newsY? is this define above?
loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
newsY = sc.textFile(loc)
print newsY.count()

On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am facing this weird issue.

 I am on Windows, and I am trying to load all files within a folder. Here
 is my code -

 loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
 newsY = sc.textFile(loc)
 print newsY.count()

 Even this simple code fails. I have tried with giving exact file names,
 everything works.

 Am I missing something stupid here? Anyone facing this (anyone still use
 windows?:))

 Here is error trace:

 D:\Project\Spark\code\news\jsonfeeds

 Traceback (most recent call last):
   File D:/Project/Spark/code/newsfeeder.py, line 28, in module
 print newsY.count()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 932, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 923, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 739, in reduce
 vals = self.mapPartitions(func).collect()
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py,
 line 713, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in
 __call__
 self.target_id, self.name)
   File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in
 get_return_value
 format(target_id, '.', name), value)
 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

 at org.apache.hadoop.util.Shell.run(Shell.java:455)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

 at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

 at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

 at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

 at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

 at java.lang.reflect.Method.invoke(Unknown Source)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

 at py4j.Gateway.invoke(Gateway.java:259)

 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

 at 

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-25 Thread Sujeevan
If your use case is more to do with querying RDBMS and then bringing the
results to spark do some analysis then Spark SQL JDBC datasource API
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/
is the best. If your use case is to bring entire data to spark, then you'll
have to explore other options which depends on the datatype. For e.g. Spark
RedShift integration
http://spark-packages.org/package/databricks/spark-redshift

Best Regards,

Sujeevan. N

On Sat, Apr 25, 2015 at 4:24 PM, ayan guha guha.a...@gmail.com wrote:

 Actually, Spark SQL provides a data source. Here is from documentation -

 JDBC To Other Databases

 Spark SQL also includes a data source that can read data from other
 databases using JDBC. This functionality should be preferred over using
 JdbcRDD
 https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD.
 This is because the results are returned as a DataFrame and they can easily
 be processed in Spark SQL or joined with other data sources. The JDBC data
 source is also easier to use from Java or Python as it does not require the
 user to provide a ClassTag. (Note that this is different than the Spark SQL
 JDBC server, which allows other applications to run queries using Spark
 SQL).

 On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote:

 What is the specific usecase? I can think of couple of ways (write to
 hdfs and then read from spark or stream data to spark). Also I have seen
 people using mysql jars to bring data in. Essentially you want to simulate
 creation of rdd.
 On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com
 wrote:

 If I run spark in stand-alone mode ( not YARN mode ), is there any tool
 like Sqoop that able to transfer data from RDBMS to spark storage?

 Thanks

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Best Regards,
 Ayan Guha