Re: Convert DStream[Long] to Long
Like this? messages.foreachRDD(rdd = { if(rdd.count() 0) //Do whatever you want. }) Thanks Best Regards On Fri, Apr 24, 2015 at 11:20 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Hi, I need compare the count of messages recived if is 0 or not, but messages.count() return a DStream[Long]. I tried this solution: val cuenta = messages.count().foreachRDD{ rdd = rdd.first() } But this return a type Unit, not Long. Any suggestion? Thanks!
Re: Convert DStream[Long] to Long
It is solved. Thank u! Is more efficient messages.foreachRDD(rdd = { if(!rdd.isEmpty) //Do whatever you want. }) 2015-04-25 19:21 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: Like this? messages.foreachRDD(rdd = { if(rdd.count() 0) //Do whatever you want. }) Thanks Best Regards On Fri, Apr 24, 2015 at 11:20 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Hi, I need compare the count of messages recived if is 0 or not, but messages.count() return a DStream[Long]. I tried this solution: val cuenta = messages.count().foreachRDD{ rdd = rdd.first() } But this return a type Unit, not Long. Any suggestion? Thanks!
Re: StreamingContext.textFileStream issue
I have no problem running the socket text stream sample in the same environment. Thanks Yang Sent from my iPhone On Apr 25, 2015, at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Make sure you are having =2 core for your streaming application. Thanks Best Regards On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote: I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DAG
May be this will give you a good start https://github.com/apache/spark/pull/2077 Thanks Best Regards On Sat, Apr 25, 2015 at 1:29 AM, Giovanni Paolo Gibilisco gibb...@gmail.com wrote: Hi, I would like to know if it is possible to build the DAG before actually executing the application. My guess is that in the scheduler the DAG is built dynamically at runtime since it might depend on the data, but I was wondering if there is a way (and maybe a tool already) to analyze the code and buidl the DAG. Thank you!
Re: directory loader in windows
This code is in python. Also I tried with fwd slash at the end with same result On 26 Apr 2015 01:36, Jeetendra Gangele gangele...@gmail.com wrote: also if this code is in scala why not val in newsY? is this define above? loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc) print newsY.count() On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote: Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc) print newsY.count() Even this simple code fails. I have tried with giving exact file names, everything works. Am I missing something stupid here? Anyone facing this (anyone still use windows?:)) Here is error trace: D:\Project\Spark\code\news\jsonfeeds Traceback (most recent call last): File D:/Project/Spark/code/newsfeeder.py, line 28, in module print newsY.count() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 932, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 923, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 739, in reduce vals = self.mapPartitions(func).collect() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 713, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in __call__ self.target_id, self.name) File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in get_return_value format(target_id, '.', name), value) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.collect(RDD.scala:813) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at
Re: StreamingContext.textFileStream issue
Make sure you are having =2 core for your streaming application. Thanks Best Regards On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote: I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
回复:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown
Even through grouping by only on name, the issue (CassCastException) still be here. - 原始邮件 发件人:ayan guha guha.a...@gmail.com 收件人:doovs...@sina.com 抄送人:user user@spark.apache.org 主题:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown 日期:2015年04月25日 22点33分 Sorry if I am looking at the wrong issue, but your query is wrong.you shoulf group by only on name. On Sat, Apr 25, 2015 at 11:59 PM, doovs...@sina.com wrote: Hi all, When I query Postgresql based on Spark SQL like this: dataFrame.registerTempTable(Employees) val emps = sqlContext.sql(select name, sum(salary) from Employees group by name, salary) monitor { emps.take(10) .map(row = (row.getString(0), row.getDecimal(1))) .foreach(println) } The type of salary column in data table is numeric(10, 2). It throws the following exception: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal Who know this issue and how to solve it? Thanks. Regards, Yi - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark1.3.1 using mysql error!
Yes, you would need to add the MySQL driver jar to the Spark driver executor classpath. Either using the deprecated SPARK_CLASSPATH environment variable (which the latest docs still recommend anyway although its deprecated) like so export SPARK_CLASSPATH=/usr/share/java/mysql-connector.jar spark-shell or spark-sql The other un-deprecated way is to set the below variables in spark-defaults.conf spark.driver.extraClassPath /usr/share/java/mysql-connector.jar spark.executor.extraClassPath /usr/share/java/mysql-connector.jar and then either run spark-shell or spark-sql or start-thriftserver and beeline Things would be better once SPARK-6966 is merged into 1.4.0 when you can use 1. use the --jars parameter for spark-shell, spark-sql, etc or 2. sc.addJar to add the driver after starting spark-shell. Good Luck, Anand Mohan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-3-1-using-mysql-error-tp22643p22658.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DAG
Giovanni, The DAG can be walked by calling the dependencies() function on any RDD. It returns a Seq containing the parent RDDs. If you start at the leaves and walk through the parents until dependencies() returns an empty Seq, you ultimately have your DAG. On Sat, Apr 25, 2015 at 1:28 PM, Akhil Das ak...@sigmoidanalytics.com wrote: May be this will give you a good start https://github.com/apache/spark/pull/2077 Thanks Best Regards On Sat, Apr 25, 2015 at 1:29 AM, Giovanni Paolo Gibilisco gibb...@gmail.com wrote: Hi, I would like to know if it is possible to build the DAG before actually executing the application. My guess is that in the scheduler the DAG is built dynamically at runtime since it might depend on the data, but I was wondering if there is a way (and maybe a tool already) to analyze the code and buidl the DAG. Thank you!
Re: How can I retrieve item-pair after calculating similarity using RowMatrix
It looks like your code is making 1 Row per item, which means that columnSimilarities will compute similarities between users. If you transpose the matrix (or construct it as the transpose), then columnSimilarities should do what you want, and it will return meaningful indices. Joseph On Fri, Apr 24, 2015 at 11:20 PM, amghost zhengweita...@outlook.com wrote: I have encountered the all-pairs similarity problem in my recommendation system. Thanks to this databricks blog, it seems RowMatrix may come to help. However, RowMatrix is a matrix type without meaningful row indices, thereby I don't know how to retrieve the similarity result after invoking columnSimilarities(threshold) for specific item i and j Below is some details about what I am doing: 1) My data file comes from Movielens with format like this: user::item::rating 2) I build up a RowMatrix in which each sparse vector i represents the ratings of all users to this item i val dataPath = ... val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split(::) match { case Array(user, item, rate) = Rating(user.toInt, item.toInt, rate.toDouble) }) val rows = ratings.map(rating=(rating.product, (rating.user, rating.rating))) .groupByKey() .map(p = Vectors.sparse(userAmount, p._2.map(r=(r._1-1, r._2)).toSeq)) val mat = new RowMatrix(rows) val similarities = mat.columnSimilarities(0.5) Now I get a CoordinateMatrix similarities. How can I get the similarity of specific item i and j? Although it can be used to retrieve a RDD[MatrixEntry], I am not sure whether the row i and column j correspond to item i and j. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-retrieve-item-pair-after-calculating-similarity-using-RowMatrix-tp22654.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans takeSample jobs and RDD cached
Yes, the count() should be the first task, and the sampling + collecting should be the second task. The first one is probably slow because the RDD being sampled is not yet cached/materialized. K-Means creates some RDDs internally while learning, and since they aren't needed after learning, they are unpersisted (uncached) at the end. Joseph On Sat, Apr 25, 2015 at 6:36 AM, podioss grega...@hotmail.com wrote: Hi, i am running k-means algorithm with initialization mode set to random and various dataset sizes and values for clusters and i have a question regarding the takeSample job of the algorithm. More specific i notice that in every application there are two sampling jobs. The first one is consuming the most time compared to all others while the second one is much quicker and that sparked my interest to investigate what is actually happening. In order to explain it, i checked the source code of the takeSample operation and i saw that there is a count action involved and then the computation of a PartiotionwiseSampledRDD with a PoissonSampler. So my question is,if that count action corresponds to the first takeSample job and if the second takeSample job is the one doing the actual sampling. I also have a question for the RDDs that are created for the k-means. In the middle of the execution under the storage tab of the web ui i can see 3 RDDs with their partitions cached in memory across all nodes which is very helpful for monitoring reasons. The problem is that after the completion i can only see one of them and the portion of the cache memory it used and i would like to ask why the web ui doesn't display all the RDDs involded in the computation. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How can I retrieve item-pair after calculating similarity using RowMatrix
I have encountered the all-pairs similarity problem in my recommendation system. Thanks to this databricks blog, it seems RowMatrix may come to help. However, RowMatrix is a matrix type without meaningful row indices, thereby I don't know how to retrieve the similarity result after invoking columnSimilarities(threshold) for specific item i and j Below is some details about what I am doing: 1) My data file comes from Movielens with format like this: user::item::rating 2) I build up a RowMatrix in which each sparse vector i represents the ratings of all users to this item i val dataPath = ... val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split(::) match { case Array(user, item, rate) = Rating(user.toInt, item.toInt, rate.toDouble) }) val rows = ratings.map(rating=(rating.product, (rating.user, rating.rating))) .groupByKey() .map(p = Vectors.sparse(userAmount, p._2.map(r=(r._1-1, r._2)).toSeq)) val mat = new RowMatrix(rows) val similarities = mat.columnSimilarities(0.5) Now I get a CoordinateMatrix similarities. How can I get the similarity of specific item i and j? Although it can be used to retrieve a RDD[MatrixEntry], I am not sure whether the row i and column j correspond to item i and j. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-retrieve-item-pair-after-calculating-similarity-using-RowMatrix-tp22654.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: what is the best way to transfer data from RDBMS to spark?
Actually, Spark SQL provides a data source. Here is from documentation - JDBC To Other Databases Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources. The JDBC data source is also easier to use from Java or Python as it does not require the user to provide a ClassTag. (Note that this is different than the Spark SQL JDBC server, which allows other applications to run queries using Spark SQL). On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote: What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream data to spark). Also I have seen people using mysql jars to bring data in. Essentially you want to simulate creation of rdd. On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com wrote: If I run spark in stand-alone mode ( not YARN mode ), is there any tool like Sqoop that able to transfer data from RDBMS to spark storage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha
回复:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown
Yeah, same issue. I noticed this issue is not solved yet. - 原始邮件 - 发件人:Ted Yu yuzhih...@gmail.com 收件人:doovs...@sina.com 抄送人:user user@spark.apache.org 主题:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown 日期:2015年04月25日 22点04分 Looks like this is related: https://issues.apache.org/jira/browse/SPARK-5456 On Sat, Apr 25, 2015 at 6:59 AM, doovs...@sina.com wrote: Hi all, When I query Postgresql based on Spark SQL like this: dataFrame.registerTempTable(Employees) val emps = sqlContext.sql(select name, sum(salary) from Employees group by name, salary) monitor { emps.take(10) .map(row = (row.getString(0), row.getDecimal(1))) .foreach(println) } The type of salary column in data table is numeric(10, 2). It throws the following exception: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal Who know this issue and how to solve it? Thanks. Regards, Yi - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown
Sorry if I am looking at the wrong issue, but your query is wrong.you shoulf group by only on name. On Sat, Apr 25, 2015 at 11:59 PM, doovs...@sina.com wrote: Hi all, When I query Postgresql based on Spark SQL like this: dataFrame.registerTempTable(Employees) val emps = sqlContext.sql(select name, sum(salary) from Employees group by name, salary) monitor { emps.take(10) .map(row = (row.getString(0), row.getDecimal(1))) .foreach(println) } The type of salary column in data table is numeric(10, 2). It throws the following exception: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal Who know this issue and how to solve it? Thanks. Regards, Yi - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha
directory loader in windows
Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc) print newsY.count() Even this simple code fails. I have tried with giving exact file names, everything works. Am I missing something stupid here? Anyone facing this (anyone still use windows?:)) Here is error trace: D:\Project\Spark\code\news\jsonfeeds Traceback (most recent call last): File D:/Project/Spark/code/newsfeeder.py, line 28, in module print newsY.count() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 932, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 923, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 739, in reduce vals = self.mapPartitions(func).collect() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 713, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in __call__ self.target_id, self.name) File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in get_return_value format(target_id, '.', name), value) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.collect(RDD.scala:813) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) -- Best Regards, Ayan Guha
Re: directory loader in windows
loc = D:\\Project\\Spark\\code\\news\\jsonfeeds\\ On 25 April 2015 at 20:49, Jeetendra Gangele gangele...@gmail.com wrote: Hi Ayan can you try below line loc = D:\\Project\\Spark\\code\\news\\jsonfeeds On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote: Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc) print newsY.count() Even this simple code fails. I have tried with giving exact file names, everything works. Am I missing something stupid here? Anyone facing this (anyone still use windows?:)) Here is error trace: D:\Project\Spark\code\news\jsonfeeds Traceback (most recent call last): File D:/Project/Spark/code/newsfeeder.py, line 28, in module print newsY.count() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 932, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 923, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 739, in reduce vals = self.mapPartitions(func).collect() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 713, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in __call__ self.target_id, self.name) File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in get_return_value format(target_id, '.', name), value) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.collect(RDD.scala:813) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at
Re: directory loader in windows
Hi Ayan can you try below line loc = D:\\Project\\Spark\\code\\news\\jsonfeeds On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote: Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc) print newsY.count() Even this simple code fails. I have tried with giving exact file names, everything works. Am I missing something stupid here? Anyone facing this (anyone still use windows?:)) Here is error trace: D:\Project\Spark\code\news\jsonfeeds Traceback (most recent call last): File D:/Project/Spark/code/newsfeeder.py, line 28, in module print newsY.count() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 932, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 923, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 739, in reduce vals = self.mapPartitions(func).collect() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 713, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in __call__ self.target_id, self.name) File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in get_return_value format(target_id, '.', name), value) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.collect(RDD.scala:813) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at
KMeans takeSample jobs and RDD cached
Hi, i am running k-means algorithm with initialization mode set to random and various dataset sizes and values for clusters and i have a question regarding the takeSample job of the algorithm. More specific i notice that in every application there are two sampling jobs. The first one is consuming the most time compared to all others while the second one is much quicker and that sparked my interest to investigate what is actually happening. In order to explain it, i checked the source code of the takeSample operation and i saw that there is a count action involved and then the computation of a PartiotionwiseSampledRDD with a PoissonSampler. So my question is,if that count action corresponds to the first takeSample job and if the second takeSample job is the one doing the actual sampling. I also have a question for the RDDs that are created for the k-means. In the middle of the execution under the storage tab of the web ui i can see 3 RDDs with their partitions cached in memory across all nodes which is very helpful for monitoring reasons. The problem is that after the completion i can only see one of them and the portion of the cache memory it used and i would like to ask why the web ui doesn't display all the RDDs involded in the computation. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL 1.3.1: java.lang.ClassCastException is thrown
Hi all, When I query Postgresql based on Spark SQL like this: dataFrame.registerTempTable(Employees) val emps = sqlContext.sql(select name, sum(salary) from Employees group by name, salary) monitor { emps.take(10) .map(row = (row.getString(0), row.getDecimal(1))) .foreach(println) } The type of salary column in data table is numeric(10, 2). It throws the following exception: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal Who know this issue and how to solve it? Thanks. Regards, Yi - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown
Looks like this is related: https://issues.apache.org/jira/browse/SPARK-5456 On Sat, Apr 25, 2015 at 6:59 AM, doovs...@sina.com wrote: Hi all, When I query Postgresql based on Spark SQL like this: dataFrame.registerTempTable(Employees) val emps = sqlContext.sql(select name, sum(salary) from Employees group by name, salary) monitor { emps.take(10) .map(row = (row.getString(0), row.getDecimal(1))) .foreach(println) } The type of salary column in data table is numeric(10, 2). It throws the following exception: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal Who know this issue and how to solve it? Thanks. Regards, Yi - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: directory loader in windows
extra forward slash at the end. sometime I have seen this kind of issues On 25 April 2015 at 20:50, Jeetendra Gangele gangele...@gmail.com wrote: loc = D:\\Project\\Spark\\code\\news\\jsonfeeds\\ On 25 April 2015 at 20:49, Jeetendra Gangele gangele...@gmail.com wrote: Hi Ayan can you try below line loc = D:\\Project\\Spark\\code\\news\\jsonfeeds On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote: Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc) print newsY.count() Even this simple code fails. I have tried with giving exact file names, everything works. Am I missing something stupid here? Anyone facing this (anyone still use windows?:)) Here is error trace: D:\Project\Spark\code\news\jsonfeeds Traceback (most recent call last): File D:/Project/Spark/code/newsfeeder.py, line 28, in module print newsY.count() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 932, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 923, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 739, in reduce vals = self.mapPartitions(func).collect() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 713, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in __call__ self.target_id, self.name) File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in get_return_value format(target_id, '.', name), value) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.collect(RDD.scala:813) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at
Re: directory loader in windows
also if this code is in scala why not val in newsY? is this define above? loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc) print newsY.count() On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote: Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc) print newsY.count() Even this simple code fails. I have tried with giving exact file names, everything works. Am I missing something stupid here? Anyone facing this (anyone still use windows?:)) Here is error trace: D:\Project\Spark\code\news\jsonfeeds Traceback (most recent call last): File D:/Project/Spark/code/newsfeeder.py, line 28, in module print newsY.count() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 932, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 923, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 739, in reduce vals = self.mapPartitions(func).collect() File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\rdd.py, line 713, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File C:\Python27\lib\site-packages\py4j\java_gateway.py, line 537, in __call__ self.target_id, self.name) File C:\Python27\lib\site-packages\py4j\protocol.py, line 300, in get_return_value format(target_id, '.', name), value) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.collect(RDD.scala:813) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at
Re: what is the best way to transfer data from RDBMS to spark?
If your use case is more to do with querying RDBMS and then bringing the results to spark do some analysis then Spark SQL JDBC datasource API http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/ is the best. If your use case is to bring entire data to spark, then you'll have to explore other options which depends on the datatype. For e.g. Spark RedShift integration http://spark-packages.org/package/databricks/spark-redshift Best Regards, Sujeevan. N On Sat, Apr 25, 2015 at 4:24 PM, ayan guha guha.a...@gmail.com wrote: Actually, Spark SQL provides a data source. Here is from documentation - JDBC To Other Databases Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources. The JDBC data source is also easier to use from Java or Python as it does not require the user to provide a ClassTag. (Note that this is different than the Spark SQL JDBC server, which allows other applications to run queries using Spark SQL). On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote: What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream data to spark). Also I have seen people using mysql jars to bring data in. Essentially you want to simulate creation of rdd. On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com wrote: If I run spark in stand-alone mode ( not YARN mode ), is there any tool like Sqoop that able to transfer data from RDBMS to spark storage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha