Re: sbt assembly fails
Hi Sean, Yeah.. I am seeing erros across all repos and yepp.. this error is mainly because of connectivity issue... How do I set up proxy.. I did set up proxy as suggested by Mayur: export JAVA_OPTS=$JAVA_OPTS -Dhttp.proxyHost=yourserver -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.proxyPassword=password How do I rectify this error. :( On Mon, Mar 17, 2014 at 6:07 PM, Sean Owen so...@cloudera.com wrote: It's in the main Maven repo: http://central.maven.org/maven2/io/netty/netty-all/ I assume you're seeing errors accessing all repos? the last few you quote are not where they are intended to be, you're just seeing it fail through all of them. I think it remains a connectivity problem from your env to the repos, possibly because of a proxy? -- Sean Owen | Director, Data Science | London On Mon, Mar 17, 2014 at 8:39 PM, Chengi Liu chengi.liu...@gmail.com wrote: I have set it up.. still it fails.. Question: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/ 4.0.13 is not there? Instead 4.0.18 is there?? Is this a bug? On Mon, Mar 17, 2014 at 11:01 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E You also have to specify git proxy as code may be copied off git also. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Mar 17, 2014 at 1:25 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am trying to compile the spark project using sbt/sbt assembly.. And i see this error: [info] Resolving io.netty#netty-all;4.0.13.Final ... [error] Server access Error: Connection timed out url= https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom [error] Server access Error: Connection timed out url= https://oss.sonatype.org/service/local/staging/deploy/maven2/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom [error] Server access Error: Connection timed out url= https://repository.cloudera.com/artifactory/cloudera-repos/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom I followed the url and seemslike 4.0.13 is not present..Am i missing something.. Also, I am behind proxy.. can that be an issue? How do i resolve this. Thanks
Re: sbt assembly fails
Yeah.. The http_proxy is set up.. and so is https_proxy.. Basically, my maven projects, git pulls etc everything is working fine.. except this. Here is another question which might help me to bypass this issue If I create a jar using eclipse... how do i run that jar in code. Like in hadoop, I create a jar and then that jar like hadoop jar jar_name Earlier what i was trying to do was basically, write a code inside spark_examples directory.. do a new sbt build and create new jars.. and use run_examples script to run my code.. But since sbt assembly is having connection issue maybe someone can help me with how to build jars and deploy code on spark (not using spark shell) Thanks On Mon, Mar 17, 2014 at 11:27 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: is it translating to sbt? are you also setting command line proxy HTTP_PROXY easiest is to build a small code just test it out by building in command line.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Mar 18, 2014 at 2:15 AM, Chengi Liu chengi.liu...@gmail.comwrote: Hi Sean, Yeah.. I am seeing erros across all repos and yepp.. this error is mainly because of connectivity issue... How do I set up proxy.. I did set up proxy as suggested by Mayur: export JAVA_OPTS=$JAVA_OPTS -Dhttp.proxyHost=yourserver -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.proxyPassword=password How do I rectify this error. :( On Mon, Mar 17, 2014 at 6:07 PM, Sean Owen so...@cloudera.com wrote: It's in the main Maven repo: http://central.maven.org/maven2/io/netty/netty-all/ I assume you're seeing errors accessing all repos? the last few you quote are not where they are intended to be, you're just seeing it fail through all of them. I think it remains a connectivity problem from your env to the repos, possibly because of a proxy? -- Sean Owen | Director, Data Science | London On Mon, Mar 17, 2014 at 8:39 PM, Chengi Liu chengi.liu...@gmail.com wrote: I have set it up.. still it fails.. Question: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/ 4.0.13 is not there? Instead 4.0.18 is there?? Is this a bug? On Mon, Mar 17, 2014 at 11:01 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E You also have to specify git proxy as code may be copied off git also. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Mar 17, 2014 at 1:25 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am trying to compile the spark project using sbt/sbt assembly.. And i see this error: [info] Resolving io.netty#netty-all;4.0.13.Final ... [error] Server access Error: Connection timed out url= https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom [error] Server access Error: Connection timed out url= https://oss.sonatype.org/service/local/staging/deploy/maven2/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom [error] Server access Error: Connection timed out url= https://repository.cloudera.com/artifactory/cloudera-repos/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom I followed the url and seemslike 4.0.13 is not present..Am i missing something.. Also, I am behind proxy.. can that be an issue? How do i resolve this. Thanks
Re: sbt assembly fails
you need to assemble the code to get spark working (unless you are using hadoop 1.0.4). to run the code you can follow any of the standalone guides here: https://spark.apache.org/docs/0.9.0/quick-start.html#a-standalone-app-in-scalayou would still need sbt though. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Mar 18, 2014 at 2:32 AM, Chengi Liu chengi.liu...@gmail.com wrote: Yeah.. The http_proxy is set up.. and so is https_proxy.. Basically, my maven projects, git pulls etc everything is working fine.. except this. Here is another question which might help me to bypass this issue If I create a jar using eclipse... how do i run that jar in code. Like in hadoop, I create a jar and then that jar like hadoop jar jar_name Earlier what i was trying to do was basically, write a code inside spark_examples directory.. do a new sbt build and create new jars.. and use run_examples script to run my code.. But since sbt assembly is having connection issue maybe someone can help me with how to build jars and deploy code on spark (not using spark shell) Thanks On Mon, Mar 17, 2014 at 11:27 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: is it translating to sbt? are you also setting command line proxy HTTP_PROXY easiest is to build a small code just test it out by building in command line.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Mar 18, 2014 at 2:15 AM, Chengi Liu chengi.liu...@gmail.comwrote: Hi Sean, Yeah.. I am seeing erros across all repos and yepp.. this error is mainly because of connectivity issue... How do I set up proxy.. I did set up proxy as suggested by Mayur: export JAVA_OPTS=$JAVA_OPTS -Dhttp.proxyHost=yourserver -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.proxyPassword=password How do I rectify this error. :( On Mon, Mar 17, 2014 at 6:07 PM, Sean Owen so...@cloudera.com wrote: It's in the main Maven repo: http://central.maven.org/maven2/io/netty/netty-all/ I assume you're seeing errors accessing all repos? the last few you quote are not where they are intended to be, you're just seeing it fail through all of them. I think it remains a connectivity problem from your env to the repos, possibly because of a proxy? -- Sean Owen | Director, Data Science | London On Mon, Mar 17, 2014 at 8:39 PM, Chengi Liu chengi.liu...@gmail.com wrote: I have set it up.. still it fails.. Question: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/ 4.0.13 is not there? Instead 4.0.18 is there?? Is this a bug? On Mon, Mar 17, 2014 at 11:01 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E You also have to specify git proxy as code may be copied off git also. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Mar 17, 2014 at 1:25 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am trying to compile the spark project using sbt/sbt assembly.. And i see this error: [info] Resolving io.netty#netty-all;4.0.13.Final ... [error] Server access Error: Connection timed out url= https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom [error] Server access Error: Connection timed out url= https://oss.sonatype.org/service/local/staging/deploy/maven2/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom [error] Server access Error: Connection timed out url= https://repository.cloudera.com/artifactory/cloudera-repos/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom I followed the url and seemslike 4.0.13 is not present..Am i missing something.. Also, I am behind proxy.. can that be an issue? How do i resolve this. Thanks
Feed KMeans algorithm with a row major matrix
Dear All, I'm trying to cluster data from native library code with Spark Kmeans||. In my native library the data are represented as a matrix (row = number of data and col = dimension). For efficiency reason, they are copied into a one dimensional scala Array row major wise so after the computation I have a RDD[Array[Double]] but the dimension of each array represents a set of data instead of the data itself. I need to transfrom these array into Array[Array[Double]] before running the KMeans|| algorithm. How to do this efficiently ? Best regards,
Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1
On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote: Is there a reason for spark using the older akka? On Sun, Mar 2, 2014 at 1:53 PM, 1esha alexey.r...@gmail.com wrote: The problem is in akka remote. It contains files compiled with 2.4.*. When you run it with 2.5.* in classpath it fails like above. Looks like moving to akka 2.3 will solve this issue. Check this issue - https://www.assembla.com/spaces/akka/tickets/3154-use-protobuf-version-2-5-0#/activity/ticket: -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p2217.html Sent from the Apache Spark User List mailing list archive at Nabble.com. Is the solution to exclude the 2.4.*. dependency on protobuf or will thi produce more complications?
Connect Exception Error in spark interactive shell...
Hi ALL !! In the interactive spark shell i get the following error. I just followed the steps of the video First steps with spark - spark screen cast #1 by andy konwinski... Any thoughts ??? scala val textfile = sc.textFile(README.md) textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12 scala textfile.count java.lang.RuntimeException: java.net.ConnectException: Call to master/ 192.168.1.11:9000 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:546) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:291) at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:439) at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:439) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:112) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:112) at scala.Option.map(Option.scala:133) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:112) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:134) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) at org.apache.spark.rdd.RDD.count(RDD.scala:698) at init(console:15) at init(console:20) at init(console:22) at init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:11) at .clinit(console) at $export(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629) at org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897) at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43) at scala.tools.nsc.io.package$$anon$2.run(package.scala:25) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.ConnectException: Call to master/192.168.1.11:9000failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099) at org.apache.hadoop.ipc.Client.call(Client.java:1075) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:542) ... 39 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206) at org.apache.hadoop.ipc.Client.call(Client.java:1050) ... 53 more -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL* *Entire water in the ocean can never sink a ship, Unless it gets inside.All the pressures of life can never hurt you, Unless you let them in.*
Re: Apache Spark 0.9.0 Build Error
I tried that command on Fedora and I got a lot of random downloads (around 250 downloads) and it appeared that something was trying to get BitTorrent start. That command ./sbt/sbt assembly doesn't work on Windows. I installed sbt separately. Is there a way to determine if I'm using the sbt that's included with Spark or the standalone version? On Tue, Mar 18, 2014 at 12:16 AM, Mark Hamstra [via Apache Spark User List] ml-node+s1001560n2795...@n3.nabble.com wrote: Try ./sbt/sbt assembly On Mon, Mar 17, 2014 at 9:06 PM, wapisani [hidden email]http://user/SendEmail.jtp?type=nodenode=2795i=0 wrote: Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8. I've installed all prerequisites (except Hadoop) and run sbt/sbt assembly while in the root directory. I'm getting an error after the line Set current project to root in build file:C:/.../spark-0.9.0-incubating/. The error is: [error] Not a valid command: / [error] /sbt [error] ^ Do you know why I'm getting this error? Thank you very much, Will -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2795.html To unsubscribe from Apache Spark 0.9.0 Build Error, click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=2794code=d2FwaXNhbmlAbXR1LmVkdXwyNzk0fDEyNzEzNDQzNzg= . NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Will Pisani Fourth-Year Chemical Engineering Student Research Scholar Honors Institute Michigan Technological University -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2806.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1
On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote: On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote: Is there a reason for spark using the older akka? On Sun, Mar 2, 2014 at 1:53 PM, 1esha alexey.r...@gmail.com wrote: The problem is in akka remote. It contains files compiled with 2.4.*. When you run it with 2.5.* in classpath it fails like above. Looks like moving to akka 2.3 will solve this issue. Check this issue - https://www.assembla.com/spaces/akka/tickets/3154-use-protobuf-version-2-5-0#/activity/ticket: Is the solution to exclude the 2.4.*. dependency on protobuf or will thi produce more complications? I am not sure I remember what the context was around this but I run 0.9.0 with hadoop 2.2.0 just fine. Ognen
KryoSerializer return null when deserialize Task obj in Executor
Hi all, I changed spark.closure.serializer to kryo, when I try count action in spark shell the Task obj deserialize in Executor return null, src line is: override def run(){ .. task = ser.deserializer[Task[Any]](...) .. } Where task is null Can any one help me? Thank you!
Re: example of non-line oriented input data?
Well, if anyone is still following this, I've gotten the following code working which in theory should allow me to parse whole XML files: (the problem was that I can't return the tree iterator directly. I have to call iter(). Why?) import xml.etree.ElementTree as ET # two source files, format data country name=../country.../data mydata=sc.textFile(file:/home/training/countries*.xml) def parsefile(iterator): s = '' for i in iterator: s = s + str(i) tree = ET.fromstring(s) treeiterator = tree.getiterator(country) # why to I have to convert an iterator to an iterator? not sure but required return iter(treeiterator) mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element: element.attrib).collect() The output is what I expect: [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}] BUT I'm a bit concerned about the construction of the string s. How big can my file be before converting it to a string becomes problematic? On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll dcarr...@cloudera.comwrote: Thanks, Matei. In the context of this discussion, it would seem mapParitions is essential, because it's the only way I'm going to be able to process each file as a whole, in our example of a large number of small XML files which need to be parsed as a whole file because records are not required to be on a single line. The theory makes sense but I'm still utterly lost as to how to implement it. Unfortunately there's only a single example of the use of mapPartitions in any of the Python example programs, which is the log regression example, which I can't run because it requires Python 2.7 and I'm on Python 2.6. (aside: I couldn't find any statement that Python 2.6 is unsupported...is it?) I'd really really love to see a real life example of a Python use of mapPartitions. I do appreciate the very simple examples you provided, but (perhaps because of my novice status on Python) I can't figure out how to translate those to a real world situation in which I'm building RDDs from files, not inline collections like [(1,2),(2,3)]. Also, you say that the function called in mapPartitions can return a collection OR an iterator. I tried returning an iterator by calling ElementTree getiterator function, but still got the error telling me my object was not an iterator. If anyone has a real life example of mapPartitions returning a Python iterator, that would be fabulous. Diana On Mon, Mar 17, 2014 at 6:17 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Oh, I see, the problem is that the function you pass to mapPartitions must itself return an iterator or a collection. This is used so that you can return multiple output records for each input record. You can implement most of the existing map-like operations in Spark, such as map, filter, flatMap, etc, with mapPartitions, as well as new ones that might do a sliding window over each partition for example, or accumulate data across elements (e.g. to compute a sum). For example, if you have data = sc.parallelize([1, 2, 3, 4], 2), this will work: data.mapPartitions(lambda x: x).collect() [1, 2, 3, 4] # Just return the same iterator, doing nothing data.mapPartitions(lambda x: [list(x)]).collect() [[1, 2], [3, 4]] # Group together the elements of each partition in a single list (like glom) data.mapPartitions(lambda x: [sum(x)]).collect() [3, 7] # Sum each partition separately However something like data.mapPartitions(lambda x: sum(x)).collect() will *not* work because sum returns a number, not an iterator. That's why I put sum(x) inside a list above. In practice mapPartitions is most useful if you want to share some data or work across the elements. For example maybe you want to load a lookup table once from an external file and then check each element in it, or sum up a bunch of elements without allocating a lot of vector objects. Matei On Mar 17, 2014, at 11:25 AM, Diana Carroll dcarr...@cloudera.com wrote: There's also mapPartitions, which gives you an iterator for each partition instead of an array. You can then return an iterator or list of objects to produce from that. I confess, I was hoping for an example of just that, because i've not yet been able to figure out how to use mapPartitions. No doubt this is because i'm a rank newcomer to Python, and haven't fully wrapped my head around iterators. All I get so far in my attempts to use mapPartitions is the darned suchnsuch is not an iterator error. def myfunction(iterator): return [1,2,3] mydata.mapPartitions(lambda x: myfunction(x)).take(2) On Mon, Mar 17, 2014 at 1:57 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Here's an example of getting together all lines in a file as one string: $ cat dir/a.txt Hello world! $ cat dir/b.txt What's up?? $ bin/pyspark files = sc.textFile(dir) files.collect() [u'Hello', u'world!', uWhat's,
Re: Apache Spark 0.9.0 Build Error
hi, if you run that under windows, you should use \ to replace /. sbt/sbt means the sbt file under the sbt folder. On Mar 18, 2014 8:42 PM, wapisani wapis...@mtu.edu wrote: I tried that command on Fedora and I got a lot of random downloads (around 250 downloads) and it appeared that something was trying to get BitTorrent start. That command ./sbt/sbt assembly doesn't work on Windows. I installed sbt separately. Is there a way to determine if I'm using the sbt that's included with Spark or the standalone version? On Tue, Mar 18, 2014 at 12:16 AM, Mark Hamstra [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=2806i=0wrote: Try ./sbt/sbt assembly On Mon, Mar 17, 2014 at 9:06 PM, wapisani [hidden email]http://user/SendEmail.jtp?type=nodenode=2795i=0 wrote: Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8. I've installed all prerequisites (except Hadoop) and run sbt/sbt assembly while in the root directory. I'm getting an error after the line Set current project to root in build file:C:/.../spark-0.9.0-incubating/. The error is: [error] Not a valid command: / [error] /sbt [error] ^ Do you know why I'm getting this error? Thank you very much, Will -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2795.html To unsubscribe from Apache Spark 0.9.0 Build Error, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Will Pisani Fourth-Year Chemical Engineering Student Research Scholar Honors Institute Michigan Technological University -- View this message in context: Re: Apache Spark 0.9.0 Build Errorhttp://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2806.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: Apache Spark 0.9.0 Build Error
Hi Chen, I tried sbt\sbt assembly and I got an error of 'sbt\sbt' is not recognized as an internal or external command, operable program or batch file. On Tue, Mar 18, 2014 at 11:18 AM, Chen Jingci [via Apache Spark User List] ml-node+s1001560n2811...@n3.nabble.com wrote: hi, if you run that under windows, you should use \ to replace /. sbt/sbt means the sbt file under the sbt folder. On Mar 18, 2014 8:42 PM, wapisani [hidden email]http://user/SendEmail.jtp?type=nodenode=2811i=0 wrote: I tried that command on Fedora and I got a lot of random downloads (around 250 downloads) and it appeared that something was trying to get BitTorrent start. That command ./sbt/sbt assembly doesn't work on Windows. I installed sbt separately. Is there a way to determine if I'm using the sbt that's included with Spark or the standalone version? On Tue, Mar 18, 2014 at 12:16 AM, Mark Hamstra [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=2806i=0 wrote: Try ./sbt/sbt assembly On Mon, Mar 17, 2014 at 9:06 PM, wapisani [hidden email]http://user/SendEmail.jtp?type=nodenode=2795i=0 wrote: Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8. I've installed all prerequisites (except Hadoop) and run sbt/sbt assembly while in the root directory. I'm getting an error after the line Set current project to root in build file:C:/.../spark-0.9.0-incubating/. The error is: [error] Not a valid command: / [error] /sbt [error] ^ Do you know why I'm getting this error? Thank you very much, Will -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2795.html To unsubscribe from Apache Spark 0.9.0 Build Error, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Will Pisani Fourth-Year Chemical Engineering Student Research Scholar Honors Institute Michigan Technological University -- View this message in context: Re: Apache Spark 0.9.0 Build Errorhttp://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2806.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2811.html To unsubscribe from Apache Spark 0.9.0 Build Error, click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=2794code=d2FwaXNhbmlAbXR1LmVkdXwyNzk0fDEyNzEzNDQzNzg= . NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Will Pisani Fourth-Year Chemical Engineering Student Research Scholar Honors Institute Michigan Technological University -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2812.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
[spark] New article on spark scalaz-stream ( a bit of ML)
Hi, I wrote this new article after studying deeper how to adapt scalaz-stream to spark dstreams. I re-explain a few spark ( scalaz-stream) concepts (in my own words) in it and I went further using new scalaz-stream NIO API which is quite interesting IMHO. The result is a long blog tryptic starting here : http://mandubian.com/2014/03/08/zpark-ml-nio-1/ Regards Pascal
Re: inexplicable exceptions in Spark 0.7.3
Hi Andrew, Thanks for your interest. This is a standalone job. On Mon, Mar 17, 2014 at 4:30 PM, Andrew Ash and...@andrewash.com wrote: Are you running from the spark shell or from a standalone job? On Mon, Mar 17, 2014 at 4:17 PM, Walrus theCat walrusthe...@gmail.comwrote: Hi, I'm getting this stack trace, using Spark 0.7.3. No references to anything in my code, never experienced anything like this before. Any ideas what is going on? java.lang.ClassCastException: spark.SparkContext$$anonfun$9 cannot be cast to scala.Function2 at spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:43) at spark.scheduler.ResultTask.readExternal(ResultTask.scala:106) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at spark.JavaDeserializationStream.readObject(JavaSerializer.scala:23) at spark.JavaSerializerInstance.deserialize(JavaSerializer.scala:45) at spark.executor.Executor$TaskRunner.run(Executor.scala:96) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744)
Re: possible bug in Spark's ALS implementation...
Sorry, the link was wrong. Should be https://github.com/apache/spark/pull/131 -Xiangrui On Tue, Mar 18, 2014 at 10:20 AM, Michael Allman m...@allman.ms wrote: Hi Xiangrui, I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can you explain? Also, thanks for addressing the issue with factor matrix persistence in PR 165. I was probably not going to get to that for a while. I will try to test your changes today for speed improvements. Cheers, Michael -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2817.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Feed KMeans algorithm with a row major matrix
Hi Jaonary, With the current implementation, you need to call Array.slice to make each row an Array[Double] and cache the result RDD. There is a plan to support block-wise input data and I will keep you informed. Best, Xiangrui On Tue, Mar 18, 2014 at 2:46 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear All, I'm trying to cluster data from native library code with Spark Kmeans||. In my native library the data are represented as a matrix (row = number of data and col = dimension). For efficiency reason, they are copied into a one dimensional scala Array row major wise so after the computation I have a RDD[Array[Double]] but the dimension of each array represents a set of data instead of the data itself. I need to transfrom these array into Array[Array[Double]] before running the KMeans|| algorithm. How to do this efficiently ? Best regards,
Re: spark-shell fails
Although sbt assembly reports success, I re-ran that step, and see errors like: Error extracting zip entry 'scala/tools/nsc/transformUnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply (omitting rest of super-long path) (File name too long) Is this a problem with the 'zip' tool on my system? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-fails-tp2778p2821.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark-shell fails
OK, the problem was that the directory where I had installed Spark is encrypted. The particular encryption system appears to limit the length of files. I re-installed on a vanilla partition, and spark-shell runs fine. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-fails-tp2778p2822.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Maven repo for Spark pre-built with CDH4?
Hi all, The Maven central repo contains an artifact for spark 0.9.0 built with unmodified Hadoop, and the Cloudera repo contains an artifact for spark 0.9.0 built with CDH 5 beta. Is there a repo that contains spark-core built against a non-beta version of CDH (such as 4.4.0)? Punya smime.p7s Description: S/MIME cryptographic signature
Re: possible bug in Spark's ALS implementation...
I just ran a runtime performance comparison between 0.9.0-incubating and your als branch. I saw a 1.5x improvement in performance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: possible bug in Spark's ALS implementation...
Glad to hear the speed-up. Wish we can improve the implementation further in the future. -Xiangrui On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman m...@allman.ms wrote: I just ran a runtime performance comparison between 0.9.0-incubating and your als branch. I saw a 1.5x improvement in performance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Regarding Successive operation on elements and recursively
Hi , I am new to Spark scala environment.Currently I am working on Discrete wavelet transformation algos on time series data. I have to perform recursive additions on successive elements in RDDs. for example List of elements(RDDS) --a1 a2 a3 a4. level1 Tranformation --a1+a2 a3+a4 a1-a2 a3-a4 level 2---(a1+a2)+(a3+a4) (a1+a2)-(a3+a4) Is their a way to provide indexing to elements in distributed environment across nodes so that I know that i am referring to a2 after a1 ..I want to perform successive addition of only two elements and in a recursive manner .. Could you please help me in this aspect..I would be really thankful to you.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-Successive-operation-on-elements-and-recursively-tp2826.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Incrementally add/remove vertices in GraphX
I just meant that you call union() before creating the RDDs that you pass to new Graph(). If you call it after it will produce other RDDs. The Graph() constructor actually shuffles and “indexes” the data to make graph operations efficient, so it’s not too easy to add elements after. You could access graph.vertices and graph.edges to build new RDDs, and then call Graph() again to make a new graph. I’ve CCed Joey and Ankur to see if they have further ideas on how to optimize this. It would be cool to support more efficient union and subtracting of graphs once they’ve been partitioned by GraphX. Matei On Mar 14, 2014, at 8:32 AM, alelulli alessandro.lu...@gmail.com wrote: Hi Matei, Could you please clarify why i must call union before creating the graph? What's the behavior if i call union / subtract after the creation? Is the added /removed vertexes been processed? For example if i'm implementing an iterative algorithm and at the 5th step i need to add some vertex / edge, can i call union / subtract on the VertexRDD, EdgeRDD and Triplets? Thanks Alessandro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Incrementally-add-remove-vertices-in-GraphX-tp2227p2695.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Access original filename in a map function
Hi spark-folk, I have a directory full of files that I want to process using PySpark. There is some necessary metadata in the filename that I would love to attach to each record in that file. Using Java MapReduce, I would access (FileSplit) context.getInputSplit()).getPath().getName() in the setup() method of the mapper. Using Hadoop Streaming, I can access the environment variable map_input_fileto get the filename. Is there something I can do in PySpark to get the filename? Surely, one solution would be to get the list of files first, load each one as an RDD separately, and then union them together. But listing the files in HDFS is a bit annoying through Python, so I was wondering if the filename is somehow attached to a partition. Thanks! Uri -- Uri Laserson, PhD Data Scientist, Cloudera Twitter/GitHub: @laserson +1 617 910 0447 laser...@cloudera.com
Re: There is an error in Graphx
This problem occurs because graph.triplets generates an iterator that reuses the same EdgeTriplet object for every triplet in the partition. The workaround is to force a copy using graph.triplets.map(_.copy()). The solution in the AMPCamp tutorial is mistaken -- I'm not sure if that ever worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/There-is-an-error-in-Graphx-tp1575p2836.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: There is an error in Graphx
The workaround is to force a copy using graph.triplets.map(_.copy()). Sorry, this actually won't copy the entire triplet, only the attributes defined in Edge. The right workaround is to copy the EdgeTriplet explicitly: graph.triplets.map { et = val et2 = new EdgeTriplet[VD, ED] // Replace VD and ED with the correct types et2.srcId = et.srcId et2.dstId = et.dstId et2.attr = et.attr et2.srcAttr = et.srcAttr et2.dstAttr = et.dstAttr et2 } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/There-is-an-error-in-Graphx-tp1575p2837.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sample data for pagerank?
The examples in graphx/data are meant to show the input data format, but if you want to play around with larger and more interesting datasets, we've been using the following ones, among others: - SNAP's web-Google dataset (5M edges): https://snap.stanford.edu/data/web-Google.html - SNAP's soc-LiveJournal1 dataset (69M edges): https://snap.stanford.edu/data/soc-LiveJournal1.html These come in edge list format and, after decompression, can directly be loaded using GraphLoader. Ankur -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sample-data-for-pagerank-tp2655p2839.html Sent from the Apache Spark User List mailing list archive at Nabble.com.