Re: sbt assembly fails

2014-03-18 Thread Chengi Liu
Hi Sean,
  Yeah.. I am seeing erros across all repos and yepp.. this error is mainly
because of connectivity issue...
How do I set up proxy.. I did set up proxy as suggested by Mayur:

export JAVA_OPTS=$JAVA_OPTS -Dhttp.proxyHost=yourserver
-Dhttp.proxyPort=8080 -Dhttp.proxyUser=username
-Dhttp.proxyPassword=password


How do I rectify this error. :(



On Mon, Mar 17, 2014 at 6:07 PM, Sean Owen so...@cloudera.com wrote:

 It's in the main Maven repo:
 http://central.maven.org/maven2/io/netty/netty-all/

 I assume you're seeing errors accessing all repos? the last few you
 quote are not where they are intended to be, you're just seeing it
 fail through all of them. I think it remains a connectivity problem
 from your env to the repos, possibly because of a proxy?
 --
 Sean Owen | Director, Data Science | London


 On Mon, Mar 17, 2014 at 8:39 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  I have set it up.. still it fails.. Question:
 
 https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/
 
  4.0.13 is not there? Instead 4.0.18 is there?? Is this a bug?
 
 
  On Mon, Mar 17, 2014 at 11:01 AM, Mayur Rustagi mayur.rust...@gmail.com
 
  wrote:
 
 
 
 http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E
 
  You also have to specify git proxy as code may be copied off git also.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Mon, Mar 17, 2014 at 1:25 PM, Chengi Liu chengi.liu...@gmail.com
  wrote:
 
  Hi,
I am trying to compile the spark project using sbt/sbt assembly..
  And i see this error:
  [info] Resolving io.netty#netty-all;4.0.13.Final ...
  [error] Server access Error: Connection timed out
  url=
 https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
  [error] Server access Error: Connection timed out
  url=
 https://oss.sonatype.org/service/local/staging/deploy/maven2/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
  [error] Server access Error: Connection timed out
  url=
 https://repository.cloudera.com/artifactory/cloudera-repos/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
 
 
 
  I followed the url and seemslike 4.0.13 is not present..Am i missing
  something.. Also, I am behind proxy.. can that be an issue?
  How do i resolve this.
  Thanks
 
 
 



Re: sbt assembly fails

2014-03-18 Thread Chengi Liu
Yeah.. The http_proxy is set up.. and so is https_proxy..
Basically, my maven projects, git pulls etc everything is working fine..
except this.

Here is another question which might help me to bypass this issue
If I create a jar using eclipse... how do i run that jar in code. Like in
hadoop, I create a jar and then that jar like hadoop jar jar_name
Earlier what i was trying to do was basically, write a code inside
spark_examples directory.. do a new sbt build and create new jars.. and use
run_examples script to run my code.. But since sbt assembly is having
connection issue maybe someone can help me with how to build jars and
deploy code on spark (not using spark shell)
Thanks


On Mon, Mar 17, 2014 at 11:27 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 is it translating to sbt?
 are you also setting command line proxy HTTP_PROXY
 easiest is to build a small code  just test it out by building in command
 line..

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Tue, Mar 18, 2014 at 2:15 AM, Chengi Liu chengi.liu...@gmail.comwrote:

 Hi Sean,
   Yeah.. I am seeing erros across all repos and yepp.. this error is
 mainly because of connectivity issue...
 How do I set up proxy.. I did set up proxy as suggested by Mayur:

 export JAVA_OPTS=$JAVA_OPTS -Dhttp.proxyHost=yourserver
 -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username
 -Dhttp.proxyPassword=password


 How do I rectify this error. :(



 On Mon, Mar 17, 2014 at 6:07 PM, Sean Owen so...@cloudera.com wrote:

 It's in the main Maven repo:
 http://central.maven.org/maven2/io/netty/netty-all/

 I assume you're seeing errors accessing all repos? the last few you
 quote are not where they are intended to be, you're just seeing it
 fail through all of them. I think it remains a connectivity problem
 from your env to the repos, possibly because of a proxy?
 --
 Sean Owen | Director, Data Science | London


 On Mon, Mar 17, 2014 at 8:39 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  I have set it up.. still it fails.. Question:
 
 https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/
 
  4.0.13 is not there? Instead 4.0.18 is there?? Is this a bug?
 
 
  On Mon, Mar 17, 2014 at 11:01 AM, Mayur Rustagi 
 mayur.rust...@gmail.com
  wrote:
 
 
 
 http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E
 
  You also have to specify git proxy as code may be copied off git also.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Mon, Mar 17, 2014 at 1:25 PM, Chengi Liu chengi.liu...@gmail.com
  wrote:
 
  Hi,
I am trying to compile the spark project using sbt/sbt assembly..
  And i see this error:
  [info] Resolving io.netty#netty-all;4.0.13.Final ...
  [error] Server access Error: Connection timed out
  url=
 https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
  [error] Server access Error: Connection timed out
  url=
 https://oss.sonatype.org/service/local/staging/deploy/maven2/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
  [error] Server access Error: Connection timed out
  url=
 https://repository.cloudera.com/artifactory/cloudera-repos/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
 
 
 
  I followed the url and seemslike 4.0.13 is not present..Am i missing
  something.. Also, I am behind proxy.. can that be an issue?
  How do i resolve this.
  Thanks
 
 
 






Re: sbt assembly fails

2014-03-18 Thread Mayur Rustagi
you need to assemble the code to get spark working (unless you are using
hadoop 1.0.4).

to run the code you can follow any of the standalone guides here:
https://spark.apache.org/docs/0.9.0/quick-start.html#a-standalone-app-in-scalayou
would still need sbt though.



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Tue, Mar 18, 2014 at 2:32 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Yeah.. The http_proxy is set up.. and so is https_proxy..
 Basically, my maven projects, git pulls etc everything is working fine..
 except this.

 Here is another question which might help me to bypass this issue
 If I create a jar using eclipse... how do i run that jar in code. Like in
 hadoop, I create a jar and then that jar like hadoop jar jar_name
 Earlier what i was trying to do was basically, write a code inside
 spark_examples directory.. do a new sbt build and create new jars.. and use
 run_examples script to run my code.. But since sbt assembly is having
 connection issue maybe someone can help me with how to build jars and
 deploy code on spark (not using spark shell)
 Thanks


 On Mon, Mar 17, 2014 at 11:27 PM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 is it translating to sbt?
 are you also setting command line proxy HTTP_PROXY
 easiest is to build a small code  just test it out by building in
 command line..

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Tue, Mar 18, 2014 at 2:15 AM, Chengi Liu chengi.liu...@gmail.comwrote:

 Hi Sean,
   Yeah.. I am seeing erros across all repos and yepp.. this error is
 mainly because of connectivity issue...
 How do I set up proxy.. I did set up proxy as suggested by Mayur:

 export JAVA_OPTS=$JAVA_OPTS -Dhttp.proxyHost=yourserver
 -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username
 -Dhttp.proxyPassword=password


 How do I rectify this error. :(



 On Mon, Mar 17, 2014 at 6:07 PM, Sean Owen so...@cloudera.com wrote:

 It's in the main Maven repo:
 http://central.maven.org/maven2/io/netty/netty-all/

 I assume you're seeing errors accessing all repos? the last few you
 quote are not where they are intended to be, you're just seeing it
 fail through all of them. I think it remains a connectivity problem
 from your env to the repos, possibly because of a proxy?
 --
 Sean Owen | Director, Data Science | London


 On Mon, Mar 17, 2014 at 8:39 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  I have set it up.. still it fails.. Question:
 
 https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/
 
  4.0.13 is not there? Instead 4.0.18 is there?? Is this a bug?
 
 
  On Mon, Mar 17, 2014 at 11:01 AM, Mayur Rustagi 
 mayur.rust...@gmail.com
  wrote:
 
 
 
 http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E
 
  You also have to specify git proxy as code may be copied off git
 also.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Mon, Mar 17, 2014 at 1:25 PM, Chengi Liu chengi.liu...@gmail.com
 
  wrote:
 
  Hi,
I am trying to compile the spark project using sbt/sbt assembly..
  And i see this error:
  [info] Resolving io.netty#netty-all;4.0.13.Final ...
  [error] Server access Error: Connection timed out
  url=
 https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
  [error] Server access Error: Connection timed out
  url=
 https://oss.sonatype.org/service/local/staging/deploy/maven2/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
  [error] Server access Error: Connection timed out
  url=
 https://repository.cloudera.com/artifactory/cloudera-repos/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
 
 
 
  I followed the url and seemslike 4.0.13 is not present..Am i missing
  something.. Also, I am behind proxy.. can that be an issue?
  How do i resolve this.
  Thanks
 
 
 







Feed KMeans algorithm with a row major matrix

2014-03-18 Thread Jaonary Rabarisoa
Dear All,

I'm trying to cluster data from native library code with Spark Kmeans||. In
my native library the data are represented as a matrix (row = number of
data and col = dimension). For efficiency reason, they are copied into a
one dimensional scala Array row major wise so after the computation I have
a RDD[Array[Double]] but the dimension of each array represents a set of
data instead of the data itself. I need to transfrom these array into
Array[Array[Double]] before running the KMeans|| algorithm.

How to do this efficiently ?



Best regards,


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-18 Thread dmpour23
On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia  wrote:
 Is there a reason for spark using the older akka?
 
 
 
 
 On Sun, Mar 2, 2014 at 1:53 PM, 1esha alexey.r...@gmail.com wrote:
 
 The problem is in akka remote. It contains files compiled with 2.4.*. When
 
 you run it with 2.5.* in classpath it fails like above.
 
 
 
 Looks like moving to akka 2.3 will solve this issue. Check this issue -
 
 https://www.assembla.com/spaces/akka/tickets/3154-use-protobuf-version-2-5-0#/activity/ticket:
 
 
 
 
 
 
 
 
 --
 
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p2217.html
 
 
 
 
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Is the solution to exclude the  2.4.*. dependency on protobuf or will thi 
produce more complications?

Connect Exception Error in spark interactive shell...

2014-03-18 Thread Sai Prasanna
Hi ALL !!

In the interactive spark shell i get the following error.
I just followed the steps of the video First steps with spark - spark
screen cast #1 by andy konwinski...

Any thoughts ???

scala val textfile = sc.textFile(README.md)
textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
console:12

scala textfile.count
java.lang.RuntimeException: java.net.ConnectException: Call to master/
192.168.1.11:9000 failed on connection exception:
java.net.ConnectException: Connection refused
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:546)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:291)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:439)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:439)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:112)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:112)
at scala.Option.map(Option.scala:133)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:112)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:134)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199)
at scala.Option.getOrElse(Option.scala:108)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199)
at scala.Option.getOrElse(Option.scala:108)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:886)
at org.apache.spark.rdd.RDD.count(RDD.scala:698)
at init(console:15)
at init(console:20)
at init(console:22)
at init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:11)
at .clinit(console)
at $export(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
at
org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Call to
master/192.168.1.11:9000failed on connection exception:
java.net.ConnectException: Connection
refused
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099)
at org.apache.hadoop.ipc.Client.call(Client.java:1075)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:542)
... 39 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206)
at org.apache.hadoop.ipc.Client.call(Client.java:1050)
... 53 more


-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*


Re: Apache Spark 0.9.0 Build Error

2014-03-18 Thread wapisani
I tried that command on Fedora and I got a lot of random downloads (around
250 downloads) and it appeared that something was trying to get BitTorrent
start. That command ./sbt/sbt assembly doesn't work on Windows.

I installed sbt separately. Is there a way to determine if I'm using the
sbt that's included with Spark or the standalone version?


On Tue, Mar 18, 2014 at 12:16 AM, Mark Hamstra [via Apache Spark User List]
ml-node+s1001560n2795...@n3.nabble.com wrote:

 Try ./sbt/sbt assembly


 On Mon, Mar 17, 2014 at 9:06 PM, wapisani [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=2795i=0
  wrote:

 Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8.
 I've
 installed all prerequisites (except Hadoop) and run sbt/sbt assembly
 while
 in the root directory. I'm getting an error after the line Set current
 project to root in build file:C:/.../spark-0.9.0-incubating/. The error
 is:
 [error] Not a valid command: /
 [error] /sbt
 [error]  ^

 Do you know why I'm getting this error?

 Thank you very much,
 Will



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2795.html
  To unsubscribe from Apache Spark 0.9.0 Build Error, click 
 herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=2794code=d2FwaXNhbmlAbXR1LmVkdXwyNzk0fDEyNzEzNDQzNzg=
 .
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Will Pisani
Fourth-Year Chemical Engineering Student
Research Scholar
Honors Institute
Michigan Technological University




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-18 Thread Ognen Duzlevski


On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote:

On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia  wrote:

Is there a reason for spark using the older akka?




On Sun, Mar 2, 2014 at 1:53 PM, 1esha alexey.r...@gmail.com wrote:

The problem is in akka remote. It contains files compiled with 2.4.*. When

you run it with 2.5.* in classpath it fails like above.



Looks like moving to akka 2.3 will solve this issue. Check this issue -

https://www.assembla.com/spaces/akka/tickets/3154-use-protobuf-version-2-5-0#/activity/ticket:


Is the solution to exclude the  2.4.*. dependency on protobuf or will thi 
produce more complications?
I am not sure I remember what the context was around this but I run 
0.9.0 with hadoop 2.2.0 just fine.

Ognen


KryoSerializer return null when deserialize Task obj in Executor

2014-03-18 Thread 林武康
Hi all, I changed spark.closure.serializer to kryo, when I try count action in 
spark shell the Task obj deserialize in Executor return null, src line is: 
override def run(){
..
task = ser.deserializer[Task[Any]](...)
..
}
Where task is null
Can any one help me? Thank you!

Re: example of non-line oriented input data?

2014-03-18 Thread Diana Carroll
Well, if anyone is still following this, I've gotten the following code
working which in theory should allow me to parse whole XML files: (the
problem was that I can't return the tree iterator directly.  I have to call
iter().  Why?)

import xml.etree.ElementTree as ET

# two source files, format data country
name=../country.../data
mydata=sc.textFile(file:/home/training/countries*.xml)

def parsefile(iterator):
s = ''
for i in iterator: s = s + str(i)
tree = ET.fromstring(s)
treeiterator = tree.getiterator(country)
# why to I have to convert an iterator to an iterator?  not sure but
required
return iter(treeiterator)

mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element:
element.attrib).collect()

The output is what I expect:
[{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}]

BUT I'm a bit concerned about the construction of the string s.  How big
can my file be before converting it to a string becomes problematic?



On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll dcarr...@cloudera.comwrote:

 Thanks, Matei.

 In the context of this discussion, it would seem mapParitions is
 essential, because it's the only way I'm going to be able to process each
 file as a whole, in our example of a large number of small XML files which
 need to be parsed as a whole file because records are not required to be on
 a single line.

 The theory makes sense but I'm still utterly lost as to how to implement
 it.  Unfortunately there's only a single example of the use of
 mapPartitions in any of the Python example programs, which is the log
 regression example, which I can't run because it requires Python 2.7 and
 I'm on Python 2.6.  (aside: I couldn't find any statement that Python 2.6
 is unsupported...is it?)

 I'd really really love to see a real life example of a Python use of
 mapPartitions.  I do appreciate the very simple examples you provided, but
 (perhaps because of my novice status on Python) I can't figure out how to
 translate those to a real world situation in which I'm building RDDs from
 files, not inline collections like [(1,2),(2,3)].

 Also, you say that the function called in mapPartitions can return a
 collection OR an iterator.  I tried returning an iterator by calling
 ElementTree getiterator function, but still got the error telling me my
 object was not an iterator.

 If anyone has a real life example of mapPartitions returning a Python
 iterator, that would be fabulous.

 Diana


 On Mon, Mar 17, 2014 at 6:17 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Oh, I see, the problem is that the function you pass to mapPartitions
 must itself return an iterator or a collection. This is used so that you
 can return multiple output records for each input record. You can implement
 most of the existing map-like operations in Spark, such as map, filter,
 flatMap, etc, with mapPartitions, as well as new ones that might do a
 sliding window over each partition for example, or accumulate data across
 elements (e.g. to compute a sum).

 For example, if you have data = sc.parallelize([1, 2, 3, 4], 2), this
 will work:

  data.mapPartitions(lambda x: x).collect()
 [1, 2, 3, 4]   # Just return the same iterator, doing nothing

  data.mapPartitions(lambda x: [list(x)]).collect()
 [[1, 2], [3, 4]]   # Group together the elements of each partition in a
 single list (like glom)

  data.mapPartitions(lambda x: [sum(x)]).collect()
 [3, 7]   # Sum each partition separately

 However something like data.mapPartitions(lambda x: sum(x)).collect()
 will *not* work because sum returns a number, not an iterator. That's why I
 put sum(x) inside a list above.

 In practice mapPartitions is most useful if you want to share some data
 or work across the elements. For example maybe you want to load a lookup
 table once from an external file and then check each element in it, or sum
 up a bunch of elements without allocating a lot of vector objects.

 Matei


 On Mar 17, 2014, at 11:25 AM, Diana Carroll dcarr...@cloudera.com
 wrote:

  There's also mapPartitions, which gives you an iterator for each
 partition instead of an array. You can then return an iterator or list of
 objects to produce from that.
 
  I confess, I was hoping for an example of just that, because i've not
 yet been able to figure out how to use mapPartitions.  No doubt this is
 because i'm a rank newcomer to Python, and haven't fully wrapped my head
 around iterators.  All I get so far in my attempts to use mapPartitions is
 the darned suchnsuch is not an iterator error.
 
  def myfunction(iterator): return [1,2,3]
  mydata.mapPartitions(lambda x: myfunction(x)).take(2)
 
 
 
 
 
  On Mon, Mar 17, 2014 at 1:57 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Here's an example of getting together all lines in a file as one string:
 
  $ cat dir/a.txt
  Hello
  world!
 
  $ cat dir/b.txt
  What's
  up??
 
  $ bin/pyspark
   files = sc.textFile(dir)
 
   files.collect()
  [u'Hello', u'world!', uWhat's, 

Re: Apache Spark 0.9.0 Build Error

2014-03-18 Thread Robin Cjc
hi, if you run that under windows, you should use \ to replace /.
sbt/sbt means the sbt file under the sbt folder.
On Mar 18, 2014 8:42 PM, wapisani wapis...@mtu.edu wrote:

 I tried that command on Fedora and I got a lot of random downloads (around
 250 downloads) and it appeared that something was trying to get BitTorrent
 start. That command ./sbt/sbt assembly doesn't work on Windows.

 I installed sbt separately. Is there a way to determine if I'm using the
 sbt that's included with Spark or the standalone version?


 On Tue, Mar 18, 2014 at 12:16 AM, Mark Hamstra [via Apache Spark User
 List] [hidden email] 
 http://user/SendEmail.jtp?type=nodenode=2806i=0wrote:

 Try ./sbt/sbt assembly


 On Mon, Mar 17, 2014 at 9:06 PM, wapisani [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=2795i=0
  wrote:

 Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8.
 I've
 installed all prerequisites (except Hadoop) and run sbt/sbt assembly
 while
 in the root directory. I'm getting an error after the line Set current
 project to root in build file:C:/.../spark-0.9.0-incubating/. The
 error
 is:
 [error] Not a valid command: /
 [error] /sbt
 [error]  ^

 Do you know why I'm getting this error?

 Thank you very much,
 Will



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2795.html
  To unsubscribe from Apache Spark 0.9.0 Build Error, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




 --
 Will Pisani
 Fourth-Year Chemical Engineering Student
 Research Scholar
 Honors Institute
 Michigan Technological University

 --
 View this message in context: Re: Apache Spark 0.9.0 Build 
 Errorhttp://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2806.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: Apache Spark 0.9.0 Build Error

2014-03-18 Thread wapisani
Hi Chen,

I tried sbt\sbt assembly and I got an error of  'sbt\sbt' is not
recognized as an internal or external command, operable program or batch
file.



On Tue, Mar 18, 2014 at 11:18 AM, Chen Jingci [via Apache Spark User List] 
ml-node+s1001560n2811...@n3.nabble.com wrote:

 hi, if you run that under windows, you should use \ to replace /.
 sbt/sbt means the sbt file under the sbt folder.
 On Mar 18, 2014 8:42 PM, wapisani [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=2811i=0
 wrote:

 I tried that command on Fedora and I got a lot of random downloads
 (around 250 downloads) and it appeared that something was trying to get
 BitTorrent start. That command ./sbt/sbt assembly doesn't work on
 Windows.

 I installed sbt separately. Is there a way to determine if I'm using the
 sbt that's included with Spark or the standalone version?


 On Tue, Mar 18, 2014 at 12:16 AM, Mark Hamstra [via Apache Spark User
 List] [hidden email] http://user/SendEmail.jtp?type=nodenode=2806i=0
  wrote:
 Try ./sbt/sbt assembly


 On Mon, Mar 17, 2014 at 9:06 PM, wapisani [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=2795i=0
  wrote:
 Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8.
 I've
 installed all prerequisites (except Hadoop) and run sbt/sbt assembly
 while
 in the root directory. I'm getting an error after the line Set current
 project to root in build file:C:/.../spark-0.9.0-incubating/. The error
 is:
 [error] Not a valid command: /
 [error] /sbt
 [error]  ^

 Do you know why I'm getting this error?

 Thank you very much,
 Will



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2795.html
  To unsubscribe from Apache Spark 0.9.0 Build Error, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 Will Pisani
 Fourth-Year Chemical Engineering Student
 Research Scholar
 Honors Institute
 Michigan Technological University

 --
 View this message in context: Re: Apache Spark 0.9.0 Build 
 Errorhttp://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2806.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2811.html
  To unsubscribe from Apache Spark 0.9.0 Build Error, click 
 herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=2794code=d2FwaXNhbmlAbXR1LmVkdXwyNzk0fDEyNzEzNDQzNzg=
 .
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Will Pisani
Fourth-Year Chemical Engineering Student
Research Scholar
Honors Institute
Michigan Technological University




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-0-9-0-Build-Error-tp2794p2812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

[spark] New article on spark scalaz-stream ( a bit of ML)

2014-03-18 Thread Pascal Voitot Dev
Hi,
I wrote this new article after studying deeper how to adapt scalaz-stream
to spark dstreams.
I re-explain a few spark ( scalaz-stream) concepts (in my own words) in
it and I went further using new scalaz-stream NIO API which is quite
interesting IMHO.

The result is a long blog tryptic starting here :
http://mandubian.com/2014/03/08/zpark-ml-nio-1/

Regards
Pascal


Re: inexplicable exceptions in Spark 0.7.3

2014-03-18 Thread Walrus theCat
Hi Andrew,

Thanks for your interest.  This is a standalone job.


On Mon, Mar 17, 2014 at 4:30 PM, Andrew Ash and...@andrewash.com wrote:

 Are you running from the spark shell or from a standalone job?


 On Mon, Mar 17, 2014 at 4:17 PM, Walrus theCat walrusthe...@gmail.comwrote:

 Hi,

 I'm getting this stack trace, using Spark 0.7.3.  No references to
 anything in my code, never experienced anything like this before.  Any
 ideas what is going on?

 java.lang.ClassCastException: spark.SparkContext$$anonfun$9 cannot be
 cast to scala.Function2
 at spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:43)
 at spark.scheduler.ResultTask.readExternal(ResultTask.scala:106)
 at
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at spark.JavaDeserializationStream.readObject(JavaSerializer.scala:23)
 at spark.JavaSerializerInstance.deserialize(JavaSerializer.scala:45)
 at spark.executor.Executor$TaskRunner.run(Executor.scala:96)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)





Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Xiangrui Meng
Sorry, the link was wrong. Should be
https://github.com/apache/spark/pull/131 -Xiangrui


On Tue, Mar 18, 2014 at 10:20 AM, Michael Allman m...@allman.ms wrote:
 Hi Xiangrui,

 I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can
 you explain?

 Also, thanks for addressing the issue with factor matrix persistence in PR
 165. I was probably not going to get to that for a while.

 I will try to test your changes today for speed improvements.

 Cheers,

 Michael



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2817.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Feed KMeans algorithm with a row major matrix

2014-03-18 Thread Xiangrui Meng
Hi Jaonary,

With the current implementation, you need to call Array.slice to make
each row an Array[Double] and cache the result RDD. There is a plan to
support block-wise input data and I will keep you informed.

Best,
Xiangrui

On Tue, Mar 18, 2014 at 2:46 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 Dear All,

 I'm trying to cluster data from native library code with Spark Kmeans||. In
 my native library the data are represented as a matrix (row = number of data
 and col = dimension). For efficiency reason, they are copied into a one
 dimensional scala Array row major wise so after the computation I have a
 RDD[Array[Double]] but the dimension of each array represents a set of data
 instead of the data itself. I need to transfrom these array into
 Array[Array[Double]] before running the KMeans|| algorithm.

 How to do this efficiently ?



 Best regards,


Re: spark-shell fails

2014-03-18 Thread psteckler
Although sbt assembly reports success, I re-ran that step, and see errors
like:

  Error extracting zip entry
'scala/tools/nsc/transformUnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply
  (omitting rest of super-long path) (File name  too long)

Is this a problem with the 'zip' tool on my system?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-fails-tp2778p2821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark-shell fails

2014-03-18 Thread psteckler
OK, the problem was that the directory where I had installed Spark is
encrypted. The particular encryption system appears to limit the length of
files.

I re-installed on a vanilla partition, and spark-shell runs fine.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-fails-tp2778p2822.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Maven repo for Spark pre-built with CDH4?

2014-03-18 Thread Punya Biswal
Hi all,

The Maven central repo contains an artifact for spark 0.9.0 built with
unmodified Hadoop, and the Cloudera repo contains an artifact for spark
0.9.0 built with CDH 5 beta. Is there a repo that contains spark-core built
against a non-beta version of CDH (such as 4.4.0)?

Punya





smime.p7s
Description: S/MIME cryptographic signature


Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Michael Allman
I just ran a runtime performance comparison between 0.9.0-incubating and your
als branch. I saw a 1.5x improvement in performance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Xiangrui Meng
Glad to hear the speed-up. Wish we can improve the implementation
further in the future. -Xiangrui

On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman m...@allman.ms wrote:
 I just ran a runtime performance comparison between 0.9.0-incubating and your
 als branch. I saw a 1.5x improvement in performance.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Regarding Successive operation on elements and recursively

2014-03-18 Thread yh18190
 Hi ,
I am new to Spark scala environment.Currently I am working on Discrete
wavelet transformation algos on time series data.
 I  have to perform recursive additions on successive elements in RDDs.
 for example
 List of elements(RDDS) --a1 a2 a3 a4.
 level1 Tranformation --a1+a2  a3+a4  a1-a2  a3-a4
 level 2---(a1+a2)+(a3+a4) (a1+a2)-(a3+a4)

Is their a way to provide indexing to elements in distributed environment
across nodes so that I know that i am referring to a2 after a1 ..I want to
perform successive addition of only two elements and in a recursive manner
..

Could you please help me in this aspect..I would be really thankful to you..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-Successive-operation-on-elements-and-recursively-tp2826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Incrementally add/remove vertices in GraphX

2014-03-18 Thread Matei Zaharia
I just meant that you call union() before creating the RDDs that you pass to 
new Graph(). If you call it after it will produce other RDDs.

The Graph() constructor actually shuffles and “indexes” the data to make graph 
operations efficient, so it’s not too easy to add elements after. You could 
access graph.vertices and graph.edges to build new RDDs, and then call Graph() 
again to make a new graph. I’ve CCed Joey and Ankur to see if they have further 
ideas on how to optimize this. It would be cool to support more efficient union 
and subtracting of graphs once they’ve been partitioned by GraphX.

Matei

On Mar 14, 2014, at 8:32 AM, alelulli alessandro.lu...@gmail.com wrote:

 Hi Matei,
 
 Could you please clarify why i must call union before creating the graph?
 
 What's the behavior if i call union / subtract after the creation? 
 Is the added /removed vertexes been processed?
 
 For example if i'm implementing an iterative algorithm and at the 5th step i
 need to add some vertex / edge, can i call union / subtract on the
 VertexRDD, EdgeRDD and Triplets?
 
 Thanks
 Alessandro
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Incrementally-add-remove-vertices-in-GraphX-tp2227p2695.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Access original filename in a map function

2014-03-18 Thread Uri Laserson
Hi spark-folk,

I have a directory full of files that I want to process using PySpark.
 There is some necessary metadata in the filename that I would love to
attach to each record in that file.  Using Java MapReduce, I would access

(FileSplit) context.getInputSplit()).getPath().getName()

in the setup() method of the mapper.

Using Hadoop Streaming, I can access the environment variable
map_input_fileto get the filename.

Is there something I can do in PySpark to get the filename?  Surely, one
solution would be to get the list of files first, load each one as an RDD
separately, and then union them together.  But listing the files in HDFS is
a bit annoying through Python, so I was wondering if the filename is
somehow attached to a partition.

Thanks!

Uri

-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laser...@cloudera.com


Re: There is an error in Graphx

2014-03-18 Thread ankurdave
This problem occurs because graph.triplets generates an iterator that reuses
the same EdgeTriplet object for every triplet in the partition. The
workaround is to force a copy using graph.triplets.map(_.copy()).

The solution in the AMPCamp tutorial is mistaken -- I'm not sure if that
ever worked.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/There-is-an-error-in-Graphx-tp1575p2836.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: There is an error in Graphx

2014-03-18 Thread ankurdave
 The workaround is to force a copy using graph.triplets.map(_.copy()).

Sorry, this actually won't copy the entire triplet, only the attributes
defined in Edge. The right workaround is to copy the EdgeTriplet explicitly:

graph.triplets.map { et =
  val et2 = new EdgeTriplet[VD, ED]   // Replace VD and ED with the correct
types
  et2.srcId = et.srcId
  et2.dstId = et.dstId
  et2.attr = et.attr
  et2.srcAttr = et.srcAttr
  et2.dstAttr = et.dstAttr
  et2
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/There-is-an-error-in-Graphx-tp1575p2837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: sample data for pagerank?

2014-03-18 Thread ankurdave
The examples in graphx/data are meant to show the input data format, but if
you want to play around with larger and more interesting datasets, we've
been using the following ones, among others:

- SNAP's web-Google dataset (5M edges):
https://snap.stanford.edu/data/web-Google.html
- SNAP's soc-LiveJournal1 dataset (69M edges):
https://snap.stanford.edu/data/soc-LiveJournal1.html

These come in edge list format and, after decompression, can directly be
loaded using GraphLoader.

Ankur



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sample-data-for-pagerank-tp2655p2839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.