Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)
I think the answer to this depends on what granularity you want to run the algorithm on. If its on the entire Spark DataFrame and if you except the data frame to be very large then it isn't easy to use the existing R function. However if you want to run the algorithm on smaller subsets of the data you can look at the support for UDFs we have in SparkR at http://spark.apache.org/docs/latest/sparkr.html#applying-user-defined-function Thanks Shivaram On Tue, Nov 15, 2016 at 3:56 AM, pietropwrote: > Hi all, > I'm writing here after some intensive usage on pyspark and SparkSQL. > I would like to use a well known function in the R world: coxph() from the > survival package. > From what I understood, I can't parallelize a function like coxph() because > it isn't provided with the SparkR package. In other words, I should > implement a SparkR compatible algorithm instead of using coxph(). > I have no chance to make coxph() parallelizable, right? > More generally, I think this is true for any non-spark function which only > accept data.frame format as the data input. > > Do you plan to implement the coxph() counterpart in Spark? The most useful > version of this model is the Cox Regression Model for Time-Dependent > Covariates, which is missing from ANY ML framework as far as I know. > > Thank you > Pietro > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-R-guidelines-for-non-spark-functions-and-coxph-Cox-Regression-for-Time-Dependent-Covariates-tp28077.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark-ec2 scripts with spark-2.0.0-preview
Can you open an issue on https://github.com/amplab/spark-ec2 ? I think we should be able to escape the version string and pass the 2.0.0-preview through the scripts Shivaram On Tue, Jun 14, 2016 at 12:07 PM, Sunil Kumarwrote: > Hi, > > The spark-ec2 scripts are missing from spark-2.0.0-preview. Is there a > workaround available ? I tried to change the ec2 scripts to accomodate > spark-2.0.0...If I call the release spark-2.0.0-preview, then it barfs > because the command line argument : --spark-version=spark-2.0.0-preview > gets translated to spark-2.0.0-preiew (-v is taken as a switch)...If I call > the release spark-2.0.0, then it cant find it in aws, since it looks for > http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-bin-hadoop2.4.tgz > instead of > http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-preview-bin-hadoop2.4.tgz > > Any ideas on how to make this work ? How can I tweak/hack the code to look > for spark-2.0.0-preview in spark-related-packages ? > > thanks > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Overall this sounds good to me. One question I have is that in addition to the ML algorithms we have a number of linear algebra (various distributed matrices) and statistical methods in the spark.mllib package. Is the plan to port or move these to the spark.ml namespace in the 2.x series ? Thanks Shivaram On Tue, Apr 5, 2016 at 11:48 AM, Sean Owenwrote: > FWIW, all of that sounds like a good plan to me. Developing one API is > certainly better than two. > > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng wrote: >> Hi all, >> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has >> been developed under the spark.ml package, while the old RDD-based API has >> been developed in parallel under the spark.mllib package. While it was >> easier to implement and experiment with new APIs under a new package, it >> became harder and harder to maintain as both packages grew bigger and >> bigger. And new users are often confused by having two sets of APIs with >> overlapped functions. >> >> We started to recommend the DataFrame-based API over the RDD-based API in >> Spark 1.5 for its versatility and flexibility, and we saw the development >> and the usage gradually shifting to the DataFrame-based API. Just counting >> the lines of Scala code, from 1.5 to the current master we added ~1 >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to >> gather more resources on the development of the DataFrame-based API and to >> help users migrate over sooner, I want to propose switching RDD-based MLlib >> APIs to maintenance mode in Spark 2.0. What does it mean exactly? >> >> * We do not accept new features in the RDD-based spark.mllib package, unless >> they block implementing new features in the DataFrame-based spark.ml >> package. >> * We still accept bug fixes in the RDD-based API. >> * We will add more features to the DataFrame-based API in the 2.x series to >> reach feature parity with the RDD-based API. >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate >> the RDD-based API. >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0. >> >> Though the RDD-based API is already in de facto maintenance mode, this >> announcement will make it clear and hence important to both MLlib developers >> and users. So we’d greatly appreciate your feedback! >> >> (As a side note, people sometimes use “Spark ML” to refer to the >> DataFrame-based API or even the entire MLlib component. This also causes >> confusion. To be clear, “Spark ML” is not an official name and there are no >> plans to rename MLlib to “Spark ML” at this time.) >> >> Best, >> Xiangrui > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [SparkR] Any reason why saveDF's mode is append by default ?
I think its just a bug -- I think we originally followed the Python API (in the original PR [1]) but the Python API seems to have been changed to match Scala / Java in https://issues.apache.org/jira/browse/SPARK-6366 Feel free to open a JIRA / PR for this. Thanks Shivaram [1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files On Sun, Dec 13, 2015 at 11:58 PM, Jeff Zhangwrote: > It is inconsistent with scala api which is error by default. Any reason for > that ? Thanks > > > > -- > Best Regards > > Jeff Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: anyone using netlib-java with sparkR on yarn spark1.6?
Nothing more -- The only two things I can think of are: (a) is there something else on the classpath that comes before this lgpl JAR ? I've seen cases where two versions of netlib-java on the classpath can mess things up. (b) There is something about the way SparkR is using reflection to invoke the ML Pipelines code that is breaking the BLAS library discovery. I don't know of a good way to debug this yet though. Thanks Shivaram On Wed, Nov 11, 2015 at 5:55 AM, Tom Graves <tgraves...@yahoo.com> wrote: > Is there anything other then the spark assembly that needs to be in the > classpath? I verified the assembly was built right and its in the classpath > (else nothing would work). > > Thanks, > Tom > > > > On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman > <shiva...@eecs.berkeley.edu> wrote: > > > I think this is happening in the driver. Could you check the classpath > of the JVM that gets started ? If you use spark-submit on yarn the > classpath is setup before R gets launched, so it should match the > behavior of Scala / Python. > > Thanks > Shivaram > > On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves <tgraves...@yahoo.com.invalid> > wrote: >> I'm trying to use the netlib-java stuff with mllib and sparkR on yarn. >> I've >> compiled with -Pnetlib-lgpl, see the necessary things in the spark >> assembly >> jar. The nodes have /usr/lib64/liblapack.so.3, /usr/lib64/libblas.so.3, >> and /usr/lib/libgfortran.so.3. >> >> >> Running: >> data <- read.df(sqlContext, 'data.csv', 'com.databricks.spark.csv') >> mdl = glm(C2~., data, family="gaussian") >> >> But I get the error: >> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from: >> com.github.fommil.netlib.NativeSystemLAPACK >> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from: >> com.github.fommil.netlib.NativeRefLAPACK >> 15/11/06 21:17:27 ERROR RBackendHandler: fitRModelFormula on >> org.apache.spark.ml.api.r.SparkRWrappers failed >> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : >> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 18. >>at scala.Predef$.assert(Predef.scala:179) >>at >> >> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40) >>at >> >> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:114) >>at >> >> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:166) >>at >> >> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:65) >>at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) >>at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) >>at >> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:138) >>at >> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134) >> >> Anyone have this working? >> >> Thanks, >> Tom > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: anyone using netlib-java with sparkR on yarn spark1.6?
I think this is happening in the driver. Could you check the classpath of the JVM that gets started ? If you use spark-submit on yarn the classpath is setup before R gets launched, so it should match the behavior of Scala / Python. Thanks Shivaram On Fri, Nov 6, 2015 at 1:39 PM, Tom Graveswrote: > I'm trying to use the netlib-java stuff with mllib and sparkR on yarn. I've > compiled with -Pnetlib-lgpl, see the necessary things in the spark assembly > jar. The nodes have /usr/lib64/liblapack.so.3, /usr/lib64/libblas.so.3, > and /usr/lib/libgfortran.so.3. > > > Running: > data <- read.df(sqlContext, 'data.csv', 'com.databricks.spark.csv') > mdl = glm(C2~., data, family="gaussian") > > But I get the error: > 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from: > com.github.fommil.netlib.NativeSystemLAPACK > 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from: > com.github.fommil.netlib.NativeRefLAPACK > 15/11/06 21:17:27 ERROR RBackendHandler: fitRModelFormula on > org.apache.spark.ml.api.r.SparkRWrappers failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 18. >at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:114) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:166) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:65) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:138) > at > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134) > > Anyone have this working? > > Thanks, > Tom - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark EC2 script on Large clusters
It is a known limitation that spark-ec2 is very slow for large clusters and as you mention most of this is due to the use of rsync to transfer things from the master to all the slaves. Nick cc'd has been working on an alternative approach at https://github.com/nchammas/flintrock that is more scalable. Thanks Shivaram On Thu, Nov 5, 2015 at 8:12 AM, Christianwrote: > For starters, thanks for the awesome product! > > When creating ec2-clusters of 20-40 nodes, things work great. When we create > a cluster with the provided spark-ec2 script, it takes hours. When creating > a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it takes > over 5 hours. One other problem we are having is that some nodes don't come > up when the other ones do, the process seems to just move on, skipping the > rsync and any installs on those ones. > > My guess as to why it takes so long to set up a large cluster is because of > the use of rsync. What if instead of using rsync, you synched to s3 and then > did a pdsh to pull it down on all of the machines. This is a big deal for us > and if we can come up with a good plan, we might be able help out with the > required changes. > > Are there any suggestions on how to deal with some of the nodes not being > ready when the process starts? > > Thanks for your time, > Christian > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: how to run RStudio or RStudio Server on ec2 cluster?
RStudio should already be setup if you launch an EC2 cluster using spark-ec2. See http://blog.godatadriven.com/sparkr-just-got-better.html for details. Shivaram On Wed, Nov 4, 2015 at 5:11 PM, Andy Davidsonwrote: > Hi > > I just set up a spark cluster on AWS ec2 cluster. In the past I have done a > lot of work using RStudio on my local machine. Bin/sparkR looks interesting > how ever it looks like you just get an R command line interpreter. Does > anyone have an experience using something like RStudio or Rstudio server and > sparkR? > > > In an ideal world I would like to use some sort of IDE for R running on my > local machine connected to my spark cluster. > > In spark 1.3.1 It was easy to get a similar environment working for python. > I would use pyspark to start an iPython notebook server on my cluster > master. Then set up an SSH tunnel on my local machine to connect a browser > on my local machine to the notebook server. > > Any comments or suggestions would be greatly appreciated > > Andy > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkR -Graphx
+Xiangrui I am not sure exposing the entire GraphX API would make sense as it contains a lot of low level functions. However we could expose some high level functions like PageRank etc. Xiangrui, who has been working on similar techniques to expose MLLib functions like GLM might have more to add. Thanks Shivaram On Thu, Aug 6, 2015 at 6:21 AM, smagadi sudhindramag...@fico.com wrote: Wanted to use the GRaphX from SparkR , is there a way to do it ?.I think as of now it is not possible.I was thinking if one can write a wrapper in R that can call Scala Graphx libraries . Any thought on this please. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Graphx-tp24152.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Broadcast variables in R
There shouldn't be anything Mac OS specific about this feature. One point of warning though -- As mentioned previously in this thread the APIs were made private because we aren't sure we will be supporting them in the future. If you are using these APIs it would be good to chime in on the JIRA with your use-case Thanks Shivaram On Tue, Jul 21, 2015 at 2:34 AM, Serge Franchois serge.franch...@altran.com wrote: I might add to this that I've done the same exercise on Linux (CentOS 6) and there, broadcast variables ARE working. Is this functionality perhaps not exposed on Mac OS X? Or has it to do with the fact there are no native Hadoop libs for Mac? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-in-R-tp23914p23927.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: What else is need to setup native support of BLAS/LAPACK with Spark?
FWIW I've run into similar BLAS related problems before and wrote up a document on how to do this for Spark EC2 clusters at https://github.com/amplab/ml-matrix/blob/master/EC2.md -- Note that this works with a vanilla Spark build (you only need to link to netlib-lgpl in your App) but requires the app jar to be present on all the machines. Thanks Shivaram On Tue, Jul 21, 2015 at 7:37 AM, Arun Ahuja aahuj...@gmail.com wrote: Yes, I imagine it's the driver's classpath - I'm pulling those screenshots straight from the Spark UI environment page. Is there somewhere else to grab the executor class path? Also, the warning is only printing once, so it's also not clear whether the warning is from the driver or exectuor, would you know? Thanks, Arun On Tue, Jul 21, 2015 at 7:52 AM, Sean Owen so...@cloudera.com wrote: Great, and that file exists on HDFS and is world readable? just double-checking. What classpath is this -- your driver or executor? this is the driver, no? I assume so just because it looks like it references the assembly you built locally and from which you're launching the driver. I think we're concerned with the executors and what they have on the classpath. I suspect there is still a problem somewhere in there. On Mon, Jul 20, 2015 at 4:59 PM, Arun Ahuja aahuj...@gmail.com wrote: Cool, I tried that as well, and doesn't seem different: spark.yarn.jar seems set [image: Inline image 1] This actually doesn't change the classpath, not sure if it should: [image: Inline image 3] But same netlib warning. Thanks for the help! - Arun On Fri, Jul 17, 2015 at 3:18 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Can you try setting the spark.yarn.jar property to make sure it points to the jar you're thinking of? -Sandy On Fri, Jul 17, 2015 at 11:32 AM, Arun Ahuja aahuj...@gmail.com wrote: Yes, it's a YARN cluster and using spark-submit to run. I have SPARK_HOME set to the directory above and using the spark-submit script from there. bin/spark-submit --master yarn-client --executor-memory 10g --driver-memory 8g --num-executors 400 --executor-cores 1 --class org.hammerlab.guacamole.Guacamole --conf spark.default.parallelism=4000 --conf spark.storage.memoryFraction=0.15 libgfortran.so.3 is also there ls /usr/lib64/libgfortran.so.3 /usr/lib64/libgfortran.so.3 These are jniloader files in the jar jar tf /hpc/users/ahujaa01/src/spark/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar | grep jniloader META-INF/maven/com.github.fommil/jniloader/ META-INF/maven/com.github.fommil/jniloader/pom.xml META-INF/maven/com.github.fommil/jniloader/pom.properties Thanks, Arun On Fri, Jul 17, 2015 at 1:30 PM, Sean Owen so...@cloudera.com wrote: Make sure /usr/lib64 contains libgfortran.so.3; that's really the issue. I'm pretty sure the answer is 'yes', but, make sure the assembly has jniloader too. I don't see why it wouldn't, but, that's needed. What is your env like -- local, standalone, YARN? how are you running? Just want to make sure you are using this assembly across your cluster. On Fri, Jul 17, 2015 at 6:26 PM, Arun Ahuja aahuj...@gmail.com wrote: Hi Sean, Thanks for the reply! I did double-check that the jar is one I think I am running: [image: Inline image 2] jar tf /hpc/users/ahujaa01/src/spark/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar | grep netlib | grep Native com/github/fommil/netlib/NativeRefARPACK.class com/github/fommil/netlib/NativeRefBLAS.class com/github/fommil/netlib/NativeRefLAPACK.class com/github/fommil/netlib/NativeSystemARPACK.class com/github/fommil/netlib/NativeSystemBLAS.class com/github/fommil/netlib/NativeSystemLAPACK.class Also, I checked the gfortran version on the cluster nodes and it is available and is 5.1 $ gfortran --version GNU Fortran (GCC) 5.1.0 Copyright (C) 2015 Free Software Foundation, Inc. and still see: 15/07/17 13:20:53 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 15/07/17 13:20:53 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 15/07/17 13:20:53 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK 15/07/17 13:20:53 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK Does anything need to be adjusted in my application POM? Thanks, Arun On Thu, Jul 16, 2015 at 5:26 PM, Sean Owen so...@cloudera.com wrote: Yes, that's most of the work, just getting the native libs into the assembly. netlib can find them from there even if you don't have BLAS libs on your OS, since it includes a reference implementation as a fallback. One common reason it won't load is not having libgfortran installed on your OSes though. It has to be 4.6+ too. That can't be shipped even in netlib and has to exist on your hosts. The other
Re: Including additional scala libraries in sparkR
There was a fix for `--jars` that went into 1.4.1 https://github.com/apache/spark/commit/2579948bf5d89ac2d822ace605a6a4afce5258d6 Shivaram On Tue, Jul 14, 2015 at 4:18 AM, Sun, Rui rui@intel.com wrote: Could you give more details about the mis-behavior of --jars for SparkR? maybe it's a bug. From: Michal Haris [michal.ha...@visualdna.com] Sent: Tuesday, July 14, 2015 5:31 PM To: Sun, Rui Cc: Michal Haris; user@spark.apache.org Subject: Re: Including additional scala libraries in sparkR Ok thanks. It seems that --jars is not behaving as expected - getting class not found for even the most simple object from my lib. But anyways, I have to do at least a filter transformation before collecting the HBaseRDD into R so will have to go the route of using scala spark shell to transform and collect and save into local filesystem and the visualise the file with R until custom RDD transformations are exposed in the SparkR API. On 13 July 2015 at 10:27, Sun, Rui rui@intel.commailto: rui@intel.com wrote: Hi, Michal, SparkR comes with a JVM backend that supports Java object instantiation, calling Java instance and static methods from R side. As defined in https://github.com/apache/spark/blob/master/R/pkg/R/backend.R, newJObject() is to create an instance of a Java class; callJMethod() is to call an instance method of a Java object; callJStatic() is to call a static method of a Java class. If the thing is as simple as data visualization, you can use the above low-level functions to create an instance of your HBASE RDD in JVM side, collect the data to R side, and visualize it. However, if you want to do HBASE RDD transformation and HBASE table update, things are quite complex now. SparkR supports majority of RDD API (though not exposed publicly in 1.4 release) allowing transformation functions in R code, but currently it only supports RDD source from text files and SparkR Data Frames, so your HBASE RDDs can't be used by SparkR RDD API for further processing. You can use --jars to include your scala library to be accessed by the JVM backend. From: Michal Haris [michal.ha...@visualdna.commailto: michal.ha...@visualdna.com] Sent: Sunday, July 12, 2015 6:39 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Including additional scala libraries in sparkR I have spark program with a custom optimised rdd for hbase scans and updates. I have a small library of objects in scala to support efficient serialisation, partitioning etc. I would like to use R as an analysis and visualisation front-end. I have tried to use rJava (i.e. not using sparkR) and I got as far as initialising the spark context but I have encountered problems with hbase dependencies (HBaseConfiguration : Unsupported major.minor version 51.0) so tried sparkR but I can't figure out how to make my custom scala classes available to sparkR other than re-implementing them in R. Is there a way to include and invoke additional scala objects and RDDs within sparkR shell/job ? Something similar to additional jars and init script in normal spark submit/shell.. -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229tel:%2B44%20%280%29%20207%20749%200229 www.visualdna.comhttp://www.visualdna.comhttp://www.visualdna.com | t: +44 (0) 207 734 7033tel:%2B44%20%280%29%20207%20734%207033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.comhttp://www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sparkr-submit additional R files
You can just use `--files` and I think it should work. Let us know on https://issues.apache.org/jira/browse/SPARK-6833 if it doesn't work as expected. Thanks Shivaram On Tue, Jul 7, 2015 at 5:13 AM, Michał Zieliński zielinski.mich...@gmail.com wrote: Hi all, *spark-submit* for Python and Java/Scala has *--py-files* and *--jars* options for submitting additional files on top of the main application. Is there any such option for *sparkr-submit*? I know that there is *includePackage() *R function to add library dependencies, but can you add other sources that are not R libraries (e.g. additional code repositories?). I really appreciate your help. Thanks, Michael
Re: JVM is not ready after 10 seconds
When I've seen this error before it has been due to the spark-submit file (i.e. `C:\spark-1.4.0\bin/bin/spark-submit.cmd`) not having execute permissions. You can try to set execute permission and see if it fixes things. Also we have a PR open to fix a related problem at https://github.com/apache/spark/pull/7025. If you can test the PR that will also be very helpful Thanks Shivaram On Mon, Jul 6, 2015 at 6:11 PM, ashishdutt ashish.du...@gmail.com wrote: Hi, I am trying to connect a worker to the master. The spark master is on cloudera manager and I know the master IP address and port number. I downloaded the spark binary for CDH4 on the worker machine and then when I try to invoke the command sc = sparkR.init(master=ip address:port number) I get the following error. sc=sparkR.init(master=spark://10.229.200.250:7377) Launching java with spark-submit command C:\spark-1.4.0\bin/bin/spark-submit.cmd sparkr-shell C:\Users\ASHISH~1\AppData\Local\Temp\Rtmp82kCxH\backend_port4281739d85 Error in sparkR.init(master = spark://10.229.200.250:7377) : JVM is not ready after 10 seconds In addition: Warning message: running command 'C:\spark-1.4.0\bin/bin/spark-submit.cmd sparkr-shell C:\Users\ASHISH~1\AppData\Local\Temp\Rtmp82kCxH\backend_port4281739d85' had status 127 I am using windows 7 as the OS on the worker machine and I am invoking the sparkR.init() from RStudio Any help in this reference will be appreciated Thank you, Ashish Dutt -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JVM-is-not-ready-after-10-seconds-tp23658.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: build spark 1.4 source code for sparkR with maven
You need to add -Psparkr to build SparkR code Shivaram On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using sparkR must build from source code about 1.4 version . thank you -- 1106944...@qq.com
Re: sparkR could not find function textFile
You can check my comment below the answer at http://stackoverflow.com/a/30959388/4577954. BTW we added a new option to sparkR.init to pass in packages and that should be a part of 1.5 Shivaram On Wed, Jul 1, 2015 at 10:03 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, Piggybacking on this discussion. I'm trying to achieve the same, reading a csv file, from RStudio. Where I'm stuck is how to supply some additional package from RStudio to spark.init() as sparkR.init does() not provide an option to specify additional package. I tried following codefrom RStudio. It is giving me error Error in callJMethod(sqlContext, load, source, options) : Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed. -- Sys.setenv(SPARK_HOME=C:\\spark-1.4.0-bin-hadoop2.6) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib),.libPaths())) library(SparkR) sparkR.stop() sc - sparkR.init(master=local[2], sparkEnvir = list(spark.executor.memory=1G), sparkJars=C:\\spark-1.4.0-bin-hadoop2.6\\lib\\spark-csv_2.11-1.1.0.jar) /* I have downloaded this spark-csv jar and kept it in lib folder of Spark */ sqlContext - sparkRSQL.init(sc) plutoMN - read.df(sqlContext, C:\\Users\\Sourav\\Work\\SparkDataScience\\PlutoMN.csv, source = com.databricks.spark.csv). -- However, I also tried this from shell as 'sparkR --package com.databricks:spark-csv_2.11:1.1.0. This time I used the following code and it works all fine. sqlContext - sparkRSQL.init(sc) plutoMN - read.df(sqlContext, C:\\Users\\Sourav\\Work\\SparkDataScience\\PlutoMN.csv, source = com.databricks.spark.csv). Any idea how to achieve the same from RStudio ? Regards, On Thu, Jun 25, 2015 at 2:38 PM, Wei Zhou zhweisop...@gmail.com wrote: I tried out the solution using spark-csv package, and it worked fine now :) Thanks. Yes, I'm playing with a file with all columns as String, but the real data I want to process are all doubles. I'm just exploring what sparkR can do versus regular scala spark, as I am by heart a R person. 2015-06-25 14:26 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com : Sure, I had a similar question that Shivaram was able fast for me, the solution is implemented using a separate DataBrick’s library. Check out this thread from the email archives [1], and the read.df() command [2]. CSV files can be a bit tricky, especially with inferring their schemas. Are you using just strings as your column types right now? Alek [1] -- http://apache-spark-developers-list.1001551.n3.nabble.com/CSV-Support-in-SparkR-td12559.html [2] -- https://spark.apache.org/docs/latest/api/R/read.df.html From: Wei Zhou zhweisop...@gmail.com Date: Thursday, June 25, 2015 at 4:15 PM To: shiva...@eecs.berkeley.edu shiva...@eecs.berkeley.edu Cc: Aleksander Eskilson alek.eskil...@cerner.com, user@spark.apache.org user@spark.apache.org Subject: Re: sparkR could not find function textFile Thanks to both Shivaram and Alek. Then if I want to create DataFrame from comma separated flat files, what would you recommend me to do? One way I can think of is first reading the data as you would do in r, using read.table(), and then create spark DataFrame out of that R dataframe, but it is obviously not scalable. 2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu: The `head` function is not supported for the RRDD that is returned by `textFile`. You can run `take(lines, 5L)`. I should add a warning here that the RDD API in SparkR is private because we might not support it in the upcoming releases. So if you can use the DataFrame API for your application you should try that out. Thanks Shivaram On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou zhweisop...@gmail.com wrote: Hi Alek, Just a follow up question. This is what I did in sparkR shell: lines - SparkR:::textFile(sc, ./README.md) head(lines) And I am getting error: Error in x[seq_len(n)] : object of type 'S4' is not subsettable I'm wondering what did I do wrong. Thanks in advance. Wei 2015-06-25 13:44 GMT-07:00 Wei Zhou zhweisop...@gmail.com: Hi Alek, Thanks for the explanation, it is very helpful. Cheers, Wei 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com: Hi there, The tutorial you’re reading there was written before the merge of SparkR for Spark 1.4.0 For the merge, the RDD API (which includes the textFile() function) was made private, as the devs felt many of its functions were too low level. They focused instead on finishing the DataFrame API which supports local, HDFS, and Hive/HBase file reads. In the meantime, the devs are trying to determine which functions of the RDD API, if any, should be made public again. You can see the rationale behind this decision on the issue’s JIRA [1]. You can still make use of those now private RDD functions by prepending the function call with the SparkR private
Re: Calling MLLib from SparkR
The 1.4 release does not support calling MLLib from SparkR. We are working on it as a part of https://issues.apache.org/jira/browse/SPARK-6805 On Wed, Jul 1, 2015 at 4:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, Does Spark 1.4 support calling MLLib directly from SparkR ? If not, is there any work around, any example available somewhere ? Regards, Sourav
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
The API exported in the 1.4 release is different from the one used in the 2014 demo. Please see the latest documentation at http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html or Chris's demo from Spark Summit at https://spark-summit.org/2015/events/a-data-frame-abstraction-layer-for-sparkr/ Thanks Shivaram On Tue, Jun 30, 2015 at 7:40 AM, Nicholas Sharkey nicholasshar...@gmail.com wrote: Good morning Sivaram, I believe I have our setup close but I'm getting an error on the last step of the word count example from the Spark Summit https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf . Off the top of your head can you think of where this error (below, and attached) is coming from? I can get into the details of how I setup this machine if needed, but wanted to keep the initial question short. Thanks. *Begin Code* *library(SparkR)* *# sc - sparkR.init(local[2])* *sc - sparkR.init(http://ec2-54-171-173-195.eu-west-1.compute.amazonaws.com:[2];)* *lines - textFile(sc, mytextfile.txt) # hi hi all all all one one one one* *words - flatMap(lines,* * function(line){* * strsplit(line, )[[1]]* * })* *wordcount - lapply(words,* *function(word){* * list(word, 1)* *})* *counts - reduceByKey(wordcount, +, numPartitions=2)* *# Error in (function (classes, fdef, mtable) : * *# unable to find an inherited method for function ‘reduceByKey’ for signature ‘PipelinedRDD, character, numeric’* *End Code * On Fri, Jun 26, 2015 at 7:04 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My workflow as to install RStudio on a cluster launched using Spark EC2 scripts. However I did a bunch of tweaking after that (like copying the spark installation over etc.). When I get some time I'll try to write the steps down in the JIRA. Thanks Shivaram On Fri, Jun 26, 2015 at 10:21 AM, m...@redoakstrategic.com wrote: So you created an EC2 instance with RStudio installed first, then installed Spark under that same username? That makes sense, I just want to verify your work flow. Thank you again for your willingness to help! On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I was using RStudio on the master node of the same cluster in the demo. However I had installed Spark under the user `rstudio` (i.e. /home/rstudio) and that will make the permissions work correctly. You will need to copy the config files from /root/spark/conf after installing Spark though and it might need some more manual tweaks. Thanks Shivaram On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson m...@redoakstrategic.com wrote: Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio Server that initializes the spark context against a separate Spark cluster. On Jun 26, 2015, at 11:46 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder that SparkR was installed, to load the SparkR library and initialize a SparkContext, I get permissions errors on the folders, or the library cannot be found because I cannot find the folder in which the library is sitting. Has anyone successfully launched and utilized SparkR 1.4.0 in this way, with RStudio Server running on top of the master instance? Are we on the right track, or should we manually launch a cluster and attempt to connect to it from another instance running R? Thank you in advance! Mark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html Sent from
Re: [SparkR] Missing Spark APIs in R
The Spark JIRA and the user, dev mailing lists are the best place to follow the progress. Shivaram On Tue, Jun 30, 2015 at 9:52 AM, Pradeep Bashyal prad...@bashyal.com wrote: Thanks Shivaram. I watched your talk and the plan to use ML APIs with R flavor looks exciting. Is there a different venue where I would be able to follow the SparkR API progress? Thanks Pradeep On Mon, Jun 29, 2015 at 1:12 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: The RDD API is pretty complex and we are not yet sure we want to export all those methods in the SparkR API. We are working towards exposing a more limited API in upcoming versions. You can find some more details in the recent Spark Summit talk at https://spark-summit.org/2015/events/sparkr-the-past-the-present-and-the-future/ Thanks Shivaram On Mon, Jun 29, 2015 at 9:40 AM, Pradeep Bashyal prad...@bashyal.com wrote: Hello, I noticed that some of the spark-core APIs are not available with version 1.4.0 release of SparkR. For example textFile(), flatMap() etc. The code seems to be there but is not exported in NAMESPACE. They were all available as part of the AmpLab Extras previously. I wasn't able to find any explanations of why they were not included with the release. Can anyone shed some light on it? Thanks Pradeep
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
Are you using the SparkR from the latest Spark 1.4 release ? The function was not available in the older AMPLab version Shivaram On Tue, Jun 30, 2015 at 1:43 PM, Nicholas Sharkey nicholasshar...@gmail.com wrote: Any idea why I can't get the sparkRSQL.init function to work? The other parts of SparkR seems like it's working fine. And yes, the SparkR library is loaded. Thanks. sc - sparkR.init(master= http://ec2-52-18-1-4.eu-west-1.compute.amazonaws.com;) ... sqlContext - sparkRSQL.init(sc) Error: could not find function sparkRSQL.init On Tue, Jun 30, 2015 at 10:56 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: The API exported in the 1.4 release is different from the one used in the 2014 demo. Please see the latest documentation at http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html or Chris's demo from Spark Summit at https://spark-summit.org/2015/events/a-data-frame-abstraction-layer-for-sparkr/ Thanks Shivaram On Tue, Jun 30, 2015 at 7:40 AM, Nicholas Sharkey nicholasshar...@gmail.com wrote: Good morning Sivaram, I believe I have our setup close but I'm getting an error on the last step of the word count example from the Spark Summit https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf . Off the top of your head can you think of where this error (below, and attached) is coming from? I can get into the details of how I setup this machine if needed, but wanted to keep the initial question short. Thanks. *Begin Code* *library(SparkR)* *# sc - sparkR.init(local[2])* *sc - sparkR.init(http://ec2-54-171-173-195.eu-west-1.compute.amazonaws.com:[2];)* *lines - textFile(sc, mytextfile.txt) # hi hi all all all one one one one* *words - flatMap(lines,* * function(line){* * strsplit(line, )[[1]]* * })* *wordcount - lapply(words,* *function(word){* * list(word, 1)* *})* *counts - reduceByKey(wordcount, +, numPartitions=2)* *# Error in (function (classes, fdef, mtable) : * *# unable to find an inherited method for function ‘reduceByKey’ for signature ‘PipelinedRDD, character, numeric’* *End Code * On Fri, Jun 26, 2015 at 7:04 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My workflow as to install RStudio on a cluster launched using Spark EC2 scripts. However I did a bunch of tweaking after that (like copying the spark installation over etc.). When I get some time I'll try to write the steps down in the JIRA. Thanks Shivaram On Fri, Jun 26, 2015 at 10:21 AM, m...@redoakstrategic.com wrote: So you created an EC2 instance with RStudio installed first, then installed Spark under that same username? That makes sense, I just want to verify your work flow. Thank you again for your willingness to help! On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I was using RStudio on the master node of the same cluster in the demo. However I had installed Spark under the user `rstudio` (i.e. /home/rstudio) and that will make the permissions work correctly. You will need to copy the config files from /root/spark/conf after installing Spark though and it might need some more manual tweaks. Thanks Shivaram On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson m...@redoakstrategic.com wrote: Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio Server that initializes the spark context against a separate Spark cluster. On Jun 26, 2015, at 11:46 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder
Re: [SparkR] Missing Spark APIs in R
The RDD API is pretty complex and we are not yet sure we want to export all those methods in the SparkR API. We are working towards exposing a more limited API in upcoming versions. You can find some more details in the recent Spark Summit talk at https://spark-summit.org/2015/events/sparkr-the-past-the-present-and-the-future/ Thanks Shivaram On Mon, Jun 29, 2015 at 9:40 AM, Pradeep Bashyal prad...@bashyal.com wrote: Hello, I noticed that some of the spark-core APIs are not available with version 1.4.0 release of SparkR. For example textFile(), flatMap() etc. The code seems to be there but is not exported in NAMESPACE. They were all available as part of the AmpLab Extras previously. I wasn't able to find any explanations of why they were not included with the release. Can anyone shed some light on it? Thanks Pradeep
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
Thanks Mark for the update. For those interested Vincent Warmerdam also has some details on making the /root/spark installation work at https://issues.apache.org/jira/browse/SPARK-8596?focusedCommentId=14604328page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604328 Shivaram On Sat, Jun 27, 2015 at 12:23 PM, RedOakMark m...@redoakstrategic.com wrote: For anyone monitoring the thread, I was able to successfully install and run a small Spark cluster and model using this method: First, make sure that the username being used to login to RStudio Server is the one that was used to install Spark on the EC2 instance. Thanks to Shivaram for his help here. Login to RStudio and ensure that these references are used - set the library location to the folder where spark is installed. In my case, ~/home/rstudio/spark. # # This line loads SparkR (the R package) from the installed directory library(SparkR, lib.loc=./spark/R/lib) The edits to this line were important, so that Spark knew where the install folder was located when initializing the cluster. # Initialize the Spark local cluster in R, as ‘sc’ sc - sparkR.init(local[2], SparkR, ./spark) From here, we ran a basic model using Spark, from RStudio, which ran successfully. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506p23514.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.4.0 - Using SparkR on EC2 Instance
We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html http://spark.apache.org/docs/latest/ec2-scripts.html ) we can fire up an EC2 instance (which we have been successful doing - we have gotten the cluster to launch from the command line without an issue). Then, I installed RStudio Server on the same EC2 instance (the master) and successfully logged into it (using the test/test user) through the web browser. This is where I get stuck - within RStudio, when I try to reference/find the folder that SparkR was installed, to load the SparkR library and initialize a SparkContext, I get permissions errors on the folders, or the library cannot be found because I cannot find the folder in which the library is sitting. Has anyone successfully launched and utilized SparkR 1.4.0 in this way, with RStudio Server running on top of the master instance? Are we on the right track, or should we manually launch a cluster and attempt to connect to it from another instance running R? Thank you in advance! Mark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sparkR could not find function textFile
The `head` function is not supported for the RRDD that is returned by `textFile`. You can run `take(lines, 5L)`. I should add a warning here that the RDD API in SparkR is private because we might not support it in the upcoming releases. So if you can use the DataFrame API for your application you should try that out. Thanks Shivaram On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou zhweisop...@gmail.com wrote: Hi Alek, Just a follow up question. This is what I did in sparkR shell: lines - SparkR:::textFile(sc, ./README.md) head(lines) And I am getting error: Error in x[seq_len(n)] : object of type 'S4' is not subsettable I'm wondering what did I do wrong. Thanks in advance. Wei 2015-06-25 13:44 GMT-07:00 Wei Zhou zhweisop...@gmail.com: Hi Alek, Thanks for the explanation, it is very helpful. Cheers, Wei 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com : Hi there, The tutorial you’re reading there was written before the merge of SparkR for Spark 1.4.0 For the merge, the RDD API (which includes the textFile() function) was made private, as the devs felt many of its functions were too low level. They focused instead on finishing the DataFrame API which supports local, HDFS, and Hive/HBase file reads. In the meantime, the devs are trying to determine which functions of the RDD API, if any, should be made public again. You can see the rationale behind this decision on the issue’s JIRA [1]. You can still make use of those now private RDD functions by prepending the function call with the SparkR private namespace, for example, you’d use SparkR:::textFile(…). Hope that helps, Alek [1] -- https://issues.apache.org/jira/browse/SPARK-7230 From: Wei Zhou zhweisop...@gmail.com Date: Thursday, June 25, 2015 at 3:33 PM To: user@spark.apache.org user@spark.apache.org Subject: sparkR could not find function textFile Hi all, I am exploring sparkR by activating the shell and following the tutorial here https://amplab-extras.github.io/SparkR-pkg/ https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQs=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Qe= And when I tried to read in a local file with textFile(sc, file_location), it gives an error could not find function textFile. By reading through sparkR doc for 1.4, it seems that we need sqlContext to import data, for example. people - read.df(sqlContext, ./examples/src/main/resources/people.json, json ) And we need to specify the file type. My question is does sparkR stop supporting general type file importing? If not, would appreciate any help on how to do this. PS, I am trying to recreate the word count example in sparkR, and want to import README.md file, or just any file into sparkR. Thanks in advance. Best, Wei CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: sparkR could not find function textFile
You can use the Spark CSV reader to do read in flat CSV files to a data frame. See https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85 for an example Shivaram On Thu, Jun 25, 2015 at 2:15 PM, Wei Zhou zhweisop...@gmail.com wrote: Thanks to both Shivaram and Alek. Then if I want to create DataFrame from comma separated flat files, what would you recommend me to do? One way I can think of is first reading the data as you would do in r, using read.table(), and then create spark DataFrame out of that R dataframe, but it is obviously not scalable. 2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu: The `head` function is not supported for the RRDD that is returned by `textFile`. You can run `take(lines, 5L)`. I should add a warning here that the RDD API in SparkR is private because we might not support it in the upcoming releases. So if you can use the DataFrame API for your application you should try that out. Thanks Shivaram On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou zhweisop...@gmail.com wrote: Hi Alek, Just a follow up question. This is what I did in sparkR shell: lines - SparkR:::textFile(sc, ./README.md) head(lines) And I am getting error: Error in x[seq_len(n)] : object of type 'S4' is not subsettable I'm wondering what did I do wrong. Thanks in advance. Wei 2015-06-25 13:44 GMT-07:00 Wei Zhou zhweisop...@gmail.com: Hi Alek, Thanks for the explanation, it is very helpful. Cheers, Wei 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com: Hi there, The tutorial you’re reading there was written before the merge of SparkR for Spark 1.4.0 For the merge, the RDD API (which includes the textFile() function) was made private, as the devs felt many of its functions were too low level. They focused instead on finishing the DataFrame API which supports local, HDFS, and Hive/HBase file reads. In the meantime, the devs are trying to determine which functions of the RDD API, if any, should be made public again. You can see the rationale behind this decision on the issue’s JIRA [1]. You can still make use of those now private RDD functions by prepending the function call with the SparkR private namespace, for example, you’d use SparkR:::textFile(…). Hope that helps, Alek [1] -- https://issues.apache.org/jira/browse/SPARK-7230 From: Wei Zhou zhweisop...@gmail.com Date: Thursday, June 25, 2015 at 3:33 PM To: user@spark.apache.org user@spark.apache.org Subject: sparkR could not find function textFile Hi all, I am exploring sparkR by activating the shell and following the tutorial here https://amplab-extras.github.io/SparkR-pkg/ https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQs=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Qe= And when I tried to read in a local file with textFile(sc, file_location), it gives an error could not find function textFile. By reading through sparkR doc for 1.4, it seems that we need sqlContext to import data, for example. people - read.df(sqlContext, ./examples/src/main/resources/people.json, json ) And we need to specify the file type. My question is does sparkR stop supporting general type file importing? If not, would appreciate any help on how to do this. PS, I am trying to recreate the word count example in sparkR, and want to import README.md file, or just any file into sparkR. Thanks in advance. Best, Wei CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: mllib from sparkR
Not yet - We are working on it as a part of https://issues.apache.org/jira/browse/SPARK-6805 and you can follow the JIRA for more information On Wed, Jun 24, 2015 at 2:30 AM, escardovi escard...@bitbang.com wrote: Hi, I was wondering if it is possible to use MLlib function inside SparkR, as outlined at the Spark Summer East 2015 Warmup meetup: http://www.meetup.com/Spark-NYC/events/220850389/ Are there available examples? Thank you! Elena -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mllib-from-sparkR-tp23466.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark1.4 sparkR usage
The Apache Spark API docs for SparkR https://spark.apache.org/docs/1.4.0/api/R/index.html represent what has been released with Spark 1.4. The AMPLab version is no longer under active development and I'd recommend users to use the version in the Apache project. Thanks Shivaram On Thu, Jun 25, 2015 at 2:16 AM, Jean-Charles RISCH risch.jeanchar...@gmail.com wrote: Thank you. But it's a bit scary because when I compare official API ( https://spark.apache.org/docs/1.4.0/api/R/index.html) and amplab API ( https://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/index.html), they look very different. JC ᐧ 2015-06-25 11:10 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: It won't change too much, it will get you started. Further details you can read from the official website itself https://spark.apache.org/docs/latest/sparkr.html Thanks Best Regards On Thu, Jun 25, 2015 at 2:38 PM, Jean-Charles RISCH risch.jeanchar...@gmail.com wrote: Hello, Is this the official R Package? It is written : *NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. * Thanks, JC ᐧ 2015-06-25 10:55 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all I have installed spark1.4, then want to use sparkR . assueme spark master ip= node1, how to start sparkR ? and summit job to spark cluster? anyone help me? or give me blog/doc thank you very much -- 1106944...@qq.com
Re: How to Map and Reduce in sparkR
In addition to Aleksander's point please let us know what use case would use RDD-like API in https://issues.apache.org/jira/browse/SPARK-7264 -- We are hoping to have a version of this API in upcoming releases. Thanks Shivaram On Thu, Jun 25, 2015 at 6:02 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: The simple answer is that SparkR does support map/reduce operations over RDD’s through the RDD API, but since Spark v 1.4.0, those functions were made private in SparkR. They can still be accessed by prepending the function with the namespace, like SparkR:::lapply(rdd, func). It was thought though that many of the functions in the RDD API were too low level to expose, with much more of the focus going into the DataFrame API. The original rationale for this decision can be found in its JIRA [1]. The devs are still deciding which functions of the RDD API, if any, should be made public for future releases. If you feel some use cases are most easily handled in SparkR through RDD functions, go ahead and let the dev email list know. Alek [1] -- https://issues.apache.org/jira/browse/SPARK-7230 From: Wei Zhou zhweisop...@gmail.com Date: Wednesday, June 24, 2015 at 4:59 PM To: user@spark.apache.org user@spark.apache.org Subject: How to Map and Reduce in sparkR Anyone knows whether sparkR supports map and reduce operations as the RDD transformations? Thanks in advance. Best, Wei CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: SparkR 1.4.0: read.df() function fails
The error you are running into is that the input file does not exist -- You can see it from the following line Input path does not exist: hdfs://smalldata13.hdp:8020/ home/esten/ami/usaf.json Thanks Shivaram On Tue, Jun 16, 2015 at 1:55 AM, esten erik.stens...@dnvgl.com wrote: Hi, In SparkR shell, I invoke: mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, header=false) I have tried various filetypes (csv, txt), all fail. RESPONSE: ERROR RBackendHandler: load on 1 failed BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with curMem=0, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 173.4 KB, free 265.2 MB) 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with curMem=177600, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.2 KB, free 265.2 MB) 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at NativeMethodAccessorImpl.java:-2 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at
Re: How to read avro in SparkR
Yep - Burak's answer should work. FWIW the error message from the stack trace that shows this is the line Failed to load class for data source: avro Thanks Shivaram On Sat, Jun 13, 2015 at 6:13 PM, Burak Yavuz brk...@gmail.com wrote: Hi, Not sure if this is it, but could you please try com.databricks.spark.avro instead of just avro. Thanks, Burak On Jun 13, 2015 9:55 AM, Shing Hing Man mat...@yahoo.com.invalid wrote: Hi, I am trying to read a avro file in SparkR (in Spark 1.4.0). I started R using the following. matmsh@gauss:~$ sparkR --packages com.databricks:spark-avro_2.10:1.0.0 Inside the R shell, when I issue the following, read.df(sqlContext, file:///home/matmsh/myfile.avro,avro) I get the following exception. Caused by: java.lang.RuntimeException: Failed to load class for data source: avro Below is the stack trace. matmsh@gauss:~$ sparkR --packages com.databricks:spark-avro_2.10:1.0.0 R version 3.2.0 (2015-04-16) -- Full of Ingredients Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-suse-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. Launching java with spark-submit command /home/matmsh/installed/spark/bin/spark-submit --packages com.databricks:spark-avro_2.10:1.0.0 sparkr-shell /tmp/RtmpoT7FrF/backend_port464e1e2fb16a Ivy Default Cache set to: /home/matmsh/.ivy2/cache The jars for the packages stored in: /home/matmsh/.ivy2/jars :: loading settings :: url = jar:file:/home/matmsh/installed/sparks/spark-1.4.0-bin-hadoop2.3/lib/spark-assembly-1.4.0-hadoop2.3.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-avro_2.10 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-avro_2.10;1.0.0 in list found org.apache.avro#avro;1.7.6 in local-m2-cache found org.codehaus.jackson#jackson-core-asl;1.9.13 in list found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in list found com.thoughtworks.paranamer#paranamer;2.3 in list found org.xerial.snappy#snappy-java;1.0.5 in list found org.apache.commons#commons-compress;1.4.1 in list found org.tukaani#xz;1.0 in list found org.slf4j#slf4j-api;1.6.4 in list :: resolution report :: resolve 421ms :: artifacts dl 16ms :: modules in use: com.databricks#spark-avro_2.10;1.0.0 from list in [default] com.thoughtworks.paranamer#paranamer;2.3 from list in [default] org.apache.avro#avro;1.7.6 from local-m2-cache in [default] org.apache.commons#commons-compress;1.4.1 from list in [default] org.codehaus.jackson#jackson-core-asl;1.9.13 from list in [default] org.codehaus.jackson#jackson-mapper-asl;1.9.13 from list in [default] org.slf4j#slf4j-api;1.6.4 from list in [default] org.tukaani#xz;1.0 from list in [default] org.xerial.snappy#snappy-java;1.0.5 from list in [default] - | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 9 | 0 | 0 | 0 || 9 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 9 already retrieved (0kB/9ms) 15/06/13 17:37:42 INFO spark.SparkContext: Running Spark version 1.4.0 15/06/13 17:37:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/13 17:37:42 WARN util.Utils: Your hostname, gauss resolves to a loopback address: 127.0.0.1; using 192.168.0.10 instead (on interface enp3s0) 15/06/13 17:37:42 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/06/13 17:37:42 INFO spark.SecurityManager: Changing view acls to: matmsh 15/06/13 17:37:42 INFO spark.SecurityManager: Changing modify acls to: matmsh 15/06/13 17:37:42 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(matmsh); users with modify permissions: Set(matmsh) 15/06/13 17:37:43 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/06/13 17:37:43 INFO Remoting: Starting remoting 15/06/13 17:37:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.0.10:46219] 15/06/13 17:37:43 INFO util.Utils: Successfully started service 'sparkDriver' on
Re: inlcudePackage() deprecated?
Yeah - We don't have support for running UDFs on DataFrames yet. There is an open issue to track this https://issues.apache.org/jira/browse/SPARK-6817 Thanks Shivaram On Thu, Jun 4, 2015 at 3:10 AM, Daniel Emaasit daniel.emaa...@gmail.com wrote: Hello Shivaram, Was the includePackage() function deprecated in SparkR 1.4.0? I don't see it in the documentation? If it was, does that mean that we can use R packages on Spark DataFrames the usual way we do for local R dataframes? Daniel -- Daniel Emaasit Ph.D. Research Assistant Transportation Research Center (TRC) University of Nevada, Las Vegas Las Vegas, NV 89154-4015 Cell: 615-649-2489 www.danielemaasit.com http://www.danielemaasit.com/
[ANNOUNCE] YARN support in Spark EC2
Hi all We recently merged support for launching YARN clusters using Spark EC2 scripts as a part of https://issues.apache.org/jira/browse/SPARK-3674. To use this you can pass in hadoop-major-version as yarn to the spark-ec2 script and this will setup Hadoop 2.4 HDFS, YARN and Spark built for YARN on the EC2 cluster. Developers who work on features related to YARN might find this useful for testing / benchmarking Spark with YARN. If anyone has questions or feedback please let me know. Thanks Shivaram
Re: SparkR Jobs Hanging in collectPartitions
For jobs with R UDFs (i.e. when we use the RDD API from SparkR) we use R on both the driver side and on the worker side. So in this case when the `flatMap` operation is run, the data is sent from the JVM to an R process on the worker which in turn executes the `gsub` function. Could you turn on INFO logging and send a pointer to the log file ? Its pretty clear that the problem is happening in the call to `subtract`, which in turn is doing a shuffle operation, but I am not sure why this should happen. Thanks Shivaram On Fri, May 29, 2015 at 7:56 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: Sure. Looking more closely at the code, I thought I might have had an error in the flow of data structures in the R code, the line that extracts the words from the corpus is now, words - distinct(SparkR:::flatMap(corpus function(line) { strsplit( gsub(“^\\s+|[[:punct:]]”, “”, tolower(line)), “\\s”)[[1]] })) (just removes leading whitespace and all punctuation after having made the whole line lowercase, then splits to a vector of words, ultimately flattening the whole collection) Counts works on the resultant words list, returning the value expected, so the hang most likely occurs during the subtract. I should mention, the size of the corpus is very small, just kb in size. The dictionary I subtract against is also quite modest by Spark standards, just 4.8MB, and I’ve got 2G memory for the Worker, which ought to be sufficient for such a small job. The Scala analog runs quite fast, even with the subtract. If we look at the DAG for the SparkR job and compare that against the event timeline for Stage 3, it seems the job is stuck in Scheduler Delay (in 0/2 tasks completed) and never begins the rest of the stage. Unfortunately, the executor log hangs up as well, and doesn’t give much info. Could you describe in a little more detail at what points data is actually held in R’s internal process memory? I was under the impression that SparkR:::textFile created an RDD object that would only be realized when a DAG requiring it was executed, and would therefore be part of the memory managed by Spark, and that memory would only be moved to R as an R object following a collect(), take(), etc. Thanks, Alek Eskilson From: Shivaram Venkataraman shiva...@eecs.berkeley.edu Reply-To: shiva...@eecs.berkeley.edu shiva...@eecs.berkeley.edu Date: Wednesday, May 27, 2015 at 8:26 PM To: Aleksander Eskilson alek.eskil...@cerner.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: SparkR Jobs Hanging in collectPartitions Could you try to see which phase is causing the hang ? i.e. If you do a count() after flatMap does that work correctly ? My guess is that the hang is somehow related to data not fitting in the R process memory but its hard to say without more diagnostic information. Thanks Shivaram On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: I’ve been attempting to run a SparkR translation of a similar Scala job that identifies words from a corpus not existing in a newline delimited dictionary. The R code is: dict - SparkR:::textFile(sc, src1) corpus - SparkR:::textFile(sc, src2) words - distinct(SparkR:::flatMap(corpus, function(line) { gsub(“[[:punct:]]”, “”, tolower(strsplit(line, “ |,|-“)[[1]]))})) found - subtract(words, dict) (where src1, src2 are locations on HDFS) Then attempting something like take(found, 10) or saveAsTextFile(found, dest) should realize the collection, but that stage of the DAG hangs in Scheduler Delay during the collectPartitions phase. Synonymous Scala code however, val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“)) val dict = sc.textFile(src2) val words = corpus.map(word = word.filter(Character.isLetter(_))).disctinct() val found = words.subtract(dict) performs as expected. Any thoughts? Thanks, Alek Eskilson CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkR Jobs Hanging in collectPartitions
Could you try to see which phase is causing the hang ? i.e. If you do a count() after flatMap does that work correctly ? My guess is that the hang is somehow related to data not fitting in the R process memory but its hard to say without more diagnostic information. Thanks Shivaram On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: I’ve been attempting to run a SparkR translation of a similar Scala job that identifies words from a corpus not existing in a newline delimited dictionary. The R code is: dict - SparkR:::textFile(sc, src1) corpus - SparkR:::textFile(sc, src2) words - distinct(SparkR:::flatMap(corpus, function(line) { gsub(“[[:punct:]]”, “”, tolower(strsplit(line, “ |,|-“)[[1]]))})) found - subtract(words, dict) (where src1, src2 are locations on HDFS) Then attempting something like take(found, 10) or saveAsTextFile(found, dest) should realize the collection, but that stage of the DAG hangs in Scheduler Delay during the collectPartitions phase. Synonymous Scala code however, val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“)) val dict = sc.textFile(src2) val words = corpus.map(word = word.filter(Character.isLetter(_))).disctinct() val found = words.subtract(dict) performs as expected. Any thoughts? Thanks, Alek Eskilson CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: Some questions after playing a little with the new ml.Pipeline.
My guess is that the `createDataFrame` call is failing here. Can you check if the schema being passed to it includes the column name and type for the newly being zipped `features` ? Joseph probably knows this better, but AFAIK the DenseVector here will need to be marked as a VectorUDT while creating a DataFrame column Thanks Shivaram On Tue, Mar 31, 2015 at 12:50 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Following your suggestion, I end up with the following implementation : *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = { val schema = transformSchema(dataSet.schema, paramMap, logging = true) val map = this.paramMap ++ paramMap* *val features = dataSet.select(map(inputCol)).mapPartitions { rows = Caffe.set_mode(Caffe.CPU)val net = CaffeUtils.floatTestNetwork(SparkFiles.get(topology), SparkFiles.get(weight)) val inputBlobs: FloatBlobVector = net.input_blobs()val N: Int = 1 val K: Int = inputBlobs.get(0).channels()val H: Int = inputBlobs.get(0).height()val W: Int = inputBlobs.get(0).width() inputBlobs.get(0).Reshape(N, K, H, W)val dataBlob = new FloatPointer(N*K*W*H)* val inputCPUData = inputBlobs.get(0).mutable_cpu_data() val feat = rows.map { case Row(a: Iterable[Float])= dataBlob.put(a.toArray, 0, a.size) caffe_copy_float(N*K*W*H, dataBlob, inputCPUData) val resultBlobs: FloatBlobVector = net.ForwardPrefilled() * val resultDim = resultBlobs.get(0).channels() logInfo(sOutput dimension $resultDim) val resultBlobData = resultBlobs.get(0).cpu_data() val output = new Array[Float](resultDim) resultBlobData.get(output) Vectors.dense(output.map(_.toDouble))} //net.deallocate()feat } val newRowData = dataSet.rdd.zip(features).map { case (old, feat)=val oldSeq = old.toSeq Row.fromSeq(oldSeq :+ feat) } dataSet.sqlContext.createDataFrame(newRowData, schema)}* The idea is to mapPartitions of the underlying RDD of the DataFrame and create a new DataFrame by zipping the results. It seems to work but when I try to save the RDD I got the following error : org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row On Mon, Mar 30, 2015 at 6:40 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: One workaround could be to convert a DataFrame into a RDD inside the transform function and then use mapPartitions/broadcast to work with the JNI calls and then convert back to RDD. Thanks Shivaram On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm still struggling to make a pre-trained caffe model transformer for dataframe works. The main problem is that creating a caffe model inside the UDF is very slow and consumes memories. Some of you suggest to broadcast the model. The problem with broadcasting is that I use a JNI interface to caffe C++ with javacpp-preset and it is not serializable. Another possible approach is to use a UDF that can handle a whole partitions instead of just a row in order to minimize the caffe model instantiation. Is there any ideas to solve one of these two issues ? Best, Jao On Tue, Mar 3, 2015 at 10:04 PM, Joseph Bradley jos...@databricks.com wrote: I see. I think your best bet is to create the cnnModel on the master and then serialize it to send to the workers. If it's big (1M or so), then you can broadcast it and use the broadcast variable in the UDF. There is not a great way to do something equivalent to mapPartitions with UDFs right now. On Tue, Mar 3, 2015 at 4:36 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Here is my current implementation with current master version of spark *class DeepCNNFeature extends Transformer with HasInputCol with HasOutputCol ... { override def transformSchema(...) { ... }* *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = {* * transformSchema(dataSet.schema, paramMap, logging = true)* * val map = this.paramMap ++ paramMap val deepCNNFeature = udf((v: Vector)= {* * val cnnModel = new CaffeModel * * cnnModel.transform(v)* * } : Vector ) dataSet.withColumn(map(outputCol), deepCNNFeature(col(map(inputCol* * }* *}* where CaffeModel is a java api to Caffe C++ model. The problem here is that for every row it will create a new instance of CaffeModel which is inefficient since creating a new model means loading a large model file. And it will transform only a single row at a time. Or a Caffe network can process a batch of rows efficiently. In other words, is it possible to create an UDF that can operatats on a partition in order to minimize the creation of a CaffeModel and to take
Re: Some questions after playing a little with the new ml.Pipeline.
One workaround could be to convert a DataFrame into a RDD inside the transform function and then use mapPartitions/broadcast to work with the JNI calls and then convert back to RDD. Thanks Shivaram On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm still struggling to make a pre-trained caffe model transformer for dataframe works. The main problem is that creating a caffe model inside the UDF is very slow and consumes memories. Some of you suggest to broadcast the model. The problem with broadcasting is that I use a JNI interface to caffe C++ with javacpp-preset and it is not serializable. Another possible approach is to use a UDF that can handle a whole partitions instead of just a row in order to minimize the caffe model instantiation. Is there any ideas to solve one of these two issues ? Best, Jao On Tue, Mar 3, 2015 at 10:04 PM, Joseph Bradley jos...@databricks.com wrote: I see. I think your best bet is to create the cnnModel on the master and then serialize it to send to the workers. If it's big (1M or so), then you can broadcast it and use the broadcast variable in the UDF. There is not a great way to do something equivalent to mapPartitions with UDFs right now. On Tue, Mar 3, 2015 at 4:36 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Here is my current implementation with current master version of spark *class DeepCNNFeature extends Transformer with HasInputCol with HasOutputCol ... { override def transformSchema(...) { ... }* *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = {* * transformSchema(dataSet.schema, paramMap, logging = true)* * val map = this.paramMap ++ paramMap val deepCNNFeature = udf((v: Vector)= {* * val cnnModel = new CaffeModel * * cnnModel.transform(v)* * } : Vector ) dataSet.withColumn(map(outputCol), deepCNNFeature(col(map(inputCol* * }* *}* where CaffeModel is a java api to Caffe C++ model. The problem here is that for every row it will create a new instance of CaffeModel which is inefficient since creating a new model means loading a large model file. And it will transform only a single row at a time. Or a Caffe network can process a batch of rows efficiently. In other words, is it possible to create an UDF that can operatats on a partition in order to minimize the creation of a CaffeModel and to take advantage of the Caffe network batch processing ? On Tue, Mar 3, 2015 at 7:26 AM, Joseph Bradley jos...@databricks.com wrote: I see, thanks for clarifying! I'd recommend following existing implementations in spark.ml transformers. You'll need to define a UDF which operates on a single Row to compute the value for the new column. You can then use the DataFrame DSL to create the new column; the DSL provides a nice syntax for what would otherwise be a SQL statement like select ... from I'm recommending looking at the existing implementation (rather than stating it here) because it changes between Spark 1.2 and 1.3. In 1.3, the DSL is much improved and makes it easier to create a new column. Joseph On Sun, Mar 1, 2015 at 1:26 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: class DeepCNNFeature extends Transformer ... { override def transform(data: DataFrame, paramMap: ParamMap): DataFrame = { // How can I do a map partition on the underlying RDD and then add the column ? } } On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi Joseph, Thank your for the tips. I understand what should I do when my data are represented as a RDD. The thing that I can't figure out is how to do the same thing when the data is view as a DataFrame and I need to add the result of my pretrained model as a new column in the DataFrame. Preciselly, I want to implement the following transformer : class DeepCNNFeature extends Transformer ... { } On Sun, Mar 1, 2015 at 1:32 AM, Joseph Bradley jos...@databricks.com wrote: Hi Jao, You can use external tools and libraries if they can be called from your Spark program or script (with appropriate conversion of data types, etc.). The best way to apply a pre-trained model to a dataset would be to call the model from within a closure, e.g.: myRDD.map { myDatum = preTrainedModel.predict(myDatum) } If your data is distributed in an RDD (myRDD), then the above call will distribute the computation of prediction using the pre-trained model. It will require that all of your Spark workers be able to run the preTrainedModel; that may mean installing Caffe and dependencies on all nodes in the compute cluster. For the second question, I would modify the above call as follows: myRDD.mapPartitions { myDataOnPartition = val myModel = //
Re: Using TF-IDF from MLlib
FWIW the JIRA I was thinking about is https://issues.apache.org/jira/browse/SPARK-3098 On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you only use operations like map which preserve partitioning, ordering should be guaranteed from what I know. On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote: Dang I can't seem to find the JIRA now but I am sure we had a discussion with Matei about this and the conclusion was that RDD order is not guaranteed unless a sort is involved. On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote: This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit types to make it clear): val data: RDD[LabeledPoint] = ... val labels: RDD[Double] = data.map(_.label) val features1: RDD[Vector] = data.map(_.features) val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1) val features3: RDD[Vector] = idfModel.transform(features2) val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) = LabeledPoint(label, features)) If you run into problems with zipping like this, please report them! Thanks, Joseph On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote: Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote: Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you only use operations like map which preserve partitioning, ordering should be guaranteed from what I know. On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote: Dang I can't seem to find the JIRA now but I am sure we had a discussion with Matei about this and the conclusion was that RDD order is not guaranteed unless a sort is involved. On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote: This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit types to make it clear): val data: RDD[LabeledPoint] = ... val labels: RDD[Double] = data.map(_.label) val features1: RDD[Vector] = data.map(_.features) val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1) val features3: RDD[Vector] = idfModel.transform(features2) val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) = LabeledPoint(label, features)) If you run into problems with zipping like this, please report them! Thanks, Joseph On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote: Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote: Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
Do you have a small test case that can reproduce the out of memory error ? I have also seen some errors on large scale experiments but haven't managed to narrow it down. Thanks Shivaram On Fri, Mar 13, 2015 at 6:20 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: It runs faster but there is some drawbacks. It seems to consume more memory. I get java.lang.OutOfMemoryError: Java heap space error if I don't have a sufficient partitions for a fixed amount of memory. With the older (ampcamp) implementation for the same data size I didn't get it. On Thu, Mar 12, 2015 at 11:36 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: In fact, by activating netlib with native libraries it goes faster. Glad you got it work ! Better performance was one of the reasons we made the switch. Thanks On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are a couple of differences between the ml-matrix implementation and the one used in AMPCamp - I think the AMPCamp one uses JBLAS which tends to ship native BLAS libraries along with it. In ml-matrix we switched to using Breeze + Netlib BLAS which is faster but needs some setup [1] to pick up native libraries. If native libraries are not found it falls back to a JVM implementation, so that might explain the slow down. - The other difference if you are comparing the whole image pipeline is that I think the AMPCamp version used NormalEquations which is around 2-3x faster (just in terms of number of flops) compared to TSQR. [1] https://github.com/fommil/netlib-java#machine-optimised-system-libraries Thanks Shivaram On Tue, Mar 10, 2015 at 9:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I'm trying to play with the implementation of least square solver (Ax = b) in mlmatrix.TSQR where A is a 5*1024 matrix and b a 5*10 matrix. It works but I notice that it's 8 times slower than the implementation given in the latest ampcamp : http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html . As far as I know these two implementations come from the same basis. What is the difference between these two codes ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
There are a couple of differences between the ml-matrix implementation and the one used in AMPCamp - I think the AMPCamp one uses JBLAS which tends to ship native BLAS libraries along with it. In ml-matrix we switched to using Breeze + Netlib BLAS which is faster but needs some setup [1] to pick up native libraries. If native libraries are not found it falls back to a JVM implementation, so that might explain the slow down. - The other difference if you are comparing the whole image pipeline is that I think the AMPCamp version used NormalEquations which is around 2-3x faster (just in terms of number of flops) compared to TSQR. [1] https://github.com/fommil/netlib-java#machine-optimised-system-libraries Thanks Shivaram On Tue, Mar 10, 2015 at 9:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I'm trying to play with the implementation of least square solver (Ax = b) in mlmatrix.TSQR where A is a 5*1024 matrix and b a 5*10 matrix. It works but I notice that it's 8 times slower than the implementation given in the latest ampcamp : http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html . As far as I know these two implementations come from the same basis. What is the difference between these two codes ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: In fact, by activating netlib with native libraries it goes faster. Glad you got it work ! Better performance was one of the reasons we made the switch. Thanks On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are a couple of differences between the ml-matrix implementation and the one used in AMPCamp - I think the AMPCamp one uses JBLAS which tends to ship native BLAS libraries along with it. In ml-matrix we switched to using Breeze + Netlib BLAS which is faster but needs some setup [1] to pick up native libraries. If native libraries are not found it falls back to a JVM implementation, so that might explain the slow down. - The other difference if you are comparing the whole image pipeline is that I think the AMPCamp version used NormalEquations which is around 2-3x faster (just in terms of number of flops) compared to TSQR. [1] https://github.com/fommil/netlib-java#machine-optimised-system-libraries Thanks Shivaram On Tue, Mar 10, 2015 at 9:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I'm trying to play with the implementation of least square solver (Ax = b) in mlmatrix.TSQR where A is a 5*1024 matrix and b a 5*10 matrix. It works but I notice that it's 8 times slower than the implementation given in the latest ampcamp : http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html . As far as I know these two implementations come from the same basis. What is the difference between these two codes ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a good reference Shivaram On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Do you have a reference paper to the implemented algorithm in TSQR.scala ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
Re: What does (### skipped) mean in the Spark UI?
+Josh, who added the Job UI page. I've seen this as well and was a bit confused about what it meant. Josh, is there a specific scenario that creates these skipped stages in the Job UI ? Thanks Shivaram On Wed, Jan 7, 2015 at 12:32 PM, Corey Nolet cjno...@gmail.com wrote: Sorry- replace ### with an actual number. What does a skipped stage mean? I'm running a series of jobs and it seems like after a certain point, the number of skipped stages is larger than the number of actual completed stages. On Wed, Jan 7, 2015 at 3:28 PM, Ted Yu yuzhih...@gmail.com wrote: Looks like the number of skipped stages couldn't be formatted. Cheers On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: We just upgraded to Spark 1.2.0 and we're seeing this in the UI.
Re: What does (### skipped) mean in the Spark UI?
Ah I see - So its more like 're-used stages' which is not necessarily a bug in the program or something like that. Thanks for the pointer to the comment Thanks Shivaram On Wed, Jan 7, 2015 at 2:00 PM, Mark Hamstra m...@clearstorydata.com wrote: That's what you want to see. The computation of a stage is skipped if the results for that stage are still available from the evaluation of a prior job run: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L163 On Wed, Jan 7, 2015 at 12:32 PM, Corey Nolet cjno...@gmail.com wrote: Sorry- replace ### with an actual number. What does a skipped stage mean? I'm running a series of jobs and it seems like after a certain point, the number of skipped stages is larger than the number of actual completed stages. On Wed, Jan 7, 2015 at 3:28 PM, Ted Yu yuzhih...@gmail.com wrote: Looks like the number of skipped stages couldn't be formatted. Cheers On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: We just upgraded to Spark 1.2.0 and we're seeing this in the UI.
Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary
Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote: You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote: Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch the Spark cluster in pseudo cluster mode 3. A small scala APP which calls RDD.saveAsObjectFile() scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.1 ) val sc = new SparkContext(args(0), test) //args[0] is the Spark master URI val rdd = sc.parallelize(List(1, 2, 3)) rdd.saveAsObjectFile(/tmp/mysaoftmp) sc.stop throws an exception as follows: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com): java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) [error] org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) [error] org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229) [error] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) [error] org.apache.spark.scheduler.Task.run(Task.scala:54) [error] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) [error] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) [error] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [error] java.lang.Thread.run(Thread.java:701) After investigation, I found that this is caused by bytecode incompatibility issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built spark assembly respectively. This issue also happens with spark 1.1.0. Is there anything wrong in my usage of Spark? Or anything wrong in the process of deploying Spark module jars to maven repo? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problem creating EC2 cluster using spark-ec2
+Andrew Actually I think this is because we haven't uploaded the Spark binaries to cloudfront / pushed the change to mesos/spark-ec2. Andrew, can you take care of this ? On Tue, Dec 2, 2014 at 5:11 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Interesting. Do you have any problems when launching in us-east-1? What is the full output of spark-ec2 when launching a cluster? (Post it to a gist if it’s too big for email.) On Mon, Dec 1, 2014 at 10:34 AM, Dave Challis dave.chal...@aistemos.com wrote: I've been trying to create a Spark cluster on EC2 using the documentation at https://spark.apache.org/docs/latest/ec2-scripts.html (with Spark 1.1.1). Running the script successfully creates some EC2 instances, HDFS etc., but appears to fail to copy the actual files needed to run Spark across. I ran the following commands: $ cd ~/src/spark-1.1.1/ec2 $ ./spark-ec2 --key-pair=* --identity-file=* --slaves=1 --region=eu-west-1 --zone=eu-west-1a --instance-type=m3.medium --no-ganglia launch foocluster I see the following in the script's output: (instance and HDFS set up happens here) ... Persistent HDFS installed, won't start by default... ~/spark-ec2 ~/spark-ec2 Setting up spark-standalone RSYNC'ing /root/spark/conf to slaves... *.eu-west-1.compute.amazonaws.com RSYNC'ing /root/spark-ec2 to slaves... *.eu-west-1.compute.amazonaws.com ./spark-standalone/setup.sh: line 22: /root/spark/sbin/stop-all.sh: No such file or directory ./spark-standalone/setup.sh: line 27: /root/spark/sbin/start-master.sh: No such file or directory ./spark-standalone/setup.sh: line 33: /root/spark/sbin/start-slaves.sh: No such file or directory Setting up tachyon RSYNC'ing /root/tachyon to slaves... ... (Tachyon setup happens here without any problem) I can ssh to the master (using the ./spark-ec2 login), and looking in /root/, it contains: $ ls /root ephemeral-hdfs hadoop-native mapreduce persistent-hdfs scala shark spark spark-ec2 tachyon If I look in /root/spark (where the sbin directory should be found), it only contains a single 'conf' directory: $ ls /root/spark conf Any idea why spark-ec2 might have failed to copy these files across? Thanks, Dave - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Mllib native netlib-java/OpenBLAS
Can you clarify what is the Spark master URL you are using ? Is it 'local' or is it a cluster ? If it is 'local' then rebuilding Spark wouldn't help as Spark is getting pulled in from Maven and that'll just pick up the released artifacts. Shivaram On Mon, Nov 24, 2014 at 1:08 PM, agg212 alexander_galaka...@brown.edu wrote: I tried building Spark from the source, by downloading it and running: mvn -Pnetlib-lgpl -DskipTests clean package I then installed OpenBLAS by doing the following: - Download and unpack .tar from http://www.openblas.net/ - Run `make` I then linked /usr/lib/libblas.so.3 to /usr/lib/libopenblas.so (which links to /usr/lib/libopenblas_sandybridgep-r0.2.12.so) I am still getting the following error when running a job after installing spark from the source with the -Pnetlib-lgpl flag: WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Any thoughts on what else I need to do to get the native libraries recognized by Spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p19681.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: org/apache/commons/math3/random/RandomGenerator issue
I ran into this problem too and I know of a workaround but don't exactly know what is happening. The work around is to explicitly add either the commons math jar or your application jar (shaded with commons math) to spark.executor.extraClassPath. My hunch is that this is related to the class loader problem described in [1] where Spark loads breeze at the beginning and then having commons math in the user's jar somehow doesn't get picked up. Thanks Shivaram [1] http://apache-spark-user-list.1001560.n3.nabble.com/Native-library-can-not-be-loaded-when-using-Mllib-PCA-td7042.html#a8307 On Sat, Nov 8, 2014 at 1:21 PM, Sean Owen so...@cloudera.com wrote: This means you haven't actually included commons-math3 in your application. Check the contents of your final app jar and then go check your build file again. On Sat, Nov 8, 2014 at 12:20 PM, lev kat...@gmail.com wrote: Hi, I'm using breeze.stats.distributions.Binomial with spark 1.1.0 and having the same error. I tried to add the dependency to math3 with versions 3.11, 3.2, 3.3 and it didn't help. Any ideas what might be the problem? Thanks, Lev. anny9699 wrote I use the breeze.stats.distributions.Bernoulli in my code, however met this problem java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18406.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Matrix multiplication in spark
We are working on a PRs to add block partitioned matrix formats and dense matrix multiply methods. This should be out in the next few weeks or so. The sparse methods still need some research on partitioning schemes etc. and we will do that after the dense methods are in place. Thanks Shivaram On Wed, Nov 5, 2014 at 2:00 AM, Duy Huynh duy.huynh@gmail.com wrote: ok great. when will this be ready? On Wed, Nov 5, 2014 at 4:27 AM, Xiangrui Meng men...@gmail.com wrote: We are working on distributed block matrices. The main JIRA is at: https://issues.apache.org/jira/browse/SPARK-3434 The goal is to support basic distributed linear algebra, (dense first and then sparse). -Xiangrui On Wed, Nov 5, 2014 at 12:23 AM, ll duy.huynh@gmail.com wrote: @sowen.. i am looking for distributed operations, especially very large sparse matrix x sparse matrix multiplication. what is the best way to implement this in spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-tp12562p18164.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Build
Yeah looks like https://github.com/apache/spark/pull/2744 broke the build. We will fix it soon On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote: I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve built the master branch successfully previously and am trying to rebuild again to take advantage of the new Hive 0.13.1 profile. I execute the following command: $ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package The build fails at the following stage: INFO] Using incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [INFO] Compiling 5 Scala sources to /home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classes... [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:20: object MemLimitLogger is not a member of package org.apache.spark.deploy.yarn [ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._ [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:29: not found: value memLimitExceededLogMessage [ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics, VMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:30: not found: value memLimitExceededLogMessage [ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics, PMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] three errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.758s] [INFO] Spark Project Common Network Code . SUCCESS [6.716s] [INFO] Spark Project Core SUCCESS [2:46.610s] [INFO] Spark Project Bagel ... SUCCESS [16.776s] [INFO] Spark Project GraphX .. SUCCESS [52.159s] [INFO] Spark Project Streaming ... SUCCESS [1:09.883s] [INFO] Spark Project ML Library .. SUCCESS [1:18.932s] [INFO] Spark Project Tools ... SUCCESS [10.210s] [INFO] Spark Project Catalyst SUCCESS [1:12.499s] [INFO] Spark Project SQL . SUCCESS [1:10.561s] [INFO] Spark Project Hive SUCCESS [1:08.571s] [INFO] Spark Project REPL SUCCESS [32.377s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.317s] [INFO] Spark Project YARN Stable API . FAILURE [25.918s] [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 11:15.889s [INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014 [INFO] Final Memory: 73M/829M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-yarn_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-yarn_2.10 I could not find MemLimitLogger anywhere in the Spark code. Anybody else seen/encounter this? Thanks, -Terry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
New SparkR mailing list, JIRA
Hi I'd like to announce a couple of updates to the SparkR project. In order to facilitate better collaboration for new features and development we have a new mailing list, issue tracker for SparkR. - The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and we have migrated all existing Github issues to the JIRA. Please submit any bugs / improvements to this JIRA going forward. - There is a new mailing list sparkr-...@googlegroups.com that will be used for design discussions for new features and development related issues. We will still be answering to user issues on Apache Spark mailing lists. Please let me know if have any questions. Thanks Shivaram
Re: SparkR: split, apply, combine strategy for dataframes?
Could you try increasing the number of slices with the large data set ? SparkR assumes that each slice (or partition in Spark terminology) can fit in memory of a single machine. Also is the error happening when you do the map function or does it happen when you combine the results ? Thanks Shivaram On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta gilbello...@gmail.com wrote: Hello, I am having problems trying to apply the split-apply-combine strategy for dataframes using SparkR. I have a largish dataframe and I would like to achieve something similar to what ddply(df, .(id), foo) would do, only that using SparkR as computing engine. My df has a few million records and I would like to split it by id and operate on the pieces. These pieces are quite small in size: just a few hundred records. I do something along the following lines: 1) Use split to transform df into a list of dfs. 2) parallelize the resulting list as a RDD (using a few thousand slices) 3) map my function on the pieces using Spark. 4) recombine the results (do.call, rbind, etc.) My cluster works and I can perform medium sized batch jobs. However, it fails with my full df: I get a heap space out of memory error. It is funny as the slices are very small in size. Should I send smaller batches to my cluster? Is there any recommended general approach to these kind of split-apply-combine problems? Best, Carlos J. Gil Bellosta http://www.datanalytics.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
If the JVM heap size is close to the memory limit the OS sometimes kills the process under memory pressure. I've usually found that lowering the executor memory size helps. Shivaram On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com wrote: What is your Spark executor memory set to? (You can see it in Spark's web UI at http://driver:4040 under the executors tab). One thing to be aware of is that the JVM never really releases memory back to the OS, so it will keep filling up to the maximum heap size you set. Maybe 4 executors with that much heap are taking a lot of the memory. Persist as DISK_ONLY should indeed stream data from disk, so I don't think that will be a problem. Matei On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote: After a lot of grovelling through logs, I found out that the Nagios monitor process detected that the machine was almost out of memory, and killed the SNAP executor process. So why is the machine running out of memory? Each node has 128GB of RAM, 4 executors, about 40GB of data. It did run out of memory if I tried to cache() the RDD, but I would hope that persist() is implemented so that it would stream to disk without trying to materialize too much data in RAM. Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How can I implement eigenvalue decomposition in Spark?
If you just want to find the top eigenvalue / eigenvector you can do something like the Lanczos method. There is a description of a MapReduce based algorithm in Section 4.2 of [1] [1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf On Thu, Aug 7, 2014 at 10:54 AM, Li Pu l...@twitter.com.invalid wrote: @Miles, the latest SVD implementation in mllib is partially distributed. Matrix-vector multiplication is computed among all workers, but the right singular vectors are all stored in the driver. If your symmetric matrix is n x n and you want the first k eigenvalues, you will need to fit n x k doubles in driver's memory. Behind the scene, it calls ARPACK to compute eigen-decomposition of A^T A. You can look into the source code for the details. @Sean, the SVD++ implementation in graphx is not the canonical definition of SVD. It doesn't have the orthogonality that SVD holds. But we might want to use graphx as the underlying matrix representation for mllib.SVD to address the problem of skewed entry distribution. On Thu, Aug 7, 2014 at 10:51 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD ( http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data is sparse (which it often is in social networks), you may have better luck with this. I haven't tried the GraphX implementation, but those algorithms are often well-suited for power-law distributed graphs as you might see in social networks. FWIW, I believe you need to square elements of the sigma matrix from the SVD to get the eigenvalues. On Thu, Aug 7, 2014 at 10:20 AM, Sean Owen so...@cloudera.com wrote: (-incubator, +user) If your matrix is symmetric (and real I presume), and if my linear algebra isn't too rusty, then its SVD is its eigendecomposition. The SingularValueDecomposition object you get back has U and V, both of which have columns that are the eigenvectors. There are a few SVDs in the Spark code. The one in mllib is not distributed (right?) and is probably not an efficient means of computing eigenvectors if you really just want a decomposition of a symmetric matrix. The one I see in graphx is distributed? I haven't used it though. Maybe it could be part of a solution. On Thu, Aug 7, 2014 at 2:21 PM, yaochunnan yaochun...@gmail.com wrote: Our lab need to do some simulation on online social networks. We need to handle a 5000*5000 adjacency matrix, namely, to get its largest eigenvalue and corresponding eigenvector. Matlab can be used but it is time-consuming. Is Spark effective in linear algebra calculations and transformations? Later we would have 500*500 matrix processed. It seems emergent that we should find some distributed computation platform. I see SVD has been implemented and I can get eigenvalues of a matrix through this API. But when I want to get both eigenvalues and eigenvectors or at least the biggest eigenvalue and the corresponding eigenvector, it seems that current Spark doesn't have such API. Is it possible that I write eigenvalue decomposition from scratch? What should I do? Thanks a lot! Miles Yao View this message in context: How can I implement eigenvalue decomposition in Spark? Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Li @vrilleup
Re: SparkR : lapplyPartition transforms the data in vertical format
I tried this out and what is happening here is that as the input file is small only 1 partition is created. lapplyPartition runs the given function on the partition and computes sumx as 55 and sumy as 55. Now the return value from lapplyPartition is treated as a list by SparkR and collect concatenates all the lists from all partitions. Thus output in this case is just a list with two values and trying to access element[2] in the for loop gives NA. If you just use cat(as.character(element), \n), you should see 55 and 55. Thanks Shivaram On Thu, Aug 7, 2014 at 3:21 PM, Pranay Dave pranay.da...@gmail.com wrote: Hello Zongheng Infact the problem is in lapplyPartition lapply gives output as 1,1 2,2 3,3 ... 10,10 However lapplyPartition gives output as 55, NA 55, NA Why lapply output is horizontal and lapplyPartition is vertical ? Here is my code library(SparkR) sc - sparkR.init(local) lines - textFile(sc,/sparkdev/datafiles/covariance.txt) totals - lapplyPartition(lines, function(lines) { sumx - 0 sumy - 0 totaln - 0 for (i in 1:length(lines)){ dataxy - unlist(strsplit(lines[i], ,)) sumx - sumx + as.numeric(dataxy[1]) sumy - sumy + as.numeric(dataxy[2]) } ##list(as.numeric(sumx), as.numeric(sumy), as.numeric(sumxy), as.numeric(totaln)) ##list does same as below c(sumx,sumy) } ) output - collect(totals) for (element in output) { cat(as.character(element[1]),as.character(element[2]), \n) } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-lapplyPartition-transforms-the-data-in-vertical-format-tp11540p11726.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkR : lapplyPartition transforms the data in vertical format
The output of lapply and lapplyPartition should the same by design -- The only difference is that in lapply the user-defined function returns a row, while it returns a list in lapplyPartition. Could you given an example of a small input and output that you expect to see for the above program ? Shivaram On Wed, Aug 6, 2014 at 5:47 AM, Pranay Dave pranay.da...@gmail.com wrote: Hello As per documentation, lapply works on single records and lapplyPartition works on partition However the format of output does not change When I use lapplypartition, the data is converted to vertical format Here is my code library(SparkR) sc - sparkR.init(local) lines - textFile(sc,/sparkdev/datafiles/covariance.txt) totals - lapply(lines, function(lines) { sumx - 0 sumy - 0 totaln - 0 for (i in 1:length(lines)){ dataxy - unlist(strsplit(lines[i], ,)) sumx - sumx + as.numeric(dataxy[1]) sumy - sumy + as.numeric(dataxy[2]) } ##list(as.numeric(sumx), as.numeric(sumy), as.numeric(sumxy), as.numeric(totaln)) ##list does same as below c(sumx,sumy) } ) output - collect(totals) for (element in output) { cat(as.character(element[1]),as.character(element[2]), \n) } I am expecting output as 55, 55 However it is giving 55,NA 55,NA Where am I going wrong ? Thanks Pranay -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-lapplyPartition-transforms-the-data-in-vertical-format-tp11540.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2
This fails for me too. I have no idea why it happens as I can wget the pom from maven central. To work around this I just copied the ivy xmls and jars from this github repo https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library and put it in /root/.ivy2/cache/org.scala-lang/scala-library Thanks Shivaram On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran into same issue. What is the solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Cell : 425-233-8271 -- Cell : 425-233-8271
Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2
Thanks Patrick -- It does look like some maven misconfiguration as wget http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom works for me. Shivaram On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote: This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran into same issue. What is the solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Cell : 425-233-8271 -- Cell : 425-233-8271
Re: sbt directory missed
I think the 1.0 AMI only contains the prebuilt packages (i.e just the binaries) of Spark and not the source code. If you want to build Spark on EC2, you'll can clone the github repo and then use sbt. Thanks Shivaram On Mon, Jul 28, 2014 at 8:49 AM, redocpot julien19890...@gmail.com wrote: update: Just checked the python launch script, when retrieving spark, it will refer to this script: https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh where each version number is mapped to a tar file, 0.9.2) if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-hadoop1.tgz else wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-cdh4.tgz fi ;; 1.0.0) if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-hadoop1.tgz else wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-cdh4.tgz fi ;; 1.0.1) if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-hadoop1.tgz else wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-cdh4.tgz fi ;; I just checked the three last tar file. I find the /sbt directory and many other directory like bagel, mllib, etc in 0.9.2 tar file. However, they are not in 1.0.0 and 1.0.1 tar files. I am not sure that 1.0.X versions are mapped to the correct tar files. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-directory-missed-tp10783p10784.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: BUG in spark-ec2 script (--ebs-vol-size) and workaround...
Thanks a lot for reporting this. I think we just missed installing xfsprogs on the AMI. I have a fix for this at https://github.com/mesos/spark-ec2/pull/59. After the pull request is merged, any new clusters launched should have mkfs.xfs Thanks Shivaram On Fri, Jul 18, 2014 at 4:56 PM, Ben Horner ben.hor...@atigeo.com wrote: Hello all, There is a bug in the spark-ec2 script (perhaps due to a change in the Amazon AMI). The --ebs-vol-size option directs the spark-ec2 script to add an EBS volume of the specified size, and mount it at /vol for a persistent HDFS. To do this, it uses mkfs.xfs which is not available (though mkfs is). To work around this, I was able to run yum install xfsprogs on the master and each slave, and then use the --resume option with the script, and the persistent HDFS actually worked! This has been a frustrating experience, but I've used the spark-ec2 script for several months now, and it's incredibly helpful. I hope this post helps towards fixing the problem! Thanks, -Ben P.S. This is the full initial command I used, in case this is isolated to particular instance types or anything: ec2/spark-ec2 -k ... -i ... -z us-east-1d -s 4 -t m3.2xlarge --ebs-vol-size=250 -m r3.2xlarge launch ... P.P.S. Ganglia is still broken, and has been for a while... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/BUG-in-spark-ec2-script-ebs-vol-size-and-workaround-tp10217.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SparkR failed to connect to the master
You'll need to build SparkR to match the Spark version deployed on the cluster. You can do that by changing the Spark version in SparkR's build.sbt [1]. If you are using the Maven build you'll need to edit pom.xml Thanks Shivaram [1] https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/src/build.sbt#L20 On Mon, Jul 14, 2014 at 6:19 PM, cjwang c...@cjwang.us wrote: I tried installing the latest Spark 1.0.1 and SparkR couldn't find the master either. I restarted with Spark 0.9.1 and SparkR was able to find the master. So, there seemed to be something that changed after Spark 1.0.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-failed-to-connect-to-the-master-tp9359p9680.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sparkR - is it possible to run sparkR on yarn?
We don't have any documentation on running SparkR on YARN and I think there might be some issues that need to be fixed (The recent PySpark on YARN PRs are an example). SparkR has only been tested to work with Spark standalone mode so far. Thanks Shivaram On Tue, Apr 29, 2014 at 7:56 PM, phoenix bai mingzhi...@gmail.com wrote: Hi all, I searched around, but fail to find anything that says about running sparkR on YARN. so, is it possible to run sparkR with yarn ? either with yarn-standalone or yarn-client mode. if so, is there any document that could guide me through the build setup processes? I am desparate for some answers, so please help!
Re: Build times for Spark
Are you by any chance building this on NFS ? As far as I know the build is severely bottlenecked by filesystem calls during assembly (each class file in each dependency gets a fstat call or something like that). That is partly why building from say a local ext4 filesystem or a SSD is much faster irrespective of memory / CPU. Thanks Shivaram On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das ak...@sigmoidanalytics.comwrote: You can always increase the sbt memory by setting export JAVA_OPTS=-Xmx10g Thanks Best Regards On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken ken.willi...@windlogics.com wrote: No, I haven't done any config for SBT. Is there somewhere you might be able to point me toward for how to do that? -Ken *From:* Josh Rosen [mailto:rosenvi...@gmail.com] *Sent:* Friday, April 25, 2014 3:27 PM *To:* user@spark.apache.org *Subject:* Re: Build times for Spark Did you configure SBT to use the extra memory? On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken ken.willi...@windlogics.com wrote: I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU time). After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU. Is that typical? Or does that indicate some setup problem in my environment? -- Ken Williams, Senior Research Scientist *WindLogics* http://windlogics.com -- CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.
Re: Build times for Spark
AFAIK the resolver does pick up things form your local ~/.m2 -- Note that as ~/.m2 is on NFS that adds to the amount of filesystem traffic. Shivaram On Fri, Apr 25, 2014 at 2:57 PM, Williams, Ken ken.willi...@windlogics.comwrote: I am indeed, but it's a pretty fast NFS. I don't have any SSD I can use, but I could try to use local disk to see what happens. For me, a large portion of the time seems to be spent on lines like Resolving org.fusesource.jansi#jansi;1.4 ... or similar . Is this going out to find Maven resources? Any way to tell it to just use my local ~/.m2 repository instead when the resource already exists there? Sometimes I even get sporadic errors like this: [info] Resolving org.apache.hadoop#hadoop-yarn;2.2.0 ... [error] SERVER ERROR: Bad Gateway url= http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-server/2.2.0/hadoop-yarn-server-2.2.0.jar -Ken *From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] *Sent:* Friday, April 25, 2014 4:31 PM *To:* user@spark.apache.org *Subject:* Re: Build times for Spark Are you by any chance building this on NFS ? As far as I know the build is severely bottlenecked by filesystem calls during assembly (each class file in each dependency gets a fstat call or something like that). That is partly why building from say a local ext4 filesystem or a SSD is much faster irrespective of memory / CPU. Thanks Shivaram On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can always increase the sbt memory by setting export JAVA_OPTS=-Xmx10g Thanks Best Regards On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken ken.willi...@windlogics.com wrote: No, I haven't done any config for SBT. Is there somewhere you might be able to point me toward for how to do that? -Ken *From:* Josh Rosen [mailto:rosenvi...@gmail.com] *Sent:* Friday, April 25, 2014 3:27 PM *To:* user@spark.apache.org *Subject:* Re: Build times for Spark Did you configure SBT to use the extra memory? On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken ken.willi...@windlogics.com wrote: I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU time). After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU. Is that typical? Or does that indicate some setup problem in my environment? -- Ken Williams, Senior Research Scientist *WindLogics* http://windlogics.com -- CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.
Re: Help with error initializing SparkR.
I just updated the github issue -- In case anybody is curious, this was a problem with R resolving the right java version installed in the VM. Thanks Shivaram On Sat, Apr 19, 2014 at 7:12 PM, tongzzz tongzhang...@gmail.com wrote: I can't initialize sc context after a successful install on Cloudera quickstart VM. This is the error message: library(SparkR) Loading required package: rJava [SparkR] Initializing with classpath /usr/lib64/R/library/SparkR/sparkr-assembly-0.1.jar sc - sparkR.init(local) Error in .jcall(RJavaTools, Ljava/lang/Object;, invokeMethod, cl, : java.lang.NoClassDefFoundError: scala.collection.immutable.Vector I also created an issue request on the github, which contains more details about this issue https://github.com/amplab-extras/SparkR-pkg/issues/46 Thanks for any help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-error-initializing-SparkR-tp4495.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Creating a SparkR standalone job
Thanks for attaching code. If I get your use case right you want to call the sentiment analysis code from Spark Streaming right ? For that I think you can just use jvmr if that works and I don't think you need SparkR. SparkR is mainly intended as an API for large scale jobs which are written in R. For this use case where the job is written in Scala (or Java) you can create your SparkContext in Scala and then just call jvmr from say within a map function. The only other thing might be to figure out what the thread-safety model is for jvmr -- AFAIK R is single threaded, but we run tasks in multiple threads in Spark. Thanks Shivaram On Sat, Apr 12, 2014 at 12:16 PM, pawan kumar pkv...@gmail.com wrote: Hi Shivaram, I was able to get R integrated into spark using jvmr. Now i call R from scala and pass values to the R function using scala ( Spark Streaming values). Attached the application. I can also call sparkR but not sure where to pass the spark context in regards to the application attached. Any help would be greatly appreciated. You can run the application from scala IDE. Let me know if you have any difficulties running this application. Dependencies : Having R installed in the machine. Thank Pawan Kumar Venugopal On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: You can create standalone jobs in SparkR as just R files that are run using the sparkR script. These commands will be sent to a Spark cluster and the examples on the SparkR repository ( https://github.com/amplab-extras/SparkR-pkg#examples-unit-tests) are in fact standalone jobs. However I don't think that will completely solve your use case of using Streaming + R. We don't yet have a way to call R functions from Spark's Java or Scala API. So right now one thing you can try is to save data from SparkStreaming to HDFS and then run a SparkR job which reads in the file. Regarding the other idea of calling R from Scala -- it might be possible to do that in your code if the classpath etc. is setup correctly. I haven't tried it out though, but do let us know if you get it to work. Thanks Shivaram On Mon, Apr 7, 2014 at 2:21 PM, pawan kumar pkv...@gmail.com wrote: Hi, Is it possible to create a standalone job in scala using sparkR? If possible can you provide me with the information of the setup process. (Like the dependencies in SBT and where to include the JAR files) This is my use-case: 1. I have a Spark Streaming standalone Job running in local machine which streams twitter data. 2. I have an R script which performs Sentiment Analysis. I am looking for an optimal way where I could combine these two operations into a single job and run using SBT Run command. I came across this document which talks about embedding R into scala ( http://dahl.byu.edu/software/jvmr/dahl-payne-uppalapati-2013.pdf) but was not sure if that would work well within the spark context. Thanks, Pawan Venugopal
Re: AWS Spark-ec2 script with different user
The AMI should automatically switch between PVM and HVM based on the instance type you specify on the command line. For reference (note you don't need to specify this on the command line), the PVM ami id is ami-5bb18832 in us-east-1. FWIW we maintain the list of AMI Ids (across regions and pvm, hvm) at https://github.com/mesos/spark-ec2/tree/v2/ami-list Thanks Shivaram On Wed, Apr 9, 2014 at 9:12 AM, Marco Costantini silvio.costant...@granatads.com wrote: Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual AMIs. On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And for the record, that AMI is ami-35b1885c. Again, you don't need to specify it explicitly; spark-ec2 will default to it. On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Marco, If you call spark-ec2 launch without specifying an AMI, it will default to the Spark-provided AMI. Nick On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini silvio.costant...@granatads.com wrote: Hi there, To answer your question; no there is no reason NOT to use an AMI that Spark has prepared. The reason we haven't is that we were not aware such AMIs existed. Would you kindly point us to the documentation where we can read about this further? Many many thanks, Shivaram. Marco. On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Is there any reason why you want to start with a vanilla amazon AMI rather than the ones we build and provide as a part of Spark EC2 scripts ? The AMIs we provide are close to the vanilla AMI but have the root account setup properly and install packages like java that are used by Spark. If you wish to customize the AMI, you could always start with our AMI and add more packages you like -- I have definitely done this recently and it works with HVM and PVM as far as I can tell. Shivaram On Tue, Apr 8, 2014 at 8:50 AM, Marco Costantini silvio.costant...@granatads.com wrote: I was able to keep the workaround ...around... by overwriting the generated '/root/.ssh/authorized_keys' file with a known good one, in the '/etc/rc.local' file On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini silvio.costant...@granatads.com wrote: Another thing I didn't mention. The AMI and user used: naturally I've created several of my own AMIs with the following characteristics. None of which worked. 1) Enabling ssh as root as per this guide ( http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/). When doing this, I do not specify a user for the spark-ec2 script. What happens is that, it works! But only while it's alive. If I stop the instance, create an AMI, and launch a new instance based from the new AMI, the change I made in the '/root/.ssh/authorized_keys' file is overwritten 2) adding the 'ec2-user' to the 'root' group. This means that the ec2-user does not have to use sudo to perform any operations needing root privilidges. When doing this, I specify the user 'ec2-user' for the spark-ec2 script. An error occurs: rsync fails with exit code 23. I believe HVMs still work. But it would be valuable to the community to know that the root user work-around does/doesn't work any more for paravirtual instances. Thanks, Marco. On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini silvio.costant...@granatads.com wrote: As requested, here is the script I am running. It is a simple shell script which calls spark-ec2 wrapper script. I execute it from the 'ec2' directory of spark, as usual. The AMI used is the raw one from the AWS Quick Start section. It is the first option (an Amazon Linux paravirtual image). Any ideas or confirmation would be GREATLY appreciated. Please and thank you. #!/bin/sh export AWS_ACCESS_KEY_ID=MyCensoredKey export AWS_SECRET_ACCESS_KEY=MyCensoredKey AMI_ID=ami-2f726546 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10 -v 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge launch marcotest On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: Hmm -- That is strange. Can you paste the command you are using to launch the instances ? The typical workflow is to use the spark-ec2 wrapper script using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html Shivaram On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini silvio.costant...@granatads.com wrote: Hi Shivaram, OK so let's assume the script CANNOT take a different user and that it must be 'root'. The typical workaround is as you said, allow the ssh with the root user. Now, don't laugh, but, this worked last Friday, but today (Monday) it no longer works. :D Why? ... ...It seems that NOW, when you launch a 'paravirtual' ami, the root user's 'authorized_keys' file is always overwritten. This means the workaround doesn't
Re: AWS Spark-ec2 script with different user
Hmm -- That is strange. Can you paste the command you are using to launch the instances ? The typical workflow is to use the spark-ec2 wrapper script using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html Shivaram On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini silvio.costant...@granatads.com wrote: Hi Shivaram, OK so let's assume the script CANNOT take a different user and that it must be 'root'. The typical workaround is as you said, allow the ssh with the root user. Now, don't laugh, but, this worked last Friday, but today (Monday) it no longer works. :D Why? ... ...It seems that NOW, when you launch a 'paravirtual' ami, the root user's 'authorized_keys' file is always overwritten. This means the workaround doesn't work anymore! I would LOVE for someone to verify this. Just to point out, I am trying to make this work with a paravirtual instance and not an HVM instance. Please and thanks, Marco. On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: Right now the spark-ec2 scripts assume that you have root access and a lot of internal scripts assume have the user's home directory hard coded as /root. However all the Spark AMIs we build should have root ssh access -- Do you find this not to be the case ? You can also enable root ssh access in a vanilla AMI by editing /etc/ssh/sshd_config and setting PermitRootLogin to yes Thanks Shivaram On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini silvio.costant...@granatads.com wrote: Hi all, On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh. Also, it is the default user for the Spark-EC2 script. Currently, the Amazon Linux images have an 'ec2-user' set up for ssh instead of 'root'. I can see that the Spark-EC2 script allows you to specify which user to log in with, but even when I change this, the script fails for various reasons. And the output SEEMS that the script is still based on the specified user's home directory being '/root'. Am I using this script wrong? Has anyone had success with this 'ec2-user' user? Any ideas? Please and thank you, Marco.
Re: Spark-ec2 setup is getting slower and slower
That is a good idea, though I am not sure how much it will help as time to rsync is also dependent just on data size being copied. The other problem is that sometime we have dependencies across packages, so the first needs to be running before the second can start etc. However I agree that it takes too long to launch say a 100 node cluster right now. If you want to take a shot at trying out some changes, you can fork the spark-ec2 repo at https://github.com/mesos/spark-ec2/tree/v2 and modify the number of rsync calls (each call to /root/spark-ec2/copy-dir launches an rsync now). Thanks Shivaram On Sun, Mar 30, 2014 at 3:12 PM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time. Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have 1 rsync which deploys N applications. This should remarkably speed up the setup part, specially for clusters with many nodes.
Re: How many partitions is my RDD split into?
There is no direct way to get this in pyspark, but you can get it from the underlying java rdd. For example a = sc.parallelize([1,2,3,4], 2) a._jrdd.splits().size() On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Mark, This appears to be a Scala-only feature. :( Patrick, Are we planning to add this to PySpark? Nick On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick -- View this message in context: How many partitions is my RDD split into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: Problem with SparkR
Hi Thanks for reporting this. It'll be great if you can check a couple of things: 1. Are you trying to use this with Hadoop2 by any chance ? There was an incompatible ASM version bug that we fixed for Hadoop2 https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it, but I just want to check if the same error is cropping up again. 2. Is there a stack trace that follows the IncompatibleClassChangeError ? If so could you attach that ? The error message indicates there is some incompatibility between class versions and having a more detailed stack trace might help us track this down. Thanks Shivaram On Sun, Mar 23, 2014 at 4:48 PM, Jacques Basaldúa jacq...@dybot.com wrote: I am really interested in using Spark from R and have tried to use SparkR, but always get the same error. This is how I installed: - I successfully installed Spark version 0.9.0 with Scala 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_45) I can run examples from spark-shell and Python - I installed the R package devtools and installed SparkR using: - library(devtools) - install_github(amplab-extras/SparkR-pkg, subdir=pkg) This compiled the package successfully. When I try to run the package E.g., library(SparkR) sc - sparkR.init(master=local) //- so far the program runs fine rdd - parallelize(sc, 1:10) // This returns the following error Error in .jcall(getJRDD(rdd), Ljava/util/List;, collect) : java.lang.IncompatibleClassChangeError: org/apache/spark/util/InnerClosureFinder No matter how I try to use the sc (I have tried all the examples) I always get an error. Any ideas? Jacques.