Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-15 Thread Shivaram Venkataraman
I think the answer to this depends on what granularity you want to run the algorithm on. If its on the entire Spark DataFrame and if you except the data frame to be very large then it isn't easy to use the existing R function. However if you want to run the algorithm on smaller subsets of the data

Re: spark-ec2 scripts with spark-2.0.0-preview

2016-06-14 Thread Shivaram Venkataraman
Can you open an issue on https://github.com/amplab/spark-ec2 ? I think we should be able to escape the version string and pass the 2.0.0-preview through the scripts Shivaram On Tue, Jun 14, 2016 at 12:07 PM, Sunil Kumar wrote: > Hi, > > The spark-ec2 scripts are

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Shivaram Venkataraman
Overall this sounds good to me. One question I have is that in addition to the ML algorithms we have a number of linear algebra (various distributed matrices) and statistical methods in the spark.mllib package. Is the plan to port or move these to the spark.ml namespace in the 2.x series ? Thanks

Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Shivaram Venkataraman
I think its just a bug -- I think we originally followed the Python API (in the original PR [1]) but the Python API seems to have been changed to match Scala / Java in https://issues.apache.org/jira/browse/SPARK-6366 Feel free to open a JIRA / PR for this. Thanks Shivaram [1]

Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-11 Thread Shivaram Venkataraman
lasspath? I verified the assembly was built right and its in the classpath > (else nothing would work). > > Thanks, > Tom > > > > On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman > <shiva...@eecs.berkeley.edu> wrote: > > > I think this is

Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-10 Thread Shivaram Venkataraman
I think this is happening in the driver. Could you check the classpath of the JVM that gets started ? If you use spark-submit on yarn the classpath is setup before R gets launched, so it should match the behavior of Scala / Python. Thanks Shivaram On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Shivaram Venkataraman
It is a known limitation that spark-ec2 is very slow for large clusters and as you mention most of this is due to the use of rsync to transfer things from the master to all the slaves. Nick cc'd has been working on an alternative approach at https://github.com/nchammas/flintrock that is more

Re: how to run RStudio or RStudio Server on ec2 cluster?

2015-11-04 Thread Shivaram Venkataraman
RStudio should already be setup if you launch an EC2 cluster using spark-ec2. See http://blog.godatadriven.com/sparkr-just-got-better.html for details. Shivaram On Wed, Nov 4, 2015 at 5:11 PM, Andy Davidson wrote: > Hi > > I just set up a spark cluster on AWS ec2

Re: SparkR -Graphx

2015-08-06 Thread Shivaram Venkataraman
+Xiangrui I am not sure exposing the entire GraphX API would make sense as it contains a lot of low level functions. However we could expose some high level functions like PageRank etc. Xiangrui, who has been working on similar techniques to expose MLLib functions like GLM might have more to add.

Re: Broadcast variables in R

2015-07-21 Thread Shivaram Venkataraman
There shouldn't be anything Mac OS specific about this feature. One point of warning though -- As mentioned previously in this thread the APIs were made private because we aren't sure we will be supporting them in the future. If you are using these APIs it would be good to chime in on the JIRA

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-21 Thread Shivaram Venkataraman
FWIW I've run into similar BLAS related problems before and wrote up a document on how to do this for Spark EC2 clusters at https://github.com/amplab/ml-matrix/blob/master/EC2.md -- Note that this works with a vanilla Spark build (you only need to link to netlib-lgpl in your App) but requires the

Re: Including additional scala libraries in sparkR

2015-07-14 Thread Shivaram Venkataraman
There was a fix for `--jars` that went into 1.4.1 https://github.com/apache/spark/commit/2579948bf5d89ac2d822ace605a6a4afce5258d6 Shivaram On Tue, Jul 14, 2015 at 4:18 AM, Sun, Rui rui@intel.com wrote: Could you give more details about the mis-behavior of --jars for SparkR? maybe it's a

Re: sparkr-submit additional R files

2015-07-07 Thread Shivaram Venkataraman
You can just use `--files` and I think it should work. Let us know on https://issues.apache.org/jira/browse/SPARK-6833 if it doesn't work as expected. Thanks Shivaram On Tue, Jul 7, 2015 at 5:13 AM, Michał Zieliński zielinski.mich...@gmail.com wrote: Hi all, *spark-submit* for Python and

Re: JVM is not ready after 10 seconds

2015-07-06 Thread Shivaram Venkataraman
When I've seen this error before it has been due to the spark-submit file (i.e. `C:\spark-1.4.0\bin/bin/spark-submit.cmd`) not having execute permissions. You can try to set execute permission and see if it fixes things. Also we have a PR open to fix a related problem at

Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Shivaram Venkataraman
You need to add -Psparkr to build SparkR code Shivaram On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM,

Re: sparkR could not find function textFile

2015-07-01 Thread Shivaram Venkataraman
not scalable. 2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu: The `head` function is not supported for the RRDD that is returned by `textFile`. You can run `take(lines, 5L)`. I should add a warning here that the RDD API in SparkR is private because we might not support

Re: Calling MLLib from SparkR

2015-07-01 Thread Shivaram Venkataraman
The 1.4 release does not support calling MLLib from SparkR. We are working on it as a part of https://issues.apache.org/jira/browse/SPARK-6805 On Wed, Jul 1, 2015 at 4:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, Does Spark 1.4 support calling MLLib directly from SparkR ?

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-30 Thread Shivaram Venkataraman
to find an inherited method for function ‘reduceByKey’ for signature ‘PipelinedRDD, character, numeric’* *End Code * On Fri, Jun 26, 2015 at 7:04 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My workflow as to install RStudio on a cluster launched using Spark EC2 scripts

Re: [SparkR] Missing Spark APIs in R

2015-06-30 Thread Shivaram Venkataraman
venue where I would be able to follow the SparkR API progress? Thanks Pradeep On Mon, Jun 29, 2015 at 1:12 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: The RDD API is pretty complex and we are not yet sure we want to export all those methods in the SparkR API. We are working

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-30 Thread Shivaram Venkataraman
, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: The API exported in the 1.4 release is different from the one used in the 2014 demo. Please see the latest documentation at http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html or Chris's demo from Spark Summit at https

Re: [SparkR] Missing Spark APIs in R

2015-06-29 Thread Shivaram Venkataraman
The RDD API is pretty complex and we are not yet sure we want to export all those methods in the SparkR API. We are working towards exposing a more limited API in upcoming versions. You can find some more details in the recent Spark Summit talk at

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-27 Thread Shivaram Venkataraman
Thanks Mark for the update. For those interested Vincent Warmerdam also has some details on making the /root/spark installation work at https://issues.apache.org/jira/browse/SPARK-8596?focusedCommentId=14604328page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604328

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Shivaram Venkataraman
We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good

Re: sparkR could not find function textFile

2015-06-25 Thread Shivaram Venkataraman
The `head` function is not supported for the RRDD that is returned by `textFile`. You can run `take(lines, 5L)`. I should add a warning here that the RDD API in SparkR is private because we might not support it in the upcoming releases. So if you can use the DataFrame API for your application you

Re: sparkR could not find function textFile

2015-06-25 Thread Shivaram Venkataraman
DataFrame from comma separated flat files, what would you recommend me to do? One way I can think of is first reading the data as you would do in r, using read.table(), and then create spark DataFrame out of that R dataframe, but it is obviously not scalable. 2015-06-25 13:59 GMT-07:00 Shivaram

Re: mllib from sparkR

2015-06-25 Thread Shivaram Venkataraman
Not yet - We are working on it as a part of https://issues.apache.org/jira/browse/SPARK-6805 and you can follow the JIRA for more information On Wed, Jun 24, 2015 at 2:30 AM, escardovi escard...@bitbang.com wrote: Hi, I was wondering if it is possible to use MLlib function inside SparkR, as

Re: spark1.4 sparkR usage

2015-06-25 Thread Shivaram Venkataraman
The Apache Spark API docs for SparkR https://spark.apache.org/docs/1.4.0/api/R/index.html represent what has been released with Spark 1.4. The AMPLab version is no longer under active development and I'd recommend users to use the version in the Apache project. Thanks Shivaram On Thu, Jun 25,

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Shivaram Venkataraman
In addition to Aleksander's point please let us know what use case would use RDD-like API in https://issues.apache.org/jira/browse/SPARK-7264 -- We are hoping to have a version of this API in upcoming releases. Thanks Shivaram On Thu, Jun 25, 2015 at 6:02 AM, Eskilson,Aleksander

Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread Shivaram Venkataraman
The error you are running into is that the input file does not exist -- You can see it from the following line Input path does not exist: hdfs://smalldata13.hdp:8020/ home/esten/ami/usaf.json Thanks Shivaram On Tue, Jun 16, 2015 at 1:55 AM, esten erik.stens...@dnvgl.com wrote: Hi, In SparkR

Re: How to read avro in SparkR

2015-06-13 Thread Shivaram Venkataraman
Yep - Burak's answer should work. FWIW the error message from the stack trace that shows this is the line Failed to load class for data source: avro Thanks Shivaram On Sat, Jun 13, 2015 at 6:13 PM, Burak Yavuz brk...@gmail.com wrote: Hi, Not sure if this is it, but could you please try

Re: inlcudePackage() deprecated?

2015-06-04 Thread Shivaram Venkataraman
Yeah - We don't have support for running UDFs on DataFrames yet. There is an open issue to track this https://issues.apache.org/jira/browse/SPARK-6817 Thanks Shivaram On Thu, Jun 4, 2015 at 3:10 AM, Daniel Emaasit daniel.emaa...@gmail.com wrote: Hello Shivaram, Was the includePackage()

[ANNOUNCE] YARN support in Spark EC2

2015-06-03 Thread Shivaram Venkataraman
Hi all We recently merged support for launching YARN clusters using Spark EC2 scripts as a part of https://issues.apache.org/jira/browse/SPARK-3674. To use this you can pass in hadoop-major-version as yarn to the spark-ec2 script and this will setup Hadoop 2.4 HDFS, YARN and Spark built for YARN

Re: SparkR Jobs Hanging in collectPartitions

2015-05-29 Thread Shivaram Venkataraman
, and would therefore be part of the memory managed by Spark, and that memory would only be moved to R as an R object following a collect(), take(), etc. Thanks, Alek Eskilson From: Shivaram Venkataraman shiva...@eecs.berkeley.edu Reply-To: shiva...@eecs.berkeley.edu shiva...@eecs.berkeley.edu

Re: SparkR Jobs Hanging in collectPartitions

2015-05-27 Thread Shivaram Venkataraman
Could you try to see which phase is causing the hang ? i.e. If you do a count() after flatMap does that work correctly ? My guess is that the hang is somehow related to data not fitting in the R process memory but its hard to say without more diagnostic information. Thanks Shivaram On Tue, May

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Shivaram Venkataraman
and create a new DataFrame by zipping the results. It seems to work but when I try to save the RDD I got the following error : org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row On Mon, Mar 30, 2015 at 6:40 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Shivaram Venkataraman
One workaround could be to convert a DataFrame into a RDD inside the transform function and then use mapPartitions/broadcast to work with the JNI calls and then convert back to RDD. Thanks Shivaram On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm

Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
FWIW the JIRA I was thinking about is https://issues.apache.org/jira/browse/SPARK-3098 On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you

Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you only use operations like map which preserve partitioning, ordering should be guaranteed from what I know. On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote: Dang

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-13 Thread Shivaram Venkataraman
On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are a couple of differences between the ml-matrix implementation and the one used in AMPCamp - I think the AMPCamp one uses JBLAS which tends to ship native BLAS libraries along with it. In ml-matrix

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Shivaram Venkataraman
from the same basis. What is the difference between these two codes ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Shivaram Venkataraman
On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: In fact, by activating netlib with native libraries it goes faster. Glad you got it work ! Better performance was one of the reasons we made the switch. Thanks On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Shivaram Venkataraman
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a good reference Shivaram On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Do you have a reference paper to the implemented algorithm in TSQR.scala ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Shivaram Venkataraman
There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1]

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Shivaram Venkataraman
+Josh, who added the Job UI page. I've seen this as well and was a bit confused about what it meant. Josh, is there a specific scenario that creates these skipped stages in the Job UI ? Thanks Shivaram On Wed, Jan 7, 2015 at 12:32 PM, Corey Nolet cjno...@gmail.com wrote: Sorry- replace ###

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Shivaram Venkataraman
Ah I see - So its more like 're-used stages' which is not necessarily a bug in the program or something like that. Thanks for the pointer to the comment Thanks Shivaram On Wed, Jan 7, 2015 at 2:00 PM, Mark Hamstra m...@clearstorydata.com wrote: That's what you want to see. The computation of

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Shivaram Venkataraman
Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen

Re: Problem creating EC2 cluster using spark-ec2

2014-12-02 Thread Shivaram Venkataraman
+Andrew Actually I think this is because we haven't uploaded the Spark binaries to cloudfront / pushed the change to mesos/spark-ec2. Andrew, can you take care of this ? On Tue, Dec 2, 2014 at 5:11 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Interesting. Do you have any problems

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Shivaram Venkataraman
Can you clarify what is the Spark master URL you are using ? Is it 'local' or is it a cluster ? If it is 'local' then rebuilding Spark wouldn't help as Spark is getting pulled in from Maven and that'll just pick up the released artifacts. Shivaram On Mon, Nov 24, 2014 at 1:08 PM, agg212

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-08 Thread Shivaram Venkataraman
I ran into this problem too and I know of a workaround but don't exactly know what is happening. The work around is to explicitly add either the commons math jar or your application jar (shaded with commons math) to spark.executor.extraClassPath. My hunch is that this is related to the class

Re: Matrix multiplication in spark

2014-11-05 Thread Shivaram Venkataraman
We are working on a PRs to add block partitioned matrix formats and dense matrix multiply methods. This should be out in the next few weeks or so. The sparse methods still need some research on partitioning schemes etc. and we will do that after the dense methods are in place. Thanks Shivaram On

Re: Spark Build

2014-10-31 Thread Shivaram Venkataraman
Yeah looks like https://github.com/apache/spark/pull/2744 broke the build. We will fix it soon On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote: I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve

New SparkR mailing list, JIRA

2014-08-28 Thread Shivaram Venkataraman
Hi I'd like to announce a couple of updates to the SparkR project. In order to facilitate better collaboration for new features and development we have a new mailing list, issue tracker for SparkR. - The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and we have migrated all

Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Shivaram Venkataraman
Could you try increasing the number of slices with the large data set ? SparkR assumes that each slice (or partition in Spark terminology) can fit in memory of a single machine. Also is the error happening when you do the map function or does it happen when you combine the results ? Thanks

Re: Lost executors

2014-08-13 Thread Shivaram Venkataraman
If the JVM heap size is close to the memory limit the OS sometimes kills the process under memory pressure. I've usually found that lowering the executor memory size helps. Shivaram On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com wrote: What is your Spark executor

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Shivaram Venkataraman
If you just want to find the top eigenvalue / eigenvector you can do something like the Lanczos method. There is a description of a MapReduce based algorithm in Section 4.2 of [1] [1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf On Thu, Aug 7, 2014 at 10:54 AM, Li Pu

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Shivaram Venkataraman
I tried this out and what is happening here is that as the input file is small only 1 partition is created. lapplyPartition runs the given function on the partition and computes sumx as 55 and sumy as 55. Now the return value from lapplyPartition is treated as a list by SparkR and collect

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-06 Thread Shivaram Venkataraman
The output of lapply and lapplyPartition should the same by design -- The only difference is that in lapply the user-defined function returns a row, while it returns a list in lapplyPartition. Could you given an example of a small input and output that you expect to see for the above program ?

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman
This fails for me too. I have no idea why it happens as I can wget the pom from maven central. To work around this I just copied the ivy xmls and jars from this github repo https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library and put it in

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman
Thanks Patrick -- It does look like some maven misconfiguration as wget http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom works for me. Shivaram On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote: This is a Scala bug - I filed

Re: sbt directory missed

2014-07-28 Thread Shivaram Venkataraman
I think the 1.0 AMI only contains the prebuilt packages (i.e just the binaries) of Spark and not the source code. If you want to build Spark on EC2, you'll can clone the github repo and then use sbt. Thanks Shivaram On Mon, Jul 28, 2014 at 8:49 AM, redocpot julien19890...@gmail.com wrote:

Re: BUG in spark-ec2 script (--ebs-vol-size) and workaround...

2014-07-18 Thread Shivaram Venkataraman
Thanks a lot for reporting this. I think we just missed installing xfsprogs on the AMI. I have a fix for this at https://github.com/mesos/spark-ec2/pull/59. After the pull request is merged, any new clusters launched should have mkfs.xfs Thanks Shivaram On Fri, Jul 18, 2014 at 4:56 PM, Ben

Re: SparkR failed to connect to the master

2014-07-15 Thread Shivaram Venkataraman
You'll need to build SparkR to match the Spark version deployed on the cluster. You can do that by changing the Spark version in SparkR's build.sbt [1]. If you are using the Maven build you'll need to edit pom.xml Thanks Shivaram [1]

Re: sparkR - is it possible to run sparkR on yarn?

2014-04-30 Thread Shivaram Venkataraman
We don't have any documentation on running SparkR on YARN and I think there might be some issues that need to be fixed (The recent PySpark on YARN PRs are an example). SparkR has only been tested to work with Spark standalone mode so far. Thanks Shivaram On Tue, Apr 29, 2014 at 7:56 PM,

Re: Build times for Spark

2014-04-25 Thread Shivaram Venkataraman
Are you by any chance building this on NFS ? As far as I know the build is severely bottlenecked by filesystem calls during assembly (each class file in each dependency gets a fstat call or something like that). That is partly why building from say a local ext4 filesystem or a SSD is much faster

Re: Build times for Spark

2014-04-25 Thread Shivaram Venkataraman
-server-2.2.0.jar -Ken *From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] *Sent:* Friday, April 25, 2014 4:31 PM *To:* user@spark.apache.org *Subject:* Re: Build times for Spark Are you by any chance building this on NFS ? As far as I know the build is severely

Re: Help with error initializing SparkR.

2014-04-20 Thread Shivaram Venkataraman
I just updated the github issue -- In case anybody is curious, this was a problem with R resolving the right java version installed in the VM. Thanks Shivaram On Sat, Apr 19, 2014 at 7:12 PM, tongzzz tongzhang...@gmail.com wrote: I can't initialize sc context after a successful install on

Re: Creating a SparkR standalone job

2014-04-13 Thread Shivaram Venkataraman
R installed in the machine. Thank Pawan Kumar Venugopal On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: You can create standalone jobs in SparkR as just R files that are run using the sparkR script. These commands will be sent to a Spark cluster

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Shivaram Venkataraman
is that we were not aware such AMIs existed. Would you kindly point us to the documentation where we can read about this further? Many many thanks, Shivaram. Marco. On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Is there any reason why you want to start

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman
instance and not an HVM instance. Please and thanks, Marco. On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: Right now the spark-ec2 scripts assume that you have root access and a lot of internal scripts assume have the user's home directory hard

Re: Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Shivaram Venkataraman
That is a good idea, though I am not sure how much it will help as time to rsync is also dependent just on data size being copied. The other problem is that sometime we have dependencies across packages, so the first needs to be running before the second can start etc. However I agree that it

Re: How many partitions is my RDD split into?

2014-03-24 Thread Shivaram Venkataraman
There is no direct way to get this in pyspark, but you can get it from the underlying java rdd. For example a = sc.parallelize([1,2,3,4], 2) a._jrdd.splits().size() On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Mark, This appears to be a Scala-only

Re: Problem with SparkR

2014-03-23 Thread Shivaram Venkataraman
Hi Thanks for reporting this. It'll be great if you can check a couple of things: 1. Are you trying to use this with Hadoop2 by any chance ? There was an incompatible ASM version bug that we fixed for Hadoop2 https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it, but I just