Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-15 Thread Shivaram Venkataraman
I think the answer to this depends on what granularity you want to run
the algorithm on. If its on the entire Spark DataFrame and if you
except the data frame to be very large then it isn't easy to use the
existing R function. However if you want to run the algorithm on
smaller subsets of the data you can look at the support for UDFs we
have in SparkR at
http://spark.apache.org/docs/latest/sparkr.html#applying-user-defined-function

Thanks
Shivaram

On Tue, Nov 15, 2016 at 3:56 AM, pietrop  wrote:
> Hi all,
> I'm writing here after some intensive usage on pyspark and SparkSQL.
> I would like to use a well known function in the R world: coxph() from the
> survival package.
> From what I understood, I can't parallelize a function like coxph() because
> it isn't provided with the SparkR package. In other words, I should
> implement a SparkR compatible algorithm instead of using coxph().
> I have no chance to make coxph() parallelizable, right?
> More generally, I think this is true for any non-spark function which only
> accept data.frame format as the data input.
>
> Do you plan to implement the coxph() counterpart in Spark? The most useful
> version of this model is the Cox Regression Model for Time-Dependent
> Covariates, which is missing from ANY ML framework as far as I know.
>
> Thank you
>  Pietro
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-R-guidelines-for-non-spark-functions-and-coxph-Cox-Regression-for-Time-Dependent-Covariates-tp28077.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark-ec2 scripts with spark-2.0.0-preview

2016-06-14 Thread Shivaram Venkataraman
Can you open an issue on https://github.com/amplab/spark-ec2 ?  I
think we should be able to escape the version string and pass the
2.0.0-preview through the scripts

Shivaram

On Tue, Jun 14, 2016 at 12:07 PM, Sunil Kumar
 wrote:
> Hi,
>
> The spark-ec2 scripts are missing from spark-2.0.0-preview. Is there a
> workaround available ? I tried to change the ec2 scripts to accomodate
> spark-2.0.0...If I call the release spark-2.0.0-preview, then it barfs
> because the command line argument : --spark-version=spark-2.0.0-preview
> gets translated to spark-2.0.0-preiew (-v is taken as a switch)...If I call
> the release spark-2.0.0, then it cant find it in aws, since it looks for
> http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-bin-hadoop2.4.tgz
> instead of
> http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-preview-bin-hadoop2.4.tgz
>
> Any ideas on how to make this work ? How can I tweak/hack the code to look
> for spark-2.0.0-preview in spark-related-packages ?
>
> thanks
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Shivaram Venkataraman
Overall this sounds good to me. One question I have is that in
addition to the ML algorithms we have a number of linear algebra
(various distributed matrices) and statistical methods in the
spark.mllib package. Is the plan to port or move these to the spark.ml
namespace in the 2.x series ?

Thanks
Shivaram

On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
> FWIW, all of that sounds like a good plan to me. Developing one API is
> certainly better than two.
>
> On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
>> Hi all,
>>
>> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
>> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
>> been developed under the spark.ml package, while the old RDD-based API has
>> been developed in parallel under the spark.mllib package. While it was
>> easier to implement and experiment with new APIs under a new package, it
>> became harder and harder to maintain as both packages grew bigger and
>> bigger. And new users are often confused by having two sets of APIs with
>> overlapped functions.
>>
>> We started to recommend the DataFrame-based API over the RDD-based API in
>> Spark 1.5 for its versatility and flexibility, and we saw the development
>> and the usage gradually shifting to the DataFrame-based API. Just counting
>> the lines of Scala code, from 1.5 to the current master we added ~1
>> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
>> gather more resources on the development of the DataFrame-based API and to
>> help users migrate over sooner, I want to propose switching RDD-based MLlib
>> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>
>> * We do not accept new features in the RDD-based spark.mllib package, unless
>> they block implementing new features in the DataFrame-based spark.ml
>> package.
>> * We still accept bug fixes in the RDD-based API.
>> * We will add more features to the DataFrame-based API in the 2.x series to
>> reach feature parity with the RDD-based API.
>> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
>> the RDD-based API.
>> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>>
>> Though the RDD-based API is already in de facto maintenance mode, this
>> announcement will make it clear and hence important to both MLlib developers
>> and users. So we’d greatly appreciate your feedback!
>>
>> (As a side note, people sometimes use “Spark ML” to refer to the
>> DataFrame-based API or even the entire MLlib component. This also causes
>> confusion. To be clear, “Spark ML” is not an official name and there are no
>> plans to rename MLlib to “Spark ML” at this time.)
>>
>> Best,
>> Xiangrui
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Shivaram Venkataraman
I think its just a bug -- I think we originally followed the Python
API (in the original PR [1]) but the Python API seems to have been
changed to match Scala / Java in
https://issues.apache.org/jira/browse/SPARK-6366

Feel free to open a JIRA / PR for this.

Thanks
Shivaram

[1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files

On Sun, Dec 13, 2015 at 11:58 PM, Jeff Zhang  wrote:
> It is inconsistent with scala api which is error by default. Any reason for
> that ? Thanks
>
>
>
> --
> Best Regards
>
> Jeff Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-11 Thread Shivaram Venkataraman
Nothing more -- The only two things I can think of are: (a) is there
something else on the classpath that comes before this lgpl JAR ? I've
seen cases where two versions of netlib-java on the classpath can mess
things up. (b) There is something about the way SparkR is using
reflection to invoke the ML Pipelines code that is breaking the BLAS
library discovery. I don't know of a good way to debug this yet
though.

Thanks
Shivaram

On Wed, Nov 11, 2015 at 5:55 AM, Tom Graves <tgraves...@yahoo.com> wrote:
> Is there anything other then the spark assembly that needs to be in the
> classpath?  I verified the assembly was built right and its in the classpath
> (else nothing would work).
>
> Thanks,
> Tom
>
>
>
> On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>
>
> I think this is happening in the driver. Could you check the classpath
> of the JVM that gets started ? If you use spark-submit on yarn the
> classpath is setup before R gets launched, so it should match the
> behavior of Scala / Python.
>
> Thanks
> Shivaram
>
> On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves <tgraves...@yahoo.com.invalid>
> wrote:
>> I'm trying to use the netlib-java stuff with mllib and sparkR on yarn.
>> I've
>> compiled with -Pnetlib-lgpl, see the necessary things in the spark
>> assembly
>> jar.  The nodes have  /usr/lib64/liblapack.so.3, /usr/lib64/libblas.so.3,
>> and /usr/lib/libgfortran.so.3.
>>
>>
>> Running:
>> data <- read.df(sqlContext, 'data.csv', 'com.databricks.spark.csv')
>> mdl = glm(C2~., data, family="gaussian")
>>
>> But I get the error:
>> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
>> com.github.fommil.netlib.NativeSystemLAPACK
>> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
>> com.github.fommil.netlib.NativeRefLAPACK
>> 15/11/06 21:17:27 ERROR RBackendHandler: fitRModelFormula on
>> org.apache.spark.ml.api.r.SparkRWrappers failed
>> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>>  java.lang.AssertionError: assertion failed: lapack.dpotrs returned 18.
>>at scala.Predef$.assert(Predef.scala:179)
>>at
>>
>> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
>>at
>>
>> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:114)
>>at
>>
>> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:166)
>>at
>>
>> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:65)
>>at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>>at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>>at
>> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:138)
>>at
>> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>>
>> Anyone have this working?
>>
>> Thanks,
>> Tom
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-10 Thread Shivaram Venkataraman
I think this is happening in the driver. Could you check the classpath
of the JVM that gets started ? If you use spark-submit on yarn the
classpath is setup before R gets launched, so it should match the
behavior of Scala / Python.

Thanks
Shivaram

On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves  wrote:
> I'm trying to use the netlib-java stuff with mllib and sparkR on yarn. I've
> compiled with -Pnetlib-lgpl, see the necessary things in the spark assembly
> jar.  The nodes have  /usr/lib64/liblapack.so.3, /usr/lib64/libblas.so.3,
> and /usr/lib/libgfortran.so.3.
>
>
> Running:
> data <- read.df(sqlContext, 'data.csv', 'com.databricks.spark.csv')
> mdl = glm(C2~., data, family="gaussian")
>
> But I get the error:
> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemLAPACK
> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefLAPACK
> 15/11/06 21:17:27 ERROR RBackendHandler: fitRModelFormula on
> org.apache.spark.ml.api.r.SparkRWrappers failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.AssertionError: assertion failed: lapack.dpotrs returned 18.
>at scala.Predef$.assert(Predef.scala:179)
> at
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
> at
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:114)
> at
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:166)
> at
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:65)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:138)
> at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>
> Anyone have this working?
>
> Thanks,
> Tom

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark EC2 script on Large clusters

2015-11-05 Thread Shivaram Venkataraman
It is a known limitation that spark-ec2 is very slow for large
clusters and as you mention most of this is due to the use of rsync to
transfer things from the master to all the slaves.

Nick cc'd has been working on an alternative approach at
https://github.com/nchammas/flintrock that is more scalable.

Thanks
Shivaram

On Thu, Nov 5, 2015 at 8:12 AM, Christian  wrote:
> For starters, thanks for the awesome product!
>
> When creating ec2-clusters of 20-40 nodes, things work great. When we create
> a cluster with the provided spark-ec2 script, it takes hours. When creating
> a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it takes
> over 5 hours. One other problem we are having is that some nodes don't come
> up when the other ones do, the process seems to just move on, skipping the
> rsync and any installs on those ones.
>
> My guess as to why it takes so long to set up a large cluster is because of
> the use of rsync. What if instead of using rsync, you synched to s3 and then
> did a pdsh to pull it down on all of the machines. This is a big deal for us
> and if we can come up with a good plan, we might be able help out with the
> required changes.
>
> Are there any suggestions on how to deal with some of the nodes not being
> ready when the process starts?
>
> Thanks for your time,
> Christian
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to run RStudio or RStudio Server on ec2 cluster?

2015-11-04 Thread Shivaram Venkataraman
RStudio should already be setup if you launch an EC2 cluster using
spark-ec2. See http://blog.godatadriven.com/sparkr-just-got-better.html
for details.

Shivaram

On Wed, Nov 4, 2015 at 5:11 PM, Andy Davidson
 wrote:
> Hi
>
> I just set up a spark cluster on AWS ec2 cluster.  In the past I have done a
> lot of work using RStudio on my local machine. Bin/sparkR looks interesting
> how ever it looks like you just get an R command line interpreter. Does
> anyone have an experience using something like RStudio or Rstudio server and
> sparkR?
>
>
> In an ideal world I would like to use some sort of IDE for R running on my
> local machine connected to my spark cluster.
>
> In spark 1.3.1 It was easy to get a similar environment working for python.
> I would use pyspark to start an iPython notebook server on my cluster
> master. Then set up an SSH tunnel on my local machine to connect a browser
> on my local machine to the notebook server.
>
> Any comments or suggestions would be greatly appreciated
>
> Andy
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR -Graphx

2015-08-06 Thread Shivaram Venkataraman
+Xiangrui

I am not sure exposing the entire GraphX API would make sense as it
contains a lot of low level functions. However we could expose some
high level functions like PageRank etc. Xiangrui, who has been working
on similar techniques to expose MLLib functions like GLM might have
more to add.

Thanks
Shivaram

On Thu, Aug 6, 2015 at 6:21 AM, smagadi sudhindramag...@fico.com wrote:
 Wanted to use the GRaphX from SparkR , is there a way to do it ?.I think as
 of now it is not possible.I was thinking if one can write a wrapper in R
 that can call Scala Graphx libraries .
 Any thought on this please.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Graphx-tp24152.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Broadcast variables in R

2015-07-21 Thread Shivaram Venkataraman
There shouldn't be anything Mac OS specific about this feature. One point
of warning though -- As mentioned previously in this thread the APIs were
made private because we aren't sure we will be supporting them in the
future. If you are using these APIs it would be good to chime in on the
JIRA with your use-case

Thanks
Shivaram

On Tue, Jul 21, 2015 at 2:34 AM, Serge Franchois serge.franch...@altran.com
 wrote:

 I might add to this that I've done the same exercise on Linux (CentOS 6)
 and
 there, broadcast variables ARE working. Is this functionality perhaps not
 exposed on Mac OS X?  Or has it to do with the fact there are no native
 Hadoop libs for Mac?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-in-R-tp23914p23927.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-21 Thread Shivaram Venkataraman
FWIW I've run into similar BLAS related problems before and wrote up a
document on how to do this for Spark EC2 clusters at
https://github.com/amplab/ml-matrix/blob/master/EC2.md -- Note that this
works with a vanilla Spark build (you only need to link to netlib-lgpl in
your App) but requires the app jar to be present on all the machines.

Thanks
Shivaram

On Tue, Jul 21, 2015 at 7:37 AM, Arun Ahuja aahuj...@gmail.com wrote:

 Yes, I imagine it's the driver's classpath -  I'm pulling those
 screenshots straight from the Spark UI environment page.  Is there
 somewhere else to grab the executor class path?

 Also, the warning is only printing once, so it's also not clear whether
 the warning is from the driver or exectuor, would you know?

 Thanks,
 Arun

 On Tue, Jul 21, 2015 at 7:52 AM, Sean Owen so...@cloudera.com wrote:

 Great, and that file exists on HDFS and is world readable? just
 double-checking.

 What classpath is this -- your driver or executor? this is the driver,
 no? I assume so just because it looks like it references the assembly you
 built locally and from which you're launching the driver.

 I think we're concerned with the executors and what they have on the
 classpath. I suspect there is still a problem somewhere in there.

 On Mon, Jul 20, 2015 at 4:59 PM, Arun Ahuja aahuj...@gmail.com wrote:

 Cool, I tried that as well, and doesn't seem different:

 spark.yarn.jar seems set

 [image: Inline image 1]

 This actually doesn't change the classpath, not sure if it should:

 [image: Inline image 3]

 But same netlib warning.

 Thanks for the help!
 - Arun

 On Fri, Jul 17, 2015 at 3:18 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Can you try setting the spark.yarn.jar property to make sure it points
 to the jar you're thinking of?

 -Sandy

 On Fri, Jul 17, 2015 at 11:32 AM, Arun Ahuja aahuj...@gmail.com
 wrote:

 Yes, it's a YARN cluster and using spark-submit to run.  I have
 SPARK_HOME set to the directory above and using the spark-submit script
 from there.

 bin/spark-submit --master yarn-client --executor-memory 10g 
 --driver-memory 8g --num-executors 400 --executor-cores 1 --class 
 org.hammerlab.guacamole.Guacamole --conf spark.default.parallelism=4000 
 --conf spark.storage.memoryFraction=0.15

 ​

 libgfortran.so.3 is also there

 ls  /usr/lib64/libgfortran.so.3
 /usr/lib64/libgfortran.so.3

 These are jniloader files in the jar

 jar tf 
 /hpc/users/ahujaa01/src/spark/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar
  | grep jniloader
 META-INF/maven/com.github.fommil/jniloader/
 META-INF/maven/com.github.fommil/jniloader/pom.xml
 META-INF/maven/com.github.fommil/jniloader/pom.properties

 ​

 Thanks,
 Arun

 On Fri, Jul 17, 2015 at 1:30 PM, Sean Owen so...@cloudera.com wrote:

 Make sure /usr/lib64 contains libgfortran.so.3; that's really the
 issue.

 I'm pretty sure the answer is 'yes', but, make sure the assembly has
 jniloader too. I don't see why it wouldn't, but, that's needed.

 What is your env like -- local, standalone, YARN? how are you running?
 Just want to make sure you are using this assembly across your
 cluster.

 On Fri, Jul 17, 2015 at 6:26 PM, Arun Ahuja aahuj...@gmail.com
 wrote:

 Hi Sean,

 Thanks for the reply! I did double-check that the jar is one I think
 I am running:

 [image: Inline image 2]

 jar tf 
 /hpc/users/ahujaa01/src/spark/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar
  | grep netlib | grep Native
 com/github/fommil/netlib/NativeRefARPACK.class
 com/github/fommil/netlib/NativeRefBLAS.class
 com/github/fommil/netlib/NativeRefLAPACK.class
 com/github/fommil/netlib/NativeSystemARPACK.class
 com/github/fommil/netlib/NativeSystemBLAS.class
 com/github/fommil/netlib/NativeSystemLAPACK.class

 Also, I checked the gfortran version on the cluster nodes and it is
 available and is 5.1

 $ gfortran --version
 GNU Fortran (GCC) 5.1.0
 Copyright (C) 2015 Free Software Foundation, Inc.

 and still see:

 15/07/17 13:20:53 WARN BLAS: Failed to load implementation from: 
 com.github.fommil.netlib.NativeSystemBLAS
 15/07/17 13:20:53 WARN BLAS: Failed to load implementation from: 
 com.github.fommil.netlib.NativeRefBLAS
 15/07/17 13:20:53 WARN LAPACK: Failed to load implementation from: 
 com.github.fommil.netlib.NativeSystemLAPACK
 15/07/17 13:20:53 WARN LAPACK: Failed to load implementation from: 
 com.github.fommil.netlib.NativeRefLAPACK

 ​

 Does anything need to be adjusted in my application POM?

 Thanks,
 Arun

 On Thu, Jul 16, 2015 at 5:26 PM, Sean Owen so...@cloudera.com
 wrote:

 Yes, that's most of the work, just getting the native libs into the
 assembly. netlib can find them from there even if you don't have
 BLAS
 libs on your OS, since it includes a reference implementation as a
 fallback.

 One common reason it won't load is not having libgfortran installed
 on
 your OSes though. It has to be 4.6+ too. That can't be shipped even
 in
 netlib and has to exist on your hosts.

 The other 

Re: Including additional scala libraries in sparkR

2015-07-14 Thread Shivaram Venkataraman
There was a fix for `--jars` that went into 1.4.1
https://github.com/apache/spark/commit/2579948bf5d89ac2d822ace605a6a4afce5258d6

Shivaram

On Tue, Jul 14, 2015 at 4:18 AM, Sun, Rui rui@intel.com wrote:

 Could you give more details about the mis-behavior of --jars for SparkR?
 maybe it's a bug.
 
 From: Michal Haris [michal.ha...@visualdna.com]
 Sent: Tuesday, July 14, 2015 5:31 PM
 To: Sun, Rui
 Cc: Michal Haris; user@spark.apache.org
 Subject: Re: Including additional scala libraries in sparkR

 Ok thanks. It seems that --jars is not behaving as expected - getting
 class not found for even the most simple object from my lib. But anyways, I
 have to do at least a filter transformation before collecting the HBaseRDD
 into R so will have to go the route of using scala spark shell to transform
 and collect and save into local filesystem and the visualise the file with
 R until custom RDD transformations are exposed in the SparkR API.

 On 13 July 2015 at 10:27, Sun, Rui rui@intel.commailto:
 rui@intel.com wrote:
 Hi, Michal,

 SparkR comes with a JVM backend that supports Java object instantiation,
 calling Java instance and static methods from R side. As defined in
 https://github.com/apache/spark/blob/master/R/pkg/R/backend.R,
 newJObject() is to create an instance of a Java class;
 callJMethod() is to call an instance method of a Java object;
 callJStatic() is to call a static method of a Java class.

 If the thing is as simple as data visualization, you can use the above
 low-level functions to create an instance of your HBASE RDD in JVM side,
 collect the data to R side, and visualize it.

 However, if you want to do HBASE RDD transformation and HBASE table
 update, things are quite complex now. SparkR supports majority of RDD API
 (though not exposed publicly in 1.4 release) allowing transformation
 functions in R code, but currently it only supports RDD source from text
 files and SparkR Data Frames, so your HBASE RDDs can't be used by SparkR
 RDD API for further processing.

 You can use --jars to include your scala library to be accessed by the JVM
 backend.

 
 From: Michal Haris [michal.ha...@visualdna.commailto:
 michal.ha...@visualdna.com]
 Sent: Sunday, July 12, 2015 6:39 PM
 To: user@spark.apache.orgmailto:user@spark.apache.org
 Subject: Including additional scala libraries in sparkR

 I have spark program with a custom optimised rdd for hbase scans and
 updates. I have a small library of objects in scala to support efficient
 serialisation, partitioning etc. I would like to use R as an analysis and
 visualisation front-end. I have tried to use rJava (i.e. not using sparkR)
 and I got as far as initialising the spark context but I have encountered
 problems with hbase dependencies (HBaseConfiguration : Unsupported
 major.minor version 51.0) so tried sparkR but I can't figure out how to
 make my custom scala classes available to sparkR other than re-implementing
 them in R. Is there a way to include and invoke additional scala objects
 and RDDs within sparkR shell/job ? Something similar to additional jars and
 init script in normal spark submit/shell..

 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229tel:%2B44%20%280%29%20207%20749%200229
 www.visualdna.comhttp://www.visualdna.comhttp://www.visualdna.com |
 t: +44 (0) 207 734 7033tel:%2B44%20%280%29%20207%20734%207033
 31 Old Nichol Street
 London
 E2 7HR



 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.comhttp://www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: sparkr-submit additional R files

2015-07-07 Thread Shivaram Venkataraman
You can just use `--files` and I think it should work. Let us know on
https://issues.apache.org/jira/browse/SPARK-6833 if it doesn't work as
expected.

Thanks
Shivaram

On Tue, Jul 7, 2015 at 5:13 AM, Michał Zieliński 
zielinski.mich...@gmail.com wrote:

 Hi all,

 *spark-submit* for Python and Java/Scala has *--py-files* and *--jars*
 options for submitting additional files on top of the main application. Is
 there any such option for *sparkr-submit*? I know that there is 
 *includePackage()
 *R function to add library dependencies, but can you add other sources
 that are not R libraries (e.g. additional code repositories?).

 I really appreciate your help.

 Thanks,
 Michael



Re: JVM is not ready after 10 seconds

2015-07-06 Thread Shivaram Venkataraman
When I've seen this error before it has been due to the spark-submit file
(i.e. `C:\spark-1.4.0\bin/bin/spark-submit.cmd`) not having execute
permissions. You can try to set execute permission and see if it fixes
things.

Also we have a PR open to fix a related problem at
https://github.com/apache/spark/pull/7025. If you can test the PR that will
also be very helpful

Thanks
Shivaram

On Mon, Jul 6, 2015 at 6:11 PM, ashishdutt ashish.du...@gmail.com wrote:

 Hi,

 I am trying to connect a worker to the master. The spark master is on
 cloudera manager and I know the master IP address and port number.
 I downloaded the spark binary for CDH4 on the worker machine and then when
 I
 try to invoke the command
  sc = sparkR.init(master=ip address:port number) I get the following
  error.

  sc=sparkR.init(master=spark://10.229.200.250:7377)
 Launching java with spark-submit command
 C:\spark-1.4.0\bin/bin/spark-submit.cmd  sparkr-shell
 C:\Users\ASHISH~1\AppData\Local\Temp\Rtmp82kCxH\backend_port4281739d85
 Error in sparkR.init(master = spark://10.229.200.250:7377) :
   JVM is not ready after 10 seconds
 In addition: Warning message:
 running command 'C:\spark-1.4.0\bin/bin/spark-submit.cmd  sparkr-shell
 C:\Users\ASHISH~1\AppData\Local\Temp\Rtmp82kCxH\backend_port4281739d85' had
 status 127


 I am using windows 7 as the OS on the worker machine and I am invoking the
 sparkR.init() from RStudio

 Any help in this reference will be appreciated

 Thank you,
 Ashish Dutt



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/JVM-is-not-ready-after-10-seconds-tp23658.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Shivaram Venkataraman
You need to add -Psparkr to build SparkR code

Shivaram

On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Did you try:

 build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package



 Thanks
 Best Regards

 On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com
 wrote:

 Hi all,
Anyone  build spark 1.4 source code  for sparkR with maven/sbt, what's
 comand ?  using sparkR must build from source code  about 1.4 version .
 thank you

 --
 1106944...@qq.com





Re: sparkR could not find function textFile

2015-07-01 Thread Shivaram Venkataraman
You can check my comment below the answer at
http://stackoverflow.com/a/30959388/4577954. BTW we added a new option to
sparkR.init to pass in packages and that should be a part of 1.5

Shivaram

On Wed, Jul 1, 2015 at 10:03 AM, Sourav Mazumder 
sourav.mazumde...@gmail.com wrote:

 Hi,

 Piggybacking on this discussion.

 I'm trying to achieve the same, reading a csv file, from RStudio. Where
 I'm stuck is how to supply some additional package from RStudio to
 spark.init() as sparkR.init does() not provide an option to specify
 additional package.

 I tried following codefrom RStudio. It is giving me error Error in
 callJMethod(sqlContext, load, source, options) :
   Invalid jobj 1. If SparkR was restarted, Spark operations need to be
 re-executed.

 --
 Sys.setenv(SPARK_HOME=C:\\spark-1.4.0-bin-hadoop2.6)
 .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib),.libPaths()))
 library(SparkR)

 sparkR.stop()

 sc - sparkR.init(master=local[2], sparkEnvir =
 list(spark.executor.memory=1G),
 sparkJars=C:\\spark-1.4.0-bin-hadoop2.6\\lib\\spark-csv_2.11-1.1.0.jar)
 /* I have downloaded this spark-csv jar and kept it in lib folder of Spark
 */

 sqlContext - sparkRSQL.init(sc)

 plutoMN - read.df(sqlContext,
 C:\\Users\\Sourav\\Work\\SparkDataScience\\PlutoMN.csv, source =
 com.databricks.spark.csv).

 --

 However, I also tried this from shell as 'sparkR --package
 com.databricks:spark-csv_2.11:1.1.0. This time I used the following code
 and it works all fine.

 sqlContext - sparkRSQL.init(sc)

 plutoMN - read.df(sqlContext,
 C:\\Users\\Sourav\\Work\\SparkDataScience\\PlutoMN.csv, source =
 com.databricks.spark.csv).

 Any idea how to achieve the same from RStudio ?

 Regards,




 On Thu, Jun 25, 2015 at 2:38 PM, Wei Zhou zhweisop...@gmail.com wrote:

 I tried out the solution using spark-csv package, and it worked fine now
 :) Thanks. Yes, I'm playing with a file with all columns as String, but the
 real data I want to process are all doubles. I'm just exploring what sparkR
 can do versus regular scala spark, as I am by heart a R person.

 2015-06-25 14:26 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com
 :

  Sure, I had a similar question that Shivaram was able fast for me, the
 solution is implemented using a separate DataBrick’s library. Check out
 this thread from the email archives [1], and the read.df() command [2]. CSV
 files can be a bit tricky, especially with inferring their schemas. Are you
 using just strings as your column types right now?

  Alek

  [1] --
 http://apache-spark-developers-list.1001551.n3.nabble.com/CSV-Support-in-SparkR-td12559.html
 [2] -- https://spark.apache.org/docs/latest/api/R/read.df.html

   From: Wei Zhou zhweisop...@gmail.com
 Date: Thursday, June 25, 2015 at 4:15 PM
 To: shiva...@eecs.berkeley.edu shiva...@eecs.berkeley.edu
 Cc: Aleksander Eskilson alek.eskil...@cerner.com, 
 user@spark.apache.org user@spark.apache.org
 Subject: Re: sparkR could not find function textFile

   Thanks to both Shivaram and Alek. Then if I want to create DataFrame
 from comma separated flat files, what would you recommend me to do? One way
 I can think of is first reading the data as you would do in r, using
 read.table(), and then create spark DataFrame out of that R dataframe, but
 it is obviously not scalable.


 2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu:

 The `head` function is not supported for the RRDD that is returned by
 `textFile`. You can run `take(lines, 5L)`. I should add a warning here that
 the RDD API in SparkR is private because we might not support it in the
 upcoming releases. So if you can use the DataFrame API for your application
 you should try that out.

  Thanks
  Shivaram

 On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou zhweisop...@gmail.com
 wrote:

 Hi Alek,

  Just a follow up question. This is what I did in sparkR shell:

  lines - SparkR:::textFile(sc, ./README.md)
  head(lines)

  And I am getting error:

 Error in x[seq_len(n)] : object of type 'S4' is not subsettable

 I'm wondering what did I do wrong. Thanks in advance.

 Wei

 2015-06-25 13:44 GMT-07:00 Wei Zhou zhweisop...@gmail.com:

 Hi Alek,

  Thanks for the explanation, it is very helpful.

  Cheers,
 Wei

 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander 
 alek.eskil...@cerner.com:

  Hi there,

  The tutorial you’re reading there was written before the merge of
 SparkR for Spark 1.4.0
 For the merge, the RDD API (which includes the textFile() function)
 was made private, as the devs felt many of its functions were too low
 level. They focused instead on finishing the DataFrame API which 
 supports
 local, HDFS, and Hive/HBase file reads. In the meantime, the devs are
 trying to determine which functions of the RDD API, if any, should be 
 made
 public again. You can see the rationale behind this decision on the 
 issue’s
 JIRA [1].

  You can still make use of those now private RDD functions by
 prepending the function call with the SparkR private

Re: Calling MLLib from SparkR

2015-07-01 Thread Shivaram Venkataraman
The 1.4 release does not support calling MLLib from SparkR. We are working
on it as a part of https://issues.apache.org/jira/browse/SPARK-6805

On Wed, Jul 1, 2015 at 4:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com
 wrote:

 Hi,

 Does Spark 1.4 support calling MLLib directly from SparkR ?

 If not, is there any work around, any example available somewhere ?

 Regards,
 Sourav



Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-30 Thread Shivaram Venkataraman
The API exported in the 1.4 release is different from the one used in the
2014 demo. Please see the latest documentation at
http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html or
Chris's demo from Spark Summit at
https://spark-summit.org/2015/events/a-data-frame-abstraction-layer-for-sparkr/

Thanks
Shivaram

On Tue, Jun 30, 2015 at 7:40 AM, Nicholas Sharkey nicholasshar...@gmail.com
 wrote:

 Good morning Sivaram,

 I believe I have our setup close but I'm getting an error on the last step
 of the word count example from the Spark Summit
 https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
 .

 Off the top of your head can you think of where this error (below, and
 attached) is coming from? I can get into the details of how I setup this
 machine if needed, but wanted to keep the initial question short.

 Thanks.

 *Begin Code*

 *library(SparkR)*

 *# sc - sparkR.init(local[2])*
 *sc -
 sparkR.init(http://ec2-54-171-173-195.eu-west-1.compute.amazonaws.com:[2];)*

 *lines - textFile(sc, mytextfile.txt) # hi hi all all all one one one
 one*

 *words - flatMap(lines,*
 * function(line){*
 *   strsplit(line,  )[[1]]*
 * })*

 *wordcount - lapply(words,*
 *function(word){*
 *  list(word, 1)*
 *})*

 *counts - reduceByKey(wordcount, +, numPartitions=2)*

 *# Error in (function (classes, fdef, mtable)  : *
 *# unable to find an inherited method for function
 ‘reduceByKey’ for signature ‘PipelinedRDD, character, numeric’*


 *End Code *

 On Fri, Jun 26, 2015 at 7:04 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 My workflow as to install RStudio on a cluster launched using Spark EC2
 scripts. However I did a bunch of tweaking after that (like copying the
 spark installation over etc.). When I get some time I'll try to write the
 steps down in the JIRA.

 Thanks
 Shivaram


 On Fri, Jun 26, 2015 at 10:21 AM, m...@redoakstrategic.com wrote:

 So you created an EC2 instance with RStudio installed first, then
 installed Spark under that same username?  That makes sense, I just want to
 verify your work flow.

 Thank you again for your willingness to help!



 On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

  I was using RStudio on the master node of the same cluster in the
 demo. However I had installed Spark under the user `rstudio` (i.e.
 /home/rstudio) and that will make the permissions work correctly. You will
 need to copy the config files from /root/spark/conf after installing Spark
 though and it might need some more manual tweaks.

 Thanks
 Shivaram

 On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson 
 m...@redoakstrategic.com wrote:

 Thanks!

 In your demo video, were you using RStudio to hit a separate EC2 Spark
 cluster?  I noticed that it appeared your browser that you were using EC2
 at that time, so I was just curious.  It appears that might be one of the
 possible workarounds - fire up a separate EC2 instance with RStudio Server
 that initializes the spark context against a separate Spark cluster.

 On Jun 26, 2015, at 11:46 AM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 We don't have a documented way to use RStudio on EC2 right now. We
 have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596
 to discuss work-arounds and potential solutions for this.

 Thanks
 Shivaram

 On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com
 wrote:

 Good morning,

 I am having a bit of trouble finalizing the installation and usage of
 the
 newest Spark version 1.4.0, deploying to an Amazon EC2 instance and
 using
 RStudio to run on top of it.

 Using these instructions (
 http://spark.apache.org/docs/latest/ec2-scripts.html
 http://spark.apache.org/docs/latest/ec2-scripts.html  ) we can
 fire up an
 EC2 instance (which we have been successful doing - we have gotten the
 cluster to launch from the command line without an issue).  Then, I
 installed RStudio Server on the same EC2 instance (the master) and
 successfully logged into it (using the test/test user) through the web
 browser.

 This is where I get stuck - within RStudio, when I try to
 reference/find the
 folder that SparkR was installed, to load the SparkR library and
 initialize
 a SparkContext, I get permissions errors on the folders, or the
 library
 cannot be found because I cannot find the folder in which the library
 is
 sitting.

 Has anyone successfully launched and utilized SparkR 1.4.0 in this
 way, with
 RStudio Server running on top of the master instance?  Are we on the
 right
 track, or should we manually launch a cluster and attempt to connect
 to it
 from another instance running R?

 Thank you in advance!

 Mark



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html
 Sent from

Re: [SparkR] Missing Spark APIs in R

2015-06-30 Thread Shivaram Venkataraman
The Spark JIRA and the user, dev mailing lists are the best place to follow
the progress.

Shivaram

On Tue, Jun 30, 2015 at 9:52 AM, Pradeep Bashyal prad...@bashyal.com
wrote:

 Thanks Shivaram. I watched your talk and the plan to use ML APIs with R
 flavor looks exciting.
 Is there a different venue where I would be able to follow the SparkR API
 progress?

 Thanks
 Pradeep

 On Mon, Jun 29, 2015 at 1:12 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 The RDD API is pretty complex and we are not yet sure we want to export
 all those methods in the SparkR API. We are working towards exposing a more
 limited API in upcoming versions. You can find some more details in the
 recent Spark Summit talk at
 https://spark-summit.org/2015/events/sparkr-the-past-the-present-and-the-future/

 Thanks
 Shivaram

 On Mon, Jun 29, 2015 at 9:40 AM, Pradeep Bashyal prad...@bashyal.com
 wrote:

 Hello,

 I noticed that some of the spark-core APIs are not available with
 version 1.4.0 release of SparkR. For example textFile(), flatMap() etc. The
 code seems to be there but is not exported in NAMESPACE. They were all
 available as part of the AmpLab Extras previously. I wasn't able to find
 any explanations of why they were not included with the release.

 Can anyone shed some light on it?

 Thanks
 Pradeep






Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-30 Thread Shivaram Venkataraman
Are you using the SparkR from the latest Spark 1.4 release ? The function
was not available in the older AMPLab version

Shivaram

On Tue, Jun 30, 2015 at 1:43 PM, Nicholas Sharkey nicholasshar...@gmail.com
 wrote:

 Any idea why I can't get the sparkRSQL.init function to work? The other
 parts of SparkR seems like it's working fine. And yes, the SparkR library
 is loaded.

 Thanks.

  sc - sparkR.init(master=
 http://ec2-52-18-1-4.eu-west-1.compute.amazonaws.com;)
 ...

  sqlContext - sparkRSQL.init(sc)

 Error: could not find function sparkRSQL.init


 On Tue, Jun 30, 2015 at 10:56 AM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 The API exported in the 1.4 release is different from the one used in the
 2014 demo. Please see the latest documentation at
 http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html or
 Chris's demo from Spark Summit at
 https://spark-summit.org/2015/events/a-data-frame-abstraction-layer-for-sparkr/

 Thanks
 Shivaram

 On Tue, Jun 30, 2015 at 7:40 AM, Nicholas Sharkey 
 nicholasshar...@gmail.com wrote:

 Good morning Sivaram,

 I believe I have our setup close but I'm getting an error on the last
 step of the word count example from the Spark Summit
 https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
 .

 Off the top of your head can you think of where this error (below, and
 attached) is coming from? I can get into the details of how I setup this
 machine if needed, but wanted to keep the initial question short.

 Thanks.

 *Begin Code*

 *library(SparkR)*

 *# sc - sparkR.init(local[2])*
 *sc -
 sparkR.init(http://ec2-54-171-173-195.eu-west-1.compute.amazonaws.com:[2];)*

 *lines - textFile(sc, mytextfile.txt) # hi hi all all all one one one
 one*

 *words - flatMap(lines,*
 * function(line){*
 *   strsplit(line,  )[[1]]*
 * })*

 *wordcount - lapply(words,*
 *function(word){*
 *  list(word, 1)*
 *})*

 *counts - reduceByKey(wordcount, +, numPartitions=2)*

 *# Error in (function (classes, fdef, mtable)  : *
 *# unable to find an inherited method for function
 ‘reduceByKey’ for signature ‘PipelinedRDD, character, numeric’*


 *End Code *

 On Fri, Jun 26, 2015 at 7:04 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 My workflow as to install RStudio on a cluster launched using Spark EC2
 scripts. However I did a bunch of tweaking after that (like copying the
 spark installation over etc.). When I get some time I'll try to write the
 steps down in the JIRA.

 Thanks
 Shivaram


 On Fri, Jun 26, 2015 at 10:21 AM, m...@redoakstrategic.com wrote:

 So you created an EC2 instance with RStudio installed first, then
 installed Spark under that same username?  That makes sense, I just want 
 to
 verify your work flow.

 Thank you again for your willingness to help!



 On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

  I was using RStudio on the master node of the same cluster in the
 demo. However I had installed Spark under the user `rstudio` (i.e.
 /home/rstudio) and that will make the permissions work correctly. You 
 will
 need to copy the config files from /root/spark/conf after installing 
 Spark
 though and it might need some more manual tweaks.

 Thanks
 Shivaram

 On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson 
 m...@redoakstrategic.com wrote:

 Thanks!

 In your demo video, were you using RStudio to hit a separate EC2
 Spark cluster?  I noticed that it appeared your browser that you were 
 using
 EC2 at that time, so I was just curious.  It appears that might be one 
 of
 the possible workarounds - fire up a separate EC2 instance with RStudio
 Server that initializes the spark context against a separate Spark 
 cluster.

 On Jun 26, 2015, at 11:46 AM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 We don't have a documented way to use RStudio on EC2 right now. We
 have a ticket open at
 https://issues.apache.org/jira/browse/SPARK-8596 to discuss
 work-arounds and potential solutions for this.

 Thanks
 Shivaram

 On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark 
 m...@redoakstrategic.com wrote:

 Good morning,

 I am having a bit of trouble finalizing the installation and usage
 of the
 newest Spark version 1.4.0, deploying to an Amazon EC2 instance and
 using
 RStudio to run on top of it.

 Using these instructions (
 http://spark.apache.org/docs/latest/ec2-scripts.html
 http://spark.apache.org/docs/latest/ec2-scripts.html  ) we can
 fire up an
 EC2 instance (which we have been successful doing - we have gotten
 the
 cluster to launch from the command line without an issue).  Then, I
 installed RStudio Server on the same EC2 instance (the master) and
 successfully logged into it (using the test/test user) through the
 web
 browser.

 This is where I get stuck - within RStudio, when I try to
 reference/find the
 folder

Re: [SparkR] Missing Spark APIs in R

2015-06-29 Thread Shivaram Venkataraman
The RDD API is pretty complex and we are not yet sure we want to export all
those methods in the SparkR API. We are working towards exposing a more
limited API in upcoming versions. You can find some more details in the
recent Spark Summit talk at
https://spark-summit.org/2015/events/sparkr-the-past-the-present-and-the-future/

Thanks
Shivaram

On Mon, Jun 29, 2015 at 9:40 AM, Pradeep Bashyal prad...@bashyal.com
wrote:

 Hello,

 I noticed that some of the spark-core APIs are not available with version
 1.4.0 release of SparkR. For example textFile(), flatMap() etc. The code
 seems to be there but is not exported in NAMESPACE. They were all available
 as part of the AmpLab Extras previously. I wasn't able to find any
 explanations of why they were not included with the release.

 Can anyone shed some light on it?

 Thanks
 Pradeep



Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-27 Thread Shivaram Venkataraman
Thanks Mark for the update. For those interested Vincent Warmerdam also has
some details on making the /root/spark installation work at
https://issues.apache.org/jira/browse/SPARK-8596?focusedCommentId=14604328page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604328

Shivaram

On Sat, Jun 27, 2015 at 12:23 PM, RedOakMark m...@redoakstrategic.com
wrote:

 For anyone monitoring the thread, I was able to successfully install and
 run
 a small Spark cluster and model using this method:

 First, make sure that the username being used to login to RStudio Server is
 the one that was used to install Spark on the EC2 instance.  Thanks to
 Shivaram for his help here.

 Login to RStudio and ensure that these references are used - set the
 library
 location to the folder where spark is installed.  In my case,
 ~/home/rstudio/spark.

 # # This line loads SparkR (the R package) from the installed directory
 library(SparkR, lib.loc=./spark/R/lib)

 The edits to this line were important, so that Spark knew where the install
 folder was located when initializing the cluster.

 # Initialize the Spark local cluster in R, as ‘sc’
 sc - sparkR.init(local[2], SparkR, ./spark)

 From here, we ran a basic model using Spark, from RStudio, which ran
 successfully.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506p23514.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Shivaram Venkataraman
We don't have a documented way to use RStudio on EC2 right now. We have a
ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss
work-arounds and potential solutions for this.

Thanks
Shivaram

On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com
wrote:

 Good morning,

 I am having a bit of trouble finalizing the installation and usage of the
 newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using
 RStudio to run on top of it.

 Using these instructions (
 http://spark.apache.org/docs/latest/ec2-scripts.html
 http://spark.apache.org/docs/latest/ec2-scripts.html  ) we can fire up
 an
 EC2 instance (which we have been successful doing - we have gotten the
 cluster to launch from the command line without an issue).  Then, I
 installed RStudio Server on the same EC2 instance (the master) and
 successfully logged into it (using the test/test user) through the web
 browser.

 This is where I get stuck - within RStudio, when I try to reference/find
 the
 folder that SparkR was installed, to load the SparkR library and initialize
 a SparkContext, I get permissions errors on the folders, or the library
 cannot be found because I cannot find the folder in which the library is
 sitting.

 Has anyone successfully launched and utilized SparkR 1.4.0 in this way,
 with
 RStudio Server running on top of the master instance?  Are we on the right
 track, or should we manually launch a cluster and attempt to connect to it
 from another instance running R?

 Thank you in advance!

 Mark



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Using-SparkR-on-EC2-Instance-tp23506.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: sparkR could not find function textFile

2015-06-25 Thread Shivaram Venkataraman
The `head` function is not supported for the RRDD that is returned by
`textFile`. You can run `take(lines, 5L)`. I should add a warning here that
the RDD API in SparkR is private because we might not support it in the
upcoming releases. So if you can use the DataFrame API for your application
you should try that out.

Thanks
Shivaram

On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou zhweisop...@gmail.com wrote:

 Hi Alek,

 Just a follow up question. This is what I did in sparkR shell:

 lines - SparkR:::textFile(sc, ./README.md)
 head(lines)

 And I am getting error:

 Error in x[seq_len(n)] : object of type 'S4' is not subsettable

 I'm wondering what did I do wrong. Thanks in advance.

 Wei

 2015-06-25 13:44 GMT-07:00 Wei Zhou zhweisop...@gmail.com:

 Hi Alek,

 Thanks for the explanation, it is very helpful.

 Cheers,
 Wei

 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com
 :

  Hi there,

  The tutorial you’re reading there was written before the merge of
 SparkR for Spark 1.4.0
 For the merge, the RDD API (which includes the textFile() function) was
 made private, as the devs felt many of its functions were too low level.
 They focused instead on finishing the DataFrame API which supports local,
 HDFS, and Hive/HBase file reads. In the meantime, the devs are trying to
 determine which functions of the RDD API, if any, should be made public
 again. You can see the rationale behind this decision on the issue’s JIRA
 [1].

  You can still make use of those now private RDD functions by
 prepending the function call with the SparkR private namespace, for
 example, you’d use
 SparkR:::textFile(…).

  Hope that helps,
 Alek

  [1] -- https://issues.apache.org/jira/browse/SPARK-7230

   From: Wei Zhou zhweisop...@gmail.com
 Date: Thursday, June 25, 2015 at 3:33 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: sparkR could not find function textFile

   Hi all,

  I am exploring sparkR by activating the shell and following the
 tutorial here https://amplab-extras.github.io/SparkR-pkg/
 https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQs=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Qe=

  And when I tried to read in a local file with textFile(sc,
 file_location), it gives an error could not find function textFile.

  By reading through sparkR doc for 1.4, it seems that we need
 sqlContext to import data, for example.

 people - read.df(sqlContext, ./examples/src/main/resources/people.json, 
 json

 )
 And we need to specify the file type.

  My question is does sparkR stop supporting general type file
 importing? If not, would appreciate any help on how to do this.

  PS, I am trying to recreate the word count example in sparkR, and want
 to import README.md file, or just any file into sparkR.

  Thanks in advance.

  Best,
 Wei

CONFIDENTIALITY NOTICE This message and any included attachments are
 from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.






Re: sparkR could not find function textFile

2015-06-25 Thread Shivaram Venkataraman
You can use the Spark CSV reader to do read in flat CSV files to a data
frame. See https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85 for an
example

Shivaram

On Thu, Jun 25, 2015 at 2:15 PM, Wei Zhou zhweisop...@gmail.com wrote:

 Thanks to both Shivaram and Alek. Then if I want to create DataFrame from
 comma separated flat files, what would you recommend me to do? One way I
 can think of is first reading the data as you would do in r, using
 read.table(), and then create spark DataFrame out of that R dataframe, but
 it is obviously not scalable.


 2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu:

 The `head` function is not supported for the RRDD that is returned by
 `textFile`. You can run `take(lines, 5L)`. I should add a warning here that
 the RDD API in SparkR is private because we might not support it in the
 upcoming releases. So if you can use the DataFrame API for your application
 you should try that out.

 Thanks
 Shivaram

 On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou zhweisop...@gmail.com wrote:

 Hi Alek,

 Just a follow up question. This is what I did in sparkR shell:

 lines - SparkR:::textFile(sc, ./README.md)
 head(lines)

 And I am getting error:

 Error in x[seq_len(n)] : object of type 'S4' is not subsettable

 I'm wondering what did I do wrong. Thanks in advance.

 Wei

 2015-06-25 13:44 GMT-07:00 Wei Zhou zhweisop...@gmail.com:

 Hi Alek,

 Thanks for the explanation, it is very helpful.

 Cheers,
 Wei

 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander 
 alek.eskil...@cerner.com:

  Hi there,

  The tutorial you’re reading there was written before the merge of
 SparkR for Spark 1.4.0
 For the merge, the RDD API (which includes the textFile() function)
 was made private, as the devs felt many of its functions were too low
 level. They focused instead on finishing the DataFrame API which supports
 local, HDFS, and Hive/HBase file reads. In the meantime, the devs are
 trying to determine which functions of the RDD API, if any, should be made
 public again. You can see the rationale behind this decision on the 
 issue’s
 JIRA [1].

  You can still make use of those now private RDD functions by
 prepending the function call with the SparkR private namespace, for
 example, you’d use
 SparkR:::textFile(…).

  Hope that helps,
 Alek

  [1] -- https://issues.apache.org/jira/browse/SPARK-7230

   From: Wei Zhou zhweisop...@gmail.com
 Date: Thursday, June 25, 2015 at 3:33 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: sparkR could not find function textFile

   Hi all,

  I am exploring sparkR by activating the shell and following the
 tutorial here https://amplab-extras.github.io/SparkR-pkg/
 https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQs=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Qe=

  And when I tried to read in a local file with textFile(sc,
 file_location), it gives an error could not find function textFile.

  By reading through sparkR doc for 1.4, it seems that we need
 sqlContext to import data, for example.

 people - read.df(sqlContext, 
 ./examples/src/main/resources/people.json, json

 )
 And we need to specify the file type.

  My question is does sparkR stop supporting general type file
 importing? If not, would appreciate any help on how to do this.

  PS, I am trying to recreate the word count example in sparkR, and
 want to import README.md file, or just any file into sparkR.

  Thanks in advance.

  Best,
 Wei

CONFIDENTIALITY NOTICE This message and any included attachments
 are from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at (+1)
 (816)221-1024.








Re: mllib from sparkR

2015-06-25 Thread Shivaram Venkataraman
Not yet - We are working on it as a part of
https://issues.apache.org/jira/browse/SPARK-6805 and you can follow the
JIRA for more information

On Wed, Jun 24, 2015 at 2:30 AM, escardovi escard...@bitbang.com wrote:

 Hi,
 I was wondering if it is possible to use MLlib function inside SparkR, as
 outlined at the Spark Summer East 2015 Warmup meetup:
 http://www.meetup.com/Spark-NYC/events/220850389/
 Are there available examples?

 Thank you!
 Elena




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/mllib-from-sparkR-tp23466.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark1.4 sparkR usage

2015-06-25 Thread Shivaram Venkataraman
The Apache Spark API docs for SparkR
https://spark.apache.org/docs/1.4.0/api/R/index.html represent what has
been released with Spark 1.4. The AMPLab version is no longer under active
development and I'd recommend users to use the version in the Apache
project.

Thanks
Shivaram

On Thu, Jun 25, 2015 at 2:16 AM, Jean-Charles RISCH 
risch.jeanchar...@gmail.com wrote:

 Thank you.

 But it's a bit scary because when I compare official API (
 https://spark.apache.org/docs/1.4.0/api/R/index.html) and amplab API (
 https://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/index.html), they
 look very different.

 JC
 ᐧ

 2015-06-25 11:10 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:

 It won't change too much, it will get you started. Further details you
 can read from the official website itself
 https://spark.apache.org/docs/latest/sparkr.html

 Thanks
 Best Regards

 On Thu, Jun 25, 2015 at 2:38 PM, Jean-Charles RISCH 
 risch.jeanchar...@gmail.com wrote:

 Hello,

 Is this the official R Package?

 It is written : *NOTE: The API from the upcoming Spark release (1.4)
 will not have the same API as described here. *

 Thanks,

 JC
 ᐧ

 2015-06-25 10:55 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:

 Here you go https://amplab-extras.github.io/SparkR-pkg/

 Thanks
 Best Regards

 On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com 1106944...@qq.com
 wrote:

 Hi all
I have installed  spark1.4, then  want to use sparkR . assueme
 spark master ip= node1, how to start sparkR ? and summit job to spark
 cluster?  anyone help me?  or give me blog/doc
   thank you  very much

 --
 1106944...@qq.com








Re: How to Map and Reduce in sparkR

2015-06-25 Thread Shivaram Venkataraman
In addition to Aleksander's point please let us know what use case would
use RDD-like API in https://issues.apache.org/jira/browse/SPARK-7264 -- We
are hoping to have a version of this API in upcoming releases.

Thanks
Shivaram

On Thu, Jun 25, 2015 at 6:02 AM, Eskilson,Aleksander 
alek.eskil...@cerner.com wrote:

  The  simple answer is that SparkR does support map/reduce operations
 over RDD’s through the RDD API, but since Spark v 1.4.0, those functions
 were made private in SparkR. They can still be accessed by prepending the
 function with the namespace, like SparkR:::lapply(rdd, func). It was
 thought though that many of the functions in the RDD API were too low level
 to expose, with much more of the focus going into the DataFrame API. The
 original rationale for this decision can be found in its JIRA [1]. The devs
 are still deciding which functions of the RDD API, if any, should be made
 public for future releases. If you feel some use cases are most easily
 handled in SparkR through RDD functions, go ahead and let the dev email
 list know.

  Alek
 [1] -- https://issues.apache.org/jira/browse/SPARK-7230

   From: Wei Zhou zhweisop...@gmail.com
 Date: Wednesday, June 24, 2015 at 4:59 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: How to Map and Reduce in sparkR

   Anyone knows whether sparkR supports map and reduce operations as the
 RDD transformations? Thanks in advance.

  Best,
 Wei
CONFIDENTIALITY NOTICE This message and any included attachments are
 from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.



Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread Shivaram Venkataraman
The error you are running into is that the input file does not exist -- You
can see it from the following line
Input path does not exist: hdfs://smalldata13.hdp:8020/
home/esten/ami/usaf.json

Thanks
Shivaram

On Tue, Jun 16, 2015 at 1:55 AM, esten erik.stens...@dnvgl.com wrote:

 Hi,
 In SparkR shell, I invoke:
  mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json,
  header=false)
 I have tried various filetypes (csv, txt), all fail.

 RESPONSE: ERROR RBackendHandler: load on 1 failed
 BELOW THE WHOLE RESPONSE:
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with
 curMem=0, maxMem=278302556
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in
 memory (estimated size 173.4 KB, free 265.2 MB)
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with
 curMem=177600, maxMem=278302556
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as
 bytes
 in memory (estimated size 16.2 KB, free 265.2 MB)
 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on localhost:37142 (size: 16.2 KB, free: 265.4 MB)
 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at
 NativeMethodAccessorImpl.java:-2
 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads
 feature cannot be used because libhadoop cannot be loaded.
 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
 at

 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
 at

 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
 at

 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 at

 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 at

 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 at

 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at

 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 at

 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 at

 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
 at

 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 at

 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 at

 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
 at

 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
 at
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
 at

 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at

 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at

 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 at

 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does
 not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json
 at

 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
 at

 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
 at

 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
 at
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at

 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at 

Re: How to read avro in SparkR

2015-06-13 Thread Shivaram Venkataraman
Yep - Burak's answer should work. FWIW the error message from the stack
trace that shows this is the line
Failed to load class for data source: avro

Thanks
Shivaram

On Sat, Jun 13, 2015 at 6:13 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi,
 Not sure if this is it, but could you please try
 com.databricks.spark.avro instead of just avro.

 Thanks,
 Burak
 On Jun 13, 2015 9:55 AM, Shing Hing Man mat...@yahoo.com.invalid
 wrote:

 Hi,
   I am trying to read a avro file in SparkR (in Spark 1.4.0).

 I started R using the following.
 matmsh@gauss:~$ sparkR --packages com.databricks:spark-avro_2.10:1.0.0

 Inside the R shell, when I issue the following,

  read.df(sqlContext, file:///home/matmsh/myfile.avro,avro)

 I get the following exception.
 Caused by: java.lang.RuntimeException: Failed to load class for data
 source: avro

 Below is the stack trace.


 matmsh@gauss:~$ sparkR --packages com.databricks:spark-avro_2.10:1.0.0

 R version 3.2.0 (2015-04-16) -- Full of Ingredients
 Copyright (C) 2015 The R Foundation for Statistical Computing
 Platform: x86_64-suse-linux-gnu (64-bit)

 R is free software and comes with ABSOLUTELY NO WARRANTY.
 You are welcome to redistribute it under certain conditions.
 Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

 R is a collaborative project with many contributors.
 Type 'contributors()' for more information and
 'citation()' on how to cite R or R packages in publications.

 Type 'demo()' for some demos, 'help()' for on-line help, or
 'help.start()' for an HTML browser interface to help.
 Type 'q()' to quit R.

 Launching java with spark-submit command
 /home/matmsh/installed/spark/bin/spark-submit --packages
 com.databricks:spark-avro_2.10:1.0.0 sparkr-shell
 /tmp/RtmpoT7FrF/backend_port464e1e2fb16a
 Ivy Default Cache set to: /home/matmsh/.ivy2/cache
 The jars for the packages stored in: /home/matmsh/.ivy2/jars
 :: loading settings :: url =
 jar:file:/home/matmsh/installed/sparks/spark-1.4.0-bin-hadoop2.3/lib/spark-assembly-1.4.0-hadoop2.3.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
 com.databricks#spark-avro_2.10 added as a dependency
 :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
  confs: [default]
  found com.databricks#spark-avro_2.10;1.0.0 in list
  found org.apache.avro#avro;1.7.6 in local-m2-cache
  found org.codehaus.jackson#jackson-core-asl;1.9.13 in list
  found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in list
  found com.thoughtworks.paranamer#paranamer;2.3 in list
  found org.xerial.snappy#snappy-java;1.0.5 in list
  found org.apache.commons#commons-compress;1.4.1 in list
  found org.tukaani#xz;1.0 in list
  found org.slf4j#slf4j-api;1.6.4 in list
 :: resolution report :: resolve 421ms :: artifacts dl 16ms
  :: modules in use:
  com.databricks#spark-avro_2.10;1.0.0 from list in [default]
  com.thoughtworks.paranamer#paranamer;2.3 from list in [default]
  org.apache.avro#avro;1.7.6 from local-m2-cache in [default]
  org.apache.commons#commons-compress;1.4.1 from list in [default]
  org.codehaus.jackson#jackson-core-asl;1.9.13 from list in [default]
  org.codehaus.jackson#jackson-mapper-asl;1.9.13 from list in [default]
  org.slf4j#slf4j-api;1.6.4 from list in [default]
  org.tukaani#xz;1.0 from list in [default]
  org.xerial.snappy#snappy-java;1.0.5 from list in [default]
  -
  | | modules || artifacts |
  | conf | number| search|dwnlded|evicted|| number|dwnlded|
  -
  | default | 9 | 0 | 0 | 0 || 9 | 0 |
  -
 :: retrieving :: org.apache.spark#spark-submit-parent
  confs: [default]
  0 artifacts copied, 9 already retrieved (0kB/9ms)
 15/06/13 17:37:42 INFO spark.SparkContext: Running Spark version 1.4.0
 15/06/13 17:37:42 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable
 15/06/13 17:37:42 WARN util.Utils: Your hostname, gauss resolves to a
 loopback address: 127.0.0.1; using 192.168.0.10 instead (on interface
 enp3s0)
 15/06/13 17:37:42 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind
 to another address
 15/06/13 17:37:42 INFO spark.SecurityManager: Changing view acls to:
 matmsh
 15/06/13 17:37:42 INFO spark.SecurityManager: Changing modify acls to:
 matmsh
 15/06/13 17:37:42 INFO spark.SecurityManager: SecurityManager:
 authentication disabled; ui acls disabled; users with view permissions:
 Set(matmsh); users with modify permissions: Set(matmsh)
 15/06/13 17:37:43 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/06/13 17:37:43 INFO Remoting: Starting remoting
 15/06/13 17:37:43 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@192.168.0.10:46219]
 15/06/13 17:37:43 INFO util.Utils: Successfully started service
 'sparkDriver' on 

Re: inlcudePackage() deprecated?

2015-06-04 Thread Shivaram Venkataraman
Yeah - We don't have support for running UDFs on DataFrames yet. There is
an open issue to track this https://issues.apache.org/jira/browse/SPARK-6817

Thanks
Shivaram

On Thu, Jun 4, 2015 at 3:10 AM, Daniel Emaasit daniel.emaa...@gmail.com
wrote:

 Hello Shivaram,
 Was the includePackage() function deprecated in SparkR 1.4.0?
 I don't see it in the documentation? If it was, does that mean that we can
 use R packages on Spark DataFrames the usual way we do for local R
 dataframes?

 Daniel

 --
 Daniel Emaasit
 Ph.D. Research Assistant
 Transportation Research Center (TRC)
 University of Nevada, Las Vegas
 Las Vegas, NV 89154-4015
 Cell: 615-649-2489
 www.danielemaasit.com  http://www.danielemaasit.com/






[ANNOUNCE] YARN support in Spark EC2

2015-06-03 Thread Shivaram Venkataraman
Hi all

We recently merged support for launching YARN clusters using Spark EC2
scripts as a part of
https://issues.apache.org/jira/browse/SPARK-3674. To use this you can pass
in hadoop-major-version as yarn to the spark-ec2 script and this will
setup Hadoop 2.4 HDFS, YARN and Spark built for YARN on the EC2 cluster.

Developers who work on features related to YARN might find this useful for
testing / benchmarking Spark with YARN. If anyone has questions or feedback
please let me know.

Thanks
Shivaram


Re: SparkR Jobs Hanging in collectPartitions

2015-05-29 Thread Shivaram Venkataraman
For jobs with R UDFs (i.e. when we use the RDD API from SparkR) we use R on
both the driver side and on the worker side. So in this case when the
`flatMap` operation is run, the data is sent from the JVM to an R process
on the worker which in turn executes the `gsub` function.

Could you turn on INFO logging and send a pointer to the log file ? Its
pretty clear that the problem is happening in the call to `subtract`, which
in turn is doing a shuffle operation, but I am not sure why this should
happen.

Thanks
Shivaram

On Fri, May 29, 2015 at 7:56 AM, Eskilson,Aleksander 
alek.eskil...@cerner.com wrote:

  Sure. Looking more closely at the code, I thought I might have had an
 error in the flow of data structures in the R code, the line that extracts
 the words from the corpus is now,
 words - distinct(SparkR:::flatMap(corpus function(line) {
 strsplit(
 gsub(“^\\s+|[[:punct:]]”, “”, tolower(line)),
 “\\s”)[[1]]
 }))
 (just removes leading whitespace and all punctuation after having made the
 whole line lowercase, then splits to a vector of words, ultimately
 flattening the whole collection)

  Counts works on the resultant words list, returning the value expected,
 so the hang most likely occurs during the subtract. I should mention, the
 size of the corpus is very small, just kb in size. The dictionary I
 subtract against is also quite modest by Spark standards, just 4.8MB, and
 I’ve got 2G memory for the Worker, which ought to be sufficient for such a
 small job.

  The Scala analog runs quite fast, even with the subtract. If we look at
 the DAG for the SparkR job and compare that against the event timeline for
 Stage 3, it seems the job is stuck in Scheduler Delay (in 0/2 tasks
 completed) and never begins the rest of the stage. Unfortunately, the
 executor log hangs up as well, and doesn’t give much info.

  Could you describe in a little more detail at what points data is
 actually held in R’s internal process memory? I was under the impression
 that SparkR:::textFile created an RDD object that would only be realized
 when a DAG requiring it was executed, and would therefore be part of the
 memory managed by Spark, and that memory would only be moved to R as an R
 object following a collect(), take(), etc.

  Thanks,
 Alek Eskilson
  From: Shivaram Venkataraman shiva...@eecs.berkeley.edu
 Reply-To: shiva...@eecs.berkeley.edu shiva...@eecs.berkeley.edu
 Date: Wednesday, May 27, 2015 at 8:26 PM
 To: Aleksander Eskilson alek.eskil...@cerner.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: SparkR Jobs Hanging in collectPartitions

   Could you try to see which phase is causing the hang ? i.e. If you do a
 count() after flatMap does that work correctly ? My guess is that the hang
 is somehow related to data not fitting in the R process memory but its hard
 to say without more diagnostic information.

  Thanks
 Shivaram

 On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander 
 alek.eskil...@cerner.com wrote:

  I’ve been attempting to run a SparkR translation of a similar Scala job
 that identifies words from a corpus not existing in a newline delimited
 dictionary. The R code is:

  dict - SparkR:::textFile(sc, src1)
 corpus - SparkR:::textFile(sc, src2)
 words - distinct(SparkR:::flatMap(corpus, function(line) {
 gsub(“[[:punct:]]”, “”, tolower(strsplit(line, “ |,|-“)[[1]]))}))
 found - subtract(words, dict)

  (where src1, src2 are locations on HDFS)

  Then attempting something like take(found, 10) or saveAsTextFile(found,
 dest) should realize the collection, but that stage of the DAG hangs in
 Scheduler Delay during the collectPartitions phase.

  Synonymous Scala code however,
 val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“))
 val dict = sc.textFile(src2)
 val words = corpus.map(word =
 word.filter(Character.isLetter(_))).disctinct()
 val found = words.subtract(dict)

  performs as expected. Any thoughts?

  Thanks,
 Alek Eskilson
 CONFIDENTIALITY NOTICE This message and any included attachments are from
 Cerner Corporation and are intended only for the addressee. The information
 contained in this message is confidential and may constitute inside or
 non-public information under international, federal, or state securities
 laws. Unauthorized forwarding, printing, copying, distribution, or use of
 such information is strictly prohibited and may be unlawful. If you are not
 the addressee, please promptly delete this message and notify the sender of
 the delivery error by e-mail or you may call Cerner's corporate offices in
 Kansas City, Missouri, U.S.A at (+1) (816)221-1024.




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkR Jobs Hanging in collectPartitions

2015-05-27 Thread Shivaram Venkataraman
Could you try to see which phase is causing the hang ? i.e. If you do a
count() after flatMap does that work correctly ? My guess is that the hang
is somehow related to data not fitting in the R process memory but its hard
to say without more diagnostic information.

Thanks
Shivaram

On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander 
alek.eskil...@cerner.com wrote:

  I’ve been attempting to run a SparkR translation of a similar Scala job
 that identifies words from a corpus not existing in a newline delimited
 dictionary. The R code is:

  dict - SparkR:::textFile(sc, src1)
 corpus - SparkR:::textFile(sc, src2)
 words - distinct(SparkR:::flatMap(corpus, function(line) {
 gsub(“[[:punct:]]”, “”, tolower(strsplit(line, “ |,|-“)[[1]]))}))
 found - subtract(words, dict)

  (where src1, src2 are locations on HDFS)

  Then attempting something like take(found, 10) or saveAsTextFile(found,
 dest) should realize the collection, but that stage of the DAG hangs in
 Scheduler Delay during the collectPartitions phase.

  Synonymous Scala code however,
 val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“))
 val dict = sc.textFile(src2)
 val words = corpus.map(word =
 word.filter(Character.isLetter(_))).disctinct()
 val found = words.subtract(dict)

  performs as expected. Any thoughts?

  Thanks,
 Alek Eskilson
  CONFIDENTIALITY NOTICE This message and any included attachments are from
 Cerner Corporation and are intended only for the addressee. The information
 contained in this message is confidential and may constitute inside or
 non-public information under international, federal, or state securities
 laws. Unauthorized forwarding, printing, copying, distribution, or use of
 such information is strictly prohibited and may be unlawful. If you are not
 the addressee, please promptly delete this message and notify the sender of
 the delivery error by e-mail or you may call Cerner's corporate offices in
 Kansas City, Missouri, U.S.A at (+1) (816)221-1024.



Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Shivaram Venkataraman
My guess is that the `createDataFrame` call is failing here.  Can you check
if the schema being passed to it includes the column name and type for the
newly being zipped `features` ?

Joseph probably knows this better, but AFAIK the DenseVector here will need
to be marked as a VectorUDT while creating a DataFrame column

Thanks
Shivaram

On Tue, Mar 31, 2015 at 12:50 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Following your suggestion, I end up with the following implementation :







 *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = 
 {  val schema = transformSchema(dataSet.schema, paramMap, logging = true)  
 val map = this.paramMap ++ paramMap*













 *val features = dataSet.select(map(inputCol)).mapPartitions { rows =
 Caffe.set_mode(Caffe.CPU)val net = 
 CaffeUtils.floatTestNetwork(SparkFiles.get(topology), SparkFiles.get(weight)) 
val inputBlobs: FloatBlobVector = net.input_blobs()val N: Int = 1
 val K: Int = inputBlobs.get(0).channels()val H: Int = 
 inputBlobs.get(0).height()val W: Int = inputBlobs.get(0).width()
 inputBlobs.get(0).Reshape(N, K, H, W)val dataBlob = new 
 FloatPointer(N*K*W*H)*
 val inputCPUData = inputBlobs.get(0).mutable_cpu_data()

 val feat = rows.map { case Row(a: Iterable[Float])=
   dataBlob.put(a.toArray, 0, a.size)
   caffe_copy_float(N*K*W*H, dataBlob, inputCPUData)
   val resultBlobs: FloatBlobVector = net.ForwardPrefilled()























 *  val resultDim = resultBlobs.get(0).channels()  logInfo(sOutput 
 dimension $resultDim)  val resultBlobData = 
 resultBlobs.get(0).cpu_data()  val output = new Array[Float](resultDim)   
resultBlobData.get(output)  Vectors.dense(output.map(_.toDouble))} 
//net.deallocate()feat  }  val newRowData = 
 dataSet.rdd.zip(features).map { case (old, feat)=val oldSeq = old.toSeq  
 Row.fromSeq(oldSeq :+ feat)  }  
 dataSet.sqlContext.createDataFrame(newRowData, schema)}*


 The idea is to mapPartitions of the underlying RDD of the DataFrame and
 create a new DataFrame by zipping the results. It seems to work but when I
 try to save the RDD I got the following error :

 org.apache.spark.mllib.linalg.DenseVector cannot be cast to
 org.apache.spark.sql.Row


 On Mon, Mar 30, 2015 at 6:40 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 One workaround could be to convert a DataFrame into a RDD inside the
 transform function and then use mapPartitions/broadcast to work with the
 JNI calls and then convert back to RDD.

 Thanks
 Shivaram

 On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 I'm still struggling to make a pre-trained caffe model transformer for
 dataframe works. The main problem is that creating a caffe model inside the
 UDF is very slow and consumes memories.

 Some of you suggest to broadcast the model. The problem with
 broadcasting is that I use a JNI interface to caffe C++ with javacpp-preset
  and it is not serializable.

 Another possible approach is to use a UDF that can handle a whole
 partitions instead of just a row in order to minimize the caffe model
 instantiation.

 Is there any ideas to solve one of these two issues ?



 Best,

 Jao

 On Tue, Mar 3, 2015 at 10:04 PM, Joseph Bradley jos...@databricks.com
 wrote:

 I see.  I think your best bet is to create the cnnModel on the master
 and then serialize it to send to the workers.  If it's big (1M or so), then
 you can broadcast it and use the broadcast variable in the UDF.  There is
 not a great way to do something equivalent to mapPartitions with UDFs right
 now.

 On Tue, Mar 3, 2015 at 4:36 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Here is my current implementation with current master version of spark




 *class DeepCNNFeature extends Transformer with HasInputCol with
 HasOutputCol ... {   override def transformSchema(...) { ... }*
 *override def transform(dataSet: DataFrame, paramMap: ParamMap):
 DataFrame = {*

 *  transformSchema(dataSet.schema, paramMap, logging = 
 true)*



 *  val map = this.paramMap ++ paramMap  
 val deepCNNFeature = udf((v: Vector)= {*

 *  val cnnModel = new CaffeModel *

 *  cnnModel.transform(v)*




 *  } : Vector )  
 dataSet.withColumn(map(outputCol), deepCNNFeature(col(map(inputCol*


 * }*
 *}*

 where CaffeModel is a java api to Caffe C++ model.

 The problem here is that for every row it will create a new instance
 of CaffeModel which is inefficient since creating a new model
 means loading a large model file. And it will transform only a single
 row at a time. Or a Caffe network can process a batch of rows efficiently.
 In other words, is it possible to create an UDF that can operatats on a
 partition in order to minimize the creation of a CaffeModel and
 to take

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Shivaram Venkataraman
One workaround could be to convert a DataFrame into a RDD inside the
transform function and then use mapPartitions/broadcast to work with the
JNI calls and then convert back to RDD.

Thanks
Shivaram

On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Dear all,

 I'm still struggling to make a pre-trained caffe model transformer for
 dataframe works. The main problem is that creating a caffe model inside the
 UDF is very slow and consumes memories.

 Some of you suggest to broadcast the model. The problem with broadcasting
 is that I use a JNI interface to caffe C++ with javacpp-preset  and it is
 not serializable.

 Another possible approach is to use a UDF that can handle a whole
 partitions instead of just a row in order to minimize the caffe model
 instantiation.

 Is there any ideas to solve one of these two issues ?



 Best,

 Jao

 On Tue, Mar 3, 2015 at 10:04 PM, Joseph Bradley jos...@databricks.com
 wrote:

 I see.  I think your best bet is to create the cnnModel on the master and
 then serialize it to send to the workers.  If it's big (1M or so), then you
 can broadcast it and use the broadcast variable in the UDF.  There is not a
 great way to do something equivalent to mapPartitions with UDFs right now.

 On Tue, Mar 3, 2015 at 4:36 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Here is my current implementation with current master version of spark




 *class DeepCNNFeature extends Transformer with HasInputCol with
 HasOutputCol ... {   override def transformSchema(...) { ... }*
 *override def transform(dataSet: DataFrame, paramMap: ParamMap):
 DataFrame = {*

 *  transformSchema(dataSet.schema, paramMap, logging = 
 true)*



 *  val map = this.paramMap ++ paramMap  val 
 deepCNNFeature = udf((v: Vector)= {*

 *  val cnnModel = new CaffeModel *

 *  cnnModel.transform(v)*




 *  } : Vector )  
 dataSet.withColumn(map(outputCol), deepCNNFeature(col(map(inputCol*


 * }*
 *}*

 where CaffeModel is a java api to Caffe C++ model.

 The problem here is that for every row it will create a new instance of
 CaffeModel which is inefficient since creating a new model
 means loading a large model file. And it will transform only a single
 row at a time. Or a Caffe network can process a batch of rows efficiently.
 In other words, is it possible to create an UDF that can operatats on a
 partition in order to minimize the creation of a CaffeModel and
 to take advantage of the Caffe network batch processing ?



 On Tue, Mar 3, 2015 at 7:26 AM, Joseph Bradley jos...@databricks.com
 wrote:

 I see, thanks for clarifying!

 I'd recommend following existing implementations in spark.ml
 transformers.  You'll need to define a UDF which operates on a single Row
 to compute the value for the new column.  You can then use the DataFrame
 DSL to create the new column; the DSL provides a nice syntax for what would
 otherwise be a SQL statement like select ... from   I'm recommending
 looking at the existing implementation (rather than stating it here)
 because it changes between Spark 1.2 and 1.3.  In 1.3, the DSL is much
 improved and makes it easier to create a new column.

 Joseph

 On Sun, Mar 1, 2015 at 1:26 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 class DeepCNNFeature extends Transformer ... {

 override def transform(data: DataFrame, paramMap: ParamMap):
 DataFrame = {


  // How can I do a map partition on the underlying RDD
 and then add the column ?

  }
 }

 On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Hi Joseph,

 Thank your for the tips. I understand what should I do when my data
 are represented as a RDD. The thing that I can't figure out is how to do
 the same thing when the data is view as a DataFrame and I need to add the
 result of my pretrained model as a new column in the DataFrame. 
 Preciselly,
 I want to implement the following transformer :

 class DeepCNNFeature extends Transformer ... {

 }

 On Sun, Mar 1, 2015 at 1:32 AM, Joseph Bradley jos...@databricks.com
  wrote:

 Hi Jao,

 You can use external tools and libraries if they can be called from
 your Spark program or script (with appropriate conversion of data types,
 etc.).  The best way to apply a pre-trained model to a dataset would be 
 to
 call the model from within a closure, e.g.:

 myRDD.map { myDatum = preTrainedModel.predict(myDatum) }

 If your data is distributed in an RDD (myRDD), then the above call
 will distribute the computation of prediction using the pre-trained 
 model.
 It will require that all of your Spark workers be able to run the
 preTrainedModel; that may mean installing Caffe and dependencies on all
 nodes in the compute cluster.

 For the second question, I would modify the above call as follows:

 myRDD.mapPartitions { myDataOnPartition =
   val myModel = // 

Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
FWIW the JIRA I was thinking about is
https://issues.apache.org/jira/browse/SPARK-3098

On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 I vaguely remember that JIRA and AFAIK Matei's point was that the order is
 not guaranteed *after* a shuffle. If you only use operations like map which
 preserve partitioning, ordering should be guaranteed from what I know.

 On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote:

 Dang I can't seem to find the JIRA now but I am sure we had a discussion
 with Matei about this and the conclusion was that RDD order is not
 guaranteed unless a sort is involved.
 On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote:

 This was brought up again in
 https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one
 item which was asked about the reliability of zipping RDDs.  Basically, it
 should be reliable, and if it is not, then it should be reported as a bug.
 This general approach should work (with explicit types to make it clear):

 val data: RDD[LabeledPoint] = ...
 val labels: RDD[Double] = data.map(_.label)
 val features1: RDD[Vector] = data.map(_.features)
 val features2: RDD[Vector] = new
 HashingTF(numFeatures=100).transform(features1)
 val features3: RDD[Vector] = idfModel.transform(features2)
 val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
 features) = LabeledPoint(label, features))

 If you run into problems with zipping like this, please report them!

 Thanks,
 Joseph

 On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:

 Hopefully the new pipeline API addresses this problem. We have a code
 example here:


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

 -Xiangrui

 On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com
 wrote:
  Here is what I did for this case :
 https://github.com/andypetrella/tf-idf
 
 
  Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :
 
  Given (label, terms) you can just transform the values to a TF
 vector,
  then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
  make a LabeledPoint from (label, vector) pairs. Is that what you're
  looking for?
 
  On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
   I found the TF-IDF feature extraction and all the MLlib code that
 work
   with
   pure Vector RDD very difficult to work with due to the lack of
 ability
   to
   associate vector back to the original data. Why can't Spark MLlib
   support
   LabeledPoint?
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
 -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
I vaguely remember that JIRA and AFAIK Matei's point was that the order is
not guaranteed *after* a shuffle. If you only use operations like map which
preserve partitioning, ordering should be guaranteed from what I know.

On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote:

 Dang I can't seem to find the JIRA now but I am sure we had a discussion
 with Matei about this and the conclusion was that RDD order is not
 guaranteed unless a sort is involved.
 On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote:

 This was brought up again in
 https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one
 item which was asked about the reliability of zipping RDDs.  Basically, it
 should be reliable, and if it is not, then it should be reported as a bug.
 This general approach should work (with explicit types to make it clear):

 val data: RDD[LabeledPoint] = ...
 val labels: RDD[Double] = data.map(_.label)
 val features1: RDD[Vector] = data.map(_.features)
 val features2: RDD[Vector] = new
 HashingTF(numFeatures=100).transform(features1)
 val features3: RDD[Vector] = idfModel.transform(features2)
 val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
 features) = LabeledPoint(label, features))

 If you run into problems with zipping like this, please report them!

 Thanks,
 Joseph

 On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:

 Hopefully the new pipeline API addresses this problem. We have a code
 example here:


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

 -Xiangrui

 On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com
 wrote:
  Here is what I did for this case :
 https://github.com/andypetrella/tf-idf
 
 
  Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :
 
  Given (label, terms) you can just transform the values to a TF vector,
  then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
  make a LabeledPoint from (label, vector) pairs. Is that what you're
  looking for?
 
  On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
   I found the TF-IDF feature extraction and all the MLlib code that
 work
   with
   pure Vector RDD very difficult to work with due to the lack of
 ability
   to
   associate vector back to the original data. Why can't Spark MLlib
   support
   LabeledPoint?
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
 -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-13 Thread Shivaram Venkataraman
Do you have a small test case that can reproduce the out of memory error ?
I have also seen some errors on large scale experiments but haven't managed
to narrow it down.

Thanks
Shivaram

On Fri, Mar 13, 2015 at 6:20 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 It runs faster but there is some drawbacks. It seems to consume more
 memory. I get java.lang.OutOfMemoryError: Java heap space error if I don't
 have a sufficient partitions for a fixed amount of memory. With the older
 (ampcamp) implementation for the same data size I didn't get it.

 On Thu, Mar 12, 2015 at 11:36 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:


 On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 In fact, by activating netlib with native libraries it goes faster.

 Glad you got it work ! Better performance was one of the reasons we made
 the switch.

 Thanks

 On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 There are a couple of differences between the ml-matrix implementation
 and the one used in AMPCamp

 - I think the AMPCamp one uses JBLAS which tends to ship native BLAS
 libraries along with it. In ml-matrix we switched to using Breeze + Netlib
 BLAS which is faster but needs some setup [1] to pick up native libraries.
 If native libraries are not found it falls back to a JVM implementation, so
 that might explain the slow down.

 - The other difference if you are comparing the whole image pipeline is
 that I think the AMPCamp version used NormalEquations which is around 2-3x
 faster (just in terms of number of flops) compared to TSQR.

 [1]
 https://github.com/fommil/netlib-java#machine-optimised-system-libraries

 Thanks
 Shivaram

 On Tue, Mar 10, 2015 at 9:57 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 I'm trying to play with the implementation of least square solver (Ax
 = b) in mlmatrix.TSQR where A is  a 5*1024 matrix  and b a 5*10
 matrix. It works but I notice
 that it's 8 times slower than the implementation given in the latest
 ampcamp :
 http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
 . As far as I know these two implementations come from the same basis.
 What is the difference between these two codes ?





 On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the
 AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if
 you are interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we
 can use out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is
 private to recommender package.


 Cheers,

 Jao










Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Shivaram Venkataraman
There are a couple of differences between the ml-matrix implementation and
the one used in AMPCamp

- I think the AMPCamp one uses JBLAS which tends to ship native BLAS
libraries along with it. In ml-matrix we switched to using Breeze + Netlib
BLAS which is faster but needs some setup [1] to pick up native libraries.
If native libraries are not found it falls back to a JVM implementation, so
that might explain the slow down.

- The other difference if you are comparing the whole image pipeline is
that I think the AMPCamp version used NormalEquations which is around 2-3x
faster (just in terms of number of flops) compared to TSQR.

[1] https://github.com/fommil/netlib-java#machine-optimised-system-libraries

Thanks
Shivaram

On Tue, Mar 10, 2015 at 9:57 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 I'm trying to play with the implementation of least square solver (Ax = b)
 in mlmatrix.TSQR where A is  a 5*1024 matrix  and b a 5*10 matrix.
 It works but I notice
 that it's 8 times slower than the implementation given in the latest
 ampcamp :
 http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
 . As far as I know these two implementations come from the same basis.
 What is the difference between these two codes ?





 On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can
 use out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao






Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-12 Thread Shivaram Venkataraman
On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 In fact, by activating netlib with native libraries it goes faster.

 Glad you got it work ! Better performance was one of the reasons we made
the switch.

 Thanks

 On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 There are a couple of differences between the ml-matrix implementation
 and the one used in AMPCamp

 - I think the AMPCamp one uses JBLAS which tends to ship native BLAS
 libraries along with it. In ml-matrix we switched to using Breeze + Netlib
 BLAS which is faster but needs some setup [1] to pick up native libraries.
 If native libraries are not found it falls back to a JVM implementation, so
 that might explain the slow down.

 - The other difference if you are comparing the whole image pipeline is
 that I think the AMPCamp version used NormalEquations which is around 2-3x
 faster (just in terms of number of flops) compared to TSQR.

 [1]
 https://github.com/fommil/netlib-java#machine-optimised-system-libraries

 Thanks
 Shivaram

 On Tue, Mar 10, 2015 at 9:57 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 I'm trying to play with the implementation of least square solver (Ax =
 b) in mlmatrix.TSQR where A is  a 5*1024 matrix  and b a 5*10
 matrix. It works but I notice
 that it's 8 times slower than the implementation given in the latest
 ampcamp :
 http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
 . As far as I know these two implementations come from the same basis.
 What is the difference between these two codes ?





 On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the
 AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if
 you are interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can
 use out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is
 private to recommender package.


 Cheers,

 Jao








Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Shivaram Venkataraman
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a
good reference

Shivaram
On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 Do you have a reference paper to the implemented algorithm in TSQR.scala ?

 On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can
 use out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao






Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Shivaram Venkataraman
There are couple of solvers that I've written that is part of the AMPLab
ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
interested in porting them I'd be happy to review it

Thanks
Shivaram


[1]
https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
[2]
https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can use
 out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao



Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Shivaram Venkataraman
+Josh, who added the Job UI page.

I've seen this as well and was a bit confused about what it meant. Josh, is
there a specific scenario that creates these skipped stages in the Job UI ?

Thanks
Shivaram

On Wed, Jan 7, 2015 at 12:32 PM, Corey Nolet cjno...@gmail.com wrote:

 Sorry- replace ### with an actual number. What does a skipped stage
 mean? I'm running a series of jobs and it seems like after a certain point,
 the number of skipped stages is larger than the number of actual completed
 stages.

 On Wed, Jan 7, 2015 at 3:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the number of skipped stages couldn't be formatted.

 Cheers

 On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:

 We just upgraded to Spark 1.2.0 and we're seeing this in the UI.






Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Shivaram Venkataraman
Ah I see - So its more like 're-used stages' which is not necessarily a bug
in the program or something like that.
Thanks for the pointer to the comment

Thanks
Shivaram

On Wed, Jan 7, 2015 at 2:00 PM, Mark Hamstra m...@clearstorydata.com
wrote:

 That's what you want to see.  The computation of a stage is skipped if the
 results for that stage are still available from the evaluation of a prior
 job run:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L163

 On Wed, Jan 7, 2015 at 12:32 PM, Corey Nolet cjno...@gmail.com wrote:

 Sorry- replace ### with an actual number. What does a skipped stage
 mean? I'm running a series of jobs and it seems like after a certain point,
 the number of skipped stages is larger than the number of actual completed
 stages.

 On Wed, Jan 7, 2015 at 3:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the number of skipped stages couldn't be formatted.

 Cheers

 On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:

 We just upgraded to Spark 1.2.0 and we're seeing this in the UI.







Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Shivaram Venkataraman
Just to clarify, are you running the application using spark-submit after
packaging with sbt package ? One thing that might help is to mark the Spark
dependency as 'provided' as then you shouldn't have the Spark classes in
your jar.

Thanks
Shivaram

On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen so...@cloudera.com wrote:

 You should use the same binaries everywhere. The problem here is that
 anonymous functions get compiled to different names when you build
 different (potentially) so you actually have one function being called
 when another function is meant.

 On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui rui@intel.com wrote:
  Hi,
 
 
 
  I encountered a weird bytecode incompatability issue between spark-core
 jar
  from mvn repo and official spark prebuilt binary.
 
 
 
  Steps to reproduce:
 
  1. Download the official pre-built Spark binary 1.1.1 at
  http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz
 
  2. Launch the Spark cluster in pseudo cluster mode
 
  3. A small scala APP which calls RDD.saveAsObjectFile()
 
  scalaVersion := 2.10.4
 
 
 
  libraryDependencies ++= Seq(
 
org.apache.spark %% spark-core % 1.1.1
 
  )
 
 
 
  val sc = new SparkContext(args(0), test) //args[0] is the Spark master
 URI
 
val rdd = sc.parallelize(List(1, 2, 3))
 
rdd.saveAsObjectFile(/tmp/mysaoftmp)
 
sc.stop
 
 
 
  throws an exception as follows:
 
  [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to
  stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure:
 Lost
  task 1.3 in stage 0.0 (TID 6, ray-desktop.sh.intel.com):
  java.lang.ClassCastException: scala.Tuple2 cannot be cast to
  scala.collection.Iterator
 
  [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  [error] org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  [error]
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 
  [error]
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
  [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  [error]
  org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 
  [error]
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
  [error] org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  [error]
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 
  [error] org.apache.spark.scheduler.Task.run(Task.scala:54)
 
  [error]
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 
  [error]
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 
  [error]
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 
  [error] java.lang.Thread.run(Thread.java:701)
 
 
 
  After investigation, I found that this is caused by bytecode
 incompatibility
  issue between RDD.class in spark-core_2.10-1.1.1.jar and the pre-built
 spark
  assembly respectively.
 
 
 
  This issue also happens with spark 1.1.0.
 
 
 
  Is there anything wrong in my usage of Spark? Or anything wrong in the
  process of deploying Spark module jars to maven repo?
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Problem creating EC2 cluster using spark-ec2

2014-12-02 Thread Shivaram Venkataraman
+Andrew

Actually I think this is because we haven't uploaded the Spark binaries to
cloudfront / pushed the change to mesos/spark-ec2.

Andrew, can you take care of this ?



On Tue, Dec 2, 2014 at 5:11 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Interesting. Do you have any problems when launching in us-east-1? What
 is the full output of spark-ec2 when launching a cluster? (Post it to a
 gist if it’s too big for email.)
 ​

 On Mon, Dec 1, 2014 at 10:34 AM, Dave Challis dave.chal...@aistemos.com
 wrote:

 I've been trying to create a Spark cluster on EC2 using the
 documentation at https://spark.apache.org/docs/latest/ec2-scripts.html
 (with Spark 1.1.1).

 Running the script successfully creates some EC2 instances, HDFS etc.,
 but appears to fail to copy the actual files needed to run Spark
 across.

 I ran the following commands:

 $ cd ~/src/spark-1.1.1/ec2
 $ ./spark-ec2 --key-pair=* --identity-file=* --slaves=1
 --region=eu-west-1 --zone=eu-west-1a --instance-type=m3.medium
 --no-ganglia launch foocluster

 I see the following in the script's output:

 (instance and HDFS set up happens here)
 ...
 Persistent HDFS installed, won't start by default...
 ~/spark-ec2 ~/spark-ec2
 Setting up spark-standalone
 RSYNC'ing /root/spark/conf to slaves...
 *.eu-west-1.compute.amazonaws.com
 RSYNC'ing /root/spark-ec2 to slaves...
 *.eu-west-1.compute.amazonaws.com
 ./spark-standalone/setup.sh: line 22: /root/spark/sbin/stop-all.sh: No
 such file or directory
 ./spark-standalone/setup.sh: line 27:
 /root/spark/sbin/start-master.sh: No such file or directory
 ./spark-standalone/setup.sh: line 33:
 /root/spark/sbin/start-slaves.sh: No such file or directory
 Setting up tachyon
 RSYNC'ing /root/tachyon to slaves...
 ...
 (Tachyon setup happens here without any problem)

 I can ssh to the master (using the ./spark-ec2 login), and looking in
 /root/, it contains:

 $ ls /root
 ephemeral-hdfs  hadoop-native  mapreduce  persistent-hdfs  scala
 shark  spark  spark-ec2  tachyon

 If I look in /root/spark (where the sbin directory should be found),
 it only contains a single 'conf' directory:

 $ ls /root/spark
 conf

 Any idea why spark-ec2 might have failed to copy these files across?

 Thanks,
 Dave

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Shivaram Venkataraman
Can you clarify what is the Spark master URL you are using ? Is it 'local'
or is it a cluster ? If it is 'local' then rebuilding Spark wouldn't help
as Spark is getting pulled in from Maven and that'll just pick up the
released artifacts.

Shivaram

On Mon, Nov 24, 2014 at 1:08 PM, agg212 alexander_galaka...@brown.edu
wrote:

 I tried building Spark from the source, by downloading it and running:

 mvn -Pnetlib-lgpl -DskipTests clean package

 I then installed OpenBLAS by doing the following:

 - Download and unpack .tar from http://www.openblas.net/
 - Run `make`

 I then linked /usr/lib/libblas.so.3 to /usr/lib/libopenblas.so (which links
 to /usr/lib/libopenblas_sandybridgep-r0.2.12.so)

 I am still getting the following error when running a job after installing
 spark from the source with the -Pnetlib-lgpl flag:

 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS

 Any thoughts on what else I need to do to get the native libraries
 recognized by Spark?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p19681.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-08 Thread Shivaram Venkataraman
I ran into this problem too and I know of a workaround but don't exactly
know what is happening. The work around is to explicitly add either the
commons math jar or your application jar (shaded with commons math)
to spark.executor.extraClassPath.

My hunch is that this is related to the class loader problem described in
[1] where Spark loads breeze at the beginning and then having commons math
in the user's jar somehow doesn't get picked up.

Thanks
Shivaram
[1]
http://apache-spark-user-list.1001560.n3.nabble.com/Native-library-can-not-be-loaded-when-using-Mllib-PCA-td7042.html#a8307

On Sat, Nov 8, 2014 at 1:21 PM, Sean Owen so...@cloudera.com wrote:

 This means you haven't actually included commons-math3 in your
 application. Check the contents of your final app jar and then go
 check your build file again.

 On Sat, Nov 8, 2014 at 12:20 PM, lev kat...@gmail.com wrote:
  Hi,
  I'm using breeze.stats.distributions.Binomial with spark 1.1.0 and having
  the same error.
  I tried to add the dependency to math3 with versions 3.11, 3.2, 3.3 and
 it
  didn't help.
 
  Any ideas what might be the problem?
 
  Thanks,
  Lev.
 
 
  anny9699 wrote
  I use the breeze.stats.distributions.Bernoulli in my code, however met
  this problem
  java.lang.NoClassDefFoundError:
  org/apache/commons/math3/random/RandomGenerator
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18406.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Matrix multiplication in spark

2014-11-05 Thread Shivaram Venkataraman
We are working on a PRs to add block partitioned matrix formats and dense
matrix multiply methods. This should be out in the next few weeks or so.
The sparse methods still need some research on partitioning schemes etc.
and we will do that after the dense methods are in place.

Thanks
Shivaram

On Wed, Nov 5, 2014 at 2:00 AM, Duy Huynh duy.huynh@gmail.com wrote:

 ok great.  when will this be ready?

 On Wed, Nov 5, 2014 at 4:27 AM, Xiangrui Meng men...@gmail.com wrote:

 We are working on distributed block matrices. The main JIRA is at:

 https://issues.apache.org/jira/browse/SPARK-3434

 The goal is to support basic distributed linear algebra, (dense first
 and then sparse).

 -Xiangrui

 On Wed, Nov 5, 2014 at 12:23 AM, ll duy.huynh@gmail.com wrote:
  @sowen.. i am looking for distributed operations, especially very large
  sparse matrix x sparse matrix multiplication.  what is the best way to
  implement this in spark?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-tp12562p18164.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 





Re: Spark Build

2014-10-31 Thread Shivaram Venkataraman
Yeah looks like https://github.com/apache/spark/pull/2744 broke the
build. We will fix it soon

On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote:
 I am synced up to the Spark master branch as of commit 23468e7e96. I have
 Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve built the master branch
 successfully previously and am trying to rebuild again to take advantage of
 the new Hive 0.13.1 profile. I execute the following command:

 $ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package

 The build fails at the following stage:

 INFO] Using incremental compilation

 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)

 [INFO] Compiling 5 Scala sources to
 /home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classes...

 [ERROR]
 /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:20:
 object MemLimitLogger is not a member of package
 org.apache.spark.deploy.yarn

 [ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._

 [ERROR] ^

 [ERROR]
 /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:29:
 not found: value memLimitExceededLogMessage

 [ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics,
 VMEM_EXCEEDED_PATTERN)

 [ERROR]   ^

 [ERROR]
 /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:30:
 not found: value memLimitExceededLogMessage

 [ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics,
 PMEM_EXCEEDED_PATTERN)

 [ERROR]   ^

 [ERROR] three errors found

 [INFO]
 

 [INFO] Reactor Summary:

 [INFO]

 [INFO] Spark Project Parent POM .. SUCCESS [2.758s]

 [INFO] Spark Project Common Network Code . SUCCESS [6.716s]

 [INFO] Spark Project Core  SUCCESS
 [2:46.610s]

 [INFO] Spark Project Bagel ... SUCCESS [16.776s]

 [INFO] Spark Project GraphX .. SUCCESS [52.159s]

 [INFO] Spark Project Streaming ... SUCCESS
 [1:09.883s]

 [INFO] Spark Project ML Library .. SUCCESS
 [1:18.932s]

 [INFO] Spark Project Tools ... SUCCESS [10.210s]

 [INFO] Spark Project Catalyst  SUCCESS
 [1:12.499s]

 [INFO] Spark Project SQL . SUCCESS
 [1:10.561s]

 [INFO] Spark Project Hive  SUCCESS
 [1:08.571s]

 [INFO] Spark Project REPL  SUCCESS [32.377s]

 [INFO] Spark Project YARN Parent POM . SUCCESS [1.317s]

 [INFO] Spark Project YARN Stable API . FAILURE [25.918s]

 [INFO] Spark Project Assembly  SKIPPED

 [INFO] Spark Project External Twitter  SKIPPED

 [INFO] Spark Project External Kafka .. SKIPPED

 [INFO] Spark Project External Flume Sink . SKIPPED

 [INFO] Spark Project External Flume .. SKIPPED

 [INFO] Spark Project External ZeroMQ . SKIPPED

 [INFO] Spark Project External MQTT ... SKIPPED

 [INFO] Spark Project Examples  SKIPPED

 [INFO]
 

 [INFO] BUILD FAILURE

 [INFO]
 

 [INFO] Total time: 11:15.889s

 [INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014

 [INFO] Final Memory: 73M/829M

 [INFO]
 

 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile
 (scala-test-compile-first) on project spark-yarn_2.10: Execution
 scala-test-compile-first of goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed.
 CompileFailed - [Help 1]

 [ERROR]

 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
 switch.

 [ERROR] Re-run Maven using the -X switch to enable full debug logging.

 [ERROR]

 [ERROR] For more information about the errors and possible solutions, please
 read the following articles:

 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException

 [ERROR]

 [ERROR] After correcting the problems, you can resume the build with the
 command

 [ERROR]   mvn goals -rf :spark-yarn_2.10


 I could not find MemLimitLogger anywhere in the Spark code. Anybody else
 seen/encounter this?


 Thanks,

 -Terry




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional 

New SparkR mailing list, JIRA

2014-08-28 Thread Shivaram Venkataraman
Hi

I'd like to announce a couple of updates to the SparkR project. In order to
facilitate better collaboration for new features and development we have a
new mailing list, issue tracker for SparkR.

- The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and
we have migrated all existing Github issues to the JIRA. Please submit any
bugs / improvements to this JIRA going forward.

- There is a new mailing list sparkr-...@googlegroups.com that will be used
for design discussions for new features and development related issues. We
will still be answering to user issues on Apache Spark mailing lists.

Please let me know if have any questions.

Thanks
Shivaram


Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Shivaram Venkataraman
Could you try increasing the number of slices with the large data set ?
SparkR assumes that each slice (or partition in Spark terminology) can fit
in memory of a single machine.  Also is the error happening when you do the
map function or does it happen when you combine the results ?

Thanks
Shivaram


On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta 
gilbello...@gmail.com wrote:

 Hello,

 I am having problems trying to apply the split-apply-combine strategy
 for dataframes using SparkR.

 I have a largish dataframe and I would like to achieve something similar
 to what

 ddply(df, .(id), foo)

 would do, only that using SparkR as computing engine. My df has a few
 million records and I would like to split it by id and operate on
 the pieces. These pieces are quite small in size: just a few hundred
 records.

 I do something along the following lines:

 1) Use split to transform df into a list of dfs.
 2) parallelize the resulting list as a RDD (using a few thousand slices)
 3) map my function on the pieces using Spark.
 4) recombine the results (do.call, rbind, etc.)

 My cluster works and I can perform medium sized batch jobs.

 However, it fails with my full df: I get a heap space out of memory
 error. It is funny as the slices are very small in size.

 Should I send smaller batches to my cluster? Is there any recommended
 general approach to these kind of split-apply-combine problems?

 Best,

 Carlos J. Gil Bellosta
 http://www.datanalytics.com

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Lost executors

2014-08-13 Thread Shivaram Venkataraman
If the JVM heap size is close to the memory limit the OS sometimes kills
the process under memory pressure. I've usually found that lowering the
executor memory size helps.

Shivaram


On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 What is your Spark executor memory set to? (You can see it in Spark's web
 UI at http://driver:4040 under the executors tab). One thing to be
 aware of is that the JVM never really releases memory back to the OS, so it
 will keep filling up to the maximum heap size you set. Maybe 4 executors
 with that much heap are taking a lot of the memory.

 Persist as DISK_ONLY should indeed stream data from disk, so I don't think
 that will be a problem.

 Matei

 On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote:

 After a lot of grovelling through logs, I found out that the Nagios
 monitor
 process detected that the machine was almost out of memory, and killed the
 SNAP executor process.

 So why is the machine running out of memory? Each node has 128GB of RAM, 4
 executors, about 40GB of data. It did run out of memory if I tried to
 cache() the RDD, but I would hope that persist() is implemented so that it
 would stream to disk without trying to materialize too much data in RAM.

 Ravi



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Shivaram Venkataraman
If you just want to find the top eigenvalue / eigenvector you can do
something like the Lanczos method. There is a description of a MapReduce
based algorithm in Section 4.2 of [1]

[1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf


On Thu, Aug 7, 2014 at 10:54 AM, Li Pu l...@twitter.com.invalid wrote:

 @Miles, the latest SVD implementation in mllib is partially distributed.
 Matrix-vector multiplication is computed among all workers, but the right
 singular vectors are all stored in the driver. If your symmetric matrix is
 n x n and you want the first k eigenvalues, you will need to fit n x k
 doubles in driver's memory. Behind the scene, it calls ARPACK to compute
 eigen-decomposition of A^T A. You can look into the source code for the
 details.

 @Sean, the SVD++ implementation in graphx is not the canonical definition
 of SVD. It doesn't have the orthogonality that SVD holds. But we might want
 to use graphx as the underlying matrix representation for mllib.SVD to
 address the problem of skewed entry distribution.


 On Thu, Aug 7, 2014 at 10:51 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Reza Zadeh has contributed the distributed implementation of
 (Tall/Skinny) SVD (
 http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
 which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark
 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data is
 sparse (which it often is in social networks), you may have better luck
 with this.

 I haven't tried the GraphX implementation, but those algorithms are often
 well-suited for power-law distributed graphs as you might see in social
 networks.

 FWIW, I believe you need to square elements of the sigma matrix from the
 SVD to get the eigenvalues.




 On Thu, Aug 7, 2014 at 10:20 AM, Sean Owen so...@cloudera.com wrote:

 (-incubator, +user)

 If your matrix is symmetric (and real I presume), and if my linear
 algebra isn't too rusty, then its SVD is its eigendecomposition. The
 SingularValueDecomposition object you get back has U and V, both of
 which have columns that are the eigenvectors.

 There are a few SVDs in the Spark code. The one in mllib is not
 distributed (right?) and is probably not an efficient means of
 computing eigenvectors if you really just want a decomposition of a
 symmetric matrix.

 The one I see in graphx is distributed? I haven't used it though.
 Maybe it could be part of a solution.



 On Thu, Aug 7, 2014 at 2:21 PM, yaochunnan yaochun...@gmail.com wrote:
  Our lab need to do some simulation on online social networks. We need
 to
  handle a 5000*5000 adjacency matrix, namely, to get its largest
 eigenvalue
  and corresponding eigenvector. Matlab can be used but it is
 time-consuming.
  Is Spark effective in linear algebra calculations and transformations?
 Later
  we would have 500*500 matrix processed. It seems emergent that
 we
  should find some distributed computation platform.
 
  I see SVD has been implemented and I can get eigenvalues of a matrix
 through
  this API.  But when I want to get both eigenvalues and eigenvectors or
 at
  least the biggest eigenvalue and the corresponding eigenvector, it
 seems
  that current Spark doesn't have such API. Is it possible that I write
  eigenvalue decomposition from scratch? What should I do? Thanks a lot!
 
 
  Miles Yao
 
  
  View this message in context: How can I implement eigenvalue
 decomposition
  in Spark?
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Li
 @vrilleup



Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Shivaram Venkataraman
I tried this out and what is happening here is that as the input file is
small only 1 partition is created. lapplyPartition runs the given function
on the partition and computes sumx as 55 and sumy as 55. Now the return
value from lapplyPartition is treated as a list by SparkR and collect
concatenates all the lists from all partitions.

Thus output in this case is just a list with two values and trying to
access element[2] in the for loop gives NA. If you just use
cat(as.character(element), \n), you should see 55 and 55.

Thanks
Shivaram


On Thu, Aug 7, 2014 at 3:21 PM, Pranay Dave pranay.da...@gmail.com wrote:

 Hello Zongheng
 Infact the problem is in lapplyPartition
 lapply gives output as
 1,1
 2,2
 3,3
 ...
 10,10

 However lapplyPartition gives output as
 55, NA
 55, NA

 Why lapply output is horizontal and lapplyPartition is vertical ?

 Here is my code
 library(SparkR)


 sc - sparkR.init(local)
 lines - textFile(sc,/sparkdev/datafiles/covariance.txt)

 totals - lapplyPartition(lines, function(lines)
 {


 sumx - 0
 sumy - 0
 totaln - 0
 for (i in 1:length(lines)){
 dataxy - unlist(strsplit(lines[i], ,))
 sumx - sumx  + as.numeric(dataxy[1])
 sumy - sumy  + as.numeric(dataxy[2])

 }

 ##list(as.numeric(sumx), as.numeric(sumy), as.numeric(sumxy),
 as.numeric(totaln))
 ##list does same as below
 c(sumx,sumy)

 }

 )

 output - collect(totals)
 for (element in output) {
   cat(as.character(element[1]),as.character(element[2]), \n)
 }




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-lapplyPartition-transforms-the-data-in-vertical-format-tp11540p11726.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-06 Thread Shivaram Venkataraman
The output of lapply and lapplyPartition should the same by design -- The
only difference is that in lapply the user-defined function returns a row,
while it returns a list in lapplyPartition.

Could you given an example of a small input and output that you expect to
see for the above program ?

Shivaram


On Wed, Aug 6, 2014 at 5:47 AM, Pranay Dave pranay.da...@gmail.com wrote:

 Hello
 As per documentation, lapply works on single records and lapplyPartition
 works on partition
 However the format of output does not change

 When I use lapplypartition, the data is converted to vertical format

 Here is my code
 library(SparkR)


 sc - sparkR.init(local)
 lines - textFile(sc,/sparkdev/datafiles/covariance.txt)

 totals - lapply(lines, function(lines)
 {


 sumx - 0
 sumy - 0
 totaln - 0
 for (i in 1:length(lines)){
 dataxy - unlist(strsplit(lines[i], ,))
 sumx - sumx  + as.numeric(dataxy[1])
 sumy - sumy  + as.numeric(dataxy[2])

 }

 ##list(as.numeric(sumx), as.numeric(sumy), as.numeric(sumxy),
 as.numeric(totaln))
 ##list does same as below
 c(sumx,sumy)

 }

 )

 output - collect(totals)
 for (element in output) {
   cat(as.character(element[1]),as.character(element[2]), \n)
 }

 I am expecting output as 55, 55
 However it is giving
 55,NA
 55,NA

 Where am I going wrong ?
 Thanks
 Pranay



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-lapplyPartition-transforms-the-data-in-vertical-format-tp11540.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman
This fails for me too. I have no idea why it happens as I can wget the pom
from maven central. To work around this I just copied the ivy xmls and jars
from this github repo
https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library
and put it in /root/.ivy2/cache/org.scala-lang/scala-library

Thanks
Shivaram




On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:

 Currently scala 2.10.2 can't be pulled in from maven central it seems,
 however if you have it in your ivy cache it should work.


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:

 Me 3


 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote:

 I also ran into same issue. What is the solution?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
 Cell : 425-233-8271




 --
 Cell : 425-233-8271



Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman
Thanks Patrick -- It does look like some maven misconfiguration as

wget
http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom

works for me.

Shivaram



On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote:

 This is a Scala bug - I filed something upstream, hopefully they can fix
 it soon and/or we can provide a work around:

 https://issues.scala-lang.org/browse/SI-8772

 - Patrick


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:

 Currently scala 2.10.2 can't be pulled in from maven central it seems,
 however if you have it in your ivy cache it should work.


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca
 wrote:

 Me 3


 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote:

 I also ran into same issue. What is the solution?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
 Cell : 425-233-8271




 --
 Cell : 425-233-8271





Re: sbt directory missed

2014-07-28 Thread Shivaram Venkataraman
I think the 1.0 AMI only contains the prebuilt packages (i.e just the
binaries) of Spark and not the source code. If you want to build Spark on
EC2, you'll can clone the github repo and then use sbt.

Thanks
Shivaram


On Mon, Jul 28, 2014 at 8:49 AM, redocpot julien19890...@gmail.com wrote:

 update:

 Just checked the python launch script, when retrieving spark, it will refer
 to this script:
 https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh

 where each version number is mapped to a tar file,

 0.9.2)
   if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then
 wget
 http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-hadoop1.tgz
   else
 wget
 http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-cdh4.tgz
   fi
   ;;
 1.0.0)
   if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then
 wget
 http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-hadoop1.tgz
   else
 wget
 http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-cdh4.tgz
   fi
   ;;
 1.0.1)
   if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then
 wget
 http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-hadoop1.tgz
   else
 wget
 http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-cdh4.tgz
   fi
   ;;

 I just checked the three last tar file. I find the /sbt directory and many
 other directory like bagel, mllib, etc in 0.9.2 tar file. However, they are
 not in 1.0.0 and 1.0.1 tar files.

 I am not sure that 1.0.X versions are mapped to the correct tar files.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/sbt-directory-missed-tp10783p10784.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: BUG in spark-ec2 script (--ebs-vol-size) and workaround...

2014-07-18 Thread Shivaram Venkataraman
Thanks a lot for reporting this. I think we just missed installing xfsprogs
on the AMI. I have a fix for this at
https://github.com/mesos/spark-ec2/pull/59.

After the pull request is merged, any new clusters launched should have
mkfs.xfs

Thanks
Shivaram


On Fri, Jul 18, 2014 at 4:56 PM, Ben Horner ben.hor...@atigeo.com wrote:

 Hello all,

 There is a bug in the spark-ec2 script (perhaps due to a change in the
 Amazon AMI).

 The --ebs-vol-size option directs the spark-ec2 script to add an EBS volume
 of the specified size, and mount it at /vol for a persistent HDFS.  To do
 this, it uses mkfs.xfs which is not available (though mkfs is).

 To work around this, I was able to run yum install xfsprogs on the master
 and each slave, and then use the --resume option with the script, and the
 persistent HDFS actually worked!

 This has been a frustrating experience, but I've used the spark-ec2 script
 for several months now, and it's incredibly helpful.  I hope this post
 helps
 towards fixing the problem!

 Thanks,
 -Ben

 P.S.  This is the full initial command I used, in case this is isolated to
 particular instance types or anything:
 ec2/spark-ec2 -k ... -i ... -z us-east-1d -s 4 -t m3.2xlarge
 --ebs-vol-size=250 -m r3.2xlarge launch ...

 P.P.S.  Ganglia is still broken, and has been for a while...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/BUG-in-spark-ec2-script-ebs-vol-size-and-workaround-tp10217.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: SparkR failed to connect to the master

2014-07-15 Thread Shivaram Venkataraman
You'll need to build SparkR to match the Spark version deployed on the
cluster. You can do that by changing the Spark version in SparkR's
build.sbt [1]. If you are using the Maven build you'll need to edit pom.xml

Thanks
Shivaram

[1]
https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/src/build.sbt#L20


On Mon, Jul 14, 2014 at 6:19 PM, cjwang c...@cjwang.us wrote:

 I tried installing the latest Spark 1.0.1 and SparkR couldn't find the
 master
 either.  I restarted with Spark 0.9.1 and SparkR was able to find the
 master.  So, there seemed to be something that changed after Spark 1.0.0.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-failed-to-connect-to-the-master-tp9359p9680.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: sparkR - is it possible to run sparkR on yarn?

2014-04-30 Thread Shivaram Venkataraman
We don't have any documentation on running SparkR on YARN and I think there
might be some issues that need to be fixed (The recent PySpark on YARN PRs
are an example).
SparkR has only been tested to work with Spark standalone mode so far.

Thanks
Shivaram



On Tue, Apr 29, 2014 at 7:56 PM, phoenix bai mingzhi...@gmail.com wrote:

 Hi all,

 I searched around, but fail to find anything that says about running
 sparkR on YARN.

 so, is it possible to run sparkR with yarn ? either with yarn-standalone
 or yarn-client mode.
 if so, is there any document that could guide me through the build  setup
 processes?

 I am desparate for some answers, so please help!



Re: Build times for Spark

2014-04-25 Thread Shivaram Venkataraman
Are you by any chance building this on NFS ? As far as I know the build is
severely bottlenecked by filesystem calls during assembly (each class file
in each dependency gets a fstat call or something like that).  That is
partly why building from say a local ext4 filesystem or a SSD is much
faster irrespective of memory / CPU.

Thanks
Shivaram


On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das ak...@sigmoidanalytics.comwrote:

 You can always increase the sbt memory by setting

 export JAVA_OPTS=-Xmx10g




 Thanks
 Best Regards


 On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken 
 ken.willi...@windlogics.com wrote:

  No, I haven't done any config for SBT.  Is there somewhere you might be
 able to point me toward for how to do that?



 -Ken



 *From:* Josh Rosen [mailto:rosenvi...@gmail.com]
 *Sent:* Friday, April 25, 2014 3:27 PM
 *To:* user@spark.apache.org
 *Subject:* Re: Build times for Spark



 Did you configure SBT to use the extra memory?



 On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken 
 ken.willi...@windlogics.com wrote:

 I've cloned the github repo and I'm building Spark on a pretty beefy
 machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.



 For instance, today I did a 'git pull' for the first time in a week or
 two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time
 (88 minutes of CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0
 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73
 minutes CPU.



 Is that typical?  Or does that indicate some setup problem in my
 environment?



 --

 Ken Williams, Senior Research Scientist

 *WindLogics*

 http://windlogics.com




  --


 CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution of
 any kind is strictly prohibited. If you are not the intended recipient,
 please contact the sender via reply e-mail and destroy all copies of the
 original message. Thank you.







Re: Build times for Spark

2014-04-25 Thread Shivaram Venkataraman
AFAIK the resolver does pick up things form your local ~/.m2 -- Note that
as ~/.m2 is on NFS that adds to the amount of filesystem traffic.

Shivaram


On Fri, Apr 25, 2014 at 2:57 PM, Williams, Ken
ken.willi...@windlogics.comwrote:

  I am indeed, but it's a pretty fast NFS.  I don't have any SSD I can
 use, but I could try to use local disk to see what happens.



 For me, a large portion of the time seems to be spent on lines like
 Resolving org.fusesource.jansi#jansi;1.4 ... or similar .  Is this going
 out to find Maven resources?  Any way to tell it to just use my local ~/.m2
 repository instead when the resource already exists there?  Sometimes I
 even get sporadic errors like this:



   [info] Resolving org.apache.hadoop#hadoop-yarn;2.2.0 ...

   [error] SERVER ERROR: Bad Gateway url=
 http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-server/2.2.0/hadoop-yarn-server-2.2.0.jar





 -Ken



 *From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
 *Sent:* Friday, April 25, 2014 4:31 PM

 *To:* user@spark.apache.org
 *Subject:* Re: Build times for Spark



 Are you by any chance building this on NFS ? As far as I know the build is
 severely bottlenecked by filesystem calls during assembly (each class file
 in each dependency gets a fstat call or something like that).  That is
 partly why building from say a local ext4 filesystem or a SSD is much
 faster irrespective of memory / CPU.



 Thanks

 Shivaram



 On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can always increase the sbt memory by setting

 export JAVA_OPTS=-Xmx10g




   Thanks

 Best Regards



 On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken 
 ken.willi...@windlogics.com wrote:

 No, I haven't done any config for SBT.  Is there somewhere you might be
 able to point me toward for how to do that?



 -Ken



 *From:* Josh Rosen [mailto:rosenvi...@gmail.com]
 *Sent:* Friday, April 25, 2014 3:27 PM
 *To:* user@spark.apache.org
 *Subject:* Re: Build times for Spark



 Did you configure SBT to use the extra memory?



 On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken 
 ken.willi...@windlogics.com wrote:

 I've cloned the github repo and I'm building Spark on a pretty beefy
 machine (24 CPUs, 78GB of RAM) and it takes a pretty long time.



 For instance, today I did a 'git pull' for the first time in a week or
 two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time
 (88 minutes of CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0
 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73
 minutes CPU.



 Is that typical?  Or does that indicate some setup problem in my
 environment?



 --

 Ken Williams, Senior Research Scientist

 *WindLogics*

 http://windlogics.com




  --


 CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution of
 any kind is strictly prohibited. If you are not the intended recipient,
 please contact the sender via reply e-mail and destroy all copies of the
 original message. Thank you.









Re: Help with error initializing SparkR.

2014-04-20 Thread Shivaram Venkataraman
I just updated the github issue -- In case anybody is curious, this was a
problem with R resolving the right java version installed in the VM.

Thanks
Shivaram


On Sat, Apr 19, 2014 at 7:12 PM, tongzzz tongzhang...@gmail.com wrote:

 I can't initialize sc context after a successful install on Cloudera
 quickstart VM.
 This is the error message:

  library(SparkR)
 Loading required package: rJava
 [SparkR] Initializing with classpath
 /usr/lib64/R/library/SparkR/sparkr-assembly-0.1.jar

  sc - sparkR.init(local)
 Error in .jcall(RJavaTools, Ljava/lang/Object;, invokeMethod, cl,  :
   java.lang.NoClassDefFoundError: scala.collection.immutable.Vector


 I also created an issue request on the github, which contains more details
 about this issue

 https://github.com/amplab-extras/SparkR-pkg/issues/46


 Thanks for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-error-initializing-SparkR-tp4495.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Creating a SparkR standalone job

2014-04-13 Thread Shivaram Venkataraman
Thanks for attaching code. If I get your use case right you want to call
the sentiment analysis code from Spark Streaming right ? For that I think
you can just use jvmr if that works and I don't think you need SparkR.
 SparkR is mainly intended as an API for large scale jobs which are written
in R.

For this use case where the job is written in Scala (or Java) you can
create your SparkContext in Scala and then just call jvmr from say within a
map function.  The only other thing might be to figure out what the
thread-safety model is for jvmr -- AFAIK R is single threaded, but we run
tasks in multiple threads in Spark.

Thanks
Shivaram


On Sat, Apr 12, 2014 at 12:16 PM, pawan kumar pkv...@gmail.com wrote:

 Hi Shivaram,

 I was able to get R integrated into spark using jvmr. Now i call R from
 scala and pass values to the R function using scala ( Spark Streaming
 values). Attached the application. I can also call sparkR but not sure
 where to pass the spark context in regards to the application attached. Any
 help would be greatly appreciated.
 You can run the application from scala IDE. Let me know if you have any
 difficulties running this application.


 Dependencies :
 Having R installed in the machine.

 Thank
 Pawan Kumar Venugopal



 On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 You can create standalone jobs in SparkR as just R files that are run
 using the sparkR script. These commands will be sent to a Spark cluster and
 the examples on the SparkR repository (
 https://github.com/amplab-extras/SparkR-pkg#examples-unit-tests) are in
 fact standalone jobs.

 However I don't think that will completely solve your use case of using
 Streaming + R. We don't yet have a way to call R functions from Spark's
 Java or Scala API. So right now one thing you can try is to save data from
 SparkStreaming to HDFS and then run a SparkR job which reads in the file.

 Regarding the other idea of calling R from Scala -- it might be possible
 to do that in your code if the classpath etc. is setup correctly. I haven't
 tried it out though, but do let us know if you get it to work.

 Thanks
 Shivaram


 On Mon, Apr 7, 2014 at 2:21 PM, pawan kumar pkv...@gmail.com wrote:

 Hi,

 Is it possible to create a standalone job in scala using sparkR? If
 possible can you provide me with the information of the setup process.
 (Like the dependencies in SBT and where to include the JAR files)

 This is my use-case:

 1. I have a Spark Streaming standalone Job running in local machine
 which streams twitter data.
 2. I have an R script which performs Sentiment Analysis.

 I am looking for an optimal way where I could combine these two
 operations into a single job and run using SBT Run command.

 I came across this document which talks about embedding R into scala (
 http://dahl.byu.edu/software/jvmr/dahl-payne-uppalapati-2013.pdf) but
 was not sure if that would work well within the spark context.

 Thanks,
 Pawan Venugopal






Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Shivaram Venkataraman
The AMI should automatically switch between PVM and HVM based on the
instance type you specify on the command line. For reference (note you
don't need to specify this on the command line), the PVM ami id
is ami-5bb18832 in us-east-1.

FWIW we maintain the list of AMI Ids (across regions and pvm, hvm) at
https://github.com/mesos/spark-ec2/tree/v2/ami-list

Thanks
Shivaram


On Wed, Apr 9, 2014 at 9:12 AM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual
 AMIs.


 On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 And for the record, that AMI is ami-35b1885c. Again, you don't need to
 specify it explicitly; spark-ec2 will default to it.


 On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Marco,

 If you call spark-ec2 launch without specifying an AMI, it will default
 to the Spark-provided AMI.

 Nick


 On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi there,
 To answer your question; no there is no reason NOT to use an AMI that
 Spark has prepared. The reason we haven't is that we were not aware such
 AMIs existed. Would you kindly point us to the documentation where we can
 read about this further?

 Many many thanks, Shivaram.
 Marco.


 On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Is there any reason why you want to start with a vanilla amazon AMI
 rather than the ones we build and provide as a part of Spark EC2 scripts ?
 The AMIs we provide are close to the vanilla AMI but have the root account
 setup properly and install packages like java that are used by Spark.

 If you wish to customize the AMI, you could always start with our AMI
 and add more packages you like -- I have definitely done this recently and
 it works with HVM and PVM as far as I can tell.

 Shivaram


 On Tue, Apr 8, 2014 at 8:50 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 I was able to keep the workaround ...around... by overwriting the
 generated '/root/.ssh/authorized_keys' file with a known good one, in the
 '/etc/rc.local' file


 On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Another thing I didn't mention. The AMI and user used: naturally
 I've created several of my own AMIs with the following characteristics.
 None of which worked.

 1) Enabling ssh as root as per this guide (
 http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
 When doing this, I do not specify a user for the spark-ec2 script. What
 happens is that, it works! But only while it's alive. If I stop the
 instance, create an AMI, and launch a new instance based from the new 
 AMI,
 the change I made in the '/root/.ssh/authorized_keys' file is 
 overwritten

 2) adding the 'ec2-user' to the 'root' group. This means that the
 ec2-user does not have to use sudo to perform any operations needing 
 root
 privilidges. When doing this, I specify the user 'ec2-user' for the
 spark-ec2 script. An error occurs: rsync fails with exit code 23.

 I believe HVMs still work. But it would be valuable to the community
 to know that the root user work-around does/doesn't work any more for
 paravirtual instances.

 Thanks,
 Marco.


 On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 As requested, here is the script I am running. It is a simple shell
 script which calls spark-ec2 wrapper script. I execute it from the 
 'ec2'
 directory of spark, as usual. The AMI used is the raw one from the AWS
 Quick Start section. It is the first option (an Amazon Linux 
 paravirtual
 image). Any ideas or confirmation would be GREATLY appreciated. Please 
 and
 thank you.


 #!/bin/sh

 export AWS_ACCESS_KEY_ID=MyCensoredKey
 export AWS_SECRET_ACCESS_KEY=MyCensoredKey

 AMI_ID=ami-2f726546

 ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s
 10 -v 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t 
 m3.2xlarge
 launch marcotest



 On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 
 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and
 that it must be 'root'. The typical workaround is as you said, allow 
 the
 ssh with the root user. Now, don't laugh, but, this worked last 
 Friday, but
 today (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the
 root user's 'authorized_keys' file is always overwritten. This means 
 the
 workaround doesn't

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman
Hmm -- That is strange. Can you paste the command you are using to launch
the instances ? The typical workflow is to use the spark-ec2 wrapper script
using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html

Shivaram


On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that it
 must be 'root'. The typical workaround is as you said, allow the ssh with
 the root user. Now, don't laugh, but, this worked last Friday, but today
 (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root user's
 'authorized_keys' file is always overwritten. This means the workaround
 doesn't work anymore! I would LOVE for someone to verify this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a
 lot of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
 Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user to
 log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.






Re: Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Shivaram Venkataraman
That is a good idea, though I am not sure how much it will help as time to
rsync is also dependent just on data size being copied. The other problem
is that sometime we have dependencies across packages, so the first needs
to be running before the second can start etc.
However I agree that it takes too long to launch say a 100 node cluster
right now. If you want to take a shot at trying out some changes, you can
fork the spark-ec2 repo at https://github.com/mesos/spark-ec2/tree/v2  and
modify the number of rsync calls (each call to /root/spark-ec2/copy-dir
launches an rsync now).

Thanks
Shivaram


On Sun, Mar 30, 2014 at 3:12 PM, Aureliano Buendia buendia...@gmail.comwrote:

 Hi,

 Spark-ec2 uses rsync to deploy many applications. It seem over time more
 and more applications have been added to the script, which has
 significantly slowed down the setup time.

 Perhaps the script could be restructured this this way: Instead of
 rsyncing N times per application, we could have 1 rsync which deploys N
 applications.

 This should remarkably speed up the setup part, specially for clusters
 with many nodes.



Re: How many partitions is my RDD split into?

2014-03-24 Thread Shivaram Venkataraman
There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example

a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()


On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Mark,

 This appears to be a Scala-only feature. :(

 Patrick,

 Are we planning to add this to PySpark?

 Nick


 On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote:

 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned?
 And how would I check on that?

 Nick


 --
 View this message in context: How many partitions is my RDD split 
 into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.






Re: Problem with SparkR

2014-03-23 Thread Shivaram Venkataraman
Hi

Thanks for reporting this. It'll be great if you can check a couple of
things:

1. Are you trying to use this with Hadoop2 by any chance ? There was an
incompatible ASM version bug that we fixed for Hadoop2
https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it,
but I just want to check if the same error is cropping up again.

2. Is there a stack trace that follows the IncompatibleClassChangeError ?
If so could you attach that ? The error message indicates there is some
incompatibility between class versions and having a more detailed stack
trace might help us track this down.

Thanks
Shivaram





On Sun, Mar 23, 2014 at 4:48 PM, Jacques Basaldúa jacq...@dybot.com wrote:

  I am really interested in using Spark from R and have tried to use
 SparkR, but always get the same error.



 This is how I installed:



  - I successfully installed Spark version  0.9.0 with Scala  2.10.3
 (OpenJDK 64-Bit Server VM, Java 1.7.0_45)

I can run examples from spark-shell and Python



  - I installed the R package devtools and installed SparkR using:



  - library(devtools)

  - install_github(amplab-extras/SparkR-pkg, subdir=pkg)



   This compiled the package successfully.



 When I try to run the package



 E.g.,

   library(SparkR)

   sc - sparkR.init(master=local)   //- so far the program runs
 fine



   rdd - parallelize(sc, 1:10)  // This returns the following error



   Error in .jcall(getJRDD(rdd), Ljava/util/List;, collect) :

   java.lang.IncompatibleClassChangeError:
 org/apache/spark/util/InnerClosureFinder



 No matter how I try to use the sc (I have tried all the examples) I always
 get an error.



 Any ideas?



 Jacques.