I think the answer to this depends on what granularity you want to run
the algorithm on. If its on the entire Spark DataFrame and if you
except the data frame to be very large then it isn't easy to use the
existing R function. However if you want to run the algorithm on
smaller subsets of the data
Can you open an issue on https://github.com/amplab/spark-ec2 ? I
think we should be able to escape the version string and pass the
2.0.0-preview through the scripts
Shivaram
On Tue, Jun 14, 2016 at 12:07 PM, Sunil Kumar
wrote:
> Hi,
>
> The spark-ec2 scripts are
Overall this sounds good to me. One question I have is that in
addition to the ML algorithms we have a number of linear algebra
(various distributed matrices) and statistical methods in the
spark.mllib package. Is the plan to port or move these to the spark.ml
namespace in the 2.x series ?
Thanks
I think its just a bug -- I think we originally followed the Python
API (in the original PR [1]) but the Python API seems to have been
changed to match Scala / Java in
https://issues.apache.org/jira/browse/SPARK-6366
Feel free to open a JIRA / PR for this.
Thanks
Shivaram
[1]
lasspath? I verified the assembly was built right and its in the classpath
> (else nothing would work).
>
> Thanks,
> Tom
>
>
>
> On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>
>
> I think this is
I think this is happening in the driver. Could you check the classpath
of the JVM that gets started ? If you use spark-submit on yarn the
classpath is setup before R gets launched, so it should match the
behavior of Scala / Python.
Thanks
Shivaram
On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves
It is a known limitation that spark-ec2 is very slow for large
clusters and as you mention most of this is due to the use of rsync to
transfer things from the master to all the slaves.
Nick cc'd has been working on an alternative approach at
https://github.com/nchammas/flintrock that is more
RStudio should already be setup if you launch an EC2 cluster using
spark-ec2. See http://blog.godatadriven.com/sparkr-just-got-better.html
for details.
Shivaram
On Wed, Nov 4, 2015 at 5:11 PM, Andy Davidson
wrote:
> Hi
>
> I just set up a spark cluster on AWS ec2
+Xiangrui
I am not sure exposing the entire GraphX API would make sense as it
contains a lot of low level functions. However we could expose some
high level functions like PageRank etc. Xiangrui, who has been working
on similar techniques to expose MLLib functions like GLM might have
more to add.
There shouldn't be anything Mac OS specific about this feature. One point
of warning though -- As mentioned previously in this thread the APIs were
made private because we aren't sure we will be supporting them in the
future. If you are using these APIs it would be good to chime in on the
JIRA
FWIW I've run into similar BLAS related problems before and wrote up a
document on how to do this for Spark EC2 clusters at
https://github.com/amplab/ml-matrix/blob/master/EC2.md -- Note that this
works with a vanilla Spark build (you only need to link to netlib-lgpl in
your App) but requires the
There was a fix for `--jars` that went into 1.4.1
https://github.com/apache/spark/commit/2579948bf5d89ac2d822ace605a6a4afce5258d6
Shivaram
On Tue, Jul 14, 2015 at 4:18 AM, Sun, Rui rui@intel.com wrote:
Could you give more details about the mis-behavior of --jars for SparkR?
maybe it's a
You can just use `--files` and I think it should work. Let us know on
https://issues.apache.org/jira/browse/SPARK-6833 if it doesn't work as
expected.
Thanks
Shivaram
On Tue, Jul 7, 2015 at 5:13 AM, Michał Zieliński
zielinski.mich...@gmail.com wrote:
Hi all,
*spark-submit* for Python and
When I've seen this error before it has been due to the spark-submit file
(i.e. `C:\spark-1.4.0\bin/bin/spark-submit.cmd`) not having execute
permissions. You can try to set execute permission and see if it fixes
things.
Also we have a PR open to fix a related problem at
You need to add -Psparkr to build SparkR code
Shivaram
On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Did you try:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Thanks
Best Regards
On Fri, Jul 3, 2015 at 2:27 PM,
not scalable.
2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman
shiva...@eecs.berkeley.edu:
The `head` function is not supported for the RRDD that is returned by
`textFile`. You can run `take(lines, 5L)`. I should add a warning here that
the RDD API in SparkR is private because we might not support
The 1.4 release does not support calling MLLib from SparkR. We are working
on it as a part of https://issues.apache.org/jira/browse/SPARK-6805
On Wed, Jul 1, 2015 at 4:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com
wrote:
Hi,
Does Spark 1.4 support calling MLLib directly from SparkR ?
to find an inherited method for function
‘reduceByKey’ for signature ‘PipelinedRDD, character, numeric’*
*End Code *
On Fri, Jun 26, 2015 at 7:04 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
My workflow as to install RStudio on a cluster launched using Spark EC2
scripts
venue where I would be able to follow the SparkR API
progress?
Thanks
Pradeep
On Mon, Jun 29, 2015 at 1:12 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
The RDD API is pretty complex and we are not yet sure we want to export
all those methods in the SparkR API. We are working
, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
The API exported in the 1.4 release is different from the one used in the
2014 demo. Please see the latest documentation at
http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html or
Chris's demo from Spark Summit at
https
The RDD API is pretty complex and we are not yet sure we want to export all
those methods in the SparkR API. We are working towards exposing a more
limited API in upcoming versions. You can find some more details in the
recent Spark Summit talk at
Thanks Mark for the update. For those interested Vincent Warmerdam also has
some details on making the /root/spark installation work at
https://issues.apache.org/jira/browse/SPARK-8596?focusedCommentId=14604328page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14604328
We don't have a documented way to use RStudio on EC2 right now. We have a
ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss
work-arounds and potential solutions for this.
Thanks
Shivaram
On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com
wrote:
Good
The `head` function is not supported for the RRDD that is returned by
`textFile`. You can run `take(lines, 5L)`. I should add a warning here that
the RDD API in SparkR is private because we might not support it in the
upcoming releases. So if you can use the DataFrame API for your application
you
DataFrame from
comma separated flat files, what would you recommend me to do? One way I
can think of is first reading the data as you would do in r, using
read.table(), and then create spark DataFrame out of that R dataframe, but
it is obviously not scalable.
2015-06-25 13:59 GMT-07:00 Shivaram
Not yet - We are working on it as a part of
https://issues.apache.org/jira/browse/SPARK-6805 and you can follow the
JIRA for more information
On Wed, Jun 24, 2015 at 2:30 AM, escardovi escard...@bitbang.com wrote:
Hi,
I was wondering if it is possible to use MLlib function inside SparkR, as
The Apache Spark API docs for SparkR
https://spark.apache.org/docs/1.4.0/api/R/index.html represent what has
been released with Spark 1.4. The AMPLab version is no longer under active
development and I'd recommend users to use the version in the Apache
project.
Thanks
Shivaram
On Thu, Jun 25,
In addition to Aleksander's point please let us know what use case would
use RDD-like API in https://issues.apache.org/jira/browse/SPARK-7264 -- We
are hoping to have a version of this API in upcoming releases.
Thanks
Shivaram
On Thu, Jun 25, 2015 at 6:02 AM, Eskilson,Aleksander
The error you are running into is that the input file does not exist -- You
can see it from the following line
Input path does not exist: hdfs://smalldata13.hdp:8020/
home/esten/ami/usaf.json
Thanks
Shivaram
On Tue, Jun 16, 2015 at 1:55 AM, esten erik.stens...@dnvgl.com wrote:
Hi,
In SparkR
Yep - Burak's answer should work. FWIW the error message from the stack
trace that shows this is the line
Failed to load class for data source: avro
Thanks
Shivaram
On Sat, Jun 13, 2015 at 6:13 PM, Burak Yavuz brk...@gmail.com wrote:
Hi,
Not sure if this is it, but could you please try
Yeah - We don't have support for running UDFs on DataFrames yet. There is
an open issue to track this https://issues.apache.org/jira/browse/SPARK-6817
Thanks
Shivaram
On Thu, Jun 4, 2015 at 3:10 AM, Daniel Emaasit daniel.emaa...@gmail.com
wrote:
Hello Shivaram,
Was the includePackage()
Hi all
We recently merged support for launching YARN clusters using Spark EC2
scripts as a part of
https://issues.apache.org/jira/browse/SPARK-3674. To use this you can pass
in hadoop-major-version as yarn to the spark-ec2 script and this will
setup Hadoop 2.4 HDFS, YARN and Spark built for YARN
, and would therefore be part of the
memory managed by Spark, and that memory would only be moved to R as an R
object following a collect(), take(), etc.
Thanks,
Alek Eskilson
From: Shivaram Venkataraman shiva...@eecs.berkeley.edu
Reply-To: shiva...@eecs.berkeley.edu shiva...@eecs.berkeley.edu
Could you try to see which phase is causing the hang ? i.e. If you do a
count() after flatMap does that work correctly ? My guess is that the hang
is somehow related to data not fitting in the R process memory but its hard
to say without more diagnostic information.
Thanks
Shivaram
On Tue, May
and
create a new DataFrame by zipping the results. It seems to work but when I
try to save the RDD I got the following error :
org.apache.spark.mllib.linalg.DenseVector cannot be cast to
org.apache.spark.sql.Row
On Mon, Mar 30, 2015 at 6:40 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote
One workaround could be to convert a DataFrame into a RDD inside the
transform function and then use mapPartitions/broadcast to work with the
JNI calls and then convert back to RDD.
Thanks
Shivaram
On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Dear all,
I'm
FWIW the JIRA I was thinking about is
https://issues.apache.org/jira/browse/SPARK-3098
On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
I vaguely remember that JIRA and AFAIK Matei's point was that the order is
not guaranteed *after* a shuffle. If you
I vaguely remember that JIRA and AFAIK Matei's point was that the order is
not guaranteed *after* a shuffle. If you only use operations like map which
preserve partitioning, ordering should be guaranteed from what I know.
On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote:
Dang
On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
There are a couple of differences between the ml-matrix implementation
and the one used in AMPCamp
- I think the AMPCamp one uses JBLAS which tends to ship native BLAS
libraries along with it. In ml-matrix
from the same basis.
What is the difference between these two codes ?
On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
There are couple of solvers that I've written that is part of the AMPLab
ml-matrix repo [1,2]. These aren't part of MLLib yet
On Thu, Mar 12, 2015 at 3:05 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
In fact, by activating netlib with native libraries it goes faster.
Glad you got it work ! Better performance was one of the reasons we made
the switch.
Thanks
On Tue, Mar 10, 2015 at 7:03 PM, Shivaram Venkataraman
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a
good reference
Shivaram
On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Do you have a reference paper to the implemented algorithm in TSQR.scala ?
On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman
There are couple of solvers that I've written that is part of the AMPLab
ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
interested in porting them I'd be happy to review it
Thanks
Shivaram
[1]
+Josh, who added the Job UI page.
I've seen this as well and was a bit confused about what it meant. Josh, is
there a specific scenario that creates these skipped stages in the Job UI ?
Thanks
Shivaram
On Wed, Jan 7, 2015 at 12:32 PM, Corey Nolet cjno...@gmail.com wrote:
Sorry- replace ###
Ah I see - So its more like 're-used stages' which is not necessarily a bug
in the program or something like that.
Thanks for the pointer to the comment
Thanks
Shivaram
On Wed, Jan 7, 2015 at 2:00 PM, Mark Hamstra m...@clearstorydata.com
wrote:
That's what you want to see. The computation of
Just to clarify, are you running the application using spark-submit after
packaging with sbt package ? One thing that might help is to mark the Spark
dependency as 'provided' as then you shouldn't have the Spark classes in
your jar.
Thanks
Shivaram
On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen
+Andrew
Actually I think this is because we haven't uploaded the Spark binaries to
cloudfront / pushed the change to mesos/spark-ec2.
Andrew, can you take care of this ?
On Tue, Dec 2, 2014 at 5:11 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Interesting. Do you have any problems
Can you clarify what is the Spark master URL you are using ? Is it 'local'
or is it a cluster ? If it is 'local' then rebuilding Spark wouldn't help
as Spark is getting pulled in from Maven and that'll just pick up the
released artifacts.
Shivaram
On Mon, Nov 24, 2014 at 1:08 PM, agg212
I ran into this problem too and I know of a workaround but don't exactly
know what is happening. The work around is to explicitly add either the
commons math jar or your application jar (shaded with commons math)
to spark.executor.extraClassPath.
My hunch is that this is related to the class
We are working on a PRs to add block partitioned matrix formats and dense
matrix multiply methods. This should be out in the next few weeks or so.
The sparse methods still need some research on partitioning schemes etc.
and we will do that after the dense methods are in place.
Thanks
Shivaram
On
Yeah looks like https://github.com/apache/spark/pull/2744 broke the
build. We will fix it soon
On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote:
I am synced up to the Spark master branch as of commit 23468e7e96. I have
Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve
Hi
I'd like to announce a couple of updates to the SparkR project. In order to
facilitate better collaboration for new features and development we have a
new mailing list, issue tracker for SparkR.
- The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and
we have migrated all
Could you try increasing the number of slices with the large data set ?
SparkR assumes that each slice (or partition in Spark terminology) can fit
in memory of a single machine. Also is the error happening when you do the
map function or does it happen when you combine the results ?
Thanks
If the JVM heap size is close to the memory limit the OS sometimes kills
the process under memory pressure. I've usually found that lowering the
executor memory size helps.
Shivaram
On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
What is your Spark executor
If you just want to find the top eigenvalue / eigenvector you can do
something like the Lanczos method. There is a description of a MapReduce
based algorithm in Section 4.2 of [1]
[1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf
On Thu, Aug 7, 2014 at 10:54 AM, Li Pu
I tried this out and what is happening here is that as the input file is
small only 1 partition is created. lapplyPartition runs the given function
on the partition and computes sumx as 55 and sumy as 55. Now the return
value from lapplyPartition is treated as a list by SparkR and collect
The output of lapply and lapplyPartition should the same by design -- The
only difference is that in lapply the user-defined function returns a row,
while it returns a list in lapplyPartition.
Could you given an example of a small input and output that you expect to
see for the above program ?
This fails for me too. I have no idea why it happens as I can wget the pom
from maven central. To work around this I just copied the ivy xmls and jars
from this github repo
https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library
and put it in
Thanks Patrick -- It does look like some maven misconfiguration as
wget
http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom
works for me.
Shivaram
On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote:
This is a Scala bug - I filed
I think the 1.0 AMI only contains the prebuilt packages (i.e just the
binaries) of Spark and not the source code. If you want to build Spark on
EC2, you'll can clone the github repo and then use sbt.
Thanks
Shivaram
On Mon, Jul 28, 2014 at 8:49 AM, redocpot julien19890...@gmail.com wrote:
Thanks a lot for reporting this. I think we just missed installing xfsprogs
on the AMI. I have a fix for this at
https://github.com/mesos/spark-ec2/pull/59.
After the pull request is merged, any new clusters launched should have
mkfs.xfs
Thanks
Shivaram
On Fri, Jul 18, 2014 at 4:56 PM, Ben
You'll need to build SparkR to match the Spark version deployed on the
cluster. You can do that by changing the Spark version in SparkR's
build.sbt [1]. If you are using the Maven build you'll need to edit pom.xml
Thanks
Shivaram
[1]
We don't have any documentation on running SparkR on YARN and I think there
might be some issues that need to be fixed (The recent PySpark on YARN PRs
are an example).
SparkR has only been tested to work with Spark standalone mode so far.
Thanks
Shivaram
On Tue, Apr 29, 2014 at 7:56 PM,
Are you by any chance building this on NFS ? As far as I know the build is
severely bottlenecked by filesystem calls during assembly (each class file
in each dependency gets a fstat call or something like that). That is
partly why building from say a local ext4 filesystem or a SSD is much
faster
-server-2.2.0.jar
-Ken
*From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
*Sent:* Friday, April 25, 2014 4:31 PM
*To:* user@spark.apache.org
*Subject:* Re: Build times for Spark
Are you by any chance building this on NFS ? As far as I know the build is
severely
I just updated the github issue -- In case anybody is curious, this was a
problem with R resolving the right java version installed in the VM.
Thanks
Shivaram
On Sat, Apr 19, 2014 at 7:12 PM, tongzzz tongzhang...@gmail.com wrote:
I can't initialize sc context after a successful install on
R installed in the machine.
Thank
Pawan Kumar Venugopal
On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
You can create standalone jobs in SparkR as just R files that are run
using the sparkR script. These commands will be sent to a Spark cluster
is that we were not aware such
AMIs existed. Would you kindly point us to the documentation where we can
read about this further?
Many many thanks, Shivaram.
Marco.
On Tue, Apr 8, 2014 at 4:42 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
Is there any reason why you want to start
instance and not an HVM instance.
Please and thanks,
Marco.
On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman
shivaram.venkatara...@gmail.com wrote:
Right now the spark-ec2 scripts assume that you have root access and a
lot of internal scripts assume have the user's home directory hard
That is a good idea, though I am not sure how much it will help as time to
rsync is also dependent just on data size being copied. The other problem
is that sometime we have dependencies across packages, so the first needs
to be running before the second can start etc.
However I agree that it
There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example
a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()
On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Mark,
This appears to be a Scala-only
Hi
Thanks for reporting this. It'll be great if you can check a couple of
things:
1. Are you trying to use this with Hadoop2 by any chance ? There was an
incompatible ASM version bug that we fixed for Hadoop2
https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it,
but I just
72 matches
Mail list logo