Can you provide an example?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am seeing a small standalone cluster (master, slave) hang when I reach a
certain memory threshold, but I cannot detect how to configure memory to avoid
this.
I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this
allocated, but it does not help.
The reduce is by key to get
Dear all,
We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are
familiar with python, but some of features are developed under java. I am
looking for a way to integrate java and python on spark.
I notice that the initialization of pyspark does not include a field to
distribute
Hi,
Any thoughts on this? Thanks.
-Suren
On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Hi,
I know if we call persist with the right options, we can have Spark
persist an RDD's data on disk.
I am wondering what happens in intermediate operations
Hi Shark,
Should I assume that Shark users should not use the shark APIs since there
are no documentations for it? If there are documentations, can you point it
out?
Best Regards,
Jerry
On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote:
Hello everyone,
I have
Hi,
I was going through Matei's Advanced Spark presentation at
https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions.
The presentation of this video is at
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
The PageRank example
Hi All,
I wanted Spark on Yarn to up and running.
I did *SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*
Then i ran
*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
For issue #2 I was concerned that the build packaging had to be
internal. So I am using the already packaged make-distribution.sh
(modified to use a maven build) to create a tar ball which I then package
it using a RPM spec file.
Hi Rahul, so the issue for downstream operating system
Hi,
We have a situation where a Pyspark script works fine as a local process
(local url) on the Master and the Worker nodes, which would indicate that
all python dependencies are set up properly on each machine.
But when we try to run the script at the cluster level (using the master's
url), if
It might help if I clarify my questions. :-)
1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're
Hi TD
Could you explain me this code part?
.reduceByKeyAndWindow(
109 new Function2Integer, Integer, Integer() {
110 public Integer call(Integer i1, Integer i2) { return i1 +
i2; }
111 },
112 new Function2Integer, Integer, Integer() {
113 public
Hi all,
On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
Also, it is the default user for the Spark-EC2 script.
Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
instead of 'root'.
I can see that the Spark-EC2 script allows you to specify which user to
Hi,
I'm trying to use SparkContext.addFile() to propagate a file to worker
nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the
master). I don't have HDFS or any distributed file system. Just playing with
basic stuff.
Here's the code in my driver (actually spark-shell
That work is under submission at an academic conference and will be made
available if/when the paper is published.
In terms of algorithms for hyperparameter tuning, we consider Grid Search,
Random Search, a couple of older derivative-free optimization methods, and
a few newer methods - TPE (aka
I might be wrong here but I don't believe it's discouraged. Maybe part
of the reason there's not a lot of examples is that sql2rdd returns an
RDD (TableRDD that is
https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala).
I haven't done anything too complicated yet but
Hi Shivaram,
OK so let's assume the script CANNOT take a different user and that it must
be 'root'. The typical workaround is as you said, allow the ssh with the
root user. Now, don't laugh, but, this worked last Friday, but today
(Monday) it no longer works. :D Why? ...
...It seems that NOW,
Hi Guys,
I would like understanding why the Driver's RAM goes down, Does the
processing occur only in the workers?
Thanks
# Start Tests
computer1(Worker/Source Stream)
23:57:18 up 12:03, 1 user, load average: 0.03, 0.31, 0.44
total used free shared
Hmm -- That is strange. Can you paste the command you are using to launch
the instances ? The typical workflow is to use the spark-ec2 wrapper script
using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html
Shivaram
On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini
Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating
version 0.9.0 without any Hadoop at all, and need some help. I run into the
following error with the StatefulNetworkWordCount example (and similarly in my
prototype app, when I use the updateStateByKey operation). I get
Thanks Shivaram! Will give it a try and let you know.
Regards,
Pawan Venugopal
On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
You can create standalone jobs in SparkR as just R files that are run
using the sparkR script. These commands will be sent
Hi,
I am looking for users of spark to join my teams here at Amazon. If you are
reading this you probably qualify.
I am looking for developer of ANY level, but with an interest in spark. My
teams are leveraging spark to solve real business scenarios.
If you are interested, just shoot me a note
Few things that would be helpful.
1. Environment settings - you can find them on the environment tab in the
Spark application UI
2. Are you setting the HDFS configuration correctly in your Spark program?
For example, can you write a HDFS file from a Spark program (say
spark-shell) to your HDFS
any reason why RDDInfo suddenly became private in SPARK-1132?
we are using it to show users status of rdds
ok yeah we are using StageInfo and TaskInfo too...
On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or and...@databricks.com wrote:
Hi Koert,
Other users have expressed interest for us to expose similar classes too
(i.e. StageInfo, TaskInfo). In the newest release, they will be available
as part of
1.: I will paste the full content of the environment page of the example
application running against the cluster at the end of this message.
2. and 3.: Following #2 I was able to see that the count was incorrectly 0
when running against the cluster, and following #3 I was able to get the
Great!!!
When i built it on another disk whose format is ext4, it works right now.
hadoop@ubuntu-1:~$ df -Th
FilesystemType Size Used Avail Use% Mounted on
/dev/sdb6 ext4 135G 8.6G 119G 7% /
udev devtmpfs 7.7G 4.0K 7.7G 1% /dev
tmpfs
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote:
I am running the latest version of PySpark branch-0.9 and having some
trouble with join.
One RDD is about 100G (25GB compressed and serialized in memory) with
130K records, the other RDD is about 10G (2.5G
Hi all,
Here I am sharing a blog for beginners, about creating spark streaming
stand alone application and bundle the app as single runnable jar. Take a
look and drop your comments in blog page.
http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html
Hi Everyone,
I saved a 2GB pdf file into MongoDB using GridFS. now i want process those
GridFS collection data using Java Spark Mapreduce. previously i have
successfully processed normal mongoDB collections(not GridFS) with Apache
spark using Mongo-Hadoop connector. now i'm unable to handle input
29 matches
Mail list logo