collect() means to bring all the data back to the master node, and there might
just be too much of it for that. How big is your file? If you can’t bring it
back to the master node try saveAsTextFile to write it out to a filesystem (in
parallel).
Matei
On Feb 24, 2014, at 1:08 PM, Chengi Liu
...@gmail.com wrote:
Its around 10 GB big?
All I want is to do a frequency count? And then get top 10 entries based on
count?
How do i do this (again on pyspark(
Thanks
On Mon, Feb 24, 2014 at 1:19 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
collect() means to bring all the data back
Very cool, thanks for writing this. I’ll link it from our website.
Matei
On Feb 18, 2014, at 12:44 PM, Sampo Niskanen sampo.niska...@wellmo.com wrote:
Hi,
Since getting Spark + MongoDB to work together was not very obvious (at least
to me) I wrote a tutorial about it in my blog with an
The Vector class is defined to work on doubles right now. You’d have to write
your own version for floats.
Matei
On Feb 17, 2014, at 11:58 AM, agg agalaka...@gmail.com wrote:
Hi,
I would like to run the spark example with floats instead of doubles. When
I change this:
def
This probably means that there’s not enough free memory for the “scratch” space
used for computations, so we OOM before the Spark cache decides that it’s full
and starts to spill stuff. Try reducing spark.storage.memoryFraction (default
is 0.66, try 0.5).
Matei
On Feb 5, 2014, at 10:29 PM,
It’s fairly easy to take your existing Mapper and Reducer objects and call them
within Spark. First, you can use SparkContext.hadoopRDD to read a file with any
Hadoop InputFormat (you can even pass it the JobConf you would’ve created in
Hadoop). Then use mapPartitions to iterate through each
You can set the spark.cores.max property in your application to limit the
maximum number of cores it will take. Checko ut
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling.
It’s also possible to control scheduling in more detail within a Spark
application,
Hey Imran,
You probably have to create a subclass of HadoopRDD to do this, or some RDD
that wraps around the HadoopRDD. It would be a cool feature but HDFS itself has
no information about partitioning, so your application needs to track it.
Matei
On Jan 27, 2014, at 11:57 PM, Imran Rashid
Hi Dana,
I think the problem is that your simple.sbt does not add a dependency on
hadoop-client for CDH4, so you get a different version of the Hadoop library on
your driver application compared to the cluster. Try adding a dependency on
hadoop-client version 2.0.0-mr1-cdh4.X.X for your
Jeremy, do you happen to have a small test case that reproduces it? Is it with
the kmeans example that comes with PySpark?
Matei
On Jan 22, 2014, at 3:03 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Thanks for the thoughts Matei! I poked at this some more. I ran top on each
of the
Hi Ken,
This is unfortunately a limitation of spark-shell and the way it works on the
standalone mode. spark-shell sets an environment variable, SPARK_HOME, which
tells Spark where to find its code installed on the cluster. This means that
the path on your laptop must be the same as on the
, 2014 at 8:33 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
I’d be happy to see this added to the core API.
Matei
On Jan 23, 2014, at 5:39 PM, Andrew Ash and...@andrewash.com wrote:
Ah right of course -- perils of typing code without running it!
It feels like this is a pretty core
Try doing a sbt clean before rebuilding.
Matei
On Jan 22, 2014, at 10:22 AM, Manoj Samel manojsamelt...@gmail.com wrote:
See thread below. Reposted as compilation error thread
-- Forwarded message --
From: Manoj Samel manojsamelt...@gmail.com
Date: Wed, Jan 22, 2014 at
install - still same error.
On Wed, Jan 22, 2014 at 10:46 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Try doing a sbt clean before rebuilding.
Matei
On Jan 22, 2014, at 10:22 AM, Manoj Samel manojsamelt...@gmail.com wrote:
See thread below. Reposted as compilation error thread
see the
value, just the println does not seems working
On Wed, Jan 22, 2014 at 12:39 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi Manoj,
You’d have to make the files available at the same path on each machine
through something like NFS. You don’t need to copy them, though
If you don’t cache the RDD, the computation will happen over and over each time
we scan through it. This is done to save memory in that case and because Spark
can’t know at the beginning whether you plan to access a dataset multiple
times. If you’d like to prevent this, use cache(), or maybe
Hi Ognen,
It’s true that the documentation is partly targeting Hadoop users, and that’s
something we need to fix. Perhaps the best solution would be some kind of
tutorial on “here’s how to set up Spark by hand on EC2”. However it also sounds
like you ran into some issues with S3 that it would
It’s being voted on right now on the dev list. Check out
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-0-9-0-incubating-rc2-td225.html.
Matei
On Jan 18, 2014, at 11:03 PM, Manoj Samel manojsamelt...@gmail.com wrote:
Any time frame and list of enhancements
Hi Jeremy,
If you look at the stdout and stderr files on that worker, do you see any
earlier errors? I wonder if one of the Python workers crashed earlier.
It would also be good to run “top” and see if more memory is used during the
computation. I guess the cached RDD itself fits in less than
Hey Majd,
I believe Shark sets up data to spill to disk, even though the default storage
level in Spark is memory-only. In terms of those executors, it looks like data
distribution was unbalanced across them, possibly due to data locality in HDFS
(some of the executors may have had more data).
Typically you want 2-3 partitions per CPU core to get good load balancing. How
big is the data you’re transferring in this case? And have you looked at the
machines to see whether they’re spending lots of time on IO, CPU, etc? (Use top
or dstat on each machine for this). For large datasets with
It just uses the Hadoop FileSystem API, I don’t think there’s any extra
buffering. That API itself may do buffering in the HDFS case, though newer
versions of HDFS fix that.
Matei
On Jan 9, 2014, at 2:54 PM, hussam_jar...@dell.com wrote:
Can someone provide me details on the spark java
Have you looked at the cluster UI, and do you see any workers registered there,
and your application under running applications? Maybe you typed in the wrong
master URL or something like that.
Matei
On Jan 8, 2014, at 7:07 PM, Aureliano Buendia buendia...@gmail.com wrote:
The strange thing
...@gmail.com wrote:
On Thu, Jan 9, 2014 at 3:59 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
Have you looked at the cluster UI, and do you see any workers registered
there, and your application under running applications? Maybe you typed in
the wrong master URL or something like
, which will distribute it. You can launch your
application with “scala”, “java”, or whatever tool you’d prefer.
Matei
On Jan 8, 2014, at 8:26 PM, Aureliano Buendia buendia...@gmail.com wrote:
On Thu, Jan 9, 2014 at 4:11 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
Oh, you shouldn’t use
Sorry, you actually can’t call predict() on the cluster because the model
contains some RDDs. There was a recent patch that added a parallel predict
method, here: https://github.com/apache/incubator-spark/pull/328/files. You can
grab the code from that method there (which does a join) and call
Yeah, unfortunately sequenceFile() reuses the Writable object across records.
If you plan to use each record repeatedly (e.g. cache it), you should clone
them using a map function. It was originally designed assuming you only look at
each record once, but it’s poorly documented.
Matei
On Jan
, Matei Zaharia matei.zaha...@gmail.com wrote:
Yeah, unfortunately sequenceFile() reuses the Writable object across records.
If you plan to use each record repeatedly (e.g. cache it), you should clone
them using a map function. It was originally designed assuming you only look
at each record
(Replying on new Spark mailing list since the old one closed).
Are you sure Spark is finding your build of Mesos instead of the Apache one
from Maven Central? Unfortunately, code compiled with different protobuf
versions is not compatible, because the generated code by the protoc compiler
I agree that it would be good to do it only once, if you can find a nice way of
doing so.
Matei
On Jan 3, 2014, at 1:33 AM, Andrew Ash and...@andrewash.com wrote:
In my spark-env.sh I append to the SPARK_CLASSPATH variable rather than
overriding it, because I want to support both adding a
If you’re trying to measure the performance assuming that a dataset is already
in memory, then doing cache() and count() would work. However if you want to
measure an end-to-end workflow, it might be good to leave the operations and
the data loading to happen together, as Spark does by default.
Does that machine maybe have a full disk drive, or no space in /tmp (where
Spark stores local files by default)?
On Dec 25, 2013, at 7:50 AM, leosand...@gmail.com wrote:
No , just standalone cluster
leosand...@gmail.com
From: Azuryy Yu
Date: 2013-12-25 19:21
To:
I’m surprised by this, but one way that will definitely work is to assemble
your application into a single JAR. If passing them to the constructor doesn’t
work, that’s probably a bug.
Matei
On Dec 23, 2013, at 12:03 PM, Karavany, Ido ido.karav...@intel.com wrote:
Hi All,
For our
On Thu, Dec 19, 2013 at 2:23 PM, Matei Zaharia matei.zaha...@gmail.com
mailto:matei.zaha...@gmail.com wrote:
It might also mean you don’t have Python installed on the worker.
On Dec 19, 2013, at 1:17 PM, Jey Kottalam j...@cs.berkeley.edu
mailto:j...@cs.berkeley.edu wrote
Hi Guillaume,
I haven’t looked at the serialization of DoubleMatrix but I believe it just
creates one big Array[Double] instead of many ones, and stores all the rows
contiguously in that. I don’t think that would be slower to serialize. However,
because the object is bigger overall, it might
Yup, this will still be supported.
On Dec 18, 2013, at 12:40 PM, Gary Malouf malouf.g...@gmail.com wrote:
In 0.7.3, the way of installing spark on mesos was to unpack it into the same
directory across the cluster (I assume this includes the driver program). We
automated this process in our
Rosen, Henry Saputra, Jerry Shao, Mingfei
Shi, Andre Schumacher, Karthik Tunga, Patrick Wendell, Neal Wiggins,
Andrew Xia, Reynold Xin, Matei Zaharia, and Wu Zeming
- Patrick
It takes a while to download all the dependencies from Maven the first time you
build. Just let it run, it won’t need to do that next time. Or see if you can
build it on a machine with better Internet access and copy the binaries (you
can even get an EC2 machine for a few cents if you want).
I’m not sure if a method called repartition() ever existed in an official
release, since we don’t remove methods, but there is a method called coalesce()
that does what you want. You just tell it the desired new number of partitions.
You can also have it shuffle the data across the cluster to
the latest status is alpha
Its license terms (and code integrity) may not pass our legal department
Its robustness and efficiency are dubious.
Anyway, I'm looking at some other alternatives (e.g. JNBridge).
Thanks.
-Ken
On Mon, Dec 16, 2013 at 12:04 PM, Matei Zaharia matei.zaha...@gmail.com
Yup, this should be in Spark 0.9 and 0.8.1.
Matei
On Dec 13, 2013, at 9:41 AM, Koert Kuipers ko...@tresata.com wrote:
thats great. didn't realize this was in master already.
On Thu, Dec 12, 2013 at 8:10 PM, Shao, Saisai saisai.s...@intel.com wrote:
Hi Koert,
Spark with
Yeah, I’m curious which APIs you found missing in Python. I know we have a lot on the Scala side that aren’t yet in there, but I’m not sure how to prioritize them.If you do want to call Python from Scala, you can also use the RDD.pipe() operation to pass data through an external process. However
How long did they run for? The JVM takes a few seconds to start up and compile
code, not to mention that Spark takes some time to initialize too, so you won’t
see a major difference unless the application is taking longer. One other
problem in this job is that it might use Math.random(), which
The hadoopFile method reuses the Writable object between records that it reads
by default, so you get back the same object. You should clone them if you need
to cache them. This is kind of an unintuitive behavior that we’ll probably need
to turn off by default; it’s helpful when you don’t need
Hi Matt,
The behavior for sequenceFile is there because we reuse the same Writable
object when reading elements from the file. This is definitely unintuitive, but
if you pass through each data item only once instead of caching it, it can be
more efficient (probably should be off by default
Hey Matt,
This setting shouldn’t really affect groupBy operations, because they don’t go
through Akka. The frame size setting is for messages from the master to workers
(specifically, sending out tasks), and for results that go directly from
workers to the application (e.g. collect()). So it
I’m not sure you can have a star inside that quoted classpath argument (the
double quotes may cancel the *). Try using the JAR through its full name, or
link to Spark through Maven
(http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-java).
Matei
On Dec 6, 2013,
Hi Kenneth,
1. Is Spark suited for online learning algorithms? From what I’ve read
so far (mainly from this slide), it seems not but I could be wrong.
You can probably use Spark Streaming
(http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html)
to implement
to know the maximum value for spark.akka.framesize, too and I am
wondering if it will affect the performance of reduceByKey().
Thanks!
2013/12/8 Matei Zaharia matei.zaha...@gmail.com
Hey Matt,
This setting shouldn’t really affect groupBy operations, because they don’t
go through Akka
within the
com/typesafe/akka subtree.
On Sun, Dec 8, 2013 at 5:01 PM, Azuryy Yu azury...@gmail.com wrote:
I build 0.8.1, maven try to download akka-actor-2.0.1, which is used by
scala-core-io.
On 2013-12-09 8:40 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
Which version of Spark
Hi Philip,
There are a few things you can do:
- If you want to avoid the data copy with a CREATE TABLE statement, you can use
CREATE EXTERNAL TABLE, which points to an existing file or directory.
- If you always reuse the same table, you could CREATE TABLE only once and then
simply place
Yeah, in general, make sure you use exactly the same “cluster URL” string shown
on the master’s web UI. There’s currently a limitation in Akka where different
ways of specifying the hostname won’t work.
Matei
On Dec 6, 2013, at 10:54 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com
wrote:
Yeah, unfortunately the reason it pops up more in 0.8.0 is because our package
names got longer! But if you just do the build in /tmp it will work.
On Dec 6, 2013, at 11:35 AM, Josh Rosen rosenvi...@gmail.com wrote:
This isn't a Spark 0.8.0-specific problem. I googled for sbt error filen
.
On Thu, Dec 5, 2013 at 2:43 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi,
When you launch the worker, try using spark://ADRIBONA-DEV-1:7077 as the URL
(uppercase instead of lowercase). Unfortunately Akka is very specific about
seeing hostnames written in the same way on each
Hi Matt,
Try using take() instead, which will only begin computing from the start of the
RDD (first partition) if the number of elements you ask for is small.
Note that if you’re doing any shuffle operations, like groupBy or sort, then
the stages before that do have to be computed fully.
, December 5, 2013 7:49 AM
To: user@spark.incubator.apache.org
Subject: RE: Pre-build Spark for Windows 8.1
Excellent! Thank you, Matei.
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Wednesday, December 4, 2013 4:26 PM
To: user@spark.incubator.apache.org
Subject: Re: Pre-build
– we want as much data to be computed as
possible.
It's only for benchmarking purposes, of course.
-Matt Cheah
From: Matei Zaharia matei.zaha...@gmail.com
Reply-To: user@spark.incubator.apache.org user@spark.incubator.apache.org
Date: Thursday, December 5, 2013 10:31 AM
To: user
Yes, check out the Shark paper for example:
https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/
The numbers on that benchmark are for Shark.
Matei
On Dec 3, 2013, at 3:50 PM, Matt Cheah mch...@palantir.com wrote:
Hi everyone,
I notice the benchmark page for
these up.
-Matt Cheah
From: Matei Zaharia matei.zaha...@gmail.com
Reply-To: user@spark.incubator.apache.org user@spark.incubator.apache.org
Date: Wednesday, December 4, 2013 10:53 AM
To: user@spark.incubator.apache.org user@spark.incubator.apache.org
Cc: Mingyu Kim m...@palantir.com
Subject
Hey Adrian,
Ideally you shouldn’t use Cygwin to run on Windows — use the .cmd scripts we
provide instead. Cygwin might be made to work but we haven’t tried to do this
so far so it’s not supported. If you can fix it, that would of course be
welcome.
Also, the deploy scripts don’t work on
Hey Roman,
It looks like that pull request was never migrated to the Apache GitHub, but I
like the idea. If you migrate it over, we can merge in something like this. In
terms of the API, I’d just add a unpersist() method on each Broadcast object.
Matei
On Dec 3, 2013, at 6:00 AM, Roman
Ah, interesting, thanks for reporting that. Do you mind opening a JIRA issue
for it? I think the right way would be to wait at least X seconds after start
before deciding that some blocks don’t have preferred locations available.
Matei
On Dec 1, 2013, at 9:08 AM, Erik Freed
I think this might be an issue with the tutorial — try asking the Mesosphere
folks who created it.
Matei
On Nov 28, 2013, at 9:23 PM, om prakash pandey pande...@gmail.com wrote:
Dear Sir/Madam,
I have been trying to run Apache Spark over Mesos and have been following
the below tutorial.
Sorry, what’s the full context for this? Do you have a stack trace? My guess is
that Spark isn’t on your classpath, or maybe you only have an old version of it
on there.
Matei
On Nov 27, 2013, at 6:04 PM, Walrus theCat walrusthe...@gmail.com wrote:
To clarify, I just undid that var...
Yup, it’s also important to have low latency between the drivers and the
workers. If you plan to expose this to the outside (e.g. offer a shell
interface), it would be better to write something on top.
Matei
On Nov 24, 2013, at 3:17 PM, Patrick Wendell pwend...@gmail.com wrote:
Or more
Interesting idea — in Scala you can also use the Dynamic type
(http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic) to
allow dynamic properties. It has the same potential pitfalls as string names,
but with nicer syntax.
Matei
On Nov 18, 2013, at 3:45 PM, andy petrella
Hey folks, just a quick announcement -- in case you’re interested in learning
more about Spark in the Boston area, I’m going to speak at the Boston Hadoop
Meetup next Thursday: http://www.meetup.com/bostonhadoop/events/150875522/.
This is a good chance to meet local users and learn more about
. This timeout can ofcourse be
configurable.
Thoughts ?
On Sat, Nov 2, 2013 at 3:29 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hey Imran,
Good to know that Akka 2.1 handles this — that at least will give us a start.
In the old code, executors certainly did get flagged as “down
Hi folks,
The Apache Spark PPMC is happy to welcome two new PPMC members and committers:
Tom Graves and Prashant Sharma.
Tom has been maintaining and expanding the YARN support in Spark over the past
few months, including adding big features such as support for YARN security,
and recently
Hi Meisam,
Each block manager removes data from the cache in a least-recently-used fashion
as space fills up. If you’d like to remove an RDD manually before that, you can
call rdd.unpersist().
Matei
On Nov 13, 2013, at 8:15 PM, Meisam Fathi meisam.fa...@gmail.com wrote:
Hi Community,
Union just puts the data in two RDDs together, so you get an RDD containing the
elements of both, and with the partitions that would’ve been in both. It’s not
a unique set union (that would be union() then distinct()). Here you’ve unioned
four RDDs of 32 partitions each to get 128. If you want
It’s hard to tell, but maybe you’ve run out of space in your working directory?
The assembly command will try to write stuff in assembly/target.
Matei
On Nov 11, 2013, at 2:54 PM, Umar Javed umarj.ja...@gmail.com wrote:
I keep getting these io.Exception Permission denied errors when building
Actually it doesn’t matter a lot from what I’ve seen. Only do it if you see a
lot of communication going to the master (these threads do the serialization of
tasks). I’ve never put more than 8 or so.
Matei
On Nov 11, 2013, at 12:13 PM, Walrus theCat walrusthe...@gmail.com wrote:
Hi,
The
.
2013/11/7 Matei Zaharia matei.zaha...@gmail.com
Hi everyone,
We're glad to announce the agenda of the Spark Summit, which will happen on
December 2nd and 3rd in San Francisco. We have 5 keynotes and 24 talks lined
up, from 18 different companies. Check out the agenda here:
http://spark
Hi Pranay,
I don’t think anyone’s working on this right now, but contributions would be
welcome if this is a thing we could plug into MLlib.
Matei
On Nov 6, 2013, at 8:44 PM, Pranay Tonpay pranay.ton...@impetus.co.in wrote:
Hi,
Wanted to know if PMML support in Spark is there in the roadmap
Hi everyone,
We're glad to announce the agenda of the Spark Summit, which will happen on
December 2nd and 3rd in San Francisco. We have 5 keynotes and 24 talks lined
up, from 18 different companies. Check out the agenda here:
http://spark-summit.org/agenda/.
This will be the biggest Spark
import statements
On 11/7/2013 4:05 PM, Matei Zaharia wrote:
Yeah, this is confusing and unfortunately as far as I know it’s API
specific. Maybe we should add this to the documentation page for RDD.
The reason for these conversions is to only allow some operations based
In general, you shouldn’t be mutating data in RDDs. That will make it
impossible to recover from faults.
In this particular case, you got 1 and 2 because the RDD isn’t cached. You just
get the same list you called parallelize() with each time you iterate through
it. But caching it and
if I am
wrong.
On Fri, Nov 1, 2013 at 10:08 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
It’s true that Akka’s delivery guarantees are in general at-most-once, but if
you look at the text there it says that they differ by transport. In the
previous version, I’m quite sure
never bothered looking into it more.
I will keep digging ...
On Thu, Oct 31, 2013 at 4:36 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
BTW the problem might be the Akka failure detector settings that seem new in
2.2: http://doc.akka.io/docs/akka/2.2.3/scala/remoting.html
Looking at https://github.com/sbt/sbt-assembly, it seems you can add the
following into extraAssemblySettings:
assemblyOption in assembly ~= { _.copy(includeScala = false) }
Matei
On Oct 30, 2013, at 9:58 AM, Mingyu Kim m...@palantir.com wrote:
Hi,
In order to work around the library
The error is from a worker node -- did you check that /data2 is set up properly
on the worker nodes too? In general that should be the only directory used.
Matei
On Oct 28, 2013, at 6:52 PM, Shangyu Luo lsy...@gmail.com wrote:
Hello,
I have some questions about the files that Spark will
) to
ConverterUtils.convertFromYarn(containerToken, cmAddress).
Not 100% sure that my changes are correct.
Hope that helps,
Viren
On Sun, Sep 29, 2013 at 8:59 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi Terence,
YARN's API changed in an incompatible way in Hadoop 2.1.0, so I'd suggest
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY
caching is the input to each reduce task. Those currently don't spill to disk.
The solution if datasets are large is to add more reduce tasks, whereas Hadoop
would run along with a small number of tasks that do lots
of course we
develop features and optimizations as we see demand for them, but if there's a
lot of demand for this, we can do it.
Matei
On Oct 28, 2013, at 5:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY
Hi Ufuk,
Yes, we still write out data after these tasks in Spark 0.8, and it needs to be
written out before any stage that reads it can start. The main reason is
simplicity when there are faults, as well as more flexible scheduling (you
don't have to decide where each reduce task is in
Yup, unfortunately YARN changed its API upon releasing 2.2, which puts us in an
awkward position because all the major current users are on the old YARN API
(from 0.23.x and 2.0.x) but new users will try this one. We'll probably change
the default version in Spark 0.8.1 or 0.8.2. If you look on
Yes, take a look at
http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3
Matei
On Oct 23, 2013, at 6:17 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, all
Is there any solution running Spark with Amazon S3?
Best,
Nan
, Ayush Mishra ay...@knoldus.com wrote:
You can check
http://blog.knoldus.com/2013/09/09/running-standalone-scala-job-on-amazon-ec2-spark-cluster/.
On Thu, Oct 24, 2013 at 6:54 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
Great!!!
On Wed, Oct 23, 2013 at 9:21 PM, Matei Zaharia matei.zaha
of data
etc.
I was wondering if you could write up a little white paper or some guide
lines on how to set memory values, and what to look at when something goes
wrong? Eg. I would never gave guessed that countByValue happens on a single
machine etc.
On Oct 21, 2013 6:18 PM, Matei
if the goal is to
keep size down and you don't want to confuse new adopters who aren't using
Kafka as part of their tech stack.
-Ryan
On Sat, Oct 12, 2013 at 10:52 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi Ryan,
Spark Streaming ships with a special version of the Kafka
Hi Ryan,
If you're only going to run in local mode, there's no need to package the app
with sbt and pass a JAR. You can just run it straight out of your IDE.
Matei
On Oct 13, 2013, at 9:17 PM, Ryan Chan ryanchan...@gmail.com wrote:
Hi,
Are there any guide on teaching how to get started
Hey, this seems to be a problem in the docs about how to set the executor URI.
It looks like the SPARK_EXECUTOR_URI variable is not actually used. Instead,
set the spark.executor.uri Java system property using
System.setProperty(spark.executor.uri, your URI) before you create a
SparkContext.
Hi Ryan,
Spark Streaming ships with a special version of the Kafka 0.7.2 client that we
ported to Scala 2.9, and you need to add that as a JAR explicitly in your
project. The JAR is in
streaming/lib/org/apache/kafka/kafka/0.7.2-spark/kafka-0.7.2-spark.jar under
Spark. The streaming/lib
Hi Alex,
Unfortunately there seems to be something wrong with how the generics on that
method get seen by Java. You can work around it by calling this with:
plans.saveAsHadoopFiles(hdfs://localhost:8020/user/hue/output/completed,
csv, String.class, String.class, (Class)
Hey, sorry, for this question, there's a similar answer to the previous one.
You'll have to move the files from the output directories into a common
directory by hand, possibly renaming them. The Hadoop InputFormat and
OutputFormat APIs that we use are just designed to work at the level of
Yeah, Christopher answered this before I could, but you can list the directory
in the driver nodes, find out all the filenames, and then use
SparkContext.parallelize() on an array of filenames to split the set of
filenames among tasks. After that, run a foreach() on the parallelized RDD and
Take a look at the org.apache.spark.scheduler.SparkListener class. You can
register your own SparkListener with SparkContext that listens for job-start
and job-end events.
Matei
On Oct 10, 2013, at 9:04 PM, prabeesh k prabsma...@gmail.com wrote:
Is there any way to get execution time in the
Hi Mingyu,
The latest version of Spark works with Scala 2.9.3, which is the latest
Scala-2.9 version. There's also a branch called branch-2.10 on GitHub that uses
2.10.3. What specific libraries are you having trouble with?
I see other open source projects private-namespacing the dependencies
Hi Paul,
Just FYI, I'm not sure Akka was designed to pass ActorSystems across closures
the way you're doing. Also, there's a bit of a misunderstanding about closures
on RDDs. Consider this change you made to ActorWordCount:
lines.flatMap(_.split(\\s+)).map(x = (x, 1)).reduceByKey(_ +
1 - 100 of 112 matches
Mail list logo