The problem is that Java objects can take more space than the underlying data,
but there are options in Spark to store data in serialized form to get around
this. Take a look at https://spark.incubator.apache.org/docs/latest/tuning.html.
Matei
On Feb 25, 2014, at 12:01 PM, Suraj Satishkumar
In Spark 0.9 and master, you can pass the -i argument to spark-shell to load a
script containing commands before opening the prompt. This is also a feature of
the Scala shell as a whole (try scala -help for details).
Also, once you’re in the shell, you can use :load file.scala to execute the
Is it an error, or just a warning? In any case, you need to get those libraries
from a build of Hadoop for your platform. Then add them to the
SPARK_LIBRARY_PATH environment variable in conf/spark-env.sh, or to your
-Djava.library.path if launching an application separately.
These libraries
Hi Dana,
It’s hard to tell exactly what is consuming time, but I’d suggest starting by
profiling the single application first. Three things to look at there:
1) How many stages and how many tasks per stage is Spark launching (in the
application web UI at http://driver:4040)? If you have
Since it’s from Scala, it might mean you’re running with a different version of
Scala than you compiled Spark with. Spark 0.8 and earlier use Scala 2.9, while
Spark 0.9 uses Scala 2.10.
Matei
On Mar 11, 2014, at 8:19 AM, Jeyaraj, Arockia R (Arockia)
arockia.r.jeya...@verizon.com wrote:
Hi,
Thanks, added you.
On Mar 11, 2014, at 2:47 AM, Christoph Böhm listenbru...@gmx.net wrote:
Dear Spark team,
thanks for the great work and congrats on becoming an Apache top-level
project!
You could add us to your Powered-by-page, because we are using Spark (and
Shark) to perform
I agree that we can’t keep adding these to the core API, partly because it will
get unwieldy to maintain and partly just because each storage system will bring
in lots of dependencies. We can simply have helper classes in different modules
for each storage system. There’s some discussion on
On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote:
I also found that the product and user RDDs were being rebuilt many times
over in my tests, even for tiny data sets. By persisting the RDD returned
from updateFeatures() I was able to avoid a raft of duplicate computations.
Is
If it’s a driver on the cluster, please open a JIRA issue about this — this
kill command is indeed intended to work.
Matei
On Mar 16, 2014, at 2:35 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:
Are you embedding your driver inside the cluster?
If not then that command will not kill the
Thanks, I’ve added you:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Let me know
if you want to change any wording.
Matei
On Mar 16, 2014, at 6:48 AM, Egor Pahomov pahomov.e...@gmail.com wrote:
Hi, page https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Hi Diana,
Non-text input formats are only supported in Java and Scala right now, where
you can use sparkContext.hadoopFile or .hadoopDataset to load data with any
InputFormat that Hadoop MapReduce supports. In Python, you unfortunately only
have textFile, which gives you one record per line.
to me how to do that as I
probably should be.
Thanks,
Diana
On Mon, Mar 17, 2014 at 1:02 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi Diana,
Non-text input formats are only supported in Java and Scala right now, where
you can use sparkContext.hadoopFile or .hadoopDataset
Yup, it only returns each value once.
Matei
On Mar 17, 2014, at 1:14 PM, Adrian Mocanu amoc...@verticalscope.com wrote:
Hi
Quick question here,
I know that .foreach is not idempotent. I am wondering if collect() is
idempotent? Meaning that once I’ve collect()-ed if spark node crashes I
)
On Mon, Mar 17, 2014 at 1:57 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Here’s an example of getting together all lines in a file as one string:
$ cat dir/a.txt
Hello
world!
$ cat dir/b.txt
What's
up??
$ bin/pyspark
files = sc.textFile(“dir”)
files.collect()
[u'Hello
Thanks for reporting this, looking into it.
On Mar 17, 2014, at 2:44 PM, Walrus theCat walrusthe...@gmail.com wrote:
ping
On Thu, Mar 13, 2014 at 11:05 AM, Aaron Davidson ilike...@gmail.com wrote:
Looks like everything from 0.8.0 and before errors similarly (though Spark
0.3 for Scala
I just meant that you call union() before creating the RDDs that you pass to
new Graph(). If you call it after it will produce other RDDs.
The Graph() constructor actually shuffles and “indexes” the data to make graph
operations efficient, so it’s not too easy to add elements after. You could
Try checking spark-env.sh on the workers as well. Maybe code there is somehow
overriding the spark.executor.memory setting.
Matei
On Mar 18, 2014, at 6:17 PM, Jim Blomo jim.bl...@gmail.com wrote:
Hello, I'm using the Github snapshot of PySpark and having trouble setting
the worker memory
Yes, Spark automatically removes old RDDs from the cache when you make new
ones. Unpersist forces it to remove them right away. In both cases though, note
that Java doesn’t garbage-collect the objects released until later.
Matei
On Mar 19, 2014, at 7:22 PM, Nicholas Chammas
-Dspark.executor.memory in SPARK_JAVA_OPTS *on the master*. I'm
not sure how this varies from 0.9.0 release, but it seems to work on
SNAPSHOT.
On Tue, Mar 18, 2014 at 11:52 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Try checking spark-env.sh on the workers as well. Maybe code there is
somehow
Hi Adrian,
On every timestep of execution, we receive new data, then report updated word
counts for that new data plus the past 30 seconds. The latency here is about
how quickly you get these updated counts once the new batch of data comes in.
It’s true that the count reflects some data from
Try passing the shuffle=true parameter to coalesce, then it will do the map in
parallel but still pass all the data through one reduce node for writing it
out. That’s probably the fastest it will get. No need to cache if you do that.
Matei
On Mar 21, 2014, at 4:04 PM, Aureliano Buendia
, at 5:01 PM, Aureliano Buendia buendia...@gmail.com wrote:
Good to know it's as simple as that! I wonder why shuffle=true is not the
default for coalesce().
On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Try passing the shuffle=true parameter to coalesce
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your
SparkContext? It tries to serialize that many objects together at a time, which
might be too much. By default the batchSize is 1024.
Matei
On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hi
Congrats Michael co for putting this together — this is probably the neatest
piece of technology added to Spark in the past few months, and it will greatly
change what users can do as more data sources are added.
Matei
On Mar 26, 2014, at 3:22 PM, Ognen Duzlevski og...@plainvanillagames.com
wrote:
Much thanks, I suspected this would be difficult. I was hoping to
generate some 4 degrees of separation-like statistics. Looks like
I'll just have to work with a subset of my graph.
On Wed, Mar 26, 2014 at 5:20 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
All-pairs distances
exceptions, but I think they all stem from the above,
eg. org.apache.spark.SparkException: Error sending message to
BlockManagerMaster
Let me know if there are other settings I should try, or if I should
try a newer snapshot.
Thanks again!
On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia
Weird, how exactly are you pulling out the sample? Do you have a small program
that reproduces this?
Matei
On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.
Hi Manoj,
At the current time, for drop-in replacement of Hive, it will be best to stick
with Shark. Over time, Shark will use the Spark SQL backend, but should remain
deployable the way it is today (including launching the SharkServer, using the
Hive CLI, etc). Spark SQL is better for
You could probably port it back, but it required some changes on the Java side
as well (a new PythonMLUtils class). It might be easier to fix the Mesos issues
with 0.9.
Matei
On Apr 1, 2014, at 8:53 AM, Ian Ferreira ianferre...@hotmail.com wrote:
Hi there,
For some reason the
Hey Bhaskar, this is still the plan, though QAing might take longer than 15
days. Right now since we’ve passed April 1st, the only features considered for
a merge are those that had pull requests in review before. (Some big ones are
things like annotating the public APIs and simplifying
, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hey Steve,
This configuration sounds pretty good. The one thing I would consider is
having more disks, for two reasons — Spark uses the disks for large shuffles
and out-of-core operations, and often it’s better to run HDFS or your
Exceptions should be sent back to the driver program and logged there (with a
SparkException thrown if a task fails more than 4 times), but there were some
bugs before where this did not happen for non-Serializable exceptions. We
changed it to pass back the stack traces only (as text), which
, Mar 18, 2014 at 8:14 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
BTW one other thing — in your experience, Diana, which non-text
InputFormats would be most useful to support in Python first? Would it be
Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or
something
This can’t be done through the script right now, but you can do it manually as
long as the cluster is stopped. If the cluster is stopped, just go into the AWS
Console, right click a slave and choose “launch more of these” to add more. Or
select multiple slaves and delete them. When you run
As long as the filesystem is mounted at the same path on every node, you should
be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won’t expose data
locality info to Spark, the way HDFS does. That may not matter if it’s a
, Chen Chao,
Christian Lundgren, Diana Carroll, Emtiaz Ahmed, Frank Dai,
Henry Saputra, jianghan, Josh Rosen, Jyotiska NK, Kay Ousterhout,
Kousuke Saruta, Mark Grover, Matei Zaharia, Nan Zhu, Nick Lanham,
Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang,
Raymond Liu, Reynold Xin, Sandy
)
at
org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
On Thu, Apr 3, 2014 at 8:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Cool, thanks for the update. Have you tried running a branch with this fix
(e.g. branch-0.9, or the 0.9.1 release candidate?) Also, what memory leak
issue are you
I haven’t seen this but it may be a bug in Typesafe Config, since this is
serializing a Config object. We don’t actually use Typesafe Config ourselves.
Do you have any nulls in the data itself by any chance? And do you know how
that Config object is getting there?
Matei
On Apr 9, 2014, at
To add onto the discussion about memory working space, 0.9 introduced the
ability to spill data within a task to disk, and in 1.0 we’re also changing the
interface to allow spilling data within the same *group* to disk (e.g. when you
do groupBy and get a key with lots of values). The main
Kind of strange because we haven’t updated CloudPickle AFAIK. Is this a package
you added on the PYTHONPATH? How did you set the path, was it in
conf/spark-env.sh?
Matei
On Apr 10, 2014, at 7:39 AM, aazout albert.az...@velos.io wrote:
I am getting a python ImportError on Spark standalone
, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Matei,
Where is the functionality in 0.9 to spill data within a task (separately
from persist)? My apologies if this is something obvious but I don't see it
in the api docs.
-Suren
On Thu, Apr 10, 2014 at 3:59 PM, Matei Zaharia
You can use mapPartitionsWithIndex and look at the partition index (0 will be
the first partition) to decide whether to skip the first line.
Matei
On Apr 14, 2014, at 8:50 AM, Ethan Jewett esjew...@gmail.com wrote:
We have similar needs but IIRC, I came to the conclusion that this would only
Spark can actually launch multiple executors on the same node if you configure
it that way, but if you haven’t done that, this might mean that some tasks are
reading data from the cache, and some from HDFS. (In the HDFS case Spark will
only report it as NODE_LOCAL since HDFS isn’t tied to a
Kryo won’t make a major impact on PySpark because it just stores data as byte[]
objects, which are fast to serialize even with Java. But it may be worth a try
— you would just set spark.serializer and not try to register any classes. What
might make more impact is storing data as
Yup, one reason it’s 2 actually is to give people a similar experience to
working with large files, in case their code doesn’t deal well with the file
being partitioned.
Matei
On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote:
Take a look at the minSplits argument for
Yes, both things can happen. Take a look at
http://spark.apache.org/docs/latest/job-scheduling.html, which includes
scheduling concurrent jobs within the same driver.
Matei
On Apr 15, 2014, at 4:08 PM, Ian Ferreira ianferre...@hotmail.com wrote:
What is the support for multi-tenancy in
Hi Bertrand,
We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that
will allow saving pickled objects. Unfortunately this is not in yet, but there
is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161.
In 1.0, one feature we do have now is the
The problem is that groupByKey means “bring all the points with this same key
to the same JVM”. Your input is a Seq[Point], so you have to have all the
points there. This means that a) all points will be sent across the network in
a cluster, which is slow (and Spark goes through this sending
There was a patch posted a few weeks ago
(https://github.com/apache/spark/pull/223), but it needs a few changes in
packaging because it uses a license that isn’t fully compatible with Apache.
I’d like to get this merged when the changes are made though — it would be a
good input source to
See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more recent
build of the docs; if you spot any problems in those, let us know.
Matei
On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote:
The doc is for 0.9.1. You are running a later snapshot, which added
sparse
It’s currently in the master branch, on https://github.com/apache/spark. You
can check that out from git, build it with sbt/sbt assembly, and then try it
out. We’re also going to post some release candidates soon that will be
pre-built.
Matei
On Apr 23, 2014, at 1:30 PM, diplomatic Guru
Did you launch this using our EC2 scripts
(http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set
up the daemons? My guess is that their hostnames are not being resolved
properly on all nodes, so executor processes can’t connect back to your driver
app. This error
The problem is that SparkPi uses Math.random(), which is a synchronized method,
so it can’t scale to multiple cores. In fact it will be slower on multiple
cores due to lock contention. Try another example and you’ll see better
scaling. I think we’ll have to update SparkPi to create a new Random
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how
to do it. Look at the stderr file of the executor on that machine, and you’ll
see lines like this:
14/04/24 19:17:24 INFO HadoopRDD: Input split:
file:/Users/matei/workspace/apache-spark/README.md:0+2000
This says
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see
http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably
faster.
Matei
On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote:
From my point of view, both are supported equally. The YARN support is newer
and that’s why there’s been a lot more action there in recent months.
Matei
On Apr 27, 2014, at 12:08 PM, Andrew Ash and...@andrewash.com wrote:
That thread was mostly about benchmarking YARN vs standalone, and the
Hi Roger,
You should be able to use the --jars argument of spark-shell to add JARs onto
the classpath and then work with those classes in the shell. (A recent patch,
https://github.com/apache/spark/pull/542, made spark-shell use the same
command-line arguments as spark-submit). But this is a
Not sure if this is always ideal for Naive Bayes, but you could also hash the
features into a lower-dimensional space (e.g. reduce it to 50,000 features).
For each feature simply take MurmurHash3(featureID) % 5 for example.
Matei
On Apr 27, 2014, at 11:24 PM, DB Tsai dbt...@stanford.edu
Try turning on the Kryo serializer as described at
http://spark.apache.org/docs/latest/tuning.html. Also, are there any exceptions
in the driver program’s log before this happens?
Matei
On Apr 28, 2014, at 9:19 AM, Buttler, David buttl...@llnl.gov wrote:
Hi,
I am trying to run the K-means
Actually wildcards work too, e.g. s3n://bucket/file1*, and I believe so do
comma-separated lists (e.g. s3n://file1,s3n://file2). These are all inherited
from FileInputFormat in Hadoop.
Matei
On Apr 28, 2014, at 6:05 PM, Andrew Ash and...@andrewash.com wrote:
This is already possible with the
This will be possible in 1.0 after this pull request:
https://github.com/apache/spark/pull/30
Matei
On Apr 29, 2014, at 9:51 AM, Guanhua Yan gh...@lanl.gov wrote:
Hi all:
Is it possible to develop Spark programs in Python and run them on YARN? From
the Python SparkContext class, it
Hi Diana,
Apart from these reasons, in a multi-stage job, Spark saves the map output
files from map stages to the filesystem, so it only needs to rerun the last
reduce stage. This is why you only saw one stage executing. These files are
saved for fault recovery but they speed up subsequent
-uses that?
On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi Diana,
Apart from these reasons, in a multi-stage job, Spark saves the map output
files from map stages to the filesystem, so it only needs to rerun the last
reduce stage. This is why you only saw
Very cool! Have you thought about sending this as a pull request? We’d be happy
to maintain it inside Spark, though it might be interesting to find a single
Python package that can manage clusters across both EC2 and GCE.
Matei
On May 5, 2014, at 7:18 AM, Akhil Das ak...@sigmoidanalytics.com
Add export SPARK_JAVA_OPTS=“-Xss16m” to conf/spark-env.sh. Then it should apply
to the executor.
Matei
On May 5, 2014, at 2:20 PM, Andrea Esposito and1...@gmail.com wrote:
Hi there,
i'm doing an iterative algorithm and sometimes i ended up with
StackOverflowError, doesn't matter if i do
Java 8 support is a feature in Spark, but vendors need to decide for themselves
when they’d like support Java 8 commercially. You can still run Spark on Java 7
or 6 without taking advantage of the new features (indeed our builds are always
against Java 6).
Matei
On May 6, 2014, at 8:59 AM,
Yes, Spark goes through the standard HDFS client and will automatically benefit
from this.
Matei
On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote:
Hi all,
Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
sc.textFile() and other HDFS-related APIs?
You can just pass it around as a parameter.
On May 12, 2014, at 12:37 PM, yh18190 yh18...@gmail.com wrote:
Hi,
Could anyone suggest an idea how can we create sparkContext object in other
classes or fucntions where we need to convert a scala collection to RDD
using sc object.like
at ~54GB. stats() returns (count:
56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min:
343)
On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Okay, thanks. Do you have any info on how large your records and data
file are? I'd like to reproduce and fix
400 for the textFile()s, 1500 for the join()s.
On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hey Jim, unfortunately external spilling is not implemented in Python right
now. While it would be possible to update combineByKey to do smarter stuff
here, one
, May 19, 2014 at 1:31 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
What version is this with? We used to build each partition first before
writing it out, but this was fixed a while back (0.9.1, but it may also be in
0.9.0).
Matei
On May 19, 2014, at 12:41 AM, Sai Prasanna
If you’d like to work on just this code for your own changes, it might be best
to copy it to a separate project. Look at
http://spark.apache.org/docs/latest/quick-start.html for how to set up a
standalone job.
Matei
On May 19, 2014, at 4:53 AM, Hao Wang wh.s...@gmail.com wrote:
Hi,
I am
Which version is this with? I haven’t seen standalone masters lose workers. Is
there other stuff on the machines that’s killing them, or what errors do you
see?
Matei
On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote:
Hey folks,
I'm wondering what strategies other folks
They’re tied to the SparkContext (application) that launched them.
Matei
On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:
from looking at the source code i see executors run in their own jvm
subprocesses.
how long to they live for? as long as the worker/slave? or are
restarting the workers usually
resolves this, but we often seen workers disappear after a failed or killed
job.
If we see this occur again, I'll try and provide some logs.
On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Which version is this with? I
Unfortunately this is not yet possible. There’s a patch in progress posted here
though: https://github.com/apache/spark/pull/455 — it would be great to get
your feedback on it.
Matei
On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote:
Hello,
This seems like a basic question
It sounds like you made a typo in the code — perhaps you’re trying to call
self._jvm.PythonRDDnewAPIHadoopFile instead of
self._jvm.PythonRDD.newAPIHadoopFile? There should be a dot before the new.
Matei
On May 28, 2014, at 5:25 PM, twizansk twiza...@gmail.com wrote:
Hi Nick,
I finally
You can remove cached RDDs by calling unpersist() on them.
You can also use SparkContext.getRDDStorageInfo to get info on cache usage,
though this is a developer API so it may change in future versions. We will add
a standard API eventually but this is just very closely tied to framework
Hi Anand,
This is probably already handled by the RDD.pipe() operation. It will spawn a
process and let you feed data to it through its stdin and read data through
stdout.
Matei
On May 29, 2014, at 9:39 AM, ansriniv ansri...@gmail.com wrote:
I have a requirement where for every Spark
That hash map is just a list of where each task ran, it’s not the actual data.
How many map and reduce tasks do you have? Maybe you need to give the driver a
bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, 100) to use
only 100 tasks).
Matei
On May 29, 2014, at 2:03 AM, haitao
Quite a few people ask this question and the answer is pretty simple. When we
started Spark, we had two goals — we wanted to work with the Hadoop ecosystem,
which is JVM-based, and we wanted a concise programming interface similar to
Microsoft’s DryadLINQ (the first language-integrated big data
It can be set in an individual application.
Consolidation had some issues on ext3 as mentioned there, though we might
enable it by default in the future because other optimizations now made it
perform on par with the non-consolidation version. It also had some bugs in
0.9.0 so I’d suggest at
What instance types did you launch on?
Sometimes you also get a bad individual machine from EC2. It might help to
remove the node it’s complaining about from the conf/slaves file.
Matei
On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote:
Hey Folks,
I'm really having quite a
More specifically with the -a flag, you *can* set your own AMI, but you’ll need
to base it off ours. This is because spark-ec2 assumes that some packages (e.g.
java, Python 2.6) are already available on the AMI.
Matei
On Jun 1, 2014, at 11:01 AM, Patrick Wendell pwend...@gmail.com wrote:
Hey
1, 2014, at 3:11 PM, PJ$ p...@chickenandwaffl.es wrote:
Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't
gotten any further. No clue what's wrong. I'd really appreciate any guidance
y'all could offer.
Best,
PJ$
On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia
FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this.
Matei
On Jun 1, 2014, at 6:14 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote:
Sort of.. there were two separate issues, but both related to AWS..
I've sorted the confusion about the Master/Worker AMI ... use
You can just use the Maven build for now, even for Spark 1.0.0.
Matei
On Jun 2, 2014, at 5:30 PM, Mohit Nayak wiza...@gmail.com wrote:
Hey,
Yup that fixed it. Thanks so much!
Is this the only solution, or could this be resolved in future versions of
Spark ?
On Mon, Jun 2, 2014 at
Yeah unfortunately Hadoop 2 requires these binaries on Windows. Hadoop 1 runs
just fine without them.
Matei
On Jun 3, 2014, at 10:33 AM, Sean Owen so...@cloudera.com wrote:
I'd try the internet / SO first -- these are actually generic
Hadoop-related issues. Here I think you don't have
You can use RDD.setName to give it a name. There’s also a creationSite field
that is private[spark] — we may want to add a public setter for that later. If
the name isn’t enough and you’d like this, please open a JIRA issue for it.
Matei
On Jun 3, 2014, at 5:22 PM, John Salvatier
What Java version do you have, and how did you get Spark (did you build it
yourself by any chance or download a pre-built one)? If you build Spark
yourself you need to do it with Java 6 — it’s a known issue because of the way
Java 6 and 7 package JAR files. But I haven’t seen it result in this
Ghost, it's the dream language
we've theorized about for years! I hadn't realized!
Indeed, glad you’re enjoying it.
Matei
On Mon, Jun 2, 2014 at 12:05 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this.
Matei
You can copy your configuration from the old one. I’d suggest just downloading
it to a different location on each node first for testing, then you can delete
the old one if things work.
On Jun 3, 2014, at 12:38 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
Hi ,
I am currently using
If this isn’t the problem, it would be great if you can post the code for the
program.
Matei
On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote:
Maybe your two workers have different assembly jar files?
I just ran into a similar problem that my spark-shell is using a
Yes, you can write some glue in Spark to call these. Some functions to look at:
- SparkContext.hadoopRDD lets you create an input RDD from an existing JobConf
configured by Hadoop (including InputFormat, paths, etc)
- RDD.mapPartitions lets you operate in all the values on one partition (block)
than just one line? (Of course you would have to click to expand it.)
On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier jsalvat...@gmail.com wrote:
Ok, I will probably open a Jira.
On Tue, Jun 3, 2014 at 5:29 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
You can use RDD.setName to give
In PySpark, the data processed by each reduce task needs to fit in memory
within the Python process, so you should use more tasks to process this
dataset. Data is spilled to disk across tasks.
I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this —
it’s something we’ve
All of these are disposed of automatically if you stop the context or exit the
program.
Matei
On Jun 4, 2014, at 2:22 PM, Daniel Siegmann daniel.siegm...@velos.io wrote:
Will the broadcast variables be disposed automatically if the context is
stopped, or do I still need to unpersist()?
On Wed, Jun 4, 2014 at 1:42 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
In PySpark, the data processed by each reduce task needs to fit in memory
within the Python process, so you should use more tasks to process this
dataset. Data is spilled to disk across tasks.
I’ve created https
to include Python APIs in Spark Streaming?
Anytime frame on this?
Thanks!
John
On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Quite a few people ask this question and the answer is pretty simple. When we
started Spark, we had two goals — we wanted to work
1 - 100 of 284 matches
Mail list logo