Thanks for flagging this. I reverted the relevant YARN fix in Spark
1.2 release. We can try to debug this in master.
On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang wrote:
> I created a ticket for this:
>
> https://issues.apache.org/jira/browse/SPARK-4757
>
>
> Jianshi
>
> On Fri, Dec 5, 2014 at
The second choice is better. Once you call collect() you are pulling
all of the data onto a single node, you want to do most of the
processing in parallel on the cluster, which is what map() will do.
Ideally you'd try to summarize the data or reduce it before calling
collect().
On Fri, Dec 5, 201
Yeah the main way to do this would be to have your own static cache of
connections. These could be using an object in Scala or just a static
variable in Java (for instance a set of connections that you can
borrow from).
- Patrick
On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer wrote:
> Hi,
>
> O
Hey Manoj,
One proposal potentially of interest is the Spark Kernel project from
IBM - you should look for their. The interface in that project is more
of a "remote REPL" interface, i.e. you submit commands (as strings)
and get back results (as strings), but you don't have direct
programmatic acce
rd.
>
> On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell wrote:
>>
>> The second choice is better. Once you call collect() you are pulling
>> all of the data onto a single node, you want to do most of the
>> processing in parallel on the cluster, which is what map() w
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
the third release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!
This release brings operational and performance improvements in Spark
core
2.0 and v1.2.0-rc2 are pointed to different commits in
>> https://github.com/apache/spark/releases
>>
>> Best Regards,
>>
>> Shixiong Zhu
>>
>> 2014-12-19 16:52 GMT+08:00 Patrick Wendell :
>>>
>>> I'm happy to announce the availability of S
Xiangrui asked me to report that it's back and running :)
On Mon, Dec 22, 2014 at 3:21 PM, peng wrote:
> Me 2 :)
>
>
> On 12/22/2014 06:14 PM, Andrew Ash wrote:
>
> Hi Xiangrui,
>
> That link is currently returning a 503 Over Quota error message. Would you
> mind pinging back out when the page i
Hey Nick,
I think Hitesh was just trying to be helpful and point out the policy
- not necessarily saying there was an issue. We've taken a close look
at this and I think we're in good shape her vis-a-vis this policy.
- Patrick
On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas
wrote:
> Hitesh,
>
Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?
http://spark.apache.org/docs/latest/configuration.html
- Patrick
On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai wrote:
> Hi,
>
>
>
> We have such requirements to save RDD output to HDFS with saveAsTextFile
> like API, but need
ble as any alternatives. This is already pretty easy IMO.
- Patrick
On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao wrote:
> I am wondering if we can provide more friendly API, other than configuration
> for this purpose. What do you think Patrick?
>
> Cheng Hao
>
> -Original
What do you mean when you say "the overhead of spark shuffles start to
accumulate"? Could you elaborate more?
In newer versions of Spark shuffle data is cleaned up automatically
when an RDD goes out of scope. It is safe to remove shuffle data at
this point because the RDD can no longer be referenc
Hey Eric,
I'm just curious - which specific features in 1.2 do you find most
help with usability? This is a theme we're focusing on for 1.3 as
well, so it's helpful to hear what makes a difference.
- Patrick
On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman
wrote:
> Hi Josh,
>
> Thanks for the inf
It should appear in the page for any stage in which accumulators are updated.
On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip wrote:
> Hello,
>
> From accumulator documentation, it says that if the accumulator is named, it
> will be displayed in the WebUI. However, I cannot find it anywhere.
>
> Do I
Akhil,
Those are handled by ASF infrastructure, not anyone in the Spark
project. So this list is not the appropriate place to ask for help.
- Patrick
On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das wrote:
> My mails to the mailing list are getting rejected, have opened a Jira issue,
> can someone t
Yep, currently it only supports running at least 1 slave.
On Sat, Mar 1, 2014 at 4:47 PM, nicholas.chammas
wrote:
> I successfully launched a Spark EC2 "cluster" with 0 slaves using spark-ec2.
> When trying to login to the master node with spark-ec2 login, I get the
> following:
>
> Searching for
Hey All,
We have a fix for this but it didn't get merged yet. I'll put it as a
blocker for Spark 0.9.1.
https://github.com/pwendell/incubator-spark/commit/66594e88e5be50fca073a7ef38fa62db4082b3c8
https://spark-project.atlassian.net/browse/SPARK-1190
Sergey if you could try compiling Spark with
e
> org.slf4j.impl.StaticLoggerBinder. Adding slf4j-log4j12 with compile scope
> helps, and I confirm the logging is redirected to slf4j/Logback correctly
> now with the patched module. I'm not sure however if using compile scope for
> slf4j-log4j12 is a good idea.
>
> --
The difference between your two jobs is that take() is optimized and
only runs on the machine where you are using the shell, whereas
sortByKey requires using many machines. It seems like maybe python
didn't get upgraded correctly on one of the slaves. I would look in
the /root/spark/work/ folder (f
Hey There,
This is interesting... thanks for sharing this. If you are storing in
MEMORY_ONLY then you are just directly storing Java objects in the
JVM. So they can't be compressed because they aren't really stored in
a known format it's just left up to the JVM.
To answer you other question, it's
Hey Sen,
Is your code in the driver code or inside one of the tasks?
If it's in the tasks, the place you would expect these to be is in
stdout file under //work/[stdout/stderr]. Are you seeing
at least stderr logs in that folder? If not then the tasks might not
be running on the workers machines.
n more System.out.println (E)
>>
>> } // main
>>
>> } //class
>>
>> I get the console outputs on the master for (B) and (E). I do not see any
>> stdout in the worker node. I find the stdout and stderr in the
>> /work//0/. I see output
>> in stde
Hey Matt,
The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.
You might be able to hack around this by decreasing the number of
reducers but this could have some performance implications f
A block is an internal construct that isn't directly exposed to users.
Internally though, each partition of an RDD is mapped to one block.
- Patrick
On Mon, Mar 10, 2014 at 11:06 PM, David Thomas wrote:
> What is the concept of Block and BlockManager in Spark? How is a Block
> related to a Parti
Dianna I'm forwarding this to the dev list since it might be useful
there as well.
On Wed, Mar 12, 2014 at 11:39 AM, Diana Carroll wrote:
> Hi all. I needed to build the Spark docs. The basic instructions to do
> this are in spark/docs/README.md but it took me quite a bit of playing
> around to
Hey Pierre,
Currently modifying the "slaves" file is the best way to do this
because in general we expect that users will want to launch workers on
any slave.
I think you could hack something together pretty easily to allow this.
For instance if you modify the line in slaves.sh from this:
for
In Spark 1.0 we've added better randomization to the scheduling of
tasks so they are distributed more evenly by default.
https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e
However having specific policies like that isn't really supported
unless you subclass the RDD it
Hey Nicholas,
The best way to do this is to do rdd.mapPartitions() and pass a
function that will open a JDBC connection to your database and write
the range in each partition.
On the input path there is something called JDBC-RDD that is relevant:
http://spark.incubator.apache.org/docs/latest/api/
Hey All,
The Hadoop Summit uses community choice voting to decide which talks
to feature. It would be great if the community could help vote for
Spark talks so that Spark has a good showing at this event. You can
make three votes on each track. Below I've listed Spark talks in each
of the tracks -
Sean - was this merged into the 0.9 branch as well (it seems so based
on the message from rxin). If so it might make sense to try out the
head of branch-0.9 as well. Unless there are *also* other changes
relevant to this in master.
- Patrick
On Sun, Mar 16, 2014 at 12:24 PM, Sean Owen wrote:
> Y
This is not released yet but we're planning to cut a 0.9.1 release
very soon (e.g. most likely this week). In the mean time you'll have
checkout branch-0.9 of Spark and publish it locally then depend on the
snapshot version. Or just wait it out...
On Fri, Mar 14, 2014 at 2:01 PM, Adrian Mocanu
wr
Hey Roman,
Ya definitely checkout pull request 42 - one cool thing is this patch
now includes information about in-memory storage in the listener
interface, so you can see directly which blocks are cached/on-disk
etc.
- Patrick
On Mon, Mar 17, 2014 at 5:34 PM, Matei Zaharia wrote:
> Take a look
ard
>>>
>>>
>>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers wrote:
>>>>
>>>> not that long ago there was a nice example on here about how to combine
>>>> multiple operations on a single RDD. so basically if you want to do a
>>>> count() and something else, how to roll them into a single job. i think
>>>> patrick wendell gave the examples.
>>>>
>>>> i cant find them anymore patrick can you please repost? thanks!
>>>
>>>
>>
>
As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().
This can happen for example if y
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
wrote:
> I would love to work
Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.
- Patrick
On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
wrote:
> There is no direct way to get this in pyspark, but you can get it from the
> underlying java rdd. For example
>
> a = sc.para
Starting with Spark 0.9 the protobuf dependency we use is shaded and
cannot interfere with other protobuf libaries including those in
Hadoop. Not sure what's going on in this case. Would someone who is
having this problem post exactly how they are building spark?
- Patrick
On Fri, Mar 21, 2014 at
I'm not sure exactly how your cluster is configured. But as far as I can
tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find
the exact CDH version you have and link against the `mr1` version of their
published dependencies in that version.
So I think you wan't "2.3.0-mr1-cd
Hey Rohit,
I think external tables based on Cassandra or other datastores will work
out-of-the box if you build Catalyst with Hive support.
Michael may have feelings about this but I'd guess the longer term design
for having schema support for Cassandra/HBase etc likely wouldn't rely on
hive exte
If you call repartition() on the original stream you can set the level of
parallelism after it's ingested from Kafka. I'm not sure how it maps kafka
topic partitions to tasks for the ingest thought.
On Thu, Mar 27, 2014 at 11:09 AM, Scott Clasen wrote:
> I have a simple streaming job that create
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark
applications can persist their state so that the UI can be reloaded after
they have completed.
- Patrick
On Sun, Mar 30, 2014 at 10:30 AM, David Thomas wrote:
> Is there a way to see 'Application Detail UI' page (at mast
Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be
getting pulled in unless you are directly using akka yourself. Are you?
Does your project have other dependencies that might be indirectly pulling
in protobuf 2.4.1? It would be helpful if you could list all of your
depende
Ya this is a good way to do it.
On Sun, Mar 30, 2014 at 10:11 PM, Vipul Pandey wrote:
> Hi,
>
> I need to batch the values in my final RDD before writing out to hdfs. The
> idea is to batch multiple "rows" in a protobuf and write those batches out
> - mostly to save some space as a lot of metad
dependency but it still failed whenever I use the
> jar with ScalaBuf dependency.
> Spark version is 0.9.0
>
>
> ~Vipul
>
> On Mar 31, 2014, at 4:51 PM, Patrick Wendell wrote:
>
> Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be
> getting pull
Do you get the same problem if you build with maven?
On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey wrote:
> SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly
>
> That's all I do.
>
> On Apr 1, 2014, at 11:41 AM, Patrick Wendell wrote:
>
> Vidal - could you show e
(default-cli) on project spark-0.9.0-incubating: Error reading assemblies:
> No assembly descriptors found. -> [Help 1]
> upon runnning
> mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly
>
>
> On Apr 1, 2014, at 4:13 PM, Patrick Wendell wrote:
>
> D
For textFile I believe we overload it and let you set a codec directly:
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
For saveAsSequenceFile yep, I think Mark is right, you need an option.
On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra wrote
The driver stores the meta-data associated with the partition, but the
re-computation will occur on an executor. So if several partitions are
lost, e.g. due to a few machines failing, the re-computation can be striped
across the cluster making it fast.
On Wed, Apr 2, 2014 at 11:27 AM, David Thoma
Hey Phillip,
Right now there is no mechanism for this. You have to go in through the low
level listener interface.
We could consider exposing the JobProgressListener directly - I think it's
been factored nicely so it's fairly decoupled from the UI. The concern is
this is a semi-internal piece of
Btw - after that initial thread I proposed a slightly more detailed set of
dates:
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
- Patrick
On Thu, Apr 3, 2014 at 11:28 AM, Matei Zaharia wrote:
> Hey Bhaskar, this is still the plan, though QAing might take longer than
> 15 days.
Hey Parviz,
There was a similar thread a while ago... I think that many companies like
to be discrete about the size of large clusters. But of course it would be
great if people wanted to share openly :)
For my part - I can say that Spark has been benchmarked on
hundreds-of-nodes clusters before
We might be able to incorporate the maven rpm plugin into our build. If
that can be done in an elegant way it would be nice to have that
distribution target for people who wanted to try this with arbitrary Spark
versions...
Personally I have no familiarity with that plug-in, so curious if anyone i
If you look in the Spark UI, do you see any garbage collection happening?
My best guess is that some of the executors are going into GC and they are
timing out. You can manually increase the timeout by setting the Spark conf:
spark.storage.blockManagerSlaveTimeoutMs
to a higher value. In your cas
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller wrote:
> I am running the latest version of PySpark branch-0.9 and having some
> trouble with join.
>
> One RDD is about 100G (25GB compressed and serialized in memory) with
> 130K records, the other RDD is about 10G (2.5G compressed and
> serialized in
Okay so I think the issue here is just a conflict between your application
code and the Hadoop code.
Hadoop 2.0.0 depends on protobuf 2.4.0a:
https://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.0-alpha/hadoop-project/pom.xml
Your code is depending on protobuf 2.5.X
The protobuf libra
This job might still be faster... in MapReduce there will be other
overheads in addition to the fact that doing sequential reads from HBase is
slow. But it's possible the bottleneck is the HBase scan performance.
- Patrick
On Wed, Apr 9, 2014 at 10:10 AM, Jerry Lam wrote:
> Hi Dave,
>
> This i
couldn't
>> find one in the docs. I also couldn't find an environment variable with the
>> version. After futzing around a bit I realized it was printed out (quite
>> conspicuously) in the shell startup banner.
>>
>>
>> On Sat, Feb 22, 2014 at 7:15 PM
nalytics *| *Brussels Office
> www.realimpactanalytics.com *|
> *pierre.borckm...@realimpactanalytics.com
>
> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans
>
>
>
>
>
> On 10 Apr 2014, at 23:05, Patrick Wendell wrote:
>
> I think this was solved in a recent merge:
>
>
&g
To reiterate what Tom was saying - the code that runs inside of Spark on
YARN is exactly the same code that runs in any deployment mode. There
shouldn't be any performance difference once your application starts
(assuming you are comparing apples-to-apples in terms of hardware).
The differences ar
Unfortunately - I think a lot of this is due to generally increased latency
on ec2 itself. I've noticed that it's way more common than it used to be
for instances to come online past the "wait" timeout in the ec2 script.
On Fri, Apr 18, 2014 at 9:11 PM, FRANK AUSTIN NOTHAFT wrote:
> Aureliano,
I put some notes in this doc:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
On Sun, Apr 20, 2014 at 8:58 PM, Arun Ramakrishnan <
sinchronized.a...@gmail.com> wrote:
> I would like to run some of the tests selectively. I am in branch-1.0
>
> Tried the following two comm
For a HadoopRDD, first the spark scheduler calculates the number of tasks
based on input splits. Usually people use this with HDFS data so in that
case it's based on HDFS blocks. If the HDFS datanodes are co-located with
the Spark cluster then it will try to run the tasks on the data node that
cont
Try running sbt/sbt clean and re-compiling. Any luck?
On Thu, Apr 24, 2014 at 5:33 PM, martin.ou wrote:
>
>
> occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3
>
> 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
>
>
>
> 2.found Exception:
>
> found : org.apache
Hey Todd,
This approach violates the normal semantics of RDD transformations as you
point out. I think you pointed out some issues already, and there are
others. For instance say you cache originalRDD and some of the partitions
end up in memory and others end up on disk. The ones that end up in me
Hey Jim,
This IOException thing is a general issue that we need to fix and your
observation is spot-in. There is actually a JIRA for it here I created a
few days ago:
https://issues.apache.org/jira/browse/SPARK-1579
Aaron is assigned on that one but not actively working on it, so we'd
welcome a P
What about if you run ./bin/spark-shell
--driver-class-path=/path/to/your/jar.jar
I think either this or the --jars flag should work, but it's possible there
is a bug with the --jars flag when calling the Repl.
On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover wrote:
> A couple of issues:
> 1) the
Yes, you can set SPARK_LIBRARY_PATH in 0.9.X and in 1.0 you can set
spark.executor.extraLibraryPath.
On Mon, Apr 28, 2014 at 9:16 AM, Shubham Chopra wrote:
> I am trying to use Spark/MLLib on Yarn and do not have libgfortran
> installed on my cluster. Is there any way I can set LD_LIBRARY_PATH s
This can only be a local filesystem though, it can't refer to an HDFS
location. This is because it gets passed directly to the JVM.
On Mon, Apr 28, 2014 at 9:55 PM, Patrick Wendell wrote:
> Yes, you can set SPARK_LIBRARY_PATH in 0.9.X and in 1.0 you can set
> spark.executor.extra
This was fixed in master. I think this happens if you don't set
HADOOP_CONF_DIR to the location where your hadoop configs are (e.g.
yarn-site.xml).
On Sun, Apr 27, 2014 at 7:40 PM, martin.ou wrote:
> 1.my hadoop 2.3.0
> 2.SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
> 3.SPARK_YARN
In general, as Andrew points out, it's possible to submit jobs from
multiple threads and many Spark applications do this.
One thing to check out is the job server from Ooyala, this is an
application on top of Spark that has an automated submission API:
https://github.com/ooyala/spark-jobserver
Yo
Could you explain more what your job is doing and what data types you are
using? These numbers alone don't necessarily indicate something is wrong.
The relationship between the in-memory and on-disk shuffle amount is
definitely a bit strange, the data gets compressed when written to disk,
but unles
Is this the serialization throughput per task or the serialization
throughput for all the tasks?
On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond wrote:
> Hi
>
> I am running a WordCount program which count words from HDFS, and I
> noticed that the serializer part of code takes a lot of CPU
The signature of this function was changed in spark 1.0... is there
any chance that somehow you are actually running against a newer
version of Spark?
On Tue, Apr 29, 2014 at 8:58 PM, wxhsdp wrote:
> i met with the same question when update to spark 0.9.1
> (svn checkout https://github.com/apache
ut I'm no
expert.
On Tue, Apr 29, 2014 at 10:14 PM, Liu, Raymond wrote:
> For all the tasks, say 32 task on total
>
> Best Regards,
> Raymond Liu
>
>
> -Original Message-
> From: Patrick Wendell [mailto:pwend...@gmail.com]
>
> Is this the serialization throug
This class was made to be "java friendly" so that we wouldn't have to
use two versions. The class itself is simple. But I agree adding java
setters would be nice.
On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth wrote:
> There is a JavaSparkContext, but no JavaSparkConf object. I know SparkConf
> i
partition order. RDD order is also what allows me to get the
> top k out of RDD by doing RDD.sort().take().
>
> Am I misunderstanding it? Or, is it just when RDD is written to disk that
> the order is not well preserved? Thanks in advance!
>
> Mingyu
>
>
>
>
> On 1/22
ions returned by RDD.getPartitions()
> and the row orders within the partitions determine the row order, I’m not
> sure why union doesn’t respect the order because union operation simply
> concatenates the two lists of partitions from the two RDDs.
>
> Mingyu
>
>
>
>
> On 4/
2, 3”, “4, 5, 6”], then
> rdd1.union(rdd2).saveAsTextFile(…) should’ve resulted in a file with three
> lines “a, b, c”, “1, 2, 3”, and “4, 5, 6” because the partitions from the
> two reds are concatenated.
>
> Mingyu
>
>
>
>
> On 4/29/14, 10:55 PM, "Patrick W
This is a consequence of the way the Hadoop files API works. However,
you can (fairly easily) add code to just rename the file because it
will always produce the same filename.
(heavy use of pseudo code)
dir = "/some/dir"
rdd.coalesce(1).saveAsTextFile(dir)
f = new File(dir + "part-0")
f.move
tPrefix)
>>> s3Client.deleteObjects(new
>>> DeleteObjectsRequest("upload").withKeys(objectListing.getObjectSummaries.asScala.map(s
>>> => new KeyVersion(s.getKey)).asJava))
>>>
>>> Using a 3GB object I achieved about 33MB/s betwee
Spark will only work with Scala 2.10... are you trying to do a minor
version upgrade or upgrade to a totally different version?
You could do this as follows if you want:
1. Fork the spark-ec2 repository and change the file here:
https://github.com/mesos/spark-ec2/blob/v2/scala/init.sh
2. Modify yo
Broadcast variables need to fit entirely in memory - so that's a
pretty good litmus test for whether or not to broadcast a smaller
dataset or turn it into an RDD.
On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma wrote:
> I had like to be corrected on this but I am just trying to say small enough
>
Hey Jeremy,
This is actually a big problem - thanks for reporting it, I'm going to
revert this change until we can make sure it is backwards compatible.
- Patrick
On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman wrote:
> Hi all,
>
> A heads up in case others hit this and are confused... This nic
PM, Patrick Wendell wrote:
> Hey Jeremy,
>
> This is actually a big problem - thanks for reporting it, I'm going to
> revert this change until we can make sure it is backwards compatible.
>
> - Patrick
>
> On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman
> wrote
Hey Adrian,
If you are including log4j-over-slf4j.jar in your application, you'll
still need to manually exclude slf4j-log4j12.jar from Spark. However,
it should work once you do that. Before 0.9.1 you couldn't make it
work, even if you added an exclude.
- Patrick
On Thu, May 8, 2014 at 1:52 PM,
Hey Brian,
We've had a fairly stable 1.0 branch for a while now. I've started
voting on the dev list last night... voting can take some time but it
usually wraps up anywhere from a few days to weeks.
However, you can get started right now with the release candidates.
These are likely to be almost
Just wondering - how are you launching your application? If you want
to set values like this the right way is to add them to the SparkConf
when you create a SparkContext.
val conf = new SparkConf().set("spark.akka.frameSize",
"1").setAppName(...).setMaster(...)
val sc = new SparkContext(conf)
I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
is a milestone release as the first in the 1.0 line of releases,
providing API stability for Spark's core interfaces.
Spark 1.0.0 is Spark's largest release ever, with contributions from
117 developers. I'd like to thank everyon
;>
>>> --
>>> Christopher T. Nguyen
>>> Co-founder & CEO, Adatao <http://adatao.com>
>>> linkedin.com/in/ctnguyen <http://linkedin.com/in/ctnguyen>
>>>
>>>
>>>
>>>
>>> On Fri, May 30, 2014 at 3:12 AM, Patrick
Hi Jeremy,
That's interesting, I don't think anyone has ever reported an issue running
these scripts due to Python incompatibility, but they may require Python
2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that might be a
good place to start. But if there is a straightforward way to
The main change here was refactoring the SparkListener interface which
is where we expose internal state about a Spark job to other
applications. We've cleaned up these API's a bunch and also added a
way to log all data as JSON for post-hoc analysis:
https://github.com/apache/spark/blob/master/cor
I've removed my docs from my site to avoid confusion... somehow that
link propogated all over the place!
On Sat, May 31, 2014 at 1:58 AM, Xiangrui Meng wrote:
> The documentation you looked at is not official, though it is from
> @pwendell's website. It was for the Spark SQL release. Please find
Can you look at the logs from the executor or in the UI? They should
give an exception with the reason for the task failure. Also in the
future, for this type of e-mail please only e-mail the "user@" list
and not both lists.
- Patrick
On Sat, May 31, 2014 at 3:22 AM, prabeesh k wrote:
> Hi,
>
>
Hey There,
You can remove an accumulator by just letting it go out of scope and
it will be garbage collected. For broadcast variables we actually
store extra information for it, so we provide hooks for users to
remove the associated state. There is no such need for accumulators,
though.
- Patrick
Currently, an executor is always run in it's own JVM, so it should be
possible to just use some static initialization to e.g. launch a
sub-process and set up a bridge with which to communicate.
This is would be a fairly advanced use case, however.
- Patrick
On Thu, May 29, 2014 at 8:39 PM, ans
> 1. ctx is an instance of JavaSQLContext but the textFile method is called as
> a member of ctx.
> According to the API JavaSQLContext does not have such a member, so im
> guessing this should be sc instead.
Yeah, I think you are correct.
> 2. In that same code example the object sqlCtx is refer
> 1) Is there a guarantee that a partition will only be processed on a node
> which is in the "getPreferredLocations" set of nodes returned by the RDD ?
No there isn't, by default Spark may schedule in a "non preferred"
location after `spark.locality.wait` has expired.
http://spark.apache.org/doc
I think there are a few ways to do this... the simplest one might be to
manually build a set of comma-separated paths that excludes the bad file,
and pass that to textFile().
When you call textFile() under the hood it is going to pass your filename
string to hadoopFile() which calls setInputPaths(
I would agree with your guess, it looks like the yarn library isn't
correctly finding your yarn-site.xml file. If you look in
yarn-site.xml do you definitely the resource manager
address/addresses?
Also, you can try running this command with
SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath
Hey just to clarify this - my understanding is that the poster
(Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
launch spark-ec2 from my laptop. And he was looking for an AMI that
had a high enough version of python.
Spark-ec2 itself has a flag "-a" that allows you to give a spec
101 - 200 of 207 matches
Mail list logo