Can you run N jobs depending on the same RDD in parallel on the
driver? certainly. The context / scheduling is thread-safe and the RDD
is immutable. I've done this to, for example, build and evaluate a
bunch of models simultaneously on a big cluster.
On Fri, Jan 15, 2016 at 7:10 PM, Jakob Odersky
It makes sense if you're parallelizing jobs that have relatively few
tasks, and have a lot of execution slots available. It makes sense to
turn them loose all at once and try to use the parallelism available.
There are downsides, eventually: for example, N jobs accessing one
cached RDD may
I personally support this. I had suggest drawing the line at Hadoop
2.6, but that's minor. More info:
Hadoop 2.7: April 2015
Hadoop 2.6: Nov 2014
Hadoop 2.5: Aug 2014
Hadoop 2.4: April 2014
Hadoop 2.3: Feb 2014
Hadoop 2.2: Oct 2013
CDH 5.0/5.1 = Hadoop 2.3 + backports
CDH 5.2/5.3 = Hadoop 2.5 +
It does look somehow like the state of the DateTime object isn't being
recreated properly on deserialization somehow, given where the NPE
occurs (look at the Joda source code). However the object is
java.io.Serializable. Are you sure the Kryo serialization is correct?
It doesn't quite explain why
(kryo: Kryo) {
> kryo.register(classOf[org.joda.time.DateTime])
> }
> }
>
> Is it because the groupBy sees a different class type? Maybe Array[DateTime]?
> I don’t want to find the answer by trial and error though.
>
> Alex
>
> -Original Message-
As I said in your JIRA, the collect() in question is bringing results
back to the driver to return them. The assumption is that there aren't
a vast number of frequent items. If they are, they aren't 'frequent'
and your min support is too low.
On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
Given that the items in my transactions are drawn from a set of about
> 25,000 (I previously thought 17,000), what would be a rational way to
> determine the (peak) memory needs of my driver node?
>
> -Raj
>
>
> On Wednesday, January 13, 2016 1:18 AM, Sean Owen <so...@cloude
No, this is on purpose. Have a look at the build POM. A few Guava classes
were used in the public API for Java and have had to stay unshaded. In 2.x
/ master this is already changed such that no unshaded Guava classes should
be included.
On Tue, Jan 12, 2016, 07:28 Jake Yoon
Chiming in late, but my take on this line of argument is: these
companies are welcome to keep using Spark 1.x. If anything the
argument here is about how long to maintain 1.x, and indeed, it's
going to go dormant quite soon.
But using RHEL 6 (or any old-er version of any platform) and not
wanting
I suspect this is another instance of case classes not working as
expected between the driver and executor when used with spark-shell.
Search JIRA for some back story.
On Tue, Jan 5, 2016 at 12:42 AM, Arun Luthra wrote:
> Spark 1.5.0
>
> data:
>
>
+juliet for an additional opinion, but FWIW I think it's safe to say
that future CDH will have a more consistent Python story and that
story will support 2.7 rather than 2.6.
On Tue, Jan 5, 2016 at 7:17 AM, Reynold Xin wrote:
> Does anybody here care about us dropping
You forgot a return statement in the 'else' clause, which is what the
compiler is telling you. There's nothing more to it here. Your
function is much simpler however as
Function checkHeaders2 = (x ->
x.startsWith("npi")||x.startsWith("CPT"));
On Thu, Dec 24, 2015 at 1:13 AM,
You can safely ignore it. Native libs aren't set with HADOOP_HOME. See
Hadoop docs on how to configure this if you're curious, but you really
don't need to.
On Thu, Dec 24, 2015 at 12:19 PM, Bilinmek Istemiyor
wrote:
> Hello,
>
> I have apache spark 1.5.1 installed with the
It does look like it's not actually used. It may simply be there for
completeness, to match cancelStage and cancelJobGroup, which are used.
I also don't know of a good reason there's no way to kill a whole job.
On Wed, Dec 16, 2015 at 1:15 PM, Jacek Laskowski wrote:
> Hi,
>
>
> Hmmm.
> This would go against the grain.
>
> I have to ask how you came to that conclusion…
>
> There are a lot of factors… e.g. Yarn vs Mesos?
>
> What you’re suggesting would mean a loss of parallelism.
>
>
>> On Dec 16, 2015, at 12:22 AM, Sean Owen <s
1 per machine is the right number. If you are running very large heaps
(>64GB) you may consider multiple per machine just to make sure each's
GC pauses aren't excessive, but even this might be better mitigated
with GC tuning.
On Tue, Dec 15, 2015 at 9:07 PM, Veljko Skarich
la
> 2.10)?
>
>
> https://repository.apache.org/content/repositories/releases/org/apache/spark/spark-core_2.10/1.5.1/
>
>
>
> Xiaoyong
>
>
>
> *From:* Sean Owen [mailto:so...@cloudera.com]
> *Sent:* Saturday, December 12, 2015 12:45 AM
> *To:* Xiaoy
That's exactly what the various artifacts in the Maven repo are for. The
API classes for core are in the core artifact and so on. You don't need an
assembly.
On Sat, Dec 12, 2015 at 12:32 AM, Xiaoyong Zhu
wrote:
> Yes, so our scenario is to treat the spark assembly as an
Hm, are you referencing a local file from your remote workers? That
won't work as the file only exists in one machine (I presume).
On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao wrote:
> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
>
version of the code that does a .count for debug and
> .take(1) instead of the .isEmpty the count of one epinions RDD take 8 minutes
> and the .take(1) uses 3 minutes.
>
> Other users have seen total runtime on 13G dataset of 700 minutes with the
> execution time mostly spent
ty calc time, right? The RDD creation might be slightly
> slower due to using an accumulator.
>
>
>
>
>
> On Dec 9, 2015, at 9:29 AM, Sean Owen <so...@cloudera.com> wrote:
>
> Are you sure it's isEmpty? and not an upstream stage? isEmpty is
> definitely the action here
It should at best collect 1 item to the driver. This means evaluating
at least 1 element of 1 partition. I can imagine pathological cases
where that's slow, but, do you have any more info? how slow is slow
and what is slow?
On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel
On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel wrote:
> The “Any” is required by the code it is being passed to, which is the
> Elasticsearch Spark index writing code. The values are actually RDD[(String,
> Map[String, String])]
(Is it frequently a big big map by any chance?)
CC Sandy as his https://github.com/cloudera/spark-timeseries might be
of use here.
On Wed, Dec 9, 2015 at 4:54 PM, Arun Verma wrote:
> Hi all,
>
> We have RDD(main) of sorted time-series data. We want to split it into
> different RDDs according to window size and then
You can't broadcast an RDD to begin with, and can't use RDDs inside
RDDs. They are really driver-side concepts.
Yes that's how you'd use a broadcast of anything else though, though
you need to reference ".value" on the broadcast. The 'if' is redundant
in that example, and if it's a map- or
I'm not sure if this is available in Python but from 1.3 on you should
be able to call ALS.setFinalRDDStorageLevel with level "none" to ask
it to unpersist when it is done.
On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs wrote:
> Jonathan,
> Did you ever get to the bottom of
There is no way to upgrade a running cluster here. You can stop a
cluster, and simply start a new cluster in the same way you started
the original cluster. That ought to be simple; the only issue I
suppose is that you have down-time since you have to shut the whole
thing down, but maybe that's
Yes, in the sense that you can create and trigger an action on as many
RDDs created from the batch's RDD that you like.
On Thu, Dec 3, 2015 at 8:04 PM, Wang Yangjun wrote:
> Hi,
>
> In storm we could do thing like:
>
> TopologyBuilder builder = new TopologyBuilder();
>
>
Same, not sure if anyone handles this particularly but I'll do it.
This should go to dev@; I think we just put a note on that wiki.
On Wed, Dec 2, 2015 at 10:53 AM, Adrien Mogenet
wrote:
> Hi folks,
>
> You're probably busy, but any update on this? :)
>
>
> On
Your reasoning is correct; you need probabilities (or at least some
score) out of the model and not just a 0/1 label in order for a ROC /
PR curve to have meaning.
But you just need to call clearThreshold() on the model to make it
return a probability.
On Tue, Nov 24, 2015 at 5:19 PM, jmvllt
Not sure who generally handles that, but I just made the edit.
On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote:
> Sorry to be a nag, I realize folks with edit rights on the Powered by Spark
> page are very busy people, but its been 10 days since my original request,
>
I think because they're not a library? they're example code, not
something you build an app on.
On Mon, Nov 16, 2015 at 9:27 AM, Jeff Zhang wrote:
> I don't find spark examples jar in maven repository after 1.1.1. Any reason
> for that ?
>
>
You included a very old version of the Twitter jar - 1.0.0. Did you mean 1.5.1?
On Mon, Nov 9, 2015 at 7:36 AM, fanooos wrote:
> This is my first Spark Stream application. The setup is as following
>
> 3 nodes running a spark cluster. One master node and two slaves.
>
>
I'm pretty sure that attribute is required. I am not sure what PMML
version the code has been written for but would assume 4.2.1. Feel
free to open a PR to add this version to all the output.
On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem wrote:
> [adding dev]
>
> On Wed, Nov
I might be steering this a bit off topic: does this need the simplex
method? this is just an instance of nonnegative least squares. I don't
think it relates to LDA either.
Spark doesn't have any particular support for NNLS (right?) or simplex though.
On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das
When you build docs, you are not running Spark. It looks like you are
though. These commands are just executed in the shell.
On Tue, Oct 27, 2015 at 2:09 PM, Alex Luya wrote:
> followed this
>
> https://github.com/apache/spark/blob/master/docs/README.md
>
> to build
> problem. The configuration files are all directly copied from Spark 1.5.1.
> I feel it is a bug on Spark 1.5.1.
>
> Thanks a lot for your response.
>
> On Mon, Oct 26, 2015 at 7:21 PM Sean Owen <so...@cloudera.com> wrote:
>
>> Yeah, are these stats actually
-dev +user
How are you measuring network traffic?
It's not in general true that there will be zero network traffic, since not
all executors are local to all data. That can be the situation in many
cases but not always.
On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li wrote:
> Hi,
I,
> data is evenly distributed among 18 machines.
>
>
> On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote:
>
>> Have a look at your HDFS replication, and where the blocks are for these
>> files. For example, if you had only 2 HDFS data nodes
Yeah, are these stats actually reflecting data read locally, like through
the loopback interface? I'm also no expert on the internals here but this
may be measuring effectively local reads. Or are you sure it's not?
On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran
wrote:
c/net/dev and then take the difference of received bytes before
> and after the job. I also see a long-time peak (nearly 600Mb/s) in nload
> interface. We have 18 machines and each machine receives 4.7G bytes.
>
> On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wr
Did you switch the build to Scala 2.11 by running the script in dev/? It
won't work otherwise, but does work if you do. @Ted 2.11 was supported in
1.4, not just 1.5.
On Mon, Oct 26, 2015 at 2:13 PM, Bryan Jeffrey
wrote:
> All,
>
> The error resolved to a bad version of
I don't think the page suggests that gives you any of the tarballs on the
downloads page, and -Phive does not by itself do so either.
On Mon, Oct 26, 2015 at 4:58 PM, Ted Yu wrote:
> I logged SPARK-11318 with a PR.
>
> I verified that by adding -Phive the datanucleus jars
Hm, why do you say it doesn't support 2.11? It does.
It is not even this difficult; you just need a source distribution,
and then run "./dev/change-scala-version.sh 2.11" as you say. Then
build as normal
On Sun, Oct 25, 2015 at 4:00 PM, Todd Nist wrote:
> Hi Bilnmek,
>
>
No, 2.11 artifacts are in fact published:
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-parent_2.11%22
On Sun, Oct 25, 2015 at 7:37 PM, Todd Nist wrote:
> Sorry Sean you are absolutely right it supports 2.11 all o meant is there is
> no release available as a
This doesn't show the actual error output from Maven. I have a strong
guess that you haven't set MAVEN_OPTS to increase the memory Maven can
use.
On Fri, Oct 23, 2015 at 6:14 AM, Kayode Odeyemi wrote:
> Hi,
>
> I can't seem to get a successful maven build. Please see command
Spark asked YARN to let an executor use 7GB of memory, but it used
more so was killed. In each case you see that the exectuor memory plus
overhead equals the YARN allocation requested. What's the issue with
that?
On Fri, Oct 23, 2015 at 6:46 AM, JoneZhang wrote:
> Here is
Maven, in general, does some local caching to avoid htting the repo
every time. It's possible this is why you're not seeing 1.5.1. On the
command line you can for example add "mvn -U ..." Not sure of the
equivalent in IntelliJ, but it will be updating the same repo IJ sees.
Try that. The repo
coalesce without a shuffle? it shouldn't be an action. It just treats many
partitions as one.
On Tue, Oct 20, 2015 at 1:00 PM, t3l wrote:
>
> I have dataset consisting of 5 binary files (each between 500kb and
> 2MB). They are stored in HDFS on a Hadoop cluster. The
Nabble is not related to the mailing list, so I think that's the problem.
spark.apache.org -> Community -> Mailing Lists:
http://spark.apache.org/community.html
These instructions from the project website itself are correct.
On Tue, Oct 20, 2015 at 5:22 PM, Jeff Sadowski
I believe it will be most efficient to let top(n) do the work, rather than
sort the whole RDD and then take the first n. The reason is that top and
takeOrdered know they need at most n elements from each partition, and then
just need to merge those. It's never required to sort the whole thing.
I
If you mean, getResult is called on the result of foo for each record, then
that already happens. If you mean getResults is called only after foo has
been called on all records, then you have to collect to a list, yes.
Why does it help with foo being slow in either case though?
You can try to
A few small-scale code style tips which might improve readability:
You can in some cases write something like _.split("...") instead of x
=> x.split("...")
A series of if-else conditions on x(15) can be replaced with "x(15)
match { case"B" => ... }"
.isEmpty may be more readable than == ""
With
IIRC we have seen transient problems like this from Maven Central,
where files were visible to some but not others or reappeared later. I
can't get that file now either, but, try again later first. If it's
really persistent we may have to ask what's going on with Maven
Central. It's not a Spark or
you ever read the spark source code?Which step can lead to data
> dislocation?
>
>> 在 2015年10月9日,17:37,Sean Owen <so...@cloudera.com> 写道:
>>
>> Another guess, since you say the key is String (offline): you are not
>> cloning the value of TagsWritable. Hadoop reuses
, Sean Owen <so...@cloudera.com> wrote:
> First guess: your key class does not implement hashCode/equals
>
> On Fri, Oct 9, 2015 at 10:05 AM, Devin Huang <hos...@163.com> wrote:
>> Hi everyone,
>>
>> I got a trouble these days,and I don't know whether it is a
First guess: your key class does not implement hashCode/equals
On Fri, Oct 9, 2015 at 10:05 AM, Devin Huang wrote:
> Hi everyone,
>
> I got a trouble these days,and I don't know whether it is a bug of
> spark.When I use GroupByKey for our sequenceFile Data,I find that
These are true, but it's not because Spark is written in Scala; it's
because it executes in the JVM. So, Scala/Java-based apps have an
advantage in that they don't have to serialize data back and forth to
a Python process, which also brings a new set of things that can go
wrong. Python is also
I think Java's immutable collections are fine with respect to kryo --
that's not the same as Guava.
On Wed, Oct 7, 2015 at 11:56 AM, Jakub Dubovsky
wrote:
> I did not realized that scala's and java's immutable collections uses
> different api which causes this.
-dev
Is r.getInt(ind) very large in some cases? I think there's not quite
enough info here.
On Wed, Oct 7, 2015 at 6:23 PM, wrote:
> When running stand-alone cluster mode job, the process hangs up randomly
> during a DataFrame flatMap or explode operation, in
OK, next question then is: if this is wall-clock time for the whole
process, then, I wonder if you are just measuring the time taken by the
longest single task. I'd expect the time taken by the longest straggler
task to follow a distribution like this. That is, how balanced are the
partitions?
For what it's worth, I also use this class in an app, but it happens
to be from Java code where it acts as if it's public. So no problem
for my use case, but I suppose, another small vote for the usefulness
of this class to the caller. I end up using getLatestLeaderOffsets to
figure out how to
Yes, see https://issues.apache.org/jira/browse/SPARK-4170 The reason
was kind of complicated, and the 'fix' was just to warn you against
subclassing App! yes, use a main() method.
On Tue, Oct 6, 2015 at 3:15 PM, Nick Peterson wrote:
> This might seem silly, but...
>
>
I think it's unused as the JIRA says, but removing it from the
constructors would change the API, so that's why it stays in the
signature. Removing the internal field and one usage of it seems OK,
though I don't think it would help much of anything.
On Sun, Oct 4, 2015 at 4:36 AM, Jacek Laskowski
It builds for me. That message usually really means you can't resolve
or download from a repo. It's just the last thing that happens to
fail.
On Sun, Oct 4, 2015 at 7:06 AM, Stephen Boesch wrote:
>
> For a week or two the trunk has not been building for the examples module
>
Your exception handling is occurring on the driver, where you
'configure' the job. I don't think it's what you mean to do. You
probably mean to do this within a function you are executing on data
within the cluster, like mapPartitions etc.
On Fri, Sep 25, 2015 at 5:20 AM, Samya
You can ignore all of these. Various libraries can take advantage of
native acceleration if libs are available but it's no problem if they
don't.
On Thu, Sep 24, 2015 at 3:25 AM, r7raul1...@163.com wrote:
> 1 WARN netlib.BLAS: Failed to load implementation from:
>
Might be better for another list, but, I suspect it's more than HBase
is simply much more integrated with YARN, and because it's run with
other services that are as well.
On Wed, Sep 23, 2015 at 12:02 AM, Jacek Laskowski wrote:
> That sentence caught my attention. Could you
The point is that this only works if you already knew the file was
presented in order within and across partitions, which was the
original problem anyway. I don't think it is in general, but in
practice, I do imagine it's already in the expected order from
textFile. Maybe under the hood this ends
I don't know of a way to do this, out of the box, without maybe
digging into custom InputFormats. The RDD from textFile doesn't have
an ordering. I can't imagine a world in which partitions weren't
iterated in line order, of course, but there's also no real guarantee
about ordering among
Sbt asked for a bigger initial heap than the host had space for. It is a
JVM error you can and should search for first. You will need more memory.
On Mon, Sep 21, 2015, 2:11 AM Aaroncq4 <475715...@qq.com> wrote:
> When I used “sbt/sbt assembly" to compile spark code of spark-1.5.0,I got a
>
t; in my use
> case.
>
> On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I think foldByKey is much more what you want, as it has more a notion
>> of building up some result per key by encountering values serially.
>> You would take t
y is not?
>
> On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> The zero value here is None. Combining None with any row should yield
>> Some(row). After that, combining is a no-op for other rows.
>>
>> On Tue, Sep 22, 2015 at 4:2
I think foldByKey is much more what you want, as it has more a notion
of building up some result per key by encountering values serially.
You would take the first and ignore the rest. Note that "first"
depends on your RDD having an ordering to begin with, or else you rely
on however it happens to
This is a question for the CDH list. CDH 5.4 has Spark 1.3, and 5.5
has 1.5. The best thing is to update CDH as a whole if you can.
However it's pretty simple to just run a newer Spark assembly as a
YARN app. Don't remove anything in the CDH installation. Try
downloading the assembly and
I don't think this is Spark-specific. Mostly you need to escape /
quote user-supplied values as with any SQL engine.
On Thu, Sep 10, 2015 at 7:32 AM, V Dineshkumar
wrote:
> Hi,
>
> What is the preferred way of avoiding SQL Injection while using Spark SQL?
> In our
rg/apache/spark/mllib/classification/NaiveBayes.scala
>
> // Adamantios
>
>
>
> On Thu, Sep 10, 2015 at 7:03 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> The log probabilities are unlikely to be very large, though the
>> probabilities may be very
i + brzTheta * testData.toBreeze is a big number
> too), how can I get back the non-log-probabilities which - apparently - are
> bounded between 0.0 and 1.0?
>
>
> // Adamantios
>
>
>
> On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen <so...@cloudera.com> wrote
-Dtest=none ?
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests
On Thu, Sep 10, 2015 at 10:39 PM, Stephen Boesch wrote:
>
> I have invoked mvn test with the -DwildcardSuites option to specify a single
>
It shows you there that Maven is out of memory. Give it more heap. I use 3gb.
On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen wrote:
> Hi All,
>
> I'm trying to build a distribution off of the latest in master and I keep
> getting errors on MQTT and the build fails. I'm
Ah, right. Should've caught that.
>
> The docs seem to recommend 2gb. Should that be increased as well?
>
> --Ben
>
> On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> It shows you there that Maven is out of memory. Give it more heap. I use
Why would Scala vs Java performance be different Ted? Relatively
speaking there is almost no runtime difference; it's the same APIs or
calls via a thin wrapper. Scala/Java vs Python is a different story.
Java libraries can be used in Scala. Vice-versa too, though calling
Scala-generated classes
run Maven with the
>>>> -e switch.
>>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>>> [ERROR]
>>>> [ERROR] For more information about the errors and possible solutions,
>>>> please read the following articles:
I think somewhere alone the line you've not specified your label
column -- it's defaulting to "label" and it does not recognize it, or
at least not as a binary or nominal attribute.
On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole wrote:
> Hi, Experts,
>
> I followed the guide
iwC$$iwC.(:62)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> at $iwC$$iwC$$iwC$$iwC.(:70)
> at $iwC$$iwC$$iwC.(:72)
> at $iwC$$iwC.(:74)
> at
That means the save has not finished yet. Are you sure it did? it
writes in _temporary while it's in progress
On Fri, Sep 4, 2015 at 10:10 AM, Chirag Dewan wrote:
> Hi,
>
>
>
> I have a 2 node Spark cluster and I am trying to read data from a Cassandra
> cluster and
nstances, boostingStrategy)
>
> Thanks,
> Peter Rudenko
>
> On 2015-08-14 00:33, Sean Owen wrote:
>
> Not that I have any answer at this point, but I was discussing this
> exact same problem with Johannes today. An input size of ~20K records
> was growing each iteration
I think the story is that the new spark.ml "pipelines" API is the
future. Most (all?) of the spark.mllib functionality has been ported
over and/or translated. I don't know that spark.mllib will actually be
deprecated soon -- not until spark.ml is fully blessed as 'stable' I'd
imagine, at least.
(pedantic: it's the log-probabilities)
On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang wrote:
> Actually
> brzPi + brzTheta * testData.toBreeze
> is the probabilities of the input Vector on each class, however it's a
> Breeze Vector.
> Pay attention the index of this Vector need
ouble)))
> sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>
> -Ashish
>
>
> On Sun, Aug 30, 2015 at 2:52 PM Sean Owen <so...@cloudera.com> wrote:
>>
>> I'm not sure how to reproduce it? this code does not produce an error in
>> master.
>>
>> On
I don't think it's an 'issue'; the build does in fact require Maven
3.3.3 and Java 7 on purpose. If you don't have Maven 3.3.3, run
'build/mvn' or ('build/mvn --force' if you have an older Maven
locally) to get 3.3.3.
On Mon, Aug 31, 2015 at 6:27 AM, Kevin Jung wrote:
> I
; to do with Spark per se. Or, have a look at others related to the
>> closure and shell and you may find this is related to other known
>> behavior.
>>
>>
>> On Sun, Aug 30, 2015 at 8:08 PM, Ashish Shrowty
>> <ashish.shro...@gmail.com> wrote:
>> &
That can't cause any error, since there is no action in your first
snippet. Even calling count on the result doesn't cause an error. You
must be executing something different.
On Sun, Aug 30, 2015 at 4:21 AM, ashrowty ashish.shro...@gmail.com wrote:
I am running the Spark shell (1.2.1) in local
did.
Thanks
On Sun, Aug 30, 2015 at 12:24 AM, Sean Owen so...@cloudera.com
wrote:
That can't cause any error, since there is no action in your first
snippet. Even calling count on the result doesn't cause an error. You
must be executing something different.
On Sun, Aug 30, 2015 at 4:21
Yes, for example val sensorRDD = rdd.map(Sensor.parseSensor) is a
line of code executed on the driver; it's part the function you
supplied to foreachRDD. However that line defines an operation on an
RDD, and the map function you supplied (parseSensor) will ultimately
be carried out on the cluster.
Hm, if anything that would be related to my change at
https://issues.apache.org/jira/browse/SPARK-9613. Let me investigate.
There's some chance there is some Scala 2.11-specific problem here.
On Thu, Aug 27, 2015 at 8:51 AM, Jacek Laskowski ja...@japila.pl wrote:
Hi,
Is this a known issue of
I don't see that you invoke any action in this code. It won't do
anything unless you tell it to perform an action that requires the
transformations.
On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari
deepesh.maheshwar...@gmail.com wrote:
Hi,
I have applied mapToPair and then a reduceByKey on a
Yes, you're right that it's quite on purpose to leave this API to
Breeze, in the main. As you can see the Spark objects have already
sprouted a few basic operations anyway; there's a slippery slope
problem here. Why not addition, why not dot products, why not
determinants, etc.
What about
Yes I get all that too and I think there's a legit question about
whether moving a little further down the slippery slope is worth it
and if so how far. The other catch here is: either you completely
mimic another API (in which case why not just use it directly, which
has its own problems) or you
:
Thanks Sean.
So how PySpark is supported. I thought PySpark needs jdk 1.6.
Chen
On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen so...@cloudera.com wrote:
Spark 1.4 requires Java 7.
On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote:
I tried to build Spark
801 - 900 of 1849 matches
Mail list logo