, Sean Owen so...@cloudera.com wrote:
Spark 1.4 requires Java 7.
On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote:
I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support
PySpark, I used JDK 1.6.
I got the following error,
[INFO] --- scala-maven-plugin:3.2.0
The property scala-2.11 triggers the profile scala-2.11 -- and
additionally disables the scala-2.10 profile, so that's the way to do
it. But yes, you also need to run the script before-hand to set up the
build for Scala 2.11 as well.
On Mon, Aug 24, 2015 at 8:48 PM, Lanny Ripple
Spark 1.4 requires Java 7.
On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote:
I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support
PySpark, I used JDK 1.6.
I got the following error,
[INFO] --- scala-maven-plugin:3.2.0:testCompile
No. The third line creates a third RDD whose reference simply replaces
the reference to the first RDD in your local driver program. The first
RDD still exists.
On Thu, Aug 20, 2015 at 2:15 PM, Bahubali Jain bahub...@gmail.com wrote:
Hi,
How would the DAG look like for the below code
Yes, it should Just Work. lambdas can be used for any method that
takes an instance of an interface with one method, and that describes
Function, PairFunction, etc.
On Tue, Aug 18, 2015 at 3:23 PM, Kristoffer Sjögren sto...@gmail.com wrote:
Hi
Is there a way to execute spark jobs with Java 8
Not that I have any answer at this point, but I was discussing this
exact same problem with Johannes today. An input size of ~20K records
was growing each iteration by ~15M records. I could not see why on a
first look.
@jkbradley I know it's not much info but does that ring any bells? I
think
The difference is really that Java and Scala work differently. In
Java, your anonymous subclass of Ops defined in (a method of)
AbstractTest captures a reference to it. That much is 'correct' in
that it's how Java is supposed to work, and AbstractTest is indeed not
serializable since you didn't
You can ignore it entirely. It just means you haven't installed and
configured native libraries for things like accelerated compression,
but it has no negative impact otherwise.
On Tue, Aug 4, 2015 at 8:11 AM, Deepesh Maheshwari
deepesh.maheshwar...@gmail.com wrote:
Hi,
When i run the spark
deepesh.maheshwar...@gmail.com wrote:
Can you elaborate about the things this native library covering.
One you mentioned accelerated compression.
It would be very helpful if you can give any useful to link to read more
about it.
On Tue, Aug 4, 2015 at 12:56 PM, Sean Owen so...@cloudera.com wrote
wrote:
Think it may be needed on Windows, certainly if you start trying to work with
local files.
On 4 Aug 2015, at 00:34, Sean Owen so...@cloudera.com wrote:
It won't affect you if you're not actually running Hadoop. But it's
mainly things like Snappy/LZO compression which are implemented
If you've set the checkpoint dir, it seems like indeed the intent is
to use a default checkpoint interval in DStream:
private[streaming] def initialize(time: Time) {
...
// Set the checkpoint interval to be slideDuration or 10 seconds,
which ever is larger
if (mustCheckpoint
Yes, I think this was asked because you didn't say what flags you set
before, and it's worth verifying they're the correct ones.
Although I'd be kind of surprised if 512m isn't enough, did you try more?
You could also try -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled
Also verify
:Sean Owen so...@cloudera.com
To:Proust GZ Feng/China/IBM@IBMCN
Cc:user user@spark.apache.org
Date:07/28/2015 02:20 PM
Subject:Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0
It wasn't removed, but rewritten
It wasn't removed, but rewritten. Cygwin is just a distribution of
POSIX-related utilities so you should be able to use the normal .sh
scripts. In any event, you didn't say what the problem is?
On Tue, Jul 28, 2015 at 5:19 AM, Proust GZ Feng pf...@cn.ibm.com wrote:
Hi, Spark Users
Looks like
That's for the Windows interpreter rather than bash-running Cygwin. I
don't know it's worth doing a lot of legwork for Cygwin, but, if it's
really just a few lines of classpath translation in one script, seems
reasonable.
On Tue, Jul 28, 2015 at 9:13 PM, Steve Loughran ste...@hortonworks.com
/maven/com.github.fommil/jniloader/pom.properties
Thanks,
Arun
On Fri, Jul 17, 2015 at 1:30 PM, Sean Owen so...@cloudera.com wrote:
Make sure /usr/lib64 contains libgfortran.so.3; that's really the issue.
I'm pretty sure the answer is 'yes', but, make sure the assembly has
jniloader
Yes, just have a look at the method in the source code. It calls new
ALS()run(). It's a convenience wrapper only.
On Fri, Jul 17, 2015 at 4:59 PM, Carol McDonald cmcdon...@maprtech.com wrote:
the new ALS()...run() form is underneath both of the first two.
I am not sure what you mean by
:
com.github.fommil.netlib.NativeSystemLAPACK
15/07/17 13:20:53 WARN LAPACK: Failed to load implementation from:
com.github.fommil.netlib.NativeRefLAPACK
Does anything need to be adjusted in my application POM?
Thanks,
Arun
On Thu, Jul 16, 2015 at 5:26 PM, Sean Owen so...@cloudera.com wrote:
Yes, that's
See also https://issues.apache.org/jira/browse/SPARK-8385
(apologies if someone already mentioned that -- just saw this thread)
On Thu, Jul 16, 2015 at 7:19 PM, Jerrick Hoang jerrickho...@gmail.com wrote:
So, this has to do with the fact that 1.4 has a new way to interact with
HiveMetastore,
Yes, that's most of the work, just getting the native libs into the
assembly. netlib can find them from there even if you don't have BLAS
libs on your OS, since it includes a reference implementation as a
fallback.
One common reason it won't load is not having libgfortran installed on
your OSes
The first two examples are from the .mllib API. Really, the new
ALS()...run() form is underneath both of the first two. In the second
case, you're calling a convenience method that calls something similar
to the first example.
The second example is from the new .ml pipelines API. Similar ideas,
Is the data set synthetic, or has very few items? or is indeed very
sparse? those could be reasons. However usually this kind of thing
happens with very small data sets. I could be wrong about what's going
on, but it's a decent guess at the immediate cause given the error
messages.
On Mon, Jul
I interpret this to mean that the input to the Cholesky decomposition
wasn't positive definite. I think this can happen if the input matrix
is singular or very near singular -- maybe, very little data? Ben that
might at least address why this is happening; different input may work
fine.
Xiangrui
Yeah, it won't technically be supported, and you shouldn't go
modifying the actual installation, but if you just make your own build
of 1.4 for CDH 5.4 and use that build to launch YARN-based apps, I
imagine it will Just Work for most any use case.
On Sun, Jul 12, 2015 at 7:34 PM, Ruslan
In general, R2 means the line that was fit is a very poor fit -- the
mean would give a smaller squared error. But it can also mean you are
applying R2 where it doesn't apply. Here, you're not performing a
linear regression; why are you using R2?
On Sun, Jul 12, 2015 at 4:22 PM, afarahat
These are quite different operations. One operates on RDDs in DStream and
one operates on partitions of an RDD. They are not alternatives.
On Wed, Jul 8, 2015, 2:43 PM dgoldenberg dgoldenberg...@gmail.com wrote:
Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?
into a socket. Let's say I have one
socket per a client of my streaming app and I get a host:port of that socket
as part of the message and want to send the response via that socket. Is
foreachPartition still a better choice?
On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen so...@cloudera.com wrote
Usually this message means that the test was starting some process
like a Spark master and it didn't ever start. The eventual error is
timeout. You have to try to dig in to the test and logs to catch the
real reason.
On Sun, Jul 5, 2015 at 9:23 PM, SamRoberts samueli.robe...@yahoo.com wrote:
Yes, Spark Core depends on Hadoop libs, and there is this unfortunate
twist on Windows. You'll still need HADOOP_HOME set appropriately
since Hadoop needs some special binaries to work on Windows.
On Fri, Jun 26, 2015 at 11:06 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You just need to set
#2 is not a bug. Have a search through JIRA. It is merely unformalized. I
think that is how (one of?) the original PageRank papers does it.
On Thu, Jun 25, 2015, 7:39 AM Kelly, Terence P (HP Labs Researcher)
terence.p.ke...@hp.com wrote:
Hi,
Colleagues and I have found that the PageRank
with the
command
[ERROR] mvn goals -rf :spark-sql_2.10
Ahh..ok, so it's Hive 1.1 and Spark 1.4. Even using standard Hive .13
version, I still the the above error. Granted (it's CDH's Hadoop JARs, and
Apache's Hive).
On Wed, Jun 24, 2015 at 9:30 PM, Sean Owen so...@cloudera.com wrote
-dev +user
That all sounds fine except are you packaging Spark classes with your
app? that's the bit I'm wondering about. You would mark it as a
'provided' dependency in Maven.
On Thu, Jun 25, 2015 at 5:12 AM, jimfcarroll jimfcarr...@gmail.com wrote:
Hi Sean,
I'm running a Mesos cluster. My
No, or at least, it depends on how the source of the partitions was implemented.
On Thu, Jun 25, 2015 at 12:16 PM, Shushant Arora
shushantaror...@gmail.com wrote:
Does mapPartitions keep complete partitions in memory of executor as
iterable.
JavaRDDString rdd = jsc.textFile(path);
On Wed, Jun 24, 2015 at 12:02 PM, Nick Pentreath
nick.pentre...@gmail.com wrote:
Oryx does almost the same but Oryx1 kept all user and item vectors in memory
(though I am not sure about whether Oryx2 still stores all user and item
vectors in memory or partitions in some way).
(Yes, this is a
You didn't provide any error?
You're compiling vs Hive 1.1 here and that is the problem. It is nothing to
do with CDH.
On Wed, Jun 24, 2015, 10:15 PM Aaron aarongm...@gmail.com wrote:
I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling with
the v1.4.0 tag out of git?
Yes, and typically needs are 100ms. Now imagine even 10 concurrent
requests. My experience has been that this approach won't nearly
scale. The best you could probably do is async mini-batch
near-real-time scoring, pushing results to some store for retrieval,
which could be entirely suitable for
Out of curiosity why netty?
What model are you serving?
Velox doesn't look like it is optimized for cases like ALS recs, if that's
what you mean. I think scoring ALS at scale in real time takes a fairly
different approach.
The servlet engine probably doesn't matter at all in comparison.
On Sat,
I see the same thing in an app that uses Jackson 2.5. Downgrading to
2.4 made it work. I meant to go back and figure out if there's
something that can be done to work around this in Spark or elsewhere,
but for now, harmonize your Jackson version at 2.4.x if you can.
On Fri, Jun 12, 2015 at 4:20
You don't add dependencies to your app -- you mark Spark as 'provided'
in the build and you rely on the deployed Spark environment to provide
it.
On Fri, Jun 12, 2015 at 7:14 PM, Elkhan Dadashov elkhan8...@gmail.com wrote:
Hi all,
We want to integrate Spark in our Java application using the
Guess: it has something to do with the Text object being reused by Hadoop?
You can't in general keep around refs to them since they change. So you may
have a bunch of copies of one object at the end that become just one in
each partition.
On Thu, Jun 11, 2015, 8:36 PM Crystal Xing
at 6:44 PM, Sean Owen so...@cloudera.com wrote:
Guess: it has something to do with the Text object being reused by
Hadoop? You can't in general keep around refs to them since they change. So
you may have a bunch of copies of one object at the end that become just
one in each partition.
On Thu
No, but you can write a couple lines of code that do this. It's not
optimized of course. This is actually a long and interesting side
discussion, but I'm not sure how much it could be given that the
computation is pull rather than push; there is no concept of one
pass over the data resulting in
In the sense here, Spark actually does have operations that make multiple
RDDs like randomSplit. However there is not an equivalent of the partition
operation which gives the elements that matched and did not match at once.
On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote:
As far
Yes, I think you're right. Since this is a change to the ASF hosted
site, I can make this change to the .md / .html directly rather than
go through the usual PR.
On Wed, Jun 3, 2015 at 6:23 PM, linkstar350 . tweicomepan...@gmail.com wrote:
Hi, I'm Taira.
I notice that this example page may be
; does that seem correct?
Thanks!
On Wed, May 20, 2015 at 1:52 PM Sean Owen so...@cloudera.com wrote:
I don't think any of those problems are related to Hadoop. Have you
looked at userClassPathFirst settings?
On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson ejsa...@gmail.com
wrote:
Hi Sean
).
Then I need to get a small random sample of Document objects (e.g. 10,000
document). How can I do this quickly? The rdd.sample() methods does not help
because it need to read the entire RDD of 7 million Document from disk which
take very long time.
Ningjun
From: Sean Owen [mailto:so
If sampling whole partitions is sufficient (or a part of a partition),
sure you could mapPartitionsWithIndex and decide whether to process a
partition at all based on its # and skip the rest. That's much faster.
On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com
I don't think that's quite the difference. Any SQL engine has a query
planner and an execution engine. Both of these Spark for execution. HoS
uses Hive for query planning. Although it's not optimized for execution on
Spark per se, it's got a lot of language support and is stable/mature.
Spark
Yes, the published artifacts can only refer to one version of anything
(OK, modulo publishing a large number of variants under classifiers).
You aren't intended to rely on Spark's transitive dependencies for
anything. Compiling against the Spark API has no relation to what
version of Hadoop it
.
More anon,
Cheers,
Edward
Original Message
Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38
From: Sean Owen so...@cloudera.com To: Edward Sargisson
esa...@pobox.com Cc: user user@spark.apache.org
Yes, the published artifacts can only refer
I don't think you should rely on a shutdown hook. Ideally you try to
stop it in the main exit path of your program, even in case of an
exception.
On Tue, May 19, 2015 at 7:59 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
You mean to say within
The way these files are accessed is inherently sequential-access. There
isn't a way to in general know where record N is in a file like this and
jump to it. So they must be read to be sampled.
On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Hi
(I made you a Contributor in JIRA -- your yahoo-related account of the
two -- so maybe that will let you do so.)
On Fri, May 15, 2015 at 4:19 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
Hi, two questions
1. Can regular JIRA users reopen bugs -- I can open a new issue but it does
not
This change will be merged shortly for Spark 1.4, and has a minor
implication for those creating their own Spark builds:
https://issues.apache.org/jira/browse/SPARK-7249
https://github.com/apache/spark/pull/5786
The default Hadoop dependency has actually been Hadoop 2.2 for some
time, but the
\
affected_hosts.py
Now we're seeing data from the stream. Thanks again!
On Mon, May 11, 2015 at 2:43 PM Sean Owen so...@cloudera.com wrote:
Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd
have to provide it and all its dependencies with your app. You could
also build
2015 com/yammer/metrics/core/Gauge.class
On Tue, May 12, 2015 at 8:05 AM, Sean Owen so...@cloudera.com wrote:
It doesn't depend directly on yammer metrics; Kafka does. It wouldn't
be correct to declare that it does; it is already in the assembly
anyway.
On Tue, May 12, 2015 at 3:50 PM, Ted
, 2015, 1:11 AM Sean Owen so...@cloudera.com wrote:
The question is really whether all the third-party integrations should
be built into Spark's main assembly. I think reasonable people could
disagree, but I think the current state (not built in) is reasonable.
It means you have to bring
executor with a
thread pool of N threads doing the same task.
The performance I'm seeing of running the Kafka-Spark Streaming job is 7
times slower than that of the utility. What's pulling Spark back?
Thanks.
On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com wrote:
You have one
That is mostly the YARN overhead. You're starting up a container for the AM
and executors, at least. That still sounds pretty slow, but the defaults
aren't tuned for fast startup.
On May 11, 2015 7:00 PM, Su She suhsheka...@gmail.com wrote:
Got it to work on the cluster by changing the master to
You have one worker with one executor with 32 execution slots.
On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com wrote:
Hi,
Is there anything special one must do, running locally and submitting a job
like so:
spark-submit \
--class com.myco.Driver \
Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd
have to provide it and all its dependencies with your app. You could
also build this into your own app jar. Tools like Maven will add in
the transitive dependencies.
On Mon, May 11, 2015 at 10:04 PM, Lee McFadden
Yes, at this point I believe you'll find jblas used for historical reasons,
to not change some APIs. I don't believe it's used for much if any
computation in 1.4.
On May 8, 2015 5:04 PM, John Niekrasz john.niekr...@gmail.com wrote:
Newbie question...
Can I use any of the main ML capabilities
You're referring to a comment in the generic utility method, not the
specific calls to it. The comment just says that the generic method
doesn't mark the directory for deletion. Individual uses of it might
need to.
One or more of these might be delete-able on exit, but in any event
it's just a
See https://issues.apache.org/jira/browse/SPARK-5492 but I think
you'll need to share the stack trace as I'm not sure how this can
happen since the NoSuchMethodError (not NoSuchMethodException)
indicates a call in the bytecode failed to link but there is only a
call by reflection.
On Fri, May 1,
Yes there is now such a profile, though it is essentially redundant and
doesn't configure things differently from 2.4. Besides hadoop version of
course. Which is why it hadn't existed before since the 2.4 profile is 2.4+
People just kept filing bugs to add it but the docs are correct : you don't
You fundamentally want (half of) the Cartesian product so I don't think it
gets a lot faster to form this. You could implement this on cogroup
directly and maybe avoid forming the tuples you will filter out.
I'd think more about whether you really need to do this thing, or whether
there is
Please use user@, not dev@
This message does not appear to be from your driver. It also doesn't say
you ran out of memory. It says you didn't tell YARN to let it use the
memory you want. Look at the memory overhead param and please search first
for related discussions.
On Apr 29, 2015 11:43 AM,
be related to this
https://issues.apache.org/jira/browse/SPARK-5967 defect that was
resolved in Spark 1.2.2 and 1.3.0.
It also was a HashMap causing the issue.
-Conor
On Wed, Apr 29, 2015 at 12:01 PM, Sean Owen so...@cloudera.com wrote:
Please use user@, not dev@
This message does not appear
Works fine for me. Make sure you're not downloading the HTML
redirector page and thinking it's the archive.
On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors
and direct link. Each time i untar i
Yes, I think this is a known issue, that sortByKey actually runs a job
to assess the distribution of the data.
https://issues.apache.org/jira/browse/SPARK-1021 I think further eyes
on it would be welcome as it's not desirable.
On Fri, Apr 24, 2015 at 9:57 AM, Spico Florin spicoflo...@gmail.com
The sum? you just need to use an accumulator to sum the counts or something.
On Fri, Apr 24, 2015 at 2:14 PM, Sergio Jiménez Barrio
drarse.a...@gmail.com wrote:
Sorry for my explanation, my English is bad. I just need obtain the Long
containing of the DStream created by messages.count().
No, it prints each Long in that stream, forever. Have a look at the DStream API.
On Fri, Apr 24, 2015 at 2:24 PM, Sergio Jiménez Barrio
drarse.a...@gmail.com wrote:
But if a use messages.count().print this show a single number :/
The order of elements in an RDD is in general not guaranteed unless
you sort. You shouldn't expect to encounter the partitions of an RDD
in any particular order.
In practice, you probably find the partitions come up in the order
Hadoop presents them in this case. And within a partition, in this
The standard incantation -- which is a little different from standard
Maven practice -- is:
mvn -DskipTests [your options] clean package
mvn [your options] test
Some tests require the assembly, so you have to do it this way.
I don't know what the test failures were, you didn't post them, but
foreachRDD is an action and doesn't return anything. It seems like you
want one final count, but that's not possible with a stream, since
there is conceptually no end to a stream of data. You can get a stream
of counts, which is what you have already. You can sum those counts in
another data
Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?
On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel p...@occamsmachete.com wrote:
Sure
var columns = mc.textFile(source).map { line = line.split(delimiter) }
Here “source”
Following several discussions about how to improve the contribution
process in Spark, I've overhauled the guide to contributing. Anyone
who is going to contribute needs to read it, as it has more formal
guidance about the process:
Not that i've tried it, but, why couldn't you use one ZK server? I
don't see a reason.
On Wed, Apr 22, 2015 at 7:40 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
It isn't mentioned anywhere in the doc, but you will probably need separate
ZK for each of your HA cluster.
Thanks
Best Regards
I think maybe you need more partitions in your input, which might make
for smaller tasks?
On Tue, Apr 21, 2015 at 2:56 AM, Christian S. Perone
christian.per...@gmail.com wrote:
I keep seeing these warnings when using trainImplicit:
WARN TaskSetManager: Stage 246 contains a task of very large
What machines are HDFS data nodes -- just your master? that would
explain it. Otherwise, is it actually the write that's slow or is
something else you're doing much faster on the master for other
reasons maybe? like you're actually shipping data via the master first
in some local computation? so
You need to access the underlying RDD with .rdd() and cast that. That
works for me.
On Mon, Apr 20, 2015 at 4:41 AM, RimBerry
truonghoanglinhk55b...@gmail.com wrote:
Hi everyone,
i am trying to use the direct approach in streaming-kafka-integration
Brahma since you can see the continuous integration builds are
passing, it's got to be something specific to your environment, right?
this is not even an error from Spark, but from Maven plugins.
On Mon, Apr 20, 2015 at 4:42 AM, Ted Yu yuzhih...@gmail.com wrote:
bq. -Dhadoop.version=V100R001C00
Do these datetime objects implement a the notion of equality you'd
expect? (This may be a dumb question; I'm thinking of the equivalent
of equals() / hashCode() from the Java world.)
On Sat, Apr 18, 2015 at 4:17 PM, SecondDatke
lovejay-lovemu...@outlook.com wrote:
I'm trying to solve a
Doesn't this reduce to Scala isn't compatible with itself across
maintenance releases? Meaning, if this were fixed then Scala
2.11.{x 6} would have similar failures. It's not not-ready; it's
just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the
unofficial support to at least make the
This is the fraction available for caching, which is 60% * 90% * total
by default.
On Fri, Apr 17, 2015 at 11:30 AM, podioss grega...@hotmail.com wrote:
Hi,
i am a bit confused with the executor-memory option. I am running
applications with Standalone cluster manager with 8 workers with 4gb
Spark against 2.11.2
and still saw the problems with the REPL. I've created a bug report:
https://issues.apache.org/jira/browse/SPARK-6989
I hope this helps.
Cheers,
Michael
On Apr 17, 2015, at 1:41 AM, Sean Owen so...@cloudera.com wrote:
Doesn't this reduce to Scala isn't compatible
This would be much, much faster if your set of IDs was simply a Set,
and you passed that to a filter() call that just filtered in the docs
that matched an ID in the set.
On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Does anybody have a solution for
wrote:
Thanks Sean. I want to load each batch into Redshift. What's the best/most
efficient way to do that?
Vadim
On Apr 16, 2015, at 1:35 PM, Sean Owen so...@cloudera.com wrote:
You can't, since that's how it's designed to work. Batches are saved
in different files, which are really
I don't think there's anything specific to CDH that you need to know,
other than it ought to set things up sanely for you.
Sandy did a couple posts about tuning:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
You can't, since that's how it's designed to work. Batches are saved
in different files, which are really directories containing
partitions, as is common in Hadoop. You can move them later, or just
read them where they are.
On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy
Yes, look what it was before -- would also reject a minimum of 0.
That's the case you are hitting. 0 is a fine minimum.
On Thu, Apr 16, 2015 at 8:09 PM, Michael Stone mst...@mathom.us wrote:
On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote:
IIRC that was fixed already in 1.3
https
Looks like that message would be triggered if
spark.dynamicAllocation.initialExecutors was not set, or 0, if I read
this right. Yeah, that might have to be positive. This requires you
set initial executors to 1 if you want 0 min executors. Hm, maybe that
shouldn't be an error condition in the args
(Indeed, though the OP said it was a requirement that the pairs are
drawn from the same partition.)
On Thu, Apr 16, 2015 at 11:14 PM, Guillaume Pitel
guillaume.pi...@exensa.com wrote:
Hi Aurelien,
Sean's solution is nice, but maybe not completely order-free, since pairs
will come from the
Yeah, this really shouldn't be recursive. It can't be optimized since
it's not a final/private method. I think you're welcome to try a PR
to un-recursivize it.
On Thu, Apr 16, 2015 at 7:31 PM, Jeff Nadler jnad...@srcginc.com wrote:
I've got a Kafka topic on which lots of data has built up, and
Use mapPartitions, and then take two random samples of the elements in
the partition, and return an iterator over all pairs of them? Should
be pretty simple assuming your sample size n is smallish since you're
returning ~n^2 pairs.
On Thu, Apr 16, 2015 at 7:00 PM, abellet
IIRC that was fixed already in 1.3
https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b
On Thu, Apr 16, 2015 at 7:41 PM, Michael Stone mst...@mathom.us wrote:
The default for spark.dynamicAllocation.minExecutors is 0, but that value
causes a runtime error and a
What do you mean by batch RDD? they're just RDDs, though store their
data in different ways and come from different sources. You can union
an RDD from an HDFS file with one from a DStream.
It sounds like you want streaming data to live longer than its batch
interval, but that's not something you
batch RDD from file within spark
steraming context) - lets leave that since we are not getting anywhere
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, April 15, 2015 8:30 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: adding new elements
to the newly instantiated/loaded batch
RDD - is that what you mean by reloading batch RDD from file
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, April 15, 2015 7:43 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: adding new elements to batch
batch RDDs from file for e.g. a
second time moreover after specific period of time
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, April 15, 2015 8:14 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: adding new elements to batch RDD from
901 - 1000 of 1849 matches
Mail list logo