Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Sean Owen
, Sean Owen so...@cloudera.com wrote: Spark 1.4 requires Java 7. On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote: I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support PySpark, I used JDK 1.6. I got the following error, [INFO] --- scala-maven-plugin:3.2.0

Re: spark and scala-2.11

2015-08-24 Thread Sean Owen
The property scala-2.11 triggers the profile scala-2.11 -- and additionally disables the scala-2.10 profile, so that's the way to do it. But yes, you also need to run the script before-hand to set up the build for Scala 2.11 as well. On Mon, Aug 24, 2015 at 8:48 PM, Lanny Ripple

Re: build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Sean Owen
Spark 1.4 requires Java 7. On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote: I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support PySpark, I used JDK 1.6. I got the following error, [INFO] --- scala-maven-plugin:3.2.0:testCompile

Re: DAG related query

2015-08-20 Thread Sean Owen
No. The third line creates a third RDD whose reference simply replaces the reference to the first RDD in your local driver program. The first RDD still exists. On Thu, Aug 20, 2015 at 2:15 PM, Bahubali Jain bahub...@gmail.com wrote: Hi, How would the DAG look like for the below code

Re: Java 8 lambdas

2015-08-18 Thread Sean Owen
Yes, it should Just Work. lambdas can be used for any method that takes an instance of an interface with one method, and that describes Function, PairFunction, etc. On Tue, Aug 18, 2015 at 3:23 PM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Is there a way to execute spark jobs with Java 8

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Sean Owen
Not that I have any answer at this point, but I was discussing this exact same problem with Johannes today. An input size of ~20K records was growing each iteration by ~15M records. I could not see why on a first look. @jkbradley I know it's not much info but does that ring any bells? I think

Re: ClosureCleaner does not work for java code

2015-08-10 Thread Sean Owen
The difference is really that Java and Scala work differently. In Java, your anonymous subclass of Ops defined in (a method of) AbstractTest captures a reference to it. That much is 'correct' in that it's how Java is supposed to work, and AbstractTest is indeed not serializable since you didn't

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
You can ignore it entirely. It just means you haven't installed and configured native libraries for things like accelerated compression, but it has no negative impact otherwise. On Tue, Aug 4, 2015 at 8:11 AM, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, When i run the spark

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
deepesh.maheshwar...@gmail.com wrote: Can you elaborate about the things this native library covering. One you mentioned accelerated compression. It would be very helpful if you can give any useful to link to read more about it. On Tue, Aug 4, 2015 at 12:56 PM, Sean Owen so...@cloudera.com wrote

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Sean Owen
wrote: Think it may be needed on Windows, certainly if you start trying to work with local files. On 4 Aug 2015, at 00:34, Sean Owen so...@cloudera.com wrote: It won't affect you if you're not actually running Hadoop. But it's mainly things like Snappy/LZO compression which are implemented

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Sean Owen
If you've set the checkpoint dir, it seems like indeed the intent is to use a default checkpoint interval in DStream: private[streaming] def initialize(time: Time) { ... // Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger if (mustCheckpoint

Re: PermGen Space Error

2015-07-29 Thread Sean Owen
Yes, I think this was asked because you didn't say what flags you set before, and it's worth verifying they're the correct ones. Although I'd be kind of surprised if 512m isn't enough, did you try more? You could also try -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled Also verify

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sean Owen
:Sean Owen so...@cloudera.com To:Proust GZ Feng/China/IBM@IBMCN Cc:user user@spark.apache.org Date:07/28/2015 02:20 PM Subject:Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0 It wasn't removed, but rewritten

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sean Owen
It wasn't removed, but rewritten. Cygwin is just a distribution of POSIX-related utilities so you should be able to use the normal .sh scripts. In any event, you didn't say what the problem is? On Tue, Jul 28, 2015 at 5:19 AM, Proust GZ Feng pf...@cn.ibm.com wrote: Hi, Spark Users Looks like

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sean Owen
That's for the Windows interpreter rather than bash-running Cygwin. I don't know it's worth doing a lot of legwork for Cygwin, but, if it's really just a few lines of classpath translation in one script, seems reasonable. On Tue, Jul 28, 2015 at 9:13 PM, Steve Loughran ste...@hortonworks.com

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-21 Thread Sean Owen
/maven/com.github.fommil/jniloader/pom.properties ​ Thanks, Arun On Fri, Jul 17, 2015 at 1:30 PM, Sean Owen so...@cloudera.com wrote: Make sure /usr/lib64 contains libgfortran.so.3; that's really the issue. I'm pretty sure the answer is 'yes', but, make sure the assembly has jniloader

Re: ALS run method versus ALS train versus ALS fit and transform

2015-07-17 Thread Sean Owen
Yes, just have a look at the method in the source code. It calls new ALS()run(). It's a convenience wrapper only. On Fri, Jul 17, 2015 at 4:59 PM, Carol McDonald cmcdon...@maprtech.com wrote: the new ALS()...run() form is underneath both of the first two. I am not sure what you mean by

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Sean Owen
: com.github.fommil.netlib.NativeSystemLAPACK 15/07/17 13:20:53 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK ​ Does anything need to be adjusted in my application POM? Thanks, Arun On Thu, Jul 16, 2015 at 5:26 PM, Sean Owen so...@cloudera.com wrote: Yes, that's

Re: Getting not implemented by the TFS FileSystem implementation

2015-07-16 Thread Sean Owen
See also https://issues.apache.org/jira/browse/SPARK-8385 (apologies if someone already mentioned that -- just saw this thread) On Thu, Jul 16, 2015 at 7:19 PM, Jerrick Hoang jerrickho...@gmail.com wrote: So, this has to do with the fact that 1.4 has a new way to interact with HiveMetastore,

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-16 Thread Sean Owen
Yes, that's most of the work, just getting the native libs into the assembly. netlib can find them from there even if you don't have BLAS libs on your OS, since it includes a reference implementation as a fallback. One common reason it won't load is not having libgfortran installed on your OSes

Re: ALS run method versus ALS train versus ALS fit and transform

2015-07-15 Thread Sean Owen
The first two examples are from the .mllib API. Really, the new ALS()...run() form is underneath both of the first two. In the second case, you're calling a convenience method that calls something similar to the first example. The second example is from the new .ml pipelines API. Similar ideas,

Re: MovieALS Implicit Error

2015-07-13 Thread Sean Owen
Is the data set synthetic, or has very few items? or is indeed very sparse? those could be reasons. However usually this kind of thing happens with very small data sets. I could be wrong about what's going on, but it's a decent guess at the immediate cause given the error messages. On Mon, Jul

Re: MovieALS Implicit Error

2015-07-13 Thread Sean Owen
I interpret this to mean that the input to the Cholesky decomposition wasn't positive definite. I think this can happen if the input matrix is singular or very near singular -- maybe, very little data? Ben that might at least address why this is happening; different input may work fine. Xiangrui

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread Sean Owen
Yeah, it won't technically be supported, and you shouldn't go modifying the actual installation, but if you just make your own build of 1.4 for CDH 5.4 and use that build to launch YARN-based apps, I imagine it will Just Work for most any use case. On Sun, Jul 12, 2015 at 7:34 PM, Ruslan

Re: How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread Sean Owen
In general, R2 means the line that was fit is a very poor fit -- the mean would give a smaller squared error. But it can also mean you are applying R2 where it doesn't apply. Here, you're not performing a linear regression; why are you using R2? On Sun, Jul 12, 2015 at 4:22 PM, afarahat

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen
These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives. On Wed, Jul 8, 2015, 2:43 PM dgoldenberg dgoldenberg...@gmail.com wrote: Is there a set of best practices for when to use foreachPartition vs. foreachRDD?

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen
into a socket. Let's say I have one socket per a client of my streaming app and I get a host:port of that socket as part of the message and want to send the response via that socket. Is foreachPartition still a better choice? On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen so...@cloudera.com wrote

Re: Futures timed out after 10000 milliseconds

2015-07-05 Thread Sean Owen
Usually this message means that the test was starting some process like a Spark master and it didn't ever start. The eventual error is timeout. You have to try to dig in to the test and logs to catch the real reason. On Sun, Jul 5, 2015 at 9:23 PM, SamRoberts samueli.robe...@yahoo.com wrote:

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Sean Owen
Yes, Spark Core depends on Hadoop libs, and there is this unfortunate twist on Windows. You'll still need HADOOP_HOME set appropriately since Hadoop needs some special binaries to work on Windows. On Fri, Jun 26, 2015 at 11:06 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You just need to set

Re: bugs in Spark PageRank implementation

2015-06-25 Thread Sean Owen
#2 is not a bug. Have a search through JIRA. It is merely unformalized. I think that is how (one of?) the original PageRank papers does it. On Thu, Jun 25, 2015, 7:39 AM Kelly, Terence P (HP Labs Researcher) terence.p.ke...@hp.com wrote: Hi, Colleagues and I have found that the PageRank

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Sean Owen
with the command [ERROR] mvn goals -rf :spark-sql_2.10 Ahh..ok, so it's Hive 1.1 and Spark 1.4. Even using standard Hive .13 version, I still the the above error. Granted (it's CDH's Hadoop JARs, and Apache's Hive). On Wed, Jun 24, 2015 at 9:30 PM, Sean Owen so...@cloudera.com wrote

Re: Problem with version compatibility

2015-06-25 Thread Sean Owen
-dev +user That all sounds fine except are you packaging Spark classes with your app? that's the bit I'm wondering about. You would mark it as a 'provided' dependency in Maven. On Thu, Jun 25, 2015 at 5:12 AM, jimfcarroll jimfcarr...@gmail.com wrote: Hi Sean, I'm running a Mesos cluster. My

Re: map vs mapPartitions

2015-06-25 Thread Sean Owen
No, or at least, it depends on how the source of the partitions was implemented. On Thu, Jun 25, 2015 at 12:16 PM, Shushant Arora shushantaror...@gmail.com wrote: Does mapPartitions keep complete partitions in memory of executor as iterable. JavaRDDString rdd = jsc.textFile(path);

Re: Velox Model Server

2015-06-24 Thread Sean Owen
On Wed, Jun 24, 2015 at 12:02 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Oryx does almost the same but Oryx1 kept all user and item vectors in memory (though I am not sure about whether Oryx2 still stores all user and item vectors in memory or partitions in some way). (Yes, this is a

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-24 Thread Sean Owen
You didn't provide any error? You're compiling vs Hive 1.1 here and that is the problem. It is nothing to do with CDH. On Wed, Jun 24, 2015, 10:15 PM Aaron aarongm...@gmail.com wrote: I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling with the v1.4.0 tag out of git?

Re: Velox Model Server

2015-06-23 Thread Sean Owen
Yes, and typically needs are 100ms. Now imagine even 10 concurrent requests. My experience has been that this approach won't nearly scale. The best you could probably do is async mini-batch near-real-time scoring, pushing results to some store for retrieval, which could be entirely suitable for

Re: Velox Model Server

2015-06-21 Thread Sean Owen
Out of curiosity why netty? What model are you serving? Velox doesn't look like it is optimized for cases like ALS recs, if that's what you mean. I think scoring ALS at scale in real time takes a fairly different approach. The servlet engine probably doesn't matter at all in comparison. On Sat,

Re: [Spark-1.4.0]jackson-databind conflict?

2015-06-12 Thread Sean Owen
I see the same thing in an app that uses Jackson 2.5. Downgrading to 2.4 made it work. I meant to go back and figure out if there's something that can be done to work around this in Spark or elsewhere, but for now, harmonize your Jackson version at 2.4.x if you can. On Fri, Jun 12, 2015 at 4:20

Re: Spark Java API and minimum set of 3rd party dependencies

2015-06-12 Thread Sean Owen
You don't add dependencies to your app -- you mark Spark as 'provided' in the build and you rely on the deployed Spark environment to provide it. On Fri, Jun 12, 2015 at 7:14 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Hi all, We want to integrate Spark in our Java application using the

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Sean Owen
Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition. On Thu, Jun 11, 2015, 8:36 PM Crystal Xing

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Sean Owen
at 6:44 PM, Sean Owen so...@cloudera.com wrote: Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition. On Thu

Re: Split RDD based on criteria

2015-06-10 Thread Sean Owen
No, but you can write a couple lines of code that do this. It's not optimized of course. This is actually a long and interesting side discussion, but I'm not sure how much it could be given that the computation is pull rather than push; there is no concept of one pass over the data resulting in

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Sean Owen
In the sense here, Spark actually does have operations that make multiple RDDs like randomSplit. However there is not an equivalent of the partition operation which gives the elements that matched and did not match at once. On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote: As far

Re: Example Page Java Function2

2015-06-03 Thread Sean Owen
Yes, I think you're right. Since this is a change to the ASF hosted site, I can make this change to the .md / .html directly rather than go through the usual PR. On Wed, Jun 3, 2015 at 6:23 PM, linkstar350 . tweicomepan...@gmail.com wrote: Hi, I'm Taira. I notice that this example page may be

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Sean Owen
; does that seem correct? Thanks! On Wed, May 20, 2015 at 1:52 PM Sean Owen so...@cloudera.com wrote: I don't think any of those problems are related to Hadoop. Have you looked at userClassPathFirst settings? On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson ejsa...@gmail.com wrote: Hi Sean

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
). Then I need to get a small random sample of Document objects (e.g. 10,000 document). How can I do this quickly? The rdd.sample() methods does not help because it need to read the entire RDD of 7 million Document from disk which take very long time. Ningjun From: Sean Owen [mailto:so

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
If sampling whole partitions is sufficient (or a part of a partition), sure you could mapPartitionsWithIndex and decide whether to process a partition at all based on its # and skip the rest. That's much faster. On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com

Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Sean Owen
I don't think that's quite the difference. Any SQL engine has a query planner and an execution engine. Both of these Spark for execution. HoS uses Hive for query planning. Although it's not optimized for execution on Spark per se, it's got a lot of language support and is stable/mature. Spark

Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Sean Owen
Yes, the published artifacts can only refer to one version of anything (OK, modulo publishing a large number of variants under classifiers). You aren't intended to rely on Spark's transitive dependencies for anything. Compiling against the Spark API has no relation to what version of Hadoop it

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Sean Owen
. More anon, Cheers, Edward Original Message Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38 From: Sean Owen so...@cloudera.com To: Edward Sargisson esa...@pobox.com Cc: user user@spark.apache.org Yes, the published artifacts can only refer

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Sean Owen
I don't think you should rely on a shutdown hook. Ideally you try to stop it in the main exit path of your program, even in case of an exception. On Tue, May 19, 2015 at 7:59 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: You mean to say within

Re: rdd.sample() methods very slow

2015-05-19 Thread Sean Owen
The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled. On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Hi

Re: SPARK-4412 regressed?

2015-05-15 Thread Sean Owen
(I made you a Contributor in JIRA -- your yahoo-related account of the two -- so maybe that will let you do so.) On Fri, May 15, 2015 at 4:19 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, two questions 1. Can regular JIRA users reopen bugs -- I can open a new issue but it does not

Build change PSA: Hadoop 2.2 default; -Phadoop-x.y profile recommended for builds

2015-05-14 Thread Sean Owen
This change will be merged shortly for Spark 1.4, and has a minor implication for those creating their own Spark builds: https://issues.apache.org/jira/browse/SPARK-7249 https://github.com/apache/spark/pull/5786 The default Hadoop dependency has actually been Hadoop 2.2 for some time, but the

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Sean Owen
\ affected_hosts.py Now we're seeing data from the stream. Thanks again! On Mon, May 11, 2015 at 2:43 PM Sean Owen so...@cloudera.com wrote: Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd have to provide it and all its dependencies with your app. You could also build

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Sean Owen
2015 com/yammer/metrics/core/Gauge.class On Tue, May 12, 2015 at 8:05 AM, Sean Owen so...@cloudera.com wrote: It doesn't depend directly on yammer metrics; Kafka does. It wouldn't be correct to declare that it does; it is already in the assembly anyway. On Tue, May 12, 2015 at 3:50 PM, Ted

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Sean Owen
, 2015, 1:11 AM Sean Owen so...@cloudera.com wrote: The question is really whether all the third-party integrations should be built into Spark's main assembly. I think reasonable people could disagree, but I think the current state (not built in) is reasonable. It means you have to bring

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Sean Owen
executor with a thread pool of N threads doing the same task. The performance I'm seeing of running the Kafka-Spark Streaming job is 7 times slower than that of the utility. What's pulling Spark back? Thanks. On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com wrote: You have one

Re: Getting error running MLlib example with new cluster

2015-05-11 Thread Sean Owen
That is mostly the YARN overhead. You're starting up a container for the AM and executors, at least. That still sounds pretty slow, but the defaults aren't tuned for fast startup. On May 11, 2015 7:00 PM, Su She suhsheka...@gmail.com wrote: Got it to work on the cluster by changing the master to

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Sean Owen
You have one worker with one executor with 32 execution slots. On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com wrote: Hi, Is there anything special one must do, running locally and submitting a job like so: spark-submit \ --class com.myco.Driver \

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Sean Owen
Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd have to provide it and all its dependencies with your app. You could also build this into your own app jar. Tools like Maven will add in the transitive dependencies. On Mon, May 11, 2015 at 10:04 PM, Lee McFadden

Re: dependencies on java-netlib and jblas

2015-05-08 Thread Sean Owen
Yes, at this point I believe you'll find jblas used for historical reasons, to not change some APIs. I don't believe it's used for much if any computation in 1.4. On May 8, 2015 5:04 PM, John Niekrasz john.niekr...@gmail.com wrote: Newbie question... Can I use any of the main ML capabilities

Re: Spark does not delete temporary directories

2015-05-07 Thread Sean Owen
You're referring to a comment in the generic utility method, not the specific calls to it. The comment just says that the generic method doesn't mark the directory for deletion. Individual uses of it might need to. One or more of these might be delete-able on exit, but in any event it's just a

Re: Selecting download for 'hadoop 2.4 and later

2015-05-03 Thread Sean Owen
See https://issues.apache.org/jira/browse/SPARK-5492 but I think you'll need to share the stack trace as I'm not sure how this can happen since the NoSuchMethodError (not NoSuchMethodException) indicates a call in the bytecode failed to link but there is only a call by reflection. On Fri, May 1,

Re: Spark pre-built for Hadoop 2.6

2015-04-30 Thread Sean Owen
Yes there is now such a profile, though it is essentially redundant and doesn't configure things differently from 2.4. Besides hadoop version of course. Which is why it hadn't existed before since the 2.4 profile is 2.4+ People just kept filing bugs to add it but the docs are correct : you don't

Re: JavaRDDListTuple2 flatMap Lexicographical Permutations - Java Heap Error

2015-04-30 Thread Sean Owen
You fundamentally want (half of) the Cartesian product so I don't think it gets a lot faster to form this. You could implement this on cogroup directly and maybe avoid forming the tuples you will filter out. I'd think more about whether you really need to do this thing, or whether there is

Re: Driver memory leak?

2015-04-29 Thread Sean Owen
Please use user@, not dev@ This message does not appear to be from your driver. It also doesn't say you ran out of memory. It says you didn't tell YARN to let it use the memory you want. Look at the memory overhead param and please search first for related discussions. On Apr 29, 2015 11:43 AM,

Re: Driver memory leak?

2015-04-29 Thread Sean Owen
be related to this https://issues.apache.org/jira/browse/SPARK-5967 defect that was resolved in Spark 1.2.2 and 1.3.0. It also was a HashMap causing the issue. -Conor On Wed, Apr 29, 2015 at 12:01 PM, Sean Owen so...@cloudera.com wrote: Please use user@, not dev@ This message does not appear

Re: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Sean Owen
Works fine for me. Make sure you're not downloading the HTML redirector page and thinking it's the archive. On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors and direct link. Each time i untar i

Re: Spark RDD sortByKey triggering a new job

2015-04-24 Thread Sean Owen
Yes, I think this is a known issue, that sortByKey actually runs a job to assess the distribution of the data. https://issues.apache.org/jira/browse/SPARK-1021 I think further eyes on it would be welcome as it's not desirable. On Fri, Apr 24, 2015 at 9:57 AM, Spico Florin spicoflo...@gmail.com

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
The sum? you just need to use an accumulator to sum the counts or something. On Fri, Apr 24, 2015 at 2:14 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Sorry for my explanation, my English is bad. I just need obtain the Long containing of the DStream created by messages.count().

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
No, it prints each Long in that stream, forever. Have a look at the DStream API. On Fri, Apr 24, 2015 at 2:24 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: But if a use messages.count().print this show a single number :/

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Sean Owen
The order of elements in an RDD is in general not guaranteed unless you sort. You shouldn't expect to encounter the partitions of an RDD in any particular order. In practice, you probably find the partitions come up in the order Hadoop presents them in this case. And within a partition, in this

Re: contributing code - how to test

2015-04-24 Thread Sean Owen
The standard incantation -- which is a little different from standard Maven practice -- is: mvn -DskipTests [your options] clean package mvn [your options] test Some tests require the assembly, so you have to do it this way. I don't know what the test failures were, you didn't post them, but

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
foreachRDD is an action and doesn't return anything. It seems like you want one final count, but that's not possible with a stream, since there is conceptually no end to a stream of data. You can get a stream of counts, which is what you have already. You can sum those counts in another data

Re: Tasks run only on one machine

2015-04-23 Thread Sean Owen
Where are the file splits? meaning is it possible they were also (only) available on one node and that was also your driver? On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel p...@occamsmachete.com wrote: Sure var columns = mc.textFile(source).map { line = line.split(delimiter) } Here “source”

Contributors, read me! Updated Contributing to Spark wiki

2015-04-23 Thread Sean Owen
Following several discussions about how to improve the contribution process in Spark, I've overhauled the guide to contributing. Anyone who is going to contribute needs to read it, as it has more formal guidance about the process:

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Sean Owen
Not that i've tried it, but, why couldn't you use one ZK server? I don't see a reason. On Wed, Apr 22, 2015 at 7:40 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It isn't mentioned anywhere in the doc, but you will probably need separate ZK for each of your HA cluster. Thanks Best Regards

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-21 Thread Sean Owen
I think maybe you need more partitions in your input, which might make for smaller tasks? On Tue, Apr 21, 2015 at 2:56 AM, Christian S. Perone christian.per...@gmail.com wrote: I keep seeing these warnings when using trainImplicit: WARN TaskSetManager: Stage 246 contains a task of very large

Re: writing to hdfs on master node much faster

2015-04-20 Thread Sean Owen
What machines are HDFS data nodes -- just your master? that would explain it. Otherwise, is it actually the write that's slow or is something else you're doing much faster on the master for other reasons maybe? like you're actually shipping data via the master first in some local computation? so

Re: [STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread Sean Owen
You need to access the underlying RDD with .rdd() and cast that. That works for me. On Mon, Apr 20, 2015 at 4:41 AM, RimBerry truonghoanglinhk55b...@gmail.com wrote: Hi everyone, i am trying to use the direct approach in streaming-kafka-integration

Re: compliation error

2015-04-19 Thread Sean Owen
Brahma since you can see the continuous integration builds are passing, it's got to be something specific to your environment, right? this is not even an error from Spark, but from Maven plugins. On Mon, Apr 20, 2015 at 4:42 AM, Ted Yu yuzhih...@gmail.com wrote: bq. -Dhadoop.version=V100R001C00

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Sean Owen
Do these datetime objects implement a the notion of equality you'd expect? (This may be a dumb question; I'm thinking of the equivalent of equals() / hashCode() from the Java world.) On Sat, Apr 18, 2015 at 4:17 PM, SecondDatke lovejay-lovemu...@outlook.com wrote: I'm trying to solve a

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Sean Owen
Doesn't this reduce to Scala isn't compatible with itself across maintenance releases? Meaning, if this were fixed then Scala 2.11.{x 6} would have similar failures. It's not not-ready; it's just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the unofficial support to at least make the

Re: Executor memory in web UI

2015-04-17 Thread Sean Owen
This is the fraction available for caching, which is 60% * 90% * total by default. On Fri, Apr 17, 2015 at 11:30 AM, podioss grega...@hotmail.com wrote: Hi, i am a bit confused with the executor-memory option. I am running applications with Standalone cluster manager with 8 workers with 4gb

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Sean Owen
Spark against 2.11.2 and still saw the problems with the REPL. I've created a bug report: https://issues.apache.org/jira/browse/SPARK-6989 I hope this helps. Cheers, Michael On Apr 17, 2015, at 1:41 AM, Sean Owen so...@cloudera.com wrote: Doesn't this reduce to Scala isn't compatible

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Sean Owen
This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set. On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Does anybody have a solution for

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
wrote: Thanks Sean. I want to load each batch into Redshift. What's the best/most efficient way to do that? Vadim On Apr 16, 2015, at 1:35 PM, Sean Owen so...@cloudera.com wrote: You can't, since that's how it's designed to work. Batches are saved in different files, which are really

Re: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Sean Owen
I don't think there's anything specific to CDH that you need to know, other than it ought to set things up sanely for you. Sandy did a couple posts about tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
You can't, since that's how it's designed to work. Batches are saved in different files, which are really directories containing partitions, as is common in Hadoop. You can move them later, or just read them where they are. On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
Yes, look what it was before -- would also reject a minimum of 0. That's the case you are hitting. 0 is a fine minimum. On Thu, Apr 16, 2015 at 8:09 PM, Michael Stone mst...@mathom.us wrote: On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote: IIRC that was fixed already in 1.3 https

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
Looks like that message would be triggered if spark.dynamicAllocation.initialExecutors was not set, or 0, if I read this right. Yeah, that might have to be positive. This requires you set initial executors to 1 if you want 0 min executors. Hm, maybe that shouldn't be an error condition in the args

Re: Random pairs / RDD order

2015-04-16 Thread Sean Owen
(Indeed, though the OP said it was a requirement that the pairs are drawn from the same partition.) On Thu, Apr 16, 2015 at 11:14 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi Aurelien, Sean's solution is nice, but maybe not completely order-free, since pairs will come from the

Re: StackOverflowError from KafkaReceiver when rate limiting used

2015-04-16 Thread Sean Owen
Yeah, this really shouldn't be recursive. It can't be optimized since it's not a final/private method. I think you're welcome to try a PR to un-recursivize it. On Thu, Apr 16, 2015 at 7:31 PM, Jeff Nadler jnad...@srcginc.com wrote: I've got a Kafka topic on which lots of data has built up, and

Re: Random pairs / RDD order

2015-04-16 Thread Sean Owen
Use mapPartitions, and then take two random samples of the elements in the partition, and return an iterator over all pairs of them? Should be pretty simple assuming your sample size n is smallish since you're returning ~n^2 pairs. On Thu, Apr 16, 2015 at 7:00 PM, abellet

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
IIRC that was fixed already in 1.3 https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b On Thu, Apr 16, 2015 at 7:41 PM, Michael Stone mst...@mathom.us wrote: The default for spark.dynamicAllocation.minExecutors is 0, but that value causes a runtime error and a

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
What do you mean by batch RDD? they're just RDDs, though store their data in different ways and come from different sources. You can union an RDD from an HDFS file with one from a DStream. It sounds like you want streaming data to live longer than its batch interval, but that's not something you

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
batch RDD from file within spark steraming context) - lets leave that since we are not getting anywhere -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 8:30 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
to the newly instantiated/loaded batch RDD - is that what you mean by reloading batch RDD from file -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 7:43 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
batch RDDs from file for e.g. a second time moreover after specific period of time -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 8:14 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from

<    5   6   7   8   9   10   11   12   13   14   >