Re: simultaneous actions

2016-01-15 Thread Sean Owen
Can you run N jobs depending on the same RDD in parallel on the driver? certainly. The context / scheduling is thread-safe and the RDD is immutable. I've done this to, for example, build and evaluate a bunch of models simultaneously on a big cluster. On Fri, Jan 15, 2016 at 7:10 PM, Jakob Odersky

Re: simultaneous actions

2016-01-15 Thread Sean Owen
It makes sense if you're parallelizing jobs that have relatively few tasks, and have a lot of execution slots available. It makes sense to turn them loose all at once and try to use the parallelism available. There are downsides, eventually: for example, N jobs accessing one cached RDD may

Re: [discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

2016-01-14 Thread Sean Owen
I personally support this. I had suggest drawing the line at Hadoop 2.6, but that's minor. More info: Hadoop 2.7: April 2015 Hadoop 2.6: Nov 2014 Hadoop 2.5: Aug 2014 Hadoop 2.4: April 2014 Hadoop 2.3: Feb 2014 Hadoop 2.2: Oct 2013 CDH 5.0/5.1 = Hadoop 2.3 + backports CDH 5.2/5.3 = Hadoop 2.5 +

Re: NPE when using Joda DateTime

2016-01-14 Thread Sean Owen
It does look somehow like the state of the DateTime object isn't being recreated properly on deserialization somehow, given where the NPE occurs (look at the Joda source code). However the object is java.io.Serializable. Are you sure the Kryo serialization is correct? It doesn't quite explain why

Re: NPE when using Joda DateTime

2016-01-14 Thread Sean Owen
(kryo: Kryo) { > kryo.register(classOf[org.joda.time.DateTime]) > } > } > > Is it because the groupBy sees a different class type? Maybe Array[DateTime]? > I don’t want to find the answer by trial and error though. > > Alex > > -Original Message-

Re: FPGrowth does not handle large result sets

2016-01-13 Thread Sean Owen
As I said in your JIRA, the collect() in question is bringing results back to the driver to return them. The assumption is that there aren't a vast number of frequent items. If they are, they aren't 'frequent' and your min support is too low. On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari

Re: FPGrowth does not handle large result sets

2016-01-13 Thread Sean Owen
Given that the items in my transactions are drawn from a set of about > 25,000 (I previously thought 17,000), what would be a rational way to > determine the (peak) memory needs of my driver node? > > -Raj > > > On Wednesday, January 13, 2016 1:18 AM, Sean Owen <so...@cloude

Re: Unshaded google guava classes in spark-network-common jar

2016-01-12 Thread Sean Owen
No, this is on purpose. Have a look at the build POM. A few Guava classes were used in the public API for Java and have had to stay unshaded. In 2.x / master this is already changed such that no unshaded Guava classes should be included. On Tue, Jan 12, 2016, 07:28 Jake Yoon

Re: [discuss] dropping Python 2.6 support

2016-01-09 Thread Sean Owen
Chiming in late, but my take on this line of argument is: these companies are welcome to keep using Spark 1.x. If anything the argument here is about how long to maintain 1.x, and indeed, it's going to go dormant quite soon. But using RHEL 6 (or any old-er version of any platform) and not wanting

Re: groupByKey does not work?

2016-01-05 Thread Sean Owen
I suspect this is another instance of case classes not working as expected between the driver and executor when used with spark-shell. Search JIRA for some back story. On Tue, Jan 5, 2016 at 12:42 AM, Arun Luthra wrote: > Spark 1.5.0 > > data: > >

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Sean Owen
+juliet for an additional opinion, but FWIW I think it's safe to say that future CDH will have a more consistent Python story and that story will support 2.7 rather than 2.6. On Tue, Jan 5, 2016 at 7:17 AM, Reynold Xin wrote: > Does anybody here care about us dropping

Re: Using Java Function API with Java 8

2015-12-24 Thread Sean Owen
You forgot a return statement in the 'else' clause, which is what the compiler is telling you. There's nothing more to it here. Your function is much simpler however as Function checkHeaders2 = (x -> x.startsWith("npi")||x.startsWith("CPT")); On Thu, Dec 24, 2015 at 1:13 AM,

Re: Newbie Help for spark's not finding native hadoop warning

2015-12-24 Thread Sean Owen
You can safely ignore it. Native libs aren't set with HADOOP_HOME. See Hadoop docs on how to configure this if you're curious, but you really don't need to. On Thu, Dec 24, 2015 at 12:19 PM, Bilinmek Istemiyor wrote: > Hello, > > I have apache spark 1.5.1 installed with the

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Sean Owen
It does look like it's not actually used. It may simply be there for completeness, to match cancelStage and cancelJobGroup, which are used. I also don't know of a good reason there's no way to kill a whole job. On Wed, Dec 16, 2015 at 1:15 PM, Jacek Laskowski wrote: > Hi, > >

Re: ideal number of executors per machine

2015-12-16 Thread Sean Owen
> Hmmm. > This would go against the grain. > > I have to ask how you came to that conclusion… > > There are a lot of factors… e.g. Yarn vs Mesos? > > What you’re suggesting would mean a loss of parallelism. > > >> On Dec 16, 2015, at 12:22 AM, Sean Owen <s

Re: ideal number of executors per machine

2015-12-15 Thread Sean Owen
1 per machine is the right number. If you are running very large heaps (>64GB) you may consider multiple per machine just to make sure each's GC pauses aren't excessive, but even this might be better mitigated with GC tuning. On Tue, Dec 15, 2015 at 9:07 PM, Veljko Skarich

Re: Re: Spark assembly in Maven repo?

2015-12-14 Thread Sean Owen
la > 2.10)? > > > https://repository.apache.org/content/repositories/releases/org/apache/spark/spark-core_2.10/1.5.1/ > > > > Xiaoyong > > > > *From:* Sean Owen [mailto:so...@cloudera.com] > *Sent:* Saturday, December 12, 2015 12:45 AM > *To:* Xiaoy

Re: Re: Spark assembly in Maven repo?

2015-12-12 Thread Sean Owen
That's exactly what the various artifacts in the Maven repo are for. The API classes for core are in the core artifact and so on. You don't need an assembly. On Sat, Dec 12, 2015 at 12:32 AM, Xiaoyong Zhu wrote: > Yes, so our scenario is to treat the spark assembly as an

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Sean Owen
Hm, are you referencing a local file from your remote workers? That won't work as the file only exists in one machine (I presume). On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao wrote: > Hi, > > > > I have problem accessing local file, with such example: > > > >

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
version of the code that does a .count for debug and > .take(1) instead of the .isEmpty the count of one epinions RDD take 8 minutes > and the .take(1) uses 3 minutes. > > Other users have seen total runtime on 13G dataset of 700 minutes with the > execution time mostly spent

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
ty calc time, right? The RDD creation might be slightly > slower due to using an accumulator. > > > > > > On Dec 9, 2015, at 9:29 AM, Sean Owen <so...@cloudera.com> wrote: > > Are you sure it's isEmpty? and not an upstream stage? isEmpty is > definitely the action here

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
It should at best collect 1 item to the driver. This means evaluating at least 1 element of 1 partition. I can imagine pathological cases where that's slow, but, do you have any more info? how slow is slow and what is slow? On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel wrote: > The “Any” is required by the code it is being passed to, which is the > Elasticsearch Spark index writing code. The values are actually RDD[(String, > Map[String, String])] (Is it frequently a big big map by any chance?)

Re: Content based window operation on Time-series data

2015-12-09 Thread Sean Owen
CC Sandy as his https://github.com/cloudera/spark-timeseries might be of use here. On Wed, Dec 9, 2015 at 4:54 PM, Arun Verma wrote: > Hi all, > > We have RDD(main) of sorted time-series data. We want to split it into > different RDDs according to window size and then

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Sean Owen
You can't broadcast an RDD to begin with, and can't use RDDs inside RDDs. They are really driver-side concepts. Yes that's how you'd use a broadcast of anything else though, though you need to reference ".value" on the broadcast. The 'if' is redundant in that example, and if it's a map- or

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Sean Owen
I'm not sure if this is available in Python but from 1.3 on you should be able to call ALS.setFinalRDDStorageLevel with level "none" to ask it to unpersist when it is done. On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs wrote: > Jonathan, > Did you ever get to the bottom of

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Sean Owen
There is no way to upgrade a running cluster here. You can stop a cluster, and simply start a new cluster in the same way you started the original cluster. That ought to be simple; the only issue I suppose is that you have down-time since you have to shut the whole thing down, but maybe that's

Re: Does Spark streaming support iterative operator?

2015-12-03 Thread Sean Owen
Yes, in the sense that you can create and trigger an action on as many RDDs created from the batch's RDD that you like. On Thu, Dec 3, 2015 at 8:04 PM, Wang Yangjun wrote: > Hi, > > In storm we could do thing like: > > TopologyBuilder builder = new TopologyBuilder(); > >

Re: [POWERED BY] Please add our organization

2015-12-02 Thread Sean Owen
Same, not sure if anyone handles this particularly but I'll do it. This should go to dev@; I think we just put a note on that wiki. On Wed, Dec 2, 2015 at 10:53 AM, Adrien Mogenet wrote: > Hi folks, > > You're probably busy, but any update on this? :) > > > On

Re: Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-24 Thread Sean Owen
Your reasoning is correct; you need probabilities (or at least some score) out of the model and not just a 0/1 label in order for a ROC / PR curve to have meaning. But you just need to call clearThreshold() on the model to make it return a probability. On Tue, Nov 24, 2015 at 5:19 PM, jmvllt

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sean Owen
Not sure who generally handles that, but I just made the edit. On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote: > Sorry to be a nag, I realize folks with edit rights on the Powered by Spark > page are very busy people, but its been 10 days since my original request, >

Re: No spark examples jar in maven repository after 1.1.1 ?

2015-11-16 Thread Sean Owen
I think because they're not a library? they're example code, not something you build an app on. On Mon, Nov 16, 2015 at 9:27 AM, Jeff Zhang wrote: > I don't find spark examples jar in maven repository after 1.1.1. Any reason > for that ? > >

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-08 Thread Sean Owen
You included a very old version of the Twitter jar - 1.0.0. Did you mean 1.5.1? On Mon, Nov 9, 2015 at 7:36 AM, fanooos wrote: > This is my first Spark Stream application. The setup is as following > > 3 nodes running a spark cluster. One master node and two slaves. > >

Re: PMML version in MLLib

2015-11-04 Thread Sean Owen
I'm pretty sure that attribute is required. I am not sure what PMML version the code has been written for but would assume 4.2.1. Feel free to open a PR to add this version to all the output. On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem wrote: > [adding dev] > > On Wed, Nov

Re: apply simplex method to fix linear programming in spark

2015-11-02 Thread Sean Owen
I might be steering this a bit off topic: does this need the simplex method? this is just an instance of nonnegative least squares. I don't think it relates to LDA either. Spark doesn't have any particular support for NNLS (right?) or simplex though. On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das

Re: doc building process hangs on Failed to load class “org.slf4j.impl.StaticLoggerBinder”

2015-10-27 Thread Sean Owen
When you build docs, you are not running Spark. It looks like you are though. These commands are just executed in the shell. On Tue, Oct 27, 2015 at 2:09 PM, Alex Luya wrote: > followed this > > https://github.com/apache/spark/blob/master/docs/README.md > > to build

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
> problem. The configuration files are all directly copied from Spark 1.5.1. > I feel it is a bug on Spark 1.5.1. > > Thanks a lot for your response. > > On Mon, Oct 26, 2015 at 7:21 PM Sean Owen <so...@cloudera.com> wrote: > >> Yeah, are these stats actually

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
-dev +user How are you measuring network traffic? It's not in general true that there will be zero network traffic, since not all executors are local to all data. That can be the situation in many cases but not always. On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li wrote: > Hi,

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
I, > data is evenly distributed among 18 machines. > > > On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote: > >> Have a look at your HDFS replication, and where the blocks are for these >> files. For example, if you had only 2 HDFS data nodes

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Yeah, are these stats actually reflecting data read locally, like through the loopback interface? I'm also no expert on the internals here but this may be measuring effectively local reads. Or are you sure it's not? On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran wrote:

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
c/net/dev and then take the difference of received bytes before > and after the job. I also see a long-time peak (nearly 600Mb/s) in nload > interface. We have 18 machines and each machine receives 4.7G bytes. > > On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wr

Re: Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Sean Owen
Did you switch the build to Scala 2.11 by running the script in dev/? It won't work otherwise, but does work if you do. @Ted 2.11 was supported in 1.4, not just 1.5. On Mon, Oct 26, 2015 at 2:13 PM, Bryan Jeffrey wrote: > All, > > The error resolved to a bad version of

Re: Problem with make-distribution.sh

2015-10-26 Thread Sean Owen
I don't think the page suggests that gives you any of the tarballs on the downloads page, and -Phive does not by itself do so either. On Mon, Oct 26, 2015 at 4:58 PM, Ted Yu wrote: > I logged SPARK-11318 with a PR. > > I verified that by adding -Phive the datanucleus jars

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Sean Owen
Hm, why do you say it doesn't support 2.11? It does. It is not even this difficult; you just need a source distribution, and then run "./dev/change-scala-version.sh 2.11" as you say. Then build as normal On Sun, Oct 25, 2015 at 4:00 PM, Todd Nist wrote: > Hi Bilnmek, > >

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Sean Owen
No, 2.11 artifacts are in fact published: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-parent_2.11%22 On Sun, Oct 25, 2015 at 7:37 PM, Todd Nist wrote: > Sorry Sean you are absolutely right it supports 2.11 all o meant is there is > no release available as a

Re: Maven build failed (Spark master)

2015-10-23 Thread Sean Owen
This doesn't show the actual error output from Maven. I have a strong guess that you haven't set MAVEN_OPTS to increase the memory Maven can use. On Fri, Oct 23, 2015 at 6:14 AM, Kayode Odeyemi wrote: > Hi, > > I can't seem to get a successful maven build. Please see command

Re: I don't understand what this sentence means."7.1 GB of 7 GB physical memory used"

2015-10-23 Thread Sean Owen
Spark asked YARN to let an executor use 7GB of memory, but it used more so was killed. In each case you see that the exectuor memory plus overhead equals the YARN allocation requested. What's the issue with that? On Fri, Oct 23, 2015 at 6:46 AM, JoneZhang wrote: > Here is

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread Sean Owen
Maven, in general, does some local caching to avoid htting the repo every time. It's possible this is why you're not seeing 1.5.1. On the command line you can for example add "mvn -U ..." Not sure of the equivalent in IntelliJ, but it will be updating the same repo IJ sees. Try that. The repo

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Sean Owen
coalesce without a shuffle? it shouldn't be an action. It just treats many partitions as one. On Tue, Oct 20, 2015 at 1:00 PM, t3l wrote: > > I have dataset consisting of 5 binary files (each between 500kb and > 2MB). They are stored in HDFS on a Hadoop cluster. The

Re: mailing list subscription

2015-10-20 Thread Sean Owen
Nabble is not related to the mailing list, so I think that's the problem. spark.apache.org -> Community -> Mailing Lists: http://spark.apache.org/community.html These instructions from the project website itself are correct. On Tue, Oct 20, 2015 at 5:22 PM, Jeff Sadowski

Re: Top 10 count

2015-10-20 Thread Sean Owen
I believe it will be most efficient to let top(n) do the work, rather than sort the whole RDD and then take the first n. The reason is that top and takeOrdered know they need at most n elements from each partition, and then just need to merge those. It's never required to sort the whole thing. I

Re: Ensuring eager evaluation inside mapPartitions

2015-10-16 Thread Sean Owen
If you mean, getResult is called on the result of foo for each record, then that already happens. If you mean getResults is called only after foo has been called on all records, then you have to collect to a list, yes. Why does it help with foo being slow in either case though? You can try to

Re: Is there any better way of writing this code

2015-10-12 Thread Sean Owen
A few small-scale code style tips which might improve readability: You can in some cases write something like _.split("...") instead of x => x.split("...") A series of if-else conditions on x(15) can be replaced with "x(15) match { case"B" => ... }" .isEmpty may be more readable than == "" With

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread Sean Owen
IIRC we have seen transient problems like this from Maven Central, where files were visible to some but not others or reappeared later. I can't get that file now either, but, try again later first. If it's really persistent we may have to ask what's going on with Maven Central. It's not a Spark or

Re: Different partition number of GroupByKey leads different result

2015-10-09 Thread Sean Owen
you ever read the spark source code?Which step can lead to data > dislocation? > >> 在 2015年10月9日,17:37,Sean Owen <so...@cloudera.com> 写道: >> >> Another guess, since you say the key is String (offline): you are not >> cloning the value of TagsWritable. Hadoop reuses

Re: Different partition number of GroupByKey leads different result

2015-10-09 Thread Sean Owen
, Sean Owen <so...@cloudera.com> wrote: > First guess: your key class does not implement hashCode/equals > > On Fri, Oct 9, 2015 at 10:05 AM, Devin Huang <hos...@163.com> wrote: >> Hi everyone, >> >> I got a trouble these days,and I don't know whether it is a

Re: Different partition number of GroupByKey leads different result

2015-10-09 Thread Sean Owen
First guess: your key class does not implement hashCode/equals On Fri, Oct 9, 2015 at 10:05 AM, Devin Huang wrote: > Hi everyone, > > I got a trouble these days,and I don't know whether it is a bug of > spark.When I use GroupByKey for our sequenceFile Data,I find that

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Sean Owen
These are true, but it's not because Spark is written in Scala; it's because it executes in the JVM. So, Scala/Java-based apps have an advantage in that they don't have to serialize data back and forth to a Python process, which also brings a new set of things that can go wrong. Python is also

Re: RDD of ImmutableList

2015-10-07 Thread Sean Owen
I think Java's immutable collections are fine with respect to kryo -- that's not the same as Guava. On Wed, Oct 7, 2015 at 11:56 AM, Jakub Dubovsky wrote: > I did not realized that scala's and java's immutable collections uses > different api which causes this.

Re: Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Sean Owen
-dev Is r.getInt(ind) very large in some cases? I think there's not quite enough info here. On Wed, Oct 7, 2015 at 6:23 PM, wrote: > When running stand-alone cluster mode job, the process hangs up randomly > during a DataFrame flatMap or explode operation, in

Re: spark performance non-linear response

2015-10-07 Thread Sean Owen
OK, next question then is: if this is wall-clock time for the whole process, then, I wonder if you are just measuring the time taken by the longest single task. I'd expect the time taken by the longest straggler task to follow a distribution like this. That is, how balanced are the partitions?

Re: does KafkaCluster can be public ?

2015-10-06 Thread Sean Owen
For what it's worth, I also use this class in an app, but it happens to be from Java code where it acts as if it's public. So no problem for my use case, but I suppose, another small vote for the usefulness of this class to the caller. I end up using getLatestLeaderOffsets to figure out how to

Re: Broadcast var is null

2015-10-06 Thread Sean Owen
Yes, see https://issues.apache.org/jira/browse/SPARK-4170 The reason was kind of complicated, and the 'fix' was just to warn you against subclassing App! yes, use a main() method. On Tue, Oct 6, 2015 at 3:15 PM, Nick Peterson wrote: > This might seem silly, but... > >

Re: preferredNodeLocationData, SPARK-8949, and SparkContext - a leftover?

2015-10-04 Thread Sean Owen
I think it's unused as the JIRA says, but removing it from the constructors would change the API, so that's why it stays in the signature. Removing the internal field and one usage of it seems OK, though I don't think it would help much of anything. On Sun, Oct 4, 2015 at 4:36 AM, Jacek Laskowski

Re: Examples module not building in intellij

2015-10-04 Thread Sean Owen
It builds for me. That message usually really means you can't resolve or download from a repo. It's just the last thing that happens to fail. On Sun, Oct 4, 2015 at 7:06 AM, Stephen Boesch wrote: > > For a week or two the trunk has not been building for the examples module >

Re: Stop a Dstream computation

2015-09-25 Thread Sean Owen
Your exception handling is occurring on the driver, where you 'configure' the job. I don't think it's what you mean to do. You probably mean to do this within a function you are executing on data within the cluster, like mapPartitions etc. On Fri, Sep 25, 2015 at 5:20 AM, Samya

Re: How to fix some WARN when submit job on spark 1.5 YARN

2015-09-24 Thread Sean Owen
You can ignore all of these. Various libraries can take advantage of native acceleration if libs are available but it's no problem if they don't. On Thu, Sep 24, 2015 at 3:25 AM, r7raul1...@163.com wrote: > 1 WARN netlib.BLAS: Failed to load implementation from: >

Re: Spark as standalone or with Hadoop stack.

2015-09-23 Thread Sean Owen
Might be better for another list, but, I suspect it's more than HBase is simply much more integrated with YARN, and because it's run with other services that are as well. On Wed, Sep 23, 2015 at 12:02 AM, Jacek Laskowski wrote: > That sentence caught my attention. Could you

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
The point is that this only works if you already knew the file was presented in order within and across partitions, which was the original problem anyway. I don't think it is in general, but in practice, I do imagine it's already in the expected order from textFile. Maybe under the hood this ends

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
I don't know of a way to do this, out of the box, without maybe digging into custom InputFormats. The RDD from textFile doesn't have an ordering. I can't imagine a world in which partitions weren't iterated in line order, of course, but there's also no real guarantee about ordering among

Re: Problem at sbt/sbt assembly

2015-09-21 Thread Sean Owen
Sbt asked for a bigger initial heap than the host had space for. It is a JVM error you can and should search for first. You will need more memory. On Mon, Sep 21, 2015, 2:11 AM Aaroncq4 <475715...@qq.com> wrote: > When I used “sbt/sbt assembly" to compile spark code of spark-1.5.0,I got a >

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
t; in my use > case. > > On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote: >> >> I think foldByKey is much more what you want, as it has more a notion >> of building up some result per key by encountering values serially. >> You would take t

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
y is not? > > On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen <so...@cloudera.com> wrote: >> >> The zero value here is None. Combining None with any row should yield >> Some(row). After that, combining is a no-op for other rows. >> >> On Tue, Sep 22, 2015 at 4:2

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
I think foldByKey is much more what you want, as it has more a notion of building up some result per key by encountering values serially. You would take the first and ignore the rest. Note that "first" depends on your RDD having an ordering to begin with, or else you rely on however it happens to

Re: change the spark version

2015-09-12 Thread Sean Owen
This is a question for the CDH list. CDH 5.4 has Spark 1.3, and 5.5 has 1.5. The best thing is to update CDH as a whole if you can. However it's pretty simple to just run a newer Spark assembly as a YARN app. Don't remove anything in the CDH installation. Try downloading the assembly and

Re: Avoiding SQL Injection in Spark SQL

2015-09-10 Thread Sean Owen
I don't think this is Spark-specific. Mostly you need to escape / quote user-supplied values as with any SQL engine. On Thu, Sep 10, 2015 at 7:32 AM, V Dineshkumar wrote: > Hi, > > What is the preferred way of avoiding SQL Injection while using Spark SQL? > In our

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Sean Owen
rg/apache/spark/mllib/classification/NaiveBayes.scala > > // Adamantios > > > > On Thu, Sep 10, 2015 at 7:03 PM, Sean Owen <so...@cloudera.com> wrote: >> >> The log probabilities are unlikely to be very large, though the >> probabilities may be very

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Sean Owen
i + brzTheta * testData.toBreeze is a big number > too), how can I get back the non-log-probabilities which - apparently - are > bounded between 0.0 and 1.0? > > > // Adamantios > > > > On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen <so...@cloudera.com> wrote

Re: How to restrict java unit tests from the maven command line

2015-09-10 Thread Sean Owen
-Dtest=none ? https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests On Thu, Sep 10, 2015 at 10:39 PM, Stephen Boesch wrote: > > I have invoked mvn test with the -DwildcardSuites option to specify a single >

Re: 1.5 Build Errors

2015-09-08 Thread Sean Owen
It shows you there that Maven is out of memory. Give it more heap. I use 3gb. On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen wrote: > Hi All, > > I'm trying to build a distribution off of the latest in master and I keep > getting errors on MQTT and the build fails. I'm

Re: 1.5 Build Errors

2015-09-08 Thread Sean Owen
Ah, right. Should've caught that. > > The docs seem to recommend 2gb. Should that be increased as well? > > --Ben > > On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen <so...@cloudera.com> wrote: >> >> It shows you there that Maven is out of memory. Give it more heap. I use

Re: Java vs. Scala for Spark

2015-09-08 Thread Sean Owen
Why would Scala vs Java performance be different Ted? Relatively speaking there is almost no runtime difference; it's the same APIs or calls via a thin wrapper. Scala/Java vs Python is a different story. Java libraries can be used in Scala. Vice-versa too, though calling Scala-generated classes

Re: 1.5 Build Errors

2015-09-08 Thread Sean Owen
run Maven with the >>>> -e switch. >>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging. >>>> [ERROR] >>>> [ERROR] For more information about the errors and possible solutions, >>>> please read the following articles:

Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Sean Owen
I think somewhere alone the line you've not specified your label column -- it's defaulting to "label" and it does not recognize it, or at least not as a binary or nominal attribute. On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole wrote: > Hi, Experts, > > I followed the guide

Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Sean Owen
iwC$$iwC.(:62) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68) > at $iwC$$iwC$$iwC$$iwC.(:70) > at $iwC$$iwC$$iwC.(:72) > at $iwC$$iwC.(:74) > at

Re: Output files of saveAsText are getting stuck in temporary directory

2015-09-04 Thread Sean Owen
That means the save has not finished yet. Are you sure it did? it writes in _temporary while it's in progress On Fri, Sep 4, 2015 at 10:10 AM, Chirag Dewan wrote: > Hi, > > > > I have a 2 node Spark cluster and I am trying to read data from a Cassandra > cluster and

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-09-03 Thread Sean Owen
nstances, boostingStrategy) > > Thanks, > Peter Rudenko > > On 2015-08-14 00:33, Sean Owen wrote: > > Not that I have any answer at this point, but I was discussing this > exact same problem with Johannes today. An input size of ~20K records > was growing each iteration

Re: What is the current status of ML ?

2015-09-01 Thread Sean Owen
I think the story is that the new spark.ml "pipelines" API is the future. Most (all?) of the spark.mllib functionality has been ported over and/or translated. I don't know that spark.mllib will actually be deprecated soon -- not until spark.ml is fully blessed as 'stable' I'd imagine, at least.

Re: How to compute the probability of each class in Naive Bayes

2015-09-01 Thread Sean Owen
(pedantic: it's the log-probabilities) On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang wrote: > Actually > brzPi + brzTheta * testData.toBreeze > is the probabilities of the input Vector on each class, however it's a > Breeze Vector. > Pay attention the index of this Vector need

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Sean Owen
ouble))) > sc.makeRDD(lst).map(i=> if(a==10) 1 else 0) > > -Ashish > > > On Sun, Aug 30, 2015 at 2:52 PM Sean Owen <so...@cloudera.com> wrote: >> >> I'm not sure how to reproduce it? this code does not produce an error in >> master. >> >> On

Re: Unable to build Spark 1.5, is build broken or can anyone successfully build?

2015-08-31 Thread Sean Owen
I don't think it's an 'issue'; the build does in fact require Maven 3.3.3 and Java 7 on purpose. If you don't have Maven 3.3.3, run 'build/mvn' or ('build/mvn --force' if you have an older Maven locally) to get 3.3.3. On Mon, Aug 31, 2015 at 6:27 AM, Kevin Jung wrote: > I

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Sean Owen
; to do with Spark per se. Or, have a look at others related to the >> closure and shell and you may find this is related to other known >> behavior. >> >> >> On Sun, Aug 30, 2015 at 8:08 PM, Ashish Shrowty >> <ashish.shro...@gmail.com> wrote: >> &

Re: Spark shell and StackOverFlowError

2015-08-30 Thread Sean Owen
That can't cause any error, since there is no action in your first snippet. Even calling count on the result doesn't cause an error. You must be executing something different. On Sun, Aug 30, 2015 at 4:21 AM, ashrowty ashish.shro...@gmail.com wrote: I am running the Spark shell (1.2.1) in local

Re: Spark shell and StackOverFlowError

2015-08-30 Thread Sean Owen
did. Thanks On Sun, Aug 30, 2015 at 12:24 AM, Sean Owen so...@cloudera.com wrote: That can't cause any error, since there is no action in your first snippet. Even calling count on the result doesn't cause an error. You must be executing something different. On Sun, Aug 30, 2015 at 4:21

Re: correct use of DStream foreachRDD

2015-08-28 Thread Sean Owen
Yes, for example val sensorRDD = rdd.map(Sensor.parseSensor) is a line of code executed on the driver; it's part the function you supplied to foreachRDD. However that line defines an operation on an RDD, and the map function you supplied (parseSensor) will ultimately be carried out on the cluster.

Re: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile

2015-08-27 Thread Sean Owen
Hm, if anything that would be related to my change at https://issues.apache.org/jira/browse/SPARK-9613. Let me investigate. There's some chance there is some Scala 2.11-specific problem here. On Thu, Aug 27, 2015 at 8:51 AM, Jacek Laskowski ja...@japila.pl wrote: Hi, Is this a known issue of

Re: reduceByKey not working on JavaPairDStream

2015-08-26 Thread Sean Owen
I don't see that you invoke any action in this code. It won't do anything unless you tell it to perform an action that requires the transformations. On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, I have applied mapToPair and then a reduceByKey on a

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sean Owen
Yes, you're right that it's quite on purpose to leave this API to Breeze, in the main. As you can see the Spark objects have already sprouted a few basic operations anyway; there's a slippery slope problem here. Why not addition, why not dot products, why not determinants, etc. What about

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sean Owen
Yes I get all that too and I think there's a legit question about whether moving a little further down the slippery slope is worth it and if so how far. The other catch here is: either you completely mimic another API (in which case why not just use it directly, which has its own problems) or you

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Sean Owen
: Thanks Sean. So how PySpark is supported. I thought PySpark needs jdk 1.6. Chen On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen so...@cloudera.com wrote: Spark 1.4 requires Java 7. On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote: I tried to build Spark

<    4   5   6   7   8   9   10   11   12   13   >