Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
number of http calls, I can make. I > can't boost more number of http calls in single executors, I mean - I can't > go beyond the threashold of number of executors. > > On Thu, May 14, 2020 at 6:26 PM Sean Owen wrote: >> >> Default is not 200, but the number of ex

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
s it have > relationship with number of cores? 8 cores - 4 workers. is not it like I > can do only 8 * 4 = 32 http calls. Because in Spark number of partitions = > number cores is untrue. > > Thanks > > On Thu, May 14, 2020 at 6:11 PM Sean Owen wrote: > >> Yes any code that you

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
Yes any code that you write in code that you apply with Spark runs in the executors. You would be running as many HTTP clients as you have partitions. On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov wrote: > > I believe that if you do this within the context of an operation that is > already

Re: Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-05-03 Thread Sean Owen
, 2020 at 11:45 AM Fuo Bol wrote: > > @Sean Owen > > Why did you remove email zahidr1...@gmail.com following this query. ? > > The two responses were then sent to a removed email account. > > > > > > -- Forwarded message - > > From: > &g

Re: Good idea to do multi-threading in spark job?

2020-05-03 Thread Sean Owen
Spark will by default assume each task needs 1 CPU. On an executor with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using 4 cores, then 64 threads are trying to run. If you're CPU-bound, that could slow things down. But to the extent some of tasks take some time blocking on I/O, it

The new sock-puppet account sending the last few emails has been banned

2020-05-01 Thread Sean Owen
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Lockdown since 5th August 2019 10,000,000 Kashmiri by 900,000 Indian Soldiers

2020-04-30 Thread Sean Owen
Before anyone asks: yes this is banned immediately. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

On spam messages

2020-04-29 Thread Sean Owen
I am subscribed to this list to watch for a certain person's new accounts, which are posting obviously off-topic and inappropriate messages. It goes without saying that this is unacceptable and a CoC violation, and anyone posting that will be immediately removed and blocked. In the meantime,

Re: Copyright Infringment

2020-04-25 Thread Sean Owen
You'll want to ask the authors directly ; the book is not produced by the project itself, so can't answer here. On Sat, Apr 25, 2020, 8:42 AM Som Lima wrote: > At the risk of being removed from the emailing I would like a > clarification because I do not want to commit an unlawful act. > Can

Re: [Meta] Moderation request diversion?

2020-04-24 Thread Sean Owen
The mailing lists are operated by the ASF. I've asked whether it's possible here: https://issues.apache.org/jira/browse/INFRA-20186 On Fri, Apr 24, 2020 at 12:39 PM Jeff Evans wrote: > > Still noticing this problem quite a bit, both on the user and dev lists. I > notice that it appears to be

Re: Spark stuck at removing broadcast variable

2020-04-18 Thread Sean Owen
I don't think that means it's stuck on removing something; it was removed. Not sure what it is waiting on - more data perhaps? On Sat, Apr 18, 2020 at 2:22 PM Alchemist wrote: > > I am running a simple Spark structured streaming application that is pulling > data from a Kafka Topic. I have a

Re: Spark-3.0.0 GA

2020-04-17 Thread Sean Owen
The second release candidate will come soon. I would guess it all completes by the end of May, myself, but no guarantees. On Fri, Apr 17, 2020 at 6:30 AM Marshall Markham wrote: > > Hi, > > > > I realize this was probably not responded to because either the date is > unclear or explicitly

Re: Going it alone.

2020-04-16 Thread Sean Owen
Absolutely unacceptable even if this were the only one. I'm contacting INFRA right now. On Thu, Apr 16, 2020 at 11:57 AM Holden Karau wrote: > I want to be clear I believe the language in janethrope1s email is > unacceptable for the mailing list and possibly a violation of the Apache > code of

Re: wot no toggle ?

2020-04-16 Thread Sean Owen
Yes, this kind of message is not welcome on this list. At best the wording is ... odd, but the tone is combative. It's not even clear what the question is. This user has posted several other messages with the same type of issue, uncannily like those of "Zahid Raman" last month. This list has

Re: OFF TOPIC LIST CRITERIA

2020-03-27 Thread Sean Owen
the lists. Sean On Fri, Mar 27, 2020 at 9:46 PM Zahid Rahman wrote: > > > Sean Owen says the criteria of these two emailing list is not help to support > some body > who is new but for people who have been using the software for a long time. > > He is implying I think th

Re: what a plava !

2020-03-27 Thread Sean Owen
Spark standalone is a resource manager like YARN and Mesos. It is specific to Spark, and is therefore simpler, as it assumes it can take over whole machines. YARN and Mesos are for mediating resource usage across applications on a cluster, which may be running more than Spark apps. On Fri, Mar

Re: what a plava !

2020-03-27 Thread Sean Owen
- dev@, which is more for project devs to communicate. Cross-posting is discouraged too. The book isn't from the Spark OSS project, so not really the place to give feedback here. I don't quite understand the context of your other questions, but would elaborate them in individual, clear emails

Re: Is it feasible to build and run Spark on Windows?

2019-12-05 Thread Sean Owen
la:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) > at org.apache.spar

Re: Is it feasible to build and run Spark on Windows?

2019-12-05 Thread Sean Owen
What was the build error? you didn't say. Are you sure it succeeded? Try running from the Spark home dir, not bin. I know we do run Windows tests and it appears to pass tests, etc. On Thu, Dec 5, 2019 at 3:28 PM Ping Liu wrote: > > Hello, > > I understand Spark is preferably built on Linux. But

Re: Urgent : Changes required in the archive

2019-09-26 Thread Sean Owen
The message in question has already been public, and copied to mirrors the ASF does not control, for a year and a half. There is a process for requesting modification to ASF archives, but this case does not qualify: https://www.apache.org/foundation/public-archives.html On Thu, Sep 26, 2019 at

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Sean Owen
Seems fine to me if there are enough valuable fixes to justify another release. If there are any other important fixes imminent, it's fine to wait for those. On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun wrote: > > Hi, All. > > Spark 2.4.3 was released three months ago (8th May). > As of today

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Sean Owen
big for that high GC. > > But what's troubling me is this issue doesn't occur in Spark 2.2 at all. What > could be the reason behind such a behaviour? > > Regards, > Dhrub > > On Mon, Jul 29, 2019 at 6:45 PM Sean Owen wrote: >> >> -dev@ >> >> Yep, hig

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Sean Owen
-dev@ Yep, high GC activity means '(almost) out of memory'. I don't see that you've checked heap usage - is it nearly full? The answer isn't tuning but more heap. (Sometimes with really big heaps the problem is big pauses, but that's not the case here.) On Mon, Jul 29, 2019 at 1:26 AM

Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Sean Owen
We will certainly want a 2.4.4 release eventually. In fact I'd expect 2.4.x gets maintained for longer than the usual 18 months, as it's the last 2.x branch. It doesn't need to happen before 3.0, but could. Usually maintenance releases happen 3-4 months apart and the last one was 2 months ago. If

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Sean Owen
Deprecated -- certainly and sooner than later. I don't have a good sense of the overhead of continuing to support Python 2; is it large enough to consider dropping it in Spark 3.0? On Wed, May 29, 2019 at 11:47 PM Xiangrui Meng wrote: > > Hi all, > > I want to revive this old thread since no

Re: Access to live data of cached dataFrame

2019-05-17 Thread Sean Owen
A cached DataFrame isn't supposed to change, by definition. You can re-read each time or consider setting up a streaming source on the table which provides a result that updates as new data comes in. On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos wrote: > > Hello, > > I have a cached dataframe:

CVE-2018-17190: Unsecured Apache Spark standalone executes user code

2018-11-18 Thread Sean Owen
Severity: Low Vendor: The Apache Software Foundation Versions Affected: All versions of Apache Spark Description: Spark's standalone resource manager accepts code to execute on a 'master' host, that then runs that code on 'worker' hosts. The master itself does not, by design, execute user code.

CVE-2018-11804: Apache Spark build/mvn runs zinc, and can expose information from build machines

2018-10-24 Thread Sean Owen
Severity: Low Vendor: The Apache Software Foundation Versions Affected: 1.3.x release branch and later, including master Description: Spark's Apache Maven-based build includes a convenience script, 'build/mvn', that downloads and runs a zinc server to speed up compilation. This server will

CVE-2018-11770: Apache Spark standalone master, Mesos REST APIs not controlled by authentication

2018-08-13 Thread Sean Owen
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Spark versions from 1.3.0, running standalone master with REST API enabled, or running Mesos master with cluster mode enabled Description: >From version 1.3.0 onward, Spark's standalone master exposes a REST API for job

CVE-2018-8024 Apache Spark XSS vulnerability in UI

2018-07-11 Thread Sean Owen
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Spark versions through 2.1.2 Spark 2.2.0 through 2.2.1 Spark 2.3.0 Description: In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's possible for a malicious user to construct a URL pointing to a

CVE-2018-1334 Apache Spark local privilege escalation vulnerability

2018-07-11 Thread Sean Owen
Severity: High Vendor: The Apache Software Foundation Versions affected: Spark versions through 2.1.2 Spark 2.2.0 to 2.2.1 Spark 2.3.0 Description: In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when using PySpark or SparkR, it's possible for a different local user to

Re: Spark scala development in Sbt vs Maven

2018-03-05 Thread Sean Owen
Spark uses Maven as the primary build, but SBT works as well. It reads the Maven build to some extent. Zinc incremental compilation works with Maven (with the Scala plugin for Maven). Myself, I prefer Maven, for some of the reasons it is the main build in Spark: declarative builds end up being a

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Sean Owen
Just to follow up -- those are actually in a Palantir repo, not Central. Deploying to Central would be uncourteous, but this approach is legitimate and how it has to work for vendors to release distros of Spark etc. On Tue, Jan 9, 2018 at 11:43 AM Nan Zhu wrote: > Hi,

Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Sean Owen
Certainly, Scala 2.12 support precedes Java 9 support. A lot of the work is in place already, and the last issue is dealing with how Scala closures are now implemented quite different with lambdas / invokedynamic. This affects the ClosureCleaner. For the interested, this is as far as I know the

Re: Should Flume integration be behind a profile?

2017-10-02 Thread Sean Owen
t;> On Sun, Oct 1, 2017 at 3:50 PM, Reynold Xin <r...@databricks.com> wrote: >> > Probably should do 1, and then it is an easier transition in 3.0. >> > >> > On Sun, Oct 1, 2017 at 1:28 AM Sean Owen <so...@cloudera.com> wrote: >> >> >> >> I tried an

CVE-2017-12612 Unsafe deserialization in Apache Spark launcher API

2017-09-08 Thread Sean Owen
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Versions of Apache Spark from 1.6.0 until 2.1.1 Description: In Apache Spark 1.6.0 until 2.1.1, the launcher API performs unsafe deserialization of data received by its socket. This makes applications launched

CVE-2017-7678 Apache Spark XSS web UI MHTML vulnerability

2017-07-12 Thread Sean Owen
Severity: Low Vendor: The Apache Software Foundation Versions Affected: Versions of Apache Spark before 2.2.0 Description: It is possible for an attacker to take advantage of a user's trust in the server to trick them into visiting a link that points to a shared Spark cluster and submits data

Re: Question on Spark code

2017-06-25 Thread Sean Owen
rg/slf4j/simple/SimpleLogger.java#L599 > > Please correct me if I am wrong. > > > > > On Sun, Jun 25, 2017 at 3:04 AM, Sean Owen <so...@cloudera.com> wrote: > >> Maybe you are looking for declarations like this. "=> String" means the >> arg i

Re: Could you please add a book info on Spark website?

2017-06-25 Thread Sean Owen
Please get Packt to fix their existing PR. It's been open for months https://github.com/apache/spark-website/pull/35 On Sun, Jun 25, 2017 at 12:33 PM Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Hi Sean, > > Last time, you helped me add a book info (in the books section) on this

Re: Question on Spark code

2017-06-25 Thread Sean Owen
Maybe you are looking for declarations like this. "=> String" means the arg isn't evaluated until it's used, which is just what you want with log statements. The message isn't constructed unless it will be logged. protected def logInfo(msg: => String) { On Sun, Jun 25, 2017 at 10:28 AM kant

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Sean Owen
Yes. Imagine an RDD that results from a union of other RDDs. On Thu, Jun 15, 2017, 09:11 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread Sean Owen
Kafka and Spark Streaming don't do the same thing. Kafka stores and transports data, Spark Streaming runs computations on a stream of data. Neither is itself a streaming platform in its entirety. It's kind of like asking whether you should build a website using just MySQL, or nginx. > On 9 Mar

Re: Apparent memory leak involving count

2017-03-09 Thread Sean Owen
The driver keeps metrics on everything that has executed. This is how it can display the history in the UI. It's normal for the bookkeeping to keep growing because it's recording every job. You can configure it to keep records about fewer jobs. But thousands of entries isn't exactly big. On Thu,

Re: Wrong runtime type when using newAPIHadoopFile in Java

2017-03-06 Thread Sean Owen
I think this is the same thing we already discussed extensively on your JIRA. The type of the key/value class argument to newAPIHadoopFile are not the type of your custom class, but of the Writable describing encoding of keys and values in the file. I think that's the start of part of the

Re: LinearRegressionModel - Negative Predicted Value

2017-03-06 Thread Sean Owen
There's nothing unusual about negative values from a linear regression. If, generally, your predicted values are far from your actual values, then your model hasn't fit well. You may have a bug somewhere in your pipeline or you may have data without much linear relationship. Most of this isn't a

Re: Spark Streaming - java.lang.ClassNotFoundException Scala anonymous function

2017-03-01 Thread Sean Owen
What is the --jars you are submitting? You may have conflicting copies of Spark classes that interfere. On Wed, Mar 1, 2017, 14:20 Dominik Safaric wrote: > I've been trying to submit a Spark Streaming application using > spark-submit to a cluster of mine consisting of

Re: DataFrame from in memory datasets in multiple JVMs

2017-02-28 Thread Sean Owen
Broadcasts let you send one copy of read only data to each executor. That's not the same as a DataFrame and itseems nature means it doesnt make sense to think of them as not distributed. But consider things like broadcast hash joins which may be what you are looking for if you really mean to join

Re: is dataframe thread safe?

2017-02-12 Thread Sean Owen
No this use case is perfectly sensible. Yes it is thread safe. On Sun, Feb 12, 2017, 10:30 Jörn Franke wrote: > I think you should have a look at the spark documentation. It has > something called scheduler who does exactly this. In more sophisticated > environments yarn

Re: Remove dependence on HDFS

2017-02-12 Thread Sean Owen
Data has to live somewhere -- how do you not add storage but store more data? Alluxio is not persistent storage, and S3 isn't on your premises. On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim wrote: > Has anyone got some advice on how to remove the reliance on HDFS for >

Re: Scala Developers

2017-01-25 Thread Sean Owen
Yes, job postings are strongly discouraged on ASF lists, if not outright disallowed. You will see, sometimes posts prefixed with [JOBS] that are tolerated, but here I would assume they are not. This particular project and list is so big that there is no job posting I can imagine that is relevant

Re: Non-linear (curved?) regression line

2017-01-20 Thread Sean Owen
I don't think this is a Spark question. This isn't a problem you solve by throwing all combinations of options at it. Your target is not a linear function of input, or its square, and it's not a question of GLM link function. You may need to look at the log-log plot because this looks like a

Re: "Unable to load native-hadoop library for your platform" while running Spark jobs

2017-01-19 Thread Sean Owen
It's a message from Hadoop libs, not Spark. It can be safely ignored. It's just saying you haven't installed the additional (non-Apache-licensed) native libs that can accelerate some operations. This is something you can easily have read more about online. On Thu, Jan 19, 2017 at 10:57 AM Md.

Re: Accumulators and Datasets

2017-01-18 Thread Sean Owen
Accumulators aren't related directly to RDDs or Datasets. They're a separate construct. You can imagine updating accumulators in any distributed operation that you see documented for RDDs or Datasets. On Wed, Jan 18, 2017 at 2:16 PM Hanna Mäki wrote: > Hi, > > The

Re: Middleware-wrappers for Spark

2017-01-17 Thread Sean Owen
On Tue, Jan 17, 2017 at 4:49 PM Rick Moritz wrote: > * Oryx2 - This was more focused on a particular issue, and looked to be a > very nice framework for deploying real-time analytics --- but again, no > real traction. In fact, I've heard of PoCs being done by/for Cloudera, to >

Re: What can mesos or yarn do that spark standalone cannot do?

2017-01-15 Thread Sean Owen
The biggest thing that any resource manager besides Spark's standalone resource manager can do is manage other application resources. In a cluster where you are running other workloads, you can't use Spark standalone to arbitrate resource requirements across apps. On Sun, Jan 15, 2017 at 1:55 PM

Re: Unable to build spark documentation

2017-01-11 Thread Sean Owen
Are you using Java 8? Hyukjin fixed up all the errors due to the much stricter javadoc 8, but it's possible some creep back in because there is no Java 8 test now. On Wed, Jan 11, 2017 at 6:22 PM Krishna Kalyan wrote: > Hello, > I have been trying to build spark

Re: Time-Series Analysis with Spark

2017-01-11 Thread Sean Owen
https://github.com/sryza/spark-timeseries ? On Wed, Jan 11, 2017 at 10:11 AM Rishabh Bhardwaj wrote: > Hi All, > > I am exploring time-series forecasting with Spark. > I have some questions regarding this: > > 1. Is there any library/package out there in community of

Re: spark-shell running out of memory even with 6GB ?

2017-01-10 Thread Sean Owen
Maybe ... here are a bunch of things I'd check: Are you running out of memory, or just see a lot of mem usage? JVMs will happily use all the memory you allow them even if some of it could be reclaimed. Did the driver run out of mem? did you give 6G to the driver or executor? OOM errors do show

Re: Spark 2.0.2, KyroSerializer, double[] is not registered.

2017-01-08 Thread Sean Owen
does it break in spark? > > > > On Sun, Jan 8, 2017 at 6:03 PM, Sean Owen <so...@cloudera.com> wrote: > > Double[] is not of the same class as double[]. Kryo should already know > how to serialize double[], but I doubt Double[] is registered. > > The error does

Re: Spark 2.0.2, KyroSerializer, double[] is not registered.

2017-01-08 Thread Sean Owen
Double[] is not of the same class as double[]. Kryo should already know how to serialize double[], but I doubt Double[] is registered. The error does seem to clearly indicate double[] though. That surprises me. Can you try manually registering it to see if that fixes it? But then I'm not sure

Re: spark sql in Cloudera package

2017-01-04 Thread Sean Owen
(You can post this on the CDH lists BTW as it's more about that distribution.) The whole thrift server isn't supported / enabled in CDH, so I think that's why the script isn't turned on either. I don't think it's as much about using Impala as not wanting to do all the grunt work to make it

Re: TallSkinnyQR

2016-12-30 Thread Sean Owen
duplication. > > I have fixed the problem by what I mentioned above. Now, multiply, > computeSVD, and tallSkinnyQR are giving the correct results for > indexedRowMatrix when using multiple executors or workers. Let me know if > I should do a pull request for this. > > Best, > Hua

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
ivide the current > cosine output by the norm of this vector. And this vector we can get by > doing model.transform('science') if I am right? > > Lastly, I would be very happy to update to docs if it is editable for all > the things I encounter as not mentioned or not very clear. > ᐧ > >

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
am very open to editing the docs on things I find not properly > documented or wrong, but I need to know if that is allowed (is it like a > Wiki)?. > ᐧ > > On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen <so...@cloudera.com> wrote: > > It should be the cosine similarity,

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
It should be the cosine similarity, yes. I think this is what was fixed in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was really just outputting the 'unnormalized' similarity (dot / norm(a) only) but the docs said cosine similarity. Now it's cosine similarity in Spark 2. The

Re: Invert large matrix

2016-12-29 Thread Sean Owen
I think the best advice is: don't do that. If you're trying to solve a linear system, solve the linear system without explicitly constructing a matrix inverse. Is that what you mean? On Thu, Dec 29, 2016 at 2:22 AM Yanwei Wayne Zhang < actuary_zh...@hotmail.com> wrote: > Hi all, > > > I have a

Re: [ On the use of Spark as 'storage system']

2016-12-21 Thread Sean Owen
Spark isn't a storage system -- it's a batch processing system at heart. To "serve" something means to run a distributed computation scanning partitions for an element and collect it to a driver and return it. Although that could be fast-enough for some definition of fast, it's going to be orders

Re: Gradle dependency problem with spark

2016-12-16 Thread Sean Owen
Yes, that's the problem. Guava isn't generally mutually compatible across more than a couple major releases. You may have to hunt for a version that happens to have the functionality that both dependencies want, and hope that exists. Spark should shade Guava at this point but doesn't mean that you

Re: Negative values of predictions in ALS.tranform

2016-12-16 Thread Sean Owen
measure. That is because in explicit we are not using > the confidence matrix and preference matrix concept and use the actual > rating data. So any output from Spark ALS for explicit data would be a > rating prediction. > ᐧ > > On Thu, Dec 15, 2016 at 3:46 PM, Sean Owen &

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Sean Owen
nd 0/1 matrix to find > the user and item factors. > ᐧ > > On Thu, Dec 15, 2016 at 3:38 PM, Sean Owen <so...@cloudera.com> wrote: > > No, you can't interpret the output as probabilities at all. In particular > they may be negative. It is not predicting rating but intera

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Sean Owen
ase note, for implicit feedback ALS, we don't feed 0/1 matrix. We > feed the count matrix (discrete count values) and am assuming spark > internally converts it into a preference matrix (1/0) and a confidence > matrix =1+alpha*count_matrix > > > > > ᐧ > > On Thu, Dec 15

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Sean Owen
No, ALS is not modeling probabilities. The outputs are reconstructions of a 0/1 matrix. Most values will be in [0,1], but, it's possible to get values outside that range. On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi wrote: > Hi > > ran the ALS model for implicit

Re: WARN util.NativeCodeLoader

2016-12-07 Thread Sean Owen
You can ignore it. You can also install the native libs in question but it's just a minor accelerator. On Thu, Dec 8, 2016 at 2:36 PM baipeng wrote: > Hi ALL > > I’m new to Spark.When I execute spark-shell, the first line is as follows > WARN util.NativeCodeLoader: Unable to

Re: [GraphX] Extreme scheduler delay

2016-12-06 Thread Sean Owen
(For what it is worth, I happened to look into this with Anton earlier and am also pretty convinced it's related to GraphX rather than the app. It's somewhat difficult to debug what gets sent in the closure AFAICT.) On Tue, Dec 6, 2016 at 7:49 PM AntonIpp wrote: > Hi

Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Sean Owen
t; *Md. Rezaul Karim* BSc, MSc > PhD Researcher, INSIGHT Centre for Data Analytics > National University of Ireland, Galway > IDA Business Park, Dangan, Galway, Ireland > Web: http://www.reza-analytics.eu/index.html > <http://139.59.184.114/index.html> > > On 6 December 2016 at

Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Sean Owen
Precision, recall and F1 are metrics for binary classifiers, not regression models. Can you clarify what you intend to do? On Tue, Dec 6, 2016, 19:14 Md. Rezaul Karim wrote: > Hi Folks, > > I have the following code snippet in Java that can calculate the

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Sean Owen
1 > ls.add(srcFactor, (c1 + 1.0) / c1, c1) > } > } else { > ls.add(srcFactor, rating) > numExplicits += 1 > } > {code} > > Regards, > > Jerry > > > On Mon, Dec 5, 2016 at 3:27 PM, Sean Owen <s

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Sean Owen
paper? > > Best Regards, > > Jerry > > > On Mon, Dec 5, 2016 at 2:43 PM, Sean Owen <so...@cloudera.com> wrote: > > What are you referring to in what paper? implicit input would never > materialize 0s for missing values. > > On Tue, Dec 6, 2016 at 3:42 AM Jerry

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Sean Owen
What are you referring to in what paper? implicit input would never materialize 0s for missing values. On Tue, Dec 6, 2016 at 3:42 AM Jerry Lam wrote: > Hello spark users and developers, > > I read the paper from Yahoo about CF with implicit feedback and other > papers

Re: TallSkinnyQR

2016-12-02 Thread Sean Owen
y available (R) > > On Fri, Nov 11, 2016 at 3:56 AM Sean Owen <so...@cloudera.com> wrote: > > @Xiangrui / @Joseph, do you think it would be reasonable to have > CoordinateMatrix sort the rows it creates to make an IndexedRowMatrix? in > order to make the ultimate output of t

Re: Is there a processing speed difference between DataFrames and Datasets?

2016-11-22 Thread Sean Owen
DataFrames are a narrower, more specific type of abstraction, for tabular data. Where your data is tabular, it makes more sense to use, especially because this knowledge means a lot more can be optimized under the hood for you, whereas the framework can do nothing with an RDD of arbitrary objects.

Re: Reading LZO files with Spark

2016-11-19 Thread Sean Owen
Are you missing the hadoop-lzo package? it's not part of Hadoop/Spark. On Sat, Nov 19, 2016 at 4:20 AM learning_spark < dibyendu.chakraba...@gmail.com> wrote: > Hi Users, I am not sure about the latest status of this issue: > https://issues.apache.org/jira/browse/SPARK-2394 However, I have seen

Re: does column order matter in dataframe.repartition?

2016-11-17 Thread Sean Owen
It's not in general true that 100 different partitions keys go to 100 partitions -- it depends on the partitioner, but wouldn't be true in the case of a default HashPartitioner. But, yeah you'd expect a reasonably even distribution. What happens in all cases depends on the partitioner. I haven't

Re: toDebugString is clipped

2016-11-13 Thread Sean Owen
I believe it's the shell (Scala shell) that's cropping the output. See http://blog.ssanj.net/posts/2016-10-16-output-in-scala-repl-is-truncated.html On Sun, Nov 13, 2016 at 1:56 AM Anirudh Perugu < anirudh.per...@stonybrook.edu> wrote: > Hello all, > > I am trying to understanding how graphx

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Sean Owen
You set SCALA_HOME twice and didn't set SPARK_HOME. On Sun, Nov 13, 2016, 04:50 Kelum Perera wrote: > Dear Users, > > I'm a newbie, trying to get spark-shell using kali linux OS, but getting > error - "spark-shell: command not found" > > I'm running on Kali Linux 2 (64bit)

Re: SparkDriver memory calculation mismatch

2016-11-12 Thread Sean Owen
12, 2016 at 9:14 AM Elkhan Dadashov <elkhan8...@gmail.com> wrote: > @Sean Owen, > > Thanks for your reply. > > I put the wrong link to the blog post. Here is the correct link > <https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-4-memory-s

Re: SparkDriver memory calculation mismatch

2016-11-12 Thread Sean Owen
If you're pointing at the 336MB, then it's not really related any of the items you cite here. This is the memory managed internally by MemoryStore. The blog post refers to the legacy memory manager. You can see a bit of how it works in the code, but this is the sum of the on-heap and off-heap

Re: TallSkinnyQR

2016-11-11 Thread Sean Owen
@Xiangrui / @Joseph, do you think it would be reasonable to have CoordinateMatrix sort the rows it creates to make an IndexedRowMatrix? in order to make the ultimate output of toRowMatrix less surprising when it's not ordered? On Tue, Nov 8, 2016 at 3:29 PM Sean Owen <so...@cloudera.com>

Re: TallSkinnyQR

2016-11-08 Thread Sean Owen
666 -0.11165321782745863 > R: -1.0712142642814275 -0.8347536340918976 -1.227672225670157 > 0.0 0.7662808691141717 0.7553315911660984 > 0.0 0.0 0.7785210939368136 > > When running this in matlab the numbers are the same but row 1 is

Re: TallSkinnyQR

2016-11-07 Thread Sean Owen
Rather than post a large section of code, please post a small example of the input matrix and its decomposition, to illustrate what you're saying is out of order. On Tue, Nov 8, 2016 at 3:50 AM im281 wrote: > I am getting the correct rows but they are out of order. Is

Re: How sensitive is Spark to Swap?

2016-11-07 Thread Sean Owen
Swapping is pretty bad here, especially because a JVM-based won't even feel the memory pressure and try to GC or shrink the heap when the OS faces memory pressure. It's probably relatively worse than in M/R because Spark uses memory more. Enough grinding in swap will cause tasks to fail due to

Re: Spark ML - Is it rule of thumb that all Estimators should only be Fit on Training data

2016-11-02 Thread Sean Owen
I would also only fit these on training data. There are probably some corner cases where letting these ancillary transforms see test data results in a target leak. Though I can't really think of a good example. More to the point, you're probably fitting these as part of a pipeline and that

Re: Load whole ALS MatrixFactorizationModel into memory

2016-11-02 Thread Sean Owen
You can cause the underlying RDDs in the model to be cached in memory. That would be necessary but not sufficient to make it go fast; it should at least get rid of a lot of I/O. I think making recommendations one at a time is never going to scale to moderate load this way; one request means one

Re: Running Google Dataflow on Spark

2016-11-02 Thread Sean Owen
This is a Dataflow / Beam question, not a Spark question per se. On Wed, Nov 2, 2016 at 11:48 AM Ashutosh Kumar wrote: > I am trying to run Google Dataflow code on Spark. It works fine as google > dataflow on google cloud platform. But while running on Spark I am

Re: Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-01 Thread Sean Owen
CrossValidator splits the data into k sets, and then trains k times, holding out one subset for cross-validation each time. You are correct that you should actually withhold an additional test set, before you use CrossValidator, in order to get an unbiased estimate of the best model's performance.

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Sean Owen
Possibly https://issues.apache.org/jira/browse/SPARK-17396 ? On Tue, Nov 1, 2016 at 2:11 AM kant kodali wrote: > Hi Ryan, > > I think you are right. This may not be related to the Receiver. I have > attached jstack dump here. I do a simple MapToPair and reduceByKey and I >

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Sean Owen
> pipe size(512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > real-time priority (-r) 0 > > stack size (kbytes, -s) 8192 > > cpu time (seconds, -t) unlimited > > max user processes (-

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Sean Owen
ps -L [pid] is what shows threads. I am not sure this is counting what you think it does. My shell process has about a hundred threads, and I can't imagine why one would have thousands unless your app spawned them. On Mon, Oct 31, 2016 at 10:20 AM kant kodali wrote: > when I

Re: [SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Sean Owen
https://issues.apache.org/jira/browse/SPARK-17898 On Fri, Oct 28, 2016 at 11:56 AM Aseem Bansal wrote: > Hi > > We are trying to use some of our artifacts as dependencies while > submitting spark jobs. To specify the remote artifactory URL we are using > the following

Re: Executor shutdown hook and initialization

2016-10-28 Thread Sean Owen
Have a look at this ancient JIRA for a lot more discussion about this: https://issues.apache.org/jira/browse/SPARK-650 You have exactly the same issue described by another user. For your context, your approach is sound. You can set a shutdown hook using the normal Java Runtime API. You may not

<    1   2   3   4   5   6   7   8   9   10   >