number of http calls, I can make. I
> can't boost more number of http calls in single executors, I mean - I can't
> go beyond the threashold of number of executors.
>
> On Thu, May 14, 2020 at 6:26 PM Sean Owen wrote:
>>
>> Default is not 200, but the number of ex
s it have
> relationship with number of cores? 8 cores - 4 workers. is not it like I
> can do only 8 * 4 = 32 http calls. Because in Spark number of partitions =
> number cores is untrue.
>
> Thanks
>
> On Thu, May 14, 2020 at 6:11 PM Sean Owen wrote:
>
>> Yes any code that you
Yes any code that you write in code that you apply with Spark runs in
the executors. You would be running as many HTTP clients as you have
partitions.
On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov wrote:
>
> I believe that if you do this within the context of an operation that is
> already
, 2020 at 11:45 AM Fuo Bol wrote:
>
> @Sean Owen
>
> Why did you remove email zahidr1...@gmail.com following this query. ?
>
> The two responses were then sent to a removed email account.
>
>
>
>
> > -- Forwarded message -
> > From:
> &g
Spark will by default assume each task needs 1 CPU. On an executor
with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using
4 cores, then 64 threads are trying to run. If you're CPU-bound, that
could slow things down. But to the extent some of tasks take some time
blocking on I/O, it
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Before anyone asks: yes this is banned immediately.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
I am subscribed to this list to watch for a certain person's new
accounts, which are posting obviously off-topic and inappropriate
messages. It goes without saying that this is unacceptable and a CoC
violation, and anyone posting that will be immediately removed and
blocked.
In the meantime,
You'll want to ask the authors directly ; the book is not produced by the
project itself, so can't answer here.
On Sat, Apr 25, 2020, 8:42 AM Som Lima wrote:
> At the risk of being removed from the emailing I would like a
> clarification because I do not want to commit an unlawful act.
> Can
The mailing lists are operated by the ASF. I've asked whether it's
possible here: https://issues.apache.org/jira/browse/INFRA-20186
On Fri, Apr 24, 2020 at 12:39 PM Jeff Evans
wrote:
>
> Still noticing this problem quite a bit, both on the user and dev lists. I
> notice that it appears to be
I don't think that means it's stuck on removing something; it was
removed. Not sure what it is waiting on - more data perhaps?
On Sat, Apr 18, 2020 at 2:22 PM Alchemist wrote:
>
> I am running a simple Spark structured streaming application that is pulling
> data from a Kafka Topic. I have a
The second release candidate will come soon. I would guess it all
completes by the end of May, myself, but no guarantees.
On Fri, Apr 17, 2020 at 6:30 AM Marshall Markham
wrote:
>
> Hi,
>
>
>
> I realize this was probably not responded to because either the date is
> unclear or explicitly
Absolutely unacceptable even if this were the only one. I'm contacting
INFRA right now.
On Thu, Apr 16, 2020 at 11:57 AM Holden Karau wrote:
> I want to be clear I believe the language in janethrope1s email is
> unacceptable for the mailing list and possibly a violation of the Apache
> code of
Yes, this kind of message is not welcome on this list. At best the
wording is ... odd, but the tone is combative. It's not even clear
what the question is.
This user has posted several other messages with the same type of
issue, uncannily like those of "Zahid Raman" last month.
This list has
the lists.
Sean
On Fri, Mar 27, 2020 at 9:46 PM Zahid Rahman wrote:
>
>
> Sean Owen says the criteria of these two emailing list is not help to support
> some body
> who is new but for people who have been using the software for a long time.
>
> He is implying I think th
Spark standalone is a resource manager like YARN and Mesos. It is
specific to Spark, and is therefore simpler, as it assumes it can take
over whole machines.
YARN and Mesos are for mediating resource usage across applications on
a cluster, which may be running more than Spark apps.
On Fri, Mar
- dev@, which is more for project devs to communicate. Cross-posting
is discouraged too.
The book isn't from the Spark OSS project, so not really the place to
give feedback here.
I don't quite understand the context of your other questions, but
would elaborate them in individual, clear emails
la:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spar
What was the build error? you didn't say. Are you sure it succeeded?
Try running from the Spark home dir, not bin.
I know we do run Windows tests and it appears to pass tests, etc.
On Thu, Dec 5, 2019 at 3:28 PM Ping Liu wrote:
>
> Hello,
>
> I understand Spark is preferably built on Linux. But
The message in question has already been public, and copied to mirrors the
ASF does not control, for a year and a half.
There is a process for requesting modification to ASF archives, but this
case does not qualify:
https://www.apache.org/foundation/public-archives.html
On Thu, Sep 26, 2019 at
Seems fine to me if there are enough valuable fixes to justify another
release. If there are any other important fixes imminent, it's fine to
wait for those.
On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released three months ago (8th May).
> As of today
big for that high GC.
>
> But what's troubling me is this issue doesn't occur in Spark 2.2 at all. What
> could be the reason behind such a behaviour?
>
> Regards,
> Dhrub
>
> On Mon, Jul 29, 2019 at 6:45 PM Sean Owen wrote:
>>
>> -dev@
>>
>> Yep, hig
-dev@
Yep, high GC activity means '(almost) out of memory'. I don't see that
you've checked heap usage - is it nearly full?
The answer isn't tuning but more heap.
(Sometimes with really big heaps the problem is big pauses, but that's
not the case here.)
On Mon, Jul 29, 2019 at 1:26 AM
We will certainly want a 2.4.4 release eventually. In fact I'd expect
2.4.x gets maintained for longer than the usual 18 months, as it's the
last 2.x branch.
It doesn't need to happen before 3.0, but could. Usually maintenance
releases happen 3-4 months apart and the last one was 2 months ago. If
Deprecated -- certainly and sooner than later.
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?
On Wed, May 29, 2019 at 11:47 PM Xiangrui Meng wrote:
>
> Hi all,
>
> I want to revive this old thread since no
A cached DataFrame isn't supposed to change, by definition.
You can re-read each time or consider setting up a streaming source on
the table which provides a result that updates as new data comes in.
On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos wrote:
>
> Hello,
>
> I have a cached dataframe:
Severity: Low
Vendor: The Apache Software Foundation
Versions Affected:
All versions of Apache Spark
Description:
Spark's standalone resource manager accepts code to execute on a 'master' host,
that then runs that code on 'worker' hosts. The master itself does not, by
design, execute user code.
Severity: Low
Vendor: The Apache Software Foundation
Versions Affected:
1.3.x release branch and later, including master
Description:
Spark's Apache Maven-based build includes a convenience script, 'build/mvn',
that downloads and runs a zinc server to speed up compilation. This server
will
Severity: Medium
Vendor: The Apache Software Foundation
Versions Affected:
Spark versions from 1.3.0, running standalone master with REST API enabled,
or running Mesos master with cluster mode enabled
Description:
>From version 1.3.0 onward, Spark's standalone master exposes a REST API for
job
Severity: Medium
Vendor: The Apache Software Foundation
Versions Affected:
Spark versions through 2.1.2
Spark 2.2.0 through 2.2.1
Spark 2.3.0
Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's
possible for a malicious user to construct a URL pointing to a
Severity: High
Vendor: The Apache Software Foundation
Versions affected:
Spark versions through 2.1.2
Spark 2.2.0 to 2.2.1
Spark 2.3.0
Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when
using PySpark or SparkR, it's possible for a different local user to
Spark uses Maven as the primary build, but SBT works as well. It reads the
Maven build to some extent.
Zinc incremental compilation works with Maven (with the Scala plugin for
Maven).
Myself, I prefer Maven, for some of the reasons it is the main build in
Spark: declarative builds end up being a
Just to follow up -- those are actually in a Palantir repo, not Central.
Deploying to Central would be uncourteous, but this approach is legitimate
and how it has to work for vendors to release distros of Spark etc.
On Tue, Jan 9, 2018 at 11:43 AM Nan Zhu wrote:
> Hi,
Certainly, Scala 2.12 support precedes Java 9 support. A lot of the work is
in place already, and the last issue is dealing with how Scala closures are
now implemented quite different with lambdas / invokedynamic. This affects
the ClosureCleaner. For the interested, this is as far as I know the
t;> On Sun, Oct 1, 2017 at 3:50 PM, Reynold Xin <r...@databricks.com> wrote:
>> > Probably should do 1, and then it is an easier transition in 3.0.
>> >
>> > On Sun, Oct 1, 2017 at 1:28 AM Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> I tried an
Severity: Medium
Vendor: The Apache Software Foundation
Versions Affected:
Versions of Apache Spark from 1.6.0 until 2.1.1
Description:
In Apache Spark 1.6.0 until 2.1.1, the launcher API performs unsafe
deserialization of data received by its socket. This makes applications
launched
Severity: Low
Vendor: The Apache Software Foundation
Versions Affected:
Versions of Apache Spark before 2.2.0
Description:
It is possible for an attacker to take advantage of a user's trust in the
server to trick them into visiting a link that points to a shared Spark
cluster and submits data
rg/slf4j/simple/SimpleLogger.java#L599
>
> Please correct me if I am wrong.
>
>
>
>
> On Sun, Jun 25, 2017 at 3:04 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Maybe you are looking for declarations like this. "=> String" means the
>> arg i
Please get Packt to fix their existing PR. It's been open for months
https://github.com/apache/spark-website/pull/35
On Sun, Jun 25, 2017 at 12:33 PM Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:
> Hi Sean,
>
> Last time, you helped me add a book info (in the books section) on this
Maybe you are looking for declarations like this. "=> String" means the arg
isn't evaluated until it's used, which is just what you want with log
statements. The message isn't constructed unless it will be logged.
protected def logInfo(msg: => String) {
On Sun, Jun 25, 2017 at 10:28 AM kant
Yes. Imagine an RDD that results from a union of other RDDs.
On Thu, Jun 15, 2017, 09:11 萝卜丝炒饭 <1427357...@qq.com> wrote:
> Hi all,
>
> The RDD code keeps a member as below:
> dependencies_ : seq[Dependency[_]]
>
> It is a seq, that means it can keep more than one dependency.
>
> I have an issue
Kafka and Spark Streaming don't do the same thing. Kafka stores and
transports data, Spark Streaming runs computations on a stream of data.
Neither is itself a streaming platform in its entirety.
It's kind of like asking whether you should build a website using just
MySQL, or nginx.
> On 9 Mar
The driver keeps metrics on everything that has executed. This is how it
can display the history in the UI. It's normal for the bookkeeping to keep
growing because it's recording every job. You can configure it to keep
records about fewer jobs. But thousands of entries isn't exactly big.
On Thu,
I think this is the same thing we already discussed extensively on your
JIRA.
The type of the key/value class argument to newAPIHadoopFile are not the
type of your custom class, but of the Writable describing encoding of keys
and values in the file. I think that's the start of part of the
There's nothing unusual about negative values from a linear regression. If,
generally, your predicted values are far from your actual values, then your
model hasn't fit well. You may have a bug somewhere in your pipeline or you
may have data without much linear relationship. Most of this isn't a
What is the --jars you are submitting? You may have conflicting copies of
Spark classes that interfere.
On Wed, Mar 1, 2017, 14:20 Dominik Safaric wrote:
> I've been trying to submit a Spark Streaming application using
> spark-submit to a cluster of mine consisting of
Broadcasts let you send one copy of read only data to each executor. That's
not the same as a DataFrame and itseems nature means it doesnt make sense
to think of them as not distributed. But consider things like broadcast
hash joins which may be what you are looking for if you really mean to join
No this use case is perfectly sensible. Yes it is thread safe.
On Sun, Feb 12, 2017, 10:30 Jörn Franke wrote:
> I think you should have a look at the spark documentation. It has
> something called scheduler who does exactly this. In more sophisticated
> environments yarn
Data has to live somewhere -- how do you not add storage but store more
data? Alluxio is not persistent storage, and S3 isn't on your premises.
On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim wrote:
> Has anyone got some advice on how to remove the reliance on HDFS for
>
Yes, job postings are strongly discouraged on ASF lists, if not outright
disallowed. You will see, sometimes posts prefixed with [JOBS] that are
tolerated, but here I would assume they are not. This particular project
and list is so big that there is no job posting I can imagine that is
relevant
I don't think this is a Spark question. This isn't a problem you solve by
throwing all combinations of options at it. Your target is not a linear
function of input, or its square, and it's not a question of GLM link
function. You may need to look at the log-log plot because this looks like
a
It's a message from Hadoop libs, not Spark. It can be safely ignored. It's
just saying you haven't installed the additional (non-Apache-licensed)
native libs that can accelerate some operations. This is something you can
easily have read more about online.
On Thu, Jan 19, 2017 at 10:57 AM Md.
Accumulators aren't related directly to RDDs or Datasets. They're a
separate construct. You can imagine updating accumulators in any
distributed operation that you see documented for RDDs or Datasets.
On Wed, Jan 18, 2017 at 2:16 PM Hanna Mäki wrote:
> Hi,
>
> The
On Tue, Jan 17, 2017 at 4:49 PM Rick Moritz wrote:
> * Oryx2 - This was more focused on a particular issue, and looked to be a
> very nice framework for deploying real-time analytics --- but again, no
> real traction. In fact, I've heard of PoCs being done by/for Cloudera, to
>
The biggest thing that any resource manager besides Spark's standalone
resource manager can do is manage other application resources. In a cluster
where you are running other workloads, you can't use Spark standalone to
arbitrate resource requirements across apps.
On Sun, Jan 15, 2017 at 1:55 PM
Are you using Java 8? Hyukjin fixed up all the errors due to the much
stricter javadoc 8, but it's possible some creep back in because there is
no Java 8 test now.
On Wed, Jan 11, 2017 at 6:22 PM Krishna Kalyan
wrote:
> Hello,
> I have been trying to build spark
https://github.com/sryza/spark-timeseries ?
On Wed, Jan 11, 2017 at 10:11 AM Rishabh Bhardwaj
wrote:
> Hi All,
>
> I am exploring time-series forecasting with Spark.
> I have some questions regarding this:
>
> 1. Is there any library/package out there in community of
Maybe ... here are a bunch of things I'd check:
Are you running out of memory, or just see a lot of mem usage? JVMs will
happily use all the memory you allow them even if some of it could be
reclaimed.
Did the driver run out of mem? did you give 6G to the driver or executor?
OOM errors do show
does it break in spark?
>
>
>
> On Sun, Jan 8, 2017 at 6:03 PM, Sean Owen <so...@cloudera.com> wrote:
>
> Double[] is not of the same class as double[]. Kryo should already know
> how to serialize double[], but I doubt Double[] is registered.
>
> The error does
Double[] is not of the same class as double[]. Kryo should already know how
to serialize double[], but I doubt Double[] is registered.
The error does seem to clearly indicate double[] though. That surprises
me. Can you try manually registering it to see if that fixes it?
But then I'm not sure
(You can post this on the CDH lists BTW as it's more about that
distribution.) The whole thrift server isn't supported / enabled in CDH, so
I think that's why the script isn't turned on either. I don't think it's as
much about using Impala as not wanting to do all the grunt work to make it
duplication.
>
> I have fixed the problem by what I mentioned above. Now, multiply,
> computeSVD, and tallSkinnyQR are giving the correct results for
> indexedRowMatrix when using multiple executors or workers. Let me know if
> I should do a pull request for this.
>
> Best,
> Hua
ivide the current
> cosine output by the norm of this vector. And this vector we can get by
> doing model.transform('science') if I am right?
>
> Lastly, I would be very happy to update to docs if it is editable for all
> the things I encounter as not mentioned or not very clear.
> ᐧ
>
>
am very open to editing the docs on things I find not properly
> documented or wrong, but I need to know if that is allowed (is it like a
> Wiki)?.
> ᐧ
>
> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen <so...@cloudera.com> wrote:
>
> It should be the cosine similarity,
It should be the cosine similarity, yes. I think this is what was fixed in
https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was really
just outputting the 'unnormalized' similarity (dot / norm(a) only) but the
docs said cosine similarity. Now it's cosine similarity in Spark 2. The
I think the best advice is: don't do that. If you're trying to solve a
linear system, solve the linear system without explicitly constructing a
matrix inverse. Is that what you mean?
On Thu, Dec 29, 2016 at 2:22 AM Yanwei Wayne Zhang <
actuary_zh...@hotmail.com> wrote:
> Hi all,
>
>
> I have a
Spark isn't a storage system -- it's a batch processing system at heart. To
"serve" something means to run a distributed computation scanning
partitions for an element and collect it to a driver and return it.
Although that could be fast-enough for some definition of fast, it's going
to be orders
Yes, that's the problem. Guava isn't generally mutually compatible across
more than a couple major releases. You may have to hunt for a version that
happens to have the functionality that both dependencies want, and hope
that exists. Spark should shade Guava at this point but doesn't mean that
you
measure. That is because in explicit we are not using
> the confidence matrix and preference matrix concept and use the actual
> rating data. So any output from Spark ALS for explicit data would be a
> rating prediction.
> ᐧ
>
> On Thu, Dec 15, 2016 at 3:46 PM, Sean Owen &
nd 0/1 matrix to find
> the user and item factors.
> ᐧ
>
> On Thu, Dec 15, 2016 at 3:38 PM, Sean Owen <so...@cloudera.com> wrote:
>
> No, you can't interpret the output as probabilities at all. In particular
> they may be negative. It is not predicting rating but intera
ase note, for implicit feedback ALS, we don't feed 0/1 matrix. We
> feed the count matrix (discrete count values) and am assuming spark
> internally converts it into a preference matrix (1/0) and a confidence
> matrix =1+alpha*count_matrix
>
>
>
>
> ᐧ
>
> On Thu, Dec 15
No, ALS is not modeling probabilities. The outputs are reconstructions of a
0/1 matrix. Most values will be in [0,1], but, it's possible to get values
outside that range.
On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi
wrote:
> Hi
>
> ran the ALS model for implicit
You can ignore it. You can also install the native libs in question but
it's just a minor accelerator.
On Thu, Dec 8, 2016 at 2:36 PM baipeng wrote:
> Hi ALL
>
> I’m new to Spark.When I execute spark-shell, the first line is as follows
> WARN util.NativeCodeLoader: Unable to
(For what it is worth, I happened to look into this with Anton earlier and
am also pretty convinced it's related to GraphX rather than the app. It's
somewhat difficult to debug what gets sent in the closure AFAICT.)
On Tue, Dec 6, 2016 at 7:49 PM AntonIpp wrote:
> Hi
t; *Md. Rezaul Karim* BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> <http://139.59.184.114/index.html>
>
> On 6 December 2016 at
Precision, recall and F1 are metrics for binary classifiers, not regression
models. Can you clarify what you intend to do?
On Tue, Dec 6, 2016, 19:14 Md. Rezaul Karim
wrote:
> Hi Folks,
>
> I have the following code snippet in Java that can calculate the
1
> ls.add(srcFactor, (c1 + 1.0) / c1, c1)
> }
> } else {
> ls.add(srcFactor, rating)
> numExplicits += 1
> }
> {code}
>
> Regards,
>
> Jerry
>
>
> On Mon, Dec 5, 2016 at 3:27 PM, Sean Owen <s
paper?
>
> Best Regards,
>
> Jerry
>
>
> On Mon, Dec 5, 2016 at 2:43 PM, Sean Owen <so...@cloudera.com> wrote:
>
> What are you referring to in what paper? implicit input would never
> materialize 0s for missing values.
>
> On Tue, Dec 6, 2016 at 3:42 AM Jerry
What are you referring to in what paper? implicit input would never
materialize 0s for missing values.
On Tue, Dec 6, 2016 at 3:42 AM Jerry Lam wrote:
> Hello spark users and developers,
>
> I read the paper from Yahoo about CF with implicit feedback and other
> papers
y available (R)
>
> On Fri, Nov 11, 2016 at 3:56 AM Sean Owen <so...@cloudera.com> wrote:
>
> @Xiangrui / @Joseph, do you think it would be reasonable to have
> CoordinateMatrix sort the rows it creates to make an IndexedRowMatrix? in
> order to make the ultimate output of t
DataFrames are a narrower, more specific type of abstraction, for tabular
data. Where your data is tabular, it makes more sense to use, especially
because this knowledge means a lot more can be optimized under the hood for
you, whereas the framework can do nothing with an RDD of arbitrary objects.
Are you missing the hadoop-lzo package? it's not part of Hadoop/Spark.
On Sat, Nov 19, 2016 at 4:20 AM learning_spark <
dibyendu.chakraba...@gmail.com> wrote:
> Hi Users, I am not sure about the latest status of this issue:
> https://issues.apache.org/jira/browse/SPARK-2394 However, I have seen
It's not in general true that 100 different partitions keys go to 100
partitions -- it depends on the partitioner, but wouldn't be true in the
case of a default HashPartitioner. But, yeah you'd expect a reasonably even
distribution.
What happens in all cases depends on the partitioner. I haven't
I believe it's the shell (Scala shell) that's cropping the output. See
http://blog.ssanj.net/posts/2016-10-16-output-in-scala-repl-is-truncated.html
On Sun, Nov 13, 2016 at 1:56 AM Anirudh Perugu <
anirudh.per...@stonybrook.edu> wrote:
> Hello all,
>
> I am trying to understanding how graphx
You set SCALA_HOME twice and didn't set SPARK_HOME.
On Sun, Nov 13, 2016, 04:50 Kelum Perera wrote:
> Dear Users,
>
> I'm a newbie, trying to get spark-shell using kali linux OS, but getting
> error - "spark-shell: command not found"
>
> I'm running on Kali Linux 2 (64bit)
12, 2016 at 9:14 AM Elkhan Dadashov <elkhan8...@gmail.com>
wrote:
> @Sean Owen,
>
> Thanks for your reply.
>
> I put the wrong link to the blog post. Here is the correct link
> <https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-4-memory-s
If you're pointing at the 336MB, then it's not really related any of the
items you cite here. This is the memory managed internally by MemoryStore.
The blog post refers to the legacy memory manager. You can see a bit of how
it works in the code, but this is the sum of the on-heap and off-heap
@Xiangrui / @Joseph, do you think it would be reasonable to have
CoordinateMatrix sort the rows it creates to make an IndexedRowMatrix? in
order to make the ultimate output of toRowMatrix less surprising when it's
not ordered?
On Tue, Nov 8, 2016 at 3:29 PM Sean Owen <so...@cloudera.com>
666 -0.11165321782745863
> R: -1.0712142642814275 -0.8347536340918976 -1.227672225670157
> 0.0 0.7662808691141717 0.7553315911660984
> 0.0 0.0 0.7785210939368136
>
> When running this in matlab the numbers are the same but row 1 is
Rather than post a large section of code, please post a small example of
the input matrix and its decomposition, to illustrate what you're saying is
out of order.
On Tue, Nov 8, 2016 at 3:50 AM im281 wrote:
> I am getting the correct rows but they are out of order. Is
Swapping is pretty bad here, especially because a JVM-based won't even feel
the memory pressure and try to GC or shrink the heap when the OS faces
memory pressure. It's probably relatively worse than in M/R because Spark
uses memory more. Enough grinding in swap will cause tasks to fail due to
I would also only fit these on training data. There are probably some
corner cases where letting these ancillary transforms see test data results
in a target leak. Though I can't really think of a good example.
More to the point, you're probably fitting these as part of a pipeline and
that
You can cause the underlying RDDs in the model to be cached in memory. That
would be necessary but not sufficient to make it go fast; it should at
least get rid of a lot of I/O. I think making recommendations one at a time
is never going to scale to moderate load this way; one request means one
This is a Dataflow / Beam question, not a Spark question per se.
On Wed, Nov 2, 2016 at 11:48 AM Ashutosh Kumar
wrote:
> I am trying to run Google Dataflow code on Spark. It works fine as google
> dataflow on google cloud platform. But while running on Spark I am
CrossValidator splits the data into k sets, and then trains k times,
holding out one subset for cross-validation each time. You are correct that
you should actually withhold an additional test set, before you use
CrossValidator, in order to get an unbiased estimate of the best model's
performance.
Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?
On Tue, Nov 1, 2016 at 2:11 AM kant kodali wrote:
> Hi Ryan,
>
> I think you are right. This may not be related to the Receiver. I have
> attached jstack dump here. I do a simple MapToPair and reduceByKey and I
>
> pipe size(512 bytes, -p) 8
> > POSIX message queues (bytes, -q) 819200
> > real-time priority (-r) 0
> > stack size (kbytes, -s) 8192
> > cpu time (seconds, -t) unlimited
> > max user processes (-
ps -L [pid] is what shows threads. I am not sure this is counting what you
think it does. My shell process has about a hundred threads, and I can't
imagine why one would have thousands unless your app spawned them.
On Mon, Oct 31, 2016 at 10:20 AM kant kodali wrote:
> when I
https://issues.apache.org/jira/browse/SPARK-17898
On Fri, Oct 28, 2016 at 11:56 AM Aseem Bansal wrote:
> Hi
>
> We are trying to use some of our artifacts as dependencies while
> submitting spark jobs. To specify the remote artifactory URL we are using
> the following
Have a look at this ancient JIRA for a lot more discussion about this:
https://issues.apache.org/jira/browse/SPARK-650 You have exactly the same
issue described by another user. For your context, your approach is sound.
You can set a shutdown hook using the normal Java Runtime API. You may not
501 - 600 of 1849 matches
Mail list logo