Do you have dynamic resource allocation enabled?
On Wednesday, July 17, 2019, zenglong chen wrote:
> Hi,all,
> My standalone mode has two slaves.When I submit my job,the
> localhost slave is working well,but second slave do add and remove executor
> action always!The log are below:
>
Users can also request random rows in those columns. So a user can request a
subset of the matrix (N rows and N columns) which would change the value of the
correlation coefficient.
From: Jerry Vinokurov [mailto:grapesmo...@gmail.com]
Sent: Wednesday, July 17, 2019 1:27 PM
To:
Maybe I'm not understanding something about this use case, but why is
precomputation not an option? Is it because the matrices themselves change?
Because if the matrices are constant, then I think precomputation would
work for you even if the users request random correlations. You can just
store
As I said in the my initial message, precomputing is not an option.
Retrieving only the top/bottom N most correlated is an option – would that
speed up the results?
Our SLAs are soft – slight variations (+- 15 seconds) will not cause issues.
--gautham
From: Patrick McCarthy
Do you really need the results of all 3MM computations, or only the top-
and bottom-most correlation coefficients? Could correlations be computed on
a sample and from that estimate a distribution of coefficients? Would it
make sense to precompute offline and instead focus on fast key-value
Thanks for the reply, Bobby.
I’ve received notice that we can probably tolerate response times of up to 30
seconds. Would this be more manageable? 5 seconds was an initial ask, but 20-30
seconds is also a reasonable response time for our use case.
With the new SLA, do you think that we can
Hi,
With the current design, eventlogs are not ideal for long running streaming
applications. So, it is better then to disable the eventlogs. There was a
proposal for splitting the eventlogs based on size/Job/query for long
running applications, not sure about the followup for that.
Regards,
Hi.
There is a workaround for that.
You can disable event logs for Spark Streaming applications.
On Tue, Jul 16, 2019 at 1:08 PM raman gugnani
wrote:
> HI ,
>
> I have long running spark streaming jobs.
> Event log directories are getting filled with .inprogress files.
> Is there fix or work
Let's do a few quick rules of thumb to get an idea of what kind of
processing power you will need in general to do what you want.
You need 3,000,000 ints by 50,000 rows. Each int is 4 bytes so that ends
up being about 560 GB that you need to fully process in 5 seconds.
If you are reading this
Hello all,
TLDR; Can the number of cores used by a task vary or is it always one core
per task? Is there a UI, metrics or logs I can check to see the number of
cores used by the task?
I have an ETL-pipeline where I do some transformations. In one of the
stages which ought to be quite CPU-heavy
Regular Python UDFs don't use PyArrow under the hood.
Yes, they can potentially benefit but they can be easily worked around via
Pandas UDFs.
For instance, both below are virtually identical.
@udf(...)
def func(col):
return col
@pandas_udf(...)
def pandas_func(col):
return
Hi,all,
My standalone mode has two slaves.When I submit my job,the
localhost slave is working well,but second slave do add and remove executor
action always!The log are below:
2019-07-17 10:51:38,889 INFO
client.StandaloneAppClient$ClientEndpoint: Executor updated:
I'm trying to parse a RDD[Seq[String]] to Dataframe.
ALthough it's a Seq of Strings they could have a more specific type as Int,
Boolean, Double, String an so on.
For example, a line could be:
"hello", "1", "bye", "1.1"
"hello1", "11", "bye1", "2.1"
...
First column is going to be always a
13 matches
Mail list logo