Re: spark standalone mode problem about executor add and removed again and again!

2019-07-17 Thread Amit Sharma
Do you have dynamic resource allocation enabled? On Wednesday, July 17, 2019, zenglong chen wrote: > Hi,all, > My standalone mode has two slaves.When I submit my job,the > localhost slave is working well,but second slave do add and remove executor > action always!The log are below: >

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
Users can also request random rows in those columns. So a user can request a subset of the matrix (N rows and N columns) which would change the value of the correlation coefficient. From: Jerry Vinokurov [mailto:grapesmo...@gmail.com] Sent: Wednesday, July 17, 2019 1:27 PM To:

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Jerry Vinokurov
Maybe I'm not understanding something about this use case, but why is precomputation not an option? Is it because the matrices themselves change? Because if the matrices are constant, then I think precomputation would work for you even if the users request random correlations. You can just store

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
As I said in the my initial message, precomputing is not an option. Retrieving only the top/bottom N most correlated is an option – would that speed up the results? Our SLAs are soft – slight variations (+- 15 seconds) will not cause issues. --gautham From: Patrick McCarthy

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Patrick McCarthy
Do you really need the results of all 3MM computations, or only the top- and bottom-most correlation coefficients? Could correlations be computed on a sample and from that estimate a distribution of coefficients? Would it make sense to precompute offline and instead focus on fast key-value

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
Thanks for the reply, Bobby. I’ve received notice that we can probably tolerate response times of up to 30 seconds. Would this be more manageable? 5 seconds was an initial ask, but 20-30 seconds is also a reasonable response time for our use case. With the new SLA, do you think that we can

Re: event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2019-07-17 Thread Shahid K. I.
Hi, With the current design, eventlogs are not ideal for long running streaming applications. So, it is better then to disable the eventlogs. There was a proposal for splitting the eventlogs based on size/Job/query for long running applications, not sure about the followup for that. Regards,

Re: event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2019-07-17 Thread Artur Sukhenko
Hi. There is a workaround for that. You can disable event logs for Spark Streaming applications. On Tue, Jul 16, 2019 at 1:08 PM raman gugnani wrote: > HI , > > I have long running spark streaming jobs. > Event log directories are getting filled with .inprogress files. > Is there fix or work

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Bobby Evans
Let's do a few quick rules of thumb to get an idea of what kind of processing power you will need in general to do what you want. You need 3,000,000 ints by 50,000 rows. Each int is 4 bytes so that ends up being about 560 GB that you need to fully process in 5 seconds. If you are reading this

CPU:s per task

2019-07-17 Thread Magnus Nilsson
Hello all, TLDR; Can the number of cores used by a task vary or is it always one core per task? Is there a UI, metrics or logs I can check to see the number of cores used by the task? I have an ETL-pipeline where I do some transformations. In one of the stages which ought to be quite CPU-heavy

Re: Usage of PyArrow in Spark

2019-07-17 Thread Hyukjin Kwon
Regular Python UDFs don't use PyArrow under the hood. Yes, they can potentially benefit but they can be easily worked around via Pandas UDFs. For instance, both below are virtually identical. @udf(...) def func(col): return col @pandas_udf(...) def pandas_func(col): return

spark standalone mode problem about executor add and removed again and again!

2019-07-17 Thread zenglong chen
Hi,all, My standalone mode has two slaves.When I submit my job,the localhost slave is working well,but second slave do add and remove executor action always!The log are below: 2019-07-17 10:51:38,889 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated:

Parse RDD[Seq[String]] to DataFrame with types.

2019-07-17 Thread Guillermo Ortiz Fernández
I'm trying to parse a RDD[Seq[String]] to Dataframe. ALthough it's a Seq of Strings they could have a more specific type as Int, Boolean, Double, String an so on. For example, a line could be: "hello", "1", "bye", "1.1" "hello1", "11", "bye1", "2.1" ... First column is going to be always a