from:"Jerry Vinokurov"

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Jerry Vinokurov

t; sys.exit(1) > > Thanks again. > > > Mich > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > > >

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Jerry Vinokurov

Hi Mich, I'm a bit confused by what you mean when you say that you cannot call a fixture in another fixture. The fixtures resolve dependencies among themselves by means of their named parameters. So that means that if I have a fixture @pytest.fixture def fixture1(): return SomeObj() and

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Jerry Vinokurov

oyee_name > now the http GET call has to be made for each employee_id and DataFrame is > dynamic for each spark job run. > > Does it make sense? > > Thanks > > > On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov > wrote: > >> Hi Chetan, >> >> You

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Jerry Vinokurov

Hi Chetan, You can pretty much use any client to do this. When I was using Spark at a previous job, we used OkHttp, but I'm sure there are plenty of others. In our case, we had a startup phase in which we gathered metadata via a REST API and then broadcast it to the workers. I think if you need

Re: Any way to make catalyst optimise away join

2019-11-29 Thread Jerry Vinokurov

This seems like a suboptimal situation for a join. How can Spark know in advance that all the fields are present and the tables have the same number of rows? I suppose you could just sort the two frames by id and concatenate them, but I'm not sure what join optimization is available here. On Fri,

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov

our tolerance for error you could also use >> percentile_approx(). >> >> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov >> wrote: >> >>> Do you mean that you are trying to compute the percent rank of some >>> data? You can use the SparkSQL percent_r

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov

Do you mean that you are trying to compute the percent rank of some data? You can use the SparkSQL percent_rank function for that, but I don't think that's going to give you any improvement over calling the percentRank function on the data frame. Are you currently using a user-defined function for

Re: intermittent Kryo serialization failures in Spark

2019-09-25 Thread Jerry Vinokurov

t; ``` >> >> then run with >> >> 'spark.kryo.referenceTracking': 'false', >> 'spark.kryo.registrationRequired': 'false', >> 'spark.kryo.registrator': 'com.datadog.spark.MyKryoRegistrator', >> 'spark.kryo.unsafe': 'false', >> 'spark.kryoserializer.buffer.max'

Re: intermittent Kryo serialization failures in Spark

2019-09-18 Thread Jerry Vinokurov

> > 'spark.kryo.referenceTracking': 'false', > 'spark.kryo.registrationRequired': 'false', > 'spark.kryo.registrator': 'com.datadog.spark.MyKryoRegistrator', > 'spark.kryo.unsafe': 'false', > 'spark.kryoserializer.buffer.max': '256m', > > On Tue, Sep 17, 2019 at 10:38 AM Jerry Vin

Re: intermittent Kryo serialization failures in Spark

2019-09-17 Thread Jerry Vinokurov

isc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scal

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Jerry Vinokurov

Maybe I'm not understanding something about this use case, but why is precomputation not an option? Is it because the matrices themselves change? Because if the matrices are constant, then I think precomputation would work for you even if the users request random correlations. You can just store

Re: Spark Newbie question

2019-07-11 Thread Jerry Vinokurov

Hi Ajay, When a Spark SQL statement references a table, that table has to be "registered" first. Usually the way this is done is by reading in a DataFrame, then calling the createOrReplaceTempView (or one of a few other functions) on that data frame, with the argument being the name under which

intermittent Kryo serialization failures in Spark

2019-07-10 Thread Jerry Vinokurov

Hi all, I am experiencing a strange intermittent failure of my Spark job that results from serialization issues in Kryo. Here is the stack trace: Caused by: java.lang.ClassNotFoundException: com.mycompany.models.MyModel > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) >

Re: Testing ETL with Spark using Pytest

Re: Testing ETL with Spark using Pytest

Re: Calling HTTP Rest APIs from Spark Job

Re: Calling HTTP Rest APIs from Spark Job

Re: Any way to make catalyst optimise away join

Re: Using Percentile in Spark SQL

Re: Using Percentile in Spark SQL

Re: intermittent Kryo serialization failures in Spark

Re: intermittent Kryo serialization failures in Spark

Re: intermittent Kryo serialization failures in Spark

Re: [Beginner] Run compute on large matrices and return the result in seconds?

Re: Spark Newbie question

intermittent Kryo serialization failures in Spark

13 matches

Site Navigation

Mail list logo

Footer information