Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Jerry Vinokurov
t; sys.exit(1) > > Thanks again. > > > Mich > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > > >

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Jerry Vinokurov
Hi Mich, I'm a bit confused by what you mean when you say that you cannot call a fixture in another fixture. The fixtures resolve dependencies among themselves by means of their named parameters. So that means that if I have a fixture @pytest.fixture def fixture1(): return SomeObj() and

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Jerry Vinokurov
oyee_name > now the http GET call has to be made for each employee_id and DataFrame is > dynamic for each spark job run. > > Does it make sense? > > Thanks > > > On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov > wrote: > >> Hi Chetan, >> >> You

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Jerry Vinokurov
Hi Chetan, You can pretty much use any client to do this. When I was using Spark at a previous job, we used OkHttp, but I'm sure there are plenty of others. In our case, we had a startup phase in which we gathered metadata via a REST API and then broadcast it to the workers. I think if you need

Re: Any way to make catalyst optimise away join

2019-11-29 Thread Jerry Vinokurov
This seems like a suboptimal situation for a join. How can Spark know in advance that all the fields are present and the tables have the same number of rows? I suppose you could just sort the two frames by id and concatenate them, but I'm not sure what join optimization is available here. On Fri,

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
our tolerance for error you could also use >> percentile_approx(). >> >> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov >> wrote: >> >>> Do you mean that you are trying to compute the percent rank of some >>> data? You can use the SparkSQL percent_r

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
Do you mean that you are trying to compute the percent rank of some data? You can use the SparkSQL percent_rank function for that, but I don't think that's going to give you any improvement over calling the percentRank function on the data frame. Are you currently using a user-defined function for

Re: intermittent Kryo serialization failures in Spark

2019-09-25 Thread Jerry Vinokurov
t; ``` >> >> then run with >> >> 'spark.kryo.referenceTracking': 'false', >> 'spark.kryo.registrationRequired': 'false', >> 'spark.kryo.registrator': 'com.datadog.spark.MyKryoRegistrator', >> 'spark.kryo.unsafe': 'false', >> 'spark.kryoserializer.buffer.max'

Re: intermittent Kryo serialization failures in Spark

2019-09-18 Thread Jerry Vinokurov
> > 'spark.kryo.referenceTracking': 'false', > 'spark.kryo.registrationRequired': 'false', > 'spark.kryo.registrator': 'com.datadog.spark.MyKryoRegistrator', > 'spark.kryo.unsafe': 'false', > 'spark.kryoserializer.buffer.max': '256m', > > On Tue, Sep 17, 2019 at 10:38 AM Jerry Vin

Re: intermittent Kryo serialization failures in Spark

2019-09-17 Thread Jerry Vinokurov
isc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scal

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Jerry Vinokurov
Maybe I'm not understanding something about this use case, but why is precomputation not an option? Is it because the matrices themselves change? Because if the matrices are constant, then I think precomputation would work for you even if the users request random correlations. You can just store

Re: Spark Newbie question

2019-07-11 Thread Jerry Vinokurov
Hi Ajay, When a Spark SQL statement references a table, that table has to be "registered" first. Usually the way this is done is by reading in a DataFrame, then calling the createOrReplaceTempView (or one of a few other functions) on that data frame, with the argument being the name under which

intermittent Kryo serialization failures in Spark

2019-07-10 Thread Jerry Vinokurov
Hi all, I am experiencing a strange intermittent failure of my Spark job that results from serialization issues in Kryo. Here is the stack trace: Caused by: java.lang.ClassNotFoundException: com.mycompany.models.MyModel > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) >