Re: Integration testing Framework Spark SQL Scala

2020-11-02 Thread Lars Albertsson
, Lars Albertsson Data engineering entrepreneur www.scling.com, www.mapflat.com https://twitter.com/lalleal +46 70 7687109 On Tue, Feb 25, 2020 at 7:46 PM Ruijing Li wrote: > > Just wanted to follow up on this. If anyone has any advice, I’d be interested > in learning more! > > On Th

Re: Testing Apache Spark applications

2018-11-15 Thread Lars Albertsson
on the subject. Slides and video are linked on this page: http://www.mapflat.com/presentations/ You can find more material in this list of resources: http://www.mapflat.com/lands/resources/reading-list Happy testing! Regards, Lars Albertsson Data engineering entrepreneur www.mimeria.com

Re: testing frameworks

2018-06-12 Thread Lars Albertsson
in this list of resources: http://www.mapflat.com/lands/resources/reading-list Happy testing! Regards, Lars Albertsson Data engineering consultant www.mapflat.com https://twitter.com/lalleal +46 70 7687109 Calendar: http://www.mapflat.com/calendar On Mon, May 21, 2018 at 2:24 PM, Steve

Re: TDD in Spark

2017-01-17 Thread Lars Albertsson
. Validate selected fields instead. For a longer answer, please search for my previous posts to the user list, or watch this presentation: https://vimeo.com/192429554 Slides at http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458 Regards, Lars Albertsson Data

Re: Dependency Injection and Microservice development with Spark

2016-12-28 Thread Lars Albertsson
. Or do you want to use DI for other reasons? Lars Albertsson Data engineering consultant www.mapflat.com https://twitter.com/lalleal +46 70 7687109 Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri <chetan.opensou...@gmail.

Re: unit testing in spark

2016-12-08 Thread Lars Albertsson
on the subject. There is a video recording at https://vimeo.com/192429554 and slides at http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458 You can find more material on test strategies at http://www.mapflat.com/lands/resources/reading-list/index.html Lars Albertsson

Re: Fail a batch in Spark Streaming forcefully based on business rules

2016-07-31 Thread Lars Albertsson
be addressed; if you induce failures, system failures would become part of normal operations, and real failures risk passing unnoticed. Regards, Lars Albertsson Data engineering consultant www.mapflat.com https://twitter.com/lalleal +46 70 7687109 Calendar: https://goo.gl/6FBtlS On Thu, Jul 28

Re: Integration tests for Spark Streaming

2016-07-22 Thread Lars Albertsson
You can find useful discussions in the list archives. I wrote this, which might help you: https://www.mail-archive.com/user%40spark.apache.org/msg48032.html Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 Calendar: https://goo.gl/tV2hWF On Jun 29, 2016 07:02

Re: Spark Job trigger in production

2016-07-21 Thread Lars Albertsson
ading-list/ Regards, Lars Albertsson Data engineering consultant www.mapflat.com https://twitter.com/lalleal +46 70 7687109 Calendar: https://goo.gl/6FBtlS On Wed, Jul 20, 2016 at 3:47 PM, Sathish Kumaran Vairavelu <vsathishkuma...@gmail.com> wrote: > If you are using Mesos, then u can use C

Re: Best practices to restart Spark jobs programatically from driver itself

2016-07-21 Thread Lars Albertsson
You can use a workflow manager, which gives you tools to handle transient failures in data pipelines. I suggest either Luigi or Airflow. They provide DSLs embedded in Python, so if the primitives provided are insufficient, it is easy to customise Spark tasks with restart logic. Regards, Lars

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-10 Thread Lars Albertsson
that runs smoothly from Gradle/Maven/SBT and also from IntelliJ. I hope things are clearer. Let me know if you have further questions. Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 Calendar: https://goo.gl/6FBtlS On Thu, Jul 7, 2016 at 3:14 AM,

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-04 Thread Lars Albertsson
with Docker Compose. If you are emitting database entries, your test oracle will need to frequently poll the database for the expected records, with a timeout in order not to hang on failing tests. I hope this is comprehensible. Let me know if you have followup questions. Regards, Lars Alberts

Re: Unit testing framework for Spark Jobs?

2016-05-21 Thread Lars Albertsson
gt; > On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson <la...@mapflat.com> > wrote: > >> Thanks! >> >> It is on my backlog to write a couple of blog posts on the topic, and >> eventually some example code, but I am currently busy with clients. >> >&g

Re: Scala: Perform Unit Testing in spark

2016-04-06 Thread Lars Albertsson
Hi, I wrote a longish mail on Spark testing strategy last month, which you may find useful: http://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/browser Let me know if you have follow up questions or want assistance. Regards, Lars Albertsson Data engineering consultant

Re: Unit testing framework for Spark Jobs?

2016-03-30 Thread Lars Albertsson
Thanks! It is on my backlog to write a couple of blog posts on the topic, and eventually some example code, but I am currently busy with clients. Thanks for the pointer to Eventually - I was unaware. Fast exit on exception would be a useful addition, indeed. Lars Albertsson Data engineering

Re: sliding Top N window

2016-03-27 Thread Lars Albertsson
with some expiration strategy. :-) Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 On Fri, Mar 25, 2016 at 7:48 AM, Jatin Kumar <jku...@rocketfuelinc.com> wrote: > Hello Lars, > > Thanks for your email. I tried exactly what you said and it d

Re: sliding Top N window

2016-03-24 Thread Lars Albertsson
case, the data structures ended up being small, on the order ot tens or hundreds of megabytes. It varies with use case, but it is probably a path worth investigating if approximate results are acceptable. Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 On Wed

Re: sliding Top N window

2016-03-22 Thread Lars Albertsson
with different combinations of time windows by pushing out CMSs and heavy hitters to e.g. Kafka, and have different stream processors that aggregate different time windows and push results to Kafka or to lookup tables. Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 On Tue, Mar

Re: sliding Top N window

2016-03-21 Thread Lars Albertsson
presentation by Ted Dunning and Mikio Braun, who have held good presentations on the subject. There are AFAIK two open source implementations of Count-Min Sketch, one of them in Algebird. Let me know if anything is unclear. Good luck, and let us know how it goes. Regards, Lars Albertss

Re: Unit testing framework for Spark Jobs?

2016-03-19 Thread Lars Albertsson
clarifications or assistance. Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote: > Hi, > > What is a good unit testing framework for Spark batch/streaming jobs? I have > cor

Re: Spark Pattern and Anti-Pattern

2016-02-02 Thread Lars Albertsson
. Let me know if you have follow-up questions, or want assistance. Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 On Tue, Jan 26, 2016 at 10:25 PM, Daniel Schulz <danielschulz2...@hotmail.com> wrote: > Hi, > > We are currently working on a soluti

Re: [SPARK STREAMING] polling based operation instead of event based operation

2015-10-23 Thread Lars Albertsson
the results with the list! Regards, Lars Albertsson On Thu, Oct 22, 2015 at 10:48 PM, Nipun Arora <nipunarora2...@gmail.com> wrote: > Hi, > In general in spark stream one can do transformations ( filter, map etc.) or > output operations (collect, forEach) etc. in an event-driven

Re: Spark job workflow engine recommendations

2015-08-09 Thread Lars Albertsson
? That will require additional components. This became a bit of a brain dump on the topic. I hope that it is useful. Don't hesitate to get back if I can help. Regards, Lars Albertsson On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone vikramk...@gmail.com wrote: Hi, I'm looking for open source workflow tools

Re: Select all columns except some

2015-07-16 Thread Lars Albertsson
The snippet at the end worked for me. We run Spark 1.3.x, so DataFrame.drop is not available to us. As pointed out by Yana, DataFrame operations typically return a new DataFrame, so use as such: import com.foo.sparkstuff.DataFrameOps._ ... val df = ... val prunedDf = df.dropColumns(one_col,