Hi lucas
Thanks for the detailed feedback, that's really useful! I did suggest Github but my colleague asked for an email You raise a good point with the grammar, sure I will rephrase it. I am more than happy to merge in the PR if you send it Th at said I know you can make BDD tests using any framework but I am a lazy developer and would rather use the framework or library defaults to make it easier for other devs to pick up. The number of rows is only a start correct, we can add more tests to check the transformed version but I was going to point that out on the future part of the series since this one is mainly about raw extracts. Thank you very much for the feedback and I will be sure to add it once I have more feedback Maybe we can create a gist of all this or even a tiny book on best practices if people find it useful Looking forward to the PR! Regards Sam On Sat, 29 Apr 2017 at 06:36, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > Awesome, thanks. > > Just reading your post > > A few observations: > 1) You're giving out Marius's email: "I have been lucky enough to > build this pipeline with the amazing Marius Feteanu". A linked or > github link might be more helpful. > > 2) "If you are in Pyspark world sadly Holden’s test base wont work so > I suggest you check out Pytest and pytest-bdd.". doesn't read well to > me, on first read I was wondering if Spark-Test-Base wasn't available > in python... It took me about 20 seconds to figure out that you > probably meant it doesn't allow for direct BDD semantics. My 2nd > observation here is that BDD semantics can be aped in any given > testing framework. You just need to be flexible :) > > 3) You're doing a transformation (IE JSON input against a JSON > schema). You are testing for # of rows which is a good start. But I > don't think that really exercises a test against your JSON schema. I > tend to view schema as the things that need the most rigorous testing > (it's code after all). IE I would want to confirm that the output > matches the expected shape and values after being loaded against the > schema. > > I saw a few minor spelling and grammatical issues as well. I put a PR > into your blog for them. I won't be offended if you squish it :) > > I should be getting into our testing 'how-to' stuff this week. I'll > scrape our org specific stuff and put it up to github this week as > well. It'll be in python so maybe we'll get both use cases covered > with examples :) > > G > > On 27 April 2017 at 03:46, Sam Elamin <hussam.ela...@gmail.com> wrote: > > Hi > > > > @Lucas I certainly would love to write an integration testing library for > > workflows, I have a few ideas I would love to share with others and they > are > > focused around Airflow since that is what we use > > > > > > As promised here is the first blog post in a series of posts I hope to > write > > on how we build data pipelines > > > > Please feel free to retweet my original tweet and share because the more > > ideas we have the better! > > > > Feedback is always welcome! > > > > Regards > > Sam > > > > On Tue, Apr 25, 2017 at 10:32 PM, lucas.g...@gmail.com > > <lucas.g...@gmail.com> wrote: > >> > >> Hi all, whoever (Sam I think) was going to do some work on doing a > >> template testing pipeline. I'd love to be involved, I have a current > task > >> in my day job (data engineer) to flesh out our testing how-to / best > >> practices for Spark jobs and I think I'll be doing something very > similar > >> for the next week or 2. > >> > >> I'll scrape out what i have now in the next day or so and put it up in a > >> gist that I can share too. > >> > >> G > >> > >> On 25 April 2017 at 13:04, Holden Karau <hol...@pigscanfly.ca> wrote: > >>> > >>> Urgh hangouts did something frustrating, updated link > >>> https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe > >>> > >>> On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau <hol...@pigscanfly.ca> > >>> wrote: > >>>> > >>>> The (tentative) link for those interested is > >>>> https://hangouts.google.com/hangouts/_/oyjvcnffejcjhi6qazf3lysypue . > >>>> > >>>> On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau <hol...@pigscanfly.ca> > >>>> wrote: > >>>>> > >>>>> So 14 people have said they are available on Tuesday the 25th at 1PM > >>>>> pacific so we will do this meeting then ( > >>>>> https://doodle.com/poll/69y6yab4pyf7u8bn ). > >>>>> > >>>>> Since hangouts tends to work ok on the Linux distro I'm running my > >>>>> default is to host this as a "hangouts-on-air" unless there are > alternative > >>>>> ideas. > >>>>> > >>>>> I'll record the hangout and if it isn't terrible I'll post it for > those > >>>>> who weren't able to make it (and for next time I'll include more > European > >>>>> friendly time options - Doodle wouldn't let me update it once > posted). > >>>>> > >>>>> On Fri, Apr 14, 2017 at 11:17 AM, Holden Karau <hol...@pigscanfly.ca > > > >>>>> wrote: > >>>>>> > >>>>>> Hi Spark Users (+ Some Spark Testing Devs on BCC), > >>>>>> > >>>>>> Awhile back on one of the many threads about testing in Spark there > >>>>>> was some interest in having a chat about the state of Spark testing > and what > >>>>>> people want/need. > >>>>>> > >>>>>> So if you are interested in joining an online (with maybe an IRL > >>>>>> component if enough people are SF based) chat about Spark testing > please > >>>>>> fill out this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn > >>>>>> > >>>>>> I think reasonable topics of discussion could be: > >>>>>> > >>>>>> 1) What is the state of the different Spark testing libraries in the > >>>>>> different core (Scala, Python, R, Java) and extended languages (C#, > >>>>>> Javascript, etc.)? > >>>>>> 2) How do we make these more easily discovered by users? > >>>>>> 3) What are people looking for in their testing libraries that we > are > >>>>>> missing? (can be functionality, documentation, etc.) > >>>>>> 4) Are there any examples of well tested open source Spark projects > >>>>>> and where are they? > >>>>>> > >>>>>> If you have other topics that's awesome. > >>>>>> > >>>>>> To clarify this about libraries and best practices for people > testing > >>>>>> their Spark applications, and less about testing Spark's internals > (although > >>>>>> as illustrated by some of the libraries there is some strong > overlap in what > >>>>>> is required to make that work). > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Holden :) > >>>>>> > >>>>>> -- > >>>>>> Cell : 425-233-8271 > >>>>>> Twitter: https://twitter.com/holdenkarau > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Cell : 425-233-8271 > >>>>> Twitter: https://twitter.com/holdenkarau > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Cell : 425-233-8271 > >>>> Twitter: https://twitter.com/holdenkarau > >>> > >>> > >>> > >>> > >>> -- > >>> Cell : 425-233-8271 > >>> Twitter: https://twitter.com/holdenkarau > >> > >> > > >