Re: Plans for improved Spark DataFrame/Dataset unit testing?

Bedrytski Aliaksandr Mon, 22 Aug 2016 01:17:21 -0700

Hi Everett,

HiveContext is initialized only once as a lazy val, so if you mean
initializing different jvms for each (or a group of) test(s), then in
this case the context will not, obviously, be shared.


But specs2 (by default) launches specs (inside of tests classes) in
parallel threads and in this case the context is shared.

To sum up, tests are launched sequentially, but specs inside of tests
are launched in parallel. We don't have anything specific in our .sbt
file in regards to the parallel test execution and hive context is
initialized only once.

In my opinion (correct me if I'm wrong), if you already have >1 specs
per test, the CPU will be already saturated, so total parallel execution
of tests will not give additional gains.

Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Sun, Aug 21, 2016, at 18:30, Everett Anderson wrote:
>
>
> On Sun, Aug 21, 2016 at 3:08 AM, Bedrytski Aliaksandr
> <sp...@bedryt.ski> wrote:
>> __
>> Hi,
>>
>> we share the same spark/hive context between tests (executed in
>> parallel), so the main problem is that the temporary tables are
>> overwritten each time they are created, this may create race
>> conditions
>> as these tempTables may be seen as global mutable shared state.
>>
>> So each time we create a temporary table, we add an unique,
>> incremented,
>> thread safe id (AtomicInteger) to its name so that there are only
>> specific, non-shared temporary tables used for a test.
>
> Makes sense.
>
> But when you say you're sharing the same spark/hive context between
> tests, I'm assuming that's between the same tests within one test
> class, but you're not sharing across test classes (which a build tool
> like Maven or Gradle might have executed in separate JVMs).
>
> Is that right?
>
>
>
>>
>>
>> --
>>   Bedrytski Aliaksandr
>>   sp...@bedryt.ski
>>
>>
>>
>>> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
>>> Hi!
>>>
>>> Just following up on this --
>>>
>>> When people talk about a shared session/context for testing
>>> like this,
>>> I assume it's still within one test class. So it's still the
>>> case that
>>> if you have a lot of test classes that test Spark-related
>>> things, you
>>> must configure your build system to not run in them in parallel.
>>> You'll get the benefit of not creating and tearing down a Spark
>>> session/context between test cases with a test class, though.
>>>
>>> Is that right?
>>>
>>> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
>>> share Spark sessions/contexts across integration tests in a
>>> safe way?
>>>
>>>
>>> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
>>> <hol...@pigscanfly.ca> wrote:
>>> Thats a good point - there is an open issue for spark-testing-
>>> base to
>>> support this shared sparksession approach - but I haven't had the
>>> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
>>> I'll try and include this in the next release :)
>>>
>>> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
>>> <ko...@tresata.com> wrote:
>>> we share a single single sparksession across tests, and they can run
>>> in parallel. is pretty fast
>>>
>>> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
>>> <ever...@nuna.com.invalid> wrote:
>>> Hi,
>>>
>>> Right now, if any code uses DataFrame/Dataset, I need a test setup
>>> that brings up a local master as in this article[1].
>>>
>>>
>>> That's a lot of overhead for unit testing and the tests can't run
>>> in parallel, so testing is slow -- this is more like what I'd call
>>> an integration test.
>>>
>>> Do people have any tricks to get around this? Maybe using spy mocks
>>> on fake DataFrame/Datasets?
>>>
>>> Anyone know if there are plans to make more traditional unit
>>> testing possible with Spark SQL, perhaps with a stripped down in-
>>> memory implementation? (I admit this does seem quite hard since
>>> there's so much functionality in these classes!)
>>>
>>> Thanks!
>>>
>>>
>>> - Everett
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>

Re: Plans for improved Spark DataFrame/Dataset unit testing?

Reply via email to