Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-22 Thread Bedrytski Aliaksandr
Hi Everett,

HiveContext is initialized only once as a lazy val, so if you mean
initializing different jvms for each (or a group of) test(s), then in
this case the context will not, obviously, be shared.

But specs2 (by default) launches specs (inside of tests classes) in
parallel threads and in this case the context is shared.

To sum up, tests are launched sequentially, but specs inside of tests
are launched in parallel. We don't have anything specific in our .sbt
file in regards to the parallel test execution and hive context is
initialized only once.

In my opinion (correct me if I'm wrong), if you already have >1 specs
per test, the CPU will be already saturated, so total parallel execution
of tests will not give additional gains.

Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Sun, Aug 21, 2016, at 18:30, Everett Anderson wrote:
>
>
> On Sun, Aug 21, 2016 at 3:08 AM, Bedrytski Aliaksandr
>  wrote:
>> __
>> Hi,
>>
>> we share the same spark/hive context between tests (executed in
>> parallel), so the main problem is that the temporary tables are
>> overwritten each time they are created, this may create race
>> conditions
>> as these tempTables may be seen as global mutable shared state.
>>
>> So each time we create a temporary table, we add an unique,
>> incremented,
>> thread safe id (AtomicInteger) to its name so that there are only
>> specific, non-shared temporary tables used for a test.
>
> Makes sense.
>
> But when you say you're sharing the same spark/hive context between
> tests, I'm assuming that's between the same tests within one test
> class, but you're not sharing across test classes (which a build tool
> like Maven or Gradle might have executed in separate JVMs).
>
> Is that right?
>
>
>
>>
>>
>> --
>>   Bedrytski Aliaksandr
>>   sp...@bedryt.ski
>>
>>
>>
>>> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
>>> Hi!
>>>
>>> Just following up on this --
>>>
>>> When people talk about a shared session/context for testing
>>> like this,
>>> I assume it's still within one test class. So it's still the
>>> case that
>>> if you have a lot of test classes that test Spark-related
>>> things, you
>>> must configure your build system to not run in them in parallel.
>>> You'll get the benefit of not creating and tearing down a Spark
>>> session/context between test cases with a test class, though.
>>>
>>> Is that right?
>>>
>>> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
>>> share Spark sessions/contexts across integration tests in a
>>> safe way?
>>>
>>>
>>> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
>>>  wrote:
>>> Thats a good point - there is an open issue for spark-testing-
>>> base to
>>> support this shared sparksession approach - but I haven't had the
>>> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
>>> I'll try and include this in the next release :)
>>>
>>> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
>>>  wrote:
>>> we share a single single sparksession across tests, and they can run
>>> in parallel. is pretty fast
>>>
>>> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
>>>  wrote:
>>> Hi,
>>>
>>> Right now, if any code uses DataFrame/Dataset, I need a test setup
>>> that brings up a local master as in this article[1].
>>>
>>>
>>> That's a lot of overhead for unit testing and the tests can't run
>>> in parallel, so testing is slow -- this is more like what I'd call
>>> an integration test.
>>>
>>> Do people have any tricks to get around this? Maybe using spy mocks
>>> on fake DataFrame/Datasets?
>>>
>>> Anyone know if there are plans to make more traditional unit
>>> testing possible with Spark SQL, perhaps with a stripped down in-
>>> memory implementation? (I admit this does seem quite hard since
>>> there's so much functionality in these classes!)
>>>
>>> Thanks!
>>>
>>>
>>> - Everett
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Everett Anderson
On Sun, Aug 21, 2016 at 3:08 AM, Bedrytski Aliaksandr 
wrote:

> Hi,
>
> we share the same spark/hive context between tests (executed in
> parallel), so the main problem is that the temporary tables are
> overwritten each time they are created, this may create race conditions
> as these tempTables may be seen as global mutable shared state.
>
> So each time we create a temporary table, we add an unique, incremented,
> thread safe id (AtomicInteger) to its name so that there are only
> specific, non-shared temporary tables used for a test.
>

Makes sense.

But when you say you're sharing the same spark/hive context between tests,
I'm assuming that's between the same tests within one test class, but
you're not sharing across test classes (which a build tool like Maven or
Gradle might have executed in separate JVMs).

Is that right?




>
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
> Hi!
>
> Just following up on this --
>
> When people talk about a shared session/context for testing like this,
> I assume it's still within one test class. So it's still the case that
> if you have a lot of test classes that test Spark-related things, you
> must configure your build system to not run in them in parallel.
> You'll get the benefit of not creating and tearing down a Spark
> session/context between test cases with a test class, though.
>
> Is that right?
>
> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
> share Spark sessions/contexts across integration tests in a safe way?
>
>
> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
>  wrote:
> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the
> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
> I'll try and include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
>  wrote:
> we share a single single sparksession across tests, and they can run
> in parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
>  wrote:
> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup
> that brings up a local master as in this article[1].
>
> That's a lot of overhead for unit testing and the tests can't run
> in parallel, so testing is slow -- this is more like what I'd call
> an integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks
> on fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit
> testing possible with Spark SQL, perhaps with a stripped down in-
> memory implementation? (I admit this does seem quite hard since
> there's so much functionality in these classes!)
>
> Thanks!
>
>
> - Everett
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Bedrytski Aliaksandr
Hi,

we share the same spark/hive context between tests (executed in
parallel), so the main problem is that the temporary tables are
overwritten each time they are created, this may create race conditions
as these tempTables may be seen as global mutable shared state.

So each time we create a temporary table, we add an unique, incremented,
thread safe id (AtomicInteger) to its name so that there are only
specific, non-shared temporary tables used for a test.

--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
> Hi!
>
> Just following up on this --
>
> When people talk about a shared session/context for testing like this,
> I assume it's still within one test class. So it's still the case that
> if you have a lot of test classes that test Spark-related things, you
> must configure your build system to not run in them in parallel.
> You'll get the benefit of not creating and tearing down a Spark
> session/context between test cases with a test class, though.
>
> Is that right?
>
> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
> share Spark sessions/contexts across integration tests in a safe way?
>
>
> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
>  wrote:
> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the
> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
> I'll try and include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
>  wrote:
> we share a single single sparksession across tests, and they can run
> in parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
>  wrote:
> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup
> that brings up a local master as in this article[1].
>
> That's a lot of overhead for unit testing and the tests can't run
> in parallel, so testing is slow -- this is more like what I'd call
> an integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks
> on fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit
> testing possible with Spark SQL, perhaps with a stripped down in-
> memory implementation? (I admit this does seem quite hard since
> there's so much functionality in these classes!)
>
> Thanks!
>
>
> - Everett
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-19 Thread Everett Anderson
Hi!

Just following up on this --

When people talk about a shared session/context for testing like this, I
assume it's still within one test class. So it's still the case that if you
have a lot of test classes that test Spark-related things, you must
configure your build system to not run in them in parallel. You'll get the
benefit of not creating and tearing down a Spark session/context between
test cases with a test class, though.

Is that right?

Or have people figured out a way to have sbt (or Maven/Gradle/etc) share
Spark sessions/contexts across integration tests in a safe way?


On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau  wrote:

> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the time (
> https://github.com/holdenk/spark-testing-base/issues/123 ). I'll try and
> include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers  wrote:
>
>> we share a single single sparksession across tests, and they can run in
>> parallel. is pretty fast
>>
>> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson <
>> ever...@nuna.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> Right now, if any code uses DataFrame/Dataset, I need a test setup that
>>> brings up a local master as in this article
>>> 
>>> .
>>>
>>> That's a lot of overhead for unit testing and the tests can't run in
>>> parallel, so testing is slow -- this is more like what I'd call an
>>> integration test.
>>>
>>> Do people have any tricks to get around this? Maybe using spy mocks on
>>> fake DataFrame/Datasets?
>>>
>>> Anyone know if there are plans to make more traditional unit testing
>>> possible with Spark SQL, perhaps with a stripped down in-memory
>>> implementation? (I admit this does seem quite hard since there's so much
>>> functionality in these classes!)
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Holden Karau
Thats a good point - there is an open issue for spark-testing-base to
support this shared sparksession approach - but I haven't had the time (
https://github.com/holdenk/spark-testing-base/issues/123 ). I'll try and
include this in the next release :)

On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers  wrote:

> we share a single single sparksession across tests, and they can run in
> parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson <
> ever...@nuna.com.invalid> wrote:
>
>> Hi,
>>
>> Right now, if any code uses DataFrame/Dataset, I need a test setup that
>> brings up a local master as in this article
>> 
>> .
>>
>> That's a lot of overhead for unit testing and the tests can't run in
>> parallel, so testing is slow -- this is more like what I'd call an
>> integration test.
>>
>> Do people have any tricks to get around this? Maybe using spy mocks on
>> fake DataFrame/Datasets?
>>
>> Anyone know if there are plans to make more traditional unit testing
>> possible with Spark SQL, perhaps with a stripped down in-memory
>> implementation? (I admit this does seem quite hard since there's so much
>> functionality in these classes!)
>>
>> Thanks!
>>
>> - Everett
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Koert Kuipers
we share a single single sparksession across tests, and they can run in
parallel. is pretty fast

On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson 
wrote:

> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup that
> brings up a local master as in this article
> 
> .
>
> That's a lot of overhead for unit testing and the tests can't run in
> parallel, so testing is slow -- this is more like what I'd call an
> integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks on
> fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit testing
> possible with Spark SQL, perhaps with a stripped down in-memory
> implementation? (I admit this does seem quite hard since there's so much
> functionality in these classes!)
>
> Thanks!
>
> - Everett
>
>


Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Everett Anderson
Hi,

Right now, if any code uses DataFrame/Dataset, I need a test setup that
brings up a local master as in this article

.

That's a lot of overhead for unit testing and the tests can't run in
parallel, so testing is slow -- this is more like what I'd call an
integration test.

Do people have any tricks to get around this? Maybe using spy mocks on fake
DataFrame/Datasets?

Anyone know if there are plans to make more traditional unit testing
possible with Spark SQL, perhaps with a stripped down in-memory
implementation? (I admit this does seem quite hard since there's so much
functionality in these classes!)

Thanks!

- Everett