Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-18 Thread Vadim Semenov
you can create a Super class "FunSuiteWithSparkContext" that's going to
create a Spark sessions, Spark context, and SQLContext with all the desired
properties.
Then you add the class to all the relevant test suites, and that's pretty
much it.

The other option can be is to pass it as a VM parameter like
`-Dspark.driver.memory=2g -Xmx3G -Dspark.master=local[3]`

For example, if you run your tests with sbt:

```
SBT_OPTS="-Xmx3G -Dspark.driver.memory=1536m" sbt test
```

On Sat, Sep 16, 2017 at 2:54 PM, Femi Anthony  wrote:

> How are you specifying it, as an option to spark-submit ?
>
> On Sat, Sep 16, 2017 at 12:26 PM, Akhil Das  wrote:
>
>> spark.sql.shuffle.partitions is still used I believe. I can see it in the
>> code
>> 
>>  and
>> in the documentation page
>> 
>> .
>>
>> On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:
>>
>>> Hello,
>>>
>>> I am running unit tests with Spark DataFrames, and I am looking for
>>> configuration tweaks that would make tests faster. Usually, I use a
>>> local[2] or local[4] master.
>>>
>>> Something that has been bothering me is that most of my stages end up
>>> using 200 partitions, independently of whether I repartition the input.
>>> This seems a bit overkill for small unit tests that barely have 200 rows
>>> per DataFrame.
>>>
>>> spark.sql.shuffle.partitions used to control this I believe, but it
>>> seems to be gone and I could not find any information on what
>>> mechanism/setting replaces it or the corresponding JIRA.
>>>
>>> Has anyone experience to share on how to tune Spark best for very small
>>> local runs like that?
>>>
>>> Thanks!
>>>
>>>
>>
>>
>> --
>> Cheers!
>>
>>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>


Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Femi Anthony
How are you specifying it, as an option to spark-submit ?

On Sat, Sep 16, 2017 at 12:26 PM, Akhil Das  wrote:

> spark.sql.shuffle.partitions is still used I believe. I can see it in the
> code
> 
>  and
> in the documentation page
> 
> .
>
> On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:
>
>> Hello,
>>
>> I am running unit tests with Spark DataFrames, and I am looking for
>> configuration tweaks that would make tests faster. Usually, I use a
>> local[2] or local[4] master.
>>
>> Something that has been bothering me is that most of my stages end up
>> using 200 partitions, independently of whether I repartition the input.
>> This seems a bit overkill for small unit tests that barely have 200 rows
>> per DataFrame.
>>
>> spark.sql.shuffle.partitions used to control this I believe, but it seems
>> to be gone and I could not find any information on what mechanism/setting
>> replaces it or the corresponding JIRA.
>>
>> Has anyone experience to share on how to tune Spark best for very small
>> local runs like that?
>>
>> Thanks!
>>
>>
>
>
> --
> Cheers!
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Akhil Das
spark.sql.shuffle.partitions is still used I believe. I can see it in the
code

and
in the documentation page

.

On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:

> Hello,
>
> I am running unit tests with Spark DataFrames, and I am looking for
> configuration tweaks that would make tests faster. Usually, I use a
> local[2] or local[4] master.
>
> Something that has been bothering me is that most of my stages end up
> using 200 partitions, independently of whether I repartition the input.
> This seems a bit overkill for small unit tests that barely have 200 rows
> per DataFrame.
>
> spark.sql.shuffle.partitions used to control this I believe, but it seems
> to be gone and I could not find any information on what mechanism/setting
> replaces it or the corresponding JIRA.
>
> Has anyone experience to share on how to tune Spark best for very small
> local runs like that?
>
> Thanks!
>
>


-- 
Cheers!


Configuration for unit testing and sql.shuffle.partitions

2017-09-12 Thread peay
Hello,

I am running unit tests with Spark DataFrames, and I am looking for 
configuration tweaks that would make tests faster. Usually, I use a local[2] or 
local[4] master.

Something that has been bothering me is that most of my stages end up using 200 
partitions, independently of whether I repartition the input. This seems a bit 
overkill for small unit tests that barely have 200 rows per DataFrame.

spark.sql.shuffle.partitions used to control this I believe, but it seems to be 
gone and I could not find any information on what mechanism/setting replaces it 
or the corresponding JIRA.

Has anyone experience to share on how to tune Spark best for very small local 
runs like that?

Thanks!