Re: Coalesce behaviour

2018-10-08 Thread Koert Kuipers
although i personally would describe this as a bug the answer will be that
this is the intended behavior. the coalesce "infects" the shuffle before
it, making a coalesce useless for reducing output files after a shuffle
with many partitions b design.

your only option left is a repartition for which you pay the price in that
it introduces another expensive shuffle.

interestingly if you do a coalesce on a map-only job it knows how to reduce
the partitions and output files without introducing a shuffle, so clearly
it is possible, but i dont know how to get this behavior after a shuffle in
an existing job.

On Fri, Oct 5, 2018 at 6:34 PM Sergey Zhemzhitsky 
wrote:

> Hello guys,
>
> Currently I'm a little bit confused with coalesce behaviour.
>
> Consider the following usecase - I'd like to join two pretty big RDDs.
> To make a join more stable and to prevent it from failures by OOM RDDs
> are usually repartitioned to redistribute data more evenly and to
> prevent every partition from hitting 2GB limit. Then after join with a
> lot of partitions.
>
> Then after successful join I'd like to save the resulting dataset.
> But I don't need such a huge amount of files as the number of
> partitions/tasks during joining. Actually I'm fine with such number of
> files as the total number of executor cores allocated to the job. So
> I've considered using a coalesce.
>
> The problem is that coalesce with shuffling disabled prevents join
> from using the specified number of partitions and instead forces join
> to use the number of partitions provided to coalesce
>
> scala> sc.makeRDD(1 to 100, 20).repartition(100).coalesce(5,
> false).toDebugString
> res5: String =
> (5) CoalescedRDD[15] at coalesce at :25 []
>  |  MapPartitionsRDD[14] at repartition at :25 []
>  |  CoalescedRDD[13] at repartition at :25 []
>  |  ShuffledRDD[12] at repartition at :25 []
>  +-(20) MapPartitionsRDD[11] at repartition at :25 []
> |   ParallelCollectionRDD[10] at makeRDD at :25 []
>
> With shuffling enabled everything is ok, e.g.
>
> scala> sc.makeRDD(1 to 100, 20).repartition(100).coalesce(5,
> true).toDebugString
> res6: String =
> (5) MapPartitionsRDD[24] at coalesce at :25 []
>  |  CoalescedRDD[23] at coalesce at :25 []
>  |  ShuffledRDD[22] at coalesce at :25 []
>  +-(100) MapPartitionsRDD[21] at coalesce at :25 []
>  |   MapPartitionsRDD[20] at repartition at :25 []
>  |   CoalescedRDD[19] at repartition at :25 []
>  |   ShuffledRDD[18] at repartition at :25 []
>  +-(20) MapPartitionsRDD[17] at repartition at :25 []
> |   ParallelCollectionRDD[16] at makeRDD at :25 []
>
> In that case the problem is that for pretty huge datasets additional
> reshuffling can take hours or at least comparable amount of time as
> for the join itself.
>
> So I'd like to understand whether it is a bug or just an expected
> behaviour?
> In case it is expected is there any way to insert additional
> ShuffleMapStage into an appropriate position of DAG but without
> reshuffling itself?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Random sampling in tests

2018-10-08 Thread Dongjoon Hyun
Sean's approach looks much better to me (
https://github.com/apache/spark/pull/22672)

It achieves both contradictory goals simultaneously; keeping all test
coverages and reducing the time from 2:31 to 0:24.

Since we can remove test coverages anytime, can we proceed with Sean's
non-intrusive approach first before removing?

Bests,
Dongjoon.


On Mon, Oct 8, 2018 at 8:57 AM Xiao Li  wrote:

> Yes. Testing all the timezones is not needed.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk 
> wrote:
>
>> Hi All,
>>
>> I believe we should also take into account what we test, for example, I
>> don't think it makes sense to check all timezones for JSON/CSV
>> functions/datasources because those timezones are just passed to external
>> libraries. So, the same code is involved into testing of each out of 650
>> timezones. We basically just spend time and resources on testing the
>> external libraries.
>>
>> I mean the PRs: https://github.com/apache/spark/pull/22657 and
>> https://github.com/apache/spark/pull/22379#discussion_r223039662
>>
>> Maxim Gekk
>>
>> Technical Solutions Lead
>>
>> Databricks B. V.  
>>
>>
>> On Mon, Oct 8, 2018 at 4:49 PM Sean Owen  wrote:
>>
>>> If the problem is simply reducing the wall-clock time of tests, then
>>> even before we get to this question, I'm advocating:
>>>
>>> 1) try simple parallelization of tests within the suite. In this
>>> instance there's no reason not to test these in parallel and get a 8x
>>> or 16x speedup from cores. This assumes, I believe correctly, that the
>>> machines aren't generally near 100% utilization
>>> 2) explicitly choose a smaller, more directed set of cases to test
>>>
>>> Randomly choosing test cases with a fixed seed is basically 2, but not
>>> choosing test cases for a particular reason. You can vary the seed but
>>> as a rule the same random subset of tests is always chosen. Could be
>>> fine if there's no reason at all to prefer some cases over others. But
>>> I am guessing any wild guess at the most important subset of cases to
>>> test is better than random.
>>>
>>> I'm trying 1) right now instead in these several cases.
>>> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
>>> >
>>> > For this specific case, I do not think we should test all the
>>> timezone. If this is fast, I am fine to leave it unchanged. However, this
>>> is very slow. Thus, I even prefer to reducing the tested timezone to a
>>> smaller number or just hardcoding some specific time zones.
>>> >
>>> > In general, I like Reynold’s idea by including the seed value and we
>>> add the seed name in the test case name. This can help us reproduce it.
>>> >
>>> > Xiao
>>> >
>>> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin 
>>> wrote:
>>> >>
>>> >> I'm personally not a big fan of doing it that way in the PR. It is
>>> perfectly fine to employ randomized tests, and in this case it might even
>>> be fine to just pick couple different timezones like the way it happened in
>>> the PR, but we should:
>>> >>
>>> >> 1. Document in the code comment why we did it that way.
>>> >>
>>> >> 2. Use a seed and log the seed, so any test failures can be
>>> reproduced deterministically. For this one, it'd be better to pick the seed
>>> from a seed environmental variable. If the env variable is not set, set to
>>> a random seed.
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>> >>>
>>> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> >>> that tests a bunch of cases by randomly choosing different subsets of
>>> >>> cases to test on each Jenkins run.
>>> >>>
>>> >>> There's disagreement about whether this is good approach to improving
>>> >>> test runtime. Here's a discussion on one that was committed:
>>> >>> https://github.com/apache/spark/pull/22631/files#r223190476
>>> >>>
>>> >>> I'm flagging it for more input.
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-08 Thread Matt Cheah
Relying on kubectl exec may not be the best solution because clusters with 
locked down security will not grant users permissions to execute arbitrary code 
in pods. I can’t think of a great alternative right now but I wanted to bring 
this to our attention for the time being.

 

-Matt Cheah

 

From: Rob Vesse 
Date: Monday, October 8, 2018 at 10:09 AM
To: dev 
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

Well yes.  However the submission client is already able to monitor the driver 
pod status so can see when it is up and running.  And couldn’t we potentially 
modify the K8S entry points e.g. KubernetesClientApplication that run inside 
the driver pods to wait for dependencies to be uploaded?

 

I guess at this stage I am just throwing ideas out there and trying to figure 
out what’s practical/reasonable

 

Rob

 

From: Yinan Li 
Date: Monday, 8 October 2018 at 17:36
To: Rob Vesse 
Cc: dev 
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

However, the pod must be up and running for this to work. So if you want to use 
this to upload dependencies to the driver pod, the driver pod must already be 
up and running. So you may not even have a chance to upload the dependencies at 
this point.



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-08 Thread Rob Vesse
Well yes.  However the submission client is already able to monitor the driver 
pod status so can see when it is up and running.  And couldn’t we potentially 
modify the K8S entry points e.g. KubernetesClientApplication that run inside 
the driver pods to wait for dependencies to be uploaded?

 

I guess at this stage I am just throwing ideas out there and trying to figure 
out what’s practical/reasonable

 

Rob

 

From: Yinan Li 
Date: Monday, 8 October 2018 at 17:36
To: Rob Vesse 
Cc: dev 
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

However, the pod must be up and running for this to work. So if you want to use 
this to upload dependencies to the driver pod, the driver pod must already be 
up and running. So you may not even have a chance to upload the dependencies at 
this point.



Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-08 Thread Yinan Li
> You can do this manually yourself via kubectl cp so it should be possible
to programmatically do this since it looks like this is just a tar piped
into a kubectl exec.   This would keep the relevant logic in the Kubernetes
specific client which may/may not be desirable depending on whether we’re
looking to just fix this for K8S or more generally.  Of course there is
probably a fair bit of complexity in making this work but does that sound
like something worth exploring?

Yes, kubectl cp is able to copy files from your local machine into a
container in a pod. However, the pod must be up and running for this to
work. So if you want to use this to upload dependencies to the driver pod,
the driver pod must already be up and running. So you may not even have a
chance to upload the dependencies at this point.

On Mon, Oct 8, 2018 at 6:36 AM Rob Vesse  wrote:

> Folks, thanks for all the great input. Responding to various points raised:
>
>
>
> Marcelo/Yinan/Felix –
>
>
>
> Yes, client mode will work.  The main JAR will be automatically
> distributed and --jars/--files specified dependencies are also distributed
> though for --files user code needs to use the appropriate Spark APIs to
> resolve the actual path i.e. SparkFiles.get()
>
>
>
> However client mode can be awkward if you want to mix spark-submit
> distribution with mounting dependencies via volumes since you may need to
> ensure that dependencies appear at the same path both on the local
> submission client and when mounted into the executors.  This mainly applies
> to the case where user code does not use SparkFiles.get() and simply tries
> to access the path directly.
>
>
>
> Marcelo/Stavros –
>
>
>
> Yes I did give the other resource managers too much credit.  From my past
> experience with Mesos and Standalone I had thought this wasn’t an issue but
> going back and looking at what we did for both of those it appears we were
> entirely reliant on the shared file system (whether HDFS, NFS or other
> POSIX compliant filesystems e.g. Lustre).
>
>
>
> Since connectivity back to the client is a potential stumbling block for
> cluster mode I wander if it would be better to think in reverse i.e. rather
> than having the driver pull from the client have the client push to the
> driver pod?
>
>
>
> You can do this manually yourself via kubectl cp so it should be possible
> to programmatically do this since it looks like this is just a tar piped
> into a kubectl exec.   This would keep the relevant logic in the Kubernetes
> specific client which may/may not be desirable depending on whether we’re
> looking to just fix this for K8S or more generally.  Of course there is
> probably a fair bit of complexity in making this work but does that sound
> like something worth exploring?
>
>
>
> I hadn’t really considered the HA aspect, a first step would be to get the
> basics working and then look at the HA aspect.  Although if the above
> theoretical approach is practical that could simply be part of restarting
> the driver.
>
>
>
> Rob
>
>
>
>
>
> *From: *Felix Cheung 
> *Date: *Sunday, 7 October 2018 at 23:00
> *To: *Yinan Li , Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com>
> *Cc: *Rob Vesse , dev 
> *Subject: *Re: [DISCUSS][K8S] Local dependencies with Kubernetes
>
>
>
> Jars and libraries only accessible locally at the driver is fairly
> limited? Don’t you want the same on all executor?
>
>
>
>
>
>
> --
>
> *From:* Yinan Li 
> *Sent:* Friday, October 5, 2018 11:25 AM
> *To:* Stavros Kontopoulos
> *Cc:* rve...@dotnetrdf.org; dev
> *Subject:* Re: [DISCUSS][K8S] Local dependencies with Kubernetes
>
>
>
> > Just to be clear: in client mode things work right? (Although I'm not
> really familiar with how client mode works in k8s - never tried it.)
>
>
>
> If the driver runs on the submission client machine, yes, it should just
> work. If the driver runs in a pod, however, it faces the same problem as in
> cluster mode.
>
>
>
> Yinan
>
>
>
> On Fri, Oct 5, 2018 at 11:06 AM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
> @Marcelo is correct. Mesos does not have something similar. Only Yarn does
> due to the distributed cache thing.
>
> I have described most of the above in the the jira also there are some
> other options.
>
>
>
> Best,
>
> Stavros
>
>
>
> On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin <
> van...@cloudera.com.invalid> wrote:
>
> On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse  wrote:
> > Ideally this would all just be handled automatically for users in the
> way that all other resource managers do
>
> I think you're giving other resource managers too much credit. In
> cluster mode, only YARN really distributes local dependencies, because
> YARN has that feature (its distributed cache) and Spark just uses it.
>
> Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
> anything similar on the Mesos side.
>
> There are things that could be done; e.g. if you have HDFS you could
> do a res

Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-08 Thread Marcelo Vanzin
On Mon, Oct 8, 2018 at 6:36 AM Rob Vesse  wrote:
> Since connectivity back to the client is a potential stumbling block for 
> cluster mode I wander if it would be better to think in reverse i.e. rather 
> than having the driver pull from the client have the client push to the 
> driver pod?
>
> You can do this manually yourself via kubectl cp so it should be possible to 
> programmatically do this since it looks like this is just a tar piped into a 
> kubectl exec.   This would keep the relevant logic in the Kubernetes specific 
> client which may/may not be desirable depending on whether we’re looking to 
> just fix this for K8S or more generally.  Of course there is probably a fair 
> bit of complexity in making this work but does that sound like something 
> worth exploring?

That sounds like a good solution especially if there's a programmatic
API for it, instead of having to fork a sub-process to upload the
files.

>  I hadn’t really considered the HA aspect

When you say HA here what do you mean exactly? I don't really expect
two drivers for the same app running at the same time, so my first
guess is you mean "reattempts" just like YARN supports - re-running
the driver if the first one fails?

That can be tricky without some shared storage mechanism, because in
cluster mode the submission client doesn't need to stay alive after
the application starts. Or at least it doesn't with other cluster
managers.


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Random sampling in tests

2018-10-08 Thread Xiao Li
Yes. Testing all the timezones is not needed.

Xiao

On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk  wrote:

> Hi All,
>
> I believe we should also take into account what we test, for example, I
> don't think it makes sense to check all timezones for JSON/CSV
> functions/datasources because those timezones are just passed to external
> libraries. So, the same code is involved into testing of each out of 650
> timezones. We basically just spend time and resources on testing the
> external libraries.
>
> I mean the PRs: https://github.com/apache/spark/pull/22657 and
> https://github.com/apache/spark/pull/22379#discussion_r223039662
>
> Maxim Gekk
>
> Technical Solutions Lead
>
> Databricks B. V.  
>
>
> On Mon, Oct 8, 2018 at 4:49 PM Sean Owen  wrote:
>
>> If the problem is simply reducing the wall-clock time of tests, then
>> even before we get to this question, I'm advocating:
>>
>> 1) try simple parallelization of tests within the suite. In this
>> instance there's no reason not to test these in parallel and get a 8x
>> or 16x speedup from cores. This assumes, I believe correctly, that the
>> machines aren't generally near 100% utilization
>> 2) explicitly choose a smaller, more directed set of cases to test
>>
>> Randomly choosing test cases with a fixed seed is basically 2, but not
>> choosing test cases for a particular reason. You can vary the seed but
>> as a rule the same random subset of tests is always chosen. Could be
>> fine if there's no reason at all to prefer some cases over others. But
>> I am guessing any wild guess at the most important subset of cases to
>> test is better than random.
>>
>> I'm trying 1) right now instead in these several cases.
>> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
>> >
>> > For this specific case, I do not think we should test all the timezone.
>> If this is fast, I am fine to leave it unchanged. However, this is very
>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>> number or just hardcoding some specific time zones.
>> >
>> > In general, I like Reynold’s idea by including the seed value and we
>> add the seed name in the test case name. This can help us reproduce it.
>> >
>> > Xiao
>> >
>> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>> >>
>> >> I'm personally not a big fan of doing it that way in the PR. It is
>> perfectly fine to employ randomized tests, and in this case it might even
>> be fine to just pick couple different timezones like the way it happened in
>> the PR, but we should:
>> >>
>> >> 1. Document in the code comment why we did it that way.
>> >>
>> >> 2. Use a seed and log the seed, so any test failures can be reproduced
>> deterministically. For this one, it'd be better to pick the seed from a
>> seed environmental variable. If the env variable is not set, set to a
>> random seed.
>> >>
>> >>
>> >>
>> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>> >>>
>> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
>> >>> that tests a bunch of cases by randomly choosing different subsets of
>> >>> cases to test on each Jenkins run.
>> >>>
>> >>> There's disagreement about whether this is good approach to improving
>> >>> test runtime. Here's a discussion on one that was committed:
>> >>> https://github.com/apache/spark/pull/22631/files#r223190476
>> >>>
>> >>> I'm flagging it for more input.
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


DataSourceV2 documentation & tutorial

2018-10-08 Thread assaf.mendelson
Hi all,

I have been working on a legacy datasource integration with data source V2
for the last couple of week including upgrading it to the Spark 2.4.0 RC.

During this process I wrote a tutorial with explanation on how to create a
new datasource (it can be found in
https://github.com/assafmendelson/DataSourceV2). 
It is still a work in progress (still a lot of TODOs in it), however, I
figured others might find it useful.

I was wondering if there is some place in the spark documentation where we
can put something like this so this would continually update with the
ongoing changes to the API.

Of course, if I have mistakes in it (which I probably do), I would be happy
to learn of them…

Thanks, 
Assaf




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Random sampling in tests

2018-10-08 Thread Maxim Gekk
Hi All,

I believe we should also take into account what we test, for example, I
don't think it makes sense to check all timezones for JSON/CSV
functions/datasources because those timezones are just passed to external
libraries. So, the same code is involved into testing of each out of 650
timezones. We basically just spend time and resources on testing the
external libraries.

I mean the PRs: https://github.com/apache/spark/pull/22657 and
https://github.com/apache/spark/pull/22379#discussion_r223039662

Maxim Gekk

Technical Solutions Lead

Databricks B. V.  


On Mon, Oct 8, 2018 at 4:49 PM Sean Owen  wrote:

> If the problem is simply reducing the wall-clock time of tests, then
> even before we get to this question, I'm advocating:
>
> 1) try simple parallelization of tests within the suite. In this
> instance there's no reason not to test these in parallel and get a 8x
> or 16x speedup from cores. This assumes, I believe correctly, that the
> machines aren't generally near 100% utilization
> 2) explicitly choose a smaller, more directed set of cases to test
>
> Randomly choosing test cases with a fixed seed is basically 2, but not
> choosing test cases for a particular reason. You can vary the seed but
> as a rule the same random subset of tests is always chosen. Could be
> fine if there's no reason at all to prefer some cases over others. But
> I am guessing any wild guess at the most important subset of cases to
> test is better than random.
>
> I'm trying 1) right now instead in these several cases.
> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
> >
> > For this specific case, I do not think we should test all the timezone.
> If this is fast, I am fine to leave it unchanged. However, this is very
> slow. Thus, I even prefer to reducing the tested timezone to a smaller
> number or just hardcoding some specific time zones.
> >
> > In general, I like Reynold’s idea by including the seed value and we add
> the seed name in the test case name. This can help us reproduce it.
> >
> > Xiao
> >
> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
> >>
> >> I'm personally not a big fan of doing it that way in the PR. It is
> perfectly fine to employ randomized tests, and in this case it might even
> be fine to just pick couple different timezones like the way it happened in
> the PR, but we should:
> >>
> >> 1. Document in the code comment why we did it that way.
> >>
> >> 2. Use a seed and log the seed, so any test failures can be reproduced
> deterministically. For this one, it'd be better to pick the seed from a
> seed environmental variable. If the env variable is not set, set to a
> random seed.
> >>
> >>
> >>
> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
> >>>
> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
> >>> that tests a bunch of cases by randomly choosing different subsets of
> >>> cases to test on each Jenkins run.
> >>>
> >>> There's disagreement about whether this is good approach to improving
> >>> test runtime. Here's a discussion on one that was committed:
> >>> https://github.com/apache/spark/pull/22631/files#r223190476
> >>>
> >>> I'm flagging it for more input.
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Random sampling in tests

2018-10-08 Thread Sean Owen
If the problem is simply reducing the wall-clock time of tests, then
even before we get to this question, I'm advocating:

1) try simple parallelization of tests within the suite. In this
instance there's no reason not to test these in parallel and get a 8x
or 16x speedup from cores. This assumes, I believe correctly, that the
machines aren't generally near 100% utilization
2) explicitly choose a smaller, more directed set of cases to test

Randomly choosing test cases with a fixed seed is basically 2, but not
choosing test cases for a particular reason. You can vary the seed but
as a rule the same random subset of tests is always chosen. Could be
fine if there's no reason at all to prefer some cases over others. But
I am guessing any wild guess at the most important subset of cases to
test is better than random.

I'm trying 1) right now instead in these several cases.
On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
>
> For this specific case, I do not think we should test all the timezone. If 
> this is fast, I am fine to leave it unchanged. However, this is very slow. 
> Thus, I even prefer to reducing the tested timezone to a smaller number or 
> just hardcoding some specific time zones.
>
> In general, I like Reynold’s idea by including the seed value and we add the 
> seed name in the test case name. This can help us reproduce it.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>>
>> I'm personally not a big fan of doing it that way in the PR. It is perfectly 
>> fine to employ randomized tests, and in this case it might even be fine to 
>> just pick couple different timezones like the way it happened in the PR, but 
>> we should:
>>
>> 1. Document in the code comment why we did it that way.
>>
>> 2. Use a seed and log the seed, so any test failures can be reproduced 
>> deterministically. For this one, it'd be better to pick the seed from a seed 
>> environmental variable. If the env variable is not set, set to a random seed.
>>
>>
>>
>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>>
>>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> that tests a bunch of cases by randomly choosing different subsets of
>>> cases to test on each Jenkins run.
>>>
>>> There's disagreement about whether this is good approach to improving
>>> test runtime. Here's a discussion on one that was committed:
>>> https://github.com/apache/spark/pull/22631/files#r223190476
>>>
>>> I'm flagging it for more input.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Random sampling in tests

2018-10-08 Thread Marco Gaido
Yes, I see. It makes sense.
Thanks.

Il giorno lun 8 ott 2018 alle ore 16:35 Reynold Xin 
ha scritto:

> Marco - the issue is to reproduce. It is much more annoying for somebody
> else who might not have touched this test case to be able to reproduce the
> error, just given a timezone. It is much easier to just follow some
> documentation saying "please run TEST_SEED=5 build/sbt ~ ".
>
>
> On Mon, Oct 8, 2018 at 4:33 PM Marco Gaido  wrote:
>
>> Hi all,
>>
>> thanks for bringing up the topic Sean. I agree too with Reynold's idea,
>> but in the specific case, if there is an error the timezone is part of the
>> error message.
>> So we know exactly which timezone caused the failure. Hence I thought
>> that logging the seed is not necessary, as we can directly use the failing
>> timezone.
>>
>> Thanks,
>> Marco
>>
>> Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li 
>> ha scritto:
>>
>>> For this specific case, I do not think we should test all the timezone.
>>> If this is fast, I am fine to leave it unchanged. However, this is very
>>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>>> number or just hardcoding some specific time zones.
>>>
>>> In general, I like Reynold’s idea by including the seed value and we add
>>> the seed name in the test case name. This can help us reproduce it.
>>>
>>> Xiao
>>>
>>> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>>>
 I'm personally not a big fan of doing it that way in the PR. It is
 perfectly fine to employ randomized tests, and in this case it might even
 be fine to just pick couple different timezones like the way it happened in
 the PR, but we should:

 1. Document in the code comment why we did it that way.

 2. Use a seed and log the seed, so any test failures can be reproduced
 deterministically. For this one, it'd be better to pick the seed from a
 seed environmental variable. If the env variable is not set, set to a
 random seed.



 On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:

> Recently, I've seen 3 pull requests that try to speed up a test suite
> that tests a bunch of cases by randomly choosing different subsets of
> cases to test on each Jenkins run.
>
> There's disagreement about whether this is good approach to improving
> test runtime. Here's a discussion on one that was committed:
> https://github.com/apache/spark/pull/22631/files#r223190476
>
> I'm flagging it for more input.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Random sampling in tests

2018-10-08 Thread Reynold Xin
Marco - the issue is to reproduce. It is much more annoying for somebody
else who might not have touched this test case to be able to reproduce the
error, just given a timezone. It is much easier to just follow some
documentation saying "please run TEST_SEED=5 build/sbt ~ ".


On Mon, Oct 8, 2018 at 4:33 PM Marco Gaido  wrote:

> Hi all,
>
> thanks for bringing up the topic Sean. I agree too with Reynold's idea,
> but in the specific case, if there is an error the timezone is part of the
> error message.
> So we know exactly which timezone caused the failure. Hence I thought that
> logging the seed is not necessary, as we can directly use the failing
> timezone.
>
> Thanks,
> Marco
>
> Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li 
> ha scritto:
>
>> For this specific case, I do not think we should test all the timezone.
>> If this is fast, I am fine to leave it unchanged. However, this is very
>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>> number or just hardcoding some specific time zones.
>>
>> In general, I like Reynold’s idea by including the seed value and we add
>> the seed name in the test case name. This can help us reproduce it.
>>
>> Xiao
>>
>> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>>
>>> I'm personally not a big fan of doing it that way in the PR. It is
>>> perfectly fine to employ randomized tests, and in this case it might even
>>> be fine to just pick couple different timezones like the way it happened in
>>> the PR, but we should:
>>>
>>> 1. Document in the code comment why we did it that way.
>>>
>>> 2. Use a seed and log the seed, so any test failures can be reproduced
>>> deterministically. For this one, it'd be better to pick the seed from a
>>> seed environmental variable. If the env variable is not set, set to a
>>> random seed.
>>>
>>>
>>>
>>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>>
 Recently, I've seen 3 pull requests that try to speed up a test suite
 that tests a bunch of cases by randomly choosing different subsets of
 cases to test on each Jenkins run.

 There's disagreement about whether this is good approach to improving
 test runtime. Here's a discussion on one that was committed:
 https://github.com/apache/spark/pull/22631/files#r223190476

 I'm flagging it for more input.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Random sampling in tests

2018-10-08 Thread Marco Gaido
Hi all,

thanks for bringing up the topic Sean. I agree too with Reynold's idea, but
in the specific case, if there is an error the timezone is part of the
error message.
So we know exactly which timezone caused the failure. Hence I thought that
logging the seed is not necessary, as we can directly use the failing
timezone.

Thanks,
Marco

Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li  ha
scritto:

> For this specific case, I do not think we should test all the timezone. If
> this is fast, I am fine to leave it unchanged. However, this is very slow.
> Thus, I even prefer to reducing the tested timezone to a smaller number or
> just hardcoding some specific time zones.
>
> In general, I like Reynold’s idea by including the seed value and we add
> the seed name in the test case name. This can help us reproduce it.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>
>> I'm personally not a big fan of doing it that way in the PR. It is
>> perfectly fine to employ randomized tests, and in this case it might even
>> be fine to just pick couple different timezones like the way it happened in
>> the PR, but we should:
>>
>> 1. Document in the code comment why we did it that way.
>>
>> 2. Use a seed and log the seed, so any test failures can be reproduced
>> deterministically. For this one, it'd be better to pick the seed from a
>> seed environmental variable. If the env variable is not set, set to a
>> random seed.
>>
>>
>>
>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>
>>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> that tests a bunch of cases by randomly choosing different subsets of
>>> cases to test on each Jenkins run.
>>>
>>> There's disagreement about whether this is good approach to improving
>>> test runtime. Here's a discussion on one that was committed:
>>> https://github.com/apache/spark/pull/22631/files#r223190476
>>>
>>> I'm flagging it for more input.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: Random sampling in tests

2018-10-08 Thread Xiao Li
For this specific case, I do not think we should test all the timezone. If
this is fast, I am fine to leave it unchanged. However, this is very slow.
Thus, I even prefer to reducing the tested timezone to a smaller number or
just hardcoding some specific time zones.

In general, I like Reynold’s idea by including the seed value and we add
the seed name in the test case name. This can help us reproduce it.

Xiao

On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:

> I'm personally not a big fan of doing it that way in the PR. It is
> perfectly fine to employ randomized tests, and in this case it might even
> be fine to just pick couple different timezones like the way it happened in
> the PR, but we should:
>
> 1. Document in the code comment why we did it that way.
>
> 2. Use a seed and log the seed, so any test failures can be reproduced
> deterministically. For this one, it'd be better to pick the seed from a
> seed environmental variable. If the env variable is not set, set to a
> random seed.
>
>
>
> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>
>> Recently, I've seen 3 pull requests that try to speed up a test suite
>> that tests a bunch of cases by randomly choosing different subsets of
>> cases to test on each Jenkins run.
>>
>> There's disagreement about whether this is good approach to improving
>> test runtime. Here's a discussion on one that was committed:
>> https://github.com/apache/spark/pull/22631/files#r223190476
>>
>> I'm flagging it for more input.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Random sampling in tests

2018-10-08 Thread Reynold Xin
I'm personally not a big fan of doing it that way in the PR. It is
perfectly fine to employ randomized tests, and in this case it might even
be fine to just pick couple different timezones like the way it happened in
the PR, but we should:

1. Document in the code comment why we did it that way.

2. Use a seed and log the seed, so any test failures can be reproduced
deterministically. For this one, it'd be better to pick the seed from a
seed environmental variable. If the env variable is not set, set to a
random seed.



On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:

> Recently, I've seen 3 pull requests that try to speed up a test suite
> that tests a bunch of cases by randomly choosing different subsets of
> cases to test on each Jenkins run.
>
> There's disagreement about whether this is good approach to improving
> test runtime. Here's a discussion on one that was committed:
> https://github.com/apache/spark/pull/22631/files#r223190476
>
> I'm flagging it for more input.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-08 Thread Rob Vesse
Folks, thanks for all the great input. Responding to various points raised:

 

Marcelo/Yinan/Felix – 

 

Yes, client mode will work.  The main JAR will be automatically distributed and 
--jars/--files specified dependencies are also distributed though for --files 
user code needs to use the appropriate Spark APIs to resolve the actual path 
i.e. SparkFiles.get()

 

However client mode can be awkward if you want to mix spark-submit distribution 
with mounting dependencies via volumes since you may need to ensure that 
dependencies appear at the same path both on the local submission client and 
when mounted into the executors.  This mainly applies to the case where user 
code does not use SparkFiles.get() and simply tries to access the path directly.

 

Marcelo/Stavros – 

 

Yes I did give the other resource managers too much credit.  From my past 
experience with Mesos and Standalone I had thought this wasn’t an issue but 
going back and looking at what we did for both of those it appears we were 
entirely reliant on the shared file system (whether HDFS, NFS or other POSIX 
compliant filesystems e.g. Lustre).

 

Since connectivity back to the client is a potential stumbling block for 
cluster mode I wander if it would be better to think in reverse i.e. rather 
than having the driver pull from the client have the client push to the driver 
pod?

 

You can do this manually yourself via kubectl cp so it should be possible to 
programmatically do this since it looks like this is just a tar piped into a 
kubectl exec.   This would keep the relevant logic in the Kubernetes specific 
client which may/may not be desirable depending on whether we’re looking to 
just fix this for K8S or more generally.  Of course there is probably a fair 
bit of complexity in making this work but does that sound like something worth 
exploring?

 

I hadn’t really considered the HA aspect, a first step would be to get the 
basics working and then look at the HA aspect.  Although if the above 
theoretical approach is practical that could simply be part of restarting the 
driver.

 

Rob

 

 

From: Felix Cheung 
Date: Sunday, 7 October 2018 at 23:00
To: Yinan Li , Stavros Kontopoulos 

Cc: Rob Vesse , dev 
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

Jars and libraries only accessible locally at the driver is fairly limited? 
Don’t you want the same on all executor?

 

 

 

From: Yinan Li 
Sent: Friday, October 5, 2018 11:25 AM
To: Stavros Kontopoulos
Cc: rve...@dotnetrdf.org; dev
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes 

 

> Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.) 

 

If the driver runs on the submission client machine, yes, it should just work. 
If the driver runs in a pod, however, it faces the same problem as in cluster 
mode.

 

Yinan

 

On Fri, Oct 5, 2018 at 11:06 AM Stavros Kontopoulos 
 wrote:

@Marcelo is correct. Mesos does not have something similar. Only Yarn does due 
to the distributed cache thing. 

I have described most of the above in the the jira also there are some other 
options.

 

Best,

Stavros

 

On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin  
wrote:

On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse  wrote:
> Ideally this would all just be handled automatically for users in the way 
> that all other resource managers do

I think you're giving other resource managers too much credit. In
cluster mode, only YARN really distributes local dependencies, because
YARN has that feature (its distributed cache) and Spark just uses it.

Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
anything similar on the Mesos side.

There are things that could be done; e.g. if you have HDFS you could
do a restricted version of what YARN does (upload files to HDFS, and
change the "spark.jars" and "spark.files" URLs to point to HDFS
instead). Or you could turn the submission client into a file server
that the cluster-mode driver downloads files from - although that
requires connectivity from the driver back to the client.

Neither is great, but better than not having that feature.

Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



 



Random sampling in tests

2018-10-08 Thread Sean Owen
Recently, I've seen 3 pull requests that try to speed up a test suite
that tests a bunch of cases by randomly choosing different subsets of
cases to test on each Jenkins run.

There's disagreement about whether this is good approach to improving
test runtime. Here's a discussion on one that was committed:
https://github.com/apache/spark/pull/22631/files#r223190476

I'm flagging it for more input.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org