[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223354745
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

We aren't blocked on CPU time or resources, no. The tests are mostly 
single-threaded, and the big test machines mostly idle throughout the runs. 
I've opened https://github.com/apache/spark/pull/22672 to evaluate that.

I feel strongly that we can't do this kind of thing and am opening a thread 
on dev@ for wider discussion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223314632
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

> we should categorize timezones, pick up some timezones representing them 
and test fixed set

That would be the best, but we need some deep understanding of timezone, to 
make sure the test coverage is good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-08 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223292455
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

I don't think that adding parallelism is a good way for improve test time. 
The amount of resources used for testing is anyway limited. I think the goal 
here is not (only) reduce the overall time of the test but also reduce the 
amount of resources needed to test.

Problems with a specific timezone like you mentioned, @HyukjinKwon, are 
exactly the reason why I am proposing this randomized approach, rather than 
picking 3 timezones and always use them as done in  `DateExpressionsSuite`: if 
there is a problem with a specific timezone, in this way, we will be able to 
catch it. With a fixed subset of them (even though not on the single run), we 
are not.

The only safe deterministic way would be to run against all of them, as it 
was done before, but then I'd argue that we should do the same everywhere we 
have different timezones involved in tests (why are we testing all timezones 
only for casting to timestamp and not for all other functions involving dates 
and times, if it is so important to check all timezones?). But then the amount 
of time needed to run all the tests would be crazy, so it is not doable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223286771
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

I mean some tests like with randomized input, let's say, integer range 
input are fine in common sense but this case is different, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223285813
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

I think tests need to be deterministic in general as well.

In this particular case ideally, we should categorize timezones and test 
fixed set. For instance, timezone with DST, without DST, and some exceptions 
such as, for instance, see this particular case which Python 3.6 addressed 
lately 
(https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Lib/datetime.py#L1572-L1574),
 IMHO.

Of course, this approach requires a lot of investigations and overheads. 
So, as an alternative, I would incline to go for Sean's approach 
(https://github.com/apache/spark/pull/22631/files#r223224573) for this 
particular case.

For randomness, I think primarily we should have first deterministic set of 
tests. Maybe we could additionally have some set of randomized input to cover 
some cases we haven't foreseen but that's secondary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223253381
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

 > tests need to be deterministic, or else what's the value? failures can't 
be reproduced

That's not true. AFAIK we have a lot of tests that generate data randomly, 
and if it fails, the test name will include the seed(or the generated data), so 
people can easily reproduce it.

I think it's a good strategy to test a subset of possible cases, to make a 
tradeoff between how soon we can discover a bug, and how fast we can iterate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223228194
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

Surely not by design? tests need to be deterministic, or else what's the 
value? failures can't be reproduced. (I know that in practice many things are 
hard to make deterministic.)

Of course, if you're worried that we might not be testing an important 
case, we have to test it. We can't just not test it sometimes to make some 
tests run a little faster.

Testing just 3 timezones might be fine too; I don't know. Testing 50 
randomly seems suboptimal in all cases. 

I'll open a PR to try simply testing in parallel instead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-07 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223226770
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

Yes, there are many tests where data is randomly generated. And they are 
not seeded of course.

As I said, I think the goal here is to test that the function works well 
with different timezones: then picking a subset of timezones would be fine too, 
but I prefer taking them randomly among all because if there is a single 
timezone creating issues (very unlikely IMHO), we would discover it anyway (not 
on the single run though).

Moreover, it would be great then to be consistent among all the codebase on 
what we test. In `DateExpressionsSuite` we test only 3 timezones and here we 
test all 650: it is a weird, isn't it? We should probably define which is the 
right thing to do when timezones are involved and test always the same. 
Otherwise, testing 650 timezones on a single specific function and 3 on the 
most of the others is a nonsense IMHO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223224573
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

Tests should be deterministic, ideally; any sources of randomness should be 
seeded. Do you see one that isn't? 

I think this is like deciding we'll run just 90% of all test suites every 
time randomly, to save time. I think it's just well against good practice.

There are other solutions:
1) pick a subset of timezones that we're confident do exercise the code and 
just explicitly test those
2) parallelize these tests within the test suite

The latter should be trivial in this case: `ALL_TIMEZONES.par.foreach { tz 
=>` instead. It's the same amount of work but 8x, 16x faster by wall clock 
time, depending on how many cores are available. What about that?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-07 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223202913
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

@srowen I think the point here is to test that this works with different 
timezones. In `DateExpressionsSuite`, for instance, we test only 3-4 timezones. 
I don't think it makes sense to test every of the 650 possible timezones: if it 
works with many of them, then it means that the code is generic and respects 
timezones. We can also define a fixed subset of timezones, but IMHO taking 
randomly some of them provides the additional safety that if there is a 
specific single timezone creating problem, we are able to identify it on 
several subsequent runs.

We have many place where we generate data randomly in the test, so we 
already have randomness in the tests. I think the rationale behind them is the 
same: if the function works with some different data, then it generalize 
properly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22631#discussion_r223190476
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
 ---
@@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("cast string to timestamp") {
-for (tz <- ALL_TIMEZONES) {
+for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
--- End diff --

@mgaido91 @gatorsmile surely we shouldn't do this. We're introducing 
nondeterminism into the test, and also no longer testing things we were testing 
before. If there's a lot of timezones we don't need to test, then, don't test 
them. Right? I didn't see this PR but would have objected to this change if I 
had seen it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22631


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

2018-10-04 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/22631

[SPARK-25605][TESTS] Run cast string to timestamp tests for a subset of 
timezones

## What changes were proposed in this pull request?

The test `cast string to timestamp` used to run for all time zones. So it 
run for more than 600 times. Running the tests for a significant subset of time 
zones is probably good enough and doing this in a randomized manner enforces 
anyway that we are going to test all time zones in different runs.

## How was this patch tested?

the test time reduces to 11 seconds from more than 2 minutes


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-25605

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22631.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22631


commit 48534796d8bd9d2fcd75b4df77149f6b89341dde
Author: Marco Gaido 
Date:   2018-10-04T15:47:38Z

[SPARK-25605][TESTS] Run cast string to timestamp tests for a subset of 
timezones




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org