Sounds promising. Please let us know if we can help in any manner.

From: Cyrille Chépélov [mailto:[email protected]]
Sent: Tuesday, May 12, 2015 11:01 AM
To: [email protected]; [email protected]; 
[email protected]
Subject: Re: Scalding+Cascading+TEZ = ♥ [FOLLOW UP #2]

(Xposted to scalding-dev@ and user@tez, reply-to set to cascading-user@)

Hello,

TL;DR: things look great, we like cascading-3.0.0-wip-115 to the point we can 
run some production jobs on tez if we're in a hurry.

some progress update:

  *   As of cascading-3.0.0-wip-115, we no longer see a difference in output 
data, whether running with -hadoop, -hadoop2-mr1, or of course -hadoop2-tez

     *   (lots of hard work was involved, this is mostly Sylvain on the 
reporting side and Chris on the fixing side)

  *   We run all three back-ends on our test rig, at least one a week, aiming 
for daily.

     *   not doing this for a while let us fail to notice a regression in 
-hadoop. It's been now reported and fixed

Remaining on our (Transparency) to-do list:

  *   Test again against vanilla tez-0.6.0, but also against tez-0.6.1-SNAPSHOT

     *   in particular, looking to see whether vanilla 0.6.0 still freezes (I 
expect it should) and whether 0.6.1-SNAPSHOT passes without too much trouble 
with guava version mismatches

  *   Evaluate whether it's better to run cascades of complex flows under TEZ 
with cascading.cascade.maxconcurrentflows=1 rather than the default (no limit), 
as it's possible that when multiple IO-hungry jobs are running at the same 
time, thrashing happens and reduces performance
  *   Plug hbase taps
  *   try compiling scalding 0.13.1 against cascading 3.0.0-wip-115(+) and see 
what happens with the test suite (under -hadoop)

For now, the recipe is still as in the original report (patched tez-0.6.0, 
patched scalding-0.13.1, cascading-3.0.0-wip) except a newer cascading wip.
    -- Cyrille


Le 16/04/2015 19:10, Sylvain Veyrié a écrit :
also cross-posted on cascading-user@ and user@tez

Hello all,

Following Cyrille announcement, I started applying our regression test suite on 
the same code base. This regression test launches our code on a reduced dataset 
(14k rows), and checks results are unchanged. As of today, the result is the 
same with 1) Cascading local 2) Hadoop MR1 ("--hdfs") 3) Hadoop 2 Yarn. So now 
it runs also on Tez.

Good news: most of the result is exactly the same, and the run is a lot 
faaaaaaster -- more than 2h => less than 10 minutes
Bad news: there are regressions (~3% of output data) - we identified at least 
one.

[BAD]


val result = input

  .map { blah => blah.bleh }

  .collect { case Some(item) => item }

  .groupBy(bleh => bleh.id)

  .sortBy(bleh => (bleh.foo, bleh.bar)).reverse

  .take(1)

  .values
[/BAD]

It appears that, only while executing with Tez, the "take(1)" function is just 
ignored, giving output with elements that should have been eliminated. Cyrille 
suspects it might be a planner issue.

This modified code fixed it:

[GOOD]


val result = input

  .map { blah => blah.bleh }

  .collect { case Some(item) => item }

  .groupBy(bleh => bleh.id)

  .sortedTake(1)(Ordering.by[Bleh, (Double, String)](bleh  => (bleh.foo, 
bleh.bar)).reverse)

  .toTypedPipe.flatMap(xx => xx._2)
[/GOOD]

We know this new code is better (maybe from a performance point of view), but 
we still have some take() and head() in the code base, so we have some invalid 
output. However, it is still a bug, and I prefer not to start rewriting all 
occurrences just to check it fixes it.

I wanted to post this with a Java/Cascading test case, but I am unable to 
reproduce it on a simple test case yet, even with Scalding.

On the test runs, AFAIK, the debug flow gives exactly the same thing in both 
cases - however, in the logs:
* BAD : "Tuples_Read=3737, Tuples_Written=3737" <= wat?
* GOOD : "Tuples_Read=13818, Tuples_Written=4029"

I did not dig why Tuples_Read is different in each case (maybe the same thing 
upstream), but it seems obvious there is a problem having Tuples_Read and 
Tuples_Written the same value - this is consistent with our output with 
elements that should have been eliminated.
-- Sylvain Veyrié


Reply via email to