[GitHub] [madlib] reductionista commented on pull request #525: DL: Model Hopper Refactor

GitBox Mon, 23 Nov 2020 15:52:10 -0800


reductionista commented on pull request #525:
URL: https://github.com/apache/madlib/pull/525#issuecomment-732492520



   Thanks for testing it out Frank!
   
   You're the first to test this on with tensorflow 1.13, and based on one of 
the errors you're seeing, I'm guessing you're also the first to test it with 
gpdb5.  I only tested it on gpdb6 and postgres.
   
   > WARNING:  This version of tensorflow does not support XLA auto-cluster JIT 
optimization.  HINT:  upgrading tensorflow may improve performance.  (seg1 
slice1 10.128.0.41:40001 pid=6271)
   > CONTEXT:  PL/Python function "fit_transition"
   > ```
   > 
   > What does user need to do to enable XLA? I am on TF 1.13.1 currently.
   
   Nothing, probably just need to upgrade to TF 1.14.  I should add the 
specific minimum version requirement in the warning message.  I think it was 
added in 1.14 but I've only used 1.10 and 1.14, so I'll need to confirm that.  
Could also be that it's been in there for a while, and they just changed how 
you access it.  I'll look both of those up, and also run some more 
comprehensive performance tests to see how much of a difference it actually 
makes and in what kind of setups.  If it looks like it's not helping much 
anyway (even though the TF docs say it should) then we can get rid of this 
message... or maybe even not bother to enable it.  I figure if it does give 
better performance consistently across models, then we may as well let the user 
know there's an easy way to get that, without having to tune any parameters or 
upgrade hardware.  Would also be important to know if there are some cases 
where it could hurt performance.
    
   > ERROR:  plpy.Error: madlib_keras_fit_multiple_model error: No GPUs 
configured on hosts. (plpython.c:5038)
   > CONTEXT:  Traceback (most recent call last):
   >   PL/Python function "madlib_keras_fit_multiple_model", line 23, in 
<module>
   >     fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
   >   PL/Python function "madlib_keras_fit_multiple_model", line 147, in 
__init__
   >   PL/Python function "madlib_keras_fit_multiple_model", line 295, in 
get_accessible_gpus_for_seg
   > PL/Python function "madlib_keras_fit_multiple_model"
   > ```
   > 
   > so it looks like `use gpus=NULL` is now defaulting to `TRUE` but it should 
default to `FALSE` i.e., CPUs. It used to default to CPUs.
   
   Good catch!  For some reason, I get a different error when I tried to 
reproduce this ( about using BOOLEAN::None in a query).  But I think I see the 
cause of it--fixing.  I thought we had some dev-check tests for this, will add 
one if we don't.
   
   > ```
   > ERROR:  plpy.SPIError: PRIMARY KEY and DISTRIBUTED BY definitions 
incompatible
   > HINT:  When there is both a PRIMARY KEY, and a DISTRIBUTED BY clause, the 
DISTRIBUTED BY clause must be equal to or a left-subset of the PRIMARY KEY
   > CONTEXT:  Traceback (most recent call last):
   >   PL/Python function "madlib_keras_fit_multiple_model", line 24, in 
<module>
   >     fit_obj.fit_multiple_model()
   >   PL/Python function "madlib_keras_fit_multiple_model", line 241, in 
fit_multiple_model
   >   PL/Python function "madlib_keras_fit_multiple_model", line 509, in 
init_model_output_tbl
   > PL/Python function "madlib_keras_fit_multiple_model"
   > ```
   > 
   > I have also seen this error when running on a single segment.
   
   Interesting!  Looks like this was due to an unreasonably strict check in the 
gpdb5 server code, where it required a specific order for the keys in the 
PRIMARY KEY constraint.  The server team has fixed that in gpdb6:
   
https://github.com/greenplum-db/gpdb/commit/e572af54e4255d501d261d51d32e1f3d3cf4b9c9
   
   I can't recall for sure if we decided we care about supporting deep learning 
in gpdb5 for this release.  But since it's an easy fix, I'll just reverse the 
order of the keys so that it works on both.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [madlib] reductionista commented on pull request #525: DL: Model Hopper Refactor

Reply via email to