Nikhil Kak created MADLIB-1406:
----------------------------------

             Summary: DL: fit multiple takes up unnecessary disk space
                 Key: MADLIB-1406
                 URL: https://issues.apache.org/jira/browse/MADLIB-1406
             Project: Apache MADlib
          Issue Type: Bug
          Components: Deep Learning
            Reporter: Nikhil Kak
             Fix For: v1.17


While testing places10 with fit multiple (gpdb5, 10 iterations and 20 msts), we 
ran out of disk space although we had at least 1.5T left at the beginning of 
the query. There is no reason for us to use this much space and this probably 
means that there is a bug in the code

Here is the query and the failure
{code:java}

DROP TABLE IF EXISTS mst_table, mst_table_summary;
SELECT load_model_selection_table(
    'model_arch_places10',
    'mst_table',
    ARRAY[1],
    ARRAY[
        $$loss='categorical_crossentropy', optimizer='SGD(lr=0.1, decay=1e-6, 
nesterov=True)', metrics=['accuracy']$$,
        $$loss='categorical_crossentropy', optimizer='SGD(lr=0.01, decay=1e-6, 
nesterov=True)', metrics=['accuracy']$$,
        $$loss='categorical_crossentropy', optimizer='SGD(lr=0.001, decay=1e-6, 
nesterov=True)', metrics=['accuracy']$$,
        $$loss='categorical_crossentropy', optimizer='SGD(lr=0.0001, 
decay=1e-6, nesterov=True)', metrics=['accuracy']$$,
        $$loss='categorical_crossentropy', optimizer='SGD(lr=0.001, decay=1e-6, 
nesterov=False)', metrics=['accuracy']$$
    ],
    ARRAY[
        $$batch_size=16, epochs=1, verbose=0$$,
        $$batch_size=20, epochs=1, verbose=0$$,
        $$batch_size=32, epochs=1, verbose=0$$,
        $$batch_size=40, epochs=1, verbose=0$$
    ]
);

DROP TABLE if exists places10_train_mult_model, 
places10_train_mult_model_summary, places10_train_mult_model_info;
SELECT madlib_keras_fit_multiple_model(
    'places10_train_bytea_batched',
    'places10_train_mult_model',
    'mst_table',
    10,
    TRUE
);
-- failed in the 7th iteration

....
Time for training in iteration 6: 6403.70687222 sec
ERROR:  plpy.SPIError: could not extend relation 1663/3721274/1121877: No space 
left on device  (seg1){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to