Similar question here, asked just a couple of days ago (please do search the 
archives first):
https://groups.google.com/d/msg/julia-users/VInJ4M-yNUY/Z6N8wCCfAwAJ

Someone should just add a serializer to the relevant random forest/decision 
tree packages. These aren't hard to write, and there's an example in the 
linked docs.

For reference, here's a more complicated example: in my own lab's code, we use 
"tile trees" to represent sums over little pieces of images. They combine 
QuadTrees/OctTrees (depending on spatial dimensionality) with spatio-temporal 
factorizations. The main point being that these might seem like fairly 
complicated data structures, yet the serializer and deserializer can each be 
written in ~10 lines of code, and gave me an orders-of-magnitude performance 
improvement when saving/loading.

For reference, I've pasted the code below: it's not self-contained, but it 
should give you the idea.

Best,
--Tim

# This contains info needed to reconstruct the BoxTree, but does not store the
# BoxTree itself
type TileTreeSerializer{TT<:Tile}
    tiles::Vector{TT}
    ids::Vector{Int}
    ntiles::Int
    dims::Dims
    Ts::Type
    Tel::Type
    K::Int
    W::Tuple
end
TileTrees.tiletype{TT}(::Type{TileTreeSerializer{TT}}) = TT
TileTrees.tiletype{TT}(::TileTreeSerializer{TT}) = TT

function JLD.readas(serdata::TileTreeSerializer)
    bt = boxtree(serdata.Ts, serdata.Tel, serdata.K, serdata.W, 
dimspans(serdata.dims[1:end-1]))
    TT = tiletype(serdata)
    tiles = Array(TT, serdata.ntiles)
    for i = 1:length(serdata.tiles)
        id = serdata.ids[i]
        tile = serdata.tiles[i]
        tiles[id] = tile
        roi = boxroi(tile.spans, id)
        push!(bt, roi)
    end
    ttree = TileTree(tiles, bt, serdata.dims)
end

function JLD.writeas(ttree::TileTree)
    tiles = Array(tiletype(ttree), 0)
    ids = Int[]
    for (id, tile) in ttree
        push!(tiles, tile)
        push!(ids, id)
    end
    BT = boxtreetype(ttree)
    ST = splittype(BT)
    TileTreeSerializer{tiletype(ttree)}(
        tiles,
        ids,
        length(ttree.tiles),
        ttree.dims,
        ST,
        eltype(BT),
        splitk(BT),
        (splitwidth(BT)...))
end


On Sunday, January 24, 2016 02:15:50 AM Pedro Silva wrote:
> I've been training a lot of random forests in a really big dataset and while
> saving my transformations of the data in JLD files has been a breeze saving
> the Models and their respective details is not going smoothly. I'm
> experimenting with different sizes of trees and different number of
> parameters per tree, so I have 10 forests total and since they take about 1
> hour to train each I'd like to save them every 7 iterations in case I have
> to shut down a machine. My code for the process is the following:
> 
> using HDF5, JLD, DataFrames, Distributions, DecisionTree, MLBase, StatsBase
> 
> ...
> 
> num_of_trees = collect(10:10:100);
> num_of_features = collect(20:5:50);
> Models =
> Array{DecisionTree.Ensemble}(length(num_of_trees),length(num_of_features));
> Predictions =
> Array{Array{Float64,1}}(length(num_of_trees),length(num_of_features));
> RMSEs = Array{Float64}(length(num_of_trees),length(num_of_features)); train
> = rand(Bernoulli(0.8), size(Y)) .== 1;
> 
> for i in 1:length(num_of_trees)
>       for j in 1:length(num_of_features)
>               Models[i,j] =
> build_forest(Y[train],DataSTD[train,:],num_of_features[j],num_of_trees[i]);
> Predictions[i,j] = apply_forest(Models[i,j], DataSTD[!train,:]); RMSEs[i,j]
> = root_mean_squared_error(Y[!train], Predictions[i,j]); println("\n",
> Models[i,j])
>               println("Features: ",num_of_features[j])
>               println("RMSE: ",RMSEs[i,j])
>               
> display(confusion_matrix_regression(Y[!train],Predictions[i,j],10))
>       end
>       save("Models_run1.jld", "Models", Models, "Features", num_of_features,
> "Predictions", Predictions, "RMSEs", RMSEs, "Bernoulli", train); end
> 
> Finishing the internal for loop takes around 7 hours, which is not a
> surprise, but the save function runs for hours as well. The file keeps
> slowly increasing in size, so I think something is happening but I'm not
> sure what. I'm still unable to get to a second iteration of my outer loop
> after 3 hours of the intern loop has finished. I plan to leave it running
> over night to see whether it fails or finishes. Any idea on why this is
> happening?

Reply via email to