Similar question here, asked just a couple of days ago (please do search the archives first): https://groups.google.com/d/msg/julia-users/VInJ4M-yNUY/Z6N8wCCfAwAJ
Someone should just add a serializer to the relevant random forest/decision tree packages. These aren't hard to write, and there's an example in the linked docs. For reference, here's a more complicated example: in my own lab's code, we use "tile trees" to represent sums over little pieces of images. They combine QuadTrees/OctTrees (depending on spatial dimensionality) with spatio-temporal factorizations. The main point being that these might seem like fairly complicated data structures, yet the serializer and deserializer can each be written in ~10 lines of code, and gave me an orders-of-magnitude performance improvement when saving/loading. For reference, I've pasted the code below: it's not self-contained, but it should give you the idea. Best, --Tim # This contains info needed to reconstruct the BoxTree, but does not store the # BoxTree itself type TileTreeSerializer{TT<:Tile} tiles::Vector{TT} ids::Vector{Int} ntiles::Int dims::Dims Ts::Type Tel::Type K::Int W::Tuple end TileTrees.tiletype{TT}(::Type{TileTreeSerializer{TT}}) = TT TileTrees.tiletype{TT}(::TileTreeSerializer{TT}) = TT function JLD.readas(serdata::TileTreeSerializer) bt = boxtree(serdata.Ts, serdata.Tel, serdata.K, serdata.W, dimspans(serdata.dims[1:end-1])) TT = tiletype(serdata) tiles = Array(TT, serdata.ntiles) for i = 1:length(serdata.tiles) id = serdata.ids[i] tile = serdata.tiles[i] tiles[id] = tile roi = boxroi(tile.spans, id) push!(bt, roi) end ttree = TileTree(tiles, bt, serdata.dims) end function JLD.writeas(ttree::TileTree) tiles = Array(tiletype(ttree), 0) ids = Int[] for (id, tile) in ttree push!(tiles, tile) push!(ids, id) end BT = boxtreetype(ttree) ST = splittype(BT) TileTreeSerializer{tiletype(ttree)}( tiles, ids, length(ttree.tiles), ttree.dims, ST, eltype(BT), splitk(BT), (splitwidth(BT)...)) end On Sunday, January 24, 2016 02:15:50 AM Pedro Silva wrote: > I've been training a lot of random forests in a really big dataset and while > saving my transformations of the data in JLD files has been a breeze saving > the Models and their respective details is not going smoothly. I'm > experimenting with different sizes of trees and different number of > parameters per tree, so I have 10 forests total and since they take about 1 > hour to train each I'd like to save them every 7 iterations in case I have > to shut down a machine. My code for the process is the following: > > using HDF5, JLD, DataFrames, Distributions, DecisionTree, MLBase, StatsBase > > ... > > num_of_trees = collect(10:10:100); > num_of_features = collect(20:5:50); > Models = > Array{DecisionTree.Ensemble}(length(num_of_trees),length(num_of_features)); > Predictions = > Array{Array{Float64,1}}(length(num_of_trees),length(num_of_features)); > RMSEs = Array{Float64}(length(num_of_trees),length(num_of_features)); train > = rand(Bernoulli(0.8), size(Y)) .== 1; > > for i in 1:length(num_of_trees) > for j in 1:length(num_of_features) > Models[i,j] = > build_forest(Y[train],DataSTD[train,:],num_of_features[j],num_of_trees[i]); > Predictions[i,j] = apply_forest(Models[i,j], DataSTD[!train,:]); RMSEs[i,j] > = root_mean_squared_error(Y[!train], Predictions[i,j]); println("\n", > Models[i,j]) > println("Features: ",num_of_features[j]) > println("RMSE: ",RMSEs[i,j]) > > display(confusion_matrix_regression(Y[!train],Predictions[i,j],10)) > end > save("Models_run1.jld", "Models", Models, "Features", num_of_features, > "Predictions", Predictions, "RMSEs", RMSEs, "Bernoulli", train); end > > Finishing the internal for loop takes around 7 hours, which is not a > surprise, but the save function runs for hours as well. The file keeps > slowly increasing in size, so I think something is happening but I'm not > sure what. I'm still unable to get to a second iteration of my outer loop > after 3 hours of the intern loop has finished. I plan to leave it running > over night to see whether it fails or finishes. Any idea on why this is > happening?