I like JLD. I really do.

But I have a problem that it can't solve -- not due to a limitation of JLD, 
but due to HDF5.
HDF5 is (AFAIK) designed with rectangular arrays in mind -- everything else 
is secondary.
If your the type you are trying to serialize is more or less a  Structure 
of Arrays, your fine.
If it is an array of structures, you are not fine.
If it is a linked list/tree, you are *really* not fine.


So I have been working with some models from AdaGrams.jl 
(https://github.com/sbos/AdaGram)
Very cool stuff in my area. 

So I have one of the models, saved using their custom format 
(https://github.com/sbos/AdaGram.jl/blob/master/src/util.jl#L94)
It is *6Gb*, and takes ages to load and save.
I switch to JLD, *13Gb* even worse.
I used Base.serialize, Its just *13Mb* -- That is a full 3 orders of 
magnitude different.
And it is fast to read and write from disk.

This happens to me again and again.
Apparently the types in tend to need, do not make JLD happy.



There are 3 problems with using Base,serialize as a data storage format.

   1. *It is not stable* -- the format is evolving as the language evolves, 
   and it breaks the ability to load files
   2. *It is only usable from Julia* -- vs JLD which is, in the endm fancy 
   HDF5, anything can read it after a little work
   3. *It is not safe from a security perspective*  -- Maliciously crafted 
   ".jsz" files can allow arbitrary code execution to occur during the 
   deserialize step.



Points 2, and 3 are also true of Python's Pickle.


I haven't actually heard of anyone doing Point 3, but it seems really 
likely.
Deserialisation is a standard vector of attack, and this is not a security 
hardened part of the codebase
Here is how it can be done using Pickle 
<https://blog.nelhage.com/2011/03/exploiting-pickle/>


Point 2 is pretty serious, it leads to bad open science.


Point 1 though is what we tend to focus on, at least from what I've 
observed on SO, Github issues, Juliausers, and in the docs.

I have been bit by point 1, when I fully knew I was doing the wrong thing 
by using Base.serialize as a storage format, but did it anyway,
because it change things from hours to seconds.
I updated the Julia nightly and then couldn't load any of my files.
So I ended up having to clone a version earlier, and then wrote a script to 
resave everything in to JLD.

One thing that would go a way towards solving points 2, and 3 would be a 
standard script distributed with JLD (or julia),
that does convert ".jsz" into JLD files; in a sandboxed enviroment.

Solving point 1, would be to have a script/function (possibly as part of 
compat) that can converts between version of Base.Serialize.
This might be bit tricky, given changes to Base.Serialize are often tied to 
deep changes in how the language works internally.
In ways that mean that the Base.Serialize code from before the commit the 
changed it, can not even be run after that commit.
Such a script may need to do what I did and check out two versions of julia 
and convert to an intermediary format.


The other option would be to try and create a new data storage package 
based on the same logic as Base.Serialize,
but without it breaking when the language changes. 
This would be hard, since I suspect what makes it fast, is also what makes 
it incompatible with changes: 

that the representation on disk is very close to the representation in 
memory.
Still, even if it is 100x worse that Base.Serialize, it would still (For my 
kinda data) be 10x faster that JLD.



I had hoped Protobuf.jl might be a package that I could use as an 
alternative:
But the Protobuf spec doesn't really make it good for general data storage
See https://github.com/tanmaykm/ProtoBuf.jl/issues/73



So what to do?

Right now, I try to use Base.Serialize internally, and to convert to JLD 
when I am done messing with it, and also before updating julia.


Reply via email to