Let me clarify. Vector{Bool} works fine with JLD, it doesn't work well with extendible HDF5Datasets, which is what I actually want to do.
On Wednesday, August 13, 2014 6:04:46 PM UTC-6, ggggg wrote: > > It turns out Vector{Bool} does not work well with JLD. So I played around > with BitArray, and I figure out that it would be pretty easy to use for my > purposes. It seems to me that BitArray could be made a bit more useful by > exporting a reinterpret method. That would certainly make my use case use > less code, but it could also replace the current implementation of bits. I > think it makes more sense for bits to return a BitArray than a String > anyway, since it would be much faster for uses like bits(4)[1]. Would it > be worth a making a pull request adding something like this in base? > (Clearly redefining bits would change behavior and break things, so I'm not > sure how to approach that) > > I wrote up some simple example code and it works fine, however it isn't > actually any faster than the current bits implementation which I found > surprising. Maybe it would be a bit faster if BitArray were able to be > constructed from v directly, instead of allocating then immediately > replacing b.chunks. > function reinterpret2(::Type{BitArray}, v::Vector{Uint64}, dims=(64,-1)) > dims[2] == -1 && (dims=(64,length(v))) > # check to make sure the dims are appopriate for length of v > b = BitArray(dims...) > b.chunks = v > b > end > > function reinterpret2(::Type{BitArray}, i::Uint64, dims=64) > assert(dims <= 64) > b = BitArray(dims) > b.chunks = [i] > b > end > > bits2(i::Uint64) = reinterpret2(BitArray, i) > bits2(x) = reinterpret2(BitArray,reinterpret(Uint64, x)) > > testbits(n) = [bits(i)[1] for i=1:n] > testbits2(n) = Bool[bits2(i)[1] for i=1:n] > > > testbits(1); > @time testbits(100000); > testbits2(1); > @time testbits2(100000); > > > > On Tuesday, August 5, 2014 11:26:39 PM UTC-6, Simon Kornblith wrote: >> >> Assuming you have enough memory to write a BitArray to the JLD file >> initially, if you later open the JLD file with mmaparrays=true and read >> it, JLD will mmap the underlying Vector{Uint64} so that pieces are read >> from the disk as they are accessed. (The actual specifics of how this works >> is up to the OS, but generally it works well.) In principle you can also >> modify the BitArray the changes will be saved to the disk, although I'm not >> sure how well that works since I don't do it in my own code. There is no >> easy way to resize the BitArray if you do this, though. >> >> Simon >> >> On Tuesday, August 5, 2014 5:06:16 PM UTC-4, Tim Holy wrote: >>> >>> To me it sounds like you've come up with the main options: BitArray or >>> Array{Bool}. Since a BitArray is, underneath, a Vector{Uint64} with >>> different >>> indexing semantics, it seems you could probably come up with a >>> reasonable way >>> to update just part of it. But even if you use Array{Bool}, you're >>> "only" >>> talking a few hundred megabytes, which is not a catastrophically large. >>> Also >>> consider keeping everything in memory; with 100GB of RAM you could store >>> an >>> awful lot of selections. >>> >>> --Tim >>> >>> On Tuesday, August 05, 2014 12:01:58 PM ggggg wrote: >>> > Hello, >>> > >>> > I have an application where I have a few hundred million events, and >>> I'd >>> > like to make and work with different selections of sets of those >>> events. >>> > The events each have various values associated with them, say for >>> > simplicity color, timestamp, and loudness. Say one selection includes >>> all >>> > the events within 5 minutes after a blue event. Or I want to select >>> all >>> > events that aren't above some loudness threshold. I'd like to be able >>> to >>> > save these selections in a JLD file for later use on some or all >>> events. I >>> > also need to be able update the selections as I observe more events. >>> > >>> > My baseline plane it to have an integer associated with each event and >>> each >>> > bit in the integer i corresponds to a selection. So bit 1 is true for >>> > events within 5 minutes and bit 2 is true for events above the >>> loudness >>> > threshold. Then for each event's integer I can do bits(i)[1] or >>> bits(i)[2] >>> > to figure out if it is included in each selection. That seems like it >>> would >>> > be inefficient since bits() returns a string, so I would have to call >>> > bool(bits(i)[1]). I could use a bitwise mask of some sort like 1&i==0 >>> for >>> > the first bit and 2&i==0 for the second bit. >>> > >>> > A BitArray seems like a decent choice, except that you can only >>> read/write >>> > the entire array from a JLD file, rather than just a part of it. That >>> will >>> > be inefficient since I'll often want to look at only a small subset of >>> the >>> > total events. And every time I want to update for new events, I would >>> need >>> > to read the entire BitArray, extend it in memory, then delete the old >>> JLD >>> > object and replace it with a new JLD object. It seems plausible I >>> could >>> > figure out how to read/write part of a BitArray from a JLD as I've >>> already >>> > done some hacking on HDF5.jl, but that could be a large amount of >>> work. >>> > >>> > An Array{Bool} works well with JLD, and seems just as well suited as a >>> > BitArray. I think it's 8 times bigger than BitArray, and has a >>> similar >>> > space ratio to an integer (depending on how many selections I actually >>> use) >>> > because bools are stored as 1 byte? I can probably live with that, >>> although >>> > again it seems sort of inefficient. >>> > >>> > Any advice on how I should go about deciding, or options I hadn't >>> > considered? Also why does bits() return a string, instead of say >>> > Vector{Bool} or BitArray? >>> >>>