It turns out Vector{Bool} does not work well with JLD.  So I played around 
with BitArray, and I figure out that it would be pretty easy to use for my 
purposes. It seems to me that BitArray could be made a bit more useful by 
exporting a reinterpret method.  That would certainly make my use case use 
less code, but it could also replace the current implementation of bits.  I 
think it makes more sense for bits to return a BitArray than a String 
anyway, since it would be much faster for uses like bits(4)[1].  Would it 
be worth a making a pull request adding something like this in base? 
(Clearly redefining bits would change behavior and break things, so I'm not 
sure how to approach that)

I wrote up some simple example code and it works fine, however it isn't 
actually any faster than the current bits implementation which I found 
surprising. Maybe it would be a bit faster if BitArray were able to be 
constructed from v directly, instead of allocating then immediately 
replacing b.chunks.
function reinterpret2(::Type{BitArray}, v::Vector{Uint64}, dims=(64,-1))
dims[2] == -1 && (dims=(64,length(v)))
# check to make sure the dims are appopriate for length of v
b = BitArray(dims...)
b.chunks = v
b
end

function reinterpret2(::Type{BitArray}, i::Uint64, dims=64)
assert(dims <= 64)
b = BitArray(dims)
b.chunks = [i]
b
end

bits2(i::Uint64) = reinterpret2(BitArray, i)
bits2(x) = reinterpret2(BitArray,reinterpret(Uint64, x))

testbits(n) = [bits(i)[1] for i=1:n]
testbits2(n) = Bool[bits2(i)[1] for i=1:n]


testbits(1);
@time testbits(100000);
testbits2(1);
@time testbits2(100000);



On Tuesday, August 5, 2014 11:26:39 PM UTC-6, Simon Kornblith wrote:
>
> Assuming you have enough memory to write a BitArray to the JLD file 
> initially, if you later open the JLD file with mmaparrays=true and read 
> it, JLD will mmap the underlying Vector{Uint64} so that pieces are read 
> from the disk as they are accessed. (The actual specifics of how this works 
> is up to the OS, but generally it works well.) In principle you can also 
> modify the BitArray the changes will be saved to the disk, although I'm not 
> sure how well that works since I don't do it in my own code. There is no 
> easy way to resize the BitArray if you do this, though.
>
> Simon
>
> On Tuesday, August 5, 2014 5:06:16 PM UTC-4, Tim Holy wrote:
>>
>> To me it sounds like you've come up with the main options: BitArray or 
>> Array{Bool}. Since a BitArray is, underneath, a Vector{Uint64} with 
>> different 
>> indexing semantics, it seems you could probably come up with a reasonable 
>> way 
>> to update just part of it. But even if you use Array{Bool}, you're "only" 
>> talking a few hundred megabytes, which is not a catastrophically large. 
>> Also 
>> consider keeping everything in memory; with 100GB of RAM you could store 
>> an 
>> awful lot of selections. 
>>
>> --Tim 
>>
>> On Tuesday, August 05, 2014 12:01:58 PM ggggg wrote: 
>> > Hello, 
>> > 
>> > I have an application where I have a few hundred million events, and 
>> I'd 
>> > like to make and work with different selections of sets of those 
>> events. 
>> > The events each have various values associated with them, say for 
>> > simplicity color, timestamp, and loudness. Say one selection includes 
>> all 
>> > the events within 5 minutes after a blue event.  Or I want to select 
>> all 
>> > events that aren't above some loudness threshold. I'd like to be able 
>> to 
>> > save these selections in a JLD file for later use on some or all 
>> events. I 
>> > also need to be able update the selections as I observe more events. 
>> > 
>> > My baseline plane it to have an integer associated with each event and 
>> each 
>> > bit in the integer i corresponds to a selection.  So bit 1 is true for 
>> > events within 5 minutes and bit 2 is true for events above the loudness 
>> > threshold.  Then for each event's integer I can do bits(i)[1] or 
>> bits(i)[2] 
>> > to figure out if it is included in each selection. That seems like it 
>> would 
>> > be inefficient since bits() returns a string, so I would have to call 
>> > bool(bits(i)[1]).  I could use a bitwise mask of some sort like 1&i==0 
>> for 
>> > the first bit and 2&i==0 for the second bit. 
>> > 
>> > A BitArray seems like a decent choice, except that you can only 
>> read/write 
>> > the entire array from a JLD file, rather than just a part of it.  That 
>> will 
>> > be inefficient since I'll often want to look at only a small subset of 
>> the 
>> > total events. And every time I want to update for new events, I would 
>> need 
>> > to read the entire BitArray, extend it in memory, then delete the old 
>> JLD 
>> > object and replace it with a new JLD object.  It seems plausible I 
>> could 
>> > figure out how to read/write part of a BitArray from a JLD as I've 
>> already 
>> > done some hacking on HDF5.jl, but that could be a large amount of work. 
>> > 
>> > An Array{Bool} works well with JLD, and seems just as well suited as a 
>> > BitArray.  I think it's 8 times bigger than BitArray, and has a similar 
>> > space ratio to an integer (depending on how many selections I actually 
>> use) 
>> > because bools are stored as 1 byte? I can probably live with that, 
>> although 
>> > again it seems sort of inefficient. 
>> > 
>> > Any advice on how I should go about deciding, or options I hadn't 
>> > considered?  Also why does bits() return a string, instead of say 
>> > Vector{Bool} or BitArray? 
>>
>>

Reply via email to