Let me clarify.  Vector{Bool} works fine with JLD, it doesn't work well 
with extendible HDF5Datasets, which is what I actually want to do.

On Wednesday, August 13, 2014 6:04:46 PM UTC-6, ggggg wrote:
>
> It turns out Vector{Bool} does not work well with JLD.  So I played around 
> with BitArray, and I figure out that it would be pretty easy to use for my 
> purposes. It seems to me that BitArray could be made a bit more useful by 
> exporting a reinterpret method.  That would certainly make my use case use 
> less code, but it could also replace the current implementation of bits.  I 
> think it makes more sense for bits to return a BitArray than a String 
> anyway, since it would be much faster for uses like bits(4)[1].  Would it 
> be worth a making a pull request adding something like this in base? 
> (Clearly redefining bits would change behavior and break things, so I'm not 
> sure how to approach that)
>
> I wrote up some simple example code and it works fine, however it isn't 
> actually any faster than the current bits implementation which I found 
> surprising. Maybe it would be a bit faster if BitArray were able to be 
> constructed from v directly, instead of allocating then immediately 
> replacing b.chunks.
> function reinterpret2(::Type{BitArray}, v::Vector{Uint64}, dims=(64,-1))
> dims[2] == -1 && (dims=(64,length(v)))
> # check to make sure the dims are appopriate for length of v
> b = BitArray(dims...)
> b.chunks = v
> b
> end
>
> function reinterpret2(::Type{BitArray}, i::Uint64, dims=64)
> assert(dims <= 64)
> b = BitArray(dims)
> b.chunks = [i]
> b
> end
>
> bits2(i::Uint64) = reinterpret2(BitArray, i)
> bits2(x) = reinterpret2(BitArray,reinterpret(Uint64, x))
>
> testbits(n) = [bits(i)[1] for i=1:n]
> testbits2(n) = Bool[bits2(i)[1] for i=1:n]
>
>
> testbits(1);
> @time testbits(100000);
> testbits2(1);
> @time testbits2(100000);
>
>
>
> On Tuesday, August 5, 2014 11:26:39 PM UTC-6, Simon Kornblith wrote:
>>
>> Assuming you have enough memory to write a BitArray to the JLD file 
>> initially, if you later open the JLD file with mmaparrays=true and read 
>> it, JLD will mmap the underlying Vector{Uint64} so that pieces are read 
>> from the disk as they are accessed. (The actual specifics of how this works 
>> is up to the OS, but generally it works well.) In principle you can also 
>> modify the BitArray the changes will be saved to the disk, although I'm not 
>> sure how well that works since I don't do it in my own code. There is no 
>> easy way to resize the BitArray if you do this, though.
>>
>> Simon
>>
>> On Tuesday, August 5, 2014 5:06:16 PM UTC-4, Tim Holy wrote:
>>>
>>> To me it sounds like you've come up with the main options: BitArray or 
>>> Array{Bool}. Since a BitArray is, underneath, a Vector{Uint64} with 
>>> different 
>>> indexing semantics, it seems you could probably come up with a 
>>> reasonable way 
>>> to update just part of it. But even if you use Array{Bool}, you're 
>>> "only" 
>>> talking a few hundred megabytes, which is not a catastrophically large. 
>>> Also 
>>> consider keeping everything in memory; with 100GB of RAM you could store 
>>> an 
>>> awful lot of selections. 
>>>
>>> --Tim 
>>>
>>> On Tuesday, August 05, 2014 12:01:58 PM ggggg wrote: 
>>> > Hello, 
>>> > 
>>> > I have an application where I have a few hundred million events, and 
>>> I'd 
>>> > like to make and work with different selections of sets of those 
>>> events. 
>>> > The events each have various values associated with them, say for 
>>> > simplicity color, timestamp, and loudness. Say one selection includes 
>>> all 
>>> > the events within 5 minutes after a blue event.  Or I want to select 
>>> all 
>>> > events that aren't above some loudness threshold. I'd like to be able 
>>> to 
>>> > save these selections in a JLD file for later use on some or all 
>>> events. I 
>>> > also need to be able update the selections as I observe more events. 
>>> > 
>>> > My baseline plane it to have an integer associated with each event and 
>>> each 
>>> > bit in the integer i corresponds to a selection.  So bit 1 is true for 
>>> > events within 5 minutes and bit 2 is true for events above the 
>>> loudness 
>>> > threshold.  Then for each event's integer I can do bits(i)[1] or 
>>> bits(i)[2] 
>>> > to figure out if it is included in each selection. That seems like it 
>>> would 
>>> > be inefficient since bits() returns a string, so I would have to call 
>>> > bool(bits(i)[1]).  I could use a bitwise mask of some sort like 1&i==0 
>>> for 
>>> > the first bit and 2&i==0 for the second bit. 
>>> > 
>>> > A BitArray seems like a decent choice, except that you can only 
>>> read/write 
>>> > the entire array from a JLD file, rather than just a part of it.  That 
>>> will 
>>> > be inefficient since I'll often want to look at only a small subset of 
>>> the 
>>> > total events. And every time I want to update for new events, I would 
>>> need 
>>> > to read the entire BitArray, extend it in memory, then delete the old 
>>> JLD 
>>> > object and replace it with a new JLD object.  It seems plausible I 
>>> could 
>>> > figure out how to read/write part of a BitArray from a JLD as I've 
>>> already 
>>> > done some hacking on HDF5.jl, but that could be a large amount of 
>>> work. 
>>> > 
>>> > An Array{Bool} works well with JLD, and seems just as well suited as a 
>>> > BitArray.  I think it's 8 times bigger than BitArray, and has a 
>>> similar 
>>> > space ratio to an integer (depending on how many selections I actually 
>>> use) 
>>> > because bools are stored as 1 byte? I can probably live with that, 
>>> although 
>>> > again it seems sort of inefficient. 
>>> > 
>>> > Any advice on how I should go about deciding, or options I hadn't 
>>> > considered?  Also why does bits() return a string, instead of say 
>>> > Vector{Bool} or BitArray? 
>>>
>>>

Reply via email to