Re: [Haskell-cafe] Big Arrays
Thanks for the response. That sounds sequence comparison seems very impressive On Wed, Oct 6, 2010 at 2:23 PM, Ketil Malde wrote: > Hemanth Kapila writes: > > > Let us say, we are using a bit-array of size 2^43 (that is, a byte array > of > > size 2^40) to store a bloom filter. And let us further assume that we are > > interested in a false-positive probability of 0.01 > > Since we are just making up numbers, let us instead say we are using a > bit array of size 2^32 - still too large for Int indexing (which was the > issue here) but "only" 500MB in size. > > > That means, I will be able to use this array to represent a set of > > cardinality 9.18e11 ~ 10^12 > > ...bringing it down to less than 10^9, easily reached for building an > indexing of k-words (tuples) in e.g. the human genome (3GB). > > But: I'm about to start analyzing Illumina sequencing data, where we have > sequences from two individuals of the same species. I'm interesting in > the differences between these species, so I might want to index both > data sets and screen the sequences of each against the other to identify > bits that don't appear in the other. Since I'll have easily tens, maybe > a hundred gigabytes of sequence from each, it's not impossible to get > into the territory you describe. > > (In practice, things will be limited by available RAM, sadly still some > orders of magnitude less than 2^40 - although apparently SGI can deliver > it. I guess I just need a bigger budget.) > > > I was curious to know what sort of programs would be dealing with sets of > > 10^12 elements. > > Most likely a program using mutation, and probably not copying, > multi-generation GC. > > Other data sets that have considerable size are acoustics data (I > understand a modern sonar deliver about a Gbit/sec continously), and > seismic data. > > Also, for more moderate bodies of data and more complex analysis, you > might want to use less frugal data structure, like suffix arrays. > > -k > -- > If I haven't seen further, it is by standing in the footprints of giants > ___ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Big Arrays
Hemanth Kapila writes: > Let us say, we are using a bit-array of size 2^43 (that is, a byte array of > size 2^40) to store a bloom filter. And let us further assume that we are > interested in a false-positive probability of 0.01 Since we are just making up numbers, let us instead say we are using a bit array of size 2^32 - still too large for Int indexing (which was the issue here) but "only" 500MB in size. > That means, I will be able to use this array to represent a set of > cardinality 9.18e11 ~ 10^12 ...bringing it down to less than 10^9, easily reached for building an indexing of k-words (tuples) in e.g. the human genome (3GB). But: I'm about to start analyzing Illumina sequencing data, where we have sequences from two individuals of the same species. I'm interesting in the differences between these species, so I might want to index both data sets and screen the sequences of each against the other to identify bits that don't appear in the other. Since I'll have easily tens, maybe a hundred gigabytes of sequence from each, it's not impossible to get into the territory you describe. (In practice, things will be limited by available RAM, sadly still some orders of magnitude less than 2^40 - although apparently SGI can deliver it. I guess I just need a bigger budget.) > I was curious to know what sort of programs would be dealing with sets of > 10^12 elements. Most likely a program using mutation, and probably not copying, multi-generation GC. Other data sets that have considerable size are acoustics data (I understand a modern sonar deliver about a Gbit/sec continously), and seismic data. Also, for more moderate bodies of data and more complex analysis, you might want to use less frugal data structure, like suffix arrays. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Big Arrays
Thanks Ketil. On Wed, Oct 6, 2010 at 1:36 AM, Ketil Malde wrote: > > Just out of curiosity, may I know a use case of such huge arrays? > > Bloom filters? > > Am probably dense - I still did not get an idea of where would one use such a big array. Let us say, we are using a bit-array of size 2^43 (that is, a byte array of size 2^40) to store a bloom filter. And let us further assume that we are interested in a false-positive probability of 0.01 That means, I will be able to use this array to represent a set of cardinality 9.18e11 ~ 10^12 I was curious to know what sort of programs would be dealing with sets of 10^12 elements. Am mainly curious about how one would decide that bloom filters are the best algorithm when dealing with this amount of data.What factors do we consider when deciding the algorithm or the data structure? The impact on GC would be taken into account, for example (am guessing there would be at least one copy from a younger generation to a permanent generation)? /Hemanth K ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Big Arrays
Hemanth Kapila writes: > Just out of curiosity, may I know a use case of such huge arrays? Bloom filters? > At such sizes, I thought, the array would not have the expected array > properties (constant access time) due to "thrashing". Yes, but this is true for any array size, due to the cache hierarchy. Accessing stuff in the same vector register is faster than accessing things in L1 cache is faster than L2, then L3, then RAM, then swap (on SSD), then (rotating) disk. :-) Perhaps it's not /entirely/ unreasonable to consider array accesses to be log(N) cost - but I'm fairly sure they're always faster than pointer chasing in tree structures. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Big Arrays
On Sun, Oct 3, 2010 at 19:09, Bryan O'Sullivan wrote: > On Sun, Oct 3, 2010 at 11:54 AM, Henry Laxen > wrote: >> >> I am trying to create a (relatively) big array, >> but it seems I cannot make anything larger >> than 2^30 or so. Here is the code: > > Use a 64-bit machine, where Int is 64 bits wide. Trying to create a larger > array on a 32-bit machine doesn't make any sense. Sure it does; a 32-bit system can address much more than 2**30 elements. Artificially limiting how much memory can be allocated by depending on a poorly-specced type like 'Int' is a poor design decision in Haskell and GHC. OP: for this particular use case (unboxed Word64), an easily solution is to have some structure like (data BigArray a = BigArray (UArray Word32 a) ((UArray Word32 a) ((UArray Word32 a) ((UArray Word32 a)), with each array containing 2^30 elements. You'll need to write custom indexing and modification functions, to process the index and pass it to the appropriate array. Something like: idxBig :: BigArray a -> Word32 -> (UArray Word32 a, Word32) idxBig (BigArray a0 a1 a2 a3) i | i < 2^30 = (a0, i) | i < 2^31 = (a1, i - 2^30) | i < 2^30 + 2^31 = (a2, i - 2^31) | i < 2^32 = (a3, i - 2^31 - 2^30) Then wrap the existing array functions: idx :: BigArray a -> Word32 -> a idx arr i = let (arr', i') = idxBig arr i in arr' ! i' ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Big Arrays
On Sun, Oct 3, 2010 at 11:54 AM, Henry Laxen wrote: > > I am trying to create a (relatively) big array, > but it seems I cannot make anything larger > than 2^30 or so. Here is the code: > Use a 64-bit machine, where Int is 64 bits wide. Trying to create a larger array on a 32-bit machine doesn't make any sense. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Big Arrays
On Sun, Oct 3, 2010 at 11:18 AM, Bulat Ziganshin wrote: > Hello Henry, > > Sunday, October 3, 2010, 7:54:49 PM, you wrote: > >> It looks like array ranges can only be Ints, and not Int64 or Word64 types. > > yes, it's Int internally got efficiency reasons. you can do your own > implementation to override this limit :) > A good place to start when rolling your own is the primitive package[1] on Hackage. It is intimately tied to GHC, however. Antoine [1] http://hackage.haskell.org/package/primitive ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Big Arrays
Hello Henry, Sunday, October 3, 2010, 7:54:49 PM, you wrote: > It looks like array ranges can only be Ints, and not Int64 or Word64 types. yes, it's Int internally got efficiency reasons. you can do your own implementation to override this limit :) -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe