Hi Bernard, I'm going to see if I can add some performance metrics in the test code to try and determine where various bottlenecks may exist. If that goes well I'll see if I can add a branch that returns Byte[] instead of String for iota/vec.
I suspect this will just require modifying iota.FileVector.getChunk() and iota.FileVector.cachedChunk . I'm still not clear on a good way to handle ByteBuffer with offsets though, because Java's mmap() doesn't support reading from files greater that Integer/MAX_VALUE. To work around this I've had to patch the following together; https://github.com/thebusby/iota/blob/master/src/java/iota/Mmap.java I do think I understand what you're asking for though. Unfortunately for this level of tuning I still primarily fall back to C and JNI, so my Java/Clojure is likely a little weak here. Regarding why iota/vec returns nil on empty lines, it's to allow for (->> (iota/vec "filename.tsv") (r/filter identity) ... Hope this helps, Alan On Sat, Mar 9, 2013 at 9:41 AM, bernardH <un.compte.pour.tes...@gmail.com>wrote: > Hi Alan, > > > > On Friday, March 8, 2013 4:02:18 PM UTC+1, Alan Busby wrote: >> >> Hi Bernard, >> >> I'd certainly like to add support for binary files, but as I haven't had >> a need for it myself I haven't had a good place to start. >> >> As Java NIO's mmap() doesn't support ranges over 2GB, I've had to paste >> together multiple mmap's to cover files that are larger than 2GB. >> So if a record ended up spanning two mmap()'s, you couldn't return the >> raw data as a single object without copying it into a new buffer first. >> >> Also, if you provide a fixed record size in bytes for "doing the idx >> offset maths", why do you need the end idx for the current line as well? >> For example if you say file.bin is full of records each 100B in size, and >> you ask for the 10th record; don't you already know that the length of the >> record is 100B? >> >> > Indeed, the correlation between txt/binary and char (i.e \n) > delimited/fixed length record is very strong. However in my case I want to > first handle a \n delimited (txt) file as binary for performance reasons. > The context is that I have to consider all the lines of data, but might > not have to do "heavy" processing on all of them, so I want to do as few > work as possible on each line (i.e. not construct any java.lang.String). > This is in no way Clojure specific, I have two implementations in Java of > a small Minimum Spanning Tree program : > - one is constructing Strings from all the lines: > https://www.refheap.com/paste/12312 > - one is using offsets from a raw ByteBuffer : > https://www.refheap.com/paste/12313 > > As most of the lines are not really processed (just sorted according to > the last field), being able to only peek at the relevant bytes instead of > constructing full blown java.lang.Strings is a huge performance boost. > FWIW, as far as performance i concerned, I draw the line not between > Clojure and Java but between objects (constructed by copying some data > somewhere on the heap) and arrays of primitive data types, because > nowadays, cache locality trumps everything (once you got rid of reflection > calls in Clojure, obviously). > > So ideally, maybe 2 x 2 combinations (String / offset in ByteArray) x > (char delimited / fixed length) would be needed to cover all the needs. > > Thanks again for sharing your library ! > > Cheers, > > Bernard > > PS: Is there a rationale for returning nil instead of empty String "" on > empty lines with iota/vec? > > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.