Hi Bernard,

I'm going to see if I can add some performance metrics in the test code to
try and determine where various bottlenecks may exist. If that goes well
I'll see if I can add a branch that returns Byte[] instead of String for
iota/vec.

I suspect this will just require modifying iota.FileVector.getChunk() and
iota.FileVector.cachedChunk .

I'm still not clear on a good way to handle ByteBuffer with offsets though,
because Java's mmap() doesn't support reading from files greater that
Integer/MAX_VALUE.

To work around this I've had to patch the following together;
https://github.com/thebusby/iota/blob/master/src/java/iota/Mmap.java

I do think I understand what you're asking for though.
Unfortunately for this level of tuning I still primarily fall back to C and
JNI, so my Java/Clojure is likely a little weak here.


Regarding why iota/vec returns nil on empty lines, it's to allow for
(->> (iota/vec "filename.tsv")
         (r/filter identity)
         ...


Hope this helps,
    Alan


On Sat, Mar 9, 2013 at 9:41 AM, bernardH <un.compte.pour.tes...@gmail.com>wrote:

> Hi Alan,
>
>
>
> On Friday, March 8, 2013 4:02:18 PM UTC+1, Alan Busby wrote:
>>
>> Hi Bernard,
>>
>> I'd certainly like to add support for binary files, but as I haven't had
>> a need for it myself I haven't had a good place to start.
>>
>> As Java NIO's mmap() doesn't support ranges over 2GB, I've had to paste
>> together multiple mmap's to cover files that are larger than 2GB.
>> So if a record ended up spanning two mmap()'s, you couldn't return the
>> raw data as a single object without copying it into a new buffer first.
>>
>> Also, if you provide a fixed record size in bytes for "doing the idx
>> offset maths", why do you need the end idx for the current line as well?
>> For example if you say file.bin is full of records each 100B in size, and
>> you ask for the 10th record; don't you already know that the length of the
>> record is 100B?
>>
>>
> Indeed, the correlation between txt/binary and char (i.e \n)
> delimited/fixed length record is very strong. However in my case I want to
> first handle a \n delimited (txt) file as binary for performance reasons.
> The context is that I have to consider all the lines of data, but might
> not have to do "heavy" processing on all of them, so I want to do as few
> work as possible on each line (i.e. not construct any java.lang.String).
> This is in no way Clojure specific, I have two implementations in Java of
> a small Minimum Spanning Tree program :
> - one is constructing Strings from all the lines:
> https://www.refheap.com/paste/12312
> - one is using offsets from a raw ByteBuffer :
> https://www.refheap.com/paste/12313
>
> As most of the lines are not really processed (just sorted according to
> the last field), being able to only peek at the relevant bytes instead of
> constructing full blown java.lang.Strings is a huge performance boost.
> FWIW, as far as performance i concerned, I draw the line not between
> Clojure and Java but between objects (constructed by copying some data
> somewhere on the heap) and arrays of primitive data types, because
> nowadays, cache locality trumps everything (once you got rid of reflection
> calls in Clojure, obviously).
>
> So ideally, maybe 2 x 2 combinations (String / offset in ByteArray) x
> (char delimited / fixed length) would be needed to cover all the needs.
>
> Thanks again for sharing your library !
>
> Cheers,
>
> Bernard
>
> PS: Is there a rationale for returning nil instead of empty String "" on
> empty lines with iota/vec?
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to