[GitHub] [arrow-julia] bdklahn opened a new issue, #330: Show Map example in documentation?

GitBox Thu, 11 Aug 2022 09:15:19 -0700


bdklahn opened a new issue, #330:
URL: https://github.com/apache/arrow-julia/issues/330


   I'm so glad someone implemented Arrow for Julia. Thanks!
   
   And I think the intro to the User Manual is the clearest I've come across to 
help understand the what and why of Arrow.
   
   It looks like we can create a map array from a collection (Vector) of Dict 
items:
   
https://github.com/apache/arrow-julia/blob/532b89b2c5740124cadca632a14ebb6cc9a0dca5/src/arraytypes/map.jl#L49-L65
   
   I have been considering storing (caching, really) graph data in Arrow 
structures.
   My thought has been to create a Map of Int to Struct, where the Struct would 
define a node type. I saw that one can define and create an array of structs 
(with registering a custom type with the schema). I wonder if someone could 
post a quick example of creating an array of Dict, where the key is an Int and 
the value is a user-defined struct. I failed at a first attempt, but I think it 
might be because the Arrow schema needs to know about the node struct Dict 
type, not (only) the struct type.
   
   Now "if" I should be doing this is another question, because I wonder:
   1. Will I lose some benefit of Arrow if (de)serialization will need to be 
done to convert between Arrow and Julia struct types?
   2. Does a Map type really give much benefit?
   
   Someone here probably can easily answer the first one. Perhaps I am better 
off storing in more primitive types, then constructing my structs on ingress.
   
   For two, I am not sure what exactly a map type gets you, in terms of Arrow, 
since, as I understand, everything is contiguous and read-only, anyway. I.e., 
it is not like a Dict which us hashed out and in from memory (right?). Does 
anyone know . . . does using the map Arrow type do something like create 
separate, but linked, arrays to make indexing faster (implicit?), because the 
keys and values can be in their own homogeneous type arrays?
   I think [BadgerDB](https://github.com/outcaste-io/badger), for example, gets 
some performance benefit from separating key and value storage. I wonder if it 
is something like that.
   Maybe they apply some implicit Red Black tree logic to (sorted) keys (in any 
of their processing functions)?
   
   I also think Julia visibility might benefit from having the clearest, most 
comprehensive, set of examples (referenced) in the Arrow documentation. I think 
the Python ones there are currently the most complete, user-friendly, language 
example ones. I guess that must be because the Arrow folks implemented that 
library(?). I bet Julia folks could do even better, to make trying/using Julia 
for Arrow much more friction-less.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-julia] bdklahn opened a new issue, #330: Show Map example in documentation?

Reply via email to