Re: arrow read/write examples in Java

Siddharth Teotia Tue, 19 Dec 2017 10:04:22 -0800

>From Arrow 0.8, the second step "Grab the corresponding mutator and
accessor objects by calls to getMutator(), getAccessor()" is not needed. In
fact, it is not even there.


On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia <[email protected]>
wrote:

> Hi Animesh,
>
> Firstly I would like to suggest switching over to Arrow 0.8 release asap
> since you are writing JAVA programs and the API usage has changed
> drastically. The new APIs are much simpler with good javadocs and detailed
> internal comments.
>
> If you are writing stop-gap implementation then it is probably fine to
> continue with old version but for long term new API usage is recommended.
>
>
>    - Create an instance of the vector. Note that this doesn't allocate
>    any memory for the elements in the vector
>    - Grab the corresponding mutator and accessor objects by calls to
>    getMutator(), getAccessor().
>    - Allocate memory
>       - *allocateNew()* - we will allocate memory for default number of
>       elements in the vector. This is applicable to both fixed width and 
> variable
>       width vectors.
>       - *allocateNew(valueCount)* -  for fixed width vectors. Use this
>       method if you have already know the number of elements to store in the
>       vector
>       - *allocateNew(bytes, valueCount)* - for variable width vectors.
>       Use this method if you already know the total size (in bytes) of all the
>       variable width elements you will be storing in the vector. For example, 
> if
>       you are going to store 1024 elements in the vector and the total size
>       across all variable width elements is under 1MB, you can call
>       allocateBytes(1024*1024, 1024)
>    - Populate the vector:
>       - Use the *set() or setSafe() *APIs in the mutator interface. From
>       Arrow 0.8 onwards, you can use these APIs directly on the vector 
> instance
>       and mutator/accessor are removed.
>       - The difference between set() and corresponding setSafe() API is
>       that latter internally takes care of expanding the vector's buffer(s) 
> for
>       storing new data.
>       - Each set() API has a corresponding setSafe() API.
>    - Do a setValueCount() based on the number of elements you populated
>    in the vector.
>    - Retrieve elements from the vector:
>       - Use the get(), getObject() APIs in the accessor interface. Again,
>       from Arrow 0.8 onwards you can use these APIs directly.
>    - With respect to usage of setInitialCapacity:
>       - Let's say your application always issues calls to allocateNew().
>       It is likely that this will end up over-allocating memory because it
>       assumes a default value count to begin with.
>       - In this case, if you do setInitialCapacity() followed by
>       allocateNew() then latter doesn't do default memory allocation. It does
>       exactly for the value capacity you specified in setInitialCapacity().
>
> I would highly recommend taking a look at https://github.com/apache/
> arrow/blob/master/java/vector/src/test/java/org/apache/
> arrow/vector/TestValueVector.java
> This has lots of examples around populating the vector, retrieving from
> vector, using setInitialCapacity(), using set(), setSafe() methods and a
> combination of them to understand when things can go wrong.
>
> Hopefully this helps. Meanwhile we will try to add some internal README
> for the usage of vectors.
>
> Thanks,
> Siddharth
>
> On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <[email protected]>
> wrote:
>
>> This has probably changed with the Java code refactor, but I've posted
>> some answers inline, to the best of my understanding.
>>
>> Thanks,
>>
>> Emilio
>>
>> On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
>>
>>> Thanks Wes for you help.
>>>
>>> Based upon some code reading, I managed to code-up a basic working
>>> example.
>>> The code is here:
>>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s
>>> rc/main/java/com/github/animeshtrivedi/arrowexample
>>> .
>>>
>>> However, I do have some questions about the concepts in Arrow
>>>
>>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock essentially
>>> is
>>> the amount of the data one must hold in-memory at a time. Is my
>>> understanding correct?
>>>
>> yes
>>
>>>
>>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
>>> classes in the ValueVector interface - both are implemented by all
>>> supported data types. What is the relationship between these two? or when
>>> is one suppose to use one over other. I only use Mutator/Accessor classes
>>> in my code.
>>>
>> The write/reader interfaces are parallel implementations that make some
>> things easier, but don't encompass all available functionality (for
>> example, fixed size lists, nested lists, some dictionary operations, etc).
>> However, you should be able to accomplish everything using
>> mutators/accessors.
>>
>>>
>>> 3. What are the "safe" varient functions in the Mutator's code? I could
>>> not
>>> understand what they meant to achieve.
>>>
>> The safe methods ensure that the vector is large enough to set the value.
>> You can use the unsafe versions if you know that your vector has already
>> allocated enough space for your data.
>>
>>> 4. What are MinorTypes?
>>>
>> Minor types are a representation of the different vector types. I believe
>> they are being de-emphasized in favor of FieldTypes, as minor types don't
>> contain enough information to represent all vectors.
>>
>>>
>>> 5. For a writer, what is a dictionary provider? For example in the
>>> Integration.java code, the reader is given as the dictionary provider for
>>> the writer. But, is it something more than just:
>>> DictionaryProvider.MapDictionaryProvider provider = new
>>> DictionaryProvider.MapDictionaryProvider();
>>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider,
>>> fileOutputStream.getChannel());
>>>
>> The dictionary provider is an interface for looking up dictionary values.
>> When reading a file, the reader itself has already read the dictionaries
>> and thus serves as the provider.
>>
>>> 6. I am not clearly sure about the sequence of call that one needs to do
>>> write on mutators. For example, if I code something like
>>> NullableIntVector intVector = (NullableIntVector) fieldVector;
>>> NullableIntVector.Mutator mutator = intVector.getMutator();
>>> [.write num values]
>>> mutator.setValueCount(num)
>>> then this works for primitive types, but not for VarBinary type. There I
>>> have to set the capacity first,
>>>
>>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector)
>>> fieldVector;
>>> varBinaryVector.setInitialCapacity(items);
>>> varBinaryVector.allocateNew();
>>> NullableVarBinaryVector.Mutator mutator = varBinaryVector.getMutator();
>>>
>> The method calls are not very well documented - I would suggest looking
>> at the reader/writer implementations to see what calls are required for
>> which vector types. Generally variable length vectors (lists, var binary,
>> etc) behave differently than fixed width vectors (ints, longs, etc).
>>
>> Example of these are here:
>>> https://github.com/animeshtrivedi/ArrowExample/blob/master/s
>>> rc/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java
>>> (writeField[???] functions).
>>>
>>> Thank you very much,
>>> --
>>> Animesh
>>>
>>>
>>>
>>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <[email protected]>
>>> wrote:
>>>
>>> hi Animesh,
>>>>
>>>> I suggest you try the ArrowStreamReader/Writer or
>>>> ArrowFileReader/Writer classes. See
>>>> https://github.com/apache/arrow/blob/master/java/tools/
>>>> src/main/java/org/apache/arrow/tools/Integration.java
>>>> for example working code for this
>>>>
>>>> - Wes
>>>>
>>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi
>>>> <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> It might be a trivial question, so please let me know if I am missing
>>>>> something.
>>>>>
>>>>> I am trying to write and read files in the Arrow format in Java. My
>>>>> data
>>>>>
>>>> is
>>>>
>>>>> simple flat schema with primitive types. I already have the data in
>>>>> Java.
>>>>> So my questions are:
>>>>> 1. Is this possible or am I fundamentally missing something what Arrow
>>>>>
>>>> can
>>>>
>>>>> or cannot do (or is designed to do). I assume that an efficient
>>>>> in-memory
>>>>> columnar data format should work with files too.
>>>>> 2. Can you point me out to a working example? or a starting example.
>>>>> Intuitively I am looking for a way to define schema, write/read column
>>>>> vectors to/from files as one does with Parquet or ORC.
>>>>>
>>>>> I try to locate some working examples with ArrowFile[Reader/Writer]
>>>>>
>>>> classes
>>>>
>>>>> in the maven tests but so far not sure where to start.
>>>>>
>>>>> Thanks,
>>>>> --
>>>>> Animesh
>>>>>
>>>>
>>
>

Re: arrow read/write examples in Java

Reply via email to