I think the null pointer exception happens due to some issue in my new
writer (which used my implementation of the ByteBuffer writable
interface)...let me narrow it down first.

The basic code, that does not use my writer's implementation, seems to
work. This is the code which is at github. I did not push the new writer
implementation yet.

Thanks
--
Animesh

On 20 Dec 2017 14:51, "Animesh Trivedi" <animesh.triv...@gmail.com> wrote:

Wes, Emilio, Siddharth - many thanks for helpful replies and comments !

I managed to upgrade the code to 0.8 API. I have to say that 0.8 API is
much more intuitive ;)  I will summarize my code example with some
documentation in a blog post soon (and post it here too).

- Is there 1st class support to read/write files to HDFS files?
Because FSData[Output/Input]Stream from HDFS do not implement
[Read/Writeable]ByteChannel interfaces required to instantiate ArrowFile
readers and writers. I already implemented something for me that works but
am wondering if it does not make sense to have these facilities as
utilities in the Arrow code?

However, my example code runs fine on a small example of 10 rows with
multiple batches. But it fails to read for anything larger. I have not
verified if it was working for 0.7 version or at what row count it starts
to fail. The writes are fine as far as I can tell. For example, I am
writing and then reading TPC-DS data (store_sales table with int, long, and
doubles) and I get

[...]
Reading the arrow file : ./store_sales.arrow
File size : 3965838890 schema is Schema<ss_sold_date_sk: Int(32, true),
ss_sold_time_sk: Int(32, true), ss_item_sk: Int(32, true), ss_customer_sk:
Int(32, true), ss_cdemo_sk: Int(32, true), ss_hdemo_sk: Int(32, true),
ss_addr_sk: Int(32, true), ss_store_sk: Int(32, true), ss_promo_sk: Int(32,
true), ss_ticket_number: Int(64, true), ss_quantity: Int(32, true),
ss_wholesale_cost: FloatingPoint(DOUBLE), ss_list_price:
FloatingPoint(DOUBLE), ss_sales_price: FloatingPoint(DOUBLE),
ss_ext_discount_amt: FloatingPoint(DOUBLE), ss_ext_sales_price:
FloatingPoint(DOUBLE), ss_ext_wholesale_cost: FloatingPoint(DOUBLE),
ss_ext_list_price: FloatingPoint(DOUBLE), ss_ext_tax:
FloatingPoint(DOUBLE), ss_coupon_amt: FloatingPoint(DOUBLE), ss_net_paid:
FloatingPoint(DOUBLE), ss_net_paid_inc_tax: FloatingPoint(DOUBLE),
ss_net_profit: FloatingPoint(DOUBLE)>
Number of arrow blocks are 19
java.lang.NullPointerException
        at org.apache.arrow.vector.ipc.message.MessageSerializer.
deserializeRecordBatch(MessageSerializer.java:256)
        at org.apache.arrow.vector.ipc.message.MessageSerializer.
deserializeRecordBatch(MessageSerializer.java:242)
        at org.apache.arrow.vector.ipc.ArrowFileReader.readRecordBatch(
ArrowFileReader.java:162)
        at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(
ArrowFileReader.java:113)
        at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(
ArrowFileReader.java:139)
        at com.github.animeshtrivedi.arrowexample.ArrowRead.
makeRead(ArrowRead.java:82)
        at com.github.animeshtrivedi.arrowexample.ArrowRead.main(
ArrowRead.java:217)


Some context, the file size is 3965838890 bytes and the schema read from
the file is correct. The code where it fails is doing something like:

        System.out.println("File size : " + arrowFile.length() + " schema
is "  + root.getSchema().toString());
        List<ArrowBlock> arrowBlocks = arrowFileReader.getRecordBlocks();
        System.out.println("Number of arrow blocks are " +
arrowBlocks.size());
        for (int i = 0; i < arrowBlocks.size(); i++) {
            ArrowBlock rbBlock = arrowBlocks.get(i);
            if (!arrowFileReader.loadRecordBatch(rbBlock)) {
                throw new IOException("Expected to read record batch");
            }

the stack comes from here: https://github.com/animeshtrivedi/ArrowExample/
blob/master/src/main/java/com/github/animeshtrivedi/
arrowexample/ArrowRead.java#L82

Any idea what might be happening?

Thanks,
--
Animesh

On Tue, Dec 19, 2017 at 7:03 PM, Siddharth Teotia <siddha...@dremio.com>
wrote:

> From Arrow 0.8, the second step "Grab the corresponding mutator and
> accessor objects by calls to getMutator(), getAccessor()" is not needed. In
> fact, it is not even there.
>
> On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia <siddha...@dremio.com>
> wrote:
>
> > Hi Animesh,
> >
> > Firstly I would like to suggest switching over to Arrow 0.8 release asap
> > since you are writing JAVA programs and the API usage has changed
> > drastically. The new APIs are much simpler with good javadocs and
> detailed
> > internal comments.
> >
> > If you are writing stop-gap implementation then it is probably fine to
> > continue with old version but for long term new API usage is recommended.
> >
> >
> >    - Create an instance of the vector. Note that this doesn't allocate
> >    any memory for the elements in the vector
> >    - Grab the corresponding mutator and accessor objects by calls to
> >    getMutator(), getAccessor().
> >    - Allocate memory
> >       - *allocateNew()* - we will allocate memory for default number of
> >       elements in the vector. This is applicable to both fixed width and
> variable
> >       width vectors.
> >       - *allocateNew(valueCount)* -  for fixed width vectors. Use this
> >       method if you have already know the number of elements to store in
> the
> >       vector
> >       - *allocateNew(bytes, valueCount)* - for variable width vectors.
> >       Use this method if you already know the total size (in bytes) of
> all the
> >       variable width elements you will be storing in the vector. For
> example, if
> >       you are going to store 1024 elements in the vector and the total
> size
> >       across all variable width elements is under 1MB, you can call
> >       allocateBytes(1024*1024, 1024)
> >    - Populate the vector:
> >       - Use the *set() or setSafe() *APIs in the mutator interface. From
> >       Arrow 0.8 onwards, you can use these APIs directly on the vector
> instance
> >       and mutator/accessor are removed.
> >       - The difference between set() and corresponding setSafe() API is
> >       that latter internally takes care of expanding the vector's
> buffer(s) for
> >       storing new data.
> >       - Each set() API has a corresponding setSafe() API.
> >    - Do a setValueCount() based on the number of elements you populated
> >    in the vector.
> >    - Retrieve elements from the vector:
> >       - Use the get(), getObject() APIs in the accessor interface. Again,
> >       from Arrow 0.8 onwards you can use these APIs directly.
> >    - With respect to usage of setInitialCapacity:
> >       - Let's say your application always issues calls to allocateNew().
> >       It is likely that this will end up over-allocating memory because
> it
> >       assumes a default value count to begin with.
> >       - In this case, if you do setInitialCapacity() followed by
> >       allocateNew() then latter doesn't do default memory allocation. It
> does
> >       exactly for the value capacity you specified in
> setInitialCapacity().
> >
> > I would highly recommend taking a look at https://github.com/apache/
> > arrow/blob/master/java/vector/src/test/java/org/apache/
> > arrow/vector/TestValueVector.java
> > This has lots of examples around populating the vector, retrieving from
> > vector, using setInitialCapacity(), using set(), setSafe() methods and a
> > combination of them to understand when things can go wrong.
> >
> > Hopefully this helps. Meanwhile we will try to add some internal README
> > for the usage of vectors.
> >
> > Thanks,
> > Siddharth
> >
> > On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com>
> > wrote:
> >
> >> This has probably changed with the Java code refactor, but I've posted
> >> some answers inline, to the best of my understanding.
> >>
> >> Thanks,
> >>
> >> Emilio
> >>
> >> On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
> >>
> >>> Thanks Wes for you help.
> >>>
> >>> Based upon some code reading, I managed to code-up a basic working
> >>> example.
> >>> The code is here:
> >>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s
> >>> rc/main/java/com/github/animeshtrivedi/arrowexample
> >>> .
> >>>
> >>> However, I do have some questions about the concepts in Arrow
> >>>
> >>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock
> essentially
> >>> is
> >>> the amount of the data one must hold in-memory at a time. Is my
> >>> understanding correct?
> >>>
> >> yes
> >>
> >>>
> >>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
> >>> classes in the ValueVector interface - both are implemented by all
> >>> supported data types. What is the relationship between these two? or
> when
> >>> is one suppose to use one over other. I only use Mutator/Accessor
> classes
> >>> in my code.
> >>>
> >> The write/reader interfaces are parallel implementations that make some
> >> things easier, but don't encompass all available functionality (for
> >> example, fixed size lists, nested lists, some dictionary operations,
> etc).
> >> However, you should be able to accomplish everything using
> >> mutators/accessors.
> >>
> >>>
> >>> 3. What are the "safe" varient functions in the Mutator's code? I could
> >>> not
> >>> understand what they meant to achieve.
> >>>
> >> The safe methods ensure that the vector is large enough to set the
> value.
> >> You can use the unsafe versions if you know that your vector has already
> >> allocated enough space for your data.
> >>
> >>> 4. What are MinorTypes?
> >>>
> >> Minor types are a representation of the different vector types. I
> believe
> >> they are being de-emphasized in favor of FieldTypes, as minor types
> don't
> >> contain enough information to represent all vectors.
> >>
> >>>
> >>> 5. For a writer, what is a dictionary provider? For example in the
> >>> Integration.java code, the reader is given as the dictionary provider
> for
> >>> the writer. But, is it something more than just:
> >>> DictionaryProvider.MapDictionaryProvider provider = new
> >>> DictionaryProvider.MapDictionaryProvider();
> >>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider,
> >>> fileOutputStream.getChannel());
> >>>
> >> The dictionary provider is an interface for looking up dictionary
> values.
> >> When reading a file, the reader itself has already read the dictionaries
> >> and thus serves as the provider.
> >>
> >>> 6. I am not clearly sure about the sequence of call that one needs to
> do
> >>> write on mutators. For example, if I code something like
> >>> NullableIntVector intVector = (NullableIntVector) fieldVector;
> >>> NullableIntVector.Mutator mutator = intVector.getMutator();
> >>> [.write num values]
> >>> mutator.setValueCount(num)
> >>> then this works for primitive types, but not for VarBinary type. There
> I
> >>> have to set the capacity first,
> >>>
> >>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector)
> >>> fieldVector;
> >>> varBinaryVector.setInitialCapacity(items);
> >>> varBinaryVector.allocateNew();
> >>> NullableVarBinaryVector.Mutator mutator =
> varBinaryVector.getMutator();
> >>>
> >> The method calls are not very well documented - I would suggest looking
> >> at the reader/writer implementations to see what calls are required for
> >> which vector types. Generally variable length vectors (lists, var
> binary,
> >> etc) behave differently than fixed width vectors (ints, longs, etc).
> >>
> >> Example of these are here:
> >>> https://github.com/animeshtrivedi/ArrowExample/blob/master/s
> >>> rc/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java
> >>> (writeField[???] functions).
> >>>
> >>> Thank you very much,
> >>> --
> >>> Animesh
> >>>
> >>>
> >>>
> >>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>
> >>> hi Animesh,
> >>>>
> >>>> I suggest you try the ArrowStreamReader/Writer or
> >>>> ArrowFileReader/Writer classes. See
> >>>> https://github.com/apache/arrow/blob/master/java/tools/
> >>>> src/main/java/org/apache/arrow/tools/Integration.java
> >>>> for example working code for this
> >>>>
> >>>> - Wes
> >>>>
> >>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi
> >>>> <animesh.triv...@gmail.com> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> It might be a trivial question, so please let me know if I am missing
> >>>>> something.
> >>>>>
> >>>>> I am trying to write and read files in the Arrow format in Java. My
> >>>>> data
> >>>>>
> >>>> is
> >>>>
> >>>>> simple flat schema with primitive types. I already have the data in
> >>>>> Java.
> >>>>> So my questions are:
> >>>>> 1. Is this possible or am I fundamentally missing something what
> Arrow
> >>>>>
> >>>> can
> >>>>
> >>>>> or cannot do (or is designed to do). I assume that an efficient
> >>>>> in-memory
> >>>>> columnar data format should work with files too.
> >>>>> 2. Can you point me out to a working example? or a starting example.
> >>>>> Intuitively I am looking for a way to define schema, write/read
> column
> >>>>> vectors to/from files as one does with Parquet or ORC.
> >>>>>
> >>>>> I try to locate some working examples with ArrowFile[Reader/Writer]
> >>>>>
> >>>> classes
> >>>>
> >>>>> in the maven tests but so far not sure where to start.
> >>>>>
> >>>>> Thanks,
> >>>>> --
> >>>>> Animesh
> >>>>>
> >>>>
> >>
> >
>

Reply via email to