Re: lz4 compressed arrow between Python & Java

Joris Peeters Thu, 28 Jan 2021 10:19:21 -0800

To be fair, I'm happy to apply it at IPC level. Just didn't realise that
was a thing. IIUC what Antoine suggests, though, then just (leaving Python
as-is and) changing my Java to


    var is = new FileInputStream(path.toFile());
    var reader = new ArrowStreamReader(is, allocator);
    var schema = reader.getVectorSchemaRoot().getSchema();

(i.e. just get rid of the lz4 input stream) should work, i.e. let the
reader figure it out? I see no option to specify the compression in the
reader, so it might detect it? This, however, gives,

    java.io.IOException: Unexpected end of stream trying to read message.
    at
org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:700)
    at
org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57)
    at
org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164)
    at
org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170)
    at
org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161)
    at
org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63)

FWIW - and this makes sense now that I understand there's a difference
between IPC compression and full stream compression - writing it in
Python à la,

    fh = io.BytesIO()
    writer = pa.RecordBatchStreamWriter(fh, table.schema)
    writer.write_table(table)
    writer.close()
    bytes_ = fh.getvalue()
    compressed_bytes = lz4.frame.compress(bytes_, compression_level=3,
block_linked=False)
    with open(path, 'wb') as fh:
        fh.write(compressed_bytes)

works fine with the Java from the original email.

-J


On Thu, Jan 28, 2021 at 6:06 PM Micah Kornfield <[email protected]>
wrote:

> It might be worth opening up an issue with the lz4-java library.  This
> seems like the java implementation doesn't fully support the LZ4 stream
> protocol?
>
> Antoine in this case it looks like Joris is applying the compression and
> decompression at the file level NOT the IPC level.
>
> On Thu, Jan 28, 2021 at 10:01 AM Antoine Pitrou <[email protected]>
> wrote:
>
> >
> > Le 28/01/2021 à 17:59, Joris Peeters a écrit :
> > > From Python, I'm dumping an LZ4-compressed arrow stream to a file,
> using
> > >
> > >     with pa.output_stream(path, compression = 'lz4') as fh:
> > >         writer = pa.RecordBatchStreamWriter(fh, table.schema)
> > >         writer.write_table(table)
> > >         writer.close()
> > >
> > > I then try reading this file from Java, starting with
> > >
> > >     var is = new LZ4FrameInputStream(new
> FileInputStream(path.toFile()));
> > >
> > > using the lz4-java library. That fails, however, with
> >
> > Well, that sounds expected.  LZ4 compression in the IPC format does not
> > work by compressing the whole stream.  Instead, buffers in the stream
> > are compressed individually, while metadata is uncompressed.
> >
> > So, you needn't wrap the stream with LZ4 yourself.  Instead, just let
> > the Java implementation of Arrow handle compression.  It *should* work.
> >
> > Regards
> >
> > Antoine.
> >
>

Re: lz4 compressed arrow between Python & Java

Reply via email to