To be fair, I'm happy to apply it at IPC level. Just didn't realise that
was a thing. IIUC what Antoine suggests, though, then just (leaving Python
as-is and) changing my Java to
var is = new FileInputStream(path.toFile());
var reader = new ArrowStreamReader(is, allocator);
var schema = reader.getVectorSchemaRoot().getSchema();
(i.e. just get rid of the lz4 input stream) should work, i.e. let the
reader figure it out? I see no option to specify the compression in the
reader, so it might detect it? This, however, gives,
java.io.IOException: Unexpected end of stream trying to read message.
at
org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:700)
at
org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57)
at
org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164)
at
org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170)
at
org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161)
at
org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63)
FWIW - and this makes sense now that I understand there's a difference
between IPC compression and full stream compression - writing it in
Python à la,
fh = io.BytesIO()
writer = pa.RecordBatchStreamWriter(fh, table.schema)
writer.write_table(table)
writer.close()
bytes_ = fh.getvalue()
compressed_bytes = lz4.frame.compress(bytes_, compression_level=3,
block_linked=False)
with open(path, 'wb') as fh:
fh.write(compressed_bytes)
works fine with the Java from the original email.
-J
On Thu, Jan 28, 2021 at 6:06 PM Micah Kornfield <[email protected]>
wrote:
> It might be worth opening up an issue with the lz4-java library. This
> seems like the java implementation doesn't fully support the LZ4 stream
> protocol?
>
> Antoine in this case it looks like Joris is applying the compression and
> decompression at the file level NOT the IPC level.
>
> On Thu, Jan 28, 2021 at 10:01 AM Antoine Pitrou <[email protected]>
> wrote:
>
> >
> > Le 28/01/2021 à 17:59, Joris Peeters a écrit :
> > > From Python, I'm dumping an LZ4-compressed arrow stream to a file,
> using
> > >
> > > with pa.output_stream(path, compression = 'lz4') as fh:
> > > writer = pa.RecordBatchStreamWriter(fh, table.schema)
> > > writer.write_table(table)
> > > writer.close()
> > >
> > > I then try reading this file from Java, starting with
> > >
> > > var is = new LZ4FrameInputStream(new
> FileInputStream(path.toFile()));
> > >
> > > using the lz4-java library. That fails, however, with
> >
> > Well, that sounds expected. LZ4 compression in the IPC format does not
> > work by compressing the whole stream. Instead, buffers in the stream
> > are compressed individually, while metadata is uncompressed.
> >
> > So, you needn't wrap the stream with LZ4 yourself. Instead, just let
> > the Java implementation of Arrow handle compression. It *should* work.
> >
> > Regards
> >
> > Antoine.
> >
>