Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Chang Chen
I See. In our case, we use SingleBufferInputStream, so time spent is duplicating the backing byte buffer. Thanks Chang Ryan Blue 于2020年9月15日周二 上午2:04写道: > Before, the input was a byte array so we could read from it directly. Now, > the input is a `ByteBufferInputStream` so that Parquet can

Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Ryan Blue
Before, the input was a byte array so we could read from it directly. Now, the input is a `ByteBufferInputStream` so that Parquet can choose how to allocate buffers. For example, we use vectored reads from S3 that pull back multiple buffers in parallel. Now that the input is a stream based on

Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Sean Owen
Ryan do you happen to have any opinion there? that particular section was introduced in the Parquet 1.10 update: https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20 It looks like it didn't use to make a ByteBuffer each time, but read from in. On Sun, Sep 13, 2020 at

Re: Performance of VectorizedRleValuesReader

2020-09-13 Thread Chang Chen
I think we can copy all encoded data into a ByteBuffer once, and unpack values in the loop while (valueIndex < this.currentCount) { // values are bit packed 8 at a time, so reading bitWidth will always work this.packer.unpack8Values(buffer, buffer.position() + valueIndex,

Re: Performance of VectorizedRleValuesReader

2020-09-13 Thread Sean Owen
It certainly can't be called once - it's reading different data each time. There might be a faster way to do it, I don't know. Do you have ideas? On Sun, Sep 13, 2020 at 9:25 PM Chang Chen wrote: > > Hi export > > it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup() > >

Performance of VectorizedRleValuesReader

2020-09-13 Thread Chang Chen
Hi export it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup () case PACKED: int numGroups = header >>> 1; this.currentCount = numGroups * 8; if (this.currentBuffer.length < this.currentCount) { this.currentBuffer = new int[this.currentCount]; }