Ryan do you happen to have any opinion there? that particular section was introduced in the Parquet 1.10 update: https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20 It looks like it didn't use to make a ByteBuffer each time, but read from in.
On Sun, Sep 13, 2020 at 10:48 PM Chang Chen <baibaic...@gmail.com> wrote: > > I think we can copy all encoded data into a ByteBuffer once, and unpack > values in the loop > > while (valueIndex < this.currentCount) { > // values are bit packed 8 at a time, so reading bitWidth will always work > this.packer.unpack8Values(buffer, buffer.position() + valueIndex, > this.currentBuffer, valueIndex); > valueIndex += 8; > } > > Sean Owen <sro...@gmail.com> 于2020年9月14日周一 上午10:40写道: >> >> It certainly can't be called once - it's reading different data each time. >> There might be a faster way to do it, I don't know. Do you have ideas? >> >> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen <baibaic...@gmail.com> wrote: >> > >> > Hi export >> > >> > it looks like there is a hot spot in >> > VectorizedRleValuesReader#readNextGroup() >> > >> > case PACKED: >> > int numGroups = header >>> 1; >> > this.currentCount = numGroups * 8; >> > >> > if (this.currentBuffer.length < this.currentCount) { >> > this.currentBuffer = new int[this.currentCount]; >> > } >> > currentBufferIdx = 0; >> > int valueIndex = 0; >> > while (valueIndex < this.currentCount) { >> > // values are bit packed 8 at a time, so reading bitWidth will always >> > work >> > ByteBuffer buffer = in.slice(bitWidth); >> > this.packer.unpack8Values(buffer, buffer.position(), >> > this.currentBuffer, valueIndex); >> > valueIndex += 8; >> > } >> > >> > >> > Per my profile, the codes will spend 30% time of readNextGrou() on slice , >> > why we can't call slice out of the loop? --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org