Re: [I] [CH] New byte buffer takes most of time in SourceFromJavalter::generate [incubator-gluten]
baibaichen closed issue #4943: [CH] New byte buffer takes most of time in SourceFromJavalter::generate URL: https://github.com/apache/incubator-gluten/issues/4943 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [I] [CH] New byte buffer takes most of time in SourceFromJavalter::generate [incubator-gluten]
zzcclp commented on issue #4943: URL: https://github.com/apache/incubator-gluten/issues/4943#issuecomment-2003077901 > 原因:查询运行过程中,有26200次new byte[1024*1024] 操作,平均每个task有78次,总耗时8s, 而查询耗时也就30+s > > 问题:为什么会走带copy的OnHeapCopyShuffleInputStream,没走zero-copy的LowCopyNettyShuffleInputStream > > 调用链 > > ``` > CHColumnarBatchSerializerInstance.deserializeStream > CHStreamReader.CHStreamReader > CHShuffleReadStreamFactory.create > ``` > > ```java > public static ShuffleInputStream create( > InputStream in, boolean forceCompress, boolean isCustomizedShuffleCodec) { > final InputStream unwrapped = unwrapInputStream(in, forceCompress, isCustomizedShuffleCodec); > if (unwrapped != null) { > return createCompressedShuffleInputStream(in, unwrapped); > } > return new OnHeapCopyShuffleInputStream(in, false); > } > > private static InputStream unwrapInputStream( > InputStream in, boolean forceCompress, boolean isCustomizedShuffleCodec) { > if (forceCompress) { > return unwrapSparkInputStream(in); > } else if (isCustomizedShuffleCodec) { > return unwrapSparkWithCompressedInputStream(in); > } > return null; > } > ``` > > 由于我的local环境中并未设置celeborn作为shuffle manager, 因此最终走了OnHeapCopyShuffleInputStream。而OnHeapCopyShuffleInputStream目前的实现还不是很高效,最终导致了标题中描述的问题。 这里可能要看下你本地调用连,理应要走 LowCopyFileSegmentShuffleInputStream 这个,因为是从本地文件直接读取,按理走这里。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [I] [CH] New byte buffer takes most of time in SourceFromJavalter::generate [incubator-gluten]
taiyang-li closed issue #4943: [CH] New byte buffer takes most of time in SourceFromJavalter::generate URL: https://github.com/apache/incubator-gluten/issues/4943 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [I] [CH] New byte buffer takes most of time in SourceFromJavalter::generate [incubator-gluten]
taiyang-li commented on issue #4943: URL: https://github.com/apache/incubator-gluten/issues/4943#issuecomment-1993717645 修改配置后 " --conf spark.shuffle.manager=org.apache.spark.shuffle.gluten.celeborn.CelebornShuffleManager" 火焰图如下: ![image](https://github.com/apache/incubator-gluten/assets/8181003/1b2c0fba-e12e-4eef-8922-4230be948c22) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [I] [CH] New byte buffer takes most of time in SourceFromJavalter::generate [incubator-gluten]
taiyang-li commented on issue #4943: URL: https://github.com/apache/incubator-gluten/issues/4943#issuecomment-1993693855 原因:查询运行过程中,有26200次new byte[1024*1024] 操作,平均每个task有78次,总耗时8s, 而查询耗时也就30+s 问题:为什么会走带copy的OnHeapCopyShuffleInputStream,没走zero-copy的LowCopyNettyShuffleInputStream 调用链 ``` CHColumnarBatchSerializerInstance.deserializeStream CHStreamReader.CHStreamReader CHShuffleReadStreamFactory.create ``` ``` java public static ShuffleInputStream create( InputStream in, boolean forceCompress, boolean isCustomizedShuffleCodec) { final InputStream unwrapped = unwrapInputStream(in, forceCompress, isCustomizedShuffleCodec); if (unwrapped != null) { return createCompressedShuffleInputStream(in, unwrapped); } return new OnHeapCopyShuffleInputStream(in, false); } private static InputStream unwrapInputStream( InputStream in, boolean forceCompress, boolean isCustomizedShuffleCodec) { if (forceCompress) { return unwrapSparkInputStream(in); } else if (isCustomizedShuffleCodec) { return unwrapSparkWithCompressedInputStream(in); } return null; } ``` 由于我的local环境中并未设置celeborn作为shuffle manager, 因此最终走了OnHeapCopyShuffleInputStream。而OnHeapCopyShuffleInputStream目前的实现还不是很高效,最终导致了标题中描述的问题。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [I] [CH] New byte buffer takes most of time in SourceFromJavalter::generate [incubator-gluten]
zhanglistar commented on issue #4943: URL: https://github.com/apache/incubator-gluten/issues/4943#issuecomment-1993376883 optoruntime::new_array_c可能是传入的`memory.m_capacity`过大,另外jdk中会对内存进行memset,导致该函数占用过多的时间。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
[I] [CH] New byte buffer takes most of time in SourceFromJavalter::generate [incubator-gluten]
taiyang-li opened a new issue, #4943: URL: https://github.com/apache/incubator-gluten/issues/4943 ### Description ![d722f3fabeb6881fe8b49f58cf0eb6c](https://github.com/apache/incubator-gluten/assets/8181003/8244ef97-fd00-4838-a341-adcb669847ec) ``` bool ReadBufferFromJavaInputStream::nextImpl() { int count = readFromJava(); if (count > 0) working_buffer.resize(count); return count > 0; } int ReadBufferFromJavaInputStream::readFromJava() const { GET_JNIENV(env) jint count = safeCallIntMethod( env, java_in, ShuffleReader::input_stream_read, reinterpret_cast(working_buffer.begin()), memory.m_capacity); CLEAN_JNIENV return count; } ``` ``` @Override public long read(long destAddress, long maxReadSize) { return GlutenException.wrap( () -> { int maxReadSize32 = Math.toIntExact(maxReadSize); if (buffer == null || maxReadSize32 > buffer.length) { this.buffer = new byte[maxReadSize32]; } // The code conducts copy as long as 'in' wraps off-heap data, // which is about to be moved to heap int read = in.read(buffer, 0, maxReadSize32); if (read == -1 || read == 0) { return 0; } // The code conducts copy, from heap to off-heap // memCopyFromHeap(buffer, destAddress, read); PlatformDependent.copyMemory(buffer, 0, destAddress, read); bytesRead += read; return read; }); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org