[ https://issues.apache.org/jira/browse/BEAM-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662934#comment-16662934 ]
Luke Cwik commented on BEAM-5439: --------------------------------- Great > StringUtf8Coder is slower than expected > --------------------------------------- > > Key: BEAM-5439 > URL: https://issues.apache.org/jira/browse/BEAM-5439 > Project: Beam > Issue Type: Bug > Components: sdk-java-core > Affects Versions: 2.6.0 > Reporter: Julien Tournay > Assignee: Julien Tournay > Priority: Major > Labels: perfomance > Time Spent: 10m > Remaining Estimate: 0h > > While working on Scio's next version, I noticed that {{StringUtf8Coder}} is > slower than expected. > I wrote a small micro-benchmark using {{jmh}} that serialises a (scala) List > of a 1000 Strings using a custom {{Coder[List[_]]}}. While profiling it, I > noticed that a lot of time is spent in > {{java.io.DataInputStream.<init>(java.io.InputStream)}}. > Looking into the code for > {{StringUtf8Coder}}, the {{readString}} method is directly reading bytes. It > therefore does not seem that a {{DataInputStream}} is necessary. > I replaced {{StringUtf8Coder}} with a {{Coder[String]}} implementation (in > Scala), that is essentially the same as {{StringUtf8Coder}} but is not using > {{DataInputStream}}. > > {code:scala} > private final object ScioStringCoder extends AtomicCoder[String] { > import org.apache.beam.sdk.util.VarInt > import java.nio.charset.StandardCharsets > import org.apache.beam.sdk.values.TypeDescriptor > import com.google.common.base.Utf8 > def decode(dis: InputStream): String = { > val len = VarInt.decodeInt(dis) > if (len < 0) { > throw new CoderException("Invalid encoded string length: " + len) > } > val bytes = new Array[Byte](len) > dis.read(bytes) > return new String(bytes, StandardCharsets.UTF_8) > } > def encode(value: String, outStream: OutputStream): Unit = { > val bytes = value.getBytes(StandardCharsets.UTF_8) > VarInt.encode(bytes.length, outStream) > outStream.write(bytes) > } > override def verifyDeterministic() = () > override def consistentWithEquals() = true > private val TYPE_DESCRIPTOR = new TypeDescriptor[String] {} > override def getEncodedTypeDescriptor() = TYPE_DESCRIPTOR > override def getEncodedElementByteSize(value: String) = { > if (value == null) { > throw new CoderException("cannot encode a null String") > } > val size = Utf8.encodedLength(value) > VarInt.getLength(size) + size > } > } > {code} > > Using that {{Coder}} is about 27% faster than {{StringUtf8Coder}}. I've added > the jmh output in "Docs Text" > Is there any particular reason to use {{DataInputStream}} ? > Do you think we can remove that to make {{StringUtf8Coder}} more efficient ? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)