ConeyLiu commented on issue #25470: [SPARK-28751][Core][WIP] Improve java serializer deserialization performance URL: https://github.com/apache/spark/pull/25470#issuecomment-524615974 Hi @cloud-fan, it's difficult to reuse the `ObjectInpuStream`. The workflow of `ObjectOutputStream` write class desc as follows: ```java /** * Writes representation of given class descriptor to stream. */ private void writeClassDesc(ObjectStreamClass desc, boolean unshared) throws IOException { int handle; if (desc == null) { writeNull(); } else if (!unshared && (handle = handles.lookup(desc)) != -1) { writeHandle(handle); } else if (desc.isProxy()) { writeProxyDesc(desc, unshared); } else { writeNonProxyDesc(desc, unshared); } } ``` It will write the full class name if this is the first time we have met the class, else just write the `TC_REFERENCE ` and the handle id. So same as the `ObjectInputStream`: ```java /** * Reads in and returns (possibly null) class descriptor. Sets passHandle * to class descriptor's assigned handle. If class descriptor cannot be * resolved to a class in the local VM, a ClassNotFoundException is * associated with the class descriptor's handle. */ private ObjectStreamClass readClassDesc(boolean unshared) throws IOException { byte tc = bin.peekByte(); ObjectStreamClass descriptor; switch (tc) { case TC_NULL: descriptor = (ObjectStreamClass) readNull(); break; case TC_REFERENCE: descriptor = (ObjectStreamClass) readHandle(unshared); break; case TC_PROXYCLASSDESC: descriptor = readProxyDesc(unshared); break; case TC_CLASSDESC: descriptor = readNonProxyDesc(unshared); break; default: throw new StreamCorruptedException( String.format("invalid type code: %02X", tc)); } if (descriptor != null) { validateDescriptor(descriptor); } return descriptor; } ``` We read the class from the handle(`descriptor = (ObjectStreamClass) readHandle(unshared);`) if the class already encountered before. So we don't need to resolve the class with the class name again. However, we need to keep the mapping between `handle id` and `class` equal between `ObjectInputStream` and `ObjectOutputStream`. If we reuse the `ObjectInputStream`, it will reuse the previous `handle` cache which will destroy the mapping relationship. If we call `ObjectInputStream.reset` to reuse the `ObjectInputStream`, we still need to resolve the class with the class name. So it's difficult to reuse the `ObjectInputStream`. In currently way, we keep a resolved class cache which is a similar method used in `ObjectInputStream`. However, this cache is available for use across multiple `ObjectInputStream`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org