Martin Wiesner created OPENNLP-1556: ---------------------------------------
Summary: Improve speed of checksum computation in TwoPassDataIndexer Key: OPENNLP-1556 URL: https://issues.apache.org/jira/browse/OPENNLP-1556 Project: OpenNLP Issue Type: Improvement Components: Machine Learning Affects Versions: 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.0, 2.1.0, 2.0.0, 1.9.0 Reporter: Martin Wiesner Assignee: Martin Wiesner Fix For: 2.3.4 For training ML models, all observations (Events) are indexed via {{TwoPassDataIndexer#index(ObjectStream<Event> eventStream)}}. When #index(..) is run, a tmp file is written and read in again. For the purpose of checksum validation, instances of HashSumEventStream are used to validate the content processed. Based on a rather slow toString() implementation in Event, a cryptographic (MD5) message digest is computed. This, however, is much slower than simply computing a checksum (such as a CRC32c value) for both directions (write/read). The (slowing) effect is more problematic when larger training corpora are (pre-)processed, that is, indexed in advance. Aims: - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training phase. - Switch from MD5 to CRC32, as there is no need for a cryptographic hash function here; it's simply a checksum that is required to decide wether all bytes written are the same bytes that are read. - Remove the untested class HashSumEventStream which is just a wrapper for calling a slow toString() in Event to get some bytes to use for the computation of a checksum / md. - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that makes use of the faster CRC32c checksum computation, avoiding cryptographic hash functions such as MD5. - Make sure all existing tests hold. -- This message was sent by Atlassian Jira (v8.20.10#820010)