[jira] [Closed] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer
[ https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Wiesner closed OPENNLP-1556. --- Resolution: Fixed > Improve speed of checksum computation in TwoPassDataIndexer > --- > > Key: OPENNLP-1556 > URL: https://issues.apache.org/jira/browse/OPENNLP-1556 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.3.4 > > > For training ML models, all observations (Events) are indexed via > {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. > When #index(..) is run, a tmp file is written and read in again. For the > purpose of checksum validation, instances of HashSumEventStream are used to > validate the content processed. > Based on a rather slow toString() implementation in Event, a cryptographic > (MD5) message digest is computed. This, however, is much slower than simply > computing a checksum (such as a CRC32c value) for both directions > (write/read). The (slowing) effect is more problematic when larger training > corpora are (pre-)processed, that is, indexed in advance. > Aims: > - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training > phase. > - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash > function here; it's simply a checksum that is required to decide whether all > bytes written are the same bytes that are read. > - Remove the untested class HashSumEventStream which is just a wrapper for > calling a slow toString() in Event to get some bytes to use for the > computation of a checksum / md. > - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that > makes use of the faster CRC32c checksum computation, avoiding cryptographic > hash functions such as MD5. > - Make sure all existing tests hold. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer
[ https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844159#comment-17844159 ] ASF GitHub Bot commented on OPENNLP-1556: - mawiesne merged PR #600: URL: https://github.com/apache/opennlp/pull/600 > Improve speed of checksum computation in TwoPassDataIndexer > --- > > Key: OPENNLP-1556 > URL: https://issues.apache.org/jira/browse/OPENNLP-1556 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning >Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Martin Wiesner >Assignee: Martin Wiesner >Priority: Major > Fix For: 2.3.4 > > > For training ML models, all observations (Events) are indexed via > {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. > When #index(..) is run, a tmp file is written and read in again. For the > purpose of checksum validation, instances of HashSumEventStream are used to > validate the content processed. > Based on a rather slow toString() implementation in Event, a cryptographic > (MD5) message digest is computed. This, however, is much slower than simply > computing a checksum (such as a CRC32c value) for both directions > (write/read). The (slowing) effect is more problematic when larger training > corpora are (pre-)processed, that is, indexed in advance. > Aims: > - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training > phase. > - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash > function here; it's simply a checksum that is required to decide whether all > bytes written are the same bytes that are read. > - Remove the untested class HashSumEventStream which is just a wrapper for > calling a slow toString() in Event to get some bytes to use for the > computation of a checksum / md. > - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that > makes use of the faster CRC32c checksum computation, avoiding cryptographic > hash functions such as MD5. > - Make sure all existing tests hold. -- This message was sent by Atlassian Jira (v8.20.10#820010)