Martin Wiesner created OPENNLP-1556:
---------------------------------------

             Summary: Improve speed of checksum computation in 
TwoPassDataIndexer
                 Key: OPENNLP-1556
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1556
             Project: OpenNLP
          Issue Type: Improvement
          Components: Machine Learning
    Affects Versions: 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.0, 2.1.0, 2.0.0, 1.9.0
            Reporter: Martin Wiesner
            Assignee: Martin Wiesner
             Fix For: 2.3.4


For training ML models, all observations (Events) are indexed via 
{{TwoPassDataIndexer#index(ObjectStream<Event> eventStream)}}. 

When #index(..) is run, a tmp file is written and read in again. For the 
purpose of checksum validation, instances of HashSumEventStream are used to 
validate the content processed. 

Based on a rather slow toString() implementation in Event, a cryptographic 
(MD5) message digest is computed. This, however, is much slower than simply 
computing a checksum (such as a CRC32c value) for both directions (write/read). 
The (slowing) effect is more problematic when larger training corpora are 
(pre-)processed, that is, indexed in advance. 

Aims:
- Speedup the (IO-bound) indexing part prior to the actual CPU-bound training 
phase.
- Switch from MD5 to CRC32, as there is no need for a cryptographic hash 
function here; it's simply a checksum that is required to decide wether all 
bytes written are the same bytes that are read.
- Remove the untested class HashSumEventStream which is just a wrapper for 
calling a slow toString() in Event to get some bytes to use for the computation 
of a checksum / md.
- Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that 
makes use of the faster CRC32c checksum computation, avoiding cryptographic 
hash functions such as MD5.
- Make sure all existing tests hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to