[jira] [Closed] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer

2024-05-07 Thread Martin Wiesner (Jira)


 [ 
https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner closed OPENNLP-1556.
---
Resolution: Fixed

> Improve speed of checksum computation in TwoPassDataIndexer
> ---
>
> Key: OPENNLP-1556
> URL: https://issues.apache.org/jira/browse/OPENNLP-1556
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.3.4
>
>
> For training ML models, all observations (Events) are indexed via 
> {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. 
> When #index(..) is run, a tmp file is written and read in again. For the 
> purpose of checksum validation, instances of HashSumEventStream are used to 
> validate the content processed. 
> Based on a rather slow toString() implementation in Event, a cryptographic 
> (MD5) message digest is computed. This, however, is much slower than simply 
> computing a checksum (such as a CRC32c value) for both directions 
> (write/read). The (slowing) effect is more problematic when larger training 
> corpora are (pre-)processed, that is, indexed in advance. 
> Aims:
> - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training 
> phase.
> - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash 
> function here; it's simply a checksum that is required to decide whether all 
> bytes written are the same bytes that are read.
> - Remove the untested class HashSumEventStream which is just a wrapper for 
> calling a slow toString() in Event to get some bytes to use for the 
> computation of a checksum / md.
> - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that 
> makes use of the faster CRC32c checksum computation, avoiding cryptographic 
> hash functions such as MD5.
> - Make sure all existing tests hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OPENNLP-1556) Improve speed of checksum computation in TwoPassDataIndexer

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OPENNLP-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844159#comment-17844159
 ] 

ASF GitHub Bot commented on OPENNLP-1556:
-

mawiesne merged PR #600:
URL: https://github.com/apache/opennlp/pull/600




> Improve speed of checksum computation in TwoPassDataIndexer
> ---
>
> Key: OPENNLP-1556
> URL: https://issues.apache.org/jira/browse/OPENNLP-1556
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning
>Affects Versions: 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Martin Wiesner
>Assignee: Martin Wiesner
>Priority: Major
> Fix For: 2.3.4
>
>
> For training ML models, all observations (Events) are indexed via 
> {{TwoPassDataIndexer#index(ObjectStream eventStream)}}. 
> When #index(..) is run, a tmp file is written and read in again. For the 
> purpose of checksum validation, instances of HashSumEventStream are used to 
> validate the content processed. 
> Based on a rather slow toString() implementation in Event, a cryptographic 
> (MD5) message digest is computed. This, however, is much slower than simply 
> computing a checksum (such as a CRC32c value) for both directions 
> (write/read). The (slowing) effect is more problematic when larger training 
> corpora are (pre-)processed, that is, indexed in advance. 
> Aims:
> - Speedup the (IO-bound) indexing part prior to the actual CPU-bound training 
> phase.
> - Switch from MD5 to CRC32c, as there is *no* need for a cryptographic hash 
> function here; it's simply a checksum that is required to decide whether all 
> bytes written are the same bytes that are read.
> - Remove the untested class HashSumEventStream which is just a wrapper for 
> calling a slow toString() in Event to get some bytes to use for the 
> computation of a checksum / md.
> - Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that 
> makes use of the faster CRC32c checksum computation, avoiding cryptographic 
> hash functions such as MD5.
> - Make sure all existing tests hold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)