#20395: metrics-lib should be able to handle large descriptor files ---------------------------------+----------------------------------- Reporter: iwakeh | Owner: karsten Type: defect | Status: new Priority: Medium | Milestone: metrics-lib 2.0.0 Component: Metrics/metrics-lib | Version: Severity: Normal | Resolution: Keywords: | Actual Points: Parent ID: | Points: Reviewer: | Sponsor: ---------------------------------+-----------------------------------
Comment (by iwakeh): I hope I didn't overlook anything: `DescriptorFile#getDescriptors()` and `DescriptorParser#parseDescriptors()` don't access files. They receive Descriptor objects or bytes and will have to keep the bytes, but these methods don't cause an oom unless their caller provides too much. The problem lies in the implementation of `DescriptorReaderImpl$DescriptorReaderRunnable` (which - as an aside - should be a separate class). There the `readFile` method attempts to read an entire file and chokes when encountering a huge file. `DescriptorReaderRunnable` should check the file size before opening in order to handle the files according to their size. The oom is caused by reading the entire file into memory and then operating on it in-memory creating all the Descriptor objects (possibly copying the raw bytes, I didn't verify) in-memory. Memory usage could be reduced 1. by only reading parts of the huge file and also 2. by not adding the bytes to the descriptor objects and instead simply keeping the file path and position inside the file in-memory. Assumptions: * many Descriptor objects w/o bytes occupy way less space than the Descriptor objects do currently * the descriptor containing files are available as long as there are Descriptor objects referring to them A sketch of changes: * Introduce descriptors that either hold their bytes in-memory or have a file path and in-file position(s) for accessing raw bytes, but don't store the bytes. * `DescriptorImpl` parses bytes and produces a list of the adapted Descriptor objects. * `DescriptorReaderRunnable` needs to read a certain chunk of a large file, parse enough to determine the next descriptor, and provide the parser also with the beginning and end positions in the file. This stays very closely to the current implementation, the details need some more work, and it might be necessary to change more. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20395#comment:7> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online _______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs