[ https://issues.apache.org/jira/browse/TIKA-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281225#comment-16281225 ]
Tim Allison edited comment on TIKA-2519 at 12/8/17 2:51 PM: ------------------------------------------------------------ Thank you for opening this issue. That’s definitely a bug. Parsers should be threadsafe. was (Author: talli...@mitre.org): Thank you for opening this issue. That’s definitely a bug. Parsers should be multi-threadable. > Issue parsing multiple CHM files concurrently > --------------------------------------------- > > Key: TIKA-2519 > URL: https://issues.apache.org/jira/browse/TIKA-2519 > Project: Tika > Issue Type: Bug > Affects Versions: 1.16 > Reporter: Eamonn Saunders > Priority: Blocker > Fix For: 1.17 > > > Should I expect to be able to parse multiple CHM files concurrently in > multiple threads? > What I'm noticing when attempting to parse 2 different CHM files in different > threads is that: > - ChmExtractor.extractChmEntry() gets a ChmBlockInfo as follows: > {code} > ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance( > directoryListingEntry, (int) getChmLzxcResetTable() > .getBlockLen(), getChmLzxcControlData()); > {code} > - ChmBlockInfo.getChmBlockInfoInstance() is a static method that appears to > limit the number of ChmBlockInfo instances to 1. > {code} > public static ChmBlockInfo getChmBlockInfoInstance( > DirectoryListingEntry dle, int bytesPerBlock, > ChmLzxcControlData clcd) { > setChmBlockInfo(new ChmBlockInfo()); > getChmBlockInfo().setStartBlock(dle.getOffset() / bytesPerBlock); > getChmBlockInfo().setEndBlock( > (dle.getOffset() + dle.getLength()) / bytesPerBlock); > getChmBlockInfo().setStartOffset(dle.getOffset() % bytesPerBlock); > getChmBlockInfo().setEndOffset( > (dle.getOffset() + dle.getLength()) % bytesPerBlock); > // potential problem with casting long to int > getChmBlockInfo().setIniBlock( > getChmBlockInfo().startBlock - getChmBlockInfo().startBlock > % (int) clcd.getResetInterval()); > // (getChmBlockInfo().startBlock - > getChmBlockInfo().startBlock) > // % (int) clcd.getResetInterval()); > return getChmBlockInfo(); > } > {code} > Is there a good reason why there should only ever be one instance of > ChmBlockInfo? > Should we forget about attempting to process CHM files in parallel and > instead queue them up to be processed sequentially? -- This message was sent by Atlassian JIRA (v6.4.14#64029)