With 1.13 and this code, I'm not able to see any problems with our handful of 
test files in our unit tests.  

Exactly what code are you using?  How are you doing detection?


    @Test
    public void testMultiThreadedEncodingDetection() throws Exception {
        Path testDocs = 
Paths.get(this.getClass().getResource("/test-documents").toURI());
        List<Path> paths = new ArrayList<>();
        Map<Path, String> encodings = new ConcurrentHashMap<>();
        for (File file : testDocs.toFile().listFiles()) {
            if (file.getName().endsWith(".txt") || 
file.getName().endsWith(".html")) {
                String encoding = getEncoding(file.toPath());
                paths.add(file.toPath());
                encodings.put(file.toPath(), encoding);
            }
        }
        for (int i = 0; i < 100; i++) {
            new Thread(new EncodingDetector(paths, encodings)).run();
        }
        assertTrue("success!", true);
    }

    private class EncodingDetector implements Runnable {
        private final List<Path> paths;
        private final Map<Path, String> encodings;
        private final Random r = new Random();
        private EncodingDetector(List<Path> paths, Map<Path, String> encodings) 
{
            this.paths = paths;
            this.encodings = encodings;
        }

        @Override
        public void run() {
            for (int i = 0; i < 100; i++) {
                int pInd = r.nextInt(paths.size());
                String detectedEncoding = null;
                try {
                    detectedEncoding = getEncoding(paths.get(pInd));
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
                String trueEncoding = encodings.get(paths.get(pInd));
                if (! detectedEncoding.equals(trueEncoding)) {
                    throw new RuntimeException("detected: " + detectedEncoding +
                            " but should have been: "+trueEncoding);
                }
            }
        }
    }

    public String getEncoding(Path p) throws Exception {
        try (InputStream is = TikaInputStream.get(p)) {
            AutoDetectReader reader = new AutoDetectReader(is);
            String val = reader.getCharset().toString();
            if (val == null) {
                return "NULL";
            } else {
                return val;
            }
        }
    }

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 25, 2016 9:21 PM
To: user@tika.apache.org
Subject: RE: Is Tika (especially CharsetDetector) considered thread-safe?

Charset detection _should_ be thread safe.  If you can help us track down the 
problem (unit test?), we need to fix this.

Thank you for raising this.

Best,

         Tim

Reply via email to