monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1130370795
If I read byte by byte (i.e. byte[] bytes = new byte[1];) I get the correct result: ![image](https://user-images.githubusercontent.com/36521886/169118333-e9a5509e-8fb4-4b28-9be4-6d326a03059a.png) If I read with anything other than byte by byte I get added bytes/strings from some other part of the file: ![image](https://user-images.githubusercontent.com/36521886/169118508-6bd9559c-ffe9-4146-a74b-38141c585fbc.png) It's only doing it on this one json file, every time I can reproduce it every time on this one. ``` @Test public void jsonConvert() throws FileNotFoundException, IOException { try (FileInputStream fis = new FileInputStream("c:\\temp1\\dwgreadout.json"); FileOutputStream fos = new FileOutputStream("c:\\temp1\\dwgreadoutClean.json")) { byte[] bytes = new byte[1000]; while (fis.read(bytes) != -1) { byte[] fixedBytes = new String(bytes, StandardCharsets.UTF_8) //.replaceAll(dwgc.getCleanDwgReadRegexToReplace(), dwgc.getCleanDwgReadReplaceWith()) //.replaceAll(" nan ", " 0 ") //.replaceAll(" nan,", " 0,") .getBytes(StandardCharsets.UTF_8); String st = new String(fixedBytes, StandardCharsets.UTF_8); fos.write(fixedBytes, 0, fixedBytes.length); } } } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org