[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-18 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1130493082 > No, that probably won't work. Sorry. If you send me some examples, I can try some things. Can't send you examples unfortunately :( I did manage to process over 1000's files today w

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-18 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1130489653 > Can you guarantee that reading per line will be ok on this json-disaster? If so, that's the way to go. > > The other thing is that you'll want to specify the encoding on your read

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-18 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1130489224 > Can you tell if they're writing utf8? Are there any ascii accented data items or non-ascii characters that you can use to figure out what they're default encoding is? If you can h

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-18 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1130486579 > No, that probably won't work. Sorry. If you send me some examples, I can try some things. Yeah we'd be ok if Jackson allowed "nan" as well as "NaN" as we could use JsonReadFeature

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-18 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1130393422 If I use buffer reader I get the correct output but it's slower: 3s vs 10s (it's quite a large file) ```public void jsonConvert() throws FileNotFoundException, IOException {

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-18 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1130370795 If I read byte by byte (i.e. byte[] bytes = new byte[1];) I get the correct result: ![image](https://user-images.githubusercontent.com/36521886/169118333-e9a5509e-8fb4-4b28-9be4-6d326a0

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-18 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1130203562 Help! @tballison @nddipiazza Any reason why this section would sometimes write extra lines out? On some json files when cleaning up it writes out the file correctly then appends another

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-16 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1128022252 > > > > @nddipiazza @tballison This looks messy, can you advise a way to clean it up? A better way of doing it? Still think its worth having the comments there? > >

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-13 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1126521837 @nddipiazza @tballison This looks messy, can you advise a way to clean it up? A better way of doing it? Still think its worth having the comments there? https://github.com/apache/ti

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-13 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1126153049 > should we use TestContainers to test this within a docker container to make sure it works? or is it sufficient to just run test only if dwgread is installed? Would thi

[GitHub] [tika] monkmachine commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

2022-05-13 Thread GitBox
monkmachine commented on PR #558: URL: https://github.com/apache/tika/pull/558#issuecomment-1126150480 > @tballison @monkmachine > > > Or do you want to use our current parser only if the dwg executable is not available. > > I would vote +1 on _use current parser only if the dw