It's interesting results... I'm not a Unicode specialist, but Japanese query cannot match Arabic documents if both of them correctly encoded.
I cannot recommend such use case, single field for all languages, but maybe you should check "indexed" (analyzed) tokens for inspection, not "stored" data. Are there any CharFilters / TokenFilters that change (or corrupt) tokens unexpectedly? Thanks, Tomoko