[ https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
benj updated DRILL-7444: ------------------------ Description: 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_ when using +Drill embedded+. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'aaaaabbbbb' done echo '"}' done {code} {noformat} == I == $ gen.sh 10000 > a.json $ gen.sh 239 > b.json $ wc -c *.json 100000000 a.json 2390000 b.json 102390000 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; +--------------------+ | At | +--------------------+ | aaaaabbbbaaaaab... | +--------------------+ => All is fine here == II == $ gen.sh 10000 > a.json $ gen.sh 240 > b.json $ wc -c *.json 100000000 a.json 2400000 b.json 102400000 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; +--------------------+ | At | +--------------------+ | | +--------------------+ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 102400000 ab.json $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; +--------------------+ | At | +--------------------+ | aaaaabbbbaaaaab... | +--------------------+ => All is fine here although the number of lines is equal to case II {noformat} The Version of the Drill 1.17 tested here is the latest at 2019-11-13 This problem doesn't appears with Drill embedded 1.16 was: 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_ when using Drill embedded. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'aaaaabbbbb' done echo '"}' done {code} {noformat} == I == $ gen.sh 10000 > a.json $ gen.sh 239 > b.json $ wc -c *.json 100000000 a.json 2390000 b.json 102390000 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; +--------------------+ | At | +--------------------+ | aaaaabbbbaaaaab... | +--------------------+ => All is fine here == II == $ gen.sh 10000 > a.json $ gen.sh 240 > b.json $ wc -c *.json 100000000 a.json 2400000 b.json 102400000 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; +--------------------+ | At | +--------------------+ | | +--------------------+ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 102400000 ab.json $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; +--------------------+ | At | +--------------------+ | aaaaabbbbaaaaab... | +--------------------+ => All is fine here although the number of lines is equal to case II {noformat} The Version of the Drill 1.17 tested here is the latest at 2019-11-13 > JSON blank result on SELECT when too much byte in multiple files on embedded > ---------------------------------------------------------------------------- > > Key: DRILL-7444 > URL: https://issues.apache.org/jira/browse/DRILL-7444 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JSON > Affects Versions: 1.17.0 > Reporter: benj > Priority: Major > > 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce > different results on a simple _SELECT_ when using +Drill embedded+. > Problem appears from a number of byte (~ 102 400 000 in my case) > {code:bash} > #!/bin/bash > # script gen.sh to reproduce the problem > for ((i=1;i<=$1;++i)); > do > echo -n '{"At":"' > for j in {1..999}; > do > echo -n 'aaaaabbbbb' > done > echo '"}' > done > {code} > {noformat} > == I == > $ gen.sh 10000 > a.json > $ gen.sh 239 > b.json > $ wc -c *.json > 100000000 a.json > 2390000 b.json > 102390000 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > +--------------------+ > | At | > +--------------------+ > | aaaaabbbbaaaaab... | > +--------------------+ > => All is fine here > == II == > $ gen.sh 10000 > a.json > $ gen.sh 240 > b.json > $ wc -c *.json > 100000000 a.json > 2400000 b.json > 102400000 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > +--------------------+ > | At | > +--------------------+ > | | > +--------------------+ > => In a surprising way field `At` is empty > == III == > $ gen.sh 10240 > ab.json > $ wc -c *.json > 102400000 ab.json > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; > +--------------------+ > | At | > +--------------------+ > | aaaaabbbbaaaaab... | > +--------------------+ > => All is fine here although the number of lines is equal to case II > {noformat} > The Version of the Drill 1.17 tested here is the latest at 2019-11-13 > This problem doesn't appears with Drill embedded 1.16 -- This message was sent by Atlassian Jira (v8.3.4#803005)