[ https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-27507. ---------------------------------- Resolution: Cannot Reproduce {code} Input length: 2264 Output length: 2264 Input length: 2265 Output length: 2265 Input length: 2667 Output length: 2667 Input length: 2666 Output length: 2666 Input length: 2668 Output length: 2668 Input length: 26000 Output length: 26000 {code} I can't reproduce in the current master as above. It should be great if we can identify which JIRA fixes and see if it's applicable to backport. For now, I am leaving this resolved. > get_json_object fails somewhat arbitrarily on long input > -------------------------------------------------------- > > Key: SPARK-27507 > URL: https://issues.apache.org/jira/browse/SPARK-27507 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.3.1 > Reporter: Michael Chirico > Priority: Major > Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png > > > Some long JSON objects are parsed incorrectly by {{get_json_object}}. > The specific string we noticed this on can't be shared, but here's some > reproduction in Pyspark: > {code:java} > # v2.3.1 > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > from string import ascii_lowercase > # create a long string > alpha_rep = ascii_lowercase*1000 > # create a simple query on a simple json object which contains this string > test_q = ''' > select get_json_object('{{"a": "{}"}}', '$.a') > ''' > def run_q(s): > return len(spark.sql(test_q.format(s)).collect()[0][0]) > def diagnose(s): > out_len = run_q(s) > # input & output should be identical (length match is a necessary > condition) > print('Input length: %d\tOutput length: %d' % (len(s), out_len)) > return True > def test_l(n): > diagnose(alpha_rep[:n]) > return True > test_l(2264) > test_l(2265) > test_l(2667) > test_l(2666) > test_l(2668) > test_l(len(alpha_rep)){code} > With results on my instance: > {code:java} > Input length: 2264 Output length: 2264 > Input length: 2265 Output length: 2265 > Input length: 2667 Output length: 2660 <---- problematic!! > Input length: 2666 Output length: 2666 > Input length: 2668 Output length: 2661 <---- problematic!! > Input length: 26000 Output length: 26000 > {code} > It's strange that the error triggers for some lengths, but it's apparently > not exclusively about the input being large. > > More details from a {{pandas}} exploration: > {code:java} > import pandas as pd > DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)}) > N = DF.shape[0] > # note -- takes about 20 minutes to run on my machine > for ii in range(N): > DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']]) > if ii % 520 == 0: > print("%.0f%% Done" % (100.0*ii/N)) > DF[DF['n'] != DF['m']].shape > # (1326, 2) > DF['miss'] = DF['n'] - DF['m'] > DF.plot('n', 'miss') > {code} > Plot attached > So it appears to fail for a narrowly defined range of about 1300 characters > before recovering and continuing to function as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org