[ https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485772#comment-14485772 ]
Stefano Parmesan commented on SPARK-6677: ----------------------------------------- Hi Davies, Thanks for taking the time to look into this; that's the expected output, in fact. The point is that back then it happened to me randomly, once every five executions or so. Now I'm not able to reproduce it anymore with the data I posted, but I was able to reproduce it again (100% of the times) with the data I just added to the gist; the exception I'm getting is: {noformat} $ ./bin/pyspark ./spark_test.py [...] key: 31491 res1 data as row: [Row(foo=31491, key=u'31491')] res2 data as row: [Row(bar=1574550000, key=u'31491', other=u'foobar', some=u'thing', that=u'this', this=u'that')] res1 and res2 fields: (u'foo', u'key') (u'bar', u'key', u'other', u'some', u'that', u'this') res1 data as tuple: 31491 31491 res2 data as tuple: 1574550000 31491 foobar key: 31497 res1 data as row: [] res2 data as row: [Row(foo=1574850000, key=u'31497')] key: 31495 res1 data as row: [Traceback (most recent call last): File "/path/to/spark-1.3.0-bin-hadoop2.4/./spark_test.py", line 25, in <module> print "res1 data as row:", list(res_x) File "/path/to/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py", line 1214, in __repr__ for n in self.__FIELDS__)) File "/path/to/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py", line 1214, in <genexpr> for n in self.__FIELDS__)) IndexError: tuple index out of range {noformat} which is the same I'm getting in my more complex script. Some considerations: 1) it may be that this particular input leads to the issue only on my machine, therefore I've added another file to generate some random inputs; I've got this exception on two over three randomly-generated samples, please go ahead and run it with different values of {{N}} if the uploaded data does not make pyspark crash in your environment; 2) interestingly, given the sample data, it always crashes on the same key: {{31495}}; however, they do not seem "magic" to me (and of course input files containing just those two elements does not make pyspark crash in any way): {noformat} data/sample_a.json:{"foo": 31495, "key": "31495"} data/sample_b.json:{"other": "foobar", "bar": 1574750000, "key": "31495", "that": "this", "this": "that", "some": "thing"} {noformat} 3) what happens is that either {{res_x.data\[0\].___FIELDS___}} or {{res_y.data\[0\].___FIELDS___}} get the wrong field names, leading to the {{IndexError}} (the fields are too many and the row does not contain enough data). > pyspark.sql nondeterministic issue with row fields > -------------------------------------------------- > > Key: SPARK-6677 > URL: https://issues.apache.org/jira/browse/SPARK-6677 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.3.0 > Environment: spark version: spark-1.3.0-bin-hadoop2.4 > python version: Python 2.7.6 > operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux > Reporter: Stefano Parmesan > Labels: pyspark, row, sql > > The following issue happens only when running pyspark in the python > interpreter, it works correctly with spark-submit. > Reading two json files containing objects with a different structure leads > sometimes to the definition of wrong Rows, where the fields of a file are > used for the other one. > I was able to write a sample code that reproduce this issue one out of three > times; the code snippet is available at the following link, together with > some (very simple) data samples: > https://gist.github.com/armisael/e08bb4567d0a11efe2db -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org