[ 
https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485772#comment-14485772
 ] 

Stefano Parmesan commented on SPARK-6677:
-----------------------------------------

Hi Davies,

Thanks for taking the time to look into this; that's the expected output, in 
fact. The point is that back then it happened to me randomly, once every five 
executions or so. Now I'm not able to reproduce it anymore with the data I 
posted, but I was able to reproduce it again (100% of the times) with the data 
I just added to the gist; the exception I'm getting is:

{noformat}
$ ./bin/pyspark ./spark_test.py
[...]
key: 31491
res1 data as row: [Row(foo=31491, key=u'31491')]
res2 data as row: [Row(bar=1574550000, key=u'31491', other=u'foobar', 
some=u'thing', that=u'this', this=u'that')]
res1 and res2 fields: (u'foo', u'key') (u'bar', u'key', u'other', u'some', 
u'that', u'this')
res1 data as tuple: 31491 31491
res2 data as tuple: 1574550000 31491 foobar
key: 31497
res1 data as row: []
res2 data as row: [Row(foo=1574850000, key=u'31497')]
key: 31495
res1 data as row: [Traceback (most recent call last):
  File "/path/to/spark-1.3.0-bin-hadoop2.4/./spark_test.py", line 25, in 
<module>
    print "res1 data as row:", list(res_x)
  File "/path/to/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py", line 
1214, in __repr__
    for n in self.__FIELDS__))
  File "/path/to/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py", line 
1214, in <genexpr>
    for n in self.__FIELDS__))
IndexError: tuple index out of range
{noformat}
which is the same I'm getting in my more complex script.

Some considerations:
1) it may be that this particular input leads to the issue only on my machine, 
therefore I've added another file to generate some random inputs; I've got this 
exception on two over three randomly-generated samples, please go ahead and run 
it with different values of {{N}} if the uploaded data does not make pyspark 
crash in your environment;
2) interestingly, given the sample data, it always crashes on the same key: 
{{31495}}; however, they do not seem "magic" to me (and of course input files 
containing just those two elements does not make pyspark crash in any way):
{noformat}
data/sample_a.json:{"foo": 31495, "key": "31495"}
data/sample_b.json:{"other": "foobar", "bar": 1574750000, "key": "31495", 
"that": "this", "this": "that", "some": "thing"}
{noformat}
3) what happens is that either {{res_x.data\[0\].___FIELDS___}} or 
{{res_y.data\[0\].___FIELDS___}} get the wrong field names, leading to the 
{{IndexError}} (the fields are too many and the row does not contain enough 
data).

> pyspark.sql nondeterministic issue with row fields
> --------------------------------------------------
>
>                 Key: SPARK-6677
>                 URL: https://issues.apache.org/jira/browse/SPARK-6677
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.3.0
>         Environment: spark version: spark-1.3.0-bin-hadoop2.4
> python version: Python 2.7.6
> operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Stefano Parmesan
>              Labels: pyspark, row, sql
>
> The following issue happens only when running pyspark in the python 
> interpreter, it works correctly with spark-submit.
> Reading two json files containing objects with a different structure leads 
> sometimes to the definition of wrong Rows, where the fields of a file are 
> used for the other one.
> I was able to write a sample code that reproduce this issue one out of three 
> times; the code snippet is available at the following link, together with 
> some (very simple) data samples:
> https://gist.github.com/armisael/e08bb4567d0a11efe2db



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to