Kalle Jepsen created SPARK-17294:
------------------------------------

             Summary: Caching invalidates data on mildly wide dataframes
                 Key: SPARK-17294
                 URL: https://issues.apache.org/jira/browse/SPARK-17294
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0, 1.6.2
            Reporter: Kalle Jepsen


Caching a dataframe with > 200 columns causes the data within to simply vanish 
under certain circumstances.

Consider the following code, where we create a one-row dataframe containing the 
numbers from 0 to 200.

{code}
n_cols = 201
rng = range(n_cols)
df = spark.createDataFrame(
    data=[rng]
)

last = df.columns[-1]
print(df.select(last).collect())
df.select(F.greatest(*df.columns).alias('greatest')).show()
{code}

Returns:

{noformat}
[Row(_201=200)]

+--------+
|greatest|
+--------+
|     200|
+--------+
{noformat}

As expected column {{_201}} contains the number 200 and as expected the 
greatest value within that single row is 200.

Now if we introduce a {{.cache}} on {{df}}:

{code}
n_cols = 201
rng = range(n_cols)
df = spark.createDataFrame(
    data=[rng]
).cache()

last = df.columns[-1]
print(df.select(last).collect())
df.select(F.greatest(*df.columns).alias('greatest')).show()
{code}

Returns:

{noformat}
[Row(_201=200)]

+--------+
|greatest|
+--------+
|       0|
+--------+
{noformat}

the last column {{_201}} still seems to contain the correct value, but when I 
try to select the greatest value within the row, 0 is returned. When I issue 
{{.show()}} on the dataframe, all values will be zero. As soon as I limit the 
columns on a number < 200, everything looks fine again.

When the number of columns is < 200 from the beginning, even the cache will not 
break things and everything works as expected.

It doesn't matter whether the data is loaded from disk or created on the fly 
and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else).

Can anyone confirm this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to