[jira] [Resolved] (SPARK-17294) Caching invalidates data on mildly wide dataframes

Sean Owen (JIRA) Mon, 29 Aug 2016 07:59:31 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-17294.
-------------------------------
    Resolution: Duplicate

Duplicate #5, popular issue

> Caching invalidates data on mildly wide dataframes
> --------------------------------------------------
>
>                 Key: SPARK-17294
>                 URL: https://issues.apache.org/jira/browse/SPARK-17294
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.2, 2.0.0
>            Reporter: Kalle Jepsen
>
> Caching a dataframe with > 200 columns causes the data within to simply 
> vanish under certain circumstances.
> Consider the following code, where we create a one-row dataframe containing 
> the numbers from 0 to 200.
> {code}
> n_cols = 201
> rng = range(n_cols)
> df = spark.createDataFrame(
>     data=[rng]
> )
> last = df.columns[-1]
> print(df.select(last).collect())
> df.select(F.greatest(*df.columns).alias('greatest')).show()
> {code}
> Returns:
> {noformat}
> [Row(_201=200)]
> +--------+
> |greatest|
> +--------+
> |     200|
> +--------+
> {noformat}
> As expected column {{_201}} contains the number 200 and as expected the 
> greatest value within that single row is 200.
> Now if we introduce a {{.cache}} on {{df}}:
> {code}
> n_cols = 201
> rng = range(n_cols)
> df = spark.createDataFrame(
>     data=[rng]
> ).cache()
> last = df.columns[-1]
> print(df.select(last).collect())
> df.select(F.greatest(*df.columns).alias('greatest')).show()
> {code}
> Returns:
> {noformat}
> [Row(_201=200)]
> +--------+
> |greatest|
> +--------+
> |       0|
> +--------+
> {noformat}
> the last column {{_201}} still seems to contain the correct value, but when I 
> try to select the greatest value within the row, 0 is returned. When I issue 
> {{.show()}} on the dataframe, all values will be zero. As soon as I limit the 
> columns on a number < 200, everything looks fine again.
> When the number of columns is < 200 from the beginning, even the cache will 
> not break things and everything works as expected.
> It doesn't matter whether the data is loaded from disk or created on the fly 
> and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else).
> Can anyone confirm this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17294) Caching invalidates data on mildly wide dataframes

Reply via email to