[ https://issues.apache.org/jira/browse/SPARK-17294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-17294. ------------------------------- Resolution: Duplicate Duplicate #5, popular issue > Caching invalidates data on mildly wide dataframes > -------------------------------------------------- > > Key: SPARK-17294 > URL: https://issues.apache.org/jira/browse/SPARK-17294 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.2, 2.0.0 > Reporter: Kalle Jepsen > > Caching a dataframe with > 200 columns causes the data within to simply > vanish under certain circumstances. > Consider the following code, where we create a one-row dataframe containing > the numbers from 0 to 200. > {code} > n_cols = 201 > rng = range(n_cols) > df = spark.createDataFrame( > data=[rng] > ) > last = df.columns[-1] > print(df.select(last).collect()) > df.select(F.greatest(*df.columns).alias('greatest')).show() > {code} > Returns: > {noformat} > [Row(_201=200)] > +--------+ > |greatest| > +--------+ > | 200| > +--------+ > {noformat} > As expected column {{_201}} contains the number 200 and as expected the > greatest value within that single row is 200. > Now if we introduce a {{.cache}} on {{df}}: > {code} > n_cols = 201 > rng = range(n_cols) > df = spark.createDataFrame( > data=[rng] > ).cache() > last = df.columns[-1] > print(df.select(last).collect()) > df.select(F.greatest(*df.columns).alias('greatest')).show() > {code} > Returns: > {noformat} > [Row(_201=200)] > +--------+ > |greatest| > +--------+ > | 0| > +--------+ > {noformat} > the last column {{_201}} still seems to contain the correct value, but when I > try to select the greatest value within the row, 0 is returned. When I issue > {{.show()}} on the dataframe, all values will be zero. As soon as I limit the > columns on a number < 200, everything looks fine again. > When the number of columns is < 200 from the beginning, even the cache will > not break things and everything works as expected. > It doesn't matter whether the data is loaded from disk or created on the fly > and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else). > Can anyone confirm this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org