[ https://issues.apache.org/jira/browse/SPARK-17043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-17043. ------------------------------- Resolution: Duplicate > Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result) > ----------------------------------------------------------------------------- > > Key: SPARK-17043 > URL: https://issues.apache.org/jira/browse/SPARK-17043 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.2, 2.0.0 > Reporter: Barry Becker > > I have a method that adds a row index column to a dataframe. It only works > correctly if the dataframe has less than 200 columns. When more than 200 > columns nearly all the data becomes empty (""'s for values). > {code} > def zipWithIndex(df: DataFrame, rowIdxColName: String): DataFrame = { > val nullable = false > df.sparkSession.createDataFrame( > df.rdd.zipWithIndex.map{case (row, i) => Row.fromSeq(row.toSeq :+ i)}, > StructType(df.schema.fields :+ StructField(rowIdxColName, LongType, > nullable)) > ) > } > {code} > This might be related to https://issues.apache.org/jira/browse/SPARK-16664 > but I'm not sure. I saw the 200 column threshold and it made me think it > might be related. I saw this problem in spark 1.6.2 and 2.0.0. Maybe it is > fixed in 2.0.1 (have not tried yet). I have no idea why the 200 column > threshold is significant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org