Trying to build a ML model using LogisticRegression, I ran into the following unexplainable issue. Here's a snippet of code which training, testing = data.randomSplit([0.8, 0.2], seed=42) print("number of rows in testing = {}".format(testing.count())) print("number of rows in training = {}".format(training.count())) testing.coalesce(1).write.json('testing') training.coalesce(1).write.json('training')
The first two print statements (there should be 1390 total samples or rows in the data set): number of rows in testing = 290 number of rows in training = 1100 The thing I can't explain is that in the json files for testing and training there are 805 and 585 rows (json objects), respectively. That adds up to the expected 1390, but somehow after coalescing and printing the number of objects in each data set has changed! I have no clue why. Is this a bug? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrameWriter-bug-after-RandomSplit-tp27582.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org