DataFrameWriter bug after RandomSplit?

evanzamir Mon, 22 Aug 2016 15:44:07 -0700

Trying to build a ML model using LogisticRegression, I ran into the following
unexplainable issue. Here's a snippet of code which 
            training, testing = data.randomSplit([0.8, 0.2], seed=42)
            print("number of rows in testing = {}".format(testing.count()))
            print("number of rows in training =
{}".format(training.count()))
            testing.coalesce(1).write.json('testing')
            training.coalesce(1).write.json('training')


The first two print statements (there should be 1390 total samples or rows
in the data set):
number of rows in testing = 290
number of rows in training = 1100

The thing I can't explain is that in the json files for testing and training
there are 805 and 585 rows (json objects), respectively. That adds up to the
expected 1390, but somehow after coalescing and printing the number of
objects in each data set has changed! I have no clue why. Is this a bug?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrameWriter-bug-after-RandomSplit-tp27582.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

DataFrameWriter bug after RandomSplit?

Reply via email to