Ivan Gozali created SPARK-19352: ----------------------------------- Summary: Sorting issues on relatively big datasets Key: SPARK-19352 URL: https://issues.apache.org/jira/browse/SPARK-19352 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Environment: Spark version 2.1.0 Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102 macOS 10.12.3 Reporter: Ivan Gozali
_More details, including the script to generate the synthetic dataset (requires pandas and numpy) are in this GitHub gist._ https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f Given a relatively large synthetic time series dataset of various users (4.1GB), when attempting to: * partition this dataset by user ID * sort the time series data for each user by timestamp * write each partition to a single CSV file then some files are unsorted in a very specific manner. In one of the supposedly sorted files, the rows looked as follows: {code} 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43 {code} The above is attempted using the following Scala/Spark code: {code} val inpth = "/tmp/gen_data_3cols_small" spark .read .option("inferSchema", "true") .option("header", "true") .csv(inpth) .repartition($"userId") .sortWithinPartitions("timestamp") .write .partitionBy("userId") .option("header", "true") .csv(inpth + "_sorted") {code} This issue is not seen when using a smaller sized dataset by making the time span smaller (354MB, with the same number of columns). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org