[pyspark] How to ensure rdd.takeSample produce the same set everytime?

2016-08-22 Thread Chua Jie Sheng
Hi all, I have been trying on different machine to make rdd.takeSample produce the same set but failed. I have seed the method with the same value on different machine but the result is different. Any idea why? Best Regards Jie Sheng Important: This email is confidential and may be privileged.

Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2016-08-21 Thread Chua Jie Sheng
Hi Spark user list! I have been encountering corrupted records when reading Gzipped files that contains more than one file. Example: I have two .json file, [a.json, b.json] Each have multiple records (one line, one record). I tar both of them together on Mac OS X, 10.11.6 bsdtar 2.8.3 -