from:"Chua Jie Sheng"

[pyspark] How to ensure rdd.takeSample produce the same set everytime?

2016-08-22 Thread Chua Jie Sheng

Hi all,

I have been trying on different machine to make rdd.takeSample produce the
same set but failed.
I have seed the method with the same value on different machine but the
result is different.
Any idea why?

Best Regards
Jie Sheng

Important: This email is confidential and may be privileged. If you are not
the intended recipient, please delete it and notify us immediately; you
should not copy or use it for any purpose, nor disclose its contents to any
other person. Thank you.

Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2016-08-21 Thread Chua Jie Sheng

Hi Spark user list!

I have been encountering corrupted records when reading Gzipped files that
contains more than one file.

Example:
I have two .json file, [a.json, b.json]
Each have multiple records (one line, one record).

I tar both of them together on

Mac OS X, 10.11.6
bsdtar 2.8.3 - libarchive 2.8.3

i.e. tar -czf a.tgz *.json


When I attempt to read them (via Python):

filename = "a.tgz"
sqlContext = SQLContext(sc)
datasets = sqlContext.read.json(filename)

datasets.show(1, truncate=False)


My first record will always be corrupted, showing up in _corrupt_record.

Does anyone have any idea if it is feature or a defect?

Best Regards
Jie Sheng

Important: This email is confidential and may be privileged. If you are not
the intended recipient, please delete it and notify us immediately; you
should not copy or use it for any purpose, nor disclose its contents to any
other person. Thank you.

[pyspark] How to ensure rdd.takeSample produce the same set everytime?

Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2 matches

Site Navigation

Mail list logo

Footer information