Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

Chua Jie Sheng Sun, 21 Aug 2016 04:52:42 -0700

Hi Spark user list!

I have been encountering corrupted records when reading Gzipped files that
contains more than one file.


Example:
I have two .json file, [a.json, b.json]
Each have multiple records (one line, one record).

I tar both of them together on

Mac OS X, 10.11.6
bsdtar 2.8.3 - libarchive 2.8.3

i.e. tar -czf a.tgz *.json


When I attempt to read them (via Python):

filename = "a.tgz"
sqlContext = SQLContext(sc)
datasets = sqlContext.read.json(filename)

datasets.show(1, truncate=False)


My first record will always be corrupted, showing up in _corrupt_record.

Does anyone have any idea if it is feature or a defect?

Best Regards
Jie Sheng

Important: This email is confidential and may be privileged. If you are not
the intended recipient, please delete it and notify us immediately; you
should not copy or use it for any purpose, nor disclose its contents to any
other person. Thank you.

Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

Reply via email to