Hi Spark user list!
I have been encountering corrupted records when reading Gzipped files that
contains more than one file.
Example:
I have two .json file, [a.json, b.json]
Each have multiple records (one line, one record).
I tar both of them together on
Mac OS X, 10.11.6
bsdtar 2.8.3 - libarchive 2.8.3
i.e. tar -czf a.tgz *.json
When I attempt to read them (via Python):
filename = "a.tgz"
sqlContext = SQLContext(sc)
datasets = sqlContext.read.json(filename)
datasets.show(1, truncate=False)
My first record will always be corrupted, showing up in _corrupt_record.
Does anyone have any idea if it is feature or a defect?
Best Regards
Jie Sheng
Important: This email is confidential and may be privileged. If you are not
the intended recipient, please delete it and notify us immediately; you
should not copy or use it for any purpose, nor disclose its contents to any
other person. Thank you.