Hi Spark user list! I have been encountering corrupted records when reading Gzipped files that contains more than one file.
Example: I have two .json file, [a.json, b.json] Each have multiple records (one line, one record). I tar both of them together on Mac OS X, 10.11.6 bsdtar 2.8.3 - libarchive 2.8.3 i.e. tar -czf a.tgz *.json When I attempt to read them (via Python): filename = "a.tgz" sqlContext = SQLContext(sc) datasets = sqlContext.read.json(filename) datasets.show(1, truncate=False) My first record will always be corrupted, showing up in _corrupt_record. Does anyone have any idea if it is feature or a defect? Best Regards Jie Sheng Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.