You are attempting to read a tar file. That won't work. A compressed JSON file would.
On Sun, Aug 21, 2016, 12:52 Chua Jie Sheng <chuajiesh...@gmail.com> wrote: > Hi Spark user list! > > I have been encountering corrupted records when reading Gzipped files that > contains more than one file. > > Example: > I have two .json file, [a.json, b.json] > Each have multiple records (one line, one record). > > I tar both of them together on > > Mac OS X, 10.11.6 > bsdtar 2.8.3 - libarchive 2.8.3 > > i.e. tar -czf a.tgz *.json > > > When I attempt to read them (via Python): > > filename = "a.tgz" > sqlContext = SQLContext(sc) > datasets = sqlContext.read.json(filename) > > datasets.show(1, truncate=False) > > > My first record will always be corrupted, showing up in _corrupt_record. > > Does anyone have any idea if it is feature or a defect? > > Best Regards > Jie Sheng > > Important: This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us immediately; you > should not copy or use it for any purpose, nor disclose its contents to any > other person. Thank you. >