Re: Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

Sean Owen Sun, 21 Aug 2016 05:07:58 -0700

You are attempting to read a tar file. That won't work. A compressed JSON
file would.


On Sun, Aug 21, 2016, 12:52 Chua Jie Sheng <chuajiesh...@gmail.com> wrote:

> Hi Spark user list!
>
> I have been encountering corrupted records when reading Gzipped files that
> contains more than one file.
>
> Example:
> I have two .json file, [a.json, b.json]
> Each have multiple records (one line, one record).
>
> I tar both of them together on
>
> Mac OS X, 10.11.6
> bsdtar 2.8.3 - libarchive 2.8.3
>
> i.e. tar -czf a.tgz *.json
>
>
> When I attempt to read them (via Python):
>
> filename = "a.tgz"
> sqlContext = SQLContext(sc)
> datasets = sqlContext.read.json(filename)
>
> datasets.show(1, truncate=False)
>
>
> My first record will always be corrupted, showing up in _corrupt_record.
>
> Does anyone have any idea if it is feature or a defect?
>
> Best Regards
> Jie Sheng
>
> Important: This email is confidential and may be privileged. If you are
> not the intended recipient, please delete it and notify us immediately; you
> should not copy or use it for any purpose, nor disclose its contents to any
> other person. Thank you.
>

Re: Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

Reply via email to