well my guess is just pkunzip it and use bzip2 to zip it or leave it as it
is.
Databricks handles *.bz2 type files. I know that.
Anyway that is the easy part :)
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Hi Mich,
I forgot to mention that - this is the ugly part - the source data provider
gives us (Windows) pkzip compressed files. Will spark uncompress these
automatically? I haven’t been able to make it work.
Thanks,
Ben
> On Mar 30, 2016, at 2:27 PM, Mich Talebzadeh
Hi Ben,
Well I have done it for standard csv files downloaded from spreadsheets to
staging directory on hdfs and loaded from there.
First you may not need to unzip them. dartabricks can read them (in my
case) and zipped files.
Check this. Mine is slightly different from what you have, First I
Hi Mich,
You are correct. I am talking about the Databricks package spark-csv you have
below.
The files are stored in s3 and I download, unzip, and store each one of them in
a variable as a string using the AWS SDK (aws-java-sdk-1.10.60.jar).
Here is some of the code.
val filesRdd =
just to clarify are you talking about databricks csv package.
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
Where are these zipped files? Are they copied to a staging directory in
hdfs?
HTH
Dr Mich Talebzadeh
LinkedIn *
I have a quick question. I have downloaded multiple zipped files from S3 and
unzipped each one of them into strings. The next step is to parse using a CSV
parser. I want to know if there is a way to easily use the spark csv package
for this?
Thanks,
Ben