Hi Mich, You are correct. I am talking about the Databricks package spark-csv you have below.
The files are stored in s3 and I download, unzip, and store each one of them in a variable as a string using the AWS SDK (aws-java-sdk-1.10.60.jar). Here is some of the code. val filesRdd = sc.parallelize(lFiles, 250) filesRdd.foreachPartition(files => { val s3Client = new AmazonS3Client(new EnvironmentVariableCredentialsProvider()) files.foreach(file => { val s3Object = s3Client.getObject(new GetObjectRequest(s3Bucket, file)) val zipFile = new ZipInputStream(s3Object.getObjectContent()) val csvFile = readZipStream(zipFile) }) }) This function does the unzipping and converts to string. def readZipStream(stream: ZipInputStream): String = { stream.getNextEntry var stuff = new ListBuffer[String]() val scanner = new Scanner(stream) while(scanner.hasNextLine){ stuff += scanner.nextLine } stuff.toList.mkString("\n") } The next step is to parse the CSV string and convert to a dataframe, which will populate a Hive/HBase table. If you can help, I would be truly grateful. Thanks, Ben > On Mar 30, 2016, at 2:06 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > just to clarify are you talking about databricks csv package. > > $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 > > Where are these zipped files? Are they copied to a staging directory in hdfs? > > HTH > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 30 March 2016 at 15:17, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > I have a quick question. I have downloaded multiple zipped files from S3 and > unzipped each one of them into strings. The next step is to parse using a CSV > parser. I want to know if there is a way to easily use the spark csv package > for this? > > Thanks, > Ben > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > >