Hi Mich,

You are correct. I am talking about the Databricks package spark-csv you have 
below.

The files are stored in s3 and I download, unzip, and store each one of them in 
a variable as a string using the AWS SDK (aws-java-sdk-1.10.60.jar).

Here is some of the code.

val filesRdd = sc.parallelize(lFiles, 250)
filesRdd.foreachPartition(files => {
  val s3Client = new AmazonS3Client(new 
EnvironmentVariableCredentialsProvider())
  files.foreach(file => {
    val s3Object = s3Client.getObject(new GetObjectRequest(s3Bucket, file))
    val zipFile = new ZipInputStream(s3Object.getObjectContent())
    val csvFile = readZipStream(zipFile)
  })
})

This function does the unzipping and converts to string.

def readZipStream(stream: ZipInputStream): String = {
  stream.getNextEntry
  var stuff = new ListBuffer[String]()
  val scanner = new Scanner(stream)
  while(scanner.hasNextLine){
    stuff += scanner.nextLine
  }
  stuff.toList.mkString("\n")
}

The next step is to parse the CSV string and convert to a dataframe, which will 
populate a Hive/HBase table.

If you can help, I would be truly grateful.

Thanks,
Ben


> On Mar 30, 2016, at 2:06 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> just to clarify are you talking about databricks csv package.
> 
> $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
> 
> Where are these zipped files? Are they copied to a staging directory in hdfs?
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 March 2016 at 15:17, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I have a quick question. I have downloaded multiple zipped files from S3 and 
> unzipped each one of them into strings. The next step is to parse using a CSV 
> parser. I want to know if there is a way to easily use the spark csv package 
> for this?
> 
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 

Reply via email to