I am using Spark-CSV  to load a 50GB of around 10,000 CSV files into couple
of unified DataFrames. Since this process is slow I have wrote this snippet: 

    targetList.foreach { target =>
        // this is using sqlContext.load by getting list of files then
loading them according to schema files that
        // read before and built their StructType
        getTrace(target, sqlContext)
          .reduce(_ unionAll _)
          .registerTempTable(target.toUpperCase())
        sqlContext.sql("SELECT * FROM " + target.toUpperCase())
          .saveAsParquetFile(processedTraces + target)

to load the csv files and then union all the cvs files with the same schema
and write them into a single parquet file with their parts. The problems is
my cpu (not all cpus are being busy) and disk (ssd, with 1MB/s at most) are
barely utilized. I wonder what am I doing wrong?!

snippet for getTrace: 

def getTrace(target: String, sqlContext: SQLContext): Seq[DataFrame] = {
    logFiles(mainLogFolder + target).map {
      file =>
        sqlContext.load(
          driver,
          // schemaSelect builds the StructType once
          schemaSelect(schemaFile, target, sqlContext),
          Map("path" -> file, "header" -> "false", "delimiter" -> ","))
    }
  }

thanks for any help




-----
regards,
mohamad
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-CSV-to-DataFrame-and-saving-it-into-Parquet-for-speedup-tp23071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to