Hi,

I'm new to Spark and am working on a proof of concept.  I'm using Spark
1.3.0 and running in local mode.

I can read and parse an RCFile using Spark however the performance is not as
good as I hoped. 
I'm testing using ~800k rows and it is taking about 30 mins to process.

Is there a better way to load and process a RCFile?  The only reference to
RCFile in 'Learning Spark' is in the SparkSQL chapter.  Is using SparkSQL
for RCFiles the recommendation and I should avoid using Spark core
functionality for RCFiles?

I'm using the following code to build RDD[Record]

    val records: RDD[Record] = sc.hadoopFile(rcFile,
                                                 
classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]],
                                                  classOf[LongWritable],
                                                 
classOf[BytesRefArrayWritable])
                                                  .map(x =>  (
                                                     x._1.get, parse( x._2 )
                                                    )
                                                  ).map(pair => pair._2)
the function parse is defined as:

  def parse(braw: BytesRefArrayWritable ): Record = {  
    val serDe = new ColumnarSerDe()
    var tbl: Properties = new Properties();
    tbl.setProperty("serialization.format", "9")       
    tbl.setProperty("columns", "time,id,name,application")
    tbl.setProperty("columns.types", "string:int:string:string")
    tbl.setProperty("serialization.null.format", "NULL");
    serDe.initialize(new Configuration(), tbl);
     
    val oi = serDe.getObjectInspector()
    val soi: StructObjectInspector = oi.asInstanceOf[StructObjectInspector]
    val fieldRefs: Buffer[_ <:StructField]  =
soi.getAllStructFieldRefs().asScala
    val row = serDe.deserialize(braw)     
   
    val timeRec = soi.getStructFieldData(row, fieldRefs(0)) 
    val idRec = soi.getStructFieldData(row, fieldRefs(1))
    val nameRec = soi.getStructFieldData(row, fieldRefs(2))
    val applicationRec = soi.getStructFieldData(row, fieldRefs(3))
   
    var timeOI =
fieldRefs(0).getFieldObjectInspector().asInstanceOf[StringObjectInspector];
    var time = timeOI.getPrimitiveJavaObject(timeRec);                         
    var idOI =
fieldRefs(1).getFieldObjectInspector().asInstanceOf[IntObjectInspector];        
                                      
    var id = idOI.get(idRec);
    var nameOI =
fieldRefs(2).getFieldObjectInspector().asInstanceOf[StringObjectInspector];     
                                         
    var name = nameOI.getPrimitiveJavaObject(nameRec);
    var appOI =
fieldRefs(3).getFieldObjectInspector().asInstanceOf[StringObjectInspector];     
                                         
    var app = appOI.getPrimitiveJavaObject(applicationRec);
   
    new Record(time, id, name, app)
  }


Thanks in advance,
Glenda



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Code-to-read-RCFiles-tp14934p22545.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to