Hi,
I'm new to Spark and am working on a proof of concept. I'm using Spark
1.3.0 and running in local mode.
I can read and parse an RCFile using Spark however the performance is not as
good as I hoped.
I'm testing using ~800k rows and it is taking about 30 mins to process.
Is there a better way to load and process a RCFile? The only reference to
RCFile in 'Learning Spark' is in the SparkSQL chapter. Is using SparkSQL
for RCFiles the recommendation and I should avoid using Spark core
functionality for RCFiles?
I'm using the following code to build RDD[Record]
val records: RDD[Record] = sc.hadoopFile(rcFile,
classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]],
classOf[LongWritable],
classOf[BytesRefArrayWritable])
.map(x => (
x._1.get, parse( x._2 )
)
).map(pair => pair._2)
the function parse is defined as:
def parse(braw: BytesRefArrayWritable ): Record = {
val serDe = new ColumnarSerDe()
var tbl: Properties = new Properties();
tbl.setProperty("serialization.format", "9")
tbl.setProperty("columns", "time,id,name,application")
tbl.setProperty("columns.types", "string:int:string:string")
tbl.setProperty("serialization.null.format", "NULL");
serDe.initialize(new Configuration(), tbl);
val oi = serDe.getObjectInspector()
val soi: StructObjectInspector = oi.asInstanceOf[StructObjectInspector]
val fieldRefs: Buffer[_ <:StructField] =
soi.getAllStructFieldRefs().asScala
val row = serDe.deserialize(braw)
val timeRec = soi.getStructFieldData(row, fieldRefs(0))
val idRec = soi.getStructFieldData(row, fieldRefs(1))
val nameRec = soi.getStructFieldData(row, fieldRefs(2))
val applicationRec = soi.getStructFieldData(row, fieldRefs(3))
var timeOI =
fieldRefs(0).getFieldObjectInspector().asInstanceOf[StringObjectInspector];
var time = timeOI.getPrimitiveJavaObject(timeRec);
var idOI =
fieldRefs(1).getFieldObjectInspector().asInstanceOf[IntObjectInspector];
var id = idOI.get(idRec);
var nameOI =
fieldRefs(2).getFieldObjectInspector().asInstanceOf[StringObjectInspector];
var name = nameOI.getPrimitiveJavaObject(nameRec);
var appOI =
fieldRefs(3).getFieldObjectInspector().asInstanceOf[StringObjectInspector];
var app = appOI.getPrimitiveJavaObject(applicationRec);
new Record(time, id, name, app)
}
Thanks in advance,
Glenda
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Code-to-read-RCFiles-tp14934p22545.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]