use case reading files split per id

ruben Tue, 08 Nov 2016 09:17:16 -0800

Hey,

We have files organized on hdfs in this manner:


base_folder
|- <ID1>
|------------- file1
|------------- file2
|------------- ...
|- <ID2>
|------------- file1
|------------- file2
|------------- ...
| - ...

We want to be able to do the following operation on our data:

- for each ID we want to parse the lines into records (timestamp, record
data), giving us a list[(timestamp, record_data)] for each ID
- then we want to do an operation on list[(timestamp, record_data)], giving
us a list[output], note that this operation is not a simple map operation
(timestamp, record_data) -> output, but it requires to know the full list of
records for an id

Currently we are doing this in the following way:

val ids: List[String] = <insert list of all our ids>
val idsWithPaths: List[(String, List[String])] = <append path strings for
ids>
sc.parallelize(idsWithPaths, partitions)
.map{ case (id, pathList) =>
  val sourceList: List[Source] = <convert pathList to sources>
  val combinedIterator: Iterator = sourceList.map(_.getLines()).reduceLeft(_
++ _)
  val records:List[(Timestamp, RecordData)] = parseRecords(combinedIterator,
id)
  val output: List[Output] = generateOutput(records, id)
}

I would like to know if this is a good way to do this operation. It seems to
me that it doesn't make full use of the capabilities of spark (data locality
for example, since there is no way for the partitioner to know how to
distribute the ids close to the files on hdfs). Some attempts where made to
translate this using sc.textfile and sc.wholetextfiles but by doing some
small benchmarks it seemed that those were slower (but it could be due to
the specific implementation, since it required some groupByKey/reduceByKey
steps to gather the data for each ID into a list[(timestamp, record_data)]
to be able to do the generateOutput function).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/use-case-reading-files-split-per-id-tp28044.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

use case reading files split per id

Reply via email to