One approach would be, If you are using fileStream you can access the individual filenames from the partitions and with that filename you can apply your uncompression logic/parsing logic and get it done.
Like: UnionPartition upp = (UnionPartition) ds.values().getPartitions()[i]; NewHadoopPartition npp = (NewHadoopPartition) upp.split(); String *fPath* = npp.serializableHadoopSplit().value().toString(); Another approach would be to create a custom inputReader and InpurFormat, then pass it along with your fileStream and within the reader, you do your uncompression/parsing etc. You can also look into XMLInputFormat <https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java> of mahout. Thanks Best Regards On Mon, Mar 16, 2015 at 11:28 AM, Vijay Innamuri <vijay.innam...@gmail.com> wrote: > Hi All, > > Processing streaming JSON files with Spark features (Spark streaming and > Spark SQL), is very efficient and works like a charm. > > Below is the code snippet to process JSON files. > > windowDStream.foreachRDD(IncomingFiles => { > val IncomingFilesTable = sqlContext.jsonRDD(IncomingFiles); > IncomingFilesTable.registerAsTable("IncomingFilesTable"); > val result = sqlContext.sql("select text from > IncomingFilesTable").collect; > sc.parallelize(result).saveAsTextFile("filepath"); > } > > > But, I feel its difficult to use spark features efficiently with > streaming xml files (each compressed file would be 4 MB). > > What is the best approach for processing compressed xml files? > > Regards > Vijay >