One approach would be, If you are using fileStream you can access the
individual filenames from the partitions and with that filename you can
apply your uncompression logic/parsing logic and get it done.


Like:

        UnionPartition upp = (UnionPartition)
ds.values().getPartitions()[i]; NewHadoopPartition npp =
(NewHadoopPartition) upp.split();       String *fPath* =
npp.serializableHadoopSplit().value().toString();


Another approach would be to create a custom inputReader and InpurFormat,
then pass it along with your fileStream and within the reader, you do your
uncompression/parsing etc. You can also look into XMLInputFormat
<https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java>
of mahout.




Thanks
Best Regards

On Mon, Mar 16, 2015 at 11:28 AM, Vijay Innamuri <vijay.innam...@gmail.com>
wrote:

> Hi All,
>
> Processing streaming JSON files with Spark features (Spark streaming and
> Spark SQL), is very efficient and works like a charm.
>
> Below is the code snippet to process JSON files.
>
>         windowDStream.foreachRDD(IncomingFiles => {
>         val IncomingFilesTable = sqlContext.jsonRDD(IncomingFiles);
>         IncomingFilesTable.registerAsTable("IncomingFilesTable");
>         val result = sqlContext.sql("select text from
> IncomingFilesTable").collect;
>         sc.parallelize(result).saveAsTextFile("filepath");
>         }
>
>
> But, I feel its difficult to use spark features efficiently with
> streaming xml files (each compressed file would be 4 MB).
>
> What is the best approach for processing compressed xml files?
>
> Regards
> Vijay
>

Reply via email to