textFileStream and default fileStream recognizes the compressed xml(.xml.gz) files.
Each line in the xml file is an element in RDD[string]. Then whole RDD is converted to a proper xml format data and stored in a *Scala variable*. - I believe storing huge data in a *Scala variable* is inefficient. Is there any alternative processing for xml files? - How to create Spark SQL table with the above xml data? Regards Vijay Innamuri On 16 March 2015 at 12:12, Akhil Das <ak...@sigmoidanalytics.com> wrote: > One approach would be, If you are using fileStream you can access the > individual filenames from the partitions and with that filename you can > apply your uncompression logic/parsing logic and get it done. > > > Like: > > UnionPartition upp = (UnionPartition) ds.values().getPartitions()[i]; > NewHadoopPartition npp = (NewHadoopPartition) upp.split(); String > *fPath* = npp.serializableHadoopSplit().value().toString(); > > > Another approach would be to create a custom inputReader and InpurFormat, > then pass it along with your fileStream and within the reader, you do your > uncompression/parsing etc. You can also look into XMLInputFormat > <https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java> > of mahout. > > > > > Thanks > Best Regards > > On Mon, Mar 16, 2015 at 11:28 AM, Vijay Innamuri <vijay.innam...@gmail.com > > wrote: > >> Hi All, >> >> Processing streaming JSON files with Spark features (Spark streaming and >> Spark SQL), is very efficient and works like a charm. >> >> Below is the code snippet to process JSON files. >> >> windowDStream.foreachRDD(IncomingFiles => { >> val IncomingFilesTable = sqlContext.jsonRDD(IncomingFiles); >> IncomingFilesTable.registerAsTable("IncomingFilesTable"); >> val result = sqlContext.sql("select text from >> IncomingFilesTable").collect; >> sc.parallelize(result).saveAsTextFile("filepath"); >> } >> >> >> But, I feel its difficult to use spark features efficiently with >> streaming xml files (each compressed file would be 4 MB). >> >> What is the best approach for processing compressed xml files? >> >> Regards >> Vijay >> > >