Hi John, Glad you're enjoying the Spark training at UMD.
Is the 43 GB XML data in a single file or split across multiple BZIP2 files? Is the file in a HDFS cluster or on a single linux machine? If you're using BZIP2 with splittable compression (in HDFS), you'll need at least Hadoop 1.1: https://issues.apache.org/jira/browse/HADOOP-7823 Or if you've got the file on a single linux machine, perhaps consider uncompressing it manually using cmd line tools before loading it into Spark. You'll want to start with maybe 1 GB for each partition, so if the uncompressed file is 100GB, maybe start with 100 partitions. Even if the entire dataset is in one file (which might get read into just 1 or 2 partitions initially with Spark), you can use the repartition(numPartitions) transformation to make 100 partitions. Then you'll have to make sense of the XML schema. You have a few options to do this. You can take advantage of Scala’s XML functionality provided by the scala.xml package to parse the data. Here is a blog post with some code example for this: http://stevenskelton.ca/real-time-data-mining-spark/ Or try sc.wholeTextFiles(). It reads the entire file into a string record. You'll want to make sure that you have enough memory to read the single string into memory. This Cloudera blog post about half-way down has some Regex examples of how to use Scala to parse an XML file into a collection of tuples: http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ You can also search for XMLInputFormat on Google. There are some implementations that allow you to specify the <tag> to split on, e.g.: https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java Good luck! Sameer F. Client Services @ Databricks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-BZ2-XML-file-in-Spark-tp16954p16960.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org