Hi John,

Glad you're enjoying the Spark training at UMD.

Is the 43 GB XML data in a single file or split across multiple BZIP2 files?
Is the file in a HDFS cluster or on a single linux machine?

If you're using BZIP2 with splittable compression (in HDFS), you'll need at
least Hadoop 1.1:
https://issues.apache.org/jira/browse/HADOOP-7823

Or if you've got the file on a single linux machine, perhaps consider
uncompressing it manually using cmd line tools before loading it into Spark.

You'll want to start with maybe 1 GB for each partition, so if the
uncompressed file is 100GB, maybe start with 100 partitions. Even if the
entire dataset is in one file (which might get read into just 1 or 2
partitions initially with Spark), you can use the repartition(numPartitions)
transformation to make 100 partitions.

Then you'll have to make sense of the XML schema. You have a few options to
do this.

You can take advantage of Scala’s XML functionality provided by the
scala.xml package to parse the data. Here is a blog post with some code
example for this:
http://stevenskelton.ca/real-time-data-mining-spark/

Or try sc.wholeTextFiles(). It reads the entire file into a string record.
You'll want to make sure that you have enough memory to read the single
string into memory.

This Cloudera blog post about half-way down has some Regex examples of how
to use Scala to parse an XML file into a collection of tuples:
http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

You can also search for XMLInputFormat on Google. There are some 
implementations that allow you to specify the <tag> to split on, e.g.: 
https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java

Good luck!

Sameer F.
Client Services @ Databricks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-BZ2-XML-file-in-Spark-tp16954p16960.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to