(sorry about the previous spam... google inbox didn't allowed me to cancel
the miserable sent action :-/)

So what I was about to say: it's a real PAIN tin the ass to parse the
wikipedia articles in the dump due to this mulitline articles...

However, there is a way to manage that "quite" easily, although I found it
rather slow.

*1/ use XML reader*
Use the "org.apache.hadoop" % "hadoop-streaming" % "1.0.4"

*2/ configure the hadoop job*
import org.apache.hadoop.streaming.StreamXmlRecordReader
import org.apache.hadoop.mapred.JobConf
val jobConf = new JobConf()
jobConf.set("stream.recordreader.class",
                    "org.apache.hadoop.streaming.StreamXmlRecordReader")
jobConf.set("stream.recordreader.begin", "<page")
jobConf.set("stream.recordreader.end", "</page>")
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf,
s"hdfs://$master:9000/data.xml")

// Load documents (one per line).
val documents = sparkContext.hadoopRDD(jobConf,
classOf[org.apache.hadoop.streaming.StreamInputFormat],
classOf[org.apache.hadoop.io.Text],
classOf[org.apache.hadoop.io.Text])


*3/ use the result as XML doc*
import scala.xml.XML
val texts = documents.map(_._1.toString)
                     .map{ s =>
                       val xml = XML.loadString(s)
                       val id = (xml \ "id").text.toDouble
                       val title = (xml \ "title").text
                       val text = (xml \ "revision" \
"text").text.replaceAll("\\W", " ")
                       val tknzed = text.split("\\W").filter(_.size >
3).toList
                       (id, title, tknzed )
                     }

HTH
andy
On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer <t...@preferred.jp> wrote:

> Hi,
>
> see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for
> one solution.
>
> One issue with those XML files is that they cannot be processed line by
> line in parallel; plus you inherently need shared/global state to parse XML
> or check for well-formedness, I think. (Same issue with multi-line JSON, by
> the way.)
>
> Tobias
>
>

Reply via email to