Hi,

Parallel processing of xml files may be an issue due to the tags in the xml
file. The xml file has to be intact as while parsing it matches the start
and end entity and if its distributed in parts to workers possibly it may
or may not find start and end tags within the same worker which will give
an exception.

Thanks.

On Wed, Nov 19, 2014 at 6:26 AM, ssimanta [via Apache Spark User List] <
ml-node+s1001560n19239...@n3.nabble.com> wrote:

> If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
> that all revision information also) that is stored in HDFS, is it possible
> to parse it in parallel/faster using Spark? Or do we have to use something
> like a PullParser or Iteratee?
>
> My current solution is to read the single XML file in the first pass -
> write it to HDFS and then read the small files in parallel on the Spark
> workers.
>
> Thanks
> -Soumya
>
>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239p19477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to