Sorry for late reply. Currently, the library only supports to load XML documents just as they are.
Do you mind if I ask open an issue with some more explanations here, https://github.com/databricks/spark-xml/issues? 2016-08-17 7:22 GMT+09:00 Sreekanth Jella <srikanth.je...@gmail.com>: > Hi Experts, > > > > Please suggest. Thanks in advance. > > > > Thanks, > > Sreekanth > > > > *From:* Sreekanth Jella [mailto:srikanth.je...@gmail.com] > *Sent:* Sunday, August 14, 2016 11:46 AM > *To:* 'Hyukjin Kwon' <gurwls...@gmail.com> > *Cc:* 'user @spark' <user@spark.apache.org> > *Subject:* Re: Flattening XML in a DataFrame > > > > Hi Hyukjin Kwon, > > Thank you for reply. > > There are several types of XML documents with different schema which needs > to be parsed and tag names do not know in hand. All we know is the XSD for > the given XML. > > Is it possible to get the same results even when we do not know the xml > tags like manager.id, manager.name or is it possible to read the tag > names from XSD and use? > > Thanks, > Sreekanth > > > > On Aug 12, 2016 9:58 PM, "Hyukjin Kwon" <gurwls...@gmail.com> wrote: > > Hi Sreekanth, > > > > Assuming you are using Spark 1.x, > > > > I believe this code below: > > sqlContext.read.format("com.databricks.spark.xml").option("rowTag", > "emp").load("/tmp/sample.xml") > > .selectExpr("manager.id", "manager.name", > "explode(manager.subordinates.clerk) as clerk") > > .selectExpr("id", "name", "clerk.cid", "clerk.cname") > > .show() > > would print the results below as you want: > > +---+----+---+-----+ > > | id|name|cid|cname| > > +---+----+---+-----+ > > | 1| foo| 1| foo| > > | 1| foo| 1| foo| > > +---+----+---+-----+ > > > > > > I hope this is helpful. > > > > Thanks! > > > > > > > > > > 2016-08-13 9:33 GMT+09:00 Sreekanth Jella <srikanth.je...@gmail.com>: > > Hi Folks, > > > > I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml > package which is automatically inferring my schema and creating a > DataFrame. > > > > I do not want to hard code any column names in DataFrame as I have lot of > varieties of XML documents and each might be lot more depth of child nodes. > I simply want to flatten any type of XML and then write output data to a > hive table. Can you please give some expert advice for the same. > > > > Example XML and expected output is given below. > > > > Sample XML: > > <emplist> > > <emp> > > <manager> > > <id>1</id> > > <name>foo</name> > > <subordinates> > > <clerk> > > <cid>1</cid> > > <cname>foo</cname> > > </clerk> > > <clerk> > > <cid>1</cid> > > <cname>foo</cname> > > </clerk> > > </subordinates> > > </manager> > > </emp> > > </emplist> > > > > Expected output: > > id, name, clerk.cid, clerk.cname > > 1, foo, 2, cname2 > > 1, foo, 3, cname3 > > > > Thanks, > > Sreekanth Jella > > > > > >