*Apologies, here is the link to the product: https://sonra.io/flexter-for-xml/ <https://sonra.io/flexter-for-xml/>*
*and how it can be used with Hive: https://sonra.io/2018/01/27/converting-xml-hive/ <https://sonra.io/2018/01/27/converting-xml-hive/>* On Mon, Jun 11, 2018 at 5:46 PM, Mich Talebzadeh <[email protected]> wrote: > many thanks. but I cannot see any specific product name there? > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 11 June 2018 at 14:10, kristijan berta <[email protected]> wrote: > >> The XPath stuff works reasonably well for simple XML files. >> >> However for complex XML files that change frequently and need to be >> ingested in realtime you might look at a 3rd party solution, e.g. here: >> https://dataworkssummit.com/san-jose-2018/session/add-a-spar >> k-to-your-etl/ >> >> On Mon, Jun 11, 2018 at 3:05 PM, kristijan berta <[email protected]> >> wrote: >> >>> thanks Jorn. The only alternative is to use xpath UDF? Works as shown >>> below but tedious >>> >>> Like the example below >>> >>> *$cat employees.xml* >>> <employee> >>> <id>1</id> >>> <name>Satish Kumar</name> >>> <designation>Technical Lead</designation> >>> </employee> >>> <employee> >>> <id>2</id> >>> <name>Ramya</name> >>> <designation>Testing</designation> >>> </employee> >>> >>> *Step:1 Bring each record to one line, by executing below command* >>> >>> $cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed >>> 's|</employee>|</employee>\n|g' | grep -v '^\s*$' > >>> employees_records.xml >>> >>> *$cat employees_records.xml* >>> <employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical >>> Lead</designation> </employee> >>> <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> >>> </employee> >>> >>> *tep:2 Load the file to HDFS* >>> >>> *$hadoop fs -mkdir /user/hive/sample-xml-inputs* >>> >>> *$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs* >>> >>> *$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml* >>> <employee> <id>1</id> <name>Satish Kumar</name><designation>Technical >>> Lead</designation> </employee> >>> <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> >>> </employee> >>> >>> *Step:3 Create a Hive table and point to xml file* >>> >>> *hive>create external table xml_table_org( xmldata string) LOCATION >>> '/user/hive/sample-xml-inputs/';* >>> >>> *hive> select * from xml_table_org;* >>> *OK* >>> <employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical >>> Lead</designation> </employee> >>> <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> >>> </employee> >>> >>> *Step 4: From the stage table we can query the elements and load it to >>> other table.* >>> >>> *hive> CREATE TABLE xml_table AS SELECT >>> xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation') >>> FROM xml_table_org;* >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 9 June 2018 at 07:42, Jörn Franke <[email protected]> wrote: >>> >>>> Yes. >>>> >>>> Serde must have been removed then in 2.x. >>>> >>>> >>>> >>>> On 8. Jun 2018, at 23:52, Mich Talebzadeh <[email protected]> >>>> wrote: >>>> >>>> Ok I am looking at this jar file >>>> >>>> jar tf hive-serde-3.0.0.jar|grep -i abstractserde >>>> org/apache/hadoop/hive/serde2/AbstractSerDe.class >>>> >>>> Is this the correct one? >>>> >>>> Thanks >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 8 June 2018 at 22:34, Mich Talebzadeh <[email protected]> >>>> wrote: >>>> >>>>> Thanks Jorn so what is the resolution? do I need another jar file? >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> On 8 June 2018 at 21:56, Jörn Franke <[email protected]> wrote: >>>>> >>>>>> Oha i see now Serde is a deprecated Interface , if i am not wrong it >>>>>> has been replaced by the abstract class abstractserde >>>>>> >>>>>> On 8. Jun 2018, at 22:22, Mich Talebzadeh <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Thanks Jorn. >>>>>> >>>>>> Spark 2.3.3 (labelled as stable) >>>>>> >>>>>> First I put the jar file hivexmlserde-1.0.5.3.jar under >>>>>> $HIVE_HOME/lib and explicitly loaded with ADD JAR as well in hive session >>>>>> >>>>>> hive> ADD JAR hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar; >>>>>> Added >>>>>> [/tmp/hive/7feb5165-780b-4ab6-aca8-f516d0388823_resources/hivexmlserde-1.0.5.3.jar] >>>>>> to class path >>>>>> Added resources: [hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar] >>>>>> >>>>>> Then I ran a simple code given here >>>>>> <https://github.com/dvasilen/Hive-XML-SerDe/issues/41> >>>>>> >>>>>> hive> CREATE TABLE xml_41 (imap map<string,string>) > ROW FORMAT >>>>>> SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' > WITH >>>>>> SERDEPROPERTIES ( > >>>>>> "column.xpath.imap"="/file-format/data-set/element", >>>>>> > "xml.map.specification.element"="@name->#content" > ) >>>>>> > STORED AS > INPUTFORMAT >>>>>> > 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' >>>>>> > OUTPUTFORMAT >>>>>> 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' >>>>>> > TBLPROPERTIES ( > "xmlinput.start"="<file-format>", > >>>>>> "xmlinput.end"="</file-format>" > ); FAILED: Execution Error, >>>>>> return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. >>>>>> org/apache/hadoop/hive/serde2/SerDe And this is full error >>>>>> 2018-06-08T21:17:20,775 INFO [7feb5165-780b-4ab6-aca8-f516d0388823 >>>>>> main] ql.Driver: Starting task [Stage-0:DDL] in serial mode >>>>>> 2018-06-08T21:17:20,776 ERROR [7feb5165-780b-4ab6-aca8-f516d0388823 >>>>>> main] exec.DDLTask: java.lang.NoClassDefFoundError: >>>>>> org/apache/hadoop/hive/serde2/SerDe at >>>>>> java.lang.ClassLoader.defineClass1(Native Method) at >>>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:763) at >>>>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) >>>>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) >>>>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73) >>>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368) >>>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362) >>>>>> at java.security.AccessController.doPrivileged(Native >>>>>> Method) at >>>>>> java.net.URLClassLoader.findClass(URLClassLoader.java:361) >>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>>>>> at sun.misc.Launcher$AppClassLoad >>>>>> er.loadClass(Launcher.java:331) at >>>>>> java.lang.ClassLoader.loadClass(ClassLoader.java:411) at >>>>>> java.lang.ClassLoader.loadClass(ClassLoader.java:357) at >>>>>> java.lang.Class.forName0(Native Method) at >>>>>> java.lang.Class.forName(Class.java:348) at >>>>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134) >>>>>> at org.apache.hadoop.conf.Configu >>>>>> ration.getClassByName(Configuration.java:2099) at >>>>>> org.apache.hadoop.hive.ql.exec.DDLTask.validateSerDe(DDLTask.java:4213) >>>>>> at org.apache.hadoop.hive.ql.plan >>>>>> .CreateTableDesc.toTable(CreateTableDesc.java:723) at >>>>>> org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4321) >>>>>> at org.apache.hadoop.hive.ql.exec >>>>>> .DDLTask.execute(DDLTask.java:354) at >>>>>> org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) >>>>>> at org.apache.hadoop.hive.ql.exec >>>>>> .TaskRunner.runSequential(TaskRunner.java:100) at >>>>>> org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) >>>>>> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) >>>>>> at org.apache.hadoop.hive.ql.Driv >>>>>> er.runInternal(Driver.java:1526) at >>>>>> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at >>>>>> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at >>>>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) >>>>>> at org.apache.hadoop.hive.cli.Cli >>>>>> Driver.processCmd(CliDriver.java:184) at >>>>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) >>>>>> at org.apache.hadoop.hive.cli.Cli >>>>>> Driver.executeDriver(CliDriver.java:821) at >>>>>> org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) >>>>>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>> Method) at sun.reflect.NativeMethodAccess >>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62) at >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>>>> at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at >>>>>> org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: >>>>>> java.lang.ClassNotFoundException: org.apache.hadoop.hive.serde2.SerDe >>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381) >>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>>>>> at sun.misc.Launcher$AppClassLoad >>>>>> er.loadClass(Launcher.java:331) at >>>>>> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 40 >>>>>> more The jar file has the classes! >>>>>> >>>>>> jar tf hivexmlserde-1.0.5.3.jar >>>>>> META-INF/ >>>>>> META-INF/MANIFEST.MF >>>>>> com/ >>>>>> com/ibm/ >>>>>> com/ibm/spss/ >>>>>> com/ibm/spss/hive/ >>>>>> com/ibm/spss/hive/serde2/ >>>>>> com/ibm/spss/hive/serde2/xml/ >>>>>> com/ibm/spss/hive/serde2/xml/objectinspector/ >>>>>> com/ibm/spss/hive/serde2/xml/processor/ >>>>>> com/ibm/spss/hive/serde2/xml/processor/java/ >>>>>> com/ibm/spss/hive/serde2/xml/HiveXmlRecordReader.class >>>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlListObjectIn >>>>>> spector.class >>>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlMapObjectIns >>>>>> pector.class >>>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspec >>>>>> torFactory$1.class >>>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspec >>>>>> torFactory.class >>>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObject >>>>>> Inspector$1.class >>>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObject >>>>>> Inspector.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$1.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$2.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$1.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$2.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlQuery.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/java/NodeArray.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/SerDeArray.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapEntry.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet$Type.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode$1.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode$2.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlNodeArray.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlProcessor.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlProcessorContext.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlQuery.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlTransformer.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlUtils$1.class >>>>>> com/ibm/spss/hive/serde2/xml/processor/XmlUtils.class >>>>>> com/ibm/spss/hive/serde2/xml/SplittableXmlInputFormat.class >>>>>> com/ibm/spss/hive/serde2/xml/XmlInputFormat$XmlRecordReader.class >>>>>> com/ibm/spss/hive/serde2/xml/XmlInputFormat.class >>>>>> com/ibm/spss/hive/serde2/xml/XmlSerDe$1.class >>>>>> com/ibm/spss/hive/serde2/xml/XmlSerDe.class >>>>>> META-INF/maven/ >>>>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/ >>>>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/ >>>>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.xml >>>>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom >>>>>> .properties >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> On 8 June 2018 at 17:58, Jörn Franke <[email protected]> wrote: >>>>>> >>>>>>> Can you get the log files and start Hive with more detailled logs? >>>>>>> In could be that not all libraries are loaded (i don’t remember >>>>>>> anymore but I think this one needs more , I can look next week in my >>>>>>> docs) >>>>>>> or that it does not support maps (not sure). >>>>>>> You can try first with a more simpler extraction with a String field >>>>>>> to see if it works . >>>>>>> >>>>>>> Hive has always had external libraries for xml support and I used >>>>>>> the one below with Hive 1.x, but it should also work with 2.x (3 not >>>>>>> sure, >>>>>>> but it should if it works in 2.x) >>>>>>> >>>>>>> >>>>>>> On 8. Jun 2018, at 17:53, Mich Talebzadeh <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> I tried Hive 2.0.1, 2.3.2 and now Hive 3/ >>>>>>> >>>>>>> I explicitly added hivexmlserde jar file as ADD JAR shown below >>>>>>> >>>>>>> 0: jdbc:hive2://rhes75:10099/default> ADD JAR >>>>>>> hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar; >>>>>>> No rows affected (0.002 seconds) >>>>>>> >>>>>>> But still cannot create an xml table >>>>>>> >>>>>>> 0: jdbc:hive2://rhes75:10099/default> CREATE TABLE xml_41 (imap >>>>>>> map<string,string>) ROW FORMAT SERDE >>>>>>> 'com.ibm.spss.hive.serde2.xml.XmlSerDe' >>>>>>> WITH SERDEPROPERTIES ("column.xpath.imap"="/file-fo >>>>>>> rmat/data-set/element","xml.map.specification.element"="@name->#content") >>>>>>> STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' >>>>>>> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' >>>>>>> TBLPROPERTIES ("xmlinput.start"="<file-forma >>>>>>> t>","xmlinput.end"="</file-format>"); >>>>>>> >>>>>>> Error: Error while processing statement: FAILED: Execution Error, >>>>>>> return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. >>>>>>> org/apache/hadoop/hive/serde2/SerDe (state=08S01,code=1) >>>>>>> >>>>>>> Does anyone know the cause of this or which version of Hive supports >>>>>>> creating an XML table? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> LinkedIn * >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> >>> >> >
