GitHub user magpierre opened a pull request:

    https://github.com/apache/drill/pull/451

    Drill 3878

    Please review my fix for JIRA DRILL-3878 provide XML support for Apache 
Drill.
    The fix utilizes the existing support for JSON by converting XML to JSON 
using a simple SAX parser built for the purpose.
    The parser tries to produce acceptable JSON documents that are then fed 
into the JSONRecordReader for futher processing.
    
    To add xml support into Apache Drill, please include the built package to 
3rdparty folder of the built Apache Drill environment, and start.
    Add:
    
    "xml": {
          "type": "xml",
          "extensions": [
            "xml"
          ],
          "keepPrefix": true
        }
    
    to the type section in dfs 
    (keepPrefix = false will remove namespace from tags in Apache Drill since 
namespace can be named differently between documents and are not really part of 
the tagname)
    
    The parser tries to be nice to Drill / JSON Reader by avoiding mixing 
types, arranging recurring values in arrays, and by removing empty elements. 
This in order to minimize the amount of JSON errors due to the different nature 
of XML and Drill.
    
    Convention in JSON
    Attributes are named using convetiion @ and then the attribute name and 
store simple values.
    All other objects are stored as objects with a #value field.
    This is somewhat conforming with Apache Spark XML, but I need to store all 
values in objects in order to avoid as many map of different type problems as 
possible.
    
    Current limitations:
    DTD tags are currently not liked. 
    Schema is not validated against XSD's.
    
    Also: SInce I am not a Drill Developer, I might have broken all rules 
possible of syntax, format, layout, test frameworks, as well as how to submit 
pull requests. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/magpierre/drill DRILL-3878

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #451
    
----
commit 844f34a16e75719535ff94c54d5337746ea18c20
Author: MPierre <magnus.pie...@icloud.com>
Date:   2015-11-05T14:42:06Z

    Initial commit
    
    XML support in Apache Drill

commit 592b3af06c2ff45198136577561f2ec1f7caaee0
Author: MPierre <magnus.pie...@icloud.com>
Date:   2015-11-05T21:21:42Z

    Fixed some minor outstanding bugs
    
    EasyRecordReader have a new field userName, and I forgot to change
    jsonProcessor to protected from private.

commit 8fad811edab43d3499b41bb66cb419248d11208f
Author: MPierre <magnus.pie...@icloud.com>
Date:   2015-11-09T08:59:08Z

    Merge remote-tracking branch 'apache/master' into DRILL-3878

commit 38f4884fe9b8456c1cde5de44c1e54177301a974
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-16T11:33:15Z

    Syncing to latest release of drill

commit 909c5dec8bdb01bfe0ed358ebc64c959785738df
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-16T11:34:10Z

    syncing to latest release of drill

commit 597d9657d613fa35df2c10dff23681545b13e531
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-18T08:55:51Z

    Cleaned up deliver
    
    Cleaned up the output generated by the SAX Parser, and removed all
    unnecessary code.

commit 0cfaa31ab9af89833417288a290d21d0ce88c4ac
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-18T10:29:51Z

    Merge remote-tracking branch 'apache/master' into DRILL-3878

commit aaaff05eb921125ad64854c89c179292c4441fb7
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-24T13:05:53Z

    Adjusted output from Parser to fit Drill better
    
    I have adjusted the SAX parser to produce JSON that Drill likes. Among
    the things corrected is to remove empty objects from the tree built.
    And to consolidate repeating values in arrays.

commit ba19a356d850224c01b9e807183377b46cf7e545
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-24T13:10:57Z

    Fixed small typo

commit 8ba6705be42c7847d469611ab070b869e0c76d8c
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-24T21:17:30Z

    Further enhancements of the output format to fit Drill

commit e2273f13b8e0136a33c1576c4667f16e23e1631c
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-24T21:22:41Z

    Removed comment

commit c1b6ff8375a7e3c8161167d1a5f2b34ba165e750
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-29T12:48:53Z

    Merge remote-tracking branch 'apache/master' into DRILL-3878

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to