[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on the pull request:

https://github.com/apache/drill/pull/451#issuecomment-207091026
  
I will do the requested, thanks @jaltekruse. Just a quick note on the XML 
files it can support: I have tested around 100 different XML formats ranging 
from simple to extremely complex structures and the only limitation I have seen 
so far has been around having DTD statements in the XML documents that is not 
handled correctly. So with some thinking I believe I've been able to fit XML 
into JSON in such way it is in fact compatible with JSON and also circumventing 
most schema restrictions imposed by Drill. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58944903
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLFormatPlugin.java
 ---
@@ -0,0 +1,139 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.xml;
+
+import java.io.IOException;
+import java.net.URL;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.common.logical.StoragePluginConfig;
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.proto.ExecProtos.FragmentHandle;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.server.DrillbitContext;
+import org.apache.drill.exec.store.RecordReader;
+import org.apache.drill.exec.store.RecordWriter;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.dfs.FileSystemConfig;
+import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
+import org.apache.drill.exec.store.dfs.easy.EasyWriter;
+import org.apache.drill.exec.store.dfs.easy.FileWork;
+import 
org.apache.drill.exec.store.easy.xml.XMLFormatPlugin.XMLFormatConfig;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.drill.exec.store.easy.xml.XMLRecordReader;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Maps;
+
+/**
+ * Created by mpierre on 15-11-04.
+ */
+
+public class XMLFormatPlugin extends EasyFormatPlugin {
+
+private static final boolean IS_COMPRESSIBLE = false;
+private static final String DEFAULT_NAME = "xml";
+private Boolean keepPrefix = true;
+private XMLFormatConfig xmlConfig;
+
+public XMLFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig storageConfig) {
+this(name, context, fsConf, storageConfig, new XMLFormatConfig());
+}
+
+public XMLFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig config, XMLFormatConfig 
formatPluginConfig) {
+super(name, context, fsConf, config, formatPluginConfig, true, 
false, false, IS_COMPRESSIBLE, formatPluginConfig.getExtensions(), 
DEFAULT_NAME);
+xmlConfig = formatPluginConfig;
+}
+
+@Override
+public RecordReader getRecordReader(FragmentContext context, 
DrillFileSystem dfs, FileWork fileWork,
+List columns, String 
userName) throws ExecutionSetupException {
+return new XMLRecordReader(context, fileWork.getPath(), dfs, 
columns, xmlConfig);
+}
+
+
+@Override
+public int getReaderOperatorType() {
+return CoreOperatorType.JSON_SUB_SCAN_VALUE;
--- End diff --

I was considering setting it to -1, but since it is really JSON pieces 
already built by Drill that that does all the heavy lifting I decided to keep 
it this way. I will switch to -1 if it removes confusion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58944565
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java
 ---
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.easy.xml;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.easy.json.JSONRecordReader;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.xml.sax.SAXException;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+import java.io.IOException;
+import java.util.List;
+
+
+public class XMLRecordReader extends JSONRecordReader {
--- End diff --

I will consider this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58944333
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java
 ---
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.easy.xml;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.easy.json.JSONRecordReader;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.xml.sax.SAXException;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+import java.io.IOException;
+import java.util.List;
+
+
+public class XMLRecordReader extends JSONRecordReader {
+private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(XMLRecordReader.class);
+private XMLSaxParser handler;
+private SAXParser xmlParser;
+private JsonNode node;
+
+public XMLRecordReader(FragmentContext fragmentContext, String 
inputPath, DrillFileSystem fileSystem, List columns, 
XMLFormatPlugin.XMLFormatConfig xmlConfig) throws OutOfMemoryException {
+super(fragmentContext, inputPath, fileSystem, columns);
+try {
+FSDataInputStream fsStream = fileSystem.open(new 
Path(inputPath));
+SAXParserFactory saxParserFactory = 
SAXParserFactory.newInstance();
+xmlParser = saxParserFactory.newSAXParser();
+handler = new XMLSaxParser();
+handler.setRemoveNameSpace(!xmlConfig.getKeepPrefix());
+xmlParser.parse(fsStream.getWrappedStream(), handler);
+ObjectMapper mapper = new ObjectMapper();
+node = mapper.valueToTree(handler.getVal());
+logger.debug("XML Plugin, Produced JSON:" + 
handler.getVal().toJSONString());
+xmlParser = null;
+handler = null;
+saxParserFactory = null;
+super.stream = null;
+super.embeddedContent = node;
+super.hadoopPath = null;
+}
+catch (SAXException | ParserConfigurationException | IOException 
e) {
+logger.debug("XML Plugin:" + e.getMessage());
+
+}
+}
+
+
+@Override
+public void setup(OperatorContext context, OutputMutator output) 
throws ExecutionSetupException {
--- End diff --

Nope, really just the effect of creating the class using a gui wizard. I 
will remove them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58943891
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java
 ---
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.easy.xml;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.easy.json.JSONRecordReader;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.xml.sax.SAXException;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+import java.io.IOException;
+import java.util.List;
+
+
+public class XMLRecordReader extends JSONRecordReader {
+private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(XMLRecordReader.class);
+private XMLSaxParser handler;
+private SAXParser xmlParser;
+private JsonNode node;
+
+public XMLRecordReader(FragmentContext fragmentContext, String 
inputPath, DrillFileSystem fileSystem, List columns, 
XMLFormatPlugin.XMLFormatConfig xmlConfig) throws OutOfMemoryException {
+super(fragmentContext, inputPath, fileSystem, columns);
+try {
+FSDataInputStream fsStream = fileSystem.open(new 
Path(inputPath));
+SAXParserFactory saxParserFactory = 
SAXParserFactory.newInstance();
+xmlParser = saxParserFactory.newSAXParser();
+handler = new XMLSaxParser();
+handler.setRemoveNameSpace(!xmlConfig.getKeepPrefix());
+xmlParser.parse(fsStream.getWrappedStream(), handler);
+ObjectMapper mapper = new ObjectMapper();
+node = mapper.valueToTree(handler.getVal());
--- End diff --

The XML parser is streaming based. Due to SAX which is a stateless parser 
where each tag triggers a series of events and then drops the information, it 
comes with a very little memory overhead. However, as noted I keep all parsed 
objects in a stack since I need to link them. Due to the way the parsing works 
the root of the document is really getting its children last, which means that 
once parsed I have the whole document in memory but in JSON format so I will at 
least not keep both a JSON document and an XML document in memory at the same 
time. I've been considering storing state in files instead but I don't think it 
would be performing well considering I need to revisit objects frequently. If 
you have any suggestions such as memory mapped files that could spill down to 
disk if large, it would be appreciated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878

2016-03-29 Thread magpierre
GitHub user magpierre opened a pull request:

https://github.com/apache/drill/pull/451

Drill 3878

Please review my fix for JIRA DRILL-3878 provide XML support for Apache 
Drill.
The fix utilizes the existing support for JSON by converting XML to JSON 
using a simple SAX parser built for the purpose.
The parser tries to produce acceptable JSON documents that are then fed 
into the JSONRecordReader for futher processing.

To add xml support into Apache Drill, please include the built package to 
3rdparty folder of the built Apache Drill environment, and start.
Add:

"xml": {
  "type": "xml",
  "extensions": [
"xml"
  ],
  "keepPrefix": true
}

to the type section in dfs 
(keepPrefix = false will remove namespace from tags in Apache Drill since 
namespace can be named differently between documents and are not really part of 
the tagname)

The parser tries to be nice to Drill / JSON Reader by avoiding mixing 
types, arranging recurring values in arrays, and by removing empty elements. 
This in order to minimize the amount of JSON errors due to the different nature 
of XML and Drill.

Convention in JSON
Attributes are named using convetiion @ and then the attribute name and 
store simple values.
All other objects are stored as objects with a #value field.
This is somewhat conforming with Apache Spark XML, but I need to store all 
values in objects in order to avoid as many map of different type problems as 
possible.

Current limitations:
DTD tags are currently not liked. 
Schema is not validated against XSD's.

Also: SInce I am not a Drill Developer, I might have broken all rules 
possible of syntax, format, layout, test frameworks, as well as how to submit 
pull requests. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/magpierre/drill DRILL-3878

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #451


commit 844f34a16e75719535ff94c54d5337746ea18c20
Author: MPierre <magnus.pie...@icloud.com>
Date:   2015-11-05T14:42:06Z

Initial commit

XML support in Apache Drill

commit 592b3af06c2ff45198136577561f2ec1f7caaee0
Author: MPierre <magnus.pie...@icloud.com>
Date:   2015-11-05T21:21:42Z

Fixed some minor outstanding bugs

EasyRecordReader have a new field userName, and I forgot to change
jsonProcessor to protected from private.

commit 8fad811edab43d3499b41bb66cb419248d11208f
Author: MPierre <magnus.pie...@icloud.com>
Date:   2015-11-09T08:59:08Z

Merge remote-tracking branch 'apache/master' into DRILL-3878

commit 38f4884fe9b8456c1cde5de44c1e54177301a974
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-16T11:33:15Z

Syncing to latest release of drill

commit 909c5dec8bdb01bfe0ed358ebc64c959785738df
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-16T11:34:10Z

syncing to latest release of drill

commit 597d9657d613fa35df2c10dff23681545b13e531
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-18T08:55:51Z

Cleaned up deliver

Cleaned up the output generated by the SAX Parser, and removed all
unnecessary code.

commit 0cfaa31ab9af89833417288a290d21d0ce88c4ac
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-18T10:29:51Z

Merge remote-tracking branch 'apache/master' into DRILL-3878

commit aaaff05eb921125ad64854c89c179292c4441fb7
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-24T13:05:53Z

Adjusted output from Parser to fit Drill better

I have adjusted the SAX parser to produce JSON that Drill likes. Among
the things corrected is to remove empty objects from the tree built.
And to consolidate repeating values in arrays.

commit ba19a356d850224c01b9e807183377b46cf7e545
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-24T13:10:57Z

Fixed small typo

commit 8ba6705be42c7847d469611ab070b869e0c76d8c
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-24T21:17:30Z

Further enhancements of the output format to fit Drill

commit e2273f13b8e0136a33c1576c4667f16e23e1631c
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-24T21:22:41Z

Removed comment

commit c1b6ff8375a7e3c8161167d1a5f2b34ba165e750
Author: MPierre <magnus.pie...@icloud.com>
Date:   2016-03-29T12:48:53Z

Merge remote-tracking branch 'apache/master' into DRILL-3878




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
ena