[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on the pull request:

https://github.com/apache/drill/pull/451#issuecomment-207091026
  
I will do the requested, thanks @jaltekruse. Just a quick note on the XML 
files it can support: I have tested around 100 different XML formats ranging 
from simple to extremely complex structures and the only limitation I have seen 
so far has been around having DTD statements in the XML documents that is not 
handled correctly. So with some thinking I believe I've been able to fit XML 
into JSON in such way it is in fact compatible with JSON and also circumventing 
most schema restrictions imposed by Drill. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58944903
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLFormatPlugin.java
 ---
@@ -0,0 +1,139 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.xml;
+
+import java.io.IOException;
+import java.net.URL;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.common.logical.StoragePluginConfig;
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.proto.ExecProtos.FragmentHandle;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.server.DrillbitContext;
+import org.apache.drill.exec.store.RecordReader;
+import org.apache.drill.exec.store.RecordWriter;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.dfs.FileSystemConfig;
+import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
+import org.apache.drill.exec.store.dfs.easy.EasyWriter;
+import org.apache.drill.exec.store.dfs.easy.FileWork;
+import 
org.apache.drill.exec.store.easy.xml.XMLFormatPlugin.XMLFormatConfig;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.drill.exec.store.easy.xml.XMLRecordReader;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Maps;
+
+/**
+ * Created by mpierre on 15-11-04.
+ */
+
+public class XMLFormatPlugin extends EasyFormatPlugin {
+
+private static final boolean IS_COMPRESSIBLE = false;
+private static final String DEFAULT_NAME = "xml";
+private Boolean keepPrefix = true;
+private XMLFormatConfig xmlConfig;
+
+public XMLFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig storageConfig) {
+this(name, context, fsConf, storageConfig, new XMLFormatConfig());
+}
+
+public XMLFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig config, XMLFormatConfig 
formatPluginConfig) {
+super(name, context, fsConf, config, formatPluginConfig, true, 
false, false, IS_COMPRESSIBLE, formatPluginConfig.getExtensions(), 
DEFAULT_NAME);
+xmlConfig = formatPluginConfig;
+}
+
+@Override
+public RecordReader getRecordReader(FragmentContext context, 
DrillFileSystem dfs, FileWork fileWork,
+List columns, String 
userName) throws ExecutionSetupException {
+return new XMLRecordReader(context, fileWork.getPath(), dfs, 
columns, xmlConfig);
+}
+
+
+@Override
+public int getReaderOperatorType() {
+return CoreOperatorType.JSON_SUB_SCAN_VALUE;
--- End diff --

I was considering setting it to -1, but since it is really JSON pieces 
already built by Drill that that does all the heavy lifting I decided to keep 
it this way. I will switch to -1 if it removes confusion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58944565
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java
 ---
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.easy.xml;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.easy.json.JSONRecordReader;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.xml.sax.SAXException;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+import java.io.IOException;
+import java.util.List;
+
+
+public class XMLRecordReader extends JSONRecordReader {
--- End diff --

I will consider this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58944333
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java
 ---
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.easy.xml;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.easy.json.JSONRecordReader;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.xml.sax.SAXException;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+import java.io.IOException;
+import java.util.List;
+
+
+public class XMLRecordReader extends JSONRecordReader {
+private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(XMLRecordReader.class);
+private XMLSaxParser handler;
+private SAXParser xmlParser;
+private JsonNode node;
+
+public XMLRecordReader(FragmentContext fragmentContext, String 
inputPath, DrillFileSystem fileSystem, List columns, 
XMLFormatPlugin.XMLFormatConfig xmlConfig) throws OutOfMemoryException {
+super(fragmentContext, inputPath, fileSystem, columns);
+try {
+FSDataInputStream fsStream = fileSystem.open(new 
Path(inputPath));
+SAXParserFactory saxParserFactory = 
SAXParserFactory.newInstance();
+xmlParser = saxParserFactory.newSAXParser();
+handler = new XMLSaxParser();
+handler.setRemoveNameSpace(!xmlConfig.getKeepPrefix());
+xmlParser.parse(fsStream.getWrappedStream(), handler);
+ObjectMapper mapper = new ObjectMapper();
+node = mapper.valueToTree(handler.getVal());
+logger.debug("XML Plugin, Produced JSON:" + 
handler.getVal().toJSONString());
+xmlParser = null;
+handler = null;
+saxParserFactory = null;
+super.stream = null;
+super.embeddedContent = node;
+super.hadoopPath = null;
+}
+catch (SAXException | ParserConfigurationException | IOException 
e) {
+logger.debug("XML Plugin:" + e.getMessage());
+
+}
+}
+
+
+@Override
+public void setup(OperatorContext context, OutputMutator output) 
throws ExecutionSetupException {
--- End diff --

Nope, really just the effect of creating the class using a gui wizard. I 
will remove them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread magpierre
Github user magpierre commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58943891
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java
 ---
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.easy.xml;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.easy.json.JSONRecordReader;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.xml.sax.SAXException;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+import java.io.IOException;
+import java.util.List;
+
+
+public class XMLRecordReader extends JSONRecordReader {
+private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(XMLRecordReader.class);
+private XMLSaxParser handler;
+private SAXParser xmlParser;
+private JsonNode node;
+
+public XMLRecordReader(FragmentContext fragmentContext, String 
inputPath, DrillFileSystem fileSystem, List columns, 
XMLFormatPlugin.XMLFormatConfig xmlConfig) throws OutOfMemoryException {
+super(fragmentContext, inputPath, fileSystem, columns);
+try {
+FSDataInputStream fsStream = fileSystem.open(new 
Path(inputPath));
+SAXParserFactory saxParserFactory = 
SAXParserFactory.newInstance();
+xmlParser = saxParserFactory.newSAXParser();
+handler = new XMLSaxParser();
+handler.setRemoveNameSpace(!xmlConfig.getKeepPrefix());
+xmlParser.parse(fsStream.getWrappedStream(), handler);
+ObjectMapper mapper = new ObjectMapper();
+node = mapper.valueToTree(handler.getVal());
--- End diff --

The XML parser is streaming based. Due to SAX which is a stateless parser 
where each tag triggers a series of events and then drops the information, it 
comes with a very little memory overhead. However, as noted I keep all parsed 
objects in a stack since I need to link them. Due to the way the parsing works 
the root of the document is really getting its children last, which means that 
once parsed I have the whole document in memory but in JSON format so I will at 
least not keep both a JSON document and an XML document in memory at the same 
time. I've been considering storing state in files instead but I don't think it 
would be performing well considering I need to revisit objects frequently. If 
you have any suggestions such as memory mapped files that could spill down to 
disk if large, it would be appreciated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread jaltekruse
Github user jaltekruse commented on the pull request:

https://github.com/apache/drill/pull/451#issuecomment-206989347
  
Hey @magpierre Thanks for the work on this, we have had request for an XML 
reader from a number of community members in the past.

This is the right way to post your contribution, for some time now we have 
been using pull requests instead of the Apache reviewboard instance.

Could you write some unit tests and documentation about the types of 
transformations you are doing to convert XML to JSON? I know that not all XML 
concepts fit into JSON so it would be good to be explicit about what kind of 
XML files this can support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread jaltekruse
Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58904346
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java
 ---
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.easy.xml;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.impl.OutputMutator;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.easy.json.JSONRecordReader;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.xml.sax.SAXException;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+import java.io.IOException;
+import java.util.List;
+
+
+public class XMLRecordReader extends JSONRecordReader {
+private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(XMLRecordReader.class);
+private XMLSaxParser handler;
+private SAXParser xmlParser;
+private JsonNode node;
+
+public XMLRecordReader(FragmentContext fragmentContext, String 
inputPath, DrillFileSystem fileSystem, List columns, 
XMLFormatPlugin.XMLFormatConfig xmlConfig) throws OutOfMemoryException {
+super(fragmentContext, inputPath, fileSystem, columns);
+try {
+FSDataInputStream fsStream = fileSystem.open(new 
Path(inputPath));
+SAXParserFactory saxParserFactory = 
SAXParserFactory.newInstance();
+xmlParser = saxParserFactory.newSAXParser();
+handler = new XMLSaxParser();
+handler.setRemoveNameSpace(!xmlConfig.getKeepPrefix());
+xmlParser.parse(fsStream.getWrappedStream(), handler);
+ObjectMapper mapper = new ObjectMapper();
+node = mapper.valueToTree(handler.getVal());
--- End diff --

Considering that Drill can be used to read very large JSON files, and 
presumably users would expect to process similarly sized XML files, it probably 
makes sense to have this transformation happen in a streaming fashion rather 
than loading the whole file into memory and transforming it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread jaltekruse
Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58903153
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLFormatPlugin.java
 ---
@@ -0,0 +1,139 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.xml;
+
+import java.io.IOException;
+import java.net.URL;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.common.logical.StoragePluginConfig;
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.proto.ExecProtos.FragmentHandle;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.server.DrillbitContext;
+import org.apache.drill.exec.store.RecordReader;
+import org.apache.drill.exec.store.RecordWriter;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.dfs.FileSystemConfig;
+import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
+import org.apache.drill.exec.store.dfs.easy.EasyWriter;
+import org.apache.drill.exec.store.dfs.easy.FileWork;
+import 
org.apache.drill.exec.store.easy.xml.XMLFormatPlugin.XMLFormatConfig;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.drill.exec.store.easy.xml.XMLRecordReader;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Maps;
+
+/**
+ * Created by mpierre on 15-11-04.
+ */
+
+public class XMLFormatPlugin extends EasyFormatPlugin {
+
+private static final boolean IS_COMPRESSIBLE = false;
+private static final String DEFAULT_NAME = "xml";
+private Boolean keepPrefix = true;
+private XMLFormatConfig xmlConfig;
+
+public XMLFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig storageConfig) {
+this(name, context, fsConf, storageConfig, new XMLFormatConfig());
+}
+
+public XMLFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig config, XMLFormatConfig 
formatPluginConfig) {
+super(name, context, fsConf, config, formatPluginConfig, true, 
false, false, IS_COMPRESSIBLE, formatPluginConfig.getExtensions(), 
DEFAULT_NAME);
+xmlConfig = formatPluginConfig;
+}
+
+@Override
+public RecordReader getRecordReader(FragmentContext context, 
DrillFileSystem dfs, FileWork fileWork,
+List columns, String 
userName) throws ExecutionSetupException {
+return new XMLRecordReader(context, fileWork.getPath(), dfs, 
columns, xmlConfig);
+}
+
+
+@Override
+public int getReaderOperatorType() {
+return CoreOperatorType.JSON_SUB_SCAN_VALUE;
+}
+
+@Override
+public int getWriterOperatorType() {
+throw new UnsupportedOperationException();
+}
+
+@Override
+public boolean supportsPushDown() {
+return true;
+}
+
+@Override
+public RecordWriter getRecordWriter(FragmentContext context, 
EasyWriter writer) throws IOException {
+return null;
+}
+
+@JsonTypeName("xml")
+public static class XMLFormatConfig implements FormatPluginConfig {
+
+public List extensions;
--- End diff --

This is a bit of a lingering cleanup task, but the current formats don't 
share code for handling extensions. If you have some time if you would like to 
try to 

[GitHub] drill pull request: Drill 3878: XML support in Apache Drill

2016-04-07 Thread jaltekruse
Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/451#discussion_r58902477
  
--- Diff: 
contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLFormatPlugin.java
 ---
@@ -0,0 +1,139 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.xml;
+
+import java.io.IOException;
+import java.net.URL;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.common.logical.StoragePluginConfig;
+import org.apache.drill.exec.ExecConstants;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.proto.ExecProtos.FragmentHandle;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.server.DrillbitContext;
+import org.apache.drill.exec.store.RecordReader;
+import org.apache.drill.exec.store.RecordWriter;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.dfs.FileSystemConfig;
+import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
+import org.apache.drill.exec.store.dfs.easy.EasyWriter;
+import org.apache.drill.exec.store.dfs.easy.FileWork;
+import 
org.apache.drill.exec.store.easy.xml.XMLFormatPlugin.XMLFormatConfig;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.drill.exec.store.easy.xml.XMLRecordReader;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Maps;
+
+/**
+ * Created by mpierre on 15-11-04.
+ */
+
+public class XMLFormatPlugin extends EasyFormatPlugin {
+
+private static final boolean IS_COMPRESSIBLE = false;
+private static final String DEFAULT_NAME = "xml";
+private Boolean keepPrefix = true;
+private XMLFormatConfig xmlConfig;
+
+public XMLFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig storageConfig) {
+this(name, context, fsConf, storageConfig, new XMLFormatConfig());
+}
+
+public XMLFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig config, XMLFormatConfig 
formatPluginConfig) {
+super(name, context, fsConf, config, formatPluginConfig, true, 
false, false, IS_COMPRESSIBLE, formatPluginConfig.getExtensions(), 
DEFAULT_NAME);
+xmlConfig = formatPluginConfig;
+}
+
+@Override
+public RecordReader getRecordReader(FragmentContext context, 
DrillFileSystem dfs, FileWork fileWork,
+List columns, String 
userName) throws ExecutionSetupException {
+return new XMLRecordReader(context, fileWork.getPath(), dfs, 
columns, xmlConfig);
+}
+
+
+@Override
+public int getReaderOperatorType() {
+return CoreOperatorType.JSON_SUB_SCAN_VALUE;
--- End diff --

There is a limitation in Drill that makes this hard to implement correctly. 
In the past we have been adding a new value in a Protobuf definition for each 
new storage or format plugin so that we can send information about this 
operation and report what type it is. Unfortunately this is not flexible and 
doesn't really fit the user-extensible model for format plugins. I don't expect 
you to fix this issues as part of this pull request, I'll start a thread on the 
list about this.

To prevent confusion, it might be better to just return -1 here, which 
should show up on the web UI query profiles as unknown operator.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If 

[GitHub] drill pull request: Drill 3878

2016-03-29 Thread magpierre
GitHub user magpierre opened a pull request:

https://github.com/apache/drill/pull/451

Drill 3878

Please review my fix for JIRA DRILL-3878 provide XML support for Apache 
Drill.
The fix utilizes the existing support for JSON by converting XML to JSON 
using a simple SAX parser built for the purpose.
The parser tries to produce acceptable JSON documents that are then fed 
into the JSONRecordReader for futher processing.

To add xml support into Apache Drill, please include the built package to 
3rdparty folder of the built Apache Drill environment, and start.
Add:

"xml": {
  "type": "xml",
  "extensions": [
"xml"
  ],
  "keepPrefix": true
}

to the type section in dfs 
(keepPrefix = false will remove namespace from tags in Apache Drill since 
namespace can be named differently between documents and are not really part of 
the tagname)

The parser tries to be nice to Drill / JSON Reader by avoiding mixing 
types, arranging recurring values in arrays, and by removing empty elements. 
This in order to minimize the amount of JSON errors due to the different nature 
of XML and Drill.

Convention in JSON
Attributes are named using convetiion @ and then the attribute name and 
store simple values.
All other objects are stored as objects with a #value field.
This is somewhat conforming with Apache Spark XML, but I need to store all 
values in objects in order to avoid as many map of different type problems as 
possible.

Current limitations:
DTD tags are currently not liked. 
Schema is not validated against XSD's.

Also: SInce I am not a Drill Developer, I might have broken all rules 
possible of syntax, format, layout, test frameworks, as well as how to submit 
pull requests. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/magpierre/drill DRILL-3878

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #451


commit 844f34a16e75719535ff94c54d5337746ea18c20
Author: MPierre 
Date:   2015-11-05T14:42:06Z

Initial commit

XML support in Apache Drill

commit 592b3af06c2ff45198136577561f2ec1f7caaee0
Author: MPierre 
Date:   2015-11-05T21:21:42Z

Fixed some minor outstanding bugs

EasyRecordReader have a new field userName, and I forgot to change
jsonProcessor to protected from private.

commit 8fad811edab43d3499b41bb66cb419248d11208f
Author: MPierre 
Date:   2015-11-09T08:59:08Z

Merge remote-tracking branch 'apache/master' into DRILL-3878

commit 38f4884fe9b8456c1cde5de44c1e54177301a974
Author: MPierre 
Date:   2016-03-16T11:33:15Z

Syncing to latest release of drill

commit 909c5dec8bdb01bfe0ed358ebc64c959785738df
Author: MPierre 
Date:   2016-03-16T11:34:10Z

syncing to latest release of drill

commit 597d9657d613fa35df2c10dff23681545b13e531
Author: MPierre 
Date:   2016-03-18T08:55:51Z

Cleaned up deliver

Cleaned up the output generated by the SAX Parser, and removed all
unnecessary code.

commit 0cfaa31ab9af89833417288a290d21d0ce88c4ac
Author: MPierre 
Date:   2016-03-18T10:29:51Z

Merge remote-tracking branch 'apache/master' into DRILL-3878

commit aaaff05eb921125ad64854c89c179292c4441fb7
Author: MPierre 
Date:   2016-03-24T13:05:53Z

Adjusted output from Parser to fit Drill better

I have adjusted the SAX parser to produce JSON that Drill likes. Among
the things corrected is to remove empty objects from the tree built.
And to consolidate repeating values in arrays.

commit ba19a356d850224c01b9e807183377b46cf7e545
Author: MPierre 
Date:   2016-03-24T13:10:57Z

Fixed small typo

commit 8ba6705be42c7847d469611ab070b869e0c76d8c
Author: MPierre 
Date:   2016-03-24T21:17:30Z

Further enhancements of the output format to fit Drill

commit e2273f13b8e0136a33c1576c4667f16e23e1631c
Author: MPierre 
Date:   2016-03-24T21:22:41Z

Removed comment

commit c1b6ff8375a7e3c8161167d1a5f2b34ba165e750
Author: MPierre 
Date:   2016-03-29T12:48:53Z

Merge remote-tracking branch 'apache/master' into DRILL-3878




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a