[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user magpierre commented on the pull request: https://github.com/apache/drill/pull/451#issuecomment-207091026 I will do the requested, thanks @jaltekruse. Just a quick note on the XML files it can support: I have tested around 100 different XML formats ranging from simple to extremely complex structures and the only limitation I have seen so far has been around having DTD statements in the XML documents that is not handled correctly. So with some thinking I believe I've been able to fit XML into JSON in such way it is in fact compatible with JSON and also circumventing most schema restrictions imposed by Drill. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user magpierre commented on a diff in the pull request: https://github.com/apache/drill/pull/451#discussion_r58944903 --- Diff: contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLFormatPlugin.java --- @@ -0,0 +1,139 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.easy.xml; + +import java.io.IOException; +import java.net.URL; +import java.util.List; +import java.util.Map; + +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.logical.FormatPluginConfig; +import org.apache.drill.common.logical.StoragePluginConfig; +import org.apache.drill.exec.ExecConstants; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.proto.ExecProtos.FragmentHandle; +import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType; +import org.apache.drill.exec.server.DrillbitContext; +import org.apache.drill.exec.store.RecordReader; +import org.apache.drill.exec.store.RecordWriter; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.dfs.FileSystemConfig; +import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; +import org.apache.drill.exec.store.dfs.easy.EasyWriter; +import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.drill.exec.store.easy.xml.XMLFormatPlugin.XMLFormatConfig; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.drill.exec.store.easy.xml.XMLRecordReader; + +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.annotation.JsonTypeName; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Maps; + +/** + * Created by mpierre on 15-11-04. + */ + +public class XMLFormatPlugin extends EasyFormatPlugin { + +private static final boolean IS_COMPRESSIBLE = false; +private static final String DEFAULT_NAME = "xml"; +private Boolean keepPrefix = true; +private XMLFormatConfig xmlConfig; + +public XMLFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig) { +this(name, context, fsConf, storageConfig, new XMLFormatConfig()); +} + +public XMLFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, XMLFormatConfig formatPluginConfig) { +super(name, context, fsConf, config, formatPluginConfig, true, false, false, IS_COMPRESSIBLE, formatPluginConfig.getExtensions(), DEFAULT_NAME); +xmlConfig = formatPluginConfig; +} + +@Override +public RecordReader getRecordReader(FragmentContext context, DrillFileSystem dfs, FileWork fileWork, +List columns, String userName) throws ExecutionSetupException { +return new XMLRecordReader(context, fileWork.getPath(), dfs, columns, xmlConfig); +} + + +@Override +public int getReaderOperatorType() { +return CoreOperatorType.JSON_SUB_SCAN_VALUE; --- End diff -- I was considering setting it to -1, but since it is really JSON pieces already built by Drill that that does all the heavy lifting I decided to keep it this way. I will switch to -1 if it removes confusion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user magpierre commented on a diff in the pull request: https://github.com/apache/drill/pull/451#discussion_r58944565 --- Diff: contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java --- @@ -0,0 +1,95 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.easy.xml; + +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.easy.json.JSONRecordReader; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.Path; +import org.xml.sax.SAXException; +import javax.xml.parsers.ParserConfigurationException; +import javax.xml.parsers.SAXParser; +import javax.xml.parsers.SAXParserFactory; +import java.io.IOException; +import java.util.List; + + +public class XMLRecordReader extends JSONRecordReader { --- End diff -- I will consider this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user magpierre commented on a diff in the pull request: https://github.com/apache/drill/pull/451#discussion_r58944333 --- Diff: contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java --- @@ -0,0 +1,95 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.easy.xml; + +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.easy.json.JSONRecordReader; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.Path; +import org.xml.sax.SAXException; +import javax.xml.parsers.ParserConfigurationException; +import javax.xml.parsers.SAXParser; +import javax.xml.parsers.SAXParserFactory; +import java.io.IOException; +import java.util.List; + + +public class XMLRecordReader extends JSONRecordReader { +private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(XMLRecordReader.class); +private XMLSaxParser handler; +private SAXParser xmlParser; +private JsonNode node; + +public XMLRecordReader(FragmentContext fragmentContext, String inputPath, DrillFileSystem fileSystem, List columns, XMLFormatPlugin.XMLFormatConfig xmlConfig) throws OutOfMemoryException { +super(fragmentContext, inputPath, fileSystem, columns); +try { +FSDataInputStream fsStream = fileSystem.open(new Path(inputPath)); +SAXParserFactory saxParserFactory = SAXParserFactory.newInstance(); +xmlParser = saxParserFactory.newSAXParser(); +handler = new XMLSaxParser(); +handler.setRemoveNameSpace(!xmlConfig.getKeepPrefix()); +xmlParser.parse(fsStream.getWrappedStream(), handler); +ObjectMapper mapper = new ObjectMapper(); +node = mapper.valueToTree(handler.getVal()); +logger.debug("XML Plugin, Produced JSON:" + handler.getVal().toJSONString()); +xmlParser = null; +handler = null; +saxParserFactory = null; +super.stream = null; +super.embeddedContent = node; +super.hadoopPath = null; +} +catch (SAXException | ParserConfigurationException | IOException e) { +logger.debug("XML Plugin:" + e.getMessage()); + +} +} + + +@Override +public void setup(OperatorContext context, OutputMutator output) throws ExecutionSetupException { --- End diff -- Nope, really just the effect of creating the class using a gui wizard. I will remove them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user magpierre commented on a diff in the pull request: https://github.com/apache/drill/pull/451#discussion_r58943891 --- Diff: contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java --- @@ -0,0 +1,95 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.easy.xml; + +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.easy.json.JSONRecordReader; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.Path; +import org.xml.sax.SAXException; +import javax.xml.parsers.ParserConfigurationException; +import javax.xml.parsers.SAXParser; +import javax.xml.parsers.SAXParserFactory; +import java.io.IOException; +import java.util.List; + + +public class XMLRecordReader extends JSONRecordReader { +private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(XMLRecordReader.class); +private XMLSaxParser handler; +private SAXParser xmlParser; +private JsonNode node; + +public XMLRecordReader(FragmentContext fragmentContext, String inputPath, DrillFileSystem fileSystem, List columns, XMLFormatPlugin.XMLFormatConfig xmlConfig) throws OutOfMemoryException { +super(fragmentContext, inputPath, fileSystem, columns); +try { +FSDataInputStream fsStream = fileSystem.open(new Path(inputPath)); +SAXParserFactory saxParserFactory = SAXParserFactory.newInstance(); +xmlParser = saxParserFactory.newSAXParser(); +handler = new XMLSaxParser(); +handler.setRemoveNameSpace(!xmlConfig.getKeepPrefix()); +xmlParser.parse(fsStream.getWrappedStream(), handler); +ObjectMapper mapper = new ObjectMapper(); +node = mapper.valueToTree(handler.getVal()); --- End diff -- The XML parser is streaming based. Due to SAX which is a stateless parser where each tag triggers a series of events and then drops the information, it comes with a very little memory overhead. However, as noted I keep all parsed objects in a stack since I need to link them. Due to the way the parsing works the root of the document is really getting its children last, which means that once parsed I have the whole document in memory but in JSON format so I will at least not keep both a JSON document and an XML document in memory at the same time. I've been considering storing state in files instead but I don't think it would be performing well considering I need to revisit objects frequently. If you have any suggestions such as memory mapped files that could spill down to disk if large, it would be appreciated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user jaltekruse commented on the pull request: https://github.com/apache/drill/pull/451#issuecomment-206989347 Hey @magpierre Thanks for the work on this, we have had request for an XML reader from a number of community members in the past. This is the right way to post your contribution, for some time now we have been using pull requests instead of the Apache reviewboard instance. Could you write some unit tests and documentation about the types of transformations you are doing to convert XML to JSON? I know that not all XML concepts fit into JSON so it would be good to be explicit about what kind of XML files this can support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/451#discussion_r58904346 --- Diff: contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLRecordReader.java --- @@ -0,0 +1,95 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.easy.xml; + +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.easy.json.JSONRecordReader; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.Path; +import org.xml.sax.SAXException; +import javax.xml.parsers.ParserConfigurationException; +import javax.xml.parsers.SAXParser; +import javax.xml.parsers.SAXParserFactory; +import java.io.IOException; +import java.util.List; + + +public class XMLRecordReader extends JSONRecordReader { +private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(XMLRecordReader.class); +private XMLSaxParser handler; +private SAXParser xmlParser; +private JsonNode node; + +public XMLRecordReader(FragmentContext fragmentContext, String inputPath, DrillFileSystem fileSystem, List columns, XMLFormatPlugin.XMLFormatConfig xmlConfig) throws OutOfMemoryException { +super(fragmentContext, inputPath, fileSystem, columns); +try { +FSDataInputStream fsStream = fileSystem.open(new Path(inputPath)); +SAXParserFactory saxParserFactory = SAXParserFactory.newInstance(); +xmlParser = saxParserFactory.newSAXParser(); +handler = new XMLSaxParser(); +handler.setRemoveNameSpace(!xmlConfig.getKeepPrefix()); +xmlParser.parse(fsStream.getWrappedStream(), handler); +ObjectMapper mapper = new ObjectMapper(); +node = mapper.valueToTree(handler.getVal()); --- End diff -- Considering that Drill can be used to read very large JSON files, and presumably users would expect to process similarly sized XML files, it probably makes sense to have this transformation happen in a streaming fashion rather than loading the whole file into memory and transforming it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/451#discussion_r58903153 --- Diff: contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLFormatPlugin.java --- @@ -0,0 +1,139 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.easy.xml; + +import java.io.IOException; +import java.net.URL; +import java.util.List; +import java.util.Map; + +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.logical.FormatPluginConfig; +import org.apache.drill.common.logical.StoragePluginConfig; +import org.apache.drill.exec.ExecConstants; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.proto.ExecProtos.FragmentHandle; +import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType; +import org.apache.drill.exec.server.DrillbitContext; +import org.apache.drill.exec.store.RecordReader; +import org.apache.drill.exec.store.RecordWriter; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.dfs.FileSystemConfig; +import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; +import org.apache.drill.exec.store.dfs.easy.EasyWriter; +import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.drill.exec.store.easy.xml.XMLFormatPlugin.XMLFormatConfig; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.drill.exec.store.easy.xml.XMLRecordReader; + +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.annotation.JsonTypeName; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Maps; + +/** + * Created by mpierre on 15-11-04. + */ + +public class XMLFormatPlugin extends EasyFormatPlugin { + +private static final boolean IS_COMPRESSIBLE = false; +private static final String DEFAULT_NAME = "xml"; +private Boolean keepPrefix = true; +private XMLFormatConfig xmlConfig; + +public XMLFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig) { +this(name, context, fsConf, storageConfig, new XMLFormatConfig()); +} + +public XMLFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, XMLFormatConfig formatPluginConfig) { +super(name, context, fsConf, config, formatPluginConfig, true, false, false, IS_COMPRESSIBLE, formatPluginConfig.getExtensions(), DEFAULT_NAME); +xmlConfig = formatPluginConfig; +} + +@Override +public RecordReader getRecordReader(FragmentContext context, DrillFileSystem dfs, FileWork fileWork, +List columns, String userName) throws ExecutionSetupException { +return new XMLRecordReader(context, fileWork.getPath(), dfs, columns, xmlConfig); +} + + +@Override +public int getReaderOperatorType() { +return CoreOperatorType.JSON_SUB_SCAN_VALUE; +} + +@Override +public int getWriterOperatorType() { +throw new UnsupportedOperationException(); +} + +@Override +public boolean supportsPushDown() { +return true; +} + +@Override +public RecordWriter getRecordWriter(FragmentContext context, EasyWriter writer) throws IOException { +return null; +} + +@JsonTypeName("xml") +public static class XMLFormatConfig implements FormatPluginConfig { + +public List extensions; --- End diff -- This is a bit of a lingering cleanup task, but the current formats don't share code for handling extensions. If you have some time if you would like to try to
[GitHub] drill pull request: Drill 3878: XML support in Apache Drill
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/451#discussion_r58902477 --- Diff: contrib/storage-xml/src/main/java/org/apache/drill/exec/store/easy/xml/XMLFormatPlugin.java --- @@ -0,0 +1,139 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.easy.xml; + +import java.io.IOException; +import java.net.URL; +import java.util.List; +import java.util.Map; + +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.common.logical.FormatPluginConfig; +import org.apache.drill.common.logical.StoragePluginConfig; +import org.apache.drill.exec.ExecConstants; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.proto.ExecProtos.FragmentHandle; +import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType; +import org.apache.drill.exec.server.DrillbitContext; +import org.apache.drill.exec.store.RecordReader; +import org.apache.drill.exec.store.RecordWriter; +import org.apache.drill.exec.store.dfs.DrillFileSystem; +import org.apache.drill.exec.store.dfs.FileSystemConfig; +import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin; +import org.apache.drill.exec.store.dfs.easy.EasyWriter; +import org.apache.drill.exec.store.dfs.easy.FileWork; +import org.apache.drill.exec.store.easy.xml.XMLFormatPlugin.XMLFormatConfig; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.drill.exec.store.easy.xml.XMLRecordReader; + +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.annotation.JsonTypeName; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Maps; + +/** + * Created by mpierre on 15-11-04. + */ + +public class XMLFormatPlugin extends EasyFormatPlugin { + +private static final boolean IS_COMPRESSIBLE = false; +private static final String DEFAULT_NAME = "xml"; +private Boolean keepPrefix = true; +private XMLFormatConfig xmlConfig; + +public XMLFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig) { +this(name, context, fsConf, storageConfig, new XMLFormatConfig()); +} + +public XMLFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, XMLFormatConfig formatPluginConfig) { +super(name, context, fsConf, config, formatPluginConfig, true, false, false, IS_COMPRESSIBLE, formatPluginConfig.getExtensions(), DEFAULT_NAME); +xmlConfig = formatPluginConfig; +} + +@Override +public RecordReader getRecordReader(FragmentContext context, DrillFileSystem dfs, FileWork fileWork, +List columns, String userName) throws ExecutionSetupException { +return new XMLRecordReader(context, fileWork.getPath(), dfs, columns, xmlConfig); +} + + +@Override +public int getReaderOperatorType() { +return CoreOperatorType.JSON_SUB_SCAN_VALUE; --- End diff -- There is a limitation in Drill that makes this hard to implement correctly. In the past we have been adding a new value in a Protobuf definition for each new storage or format plugin so that we can send information about this operation and report what type it is. Unfortunately this is not flexible and doesn't really fit the user-extensible model for format plugins. I don't expect you to fix this issues as part of this pull request, I'll start a thread on the list about this. To prevent confusion, it might be better to just return -1 here, which should show up on the web UI query profiles as unknown operator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If
[GitHub] drill pull request: Drill 3878
GitHub user magpierre opened a pull request: https://github.com/apache/drill/pull/451 Drill 3878 Please review my fix for JIRA DRILL-3878 provide XML support for Apache Drill. The fix utilizes the existing support for JSON by converting XML to JSON using a simple SAX parser built for the purpose. The parser tries to produce acceptable JSON documents that are then fed into the JSONRecordReader for futher processing. To add xml support into Apache Drill, please include the built package to 3rdparty folder of the built Apache Drill environment, and start. Add: "xml": { "type": "xml", "extensions": [ "xml" ], "keepPrefix": true } to the type section in dfs (keepPrefix = false will remove namespace from tags in Apache Drill since namespace can be named differently between documents and are not really part of the tagname) The parser tries to be nice to Drill / JSON Reader by avoiding mixing types, arranging recurring values in arrays, and by removing empty elements. This in order to minimize the amount of JSON errors due to the different nature of XML and Drill. Convention in JSON Attributes are named using convetiion @ and then the attribute name and store simple values. All other objects are stored as objects with a #value field. This is somewhat conforming with Apache Spark XML, but I need to store all values in objects in order to avoid as many map of different type problems as possible. Current limitations: DTD tags are currently not liked. Schema is not validated against XSD's. Also: SInce I am not a Drill Developer, I might have broken all rules possible of syntax, format, layout, test frameworks, as well as how to submit pull requests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/magpierre/drill DRILL-3878 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/451.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #451 commit 844f34a16e75719535ff94c54d5337746ea18c20 Author: MPierreDate: 2015-11-05T14:42:06Z Initial commit XML support in Apache Drill commit 592b3af06c2ff45198136577561f2ec1f7caaee0 Author: MPierre Date: 2015-11-05T21:21:42Z Fixed some minor outstanding bugs EasyRecordReader have a new field userName, and I forgot to change jsonProcessor to protected from private. commit 8fad811edab43d3499b41bb66cb419248d11208f Author: MPierre Date: 2015-11-09T08:59:08Z Merge remote-tracking branch 'apache/master' into DRILL-3878 commit 38f4884fe9b8456c1cde5de44c1e54177301a974 Author: MPierre Date: 2016-03-16T11:33:15Z Syncing to latest release of drill commit 909c5dec8bdb01bfe0ed358ebc64c959785738df Author: MPierre Date: 2016-03-16T11:34:10Z syncing to latest release of drill commit 597d9657d613fa35df2c10dff23681545b13e531 Author: MPierre Date: 2016-03-18T08:55:51Z Cleaned up deliver Cleaned up the output generated by the SAX Parser, and removed all unnecessary code. commit 0cfaa31ab9af89833417288a290d21d0ce88c4ac Author: MPierre Date: 2016-03-18T10:29:51Z Merge remote-tracking branch 'apache/master' into DRILL-3878 commit aaaff05eb921125ad64854c89c179292c4441fb7 Author: MPierre Date: 2016-03-24T13:05:53Z Adjusted output from Parser to fit Drill better I have adjusted the SAX parser to produce JSON that Drill likes. Among the things corrected is to remove empty objects from the tree built. And to consolidate repeating values in arrays. commit ba19a356d850224c01b9e807183377b46cf7e545 Author: MPierre Date: 2016-03-24T13:10:57Z Fixed small typo commit 8ba6705be42c7847d469611ab070b869e0c76d8c Author: MPierre Date: 2016-03-24T21:17:30Z Further enhancements of the output format to fit Drill commit e2273f13b8e0136a33c1576c4667f16e23e1631c Author: MPierre Date: 2016-03-24T21:22:41Z Removed comment commit c1b6ff8375a7e3c8161167d1a5f2b34ba165e750 Author: MPierre Date: 2016-03-29T12:48:53Z Merge remote-tracking branch 'apache/master' into DRILL-3878 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a