Re: Apache Drill
Inline On Sun, Oct 18, 2015 at 11:37 AM, Julian Hydewrote: > ... > My proposed “solution” — and I suspect you’re not going to like it — is to > ignore, for now, harder XML problems and focus on the easier ones. Hmm I think that this may or may not be easy. But it is real important. > A lot of XML documents do not have repeating scalar values. They are > collections of records, perhaps with nested records or nested collections > of records. The scalar-ness of my example was just a simplification. The same problem occurs every time there is a list that sometimes contains 1 element. > Whitespace can be safely thrown away. Namespaces are not used. Fine. > A lot of data is in XML format because XML was the only option considered, > not because the data structure pushed the limits of what XML’s rich model > can express. > True. > I think 90% of cases can be handled using a simple XML-to-JSON mapper that > takes hints such as that the “employee” tag is to become a list of JSON > maps and the “salary” and “name” tags are to be treated as attributes. > Great. The real question is whether or not the XML community already has such a hinting mechanism. Or is Drill about to reinvent that? > > I really think that if we focus on the harder cases we’ll end up with the > wrong solution. > No doubt. This isn't one of those.
[jira] [Created] (DRILL-3948) Partitioning columns of a Parquet table should be made visible to end user
Aman Sinha created DRILL-3948: - Summary: Partitioning columns of a Parquet table should be made visible to end user Key: DRILL-3948 URL: https://issues.apache.org/jira/browse/DRILL-3948 Project: Apache Drill Issue Type: Improvement Components: Metadata, Query Planning & Optimization Affects Versions: 1.2.0 Reporter: Aman Sinha For Parquet files, Drill can do partition pruning for filter conditions on a column which satisfies the following criteria: Each parquet file has a single value of that column. The parquet metadata is examined for the min and max values of that column and if they are the same, the column is considered a partitioning column. When CTAS auto-partition is used, the above criteria is enforced, but even for files created through external methods could satisfy the criteria. It is difficult for users to know what exactly are the candidate partitioning columns in the table. We should provide this information in a user friendly way: for instance: - special 'show partition columns for ' command - In the Explain plan, show partition columns for the table in Scan node More options should be discussed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Design Documents
Parth, Thanks for bringing this up. We definitely need to do a better job of discussing development decisions. I think whether this is done as a set of descriptions and comments on JIRA or a formal doc on Google is less important (and I wouldn't be inclined to enforce one over the other). That being said, I think there is something else that limits the success of such an effort. We first must ask: how do we get more people responding and providing feedback among the things people have already posted. I know I've experienced silence numerous times when asking for feedback and so have others. Some recent examples I've seen in the community: - DRILL-3738 has received very little to no feedback despite providing an initial design document - DRILL-3229 has one general response (ask for more detail) from you with a follow-up from Steven but no additional feedback on the actual proposal So I put it back to you and the general list, how do we get people to provide more feedback on all contributions and proposals? I think it goes beyond designs. More issues should be opened with better descriptions and proposals around why one would do something. When the basic outline has consensus and feedback, people can move to more thorough designs. Why haven't we seen response on these issues? I can't see a requirement of reviewed design docs being enforced until we start to seeing people providing feedback on feature proposals and existing (albeit thin) design documents. So +1 long term but -1 until people start to respond and provide feedback on the outstanding items. Contributors need to perceive value in presenting a design doc. Let's get the WIIFM right so that developer incentives are aligned. Jacques -- Jacques Nadeau CTO and Co-Founder, Dremio On Fri, Oct 16, 2015 at 10:21 AM, Parth Chandrawrote: > Hi guys, > > Now that 1.2 is out I wanted to bring up the exciting topic of design > documents for Drill. As the project gets more contributors, we definitely > need to start documenting our designs and also allow for a more substantial > review process. In particular, we need to make sure that there is > sufficient time for comment as well as a time limit for comments so that > developers are not left stranded. It is understood that committers should > ensure they spend enough time in reviewing designs. > > I can see some substantial improvements in the works (some may even have > pull requests for initial work) and I think that this is a good time to > make sure that the design is done and understood by all before we get too > far ahead with the implementation. > > [1] is an example from Spark, though that might be asking for a lot. > > [2] is an example from Drill - Hash Aggregation in Drill - This is an ideal > design document. It could be improved even further perhaps by adding some > implementation level details (for example parameters that could be used to > tune Hash aggregation) that could aid QA/documentation. > > What do people think? Can we start enforcing the requirement to have > reviewed design docs before submitting pull requests for *advanced* > features? > > Parth > > [1] http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf > [2] > https://issues.apache.org/jira/secure/attachment/12622804/DrillAggrs.pdf >
[GitHub] drill pull request: DRILL-3947: Use setSafe() for date, time, time...
Github user zfong commented on the pull request: https://github.com/apache/drill/pull/208#issuecomment-149087340 Is there a small unit test case that reproduces this problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [DISCUSS] Design Documents
+1 From: Parth ChandraTo: dev@drill.apache.org Sent: Friday, October 16, 2015 10:21 AM Subject: [DISCUSS] Design Documents Hi guys, Now that 1.2 is out I wanted to bring up the exciting topic of design documents for Drill. As the project gets more contributors, we definitely need to start documenting our designs and also allow for a more substantial review process. In particular, we need to make sure that there is sufficient time for comment as well as a time limit for comments so that developers are not left stranded. It is understood that committers should ensure they spend enough time in reviewing designs. I can see some substantial improvements in the works (some may even have pull requests for initial work) and I think that this is a good time to make sure that the design is done and understood by all before we get too far ahead with the implementation. [1] is an example from Spark, though that might be asking for a lot. [2] is an example from Drill - Hash Aggregation in Drill - This is an ideal design document. It could be improved even further perhaps by adding some implementation level details (for example parameters that could be used to tune Hash aggregation) that could aid QA/documentation. What do people think? Can we start enforcing the requirement to have reviewed design docs before submitting pull requests for *advanced* features? Parth [1] http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf [2] https://issues.apache.org/jira/secure/attachment/12622804/DrillAggrs.pdf
Re: Apache Drill
Kasper, Are you interested in working on a Drill metamodel format plugin? That way, anything that Metamodel exposes would be available in Drill. It seems like this would add great value to many users. Jacques -- Jacques Nadeau CTO and Co-Founder, Dremio On Sun, Oct 18, 2015 at 1:18 PM, Kasper Sørensen < i.am.kasper.soren...@gmail.com> wrote: > Hi Ted, > > Actually in MetaModel you then have two choices with your mapping to table > format. > > 1) Either map the "item" as the granularity of a record. That way you will > get three rows - one for each item. On the last of the two rows you would > have the same values for any element that is registered at the > scope. > > 2) You can also map 2 tables instead - one for and one for > and then join them as you like. > > > 2015-10-18 20:24 GMT+02:00 Ted Dunning: > > > Kasper, > > > > This might work. > > > > One issue that I see is that Metamodel seems to take a very XML centric > > view of things while Drill takes a pretty JSON view of things. > > > > The point at which I think that this might cause problems is that Drill > > currently has troubles when it sees a records like > > > > 1 > > 23 > > > > This is fine as far as XML is concerned, but if you think about it in > terms > > of JSON, it is probably best to view these records as > > > > {"item":[1]} > > {"item":[2,3]} > > > > Unfortunately, from the first record, there is no way to tell that it > > should not be viewed as > > > > {"item":1} > > > > Do you have a suggestion that would help with this? > > > > > > On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen < > > i.am.kasper.soren...@gmail.com> wrote: > > > > > Hi there, > > > > > > Sorry for barging in, but maybe this is a place where Drill and > MetaModel > > > could benefit from each other? We've considered that before at least > ... > > > > > > MetaModel already has support for both DOM and SAX based XML querying. > > They > > > basically inherit some characteristics from DOM and SAX respectively: > > > > > > - In the DOM variant we can infer a schema and all the user has to do > is > > > select a XML file/resource anywhere. > > > - In the SAX variant the user has to specify which paths in the XML > > > document should represent logical "tables" and what paths represent > their > > > columns. > > > > > > See [1] for more info. Hope this might be of interest to integrate into > > > Drill? > > > > > > Best regards, > > > Kasper Sørensen (from the MetaModel project) > > > > > > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping > > > > > > 2015-10-18 0:35 GMT+02:00 Magnus Pierre : > > > > > > > Well, very few lines of code imho. And simple. Been able to parse > > pretty > > > > deep structures with no issues so far. Performance? 10-15 5mb xml's > in > > > less > > > > than a second on my laptop but then I run it using Storm with some > > > > parallelism in place. Don't know if it's good or bad. I'll share the > > code > > > > next time I use computer. You don't need to use it, but it works at > > > least. > > > > > > > > /M > > > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" : > > > > > > > > > If the converter is clean and performant then I'm sure the > community > > > > > (including me) is interested :) > > > > > > > > > > However I wonder if Drill can afford to add a translation layer > > between > > > > > data formats, could we be better served with similar parsing in > Drill > > > for > > > > > XML as we do for JSON, or can it be pushed down far enough (to the > > > > parser) > > > > > to not make a noticeable difference (which is what I think Julian > is > > > > > implying)? > > > > > > > > > > Sent from my iPhone > > > > > > > > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre > > > > > wrote: > > > > > > > > > > > > Hello, > > > > > > > > > > > > Just wrote a simple sax implementation that converts xml to json > > and > > > > that > > > > > > is able to deal with decently complex xml's, that I currently use > > in > > > > > Storm. > > > > > > Takes attributes, and everything. > > > > > > > > > > > > I can share it with the community if interesting. > > > > > > > > > > > > /Magnus > > > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" < > jul...@hydromatic.net > > >: > > > > > > > > > > > >> Seems to me the biggest problem is to make drill understand the > > > nested > > > > > >> structure of an xml document. That work has been done for json, > so > > > > let's > > > > > >> build on it. Suppose there was a translator that converted xml > to > > > json > > > > > >> (adding attributes for things that json lacks, such as > namespaces, > > > > text, > > > > > >> element tags). Drill knows how to handle json, even if it is a > bit > > > > > verbose. > > > > > >> The translator could be applied on the fly. > > > > > >> > > > > > >> Julian > > > > > >> > > > > > >> > > > > > >> > > > > > >> Sent from my iPad > > > > > On Oct 16, 2015, at
[GitHub] drill pull request: DRILL-3947: Use setSafe() for date, time, time...
Github user mehant commented on the pull request: https://github.com/apache/drill/pull/208#issuecomment-149087927 +1. Changes look fine to me, given that all the other types use setSafe() it makes sense to use the same method for DATE, TIMESTAMP for consistency. However I feel that we might be addressing a symptom here, do you have any thoughts on what the actual issue might be? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Apache Drill
Hi Ted, Actually in MetaModel you then have two choices with your mapping to table format. 1) Either map the "item" as the granularity of a record. That way you will get three rows - one for each item. On the last of the two rows you would have the same values for any element that is registered at the scope. 2) You can also map 2 tables instead - one for and one for and then join them as you like. 2015-10-18 20:24 GMT+02:00 Ted Dunning: > Kasper, > > This might work. > > One issue that I see is that Metamodel seems to take a very XML centric > view of things while Drill takes a pretty JSON view of things. > > The point at which I think that this might cause problems is that Drill > currently has troubles when it sees a records like > > 1 > 23 > > This is fine as far as XML is concerned, but if you think about it in terms > of JSON, it is probably best to view these records as > > {"item":[1]} > {"item":[2,3]} > > Unfortunately, from the first record, there is no way to tell that it > should not be viewed as > > {"item":1} > > Do you have a suggestion that would help with this? > > > On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen < > i.am.kasper.soren...@gmail.com> wrote: > > > Hi there, > > > > Sorry for barging in, but maybe this is a place where Drill and MetaModel > > could benefit from each other? We've considered that before at least ... > > > > MetaModel already has support for both DOM and SAX based XML querying. > They > > basically inherit some characteristics from DOM and SAX respectively: > > > > - In the DOM variant we can infer a schema and all the user has to do is > > select a XML file/resource anywhere. > > - In the SAX variant the user has to specify which paths in the XML > > document should represent logical "tables" and what paths represent their > > columns. > > > > See [1] for more info. Hope this might be of interest to integrate into > > Drill? > > > > Best regards, > > Kasper Sørensen (from the MetaModel project) > > > > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping > > > > 2015-10-18 0:35 GMT+02:00 Magnus Pierre : > > > > > Well, very few lines of code imho. And simple. Been able to parse > pretty > > > deep structures with no issues so far. Performance? 10-15 5mb xml's in > > less > > > than a second on my laptop but then I run it using Storm with some > > > parallelism in place. Don't know if it's good or bad. I'll share the > code > > > next time I use computer. You don't need to use it, but it works at > > least. > > > > > > /M > > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" : > > > > > > > If the converter is clean and performant then I'm sure the community > > > > (including me) is interested :) > > > > > > > > However I wonder if Drill can afford to add a translation layer > between > > > > data formats, could we be better served with similar parsing in Drill > > for > > > > XML as we do for JSON, or can it be pushed down far enough (to the > > > parser) > > > > to not make a noticeable difference (which is what I think Julian is > > > > implying)? > > > > > > > > Sent from my iPhone > > > > > > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre > > > wrote: > > > > > > > > > > Hello, > > > > > > > > > > Just wrote a simple sax implementation that converts xml to json > and > > > that > > > > > is able to deal with decently complex xml's, that I currently use > in > > > > Storm. > > > > > Takes attributes, and everything. > > > > > > > > > > I can share it with the community if interesting. > > > > > > > > > > /Magnus > > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" >: > > > > > > > > > >> Seems to me the biggest problem is to make drill understand the > > nested > > > > >> structure of an xml document. That work has been done for json, so > > > let's > > > > >> build on it. Suppose there was a translator that converted xml to > > json > > > > >> (adding attributes for things that json lacks, such as namespaces, > > > text, > > > > >> element tags). Drill knows how to handle json, even if it is a bit > > > > verbose. > > > > >> The translator could be applied on the fly. > > > > >> > > > > >> Julian > > > > >> > > > > >> > > > > >> > > > > >> Sent from my iPad > > > > On Oct 16, 2015, at 2:31 PM, Stefán Baxter < > > > ste...@activitystream.com > > > > > > > > > >>> wrote: > > > > >>> > > > > >>> Hi, > > > > >>> > > > > >>> It's not possible but there has been some talk here about > > supporting > > > > it. > > > > >>> If I remember correctly it's rather complicated and not really > > > > feasible. > > > > >>> (I'm just a newbie so don't take my words for it) > > > > >>> > > > > >>> > > > > >>> Regards, > > > > >>> -Stefan > > > > >>> > > > > >>> On Fri, Oct 16, 2015 at 8:54 PM, Daniel Ajo < > > > > daniel@abarcahealth.com > > > > >>> > > > > >>> wrote: > > > > >>> > > > > Hey there, > > > > > > > > I
[GitHub] drill pull request: DRILL-3947: Use setSafe() for date, time, time...
GitHub user amansinha100 opened a pull request: https://github.com/apache/drill/pull/208 DRILL-3947: Use setSafe() for date, time, timestamp types while popul⦠â¦ating pruning vector (other types were already using setSafe). You can merge this pull request into a Git repository by running: $ git pull https://github.com/amansinha100/incubator-drill date1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/208.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #208 commit 7f0a03c17ca08433ebd2b0f122e23212b63c1570 Author: Aman SinhaDate: 2015-10-18T16:59:19Z DRILL-3947: Use setSafe() for date, time, timestamp types while populating pruning vector (other types were already using setSafe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Apache Drill
I actually generating thing this is a special version of object promotion. Steven is already working on solving the same problem where it exists when we have the JSON file: {"item":1} {"item":[2,3]} The main difference isn't until we see the "3" value in XML that we need to promote the "2" value to be the first element in an array. In Drill this is simply a value copy (not something that is currently exposed in ComplexWriter but exists at the vector level already). We could also do the promotion with lookahead and a markable/rewindable event stream. In general though, I agree with Julian. Let's get a basic reader working first. Does someone want to create a JIRA and propose a basic design. I'm more than happy to help people through the format plugin definitions as necessary since documentation there is lean. -- Jacques Nadeau CTO and Co-Founder, Dremio On Sun, Oct 18, 2015 at 11:24 AM, Ted Dunningwrote: > Kasper, > > This might work. > > One issue that I see is that Metamodel seems to take a very XML centric > view of things while Drill takes a pretty JSON view of things. > > The point at which I think that this might cause problems is that Drill > currently has troubles when it sees a records like > > 1 > 23 > > This is fine as far as XML is concerned, but if you think about it in terms > of JSON, it is probably best to view these records as > > {"item":[1]} > {"item":[2,3]} > > Unfortunately, from the first record, there is no way to tell that it > should not be viewed as > > {"item":1} > > Do you have a suggestion that would help with this? > > > On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen < > i.am.kasper.soren...@gmail.com> wrote: > > > Hi there, > > > > Sorry for barging in, but maybe this is a place where Drill and MetaModel > > could benefit from each other? We've considered that before at least ... > > > > MetaModel already has support for both DOM and SAX based XML querying. > They > > basically inherit some characteristics from DOM and SAX respectively: > > > > - In the DOM variant we can infer a schema and all the user has to do is > > select a XML file/resource anywhere. > > - In the SAX variant the user has to specify which paths in the XML > > document should represent logical "tables" and what paths represent their > > columns. > > > > See [1] for more info. Hope this might be of interest to integrate into > > Drill? > > > > Best regards, > > Kasper Sørensen (from the MetaModel project) > > > > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping > > > > 2015-10-18 0:35 GMT+02:00 Magnus Pierre : > > > > > Well, very few lines of code imho. And simple. Been able to parse > pretty > > > deep structures with no issues so far. Performance? 10-15 5mb xml's in > > less > > > than a second on my laptop but then I run it using Storm with some > > > parallelism in place. Don't know if it's good or bad. I'll share the > code > > > next time I use computer. You don't need to use it, but it works at > > least. > > > > > > /M > > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" : > > > > > > > If the converter is clean and performant then I'm sure the community > > > > (including me) is interested :) > > > > > > > > However I wonder if Drill can afford to add a translation layer > between > > > > data formats, could we be better served with similar parsing in Drill > > for > > > > XML as we do for JSON, or can it be pushed down far enough (to the > > > parser) > > > > to not make a noticeable difference (which is what I think Julian is > > > > implying)? > > > > > > > > Sent from my iPhone > > > > > > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre > > > wrote: > > > > > > > > > > Hello, > > > > > > > > > > Just wrote a simple sax implementation that converts xml to json > and > > > that > > > > > is able to deal with decently complex xml's, that I currently use > in > > > > Storm. > > > > > Takes attributes, and everything. > > > > > > > > > > I can share it with the community if interesting. > > > > > > > > > > /Magnus > > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" >: > > > > > > > > > >> Seems to me the biggest problem is to make drill understand the > > nested > > > > >> structure of an xml document. That work has been done for json, so > > > let's > > > > >> build on it. Suppose there was a translator that converted xml to > > json > > > > >> (adding attributes for things that json lacks, such as namespaces, > > > text, > > > > >> element tags). Drill knows how to handle json, even if it is a bit > > > > verbose. > > > > >> The translator could be applied on the fly. > > > > >> > > > > >> Julian > > > > >> > > > > >> > > > > >> > > > > >> Sent from my iPad > > > > On Oct 16, 2015, at 2:31 PM, Stefán Baxter < > > > ste...@activitystream.com > > > > > > > > > >>> wrote: > > > > >>> > > > > >>> Hi, > > > > >>> > > > > >>> It's not possible but there
Re: Apache Drill
Kasper, How is the mapping you suggest specified? In my example, I meant for there to be many records in a file and each record element to be a record insofar as Drill is concerned. I also didn't include other information that presumably would make it more interesting to talk about a record element as a unit. Your suggestion (1) is essentially to denest the records, but that loses the nice hierarchical structure expressed in the original that so easily could be expressed in the JSON data model. For your option (2), what do you mean by map 2 tables? Does MetaModel inherently assume that all output is purely relational? On Sun, Oct 18, 2015 at 1:18 PM, Kasper Sørensen < i.am.kasper.soren...@gmail.com> wrote: > Hi Ted, > > Actually in MetaModel you then have two choices with your mapping to table > format. > > 1) Either map the "item" as the granularity of a record. That way you will > get three rows - one for each item. On the last of the two rows you would > have the same values for any element that is registered at the > scope. > > 2) You can also map 2 tables instead - one for and one for > and then join them as you like. > > > 2015-10-18 20:24 GMT+02:00 Ted Dunning: > > > Kasper, > > > > This might work. > > > > One issue that I see is that Metamodel seems to take a very XML centric > > view of things while Drill takes a pretty JSON view of things. > > > > The point at which I think that this might cause problems is that Drill > > currently has troubles when it sees a records like > > > > 1 > > 23 > > > > This is fine as far as XML is concerned, but if you think about it in > terms > > of JSON, it is probably best to view these records as > > > > {"item":[1]} > > {"item":[2,3]} > > > > Unfortunately, from the first record, there is no way to tell that it > > should not be viewed as > > > > {"item":1} > > > > Do you have a suggestion that would help with this? > > > > > > On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen < > > i.am.kasper.soren...@gmail.com> wrote: > > > > > Hi there, > > > > > > Sorry for barging in, but maybe this is a place where Drill and > MetaModel > > > could benefit from each other? We've considered that before at least > ... > > > > > > MetaModel already has support for both DOM and SAX based XML querying. > > They > > > basically inherit some characteristics from DOM and SAX respectively: > > > > > > - In the DOM variant we can infer a schema and all the user has to do > is > > > select a XML file/resource anywhere. > > > - In the SAX variant the user has to specify which paths in the XML > > > document should represent logical "tables" and what paths represent > their > > > columns. > > > > > > See [1] for more info. Hope this might be of interest to integrate into > > > Drill? > > > > > > Best regards, > > > Kasper Sørensen (from the MetaModel project) > > > > > > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping > > > > > > 2015-10-18 0:35 GMT+02:00 Magnus Pierre : > > > > > > > Well, very few lines of code imho. And simple. Been able to parse > > pretty > > > > deep structures with no issues so far. Performance? 10-15 5mb xml's > in > > > less > > > > than a second on my laptop but then I run it using Storm with some > > > > parallelism in place. Don't know if it's good or bad. I'll share the > > code > > > > next time I use computer. You don't need to use it, but it works at > > > least. > > > > > > > > /M > > > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" : > > > > > > > > > If the converter is clean and performant then I'm sure the > community > > > > > (including me) is interested :) > > > > > > > > > > However I wonder if Drill can afford to add a translation layer > > between > > > > > data formats, could we be better served with similar parsing in > Drill > > > for > > > > > XML as we do for JSON, or can it be pushed down far enough (to the > > > > parser) > > > > > to not make a noticeable difference (which is what I think Julian > is > > > > > implying)? > > > > > > > > > > Sent from my iPhone > > > > > > > > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre > > > > > wrote: > > > > > > > > > > > > Hello, > > > > > > > > > > > > Just wrote a simple sax implementation that converts xml to json > > and > > > > that > > > > > > is able to deal with decently complex xml's, that I currently use > > in > > > > > Storm. > > > > > > Takes attributes, and everything. > > > > > > > > > > > > I can share it with the community if interesting. > > > > > > > > > > > > /Magnus > > > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" < > jul...@hydromatic.net > > >: > > > > > > > > > > > >> Seems to me the biggest problem is to make drill understand the > > > nested > > > > > >> structure of an xml document. That work has been done for json, > so > > > > let's > > > > > >> build on it. Suppose there was a translator that converted xml > to > > > json > > > > > >>
[MongoDB] - Why not returning the _id when using *
Hello, I do not understand why the '_id' is not returned when I do: select * from mongo.db.collection Any reason? I would like to remove this line: https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/main/java/org/apache/drill/exec/store/mongo/MongoRecordReader.java#L82 (This would solve this issue in the same time: https://issues.apache.org/jira/browse/DRILL-3505 ) Regards Tug @tgrall
Re: Apache Drill
Ted, My proposed “solution” — and I suspect you’re not going to like it — is to ignore, for now, harder XML problems and focus on the easier ones. A lot of XML documents do not have repeating scalar values. They are collections of records, perhaps with nested records or nested collections of records. Whitespace can be safely thrown away. Namespaces are not used. A lot of data is in XML format because XML was the only option considered, not because the data structure pushed the limits of what XML’s rich model can express. I think 90% of cases can be handled using a simple XML-to-JSON mapper that takes hints such as that the “employee” tag is to become a list of JSON maps and the “salary” and “name” tags are to be treated as attributes. I really think that if we focus on the harder cases we’ll end up with the wrong solution. Julian > On Oct 18, 2015, at 11:24 AM, Ted Dunningwrote: > > Kasper, > > This might work. > > One issue that I see is that Metamodel seems to take a very XML centric > view of things while Drill takes a pretty JSON view of things. > > The point at which I think that this might cause problems is that Drill > currently has troubles when it sees a records like > > 1 > 23 > > This is fine as far as XML is concerned, but if you think about it in terms > of JSON, it is probably best to view these records as > > {"item":[1]} > {"item":[2,3]} > > Unfortunately, from the first record, there is no way to tell that it > should not be viewed as > > {"item":1} > > Do you have a suggestion that would help with this? > > > On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen < > i.am.kasper.soren...@gmail.com> wrote: > >> Hi there, >> >> Sorry for barging in, but maybe this is a place where Drill and MetaModel >> could benefit from each other? We've considered that before at least ... >> >> MetaModel already has support for both DOM and SAX based XML querying. They >> basically inherit some characteristics from DOM and SAX respectively: >> >> - In the DOM variant we can infer a schema and all the user has to do is >> select a XML file/resource anywhere. >> - In the SAX variant the user has to specify which paths in the XML >> document should represent logical "tables" and what paths represent their >> columns. >> >> See [1] for more info. Hope this might be of interest to integrate into >> Drill? >> >> Best regards, >> Kasper Sørensen (from the MetaModel project) >> >> [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping >> >> 2015-10-18 0:35 GMT+02:00 Magnus Pierre : >> >>> Well, very few lines of code imho. And simple. Been able to parse pretty >>> deep structures with no issues so far. Performance? 10-15 5mb xml's in >> less >>> than a second on my laptop but then I run it using Storm with some >>> parallelism in place. Don't know if it's good or bad. I'll share the code >>> next time I use computer. You don't need to use it, but it works at >> least. >>> >>> /M >>> Den 17 okt 2015 10:43 em skrev "Matt Burgess" : >>> If the converter is clean and performant then I'm sure the community (including me) is interested :) However I wonder if Drill can afford to add a translation layer between data formats, could we be better served with similar parsing in Drill >> for XML as we do for JSON, or can it be pushed down far enough (to the >>> parser) to not make a noticeable difference (which is what I think Julian is implying)? Sent from my iPhone > On Oct 17, 2015, at 1:41 PM, Magnus Pierre >>> wrote: > > Hello, > > Just wrote a simple sax implementation that converts xml to json and >>> that > is able to deal with decently complex xml's, that I currently use in Storm. > Takes attributes, and everything. > > I can share it with the community if interesting. > > /Magnus > Den 17 okt 2015 7:02 em skrev "Julian Hyde" : > >> Seems to me the biggest problem is to make drill understand the >> nested >> structure of an xml document. That work has been done for json, so >>> let's >> build on it. Suppose there was a translator that converted xml to >> json >> (adding attributes for things that json lacks, such as namespaces, >>> text, >> element tags). Drill knows how to handle json, even if it is a bit verbose. >> The translator could be applied on the fly. >> >> Julian >> >> >> >> Sent from my iPad On Oct 16, 2015, at 2:31 PM, Stefán Baxter < >>> ste...@activitystream.com > >>> wrote: >>> >>> Hi, >>> >>> It's not possible but there has been some talk here about >> supporting it. >>> If I remember correctly it's rather complicated and not really feasible. >>> (I'm just a newbie so don't take my words for it) >>> >>> >>>
Re: Apache Drill
Hi there, Sorry for barging in, but maybe this is a place where Drill and MetaModel could benefit from each other? We've considered that before at least ... MetaModel already has support for both DOM and SAX based XML querying. They basically inherit some characteristics from DOM and SAX respectively: - In the DOM variant we can infer a schema and all the user has to do is select a XML file/resource anywhere. - In the SAX variant the user has to specify which paths in the XML document should represent logical "tables" and what paths represent their columns. See [1] for more info. Hope this might be of interest to integrate into Drill? Best regards, Kasper Sørensen (from the MetaModel project) [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping 2015-10-18 0:35 GMT+02:00 Magnus Pierre: > Well, very few lines of code imho. And simple. Been able to parse pretty > deep structures with no issues so far. Performance? 10-15 5mb xml's in less > than a second on my laptop but then I run it using Storm with some > parallelism in place. Don't know if it's good or bad. I'll share the code > next time I use computer. You don't need to use it, but it works at least. > > /M > Den 17 okt 2015 10:43 em skrev "Matt Burgess" : > > > If the converter is clean and performant then I'm sure the community > > (including me) is interested :) > > > > However I wonder if Drill can afford to add a translation layer between > > data formats, could we be better served with similar parsing in Drill for > > XML as we do for JSON, or can it be pushed down far enough (to the > parser) > > to not make a noticeable difference (which is what I think Julian is > > implying)? > > > > Sent from my iPhone > > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre > wrote: > > > > > > Hello, > > > > > > Just wrote a simple sax implementation that converts xml to json and > that > > > is able to deal with decently complex xml's, that I currently use in > > Storm. > > > Takes attributes, and everything. > > > > > > I can share it with the community if interesting. > > > > > > /Magnus > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" : > > > > > >> Seems to me the biggest problem is to make drill understand the nested > > >> structure of an xml document. That work has been done for json, so > let's > > >> build on it. Suppose there was a translator that converted xml to json > > >> (adding attributes for things that json lacks, such as namespaces, > text, > > >> element tags). Drill knows how to handle json, even if it is a bit > > verbose. > > >> The translator could be applied on the fly. > > >> > > >> Julian > > >> > > >> > > >> > > >> Sent from my iPad > > On Oct 16, 2015, at 2:31 PM, Stefán Baxter < > ste...@activitystream.com > > > > > >>> wrote: > > >>> > > >>> Hi, > > >>> > > >>> It's not possible but there has been some talk here about supporting > > it. > > >>> If I remember correctly it's rather complicated and not really > > feasible. > > >>> (I'm just a newbie so don't take my words for it) > > >>> > > >>> > > >>> Regards, > > >>> -Stefan > > >>> > > >>> On Fri, Oct 16, 2015 at 8:54 PM, Daniel Ajo < > > daniel@abarcahealth.com > > >>> > > >>> wrote: > > >>> > > Hey there, > > > > I was wondering if it is possible to query XML files using Apache > > Drill? > > > > I see there are several formats, and maybe it would work using an > > xpath > > query of some sorts, but just wondering if it would work to directly > > >> query > > it using some sort of plug-in. > > > > Well, let me know, > > > > Daniel Ajo > > * > > >> CONFIDENTIALITY > > NOTE: This electronic transmission contains information belonging to > > >> Abarca > > Health LLC, which is confidential or legally privileged. If you are > > not > > >> the > > intended recipient, please immediately advise the sender by reply > > >> e-mail or > > telephone that this message has been inadvertently transmitted to > you > > >> and > > delete this e-mail from your system. If you have received this > > >> transmission > > in error, you are hereby notified that any disclosure, copying, > > distribution or the taking of any action in reliance on the contents > > of > > >> the > > information is strictly prohibited. > > >> > > >
[jira] [Created] (DRILL-3947) IndexOutOfBoundsException for pruning on date column (at large scale)
Aman Sinha created DRILL-3947: - Summary: IndexOutOfBoundsException for pruning on date column (at large scale) Key: DRILL-3947 URL: https://issues.apache.org/jira/browse/DRILL-3947 Project: Apache Drill Issue Type: Bug Components: Query Planning & Optimization Affects Versions: 1.2.0 Reporter: Aman Sinha Assignee: Aman Sinha When a large table (about 52 B records, 10K files, created with CTAS auto-partitioning) partitioned by a 'date' column, partition pruning is encountering an error. At smaller scales, partition pruning succeeds. At this time, the problem seems specific to date columns only. This column is a nullable column and has NULL values in the data. Here's the query: {code} explain plan for select count(*) from `table` where `date` = '2015-07-01'; {code} Here's the error stack: {code} WARN o.a.d.e.p.l.partition.PruneScanRule - Exception while trying to prune partition. java.lang.IndexOutOfBoundsException: index: 4096, length: 1 (expected: range(0, 4096)) at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:189) ~[drill-java-exec-1.2.0.jar:4.0.27.Final] at io.netty.buffer.DrillBuf.chk(DrillBuf.java:211) ~[drill-java-exec-1.2.0.jar:4.0.27.Final] at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:612) ~[drill-java-exec-1.2.0.jar:4.0.27.Final] at org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:411) ~[drill-java-exec-1.2.0.jar:1.2.0] at org.apache.drill.exec.vector.NullableDateVector$Mutator.set(NullableDateVector.java:440) ~[drill-java-exec-1.2.0.jar:1.2.0] at org.apache.drill.exec.store.parquet.ParquetGroupScan.populatePruningVector(ParquetGroupScan.java:420) ~[drill-java-exec-1.2.0.jar:1.2.0] at org.apache.drill.exec.planner.ParquetPartitionDescriptor.populatePartitionVectors(ParquetPartitionDescriptor.java:96) ~[drill-java-exec-1.2.0.jar:1.2.0] at org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:212) ~[drill-java-exec-1.2.0.jar:1.2.0] at org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2.onMatch(ParquetPruneScanRule.java:87) [drill-java-exec-1.2.0.jar:1.2.0] at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228) [calcite-core-1.4.0-drill-r5.jar:1.4.0-drill-r5] at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808) [calcite-core-1.4.0-drill-r5.jar:1.4.0-drill-r5] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Apache Drill
Kasper, This might work. One issue that I see is that Metamodel seems to take a very XML centric view of things while Drill takes a pretty JSON view of things. The point at which I think that this might cause problems is that Drill currently has troubles when it sees a records like 1 23 This is fine as far as XML is concerned, but if you think about it in terms of JSON, it is probably best to view these records as {"item":[1]} {"item":[2,3]} Unfortunately, from the first record, there is no way to tell that it should not be viewed as {"item":1} Do you have a suggestion that would help with this? On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen < i.am.kasper.soren...@gmail.com> wrote: > Hi there, > > Sorry for barging in, but maybe this is a place where Drill and MetaModel > could benefit from each other? We've considered that before at least ... > > MetaModel already has support for both DOM and SAX based XML querying. They > basically inherit some characteristics from DOM and SAX respectively: > > - In the DOM variant we can infer a schema and all the user has to do is > select a XML file/resource anywhere. > - In the SAX variant the user has to specify which paths in the XML > document should represent logical "tables" and what paths represent their > columns. > > See [1] for more info. Hope this might be of interest to integrate into > Drill? > > Best regards, > Kasper Sørensen (from the MetaModel project) > > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping > > 2015-10-18 0:35 GMT+02:00 Magnus Pierre: > > > Well, very few lines of code imho. And simple. Been able to parse pretty > > deep structures with no issues so far. Performance? 10-15 5mb xml's in > less > > than a second on my laptop but then I run it using Storm with some > > parallelism in place. Don't know if it's good or bad. I'll share the code > > next time I use computer. You don't need to use it, but it works at > least. > > > > /M > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" : > > > > > If the converter is clean and performant then I'm sure the community > > > (including me) is interested :) > > > > > > However I wonder if Drill can afford to add a translation layer between > > > data formats, could we be better served with similar parsing in Drill > for > > > XML as we do for JSON, or can it be pushed down far enough (to the > > parser) > > > to not make a noticeable difference (which is what I think Julian is > > > implying)? > > > > > > Sent from my iPhone > > > > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre > > wrote: > > > > > > > > Hello, > > > > > > > > Just wrote a simple sax implementation that converts xml to json and > > that > > > > is able to deal with decently complex xml's, that I currently use in > > > Storm. > > > > Takes attributes, and everything. > > > > > > > > I can share it with the community if interesting. > > > > > > > > /Magnus > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" : > > > > > > > >> Seems to me the biggest problem is to make drill understand the > nested > > > >> structure of an xml document. That work has been done for json, so > > let's > > > >> build on it. Suppose there was a translator that converted xml to > json > > > >> (adding attributes for things that json lacks, such as namespaces, > > text, > > > >> element tags). Drill knows how to handle json, even if it is a bit > > > verbose. > > > >> The translator could be applied on the fly. > > > >> > > > >> Julian > > > >> > > > >> > > > >> > > > >> Sent from my iPad > > > On Oct 16, 2015, at 2:31 PM, Stefán Baxter < > > ste...@activitystream.com > > > > > > > >>> wrote: > > > >>> > > > >>> Hi, > > > >>> > > > >>> It's not possible but there has been some talk here about > supporting > > > it. > > > >>> If I remember correctly it's rather complicated and not really > > > feasible. > > > >>> (I'm just a newbie so don't take my words for it) > > > >>> > > > >>> > > > >>> Regards, > > > >>> -Stefan > > > >>> > > > >>> On Fri, Oct 16, 2015 at 8:54 PM, Daniel Ajo < > > > daniel@abarcahealth.com > > > >>> > > > >>> wrote: > > > >>> > > > Hey there, > > > > > > I was wondering if it is possible to query XML files using Apache > > > Drill? > > > > > > I see there are several formats, and maybe it would work using an > > > xpath > > > query of some sorts, but just wondering if it would work to > directly > > > >> query > > > it using some sort of plug-in. > > > > > > Well, let me know, > > > > > > Daniel Ajo > > > * > > > >> CONFIDENTIALITY > > > NOTE: This electronic transmission contains information belonging > to > > > >> Abarca > > > Health LLC, which is confidential or legally privileged. If you > are > > > not > > > >> the > > > intended recipient, please immediately advise the