Re: Apache Drill

2015-10-18 Thread Ted Dunning
Inline

On Sun, Oct 18, 2015 at 11:37 AM, Julian Hyde  wrote:

> ...
> My proposed “solution” — and I suspect you’re not going to like it — is to
> ignore, for now, harder XML problems and focus on the easier ones.


Hmm I think that this may or may not be easy. But it is real important.


> A lot of XML documents do not have repeating scalar values. They are
> collections of records, perhaps with nested records or nested collections
> of records.


The scalar-ness of my example was just a simplification. The same problem
occurs every time there is a list that sometimes contains 1 element.


> Whitespace can be safely thrown away. Namespaces are not used.


Fine.


> A lot of data is in XML format because XML was the only option considered,
> not because the data structure pushed the limits of what XML’s rich model
> can express.
>

True.


> I think 90% of cases can be handled using a simple XML-to-JSON mapper that
> takes hints such as that the “employee” tag is to become a list of JSON
> maps and the “salary” and “name” tags are to be treated as attributes.
>

Great.

The real question is whether or not the XML community already has such a
hinting mechanism.  Or is Drill about to reinvent that?


>
> I really think that if we focus on the harder cases we’ll end up with the
> wrong solution.
>

No doubt.  This isn't one of those.


[jira] [Created] (DRILL-3948) Partitioning columns of a Parquet table should be made visible to end user

2015-10-18 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-3948:
-

 Summary: Partitioning columns of a Parquet table should be made 
visible to end user
 Key: DRILL-3948
 URL: https://issues.apache.org/jira/browse/DRILL-3948
 Project: Apache Drill
  Issue Type: Improvement
  Components: Metadata, Query Planning & Optimization
Affects Versions: 1.2.0
Reporter: Aman Sinha


For Parquet files, Drill can do partition pruning for filter conditions on a 
column which satisfies the following criteria: 
  Each parquet file has a single value of that column. The parquet metadata is 
examined for the min and max values of that column and if they are the same, 
the column is considered a partitioning column. 

  When CTAS auto-partition is used, the above criteria is enforced, but even 
for files created through external methods could satisfy the criteria.  

It is difficult for users to know what exactly are the candidate partitioning 
columns in the table.  We should provide this information in a user friendly 
way:  for instance: 
  - special  'show partition columns for '  command
  - In the Explain plan, show partition columns for the table in Scan node
 More options should be discussed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Design Documents

2015-10-18 Thread Jacques Nadeau
Parth,

Thanks for bringing this up. We definitely need to do a better job of
discussing development decisions. I think whether this is done as a set of
descriptions and comments on JIRA or a formal doc on Google is less
important (and I wouldn't be inclined to enforce one over the other).

That being said, I think there is something else that limits the success of
such an effort. We first must ask: how do we get more people responding and
providing feedback among the things people have already posted. I know I've
experienced silence numerous times when asking for feedback and so have
others. Some recent examples I've seen in the community:

 - DRILL-3738 has received very little to no feedback despite providing an
initial design document
 - DRILL-3229 has one general response (ask for more detail) from you with
a follow-up from Steven but no additional feedback on the actual proposal

So I put it back to you and the general list, how do we get people to
provide more feedback on all contributions and proposals? I think it goes
beyond designs. More issues should be opened with better descriptions and
proposals around why one would do something. When the basic outline has
consensus and feedback, people can move to more thorough designs. Why
haven't we seen response on these issues?

I can't see a requirement of reviewed design docs being enforced until we
start to seeing people providing feedback on feature proposals and existing
(albeit thin) design documents. So +1 long term but -1 until people start
to respond and provide feedback on the outstanding items. Contributors need
to perceive value in presenting a design doc. Let's get the WIIFM right so
that developer incentives are aligned.

Jacques



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Oct 16, 2015 at 10:21 AM, Parth Chandra  wrote:

> Hi guys,
>
> Now that 1.2 is out I wanted to bring up the exciting topic of design
> documents for Drill. As the project gets more contributors, we definitely
> need to start documenting our designs and also allow for a more substantial
> review process. In particular, we need to make sure that there is
> sufficient time for comment as well as a time limit for comments so that
> developers are not left stranded. It is understood that committers should
> ensure they spend enough time in reviewing designs.
>
> I can see some substantial improvements in the works (some may even have
> pull requests for initial work) and I think that this is a good time to
> make sure that the design is done and understood by all before we get too
> far ahead with the implementation.
>
> [1] is an example from Spark, though that might be asking for a lot.
>
> [2] is an example from Drill - Hash Aggregation in Drill - This is an ideal
> design document. It could be improved even further perhaps by adding some
> implementation level details (for example parameters that could be used to
> tune Hash aggregation) that could aid QA/documentation.
>
> What do people think? Can we start enforcing the requirement to have
> reviewed design docs before submitting pull requests for *advanced*
> features?
>
> Parth
>
> [1] http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
> [2]
> https://issues.apache.org/jira/secure/attachment/12622804/DrillAggrs.pdf
>


[GitHub] drill pull request: DRILL-3947: Use setSafe() for date, time, time...

2015-10-18 Thread zfong
Github user zfong commented on the pull request:

https://github.com/apache/drill/pull/208#issuecomment-149087340
  
Is there a small unit test case that reproduces this problem? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Design Documents

2015-10-18 Thread yuliya Feldman
+1
  From: Parth Chandra 
 To: dev@drill.apache.org 
 Sent: Friday, October 16, 2015 10:21 AM
 Subject: [DISCUSS] Design Documents
   
Hi guys,

Now that 1.2 is out I wanted to bring up the exciting topic of design
documents for Drill. As the project gets more contributors, we definitely
need to start documenting our designs and also allow for a more substantial
review process. In particular, we need to make sure that there is
sufficient time for comment as well as a time limit for comments so that
developers are not left stranded. It is understood that committers should
ensure they spend enough time in reviewing designs.

I can see some substantial improvements in the works (some may even have
pull requests for initial work) and I think that this is a good time to
make sure that the design is done and understood by all before we get too
far ahead with the implementation.

[1] is an example from Spark, though that might be asking for a lot.

[2] is an example from Drill - Hash Aggregation in Drill - This is an ideal
design document. It could be improved even further perhaps by adding some
implementation level details (for example parameters that could be used to
tune Hash aggregation) that could aid QA/documentation.

What do people think? Can we start enforcing the requirement to have
reviewed design docs before submitting pull requests for *advanced*
features?

Parth

[1] http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
[2] https://issues.apache.org/jira/secure/attachment/12622804/DrillAggrs.pdf


   

Re: Apache Drill

2015-10-18 Thread Jacques Nadeau
Kasper,

Are you interested in working on a Drill metamodel format plugin? That way,
anything that Metamodel exposes would be available in Drill. It seems like
this would add great value to many users.

Jacques

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sun, Oct 18, 2015 at 1:18 PM, Kasper Sørensen <
i.am.kasper.soren...@gmail.com> wrote:

> Hi Ted,
>
> Actually in MetaModel you then have two choices with your mapping to table
> format.
>
> 1) Either map the "item" as the granularity of a record. That way you will
> get three rows - one for each item. On the last of the two rows you would
> have the same values for any element that is registered at the 
> scope.
>
> 2) You can also map 2 tables instead - one for  and one for 
> and then join them as you like.
>
>
> 2015-10-18 20:24 GMT+02:00 Ted Dunning :
>
> > Kasper,
> >
> > This might work.
> >
> > One issue that I see is that Metamodel seems to take a very XML centric
> > view of things while Drill takes a pretty JSON view of things.
> >
> > The point at which I think that this might cause problems is that Drill
> > currently has troubles when it sees a records like
> >
> > 1
> > 23
> >
> > This is fine as far as XML is concerned, but if you think about it in
> terms
> > of JSON, it is probably best to view these records as
> >
> > {"item":[1]}
> > {"item":[2,3]}
> >
> > Unfortunately, from the first record, there is no way to tell that it
> > should not be viewed as
> >
> > {"item":1}
> >
> > Do you have a suggestion that would help with this?
> >
> >
> > On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen <
> > i.am.kasper.soren...@gmail.com> wrote:
> >
> > > Hi there,
> > >
> > > Sorry for barging in, but maybe this is a place where Drill and
> MetaModel
> > > could benefit from each other? We've considered that before at least
> ...
> > >
> > > MetaModel already has support for both DOM and SAX based XML querying.
> > They
> > > basically inherit some characteristics from DOM and SAX respectively:
> > >
> > >  - In the DOM variant we can infer a schema and all the user has to do
> is
> > > select a XML file/resource anywhere.
> > >  - In the SAX variant the user has to specify which paths in the XML
> > > document should represent logical "tables" and what paths represent
> their
> > > columns.
> > >
> > > See [1] for more info. Hope this might be of interest to integrate into
> > > Drill?
> > >
> > > Best regards,
> > > Kasper Sørensen (from the MetaModel project)
> > >
> > > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping
> > >
> > > 2015-10-18 0:35 GMT+02:00 Magnus Pierre :
> > >
> > > > Well, very few lines of code imho. And simple. Been able to parse
> > pretty
> > > > deep structures with no issues so far. Performance? 10-15 5mb xml's
> in
> > > less
> > > > than a second on my laptop but then I run it using Storm with some
> > > > parallelism in place. Don't know if it's good or bad. I'll share the
> > code
> > > > next time I use computer. You don't need to use it, but it works at
> > > least.
> > > >
> > > > /M
> > > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" :
> > > >
> > > > > If the converter is clean and performant then I'm sure the
> community
> > > > > (including me) is interested :)
> > > > >
> > > > > However I wonder if Drill can afford to add a translation layer
> > between
> > > > > data formats, could we be better served with similar parsing in
> Drill
> > > for
> > > > > XML as we do for JSON, or can it be pushed down far enough (to the
> > > > parser)
> > > > > to not make a noticeable difference (which is what I think Julian
> is
> > > > > implying)?
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre  >
> > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Just wrote a simple sax implementation that converts xml to json
> > and
> > > > that
> > > > > > is able to deal with decently complex xml's, that I currently use
> > in
> > > > > Storm.
> > > > > > Takes attributes, and everything.
> > > > > >
> > > > > > I can share it with the community if interesting.
> > > > > >
> > > > > > /Magnus
> > > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" <
> jul...@hydromatic.net
> > >:
> > > > > >
> > > > > >> Seems to me the biggest problem is to make drill understand the
> > > nested
> > > > > >> structure of an xml document. That work has been done for json,
> so
> > > > let's
> > > > > >> build on it. Suppose there was a translator that converted xml
> to
> > > json
> > > > > >> (adding attributes for things that json lacks, such as
> namespaces,
> > > > text,
> > > > > >> element tags). Drill knows how to handle json, even if it is a
> bit
> > > > > verbose.
> > > > > >> The translator could be applied on the fly.
> > > > > >>
> > > > > >> Julian
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Sent from my iPad
> > > > >  On Oct 16, 2015, at 

[GitHub] drill pull request: DRILL-3947: Use setSafe() for date, time, time...

2015-10-18 Thread mehant
Github user mehant commented on the pull request:

https://github.com/apache/drill/pull/208#issuecomment-149087927
  
+1. 

Changes look fine to me, given that all the other types use setSafe() it 
makes sense to use the same method for DATE, TIMESTAMP for consistency. However 
I feel that we might be addressing a symptom here, do you have any thoughts on 
what the actual issue might be?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Apache Drill

2015-10-18 Thread Kasper Sørensen
Hi Ted,

Actually in MetaModel you then have two choices with your mapping to table
format.

1) Either map the "item" as the granularity of a record. That way you will
get three rows - one for each item. On the last of the two rows you would
have the same values for any element that is registered at the 
scope.

2) You can also map 2 tables instead - one for  and one for 
and then join them as you like.


2015-10-18 20:24 GMT+02:00 Ted Dunning :

> Kasper,
>
> This might work.
>
> One issue that I see is that Metamodel seems to take a very XML centric
> view of things while Drill takes a pretty JSON view of things.
>
> The point at which I think that this might cause problems is that Drill
> currently has troubles when it sees a records like
>
> 1
> 23
>
> This is fine as far as XML is concerned, but if you think about it in terms
> of JSON, it is probably best to view these records as
>
> {"item":[1]}
> {"item":[2,3]}
>
> Unfortunately, from the first record, there is no way to tell that it
> should not be viewed as
>
> {"item":1}
>
> Do you have a suggestion that would help with this?
>
>
> On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen <
> i.am.kasper.soren...@gmail.com> wrote:
>
> > Hi there,
> >
> > Sorry for barging in, but maybe this is a place where Drill and MetaModel
> > could benefit from each other? We've considered that before at least ...
> >
> > MetaModel already has support for both DOM and SAX based XML querying.
> They
> > basically inherit some characteristics from DOM and SAX respectively:
> >
> >  - In the DOM variant we can infer a schema and all the user has to do is
> > select a XML file/resource anywhere.
> >  - In the SAX variant the user has to specify which paths in the XML
> > document should represent logical "tables" and what paths represent their
> > columns.
> >
> > See [1] for more info. Hope this might be of interest to integrate into
> > Drill?
> >
> > Best regards,
> > Kasper Sørensen (from the MetaModel project)
> >
> > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping
> >
> > 2015-10-18 0:35 GMT+02:00 Magnus Pierre :
> >
> > > Well, very few lines of code imho. And simple. Been able to parse
> pretty
> > > deep structures with no issues so far. Performance? 10-15 5mb xml's in
> > less
> > > than a second on my laptop but then I run it using Storm with some
> > > parallelism in place. Don't know if it's good or bad. I'll share the
> code
> > > next time I use computer. You don't need to use it, but it works at
> > least.
> > >
> > > /M
> > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" :
> > >
> > > > If the converter is clean and performant then I'm sure the community
> > > > (including me) is interested :)
> > > >
> > > > However I wonder if Drill can afford to add a translation layer
> between
> > > > data formats, could we be better served with similar parsing in Drill
> > for
> > > > XML as we do for JSON, or can it be pushed down far enough (to the
> > > parser)
> > > > to not make a noticeable difference (which is what I think Julian is
> > > > implying)?
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre 
> > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > Just wrote a simple sax implementation that converts xml to json
> and
> > > that
> > > > > is able to deal with decently complex xml's, that I currently use
> in
> > > > Storm.
> > > > > Takes attributes, and everything.
> > > > >
> > > > > I can share it with the community if interesting.
> > > > >
> > > > > /Magnus
> > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde"  >:
> > > > >
> > > > >> Seems to me the biggest problem is to make drill understand the
> > nested
> > > > >> structure of an xml document. That work has been done for json, so
> > > let's
> > > > >> build on it. Suppose there was a translator that converted xml to
> > json
> > > > >> (adding attributes for things that json lacks, such as namespaces,
> > > text,
> > > > >> element tags). Drill knows how to handle json, even if it is a bit
> > > > verbose.
> > > > >> The translator could be applied on the fly.
> > > > >>
> > > > >> Julian
> > > > >>
> > > > >>
> > > > >>
> > > > >> Sent from my iPad
> > > >  On Oct 16, 2015, at 2:31 PM, Stefán Baxter <
> > > ste...@activitystream.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> Hi,
> > > > >>>
> > > > >>> It's not possible but there has been some talk here about
> > supporting
> > > > it.
> > > > >>> If I remember correctly it's rather complicated and not really
> > > > feasible.
> > > > >>> (I'm just a newbie so don't take my words for it)
> > > > >>>
> > > > >>>
> > > > >>> Regards,
> > > > >>> -Stefan
> > > > >>>
> > > > >>> On Fri, Oct 16, 2015 at 8:54 PM, Daniel Ajo <
> > > > daniel@abarcahealth.com
> > > > >>>
> > > > >>> wrote:
> > > > >>>
> > > >  Hey there,
> > > > 
> > > >  I 

[GitHub] drill pull request: DRILL-3947: Use setSafe() for date, time, time...

2015-10-18 Thread amansinha100
GitHub user amansinha100 opened a pull request:

https://github.com/apache/drill/pull/208

DRILL-3947: Use setSafe() for date, time, timestamp types while popul…

…ating pruning vector (other types were already using setSafe).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/amansinha100/incubator-drill date1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #208


commit 7f0a03c17ca08433ebd2b0f122e23212b63c1570
Author: Aman Sinha 
Date:   2015-10-18T16:59:19Z

DRILL-3947: Use setSafe() for date, time, timestamp types while populating 
pruning vector (other types were already using setSafe).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Apache Drill

2015-10-18 Thread Jacques Nadeau
I actually generating thing this is a special version of object promotion.
Steven is already working on solving the same problem where it exists when
we have the JSON file:

{"item":1}
{"item":[2,3]}

The main difference isn't until we see the "3" value in XML that we need to
promote the "2" value to be the first element in an array. In Drill this is
simply a value copy (not something that is currently exposed in
ComplexWriter but exists at the vector level already). We could also do the
promotion with lookahead and a markable/rewindable event stream.

In general though, I agree with Julian. Let's get a basic reader working
first.

Does someone want to create a JIRA and propose a basic design. I'm more
than happy to help people through the format plugin definitions as
necessary since documentation there is lean.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sun, Oct 18, 2015 at 11:24 AM, Ted Dunning  wrote:

> Kasper,
>
> This might work.
>
> One issue that I see is that Metamodel seems to take a very XML centric
> view of things while Drill takes a pretty JSON view of things.
>
> The point at which I think that this might cause problems is that Drill
> currently has troubles when it sees a records like
>
> 1
> 23
>
> This is fine as far as XML is concerned, but if you think about it in terms
> of JSON, it is probably best to view these records as
>
> {"item":[1]}
> {"item":[2,3]}
>
> Unfortunately, from the first record, there is no way to tell that it
> should not be viewed as
>
> {"item":1}
>
> Do you have a suggestion that would help with this?
>
>
> On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen <
> i.am.kasper.soren...@gmail.com> wrote:
>
> > Hi there,
> >
> > Sorry for barging in, but maybe this is a place where Drill and MetaModel
> > could benefit from each other? We've considered that before at least ...
> >
> > MetaModel already has support for both DOM and SAX based XML querying.
> They
> > basically inherit some characteristics from DOM and SAX respectively:
> >
> >  - In the DOM variant we can infer a schema and all the user has to do is
> > select a XML file/resource anywhere.
> >  - In the SAX variant the user has to specify which paths in the XML
> > document should represent logical "tables" and what paths represent their
> > columns.
> >
> > See [1] for more info. Hope this might be of interest to integrate into
> > Drill?
> >
> > Best regards,
> > Kasper Sørensen (from the MetaModel project)
> >
> > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping
> >
> > 2015-10-18 0:35 GMT+02:00 Magnus Pierre :
> >
> > > Well, very few lines of code imho. And simple. Been able to parse
> pretty
> > > deep structures with no issues so far. Performance? 10-15 5mb xml's in
> > less
> > > than a second on my laptop but then I run it using Storm with some
> > > parallelism in place. Don't know if it's good or bad. I'll share the
> code
> > > next time I use computer. You don't need to use it, but it works at
> > least.
> > >
> > > /M
> > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" :
> > >
> > > > If the converter is clean and performant then I'm sure the community
> > > > (including me) is interested :)
> > > >
> > > > However I wonder if Drill can afford to add a translation layer
> between
> > > > data formats, could we be better served with similar parsing in Drill
> > for
> > > > XML as we do for JSON, or can it be pushed down far enough (to the
> > > parser)
> > > > to not make a noticeable difference (which is what I think Julian is
> > > > implying)?
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre 
> > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > Just wrote a simple sax implementation that converts xml to json
> and
> > > that
> > > > > is able to deal with decently complex xml's, that I currently use
> in
> > > > Storm.
> > > > > Takes attributes, and everything.
> > > > >
> > > > > I can share it with the community if interesting.
> > > > >
> > > > > /Magnus
> > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde"  >:
> > > > >
> > > > >> Seems to me the biggest problem is to make drill understand the
> > nested
> > > > >> structure of an xml document. That work has been done for json, so
> > > let's
> > > > >> build on it. Suppose there was a translator that converted xml to
> > json
> > > > >> (adding attributes for things that json lacks, such as namespaces,
> > > text,
> > > > >> element tags). Drill knows how to handle json, even if it is a bit
> > > > verbose.
> > > > >> The translator could be applied on the fly.
> > > > >>
> > > > >> Julian
> > > > >>
> > > > >>
> > > > >>
> > > > >> Sent from my iPad
> > > >  On Oct 16, 2015, at 2:31 PM, Stefán Baxter <
> > > ste...@activitystream.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> Hi,
> > > > >>>
> > > > >>> It's not possible but there 

Re: Apache Drill

2015-10-18 Thread Ted Dunning
Kasper,

How is the mapping you suggest specified?

In my example, I meant for there to be many records in a file and each
record element to be a record insofar as Drill is concerned.  I also didn't
include other information that presumably would make it more interesting to
talk about a record element as a unit.

Your suggestion (1) is essentially to denest the records, but that loses
the nice hierarchical structure expressed in the original that so easily
could be expressed in the JSON data model.

For your option (2), what do you mean by map 2 tables?  Does MetaModel
inherently assume that all output is purely relational?




On Sun, Oct 18, 2015 at 1:18 PM, Kasper Sørensen <
i.am.kasper.soren...@gmail.com> wrote:

> Hi Ted,
>
> Actually in MetaModel you then have two choices with your mapping to table
> format.
>
> 1) Either map the "item" as the granularity of a record. That way you will
> get three rows - one for each item. On the last of the two rows you would
> have the same values for any element that is registered at the 
> scope.
>
> 2) You can also map 2 tables instead - one for  and one for 
> and then join them as you like.
>
>
> 2015-10-18 20:24 GMT+02:00 Ted Dunning :
>
> > Kasper,
> >
> > This might work.
> >
> > One issue that I see is that Metamodel seems to take a very XML centric
> > view of things while Drill takes a pretty JSON view of things.
> >
> > The point at which I think that this might cause problems is that Drill
> > currently has troubles when it sees a records like
> >
> > 1
> > 23
> >
> > This is fine as far as XML is concerned, but if you think about it in
> terms
> > of JSON, it is probably best to view these records as
> >
> > {"item":[1]}
> > {"item":[2,3]}
> >
> > Unfortunately, from the first record, there is no way to tell that it
> > should not be viewed as
> >
> > {"item":1}
> >
> > Do you have a suggestion that would help with this?
> >
> >
> > On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen <
> > i.am.kasper.soren...@gmail.com> wrote:
> >
> > > Hi there,
> > >
> > > Sorry for barging in, but maybe this is a place where Drill and
> MetaModel
> > > could benefit from each other? We've considered that before at least
> ...
> > >
> > > MetaModel already has support for both DOM and SAX based XML querying.
> > They
> > > basically inherit some characteristics from DOM and SAX respectively:
> > >
> > >  - In the DOM variant we can infer a schema and all the user has to do
> is
> > > select a XML file/resource anywhere.
> > >  - In the SAX variant the user has to specify which paths in the XML
> > > document should represent logical "tables" and what paths represent
> their
> > > columns.
> > >
> > > See [1] for more info. Hope this might be of interest to integrate into
> > > Drill?
> > >
> > > Best regards,
> > > Kasper Sørensen (from the MetaModel project)
> > >
> > > [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping
> > >
> > > 2015-10-18 0:35 GMT+02:00 Magnus Pierre :
> > >
> > > > Well, very few lines of code imho. And simple. Been able to parse
> > pretty
> > > > deep structures with no issues so far. Performance? 10-15 5mb xml's
> in
> > > less
> > > > than a second on my laptop but then I run it using Storm with some
> > > > parallelism in place. Don't know if it's good or bad. I'll share the
> > code
> > > > next time I use computer. You don't need to use it, but it works at
> > > least.
> > > >
> > > > /M
> > > > Den 17 okt 2015 10:43 em skrev "Matt Burgess" :
> > > >
> > > > > If the converter is clean and performant then I'm sure the
> community
> > > > > (including me) is interested :)
> > > > >
> > > > > However I wonder if Drill can afford to add a translation layer
> > between
> > > > > data formats, could we be better served with similar parsing in
> Drill
> > > for
> > > > > XML as we do for JSON, or can it be pushed down far enough (to the
> > > > parser)
> > > > > to not make a noticeable difference (which is what I think Julian
> is
> > > > > implying)?
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre  >
> > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Just wrote a simple sax implementation that converts xml to json
> > and
> > > > that
> > > > > > is able to deal with decently complex xml's, that I currently use
> > in
> > > > > Storm.
> > > > > > Takes attributes, and everything.
> > > > > >
> > > > > > I can share it with the community if interesting.
> > > > > >
> > > > > > /Magnus
> > > > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" <
> jul...@hydromatic.net
> > >:
> > > > > >
> > > > > >> Seems to me the biggest problem is to make drill understand the
> > > nested
> > > > > >> structure of an xml document. That work has been done for json,
> so
> > > > let's
> > > > > >> build on it. Suppose there was a translator that converted xml
> to
> > > json
> > > > > >> 

[MongoDB] - Why not returning the _id when using *

2015-10-18 Thread Tugdual Grall
Hello,

I do not understand why the '_id' is not returned when I do:

select * from mongo.db.collection

Any reason?

I would like to remove this line:
https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/main/java/org/apache/drill/exec/store/mongo/MongoRecordReader.java#L82


(This would solve this issue in the same time:
https://issues.apache.org/jira/browse/DRILL-3505 )

Regards
Tug
@tgrall


Re: Apache Drill

2015-10-18 Thread Julian Hyde
Ted,

My proposed “solution” — and I suspect you’re not going to like it — is to 
ignore, for now, harder XML problems and focus on the easier ones. A lot of XML 
documents do not have repeating scalar values. They are collections of records, 
perhaps with nested records or nested collections of records. Whitespace can be 
safely thrown away. Namespaces are not used. A lot of data is in XML format 
because XML was the only option considered, not because the data structure 
pushed the limits of what XML’s rich model can express.

I think 90% of cases can be handled using a simple XML-to-JSON mapper that 
takes hints such as that the “employee” tag is to become a list of JSON maps 
and the “salary” and “name” tags are to be treated as attributes.

I really think that if we focus on the harder cases we’ll end up with the wrong 
solution.

Julian


> On Oct 18, 2015, at 11:24 AM, Ted Dunning  wrote:
> 
> Kasper,
> 
> This might work.
> 
> One issue that I see is that Metamodel seems to take a very XML centric
> view of things while Drill takes a pretty JSON view of things.
> 
> The point at which I think that this might cause problems is that Drill
> currently has troubles when it sees a records like
> 
> 1
> 23
> 
> This is fine as far as XML is concerned, but if you think about it in terms
> of JSON, it is probably best to view these records as
> 
> {"item":[1]}
> {"item":[2,3]}
> 
> Unfortunately, from the first record, there is no way to tell that it
> should not be viewed as
> 
> {"item":1}
> 
> Do you have a suggestion that would help with this?
> 
> 
> On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen <
> i.am.kasper.soren...@gmail.com> wrote:
> 
>> Hi there,
>> 
>> Sorry for barging in, but maybe this is a place where Drill and MetaModel
>> could benefit from each other? We've considered that before at least ...
>> 
>> MetaModel already has support for both DOM and SAX based XML querying. They
>> basically inherit some characteristics from DOM and SAX respectively:
>> 
>> - In the DOM variant we can infer a schema and all the user has to do is
>> select a XML file/resource anywhere.
>> - In the SAX variant the user has to specify which paths in the XML
>> document should represent logical "tables" and what paths represent their
>> columns.
>> 
>> See [1] for more info. Hope this might be of interest to integrate into
>> Drill?
>> 
>> Best regards,
>> Kasper Sørensen (from the MetaModel project)
>> 
>> [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping
>> 
>> 2015-10-18 0:35 GMT+02:00 Magnus Pierre :
>> 
>>> Well, very few lines of code imho. And simple. Been able to parse pretty
>>> deep structures with no issues so far. Performance? 10-15 5mb xml's in
>> less
>>> than a second on my laptop but then I run it using Storm with some
>>> parallelism in place. Don't know if it's good or bad. I'll share the code
>>> next time I use computer. You don't need to use it, but it works at
>> least.
>>> 
>>> /M
>>> Den 17 okt 2015 10:43 em skrev "Matt Burgess" :
>>> 
 If the converter is clean and performant then I'm sure the community
 (including me) is interested :)
 
 However I wonder if Drill can afford to add a translation layer between
 data formats, could we be better served with similar parsing in Drill
>> for
 XML as we do for JSON, or can it be pushed down far enough (to the
>>> parser)
 to not make a noticeable difference (which is what I think Julian is
 implying)?
 
 Sent from my iPhone
 
> On Oct 17, 2015, at 1:41 PM, Magnus Pierre 
>>> wrote:
> 
> Hello,
> 
> Just wrote a simple sax implementation that converts xml to json and
>>> that
> is able to deal with decently complex xml's, that I currently use in
 Storm.
> Takes attributes, and everything.
> 
> I can share it with the community if interesting.
> 
> /Magnus
> Den 17 okt 2015 7:02 em skrev "Julian Hyde" :
> 
>> Seems to me the biggest problem is to make drill understand the
>> nested
>> structure of an xml document. That work has been done for json, so
>>> let's
>> build on it. Suppose there was a translator that converted xml to
>> json
>> (adding attributes for things that json lacks, such as namespaces,
>>> text,
>> element tags). Drill knows how to handle json, even if it is a bit
 verbose.
>> The translator could be applied on the fly.
>> 
>> Julian
>> 
>> 
>> 
>> Sent from my iPad
 On Oct 16, 2015, at 2:31 PM, Stefán Baxter <
>>> ste...@activitystream.com
> 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> It's not possible but there has been some talk here about
>> supporting
 it.
>>> If I remember correctly it's rather complicated and not really
 feasible.
>>> (I'm just a newbie so don't take my words for it)
>>> 
>>> 
>>> 

Re: Apache Drill

2015-10-18 Thread Kasper Sørensen
Hi there,

Sorry for barging in, but maybe this is a place where Drill and MetaModel
could benefit from each other? We've considered that before at least ...

MetaModel already has support for both DOM and SAX based XML querying. They
basically inherit some characteristics from DOM and SAX respectively:

 - In the DOM variant we can infer a schema and all the user has to do is
select a XML file/resource anywhere.
 - In the SAX variant the user has to specify which paths in the XML
document should represent logical "tables" and what paths represent their
columns.

See [1] for more info. Hope this might be of interest to integrate into
Drill?

Best regards,
Kasper Sørensen (from the MetaModel project)

[1] http://wiki.apache.org/metamodel/examples/XmlTableMapping

2015-10-18 0:35 GMT+02:00 Magnus Pierre :

> Well, very few lines of code imho. And simple. Been able to parse pretty
> deep structures with no issues so far. Performance? 10-15 5mb xml's in less
> than a second on my laptop but then I run it using Storm with some
> parallelism in place. Don't know if it's good or bad. I'll share the code
> next time I use computer. You don't need to use it, but it works at least.
>
> /M
> Den 17 okt 2015 10:43 em skrev "Matt Burgess" :
>
> > If the converter is clean and performant then I'm sure the community
> > (including me) is interested :)
> >
> > However I wonder if Drill can afford to add a translation layer between
> > data formats, could we be better served with similar parsing in Drill for
> > XML as we do for JSON, or can it be pushed down far enough (to the
> parser)
> > to not make a noticeable difference (which is what I think Julian is
> > implying)?
> >
> > Sent from my iPhone
> >
> > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre 
> wrote:
> > >
> > > Hello,
> > >
> > > Just wrote a simple sax implementation that converts xml to json and
> that
> > > is able to deal with decently complex xml's, that I currently use in
> > Storm.
> > > Takes attributes, and everything.
> > >
> > > I can share it with the community if interesting.
> > >
> > > /Magnus
> > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" :
> > >
> > >> Seems to me the biggest problem is to make drill understand the nested
> > >> structure of an xml document. That work has been done for json, so
> let's
> > >> build on it. Suppose there was a translator that converted xml to json
> > >> (adding attributes for things that json lacks, such as namespaces,
> text,
> > >> element tags). Drill knows how to handle json, even if it is a bit
> > verbose.
> > >> The translator could be applied on the fly.
> > >>
> > >> Julian
> > >>
> > >>
> > >>
> > >> Sent from my iPad
> >  On Oct 16, 2015, at 2:31 PM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > >>> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> It's not possible but there has been some talk here about supporting
> > it.
> > >>> If I remember correctly it's rather complicated and not really
> > feasible.
> > >>> (I'm just a newbie so don't take my words for it)
> > >>>
> > >>>
> > >>> Regards,
> > >>> -Stefan
> > >>>
> > >>> On Fri, Oct 16, 2015 at 8:54 PM, Daniel Ajo <
> > daniel@abarcahealth.com
> > >>>
> > >>> wrote:
> > >>>
> >  Hey there,
> > 
> >  I was wondering if it is possible to query XML files using Apache
> > Drill?
> > 
> >  I see there are several formats, and maybe it would work using an
> > xpath
> >  query of some sorts, but just wondering if it would work to directly
> > >> query
> >  it using some sort of plug-in.
> > 
> >  Well, let me know,
> > 
> >  Daniel Ajo
> >  *
> > >> CONFIDENTIALITY
> >  NOTE: This electronic transmission contains information belonging to
> > >> Abarca
> >  Health LLC, which is confidential or legally privileged. If you are
> > not
> > >> the
> >  intended recipient, please immediately advise the sender by reply
> > >> e-mail or
> >  telephone that this message has been inadvertently transmitted to
> you
> > >> and
> >  delete this e-mail from your system. If you have received this
> > >> transmission
> >  in error, you are hereby notified that any disclosure, copying,
> >  distribution or the taking of any action in reliance on the contents
> > of
> > >> the
> >  information is strictly prohibited.
> > >>
> >
>


[jira] [Created] (DRILL-3947) IndexOutOfBoundsException for pruning on date column (at large scale)

2015-10-18 Thread Aman Sinha (JIRA)
Aman Sinha created DRILL-3947:
-

 Summary: IndexOutOfBoundsException for pruning on date column (at 
large scale)
 Key: DRILL-3947
 URL: https://issues.apache.org/jira/browse/DRILL-3947
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Affects Versions: 1.2.0
Reporter: Aman Sinha
Assignee: Aman Sinha


When a large table (about 52 B records, 10K files, created with CTAS 
auto-partitioning) partitioned by a 'date' column,  partition pruning is 
encountering an error.  At smaller scales, partition pruning succeeds. At this 
time, the problem seems specific to date columns only. This column is a 
nullable column and has NULL values in the data. 

Here's the query:
{code}
explain plan for select count(*) from `table` where `date` = '2015-07-01';
{code}

Here's the error stack: 
{code}
WARN  o.a.d.e.p.l.partition.PruneScanRule - Exception while trying to prune 
partition.
java.lang.IndexOutOfBoundsException: index: 4096, length: 1 (expected: range(0, 
4096))
at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:189) 
~[drill-java-exec-1.2.0.jar:4.0.27.Final]
at io.netty.buffer.DrillBuf.chk(DrillBuf.java:211) 
~[drill-java-exec-1.2.0.jar:4.0.27.Final]
at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:612) 
~[drill-java-exec-1.2.0.jar:4.0.27.Final]
at 
org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:411) 
~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.vector.NullableDateVector$Mutator.set(NullableDateVector.java:440)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.store.parquet.ParquetGroupScan.populatePruningVector(ParquetGroupScan.java:420)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.planner.ParquetPartitionDescriptor.populatePartitionVectors(ParquetPartitionDescriptor.java:96)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:212)
 ~[drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2.onMatch(ParquetPruneScanRule.java:87)
 [drill-java-exec-1.2.0.jar:1.2.0]
at 
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228)
 [calcite-core-1.4.0-drill-r5.jar:1.4.0-drill-r5]
at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808)
 [calcite-core-1.4.0-drill-r5.jar:1.4.0-drill-r5]
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Apache Drill

2015-10-18 Thread Ted Dunning
Kasper,

This might work.

One issue that I see is that Metamodel seems to take a very XML centric
view of things while Drill takes a pretty JSON view of things.

The point at which I think that this might cause problems is that Drill
currently has troubles when it sees a records like

1
23

This is fine as far as XML is concerned, but if you think about it in terms
of JSON, it is probably best to view these records as

{"item":[1]}
{"item":[2,3]}

Unfortunately, from the first record, there is no way to tell that it
should not be viewed as

{"item":1}

Do you have a suggestion that would help with this?


On Sun, Oct 18, 2015 at 8:41 AM, Kasper Sørensen <
i.am.kasper.soren...@gmail.com> wrote:

> Hi there,
>
> Sorry for barging in, but maybe this is a place where Drill and MetaModel
> could benefit from each other? We've considered that before at least ...
>
> MetaModel already has support for both DOM and SAX based XML querying. They
> basically inherit some characteristics from DOM and SAX respectively:
>
>  - In the DOM variant we can infer a schema and all the user has to do is
> select a XML file/resource anywhere.
>  - In the SAX variant the user has to specify which paths in the XML
> document should represent logical "tables" and what paths represent their
> columns.
>
> See [1] for more info. Hope this might be of interest to integrate into
> Drill?
>
> Best regards,
> Kasper Sørensen (from the MetaModel project)
>
> [1] http://wiki.apache.org/metamodel/examples/XmlTableMapping
>
> 2015-10-18 0:35 GMT+02:00 Magnus Pierre :
>
> > Well, very few lines of code imho. And simple. Been able to parse pretty
> > deep structures with no issues so far. Performance? 10-15 5mb xml's in
> less
> > than a second on my laptop but then I run it using Storm with some
> > parallelism in place. Don't know if it's good or bad. I'll share the code
> > next time I use computer. You don't need to use it, but it works at
> least.
> >
> > /M
> > Den 17 okt 2015 10:43 em skrev "Matt Burgess" :
> >
> > > If the converter is clean and performant then I'm sure the community
> > > (including me) is interested :)
> > >
> > > However I wonder if Drill can afford to add a translation layer between
> > > data formats, could we be better served with similar parsing in Drill
> for
> > > XML as we do for JSON, or can it be pushed down far enough (to the
> > parser)
> > > to not make a noticeable difference (which is what I think Julian is
> > > implying)?
> > >
> > > Sent from my iPhone
> > >
> > > > On Oct 17, 2015, at 1:41 PM, Magnus Pierre 
> > wrote:
> > > >
> > > > Hello,
> > > >
> > > > Just wrote a simple sax implementation that converts xml to json and
> > that
> > > > is able to deal with decently complex xml's, that I currently use in
> > > Storm.
> > > > Takes attributes, and everything.
> > > >
> > > > I can share it with the community if interesting.
> > > >
> > > > /Magnus
> > > > Den 17 okt 2015 7:02 em skrev "Julian Hyde" :
> > > >
> > > >> Seems to me the biggest problem is to make drill understand the
> nested
> > > >> structure of an xml document. That work has been done for json, so
> > let's
> > > >> build on it. Suppose there was a translator that converted xml to
> json
> > > >> (adding attributes for things that json lacks, such as namespaces,
> > text,
> > > >> element tags). Drill knows how to handle json, even if it is a bit
> > > verbose.
> > > >> The translator could be applied on the fly.
> > > >>
> > > >> Julian
> > > >>
> > > >>
> > > >>
> > > >> Sent from my iPad
> > >  On Oct 16, 2015, at 2:31 PM, Stefán Baxter <
> > ste...@activitystream.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> It's not possible but there has been some talk here about
> supporting
> > > it.
> > > >>> If I remember correctly it's rather complicated and not really
> > > feasible.
> > > >>> (I'm just a newbie so don't take my words for it)
> > > >>>
> > > >>>
> > > >>> Regards,
> > > >>> -Stefan
> > > >>>
> > > >>> On Fri, Oct 16, 2015 at 8:54 PM, Daniel Ajo <
> > > daniel@abarcahealth.com
> > > >>>
> > > >>> wrote:
> > > >>>
> > >  Hey there,
> > > 
> > >  I was wondering if it is possible to query XML files using Apache
> > > Drill?
> > > 
> > >  I see there are several formats, and maybe it would work using an
> > > xpath
> > >  query of some sorts, but just wondering if it would work to
> directly
> > > >> query
> > >  it using some sort of plug-in.
> > > 
> > >  Well, let me know,
> > > 
> > >  Daniel Ajo
> > >  *
> > > >> CONFIDENTIALITY
> > >  NOTE: This electronic transmission contains information belonging
> to
> > > >> Abarca
> > >  Health LLC, which is confidential or legally privileged. If you
> are
> > > not
> > > >> the
> > >  intended recipient, please immediately advise the