[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986926#comment-14986926
 ] 

ASF GitHub Bot commented on DRILL-3229:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/180


> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Hanifi Gunes
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-11-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986533#comment-14986533
 ] 

ASF GitHub Bot commented on DRILL-3229:
---

Github user jacques-n commented on the pull request:

https://github.com/apache/drill/pull/180#issuecomment-153219549
  
This seems very useful to users as an experimental feature. +1 with the 
default behavior as disabled.


> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Hanifi Gunes
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-10-26 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974674#comment-14974674
 ] 

Parth Chandra commented on DRILL-3229:
--

Just realised that with Untyped nulls, we would need to resolve the question of 
how we will handle schema only queries.  We can't be sending back a schema with 
no type.


> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Hanifi Gunes
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-10-23 Thread Steven Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971766#comment-14971766
 ] 

Steven Phillips commented on DRILL-3229:


Regarding the list writer, I know it is a bit confusing, so I will try to give 
a better explanation for how it works. It confuses me at times as well.

The type promotion was designed with the possibility of allowing other 
promotions in mind, but I am currently only doing promotion to Union. We should 
have a discussion about what other promotions we want to allow.

Screen currently returns a Union type to the user. This is an area that will 
require additional enhancement. The DrillClient has no problem dealing with a 
Union vector. The jdbc driver, on the other hand, has only limited support for 
a Union type, currently. I think we might need to add a feature similar to what 
we have with complex types, which will determine if the client is able to 
handle Union types, and convert to json if it doesn't. So metadata queries will 
also return a Union type.

As for case statements, I am leaning more toward a general philosophy of trying 
as much as we can to not fail queries, and so if there is something Drill can 
do to execute a query, it should do that. So I am leaning toward option 3.

An untyped-null type is supported as part of a Union vector. This null value is 
encoded in the 'type' vector. This patch does not introduce a standalone 
Untyped Null Vector. That will be a separate patch.

I will update the design document with what I have said here.

> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Hanifi Gunes
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-10-23 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972246#comment-14972246
 ] 

Parth Chandra commented on DRILL-3229:
--

... get a headstart on fixing the C++ client. (sorry for the break in 
transmission).

> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Hanifi Gunes
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-10-23 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972238#comment-14972238
 ] 

Parth Chandra commented on DRILL-3229:
--

I think we do need to do the same thing as with complex types, see if the 
client supports complex types or not, and if not, then convert to JSON. 
Otherwise we will break the C++ client (fixing the C++ client for this would be 
painful, and useless since no consumer of the API can currently handle the 
Union type). 
Untyped nulls will also break he C++ client, though it is easier to fix that, 
so we probably should. IF you include a quick proposal on that, I an get a 
headstart on the 


> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Hanifi Gunes
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-10-22 Thread Steven Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969613#comment-14969613
 ] 

Steven Phillips commented on DRILL-3229:


Design document: https://gist.github.com/StevenMPhillips/41b4a1bd745943d508d2

> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Steven Phillips
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-10-22 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970439#comment-14970439
 ] 

Parth Chandra commented on DRILL-3229:
--

Nice doc. I have a couple of quick questions - 
In the list writer, when the map() method is called, I didn't quite follow the 
reason for tracking the current field name. What is it needed for? 
The Type promotion proposal is excellent. But with type promotion we will 
update the underlying writer to a UnionWriter the moment a type change occurs. 
Is it possible for us to to have a hierarchy of promotable types and we promote 
to a higher Scalar type (e.g. Int gets promoted to a Varchar) as a first step 
and Union if we encounter more than one type change or a change to a complex 
type. I'm OK if we think this is too complex to implement.
How will Screen handle a Union type? In general, a user level tool (sqlline 
included) will not know how to handle this. Can we have screen return a varchar 
representation of the Union type? During data exploration the user will then 
see there are type changes and can then use the type introspection and cast 
methods appropriately. 
What about metadata only queries ( i.e select * ... limit 0)? What type would 
the user application get?
For Function Evaluation my preference is to have code generation rather than 
have UDFs that take a union parameter.
For case statements - If a case statment can output a Union type, the end user 
will presumably have to resolve the different types using type introspection 
and an outer case statement. Actually I don't have enough idea about end user 
use cases to choose which is more desirable. Should we leave it as choice #2 
and see what users ask for?
Jacques had mentioned that you have an idea for introducing a Untyped null 
type. How would that fit in with this design?


> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Steven Phillips
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-10-01 Thread Steven Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939640#comment-14939640
 ] 

Steven Phillips commented on DRILL-3229:


i) In this first iteration, Union types will be enabled with an option, and 
they will be created in Json Reader and Mongo reader automatically if the 
option is enabled. Everything will be a Union type in this case. A future patch 
will work on promoting from non-union once it is necessary to promote.
ii) Your understanding is correct. One change from the earlier comment, there 
is no "bits" vector. The underlying primitive type vectors will have their own 
"bits" for tracking nulls. The type vector with a value of zero will also 
indicate null.

Without going into much detail at this point, I can answer the next paragraph 
of question by saying that this patch will allow reading of any valid json. It 
also has a more literal representation of the json, e.g. null values will be 
treated as null, instead of empty maps/lists. The patch also includes functions 
for inspecting the type of a field, which can be used with case statements to 
handle the data based on which type it is. Though it may be somewhat 
cumbersome, with these tools you should be able to run almost any query against 
dynamic json data. This will generally involve using introspection and case 
statements to remove the Union types early in the query. Future work will 
eliminate the need for this in many cases. One notable exception is that 
flatten is not supported in this initial patch.

> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Steven Phillips
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-10-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939656#comment-14939656
 ] 

ASF GitHub Bot commented on DRILL-3229:
---

GitHub user StevenMPhillips opened a pull request:

https://github.com/apache/drill/pull/180

DRILL-3229: Implement Union type vector



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/StevenMPhillips/drill drill-3229

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/180.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #180


commit e5a529d7a276597f2b62cdcb9a1cab2fae8bc52f
Author: Steven Phillips 
Date:   2015-10-01T10:26:34Z

DRILL-3229: Implement Union type vector




> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Steven Phillips
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-09-24 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907299#comment-14907299
 ] 

Parth Chandra commented on DRILL-3229:
--

The union type looks good (haven't delved into the UnionListVector, though it 
doesn't look too far removed from the UnionVector). I'm missing some details - 
i) When do we create a Union type?
ii) The Union Vector will have a map vector which will have a fields for each 
minor type. The fields will be nullable vectors of the corresponding minor 
type. For a given value, only one of the value vectors will have the bits field 
set. Is my understanding correct? A picture would be a big help.

More importantly, can we write up a couple of notes on the big picture so I can 
see where this fits in?  For instance, it is not clear in what cases we plan to 
use this. There are different use cases where changing schema is encountered.  
For instance, a large number of nulls followed by a schema that materializes is 
one frequently encountered case. The other common case is that of a primitive 
type that appears within quotes in a particular record and gets interpreted as 
a varchar. More complex cases can occur that have the same information 
represented differently eg a timestamp that is written either as as string or 
as a long. (I'm not yet considering the rather extreme example in the yelp data 
set where a null field shows up as an empty map). Which of these types of cases 
are we addressing with UnionVectors? 

Also, one question I've never resolved in my own mind is that of FieldMetadata. 
Does a ValueVector require FieldMetadata to describe it's structure? Or is it 
the other way around: FieldMetadata can be derived from the ValueVector. Either 
way, how do we define FieldMetadata for Union types? What is the impact on 
ODBC/JDBC, if any? 

Would a shared doc be a better way to discuss this? Then we can consolidate and 
add the result to https://drill.apache.org/docs/value-vectors/.



> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Steven Phillips
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-09-18 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875831#comment-14875831
 ] 

Jacques Nadeau commented on DRILL-3229:
---

[~hgunes] and [~parthc], it would be good to get your feedback on this design. 
I'll post my notes shortly

> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: Jacques Nadeau
>Assignee: Steven Phillips
> Fix For: Future
>
>
> Embedded Vector will leverage a binary encoding for holding information about 
> type for each individual field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector

2015-09-17 Thread Steven Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804637#comment-14804637
 ] 

Steven Phillips commented on DRILL-3229:


Basic design outline:

A Union type represents a field where the type can vary between records. The 
data for a field of type Union will be stored in a UnionVector.

h4. UnionVector
Internally uses a MapVector to hold the vectors for the various types. 
The types include all of the MinorTypes, including List and Map.
For example, the internal MapVector will have a subfield named 
"bigInt", which will refer to a NullableBigIntVector.

In addition to the vectors corresponding to the minor types, there will 
be two additional fields, both represented by UInt1Vectors. These are
"bits" and "types", which will represent the nullability and types of 
the underlying data. The "bits" vector will work the same way it works in other
nullable vectors. The "types" vector will store the number 
corresponding to the value of the MinorType as defined in the protobuf 
definition. There
will be mutator methods for setting null and type.

h4. UnionWriter
The UnionWriter implements and overwrites all of the methods of 
FieldWriter. It holds field writers corresponding to each of the types included 
in the underly
UnionVector, and delegates the method calls for each type to the 
corresponding writer. For example, the BigIntWriter interface:

{code}
public interface BigIntWriter extends BaseWriter {
  public void write(BigIntHolder h);

  public void writeBigInt(long value);
}
{code}
UnionWriter overwrites these methods:

{code}
@Override
  public void writeBigInt(long value) {
data.getMutator().setType(idx(), MinorType.BIGINT);
data.getMutator().setNotNull(idx());
getBigIntWriter().setPosition(idx());
getBigIntWriter().writeBigInt(value);
  }

@Override
  public void writeBigInt(BigIntHolder h) {
data.getMutator().setType(idx(), MinorType.BIGINT);
data.getMutator().setNotNull(idx());
getBigIntWriter().setPosition(idx());
getBigIntWriter().writeBigInt(holder.value);
  }
{code}

This requires users of the interface to go through the UnionWriter, 
rather than using the underlying BigIntWriter directly. Otherwise, the "type" 
and "bits" vector would not get set correctly.

h4. UnionReader
Much the same as the UnionWriter, the UnionReader overwrites the 
methods of FieldReader, and delegates to a corresponding specific FieldReader 
implementation depending on which type 
the current value is.

h4. UnionListVector
UnionListVector extends BaseRepeatedVector. It works much the same as 
other Repeated vectors; there is a data vector and an offset vector. The data 
vector in this case is a UnionVector.

h4. UnionListWriter
The UnionListWriter overrides all FieldWriter methods. When starting a 
new list, the startList() method is called. This calls the startNewValue(int 
index) method
of the underlying UnionListVector.Mutator. Subsequent calls to the 
ListWriter methods (such as bigint()), return the UnionListWriter itself, and 
calls to write are handled by calling
the appropriate method on the underlying UnionListVector.Mutator, which 
handles updating the offset vector.

In the case that the map() method is called (i.e. repeated map), the 
UnionListWriter is itself returned, but a state variable is updated to indicate 
that it should oeprate as a MapWriter.
While in MapWriter mode, calls to the MapWriter methods will also 
return the UnionListWriter itself, but will also update the field indicating 
what the name of the current field is.
Subsequent writes to the ScalarWriter methods will write to the 
underlying UnionVector using the UnionWriter interface.

For example,

{code}
UnionListWriter list;
...

list.startList();
list.map().bigInt("a").writeBigInt(1);
{code}

This code first indicates that a new list is starting. By doing this, 
the offset vector is correctly set. Calling map() sets the internal state of 
the writer to "MAP". bigInt("a") sets the current
field of the writer to "a", and writeBigInt(1) writes the value 1 to 
the underlying UnionVector.
Another example:

{code}
MapWriter mapWriter = list.map().map("a")
{code}

In this case, the final call to map("a") delegates to the underlying 
UnionWriter, and returns a new MapWriter, with the position set according to 
the current offset.

> Create a new EmbeddedVector
> ---
>
> Key: DRILL-3229
> URL: https://issues.apache.org/jira/browse/DRILL-3229
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Execution - Codegen, Execution - Data Types, Execution - 
> Relational Operators, Functions - Drill
>Reporter: