Re: [DRILL with ALTERYX]

2019-12-10 Thread Paul Rogers
Hi Thiago,

Just wanted to follow up with a bit more detail.

The use case you describe is what is sometimes called "query integration": 
having a single tool accept a query, then turn around and issue other queries 
to other data sources. Finally, the query integrator combines the resulting 
data. Drill has some of this functionality, depending on the data sources you 
want to use.


You'd use a JDBC or ODBC driver to connect Drill to Alteryx so you can send 
queries to Drill, and obtain the results back from Drill.


Although Drill can connect to many data sources, query integration has not been 
the primary use case for Drill historically. Drill has mostly focused on 
reading tables in HDFS, MFS, S3 and other distributed file systems.


Query integration rapidly becomes complex: one must decide whether it is better 
to, say, scan both DBs A and B, or scan DB A and do per-row lookups in B, or 
perhaps visa-versa.

As it turns out, Drill uses Apache Calcite for query planning. One could add 
Calcite rules to help decide how best to divide up a query. You would need some 
statistics about your data source, such as the number of rows expected from a 
query to a DB. Getting these numbers right for each data source can be tricky. 
Still, if you've read about the existing data sources, you'll see the community 
has integrated with Kafka, HBase, MapRDB and more.

This kind of cross-DB planning exists in Drill in only the most rudimentary 
form. We'd welcome contributions to build on Calcite to expand this 
functionality.


You mention a data abstraction layer. In addition to just combining queries, 
such layers also handle type conversions. Maybe a product code is an INT in 
system A, but a VARCHAR in system B. Maybe names are stored as a single string 
in system B, but as First/last name in system C. Tools exist to handle this 
complexity, but Drill does not do so directly. You can create views that handle 
normalization, but you might need a tool that handles data unification if the 
differences between data models are significant. (Such a tool could be built on 
Drill, but I don't know of anyone who has yet done so.)

Can you explain a bit more about your use case so we can make better 
suggestions? For example, how many data sources do you need to query? How 
similar are the data models?

Thanks,
- Paul

 

On Tuesday, December 3, 2019, 7:06:15 PM PST, Charles Givre 
 wrote:  
 
 Hi Thiago, 
Welcome to the Drill community!  I'd be happy to help you out and from what 
you're describing, Drill may be a great tool for this use case.  Can you share 
a bit more about what kinds of systems you are looking to query with Drill?  I 
assume you've seen the documentation at drill.apache.org 
?  I'll put a shameless plug for the Drill book as 
well which might be useful. [1]

Best,
-- C

[1] https://amzn.to/33P2QwC 



> On Dec 3, 2019, at 9:54 PM, Thiago Samuel dos Santos Ribeiro 
>  wrote:
> 
> Hi Apache Team,
>  
> Please , in brazil we are trying to evaluate a very huge solution using 
> Apache Drill in a TELCO company, however this solution must perform 
> connection to several data sources, and  through this join bring data back to 
> the Alteryx, in this case the DRILL should work as an abstraction data layer.
>  
> I am worried about the Drill´s community has no enough information or 
> use-cases which we can take advantage and drive our project here.
>  
> Please, would someone guide me to someone or some community documentation 
> about this approach ?
>  
>    

Re: Integrating Arrow with Drill

2019-12-10 Thread Paul Rogers
Hi Nai Yan,

You posted this same question a few days ago and we responded with some 
questions and discussion. Perhaps the dev list e-mail is going into your spam 
folder? You can find the discussion in the e-mail archives [1].

We would still like to learn more about how you might use Arrow with Drill.


Thanks,
- Paul


[1] http://mail-archives.apache.org/mod_mbox/drill-dev/201912.mbox/browser  See 
posts for Dec. 9.



 

On Tuesday, December 10, 2019, 5:15:09 PM PST, Nai Yan. 
 wrote:  
 
 Greetings, 
      Per Drill Dev Day 2018, there's a proposal to integrate arrow exec. 
engine with Drill. I was wondering if there's any solid plan? 

      Any comments are appreciated.  Thanks in advance. 

Nai Yan
 
  

[jira] [Created] (DRILL-7480) Revisit parameterized type design for Metadata API

2019-12-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7480:
--

 Summary: Revisit parameterized type design for Metadata API
 Key: DRILL-7480
 URL: https://issues.apache.org/jira/browse/DRILL-7480
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Grabbed latest master and found that the code will not build in Eclipse due to 
a type mismatch in the statistics code. Specifically, the problem is that we 
have several parameterized classes, but we often omit the parameters. 
Evidently, doing so is fine for some compilers, but is an error in Eclipse.

Then, while fixing the immediate issue, I found an opposite problem: code that 
would satisfy Eclipse, but which failed in the Maven build.

I spent time making another pass through the metadata code to add type 
parameters, remove "rawtypes" ignores and so on. See DRILL-7479.

Stepping back a bit, it seems that we are perhaps using the type parameters in 
a way that does not serve our needs in this particular case.

We have many classes that hold onto particular values of some type, such as 
{{StatisticsHolder}}, which can hold a String, a Double, etc. So, we 
parameterize.

But, after that, we treat the items generically. We don't care that {{foo}} is 
a {{StatisticsHolder}} and {{bar}} is {{StatisticsHolder}}, we 
just want to create, combine and work with lists of statistics.

The same is true in several other places such as column type, comparator type, 
etc. For comparators, we don't really care what type they compare, we just 
want, given two generic \{{StatisticsHolder}}s to get the corresponding 
comparator.

This is very similar to the situation with the "column accessors" in EVF: each 
column is a {{VARCHAR}} or a\{{ FLOAT8}}, but most code just treats them 
generically. So, the type-ness of the value was treated as data a runtime 
attribute, not a compile-time attribute.

This is a subtle point. Most code in Drill does not work with types directly in 
Java code. Instead, Drill is an interpreter: it works with generic objects 
which, at run time, resolve to actual typed objects. It is the difference 
between writing an application (directly uses types) and writing a language 
(generically works with all types.)

For example, a {{StatsticsHolder}} probably only needs to be type-aware at the 
moment it is populated or used, but not in all the generic column-level and 
table level code. (The same is true of properties in the column metadata class, 
as an example.)

IMHO, {{StatsticsHolder}} probably wants to be a non-parameterized class. It 
should have a declaration object that, say, provides the name, type, comparator 
and with other metadata. When the actual value is needed, a typed getter can be 
provided:
{code:java}
 T getValue();
{code}
As it is, the type system is very complex but we get no value. Since it is so 
complex, the code just punted and sprinkled raw types and ignores in many 
places, which defeats the purpose of parameterized types anyway.

Suggestion: let's revisit this work after the upcoming release and see if we 
can simplify it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7479) Short-term fixes for metadata API parameterized type issues

2019-12-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7479:
--

 Summary: Short-term fixes for metadata API parameterized type 
issues
 Key: DRILL-7479
 URL: https://issues.apache.org/jira/browse/DRILL-7479
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


See DRILL- for a discussion of the issues with how we currently use 
parameterized types in the metadata API.

This ticket is for short-term fixes that convert unsafe generic types of the 
form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the 
compiler does not complain with many warnings (and a few Eclipse-only errors.)

The topic should be revisited later in the context of DRILL-.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: About integration of drill and arrow

2019-12-09 Thread Paul Rogers
Hi All,

Would be good to do some design brainstorming around this.

Integration with other tools depends on the APIs (the first two items I 
mentioned.) Last time I checked (more than a year ago), memory layout of Arrow 
is close to that in Drill; so conversion is around "packaging" and metadata, 
which can be encapsulated in an API.

Converting internals is a major undertaking. We have large amounts of complex, 
critical code that works directly with the details of value vectors. My thought 
was to first convert code to use the column readers/writers we've developed. 
Then, once all internal code uses that abstraction, we can replace the 
underlying vector implementation with Arrow. This lets us work in small stages, 
each of which is deliverable by itself.

The other approach is to change all code that works directly with Drill vectors 
to instead work with Arrow. Because that code is so detailed and fragile, that 
is a huge, risky project.

There are other approaches as well. Would be good to explore them before we 
dive into a major project.

Thanks,
- Paul

 

On Monday, December 9, 2019, 07:07:31 AM PST, Charles Givre 
 wrote:  
 
 Hi Igor, 
That would be really great if you could see that through to completion.  IMHO, 
the value from this is not so much performance related but rather the ability 
to use Drill to gather and prep data and seamlessly "hand it off" to other 
platforms for machine learning.  
-- C


> On Dec 9, 2019, at 5:48 AM, Igor Guzenko  wrote:
> 
> Hello Nai and Paul,
> 
> I would like to contribute full Apache Arrow integration.
> 
> Thanks,
> Igor
> 
> On Mon, Dec 9, 2019 at 8:56 AM Paul Rogers 
> wrote:
> 
>> Hi Nai Yan,
>> 
>> Integration is still in the discussion stages. Work has been progressing
>> on some foundations which would help that integration.
>> 
>> At the Developer's Day we talked about several ways to integrate. These
>> include:
>> 
>> 1. A storage plugin to read Arrow buffers from some source so that you
>> could use Arrow data in a Drill query.
>> 
>> 2. A new Drill client API that produces Arrow buffers from a Drill query
>> so that an Arrow-based tool can consume Arrow data from Drill.
>> 
>> 3. Replacement of the Drill value vectors internally with Arrow buffers.
>> 
>> The first two are relatively straightforward; they just need someone to
>> contribute an implementation. The third is a major long-term project
>> because of the way Drill value vectors and Arrow vectors have diverged.
>> 
>> 
>> I wonder, which of these use cases is of interest to you? How might you
>> use that integration in you project?
>> 
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Sunday, December 8, 2019, 10:33:23 PM PST, Nai Yan. <
>> zhaon...@gmail.com> wrote:
>> 
>> Greetings,
>>      As mentioned in Drill develper Day 2018, there's a plan for Drill to
>> integrate Arrow (gandiva from Dremio). I was wondering how is going.
>> 
>>      Thanks in adavance.
>> 
>> 
>> 
>> Nai Yan
>> 
  

[jira] [Resolved] (DRILL-7303) Filter record batch does not handle zero-length batches

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7303.

Resolution: Duplicate

> Filter record batch does not handle zero-length batches
> ---
>
> Key: DRILL-7303
> URL: https://issues.apache.org/jira/browse/DRILL-7303
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>    Reporter: Paul Rogers
>    Assignee: Paul Rogers
>Priority: Major
>
> Testing of the row-set-based JSON reader revealed a limitation of the Filter 
> record batch: if an incoming batch has zero records, the length of the 
> associated SV2 is left at -1. In particular:
> {code:java}
> public class SelectionVector2 implements AutoCloseable {
>   // Indicates actual number of rows in the RecordBatch
>   // container which owns this SV2 instance
>   private int batchActualRecordCount = -1;
> {code}
> Then:
> {code:java}
> public abstract class FilterTemplate2 implements Filterer {
>   @Override
>   public void filterBatch(int recordCount) throws SchemaChangeException{
> if (recordCount == 0) {
>   outgoingSelectionVector.setRecordCount(0);
>   return;
> }
> {code}
> Notice there is no call to set the actual record count. The solution is to 
> insert one line of code:
> {code:java}
> if (recordCount == 0) {
>   outgoingSelectionVector.setRecordCount(0);
>   outgoingSelectionVector.setBatchActualRecordCount(0); // <-- Add this
>   return;
> }
> {code}
> Without this, the query fails with an error due to an invalid index of -1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7311) Partial fixes for empty batch bugs

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7311.

Resolution: Duplicate

> Partial fixes for empty batch bugs
> --
>
> Key: DRILL-7311
> URL: https://issues.apache.org/jira/browse/DRILL-7311
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>    Reporter: Paul Rogers
>    Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.18.0
>
>
> DRILL-7305 explains that multiple operators have serious bugs when presented 
> with empty batches. DRILL-7306 explains that the EVF (AKA "new scan 
> framework") was originally coded to emit an empty "fast schema" batch, but 
> that the feature was disabled because of the many empty-batch operator 
> failures.
> This ticket covers a set of partial fixes for empty-batch issues. This is the 
> result of work done to get the converted JSON reader to work with a "fast 
> schema." The JSON work, in the end, revealed that Drill has too many bugs to 
> enable fast schema, and so the DRILL-7306 was implemented instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7305) Multiple operators do not handle empty batches

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7305.

Resolution: Duplicate

> Multiple operators do not handle empty batches
> --
>
> Key: DRILL-7305
> URL: https://issues.apache.org/jira/browse/DRILL-7305
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>    Reporter: Paul Rogers
>Priority: Major
>
> While testing the new "EVF" framework, it was found that multiple operators 
> incorrectly handle empty batches. The EVF framework is set up to return a 
> "fast schema" empty batch with only schema as its first batch. It turns out 
> that many operators fail with problems such as:
> * Failure to set the value counts in the output container
> * Fail to initialize the offset vector position 0 to 0 for variable-width or 
> repeated vectors
> And so on.
> Partial fixes are in the JSON reader PR.
> For now, the easiest work-around is to disable the "fast schema" path in the 
> EVF: DRILL-7306.
> To discover the remaining issues, enable the 
> {{ScanOrchestratorBuilder.enableSchemaBatch}} option and run unit tests. You 
> can use the {{VectorChecker}} and {{VectorAccessorUtilities.verify()}} 
> methods to check state. Insert a call to {{verify()}} in each "next" method: 
> verify the incoming and outgoing batches. The checker only verifies a few 
> vector types; but these are enough to show many problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7458) Base storage plugin framework

2019-11-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7458:
--

 Summary: Base storage plugin framework
 Key: DRILL-7458
 URL: https://issues.apache.org/jira/browse/DRILL-7458
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The "Easy" framework allows third-parties to add format plugins to Drill with 
moderate effort. (The process could be easier, but "Easy" makes it as simple as 
possible given the current structure.)

At present, no such "starter" framework exists for storage plugins. Further, 
multiple storage plugins have implemented filter push down, seemingly by 
copying large blocks of code.

This ticket offers a "base" framework for storage plugins and for filter 
push-downs. The framework builds on the EVF, allowing plugins to also support 
project push down.

The framework has a "test mule" storage plugin to verify functionality, and was 
used as the basis of an REST-like plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7457) Join assignment is random when table costa are identical

2019-11-22 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7457:
--

 Summary: Join assignment is random when table costa are identical
 Key: DRILL-7457
 URL: https://issues.apache.org/jira/browse/DRILL-7457
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Create a simple test: a join between two identical scans, call them t1 and t2. 
Ensure that the scans report the same cost. Capture the logical plan. Repeat 
the exercise several times. You will see that Drill randomly assigns t1 to the 
left side or right side.

Operationally this might not make a difference. But, in tests, it means that 
trying to compare an "actual" and "golden" plan is impossible as the plans are 
unstable.

Also, if only the estimates are the same, but the table size differs, then 
runtime performance will randomly be better on some query runs than others.

Better is to fall back to SQL statement table order if the two tables are 
otherwise identical in cost.

This may be a Calcite issue rather than a Drill issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7456) Batch count fixes for 12 additional operators

2019-11-22 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7456:
--

 Summary: Batch count fixes for 12 additional operators
 Key: DRILL-7456
 URL: https://issues.apache.org/jira/browse/DRILL-7456
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enables batch validation for 12 additional operators:

* MergingRecordBatch
* OrderedPartitionRecordBatch
* RangePartitionRecordBatch
* TraceRecordBatch
* UnionAllRecordBatch
* UnorderedReceiverBatch
* UnpivotMapsRecordBatch
* WindowFrameRecordBatch
* TopNBatch
* HashJoinBatch
* ExternalSortBatch
* WriterRecordBatch

Fixes issues found with those checks so that this set of operators passes all 
checks.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7455) "Renaming" projection operator to avoid physical copies

2019-11-22 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7455:
--

 Summary: "Renaming" projection operator to avoid physical copies
 Key: DRILL-7455
 URL: https://issues.apache.org/jira/browse/DRILL-7455
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Drill/Calcite inserts project operators for three main reasons:

1. To compute a new column: {{SELECT a + b AS c ...}}

2. To rename columns: {{SELECT a AS x ...}}

3. To remove columns: {{SELECT a ...} but a data source provides columns {{a}}, 
and {{b}}.

Example of case 1:

{code:json}
"pop" : "project",
"@id" : 4,
"exprs" : [ {
  "ref" : "`a0`",
  "expr" : "`a`"
}, {
  "ref" : "`b0`",
  "expr" : "`b`"
} ],
{code}

Of these, only case 2 requires row-by-row computation of new values. Case 1 
simply creates a new vector with only the name changed; but the same data. Case 
3 preserves some vectors, drops others.

In the cases 1 and 2, a simple data transfer from input to output would be 
adequate. Yet, if one steps through the code, and enables code generation, one 
will see that Drill steps through each record in all three cases, even calling 
an empty per-record compute block.

A better-performance solution is to separate out the renames/drops (cases 1 and 
3) from the column computations (case 2). This can be done either:

1. At plan time, identify that all columns are renames, and replace the 
row-by-row project with a column-level project.

2. At run time that identifies the column-level projections (cases 1 and 3) and 
handles those with transfer pairs, while doing row-by-row computes only if case 
2 exists.

Since row-by-row copies are among the most expensive operations in Drill, this 
optimization could improve performance by a decent amount.

Note that a further optimization is to remove "trivial" projects such as the 
following:

{code:json}
"pop" : "project",
"@id" : 2,
"exprs" : [ {
  "ref" : "`a`",
  "expr" : "`a`"
}, {
  "ref" : "`b`",
  "expr" : "`b`"
}, {
  "ref" : "`b0`",
  "expr" : "`b0`"
} ],
{code}

The only value of such a projection is to say, "remove all vectors except 
{{a}}, {{b}} and {{b0}}. In fact, the only time such a projection should be 
needed is:

1. On top of a data source that does not support projection push down.

2. When Calcite knows it wants to discard certain intermediate columns.

Otherwise, Calcite knows which columns emerge from operator x, and should not 
need to add a project to enforce that schema if it is already what the project 
will emit.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7451) Planner inserts project node even if scan handles project push-down

2019-11-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7451:
--

 Summary: Planner inserts project node even if scan handles project 
push-down
 Key: DRILL-7451
 URL: https://issues.apache.org/jira/browse/DRILL-7451
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


I created a "dummy" storage plugin for testing. The test does a simple query:

{code:sql}
SELECT a, b, c from dummy.myTable
{code}

The first test is to mark the plugin's group scan as supporting projection push 
down. However, Drill still creates a projection node in the logical plan:

{code:json}
  "graph" : [ {
"pop" : "DummyGroupScan",
"@id" : 2,
"columns" : [ "`**`" ],
"userName" : "progers",
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  }, {
"pop" : "project",
"@id" : 1,
"exprs" : [ {
  "ref" : "`a`",
  "expr" : "`a`"
}, {
  "ref" : "`b`",
  "expr" : "`b`"
}, {
  "ref" : "`c`",
  "expr" : "`c`"
} ],
"child" : 2,
"outputProj" : true,
"initialAllocation" : 100,
"maxAllocation" : 100,
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  }, {
"pop" : "screen",
"@id" : 0,
"child" : 1,
"initialAllocation" : 100,
"maxAllocation" : 100,
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  } ]
{code}

There is [a comment in the 
code|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjectIntoScanRule.java#L109]
 that suggests the project should be removed:

{code:java}
// project above scan may be removed in ProjectRemoveRule for
// the case when it is trivial
{code}

As shown in the example, the project is trivial. There is a subtlety: it may be 
that the scan, unknown to the planner, produce additional columns, say {{d}} 
and {{e}} which the project operator is needed to remove.

If this is the reason the project remains, perhaps we can add a flag of some 
kind where the group scan can insist that not only does it handle projection, 
it will not insert additional columns. At that point, the project is completely 
unnecessary in this case.

This is not a functional bug; just a performance issue: we exercise the 
machinery of the project operator to do exactly nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7447) Simplify the Mock reader

2019-11-16 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7447:
--

 Summary: Simplify the Mock reader
 Key: DRILL-7447
 URL: https://issues.apache.org/jira/browse/DRILL-7447
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The mock reader is used to generate large volumes of data. It has evolved over 
time and has many crufty vestiges of prior implementations.

Also, the Mock reader allows specifying that types are nullable, and the rate 
of null values. This change adds to the existing "encoding" to allow specifying 
this property via SQL: add an "n" to the column name to specify nullable, a 
number to specify percent. To specify INT columns with 10%, 50% and 90% nulls:

{noformat}
SELECT a_in10, b_n50, b_n90 FROM mock.dummy1000
{noformat}

The default is 25% nulls (which already existed in the code) if no numeric 
suffix is provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7446) Eclipse compilation issue in AbstractParquetGroupScan

2019-11-16 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7446:
--

 Summary: Eclipse compilation issue in AbstractParquetGroupScan
 Key: DRILL-7446
 URL: https://issues.apache.org/jira/browse/DRILL-7446
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


When the recent master branch is loaded in Eclipse, we get a compiler error in 
{{AbstractParquetGroupScan}}:

{noformat}
The method getFiltered(OptionManager, FilterPredicate) from the type 
AbstractGroupScanWithMetadata.GroupScanWithMetadataFilterer is not visible 
AbstractParquetGroupScan.java   
/drill-java-exec/src/main/java/org/apache/drill/exec/store/parquet  line 
242Java Problem

Type mismatch: cannot convert from 
AbstractGroupScanWithMetadata.GroupScanWithMetadataFilterer to 
AbstractParquetGroupScan.RowGroupScanFilterer AbstractParquetGroupScan.java   
/drill-java-exec/src/main/java/org/apache/drill/exec/store/parquet  line 
237Java Problem
{noformat}

The issue appears to be due to using the raw type rather than using parameters 
with the type.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Storage Plugin Assistance

2019-11-15 Thread Paul Rogers
Hi Charles,

Looking at the code in your PR, it seems that you are, in fact, using Drill's 
JSON reader to decode the message JSON. (See [1]). Is that where you are having 
problems?

Looks like this reader handles JSON passed as a string or from a file? In 
either case, get a local copy of the JSON, then use the JsonReader directly. 
The JSON reader wants a container and other knick-knacks which you can create 
within a test that extends the SubOperatorTest. That framework gives you things 
like an allocator so you can create the vectors, allocate memory, and so on.

This code uses the old JSON reader, so it is going to be pretty fiddly. This 
reader would greatly benefit from the newer EVF-based JSON reader, but we can 
work on that later.

Thanks,

- Paul


[1] 
https://github.com/apache/drill/pull/1892/files#diff-59df95a0bedb082b25742242eef0bb9c


 

On Friday, November 15, 2019, 12:40:22 PM PST, Charles Givre 
 wrote:  
 
 

> On Nov 15, 2019, at 1:39 PM, Paul Rogers  wrote:
> 
> Hi Charles,
> 
> A thought on debugging deserialization is to not do it in a query. Capture 
> the JSON returned from a rest call. Write a simple unit test that 
> deserializes that by itself from a string or file. Deserialization is a bit 
> of a black art, and is really a problem separate from Drill itself.

So dumb non-dev question... How exactly do I do that?  I have SeDe unit 
test(s), but the query in question is failing in the first part of the unit 
test.

@Test
 public void testSerDe() throws Exception {
  String sql = "SELECT COUNT(*) FROM 
http.`/json?lat=36.7201600=-4.4203400=2019-10-02`";
  String plan = queryBuilder().sql(sql).explainJson();
  long cnt = queryBuilder().physical(plan).singletonLong();
  assertEquals("Counts should match",1L, cnt);
}


  

Re: Storage Plugin Assistance

2019-11-15 Thread Paul Rogers
Hi Charles,

A thought on debugging deserialization is to not do it in a query. Capture the 
JSON returned from a rest call. Write a simple unit test that deserializes that 
by itself from a string or file. Deserialization is a bit of a black art, and 
is really a problem separate from Drill itself.

As it turns out, for my "day job" I'm doing a POC using Drill to query 
SumoLogic. I took this as an opportunity to fill that gap you mentioned in our 
book: how to create a storage plugin. See [1]. This is a work in progress, but 
it has helped me build the planner-side stuff up to the batch reader, after 
which the work is identical to that for a format plugin.

The Sumo API is REST-based, but for now I'm using the clunky REST client 
available in the Sumo public repo because of some unfortunate details of the 
Sumo REST service when used for this purpose. (Sumo returns data as a set of 
key/value pairs, not as a fixed JSON schema. [4])

Poking around elsewhere, it turns out someone wrote a very simple Presto 
connector for REST [2] using the Retrofit library from Square [3] which seems 
very simple to use. If we create a generic REST plugin, we might want to look 
at how it was done in Presto. Presto requires an up-front schema which Retrofit 
can provide. Drill, of course, does not require such a schema and so works with 
ad-hoc schemas, such as the one that Sumo's API provides. 

Actually, better than using a deserializer would be to use Drill's existing 
JSON parser to read data directly into value vectors. But, that existing code 
has lots of tech debt. I've been working on a PR for new version based on EVF, 
but that is a while off, and won't help us today.

It is interesting to note that neither the JSON reader, nor a generic REST API 
would work with the Sumo API because of is structure. I think the JSON reader 
would read an entire batch of Sumo results as a single record composed of a 
repeated Map, with elements being the key/value pairs. Not at all ideal.

So, both the JSON reader, and the REST API, should eventually handle data 
formats which are generic (name/value pairs) rather than expressed in the 
structure of JSON objects (as required by Jackson and Retrofit.) That is a 
topic for later, but is why the Sumo plugin has to be custom to Sumo's API for 
now.


Thanks,
- Paul


[1] https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin

[2] https://github.com/prestosql-rocks/presto-rest

[3] https://square.github.io/retrofit/

[4] https://help.sumologic.com/APIs/Search-Job-API/About-the-Search-Job-API



 

On Friday, November 15, 2019, 09:04:21 AM PST, Charles Givre 
 wrote:  
 
 Hi Igor, 
Thanks for the advice.  I've been doing some digging and am still pretty stuck 
here.  Can you recommend any techniques about how to debug the Jackson 
serialization/deserialization?  I added a unit test that serializes a query and 
then deserializes it and that test fails.  I've tracked this back to a 
constructor not receiving the plugin config and then throwing a NPE. What I 
can't seem to figure out is where that is being called from and why.

Any advice would be greatly appreciated.  Code can be found here: 
https://github.com/apache/drill/pull/1892 
<https://github.com/apache/drill/pull/1892>
Thanks,
-- C


> On Oct 12, 2019, at 3:27 AM, Igor Guzenko  wrote:
> 
> Hello Charles,
> 
> Looks like you found another new issue. Maybe I explained unclear, but my
> previous suggestion wasn't about EXPLAIN PLAN construct, but rather:
> 1)  Use http client like Postman or simply browser to save response of
> requested rest service into json file
> 2)  Try to debug reading the file by Drill in order to compare how
> Calcite's conversion from AST SqlNode to RelNode tree differs for existing
> dfs storage plugin from same flow in your storage plugin.
> 
> From your last email I can figure out that exists another issue with class
> HttpGroupScan, at some point Drill tried to deserialize json into instance
> of HttpGroupScan and jackson library didn't find how to do this. Probably
> you missed some constructor with jackson metadata, for example see in
> HiveScan operator:
> 
> @JsonCreator
> public HiveScan(@JsonProperty("userName") final String userName,
>                @JsonProperty("hiveReadEntry") final HiveReadEntry
> hiveReadEntry,
>                @JsonProperty("hiveStoragePluginConfig") final
> HiveStoragePluginConfig hiveStoragePluginConfig,
>                @JsonProperty("columns") final List columns,
>                @JsonProperty("confProperties") final Map String> confProperties,
>                @JacksonInject final StoragePluginRegistry
> pluginRegistry) throws ExecutionSetupException {
>  this(userName,
>      hiveReadEntry,
>      (HiveStoragePlugin) pluginRegistry.getPlugin(hiveStoragePluginConfig),
>      columns,
&

[jira] [Created] (DRILL-7445) Create batch copier based on result set framework

2019-11-14 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7445:
--

 Summary: Create batch copier based on result set framework
 Key: DRILL-7445
 URL: https://issues.apache.org/jira/browse/DRILL-7445
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The result set framework now provides both a reader and writer. Provide a 
copier that copies batches using this framework. Such a copier can:

* Copy selected records
* Copy all records, such as for an SV2 or SV4




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7442) Create multi-batch row set reader

2019-11-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7442:
--

 Summary: Create multi-batch row set reader
 Key: DRILL-7442
 URL: https://issues.apache.org/jira/browse/DRILL-7442
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The "row set" work provided a {{RowSetWriter}} and {{RowSetReader}} to write to 
and read from a single batch. The {{ResultSetLoader}} class provided a writer 
that spans multiple batches, handling schema changes across batches and so on.

This ticket introduces a reader equivalent, the {{ResultSetReader}} that reads 
an entire result set of multiple batches, handling schema changes along the way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7441) Fix issues with fillEmpties, offset vectors

2019-11-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7441:
--

 Summary: Fix issues with fillEmpties, offset vectors
 Key: DRILL-7441
 URL: https://issues.apache.org/jira/browse/DRILL-7441
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enable the vector validator with full testing of offset vectors. A number of 
operators trigger errors. Tracking down the issues, and adding detailed tests, 
it turns out that:

* Drill has an informal standard that zero-length batches should have 
zero-length offset vectors, while a batch of size 1 will have offset vectors of 
size 2. Thus, zero-length is a special case.
* Nullable, repeated and variable-width vectors have "fill empties" logic that 
is used in two places: when setting the value count and when preparing to write 
a new value. The current logic is not quite right for either case.

Detailed vector checks fail due to inconsistencies in how the above works. This 
PR fixes those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Use cases for DFDL

2019-11-07 Thread Paul Rogers
Hi Charles,

Your suggestion to read the schema in each reader can work. In this case, the 
planner knows nothing about the schema; it is discovered at scan time, by each 
reader, as the file is read.


Let's take a step back. Drill is designed for big data distributed processing. 
We might imagine having 100+ files of some DFDL format on HDFS, with, say, 10+ 
Drillbits reading those files in using, say, 50 scan operators. in separate 
threads (minor fragments.)

My hunch is that, since the schema is the same for all files, it would be more 
efficient to read the schema at plan time, then pass the schema along as part 
of the "physical plan" to each scan operator. That way, in the scenario above, 
the schema would be read once (by the planner) rather than 100 times (by each 
reader in each scan operator.)

Further, Drill would know the type of the columns which can avoid ambiguities 
that occur when types are unknown.

Arina recently added schema support via a "provided schema." We passed this 
information to the CSV reader so it can operate with a schema. Perhaps we can 
look at what Arina did and figure out something similar for this use case. Or, 
maybe even use the DFDL schema in place of the "provided" schema. Someone will 
need to poke around a bit to figure out the best answer.

Thanks,
- Paul

 

On Thursday, November 7, 2019, 10:40:39 AM PST, Charles Givre 
 wrote:  
 
 @Paul, 
Do you think a format plugin is the right way to integrate this?  My thought 
was that we could create a folder for dfdl schemata, then the format plugin 
could specify which schema would be used during read.  IE:

"dfdl" :{
  "type":"dfdl",
  "file":"myschema.dfdl",
  "extensions":["xml"]
}

I was envisioning this working in much the same way as other format plugins 
that use an external parser.
-- C


> On Nov 7, 2019, at 1:35 PM, Paul Rogers  wrote:
> 
> Hi All,
> 
> One thought to add is that if DFDL defines the file schema, then it would be 
> ideal to use that schema at plan time as well as run time. Drill's Calcite 
> integration provides means to do this, though I am personally a bit hazy on 
> the details.
> 
> Certainly getting the reader to work is the first step; thanks Charles for 
> the excellent summary. Then, add the needed Calcite integration to make the 
> schema available to the planner at plan time.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre 
> wrote:  
> 
> Hi Steve, 
> Thanks for responding... Here's how Drill reads a file:
> 
> Drill uses what are called "format plugins" which basically read the file in 
> question and map fields to column vectors.  Note:  Drill supports nested data 
> structures, so a column could contain a MAP or LIST. 
> 
> The basic steps are:
> 1.  Open the inputstream and read the file
> 2.  If the schema is known, it is advantageous to define the schema using a 
> schemaBuilder object in advance and create schemaWriters for each column.  In 
> this case, since we'd be using DFDL, we do know the schema so we could create 
> the schema BEFORE the data actually gets read.  If the schema is not known in 
> advance, JSON for instance, Drill can discover the schema as it is reading 
> the data, by dynamically adding column vectors as data is ingested, but 
> that's not the case here... 
> 3.  Once the schema is defined, Drill will then read the file row by row, 
> parse the data, and assign values to each column vector. 
> 
> There are a few more details but that's the essence.  
> 
> What would be great is if we could create a function that could directly map 
> a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does 
> natively support JSON, however, it would probably be more effective and 
> efficient if there was an InfosetOutputter custom for Drill.  Ideally, we 
> need some sort of Iterable object so that Drill can map the parsed fields to 
> the schema.  
> 
> If you want to take a look at a relatively simple format plugin take a look 
> here: [2]. This file is the BatchReader which is where most of the heavy 
> lifting takes place.  This plugin is for ESRI Shape files and has a mix of 
> pre-defined fields, nested fields and fields that are defined after reading 
> starts.
> 
> 
> [1]: 
> https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md
>  
> <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]: 
> https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
>  
> <https://github.com/apache/drill/blob/mast

Re: Use cases for DFDL

2019-11-07 Thread Paul Rogers
Hi All,

One thought to add is that if DFDL defines the file schema, then it would be 
ideal to use that schema at plan time as well as run time. Drill's Calcite 
integration provides means to do this, though I am personally a bit hazy on the 
details.

Certainly getting the reader to work is the first step; thanks Charles for the 
excellent summary. Then, add the needed Calcite integration to make the schema 
available to the planner at plan time.

Thanks,
- Paul

 

On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre 
 wrote:  
 
 Hi Steve, 
Thanks for responding... Here's how Drill reads a file:

Drill uses what are called "format plugins" which basically read the file in 
question and map fields to column vectors.  Note:  Drill supports nested data 
structures, so a column could contain a MAP or LIST. 

The basic steps are:
1.  Open the inputstream and read the file
2.  If the schema is known, it is advantageous to define the schema using a 
schemaBuilder object in advance and create schemaWriters for each column.  In 
this case, since we'd be using DFDL, we do know the schema so we could create 
the schema BEFORE the data actually gets read.  If the schema is not known in 
advance, JSON for instance, Drill can discover the schema as it is reading the 
data, by dynamically adding column vectors as data is ingested, but that's not 
the case here... 
3.  Once the schema is defined, Drill will then read the file row by row, parse 
the data, and assign values to each column vector. 

There are a few more details but that's the essence.  

What would be great is if we could create a function that could directly map a 
DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does 
natively support JSON, however, it would probably be more effective and 
efficient if there was an InfosetOutputter custom for Drill.  Ideally, we need 
some sort of Iterable object so that Drill can map the parsed fields to the 
schema.  

If you want to take a look at a relatively simple format plugin take a look 
here: [2]. This file is the BatchReader which is where most of the heavy 
lifting takes place.  This plugin is for ESRI Shape files and has a mix of 
pre-defined fields, nested fields and fields that are defined after reading 
starts.


[1]: 
https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md
 

[2]: 
https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
 



I can start a draft PR on the Drill side over the weekend and will share the 
link to this list.
Respectfully, 
-- C


> On Nov 5, 2019, at 8:12 AM, Steve Lawrence  
> wrote:
> 
> I definitely agree. Apache Drill seems like a logical place to add
> Daffodil support. And I'm sure many of us, including myself, would be
> happy to provide some time towards this effort.
> 
> The Daffodil API is actually fairly simple and is usually fairly
> straightforward to integrate--most of the complexity comes from the DFDL
> schemas. There's a good "hello world" available [1] that shows more API
> functionality/errors/etc., but the jist of it is:
> 
> 1) Compile a DFDL schema to a data processor:
> 
>  Compiler c = Daffodil.compiler();
>  ProcessorFactory pf = c.compileFile(file);
>  DataProcessor dp = pf.onPath("/");
> 
> 2) Create an input source for the data
> 
>  InputStream is = ...
>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> 
> 3) Create an infoset outputter (we have a handful of differnt kinds)
> 
>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> 
> 4) Use the DataProcessor to parse the input data to the infoset outputter
> 
>  ParseResult pr = dataProcessor.parse(in, out)
> 
> So I guess the parts that we would need more Drill understanding is what
> the InfosetOutputter (step 3) needs to look like to better integrate
> into Drill. Is there a standard data structure that Drill expects
> representations of data to look like and Drill does the querying on the
> data structure? And is there some sort of schema that Daffodil would
> need to create to describe what this structure looks like so it could
> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> this data structure, unless Drill already supports XML or JSON.
> 
> Or is it completely up to the Storage Plugin (is that the right term) to
> determine how to take a Drill query and find the appropriate data from
> the data store?
> 
> - Steve
> 
> [1]
> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
> 
> 
> On 11/3/19 9:31 AM, Charles Givre wrote:
>> Hi Julian,
>> It seems like there is a beginning of convergence of the minds 

Re: Help for DRILL-3609

2019-11-06 Thread Paul Rogers
Hi Nitin,

As it turns out, I just had to fix a bug in the windowing operator. I'm not an 
expert on this operator, but perhaps I can offer a suggestion or two.

We have a few existing unit tests for window functions in TestWindowFrame. They 
are a bit hard to follow, however. Take a look at testFix3605(), which does:

select
  col2,
  lead(col2) over(partition by col2 order by col0) as lead_col2
from
  dfs.`window/fewRowsAllData.parquet`

When executed, we use the NoFrameSupportTemplate [1] class to do the work. 
Specifically, processPartition() contains code that handles the lead/lag by 1 
case.

I found it useful to enable saving of the generated code: [2]. When you step 
into the generated code, in Eclipse, you can set the source path to include 
/tmp/drill/codegen (on Linux/Mac). You can then see the contents of the 
generated copyPrev() and copyNext() functions.

If you step into these generated functions, you can see that the code simply 
takes two indexes: a to and a from, then copies the data from one to the other. 
As a result, I suspect that you do not need to change the generated code to 
achieve your goal.

Instead, you may want to change the processPartition() function. Instead of the 
simple +/-1 logic it currently has, use your lead/lag offset instead.

By the way, a handy way to share work is simply to push your work to your 
private GitHub repo, then link to that code.

Thanks,
- Paul


[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/window/NoFrameSupportTemplate.java#L139
[2] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/window/WindowFrameRecordBatch.java#L366
 

On Wednesday, November 6, 2019, 10:04:12 PM PST, Nitin Pawar 
 wrote:  
 
 any help on this?


On Tue, Nov 5, 2019 at 7:09 PM Nitin Pawar  wrote:

> Ohh ok
> let me provide a google drive url
> Here
> 
> is the link. Can you check if can access it.
>
> Thanks,
> Nitin
>
> On Tue, Nov 5, 2019 at 7:02 PM Charles Givre  wrote:
>
>> Hi Nitin,
>> It seems to have been filtered out.
>>
>>
>> > On Nov 5, 2019, at 8:29 AM, Nitin Pawar 
>> wrote:
>> >
>> > Hi Charles,
>> >
>> > I have attached git patch.
>> > I was currently doing for lag function only for testing purposes
>> >
>> > Thanks,
>> > Nitin
>> >
>> > On Tue, Nov 5, 2019 at 6:34 PM Charles Givre > cgi...@gmail.com>> wrote:
>> > Hi Nitin,
>> > Thanks for your question.  Could you/did you share your code?  If not,
>> could you please post a draft PR so that we can take a look and offer
>> suggestions?
>> > Thanks,
>> > -- C
>> >
>> >
>> > > On Nov 5, 2019, at 7:27 AM, Nitin Pawar > > wrote:
>> > >
>> > > Hi Devs,
>> > >
>> > > I had sent request for this almost 2.5 years ago. Trying it again now.
>> > >
>> > > Currently Apache drill window functions LEAD and LAG support offset
>> as 1.
>> > > In another words in a given window these functions can return either
>> > > previous or next row only.
>> > >
>> > >
>> > > I am trying modify the behavior these function and allow offset >=1 in
>> > > query such as
>> > > select employee_id, department_id,salary, lag(salary,*4*)
>> over(partition by
>> > > department_id order by salary asc) from  cp.`employee.json`;
>> > >
>> > > I have managed to remove the limitation which fails the query can not
>> have
>> > > offset > 1 and able to pass the offset to actual function
>> implementation.
>> > >
>> > > Currently I am stuck where the record processor is crossing the window
>> > > boundary of department_id and gets row from next/previous window in
>> > > lead/lag function
>> > >
>> > > For eg: If you notice in row 2 for department_id=2, it is getting
>> previous
>> > > windows of department_id=1
>> > >
>> > > Here is sample output for below query
>> > > apache drill> select  employee_id, department_id,salary, lag(salary,4)
>> > > over(partition by department_id order by salary asc) from
>> > > cp.`employee.json` where department_id <=3;
>> > > +-+---+-+--+
>> > > | employee_id | department_id | salary  |  EXPR$3  |
>> > > +-+---+-+--+
>> > > | 20          | 1            | 3.0 | null    |
>> > > | 5          | 1            | 35000.0 | null    |
>> > > | 22          | 1            | 35000.0 | null    |
>> > > | 21          | 1            | 35000.0 | null    |
>> > > | 2          | 1            | 4.0 | 3.0  |
>> > > | 4          | 1            | 4.0 | 35000.0  |
>> > > | 1          | 1            | 8.0 | 35000.0  |
>> > > | 37          | 2            | 6700.0  | null    |
>> > > | 38          | 2            | 8000.0  | 4.0  |
>> > > | 39          | 2            | 1.0 | 4.0  |
>> > > | 40          | 2            | 1.0 | 8.0  |
>> > > | 6          | 2            | 25000.0 | 6700.0  |

[jira] [Created] (DRILL-7439) Batch count fixes for six additional operators

2019-11-05 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7439:
--

 Summary: Batch count fixes for six additional operators
 Key: DRILL-7439
 URL: https://issues.apache.org/jira/browse/DRILL-7439
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enables vector checks, and fixes batch count and vector issues for:

* StreamingAggBatch
* RuntimeFilterRecordBatch
* FlattenRecordBatch
* MergeJoinBatch
* NestedLoopJoinBatch
* LimitRecordBatch




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Drill Storage Plugins

2019-11-05 Thread Paul Rogers
Hi Charles,

Storage plugins are a bit complex because they integrate not just with the 
runtime engine, but also with the Calcite planning engine. Format plugins are 
simpler because they are mostly runtime-only. The "Easy" framework hides much 
of the planner integration, and the EVF "Easier" revisions hide the details 
even more.

Yes, we did choose to omit the storage plugins from the Drill book because of 
the large amount of complexity involved. I like the suggestion that we gather 
information about how to create a storage plugin and post it somewhere. If we 
do an "Expanded and Revised" edition of the book, we can incorporate the 
material at that time.

We may also want to create something like the "Easy" framework that hides (or 
at least simplifies) all the Calcite knick-knacks that we must currently fiddle 
with.

Finally, the simplest possible storage plugin is the "MockStoragePlugin" in 
Drill itself. This plugin uses some crazy tricks to generate random data which 
we use when testing operators, such as when we need a large number of rows to 
test sort spilling. The Mock plugin is a bit of a mess, but it at least mostly 
"factors out" any additional complexity that comes from interfacing with an 
external system.

Thanks,
- Paul

 

On Tuesday, November 5, 2019, 6:14:06 AM PST, Charles Givre 
 wrote:  
 
 One more thing:  I've found code for storage plugins (in various states of 
completion) for the folllowing systems:
DynamoDB (https://github.com/fineoio/drill-dynamo-adapter 
) 
Apache Druid:  (Current Draft PR https://github.com/apache/drill/pull/1888 
)
Couchbase: (https://github.com/LyleLeo/Apache-Drill-CouchDB-Storage-Plugin 
) (Author said 
he would consider submitting as PR)
ElasticSearch: https://github.com/javiercanillas/drill-storage-elastic 
, 
https://github.com/gaoshui87/drill-storage-elastic 

Apache Solr

Are there others that anyone knows of?


> On Nov 4, 2019, at 10:23 PM, Charles Givre  wrote:
> 
> Hello all, 
> I've written some UDFs and Format plugins for Drill and I'm interested in 
> tackling a storage plugin.  One of my regrets from the Drill book was that we 
> didn't get into this topic.  For those of you who have written one, my hat's 
> off to you. I wanted to ask if there are any resources or tutorials available 
> that you found particularly helpful?  I'm having a little trouble figuring 
> out what all the pieces do and how they fit together.
> 
> Does anyone have any ideas about storage plugins should be implemented?  
> Personally I'd really like to see one for ElasticSearch,
> Best,
> -- C
  

[jira] [Created] (DRILL-7436) Fix record count, vector structure issues in several operators

2019-11-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7436:
--

 Summary: Fix record count, vector structure issues in several 
operators
 Key: DRILL-7436
 URL: https://issues.apache.org/jira/browse/DRILL-7436
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


This is the next in a continuing series of fixes to the container record count, 
batch record count, and vector structure in several operators. This batch 
represents the smallest change needed to add checking for the Filter operator.

In order to get Filter to pass checks, many of its upstream operators needed to 
be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7435) JSON reader incorrectly adds a LATE type to union vector

2019-11-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7435:
--

 Summary: JSON reader incorrectly adds a LATE type to union vector
 Key: DRILL-7435
 URL: https://issues.apache.org/jira/browse/DRILL-7435
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Run Drill with a fix for DRILL-7434. Now, another test fails: 
{{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. 
Evidently the JSON reader has added the {{LATE}} type to the Union vector. 
However, there is no vector type associated with the {{LATE}} type. An attempt 
to get the member or this type throws an exception.

The simple work around is to special-case this type when setting the value 
count. The longer-term fix is to not add the {{LATE}} type to a union vector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7434) TopNBatch constructs Union vector incorrectly

2019-11-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7434:
--

 Summary: TopNBatch constructs Union vector incorrectly
 Key: DRILL-7434
 URL: https://issues.apache.org/jira/browse/DRILL-7434
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


The Union type is an "experimental" type that has never been completed. Yet, we 
use it as if it works.

Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with 
the new batch validator enabled. This test creates a union vector. Here is how 
the schema looks:

{noformat}
(UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
  children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])
{noformat}

This is very hard to follow because the Union vector structure is complex (and 
has many issues.) Let's work though it.

We are looking at the {{MaterializedField}} for the union vector. It tells us 
that this Union has two types: {{FLOAT8}} and {{INT}}. All good.

The Union has a vector per type, stored in an "internal map".' That map shows 
up as child, it is there on the {{children}} list as {{internal}}. However, the 
metadata claims that only one vector exists in that map: the {{types}} vector 
(the one that tells us what type to use for each row.)  The vectors for 
{{FLOAT8}} and {{INT}} are missing.

If, however, we use our debugger and inspect the actual contents of the 
{{internal}} map, we get the following:

{noformat}
[`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
(FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
{noformat}

That is, the internal map has the correct schema, but the Union vector itself 
has the wrong (incomplete) schema.

This is an inherent design flaw with Union vector: it requires two copies of 
the schema to be in sync. Further {{MaterializedField}} was designed to be 
immutable, but the map and Union types require mutation. If the Union simply 
points to the actual Map vector {{MaterializedField}}, it will drift out of 
date since the map vector creates a new schema each time we add fields; the 
Union vector ends up pointing to the old one.

This is not a simple bug to fix, but the result of the bug is that the vectors 
end up corrupted, as detected by the Batch Validator. In fact, the bug itself 
is subtle.

The TopNBatch does pass vector validation. However, because of the incorrect 
metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
vector incorrectly: it fails to set the value count for the {{INT}} type.

{noformat}
Found one or more vector errors from RemovingRecordBatch
kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
{noformat}

Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
type vector for a Union named {{kl}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7428) Drill incorrectly allows a repeated map field to be projected to top level

2019-10-29 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7428:
--

 Summary: Drill incorrectly allows a repeated map field to be 
projected to top level
 Key: DRILL-7428
 URL: https://issues.apache.org/jira/browse/DRILL-7428
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Consider the following query from the [Mongo DB 
tests|https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/test/java/org/apache/drill/exec/store/mongo/MongoTestConstants.java#L80]:

{noformat}
select t.name as name, t.topping.type as type 
  from mongo.%s.`%s` t where t.sales >= 150
{noformat}


The query is used in 
[{{TestMongoQueries.testUnShardedDBInShardedClusterWithProjectionAndFilter()}}|https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/test/java/org/apache/drill/exec/store/mongo/TestMongoQueries.java#L89].
 
Here it turns out that {{topping}} is a repeated map. The query is projecting 
the members of that map to the top level. The query has five rows, but 24 
values in the repeated map. The Project operator allows the projection, 
resulting in an output batch in which most vectors have 5 values, but the 
{{topping}} column, now at the top level and no longer in the map, has 24 
values.

As a result, the first five values, formerly associated with the first record, 
are now associated with the first five top-level records, while the values 
formerly associated with records 1-4 are lost.

Thus, this is a data corruption bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7425) Remove redundant record count field from operators

2019-10-27 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7425:
--

 Summary: Remove redundant record count field from operators
 Key: DRILL-7425
 URL: https://issues.apache.org/jira/browse/DRILL-7425
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


Work on the container count operator bugs revealed that multiple operators 
(Project, Unnest, probably others) maintain a record count field. As it turns 
out, the only semantically valid value for this field is to have the same value 
as the container record count. Hence, the record count field is redundant, and 
is just another detail to get right.

The only real use of the variable is to report a record count of 0 before the 
first batch is created. Because of the way the container reports counts, it 
will throw an exception after the container is created, before the value is set 
to 0.

The goal of this ticket is to remove the variable, addressing any resulting 
issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7424) Project operator fails to set the container row count

2019-10-27 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7424:
--

 Summary: Project operator fails to set the container row count
 Key: DRILL-7424
 URL: https://issues.apache.org/jira/browse/DRILL-7424
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enabled the "batch validator" for the Project operator. Ran tests. Exceptions 
occurred because, in some paths, the Project operator fails to set the 
container row count.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7333) Batch of container count fixes

2019-10-27 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7333.

Resolution: Incomplete

Abandoned this one; too many changes in one go. Will submit the work as a set 
of smaller PRs.

> Batch of container count fixes
> --
>
> Key: DRILL-7333
> URL: https://issues.apache.org/jira/browse/DRILL-7333
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>    Reporter: Paul Rogers
>    Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.17.0
>
>
> See DRILL-7325. The set of operators to be fixed is large. This ticket covers 
> a subset of operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-20 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7414:
--

 Summary: EVF incorrectly sets buffer writer index after rollover
 Key: DRILL-7414
 URL: https://issues.apache.org/jira/browse/DRILL-7414
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


A full test run, with vector validation enabled and with the "new" scan 
enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:

{noformat}
comments_s2 - VarCharVector: Row count = 838, but value count = 839
{noformat}

Adding vector validation to the result set loader overflow tests reveals that 
the problem is in overflow. In 
{{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:

{noformat}
a - RepeatedIntVector: Row count = 2952, but value count = 2953
b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
32472 values
c - RepeatedIntVector: Row count = 2952, but value count = 2953
d - RepeatedIntVector: Row count = 2952, but value count = 2953
{noformat}

The problem is that EVF incorrectly sets the offset buffer writer index after a 
rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7413) Scan operator does not set the container record count

2019-10-20 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7413:
--

 Summary: Scan operator does not set the container record count
 Key: DRILL-7413
 URL: https://issues.apache.org/jira/browse/DRILL-7413
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enable the vector checking provided in DRILL-7403. Enable just for the JSON 
reader. You will get the following error:

{noformat}
12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
ScanBatch
ScanBatch: Container record count not set
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7412) Minor unit test improvements

2019-10-20 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7412:
--

 Summary: Minor unit test improvements
 Key: DRILL-7412
 URL: https://issues.apache.org/jira/browse/DRILL-7412
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


Many tests intentionally trigger errors. A debug-only log setting sent those 
errors to stdout. The resulting stack dumps simply cluttered the test output, 
so disabled error output to the console.

Drill can apply bounds checks to vectors. Tests run via Maven enable bounds 
checking. Now, bounds checking is also enabled in "debug mode" (when assertions 
are enabled, as in an IDE.)

Drill contains two test frameworks. The older BaseTestQuery was marked as 
deprecated, but many tests still use it and are unlikely to be changed soon. 
So, removed the deprecated marker to reduce the number of spurious warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7403) Validate batch checks, vector integretity, in unit tests

2019-10-13 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7403:
--

 Summary: Validate batch checks, vector integretity, in unit tests
 Key: DRILL-7403
 URL: https://issues.apache.org/jira/browse/DRILL-7403
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill provides a {{BatchValidator}} that checks vectors. It is disabled by 
default. This enhancement adds more checks, including checks for row counts (of 
which there are surprisingly many.)

Since most operators will fail if the check is enabled, this enhancement also 
adds a table to keep track of which operators pass the checks (and for which 
checks should be enabled) and those that still need work. This allows the 
checks to exist in the code, and to be enabled incrementally as we fix the 
various problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7402) Suppress batch dumps for expected failures in tests

2019-10-13 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7402:
--

 Summary: Suppress batch dumps for expected failures in tests
 Key: DRILL-7402
 URL: https://issues.apache.org/jira/browse/DRILL-7402
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill provides a way to dump the last few batches when an error occurs. 
However, in tests, we often deliberately cause something to fail. In this case, 
the batch dump is unnecessary.

This enhancement adds a config property, disabled in tests, that controls the 
dump activity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-5914) CSV (text) reader fails to parse quoted newlines in trailing fields

2019-10-05 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-5914.

Resolution: Fixed

This issue was fixed as part of the "Complaint text reader V3" project. The 
test cited in the description now correctly reports 4 lines for the 
{{COUNT(*)}} query.

> CSV (text) reader fails to parse quoted newlines in trailing fields
> ---
>
> Key: DRILL-5914
> URL: https://issues.apache.org/jira/browse/DRILL-5914
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Consider the existing `TestCsvHeader.testCountOnCsvWithHeader()` unit test. 
> The input file is as follows:
> {noformat}
> Year,Make,Model,Description,Price
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 1999,Chevy,"Venture ""Extended Edition""","",4900.00
> 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
> 1996,Jeep,Grand Cherokee,"MUST SELL!
> air, moon roof, loaded",4799.00
> {noformat}
> Note the newline in side the description in the last record.
> If we do a `SELECT *` query, the file is parsed fine; we get 4 records.
> If we do a `SELECT Year, Model` query, the CSV reader uses a special trick: 
> it short-circuits reads on the three columns that are not wanted:
> {code}
> TextReader.parseRecord() {
> ...
> if (earlyTerm) {
>   if (ch != newLine) {
> input.skipLines(1); // <-- skip lines
>   }
>   break;
> }
> {code}
> This method skips forward in the file, discarding characters until it hits a 
> newline:
> {code}
>   do {
> nextChar();
>   } while (lineCount < expectedLineCount);
> {code}
> Note that this code handles individual characters, it is not aware of 
> per-field semantics. That is, unlike the higher-level parser methods, the 
> `nextChar()` method does not consider newlines inside of quoted fields to be 
> special.
> This problem shows up acutely in a `SELECT COUNT\(*)` style query that skips 
> all fields; the result is we count the input as five lines, not four.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: MongoDB Question

2019-09-26 Thread Paul Rogers
Hi Charles,

This kind of data-specific setting really should be associated with a schema so 
that it is DB-specific, but consistent across queries. That is almost the 
definition of a schema...

In the future, once the recently-added schema system is more widely used, one 
might be able to set this as a property of a table with ALTER SCHEMA. However, 
the Mongo plugin does not yet support that functionality.

Since we don't have a way to do that, there is an obscure hack that can be 
used. In the drill-override.conf file, one can change the default value for 
system/session variables.

drill.exec.options: {
    store.mongo.read_numbers_as_double: true
}

This is not a documented feature, but it can be handy in unusual situations 
such as this.

Caveat: I've not tried this, but it should work as we have done something 
similar in other obscure cases. Perhaps someone can try it.

Thanks,
- Paul

 

On Thursday, September 26, 2019, 5:56:43 AM PDT, Charles Givre 
 wrote:  
 
 Hello all, 
Is it possible to set store.mongo.read_numbers_as_double in the storage plugin 
configuration, or someplace else so this is set permanently?
Thanks,
-- C  

Re: EVF Question: FULL BATCH?

2019-09-24 Thread Paul Rogers
Hi Charles,

Looks like I forgot the extra newlines needed to make my e-mail provider work 
with Apache's mailer. Let me try again.

In the "classic" readers, each reader picks some number of rows per batch, 
often 1K, 4K, 4000, etc. The idea is that, on average, this row count will give 
us a decently-sized record batch.

EVF uses a different approach. It tallies the total space used by the batch, 
and the space taken by each vector, and it will decide when the batch is full. 
(You can also set a row count limit. By default, the limit is 64K, the maximum 
row count allowed in a Drill batch.)

So, if you are converting from a "classic" reader to EVF, you will want to 
remove the reader-imposed row count limit. (Or, if there is reason to do so, 
you can pass that limit to EVF and let EVF enforce it.)

For example, convert code that looks like this:

for (int rowCount = 0; rowCount < MAX_BATCH_SIZE; rowCount++) {
  // Load a row
}

Into code that looks like this:

RowSetLoader rowWriter = // get from EVF per examples
while (! rowWriter.isFull()) {
   // Load a row
    rowWriter.save();
}

Note that the "save" is needed because EVF let's you discard a row: you can 
load it, check it, and decide to skip it. This will, eventually, allow us to 
push filtering down to the reader. For now, you just need to call save().


Thanks,
- Paul

 

On Tuesday, September 24, 2019, 09:56:34 AM PDT, Paul Rogers 
 wrote:  
 
 So the usual pattern is:
while (! rowWriter.isFull()) {  // Load the row  rowWriter.save();}
Is it the case that PCAP is trying to force the row count to, say 4K or 8K or 
whatever? If so, ignore that count.
The error is telling you that at least one vector has reached 16 MB in size (or 
you've reached the row count limit, if you set that.)
Thanks,
- Paul

 

    On Monday, September 23, 2019, 08:09:38 PM PDT, Charles Givre 
 wrote:  
 
 Ok... so I have yet another question relating to the EVF.  I'm working on a 
project to improve (hopefully) the PCAP plugin with the ultimate goal being to 
include parsed PCAP packet data.  In any event, I've run into a snag.  In one 
unit test, I'm getting the error below when I call rowWriter.save().  I suspect 
what is happening is that the TupleWriter is not full when it starts writing 
the row, but by the time it is finished writing the fields, the batch is full.  
Does that even make sense?

Here's a link to the offending code 
<https://github.com/cgivre/drill/blob/caa69e7f27f68aeedaa28902596a2250ab32cd84/exec/java-exec/src/main/java/org/apache/drill/exec/store/pcap/PcapBatchReader.java#L232>:
 


Any suggestions?
Thanks!
--C




[Error Id: bf6446fe-95d5-4ba7-bf7f-32e8a5dd4d1e on 192.168.1.21:31010]

  (java.lang.IllegalStateException) Unexpected state: FULL_BATCH
    
org.apache.drill.exec.physical.resultSet.impl.ResultSetLoaderImpl.saveRow():530
    org.apache.drill.exec.physical.resultSet.impl.RowSetLoaderImpl.save():73
    org.apache.drill.exec.store.pcap.PcapBatchReader.addDataToTable():232
    
org.apache.drill.exec.store.pcap.PcapBatchReader.parsePcapFilesAndPutItToTable():174
    org.apache.drill.exec.store.pcap.PcapBatchReader.next():102
    
org.apache.drill.exec.physical.impl.scan.framework.ShimBatchReader.next():132
    org.apache.drill.exec.physical.impl.scan.ReaderState.readBatch():412
    org.apache.drill.exec.physical.impl.scan.ReaderState.next():369
    org.apache.drill.exec.physical.impl.scan.ScanOperatorExec.nextAction():261
    org.apache.drill.exec.physical.impl.scan.ScanOperatorExec.next():232
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.doNext():196
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.start():174
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.next():124
    org.apache.drill.exec.physical.impl.protocol.OperatorRecordBatch.next():148
    
org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():237
    org.apache.drill.exec.record.AbstractRecordBatch.next():126
    org.apache.drill.exec.record.AbstractRecordBatch.next():116
    org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
    
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
    org.apache.drill.exec.record.AbstractRecordBatch.next():186
    
org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():237
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():83
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1746
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
    org.apache.dri

Re: EVF Question: FULL BATCH?

2019-09-24 Thread Paul Rogers
So the usual pattern is:
while (! rowWriter.isFull()) {  // Load the row  rowWriter.save();}
Is it the case that PCAP is trying to force the row count to, say 4K or 8K or 
whatever? If so, ignore that count.
The error is telling you that at least one vector has reached 16 MB in size (or 
you've reached the row count limit, if you set that.)
Thanks,
- Paul

 

On Monday, September 23, 2019, 08:09:38 PM PDT, Charles Givre 
 wrote:  
 
 Ok... so I have yet another question relating to the EVF.  I'm working on a 
project to improve (hopefully) the PCAP plugin with the ultimate goal being to 
include parsed PCAP packet data.  In any event, I've run into a snag.  In one 
unit test, I'm getting the error below when I call rowWriter.save().  I suspect 
what is happening is that the TupleWriter is not full when it starts writing 
the row, but by the time it is finished writing the fields, the batch is full.  
Does that even make sense?

Here's a link to the offending code 
:
 


Any suggestions?
Thanks!
--C




[Error Id: bf6446fe-95d5-4ba7-bf7f-32e8a5dd4d1e on 192.168.1.21:31010]

  (java.lang.IllegalStateException) Unexpected state: FULL_BATCH
    
org.apache.drill.exec.physical.resultSet.impl.ResultSetLoaderImpl.saveRow():530
    org.apache.drill.exec.physical.resultSet.impl.RowSetLoaderImpl.save():73
    org.apache.drill.exec.store.pcap.PcapBatchReader.addDataToTable():232
    
org.apache.drill.exec.store.pcap.PcapBatchReader.parsePcapFilesAndPutItToTable():174
    org.apache.drill.exec.store.pcap.PcapBatchReader.next():102
    
org.apache.drill.exec.physical.impl.scan.framework.ShimBatchReader.next():132
    org.apache.drill.exec.physical.impl.scan.ReaderState.readBatch():412
    org.apache.drill.exec.physical.impl.scan.ReaderState.next():369
    org.apache.drill.exec.physical.impl.scan.ScanOperatorExec.nextAction():261
    org.apache.drill.exec.physical.impl.scan.ScanOperatorExec.next():232
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.doNext():196
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.start():174
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.next():124
    org.apache.drill.exec.physical.impl.protocol.OperatorRecordBatch.next():148
    
org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():237
    org.apache.drill.exec.record.AbstractRecordBatch.next():126
    org.apache.drill.exec.record.AbstractRecordBatch.next():116
    org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
    
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
    org.apache.drill.exec.record.AbstractRecordBatch.next():186
    
org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():237
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():83
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1746
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1149
    java.util.concurrent.ThreadPoolExecutor$Worker.run():624
    java.lang.Thread.run():748

    at 
org.apache.drill.exec.rpc.RpcException.mapException(RpcException.java:60) 
~[classes/:na]
    at 
org.apache.drill.exec.client.DrillClient$ListHoldingResultsListener.getResults(DrillClient.java:881)
 ~[classes/:na]
    at org.apache.drill.exec.client.DrillClient.runQuery(DrillClient.java:583) 
~[classes/:na]
    at 
org.apache.drill.test.BaseTestQuery.testRunAndReturn(BaseTestQuery.java:340) 
~[test-classes/:na]
    at 
org.apache.drill.test.BaseTestQuery.testSqlWithResults(BaseTestQuery.java:321) 
~[test-classes/:na]
    at 
org.apache.drill.exec.store.pcap.TestPcapRecordReader.runSQLWithResults(TestPcapRecordReader.java:90)
 ~[test-classes/:na]
    at 
org.apache.drill.exec.store.pcap.TestPcapRecordReader.runSQLVerifyCount(TestPcapRecordReader.java:85)
 ~[test-classes/:na]
    at 
org.apache.drill.exec.store.pcap.TestPcapRecordReader.testCorruptPCAPQuery(TestPcapRecordReader.java:47)
 ~[test-classes/:na]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_211]
Caused by: org.apache.drill.common.exceptions.UserRemoteException: 
EXECUTION_ERROR ERROR: Unexpected state: FULL_BATCH

Read failed for reader PcapBatchReader
Fragment 0:0

[Error Id: bf6446fe-95d5-4ba7-bf7f-32e8a5dd4d1e on 192.168.1.21:31010]

  (java.lang.IllegalStateException) Unexpected 

Re: UDFs not Working in Unit Tests

2019-09-20 Thread Paul Rogers
Hi Charles,

Your workaround sounds right. In general, unit tests should make as few 
assumptions as possible about the system under test; they should focus on the 
one part of the system they want to exercise. (There is a large body of testing 
theory behind this statement...)

In your case, it may be that the contrib project that contains your UDFs is not 
a dependency of the contrib project with your format plugin. Nor should it be.

We have a separate set of end-to-end tests which would be a great place to, 
say, make sure that all your UDFs work with all your format plugins. 
Unfortunately, these can't be run by developers; you'd need help from whoever 
currently owns the test framework.

Thanks,
- Paul

 

On Friday, September 20, 2019, 09:50:54 AM PDT, Charles Givre 
 wrote:  
 
 Hey Paul, 
Thanks for your help.  I should have clarified that the work I'm doing is for 
an ESRI Shape File plugin which is in contrib.  The unit test in question calls 
a function which is in the contrib/udfs and I think you've pinpointed the 
issue.  Unfortunately, running the test with maven produced the same error.  I 
may just remove the unit test in question since the others pass, and this 
really doesn't have anything to do with the functionality of the code. 
-- C

> On Sep 20, 2019, at 12:40 PM, Paul Rogers  wrote:
> 
> Hi Charles,
> 
> I seem to recall fighting with something similar in the past. The problem is 
> not with your setup; it is with how Drill finds your (custom?) UDF on the 
> classpath.
> 
> My memory is hazy; but I think it had to do with the way that Drill uses the 
> drill-override.conf file to extend class path scanning, and how it finds the 
> UDF code. I think I banged on it until it worked, but I don't recall what I 
> did.
> 
> Maybe I do remember. I think that, to run your code in the IDE, you need to 
> add the source code directory to your class path. (Recall that Drill needs 
> access to both function source and compiled code.) I think I modified my IDE 
> debug launch command to always include the proper sources. I don't have that 
> config in front of me; I'll check it this weekend to see if I can find the 
> exact item you must add.
> 
> A workaround may be to run the test using Maven [1], since the Maven configs 
> will do the needed magic:
> 
> cd exec/java-exec
> mvn surefire:test -Dtest=YourTest
> 
> 
> The other possibility is that you have your UDF in the "contrib" project, 
> while you are running unit tests in the "exec" project. Exec does not depend 
> on contrib, so contrib code is not visible to unit tests in Exec. The same is 
> true, by the way, for the us of the JDBC driver, since that is in the Maven 
> project after exec.
> 
> Thanks,
> - Paul
> 
> [1] 
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/single-test.html
> 
> 
> 
> 
>    On Friday, September 20, 2019, 07:44:55 AM PDT, Charles Givre 
> wrote:  
> 
> Hello Drillers, 
> I'm encountering a strange error in a unit test.  The code is included below, 
> and it fails because when Drill attempts to execute the test, it cannot find 
> the function st_astext().  If I build Drill and execute the query in the CLI 
> it works, so I suspect there is some environment issue rather than a Drill 
> issue.  Does anyone have any suggestions?
> Thanks!
> 
> 
> @BeforeClass
> public static void setup() throws Exception {
>  startCluster(ClusterFixture.builder(dirTestWatcher));
> 
>  DrillbitContext context = cluster.drillbit().getContext();
>  FileSystemConfig original = (FileSystemConfig) 
>context.getStorage().getPlugin("cp").getConfig();
>  Map newFormats = new 
>HashMap<>(original.getFormats());
>  newFormats.put("shp", new ShpFormatConfig());
>  FileSystemConfig pluginConfig = new 
>FileSystemConfig(original.getConnection(), original.getConfig(), 
>original.getWorkspaces(), newFormats);
>  pluginConfig.setEnabled(true);
>  context.getStorage().createOrUpdate("cp", pluginConfig, true);
> }
> ...
> 
> @Test
> public void testShpQuery() throws Exception {
> 
>  testBuilder()
>    .sqlQuery("select gid, srid, shapeType, name, st_astext(geom) as wkt "
>      + "from cp.`CA-cities.shp` where gid = 100")
>    .ordered()
>    .baselineColumns("gid", "srid", "shapeType", "name", "wkt")
>    .baselineValues(100, 4326, "Point", "Jenny Lind", "POINT (-120.8699371 
>38.0949216)")
>    .build()
>    .run();
> }
  

Re: UDFs not Working in Unit Tests

2019-09-20 Thread Paul Rogers
Hi Charles,

I seem to recall fighting with something similar in the past. The problem is 
not with your setup; it is with how Drill finds your (custom?) UDF on the 
classpath.

My memory is hazy; but I think it had to do with the way that Drill uses the 
drill-override.conf file to extend class path scanning, and how it finds the 
UDF code. I think I banged on it until it worked, but I don't recall what I did.

Maybe I do remember. I think that, to run your code in the IDE, you need to add 
the source code directory to your class path. (Recall that Drill needs access 
to both function source and compiled code.) I think I modified my IDE debug 
launch command to always include the proper sources. I don't have that config 
in front of me; I'll check it this weekend to see if I can find the exact item 
you must add.

A workaround may be to run the test using Maven [1], since the Maven configs 
will do the needed magic:

cd exec/java-exec
mvn surefire:test -Dtest=YourTest


The other possibility is that you have your UDF in the "contrib" project, while 
you are running unit tests in the "exec" project. Exec does not depend on 
contrib, so contrib code is not visible to unit tests in Exec. The same is 
true, by the way, for the us of the JDBC driver, since that is in the Maven 
project after exec.

Thanks,
- Paul

[1] 
https://maven.apache.org/surefire/maven-surefire-plugin/examples/single-test.html


 

On Friday, September 20, 2019, 07:44:55 AM PDT, Charles Givre 
 wrote:  
 
 Hello Drillers, 
I'm encountering a strange error in a unit test.  The code is included below, 
and it fails because when Drill attempts to execute the test, it cannot find 
the function st_astext().  If I build Drill and execute the query in the CLI it 
works, so I suspect there is some environment issue rather than a Drill issue.  
Does anyone have any suggestions?
Thanks!


@BeforeClass
public static void setup() throws Exception {
  startCluster(ClusterFixture.builder(dirTestWatcher));

  DrillbitContext context = cluster.drillbit().getContext();
  FileSystemConfig original = (FileSystemConfig) 
context.getStorage().getPlugin("cp").getConfig();
  Map newFormats = new 
HashMap<>(original.getFormats());
  newFormats.put("shp", new ShpFormatConfig());
  FileSystemConfig pluginConfig = new 
FileSystemConfig(original.getConnection(), original.getConfig(), 
original.getWorkspaces(), newFormats);
  pluginConfig.setEnabled(true);
  context.getStorage().createOrUpdate("cp", pluginConfig, true);
}
...

@Test
public void testShpQuery() throws Exception {

  testBuilder()
    .sqlQuery("select gid, srid, shapeType, name, st_astext(geom) as wkt "
      + "from cp.`CA-cities.shp` where gid = 100")
    .ordered()
    .baselineColumns("gid", "srid", "shapeType", "name", "wkt")
    .baselineValues(100, 4326, "Point", "Jenny Lind", "POINT (-120.8699371 
38.0949216)")
    .build()
    .run();
}
  

Re: [DISCUSS]: Changes to Formatting Rules

2019-08-16 Thread Paul Rogers
Hi Charles,
Agree. I am reviewing a PR in which formatting was changed. Both the original 
and revised formatting are fine, and each is often seen in open source 
projects. But we really should settle on one style so we don't have each of us 
reformatting code to our favorite styles.

Drill does have documented standards, [1], but they are out of date with 
respect to the Checkstyle rules. The IntelliJ and Eclipse formatting templates 
are out of date with respect to the style preferred by some team members (and, 
it seems are inconsistent with the Checkstyle rules.) Not surprisingly, this 
inconsistency leads to a bit of chaos and uncertainty.

Code formatting is a hassle and the "right" style is always what any one person 
prefers. Still, might be worth while discussing and formalizing our current 
preferences so we can update the format templates to match.

Perhaps each of us can toss out ideas for either a) the preferred formatting 
not covered by Checkstyle, or b) where the format templates and Checkstyle 
rules conflict.

Would we prefer to do that here, or via a JIRA ticket?

Thanks,
- Paul

[1] http://drill.apache.org/docs/apache-drill-contribution-guidelines/

 

On Friday, August 16, 2019, 09:21:51 AM PDT, Charles Givre 
 wrote:  
 
 Hello all, 
I recently submitted a PR and after running the code formatter in IntelliJ, 
there still were a lot of formatting issues.  Personally, I don't really care 
what the conventions are that we uset, but is it possible to have the 
formatting template in IntelliJ match the Drill requirements?
-- C    

Re: WebUI is Vulnerable to CSRF?

2019-08-15 Thread Paul Rogers
Hi Don,

The one saving grace is that no one should ever host the Drill web UI on a 
public-facing web site. The UI provides lots of admin operations that one would 
not really want to expose openly.


A much better solution would be to wrap Drill in a custom-made web app that 
controls what someone can do; the same way that a DB is exposed via a custom 
app, not by a public-facing PhpMyAdmin...

Still, this should be fixed. Please file a JIRA with your findings.

Thanks,
- Paul

 

On Thursday, August 15, 2019, 8:33:19 PM PDT, Don Perial 
 wrote:  
 
 It seems that there is no way to protect the WebUI from CSRF and the fact that 
the value for the access-control-allow-origin header is '*' appears to confound 
this issue as well. I have searched the documentation and also did quite a bit 
of Googling but have not seen any references to this. Is this known and/or 
intended behavior?
The attached file should demonstrate the (elementary) attack.

Thanks In advance,
P
  

Re: complex data structure aggregators?

2019-08-12 Thread Paul Rogers
Hi Ted,

You are now at the point that you'll have to experiment. Drill provides an 
annotation for aggregate state:  @Workspace. The value must be declared as a 
"holder". You'll have to check if VarBinaryHolder is allowed, and, if so, how 
you allocate memory and remember the offset into the array. (My guess is that 
this may not work.)
@Workspace does allow you to specify a holder for a Java object, but such 
objects won't be spilled to disk when, say, the hash aggregate spills. This 
means your aggregate will work fine at small scale, then mysteriously fail once 
moved into production. Fun.

Unless aggregate UDFs are special, they can return a VarChar or VarBinary 
result. The book explains how to do this for VarChar, some poking around in the 
Drill source should identify how to do so for VarBinary. (There are crufty 
details about allocating space, copying over data, etc.)

FWIW: There is a pile of information on UDF internals on my GitHub Wiki. [1] 
Aggregate UDFS are covered in [2]. Once we learn the answers to your specific 
questions, we can add the info to the Wiki.

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information


[2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs




 

On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning 
 wrote:  
 
 I am trying to figure out how to build an approximate percentile estimator.

I have a fancy data structure that will do this. It can live in bounded
memory with no allocation. I can add numbers to the digest easily enough.
And the required results can be extracted from the structure.

What I would need to know:

- how to use a fixed array of bytes as the state of an aggregating UDF

- how to pass in an argument to an aggregator OR (better) how to use the
binary result of an aggregator in another function.

On Mon, Aug 12, 2019 at 11:25 AM Charles Givre  wrote:

> Ted,
> Can we ask what it is you are trying to build a UDF for?
> --C
>
> > On Aug 12, 2019, at 2:23 PM, Paul Rogers 
> wrote:
> >
> > Hi Ted,
> >
> > Thanks for the link; I suspected there was some trick for stddev. The
> point still stands that, if the algorithm requires multiple passes over the
> data (ML, say), can't be done in Drill.
> >
> > Each UDF must return exactly one value. It can return a map if you want
> multiple values (though someone would have to check that projection works
> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> accumulate a VarChar or VarBinary because Drill cannot insert values into
> an existing variable-length vector.
> >
> > UDFs need your knack for finding a workaround to get your job done; they
> have pretty strong limitations on the surface.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
> >
> > Is it possible for a UDF to produce multiple scalar results? Can it
> produce
> > a binary result?
> >
> > Also, as a nit, standard deviation doesn't require buffering all the
> data.
> > It just requires that you have three accumulators, one for count, one for
> > mean and one for mean squared deviation.  There is a slightly tricky
> > algorithm called Welford's algorithm
> > <
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >
> > which
> > allows good numerical stability while computing this on-line.
> >
> > On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers 
> > wrote:
> >
> >> Hi Ted,
> >>
> >> Last I checked (when we wrote the book chapter on the subject),
> aggregate
> >> state are limited to scalars and Drill-defined types. There is no
> support
> >> to spill aggregate state, so that state will be lost if spilling is
> >> required to handle large aggregate batches. The current solution works
> for
> >> simple cases such as totals and averages.
> >>
> >> Aggregate UDFs share no state, so it is not possible for one function to
> >> use state accumulated by another. If, for example, you want sum, average
> >> and standard deviation, you'll have to accumulate the total three times,
> >> average twice, and so on. Note that the std dev function will require
> >> buffering all data in one's own array (without any spilling or other
> >> support), to allow computing the (X-bar - X)^2 part of the calculation.
> >>
> >> A UDF can emit a byte array (have to check it this is true of aggregate
> >> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a

Re: complex data structure aggregators?

2019-08-12 Thread Paul Rogers
Hi Ted,

Thanks for the link; I suspected there was some trick for stddev. The point 
still stands that, if the algorithm requires multiple passes over the data (ML, 
say), can't be done in Drill.

Each UDF must return exactly one value. It can return a map if you want 
multiple values (though someone would have to check that projection works to 
convert these to scalar top-level values). AFAIK, a UDF can produce a binary 
buffer as output (type VarBinary). But, an aggregate UDF cannot accumulate a 
VarChar or VarBinary because Drill cannot insert values into an existing 
variable-length vector.

UDFs need your knack for finding a workaround to get your job done; they have 
pretty strong limitations on the surface.

Thanks,
- Paul

 

On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning 
 wrote:  
 
 Is it possible for a UDF to produce multiple scalar results? Can it produce
a binary result?

Also, as a nit, standard deviation doesn't require buffering all the data.
It just requires that you have three accumulators, one for count, one for
mean and one for mean squared deviation.  There is a slightly tricky
algorithm called Welford's algorithm
<https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm>
which
allows good numerical stability while computing this on-line.

On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers 
wrote:

> Hi Ted,
>
> Last I checked (when we wrote the book chapter on the subject), aggregate
> state are limited to scalars and Drill-defined types. There is no support
> to spill aggregate state, so that state will be lost if spilling is
> required to handle large aggregate batches. The current solution works for
> simple cases such as totals and averages.
>
> Aggregate UDFs share no state, so it is not possible for one function to
> use state accumulated by another. If, for example, you want sum, average
> and standard deviation, you'll have to accumulate the total three times,
> average twice, and so on. Note that the std dev function will require
> buffering all data in one's own array (without any spilling or other
> support), to allow computing the (X-bar - X)^2 part of the calculation.
>
> A UDF can emit a byte array (have to check it this is true of aggregate
> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> VarChar.
>
> All this is from memory and so is only approximately accurate. YMMV.
>
> Thanks,
> - Paul
>
>
>
>    On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  What is the current state of building aggregators that have complex state
> via UDFs?
>
> Is it possible to define multi-level aggregators in a UDF?
>
> Can the output of a UDF be a byte array?
>
>
> (these are three different questions)
>
  

Re: complex data structure aggregators?

2019-08-12 Thread Paul Rogers
Hi Ted,

Last I checked (when we wrote the book chapter on the subject), aggregate state 
are limited to scalars and Drill-defined types. There is no support to spill 
aggregate state, so that state will be lost if spilling is required to handle 
large aggregate batches. The current solution works for simple cases such as 
totals and averages.

Aggregate UDFs share no state, so it is not possible for one function to use 
state accumulated by another. If, for example, you want sum, average and 
standard deviation, you'll have to accumulate the total three times, average 
twice, and so on. Note that the std dev function will require buffering all 
data in one's own array (without any spilling or other support), to allow 
computing the (X-bar - X)^2 part of the calculation.

A UDF can emit a byte array (have to check it this is true of aggregate UDFs). 
A VarChar is simply a special kind of array, and UDFs can emit a VarChar.

All this is from memory and so is only approximately accurate. YMMV.

Thanks,
- Paul

 

On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning 
 wrote:  
 
 What is the current state of building aggregators that have complex state
via UDFs?

Is it possible to define multi-level aggregators in a UDF?

Can the output of a UDF be a byte array?


(these are three different questions)
  

Re: [DISCUSS]: Drill after MapR

2019-08-08 Thread Paul Rogers
Hi Charles,

Thanks for raising this issue.

The short answer is probably "the cloud." The longer answer should include:

* Who will manage the AWS/GCE instances?

* Who will pay for the instances?

* The MapR infrastructure uses the MapR file system and Hadoop distro. Probably 
should use Hadoop (or EMR) for Apache. Who will do the port?

* The test framework itself is public in the MapR Github repo [1]. Perhaps it 
should migrate to the Apache Drill repo?

* The tests are run within MapR using a very nice Jenkins job that Abhishek 
created. How can we make this public and port it to the cloud?

We might look at what Impala did. Cloudera hosts a pubic build system: anyone 
can submit a build job (though the jobs are throttled, etc.)

Thanks,
- Paul

[1] https://github.com/mapr/drill-test-framework


 

On Thursday, August 8, 2019, 09:49:42 AM PDT, Charles Givre 
 wrote:  
 
 Hello all, 
Now that MapR is officially part of HPE, I wanted to ask everyone their 
thoughts about how we can continue to release Drill without the MapR owned 
infrastructure.  Assuming that HPE is not likely to continue supporting Drill 
(or maybe they will, but I'm guessing not) what infrastructure does the 
community need to take over?
Can we start compiling a list and formulating a plan for this?
Thanks,
-- C  

Re: [QUESTION]: Caching UDFs

2019-08-08 Thread Paul Rogers
Hi Charles,

In general, we cannot know if a function is deterministic. Your function might 
be rand(seed, max). It might do a JDBC lookup or a REST call. Drill can't know 
(unless we add some way to know that a function is deterministic: maybe a 
@Deterministic annotation.)

That said, you can build in caching inside the function. Should your cache be 
separate from mine for security reasons? Should the cache be shared across 
execution threads on a given node? Local to a single minor fragment?

Aggregates are example of functions that have internal state, perhaps the idea 
can be extended for a function-specific results cache.

Thanks,
- Paul

 

On Thursday, August 8, 2019, 09:46:12 AM PDT, Charles Givre 
 wrote:  
 
 Hello Drill Devs,I have a question about UDFs.  Let's say you have a 
non-trivial UDF called foo(x,y) which returns some value.  Assuming that if the 
arguments are the same, the function foo() will return the same result, does 
Drill have any optimizations to prevent running the non-trivial function?  
I was thinking that it might make sense to cache the arguments and results in 
memory and before the function is executed, check the cache to see if they're 
there.  If they are, return the cached results, and if not, execute the 
function.  I was thinking that for some functions, like date/time functions, we 
might want to include something in the code to ensure that the results do not 
get cached. 
Thoughts?

Charles S. Givre CISSPData Scientist, Co-Founder GTK Cyber LLC
charles.givre@gtkcyber.comMobile: (443) 762-3286


  

[jira] [Created] (DRILL-7333) Batch of

2019-07-27 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7333:
--

 Summary: Batch of 
 Key: DRILL-7333
 URL: https://issues.apache.org/jira/browse/DRILL-7333
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [ANNOUNCE] New Committer: Igor Guzenko

2019-07-23 Thread Paul Rogers
Congrats Igor!
Thanks,
- Paul

 

On Monday, July 22, 2019, 07:02:44 AM PDT, Arina Ielchiieva 
 wrote:  
 
 The Project Management Committee (PMC) for Apache Drill has invited Igor
Guzenko to become a committer, and we are pleased to announce that he has
accepted.

Igor has been contributing into Drill for 9 months and made a number of
significant contributions, including cross join syntax support, Hive views
support, as well as improving performance for Hive show schema and unit
tests. Currently he is working on supporting Hive complex types
[DRILL-3290]. He already added support for list type and working on struct
and canonical map.

Welcome Igor, and thank you for your contributions!

- Arina
(on behalf of the Apache Drill PMC)
  

Re: drill.exec.grace_period_ms' Errors

2019-07-20 Thread Paul Rogers
Hi Charles,

I just ran some unit tests, using master, and did not see the 
drill.exec.grace_period_ms error that you saw.

drill.exec.grace_period_ms is defined in ExecConstants.java, is used in 
Drillbit startup in Drillbit.java, and has a value defined in 
src/main/resources/drill-module.conf.

In other words, it seems everything is set up the way it should be. I wonder, 
do you have an old version of drill-module.conf? If you check your working 
branch do you have any unexpected changes? ("git status"). Also, have you 
grabbed the latest master ranch recently? ("git checkout master; git pull 
apache master; git checkout ; git rebase master". Where "apache" 
is whatever you named your Drill Github remote.

Thanks,
- Paul

 

On Tuesday, July 16, 2019, 6:45:01 PM PDT, Charles Givre  
wrote:  
 
 H> 
> Also, I am working on updating a few format plugins and kept getting the 
> following error when I try to run unit tests:
> 
> at org.apache.drill.test.ClusterFixture.(ClusterFixture.java:152)
>    at 
>org.apache.drill.test.ClusterFixtureBuilder.build(ClusterFixtureBuilder.java:283)
>    at org.apache.drill.test.ClusterTest.startCluster(ClusterTest.java:83)
>    at 
>org.apache.drill.exec.store.excel.TestExcelFormat.setup(TestExcelFormat.java:49)
> Caused by: com.typesafe.config.ConfigException$Missing: No configuration 
> setting found for key 'drill.exec.grace_period_ms'
>    at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
>    at 
>com.typesafe.config.impl.SimpleConfig.getConfigNumber(SimpleConfig.java:170)
>    at com.typesafe.config.impl.SimpleConfig.getInt(SimpleConfig.java:181)
>    at org.apache.drill.common.config.NestedConfig.getInt(NestedConfig.java:96)
>    at org.apache.drill.common.config.DrillConfig.getInt(DrillConfig.java:44)
>    at org.apache.drill.common.config.NestedConfig.getInt(NestedConfig.java:96)
>    at org.apache.drill.common.config.DrillConfig.getInt(DrillConfig.java:44)
>    at org.apache.drill.exec.server.Drillbit.(Drillbit.java:160)
>    at org.apache.drill.exec.server.Drillbit.(Drillbit.java:138)
>    at 
>org.apache.drill.test.ClusterFixture.startDrillbits(ClusterFixture.java:228)
>    at org.apache.drill.test.ClusterFixture.(ClusterFixture.java:146)
>    ... 3 more
> 
> 
> Process finished with exit code 255
> 
> I understand that I have to set the variable drill.exec.grace_period_ms, but 
> I'm not sure how/where to do this.  Here is the beginning of my unit test 
> code:
> 
> @ClassRule
> public static final BaseDirTestWatcher dirTestWatcher = new 
> BaseDirTestWatcher();
> 
> @BeforeClass
> public static void setup() throws Exception {
>  
>ClusterTest.startCluster(ClusterFixture.builder(dirTestWatcher).maxParallelization(1));
>  definePlugin();
> }
> 
> private static void definePlugin() throws ExecutionSetupException {
>  ExcelFormatConfig sampleConfig = new ExcelFormatConfig();
> 
>  // Define a temporary plugin for the "cp" storage plugin.
>  Drillbit drillbit = cluster.drillbit();
>  final StoragePluginRegistry pluginRegistry = 
>drillbit.getContext().getStorage();
>  final FileSystemPlugin plugin = (FileSystemPlugin) 
>pluginRegistry.getPlugin("cp");
>  final FileSystemConfig pluginConfig = (FileSystemConfig) plugin.getConfig();
>  pluginConfig.getFormats().put("sample", sampleConfig);
>  pluginRegistry.createOrUpdate("cp", pluginConfig, false);
> }
> 
> @Test
> public void testStarQuery() throws RpcException {
>  String sql = "SELECT * FROM cp.`excel/test_data.xlsx` LIMIT 5";
> 
>  RowSet results = client.queryBuilder().sql(sql).rowSet();
>  TupleMetadata expectedSchema = new SchemaBuilder()
>          .add("id", TypeProtos.MinorType.FLOAT8, TypeProtos.DataMode.OPTIONAL)
>          .add("first__name", TypeProtos.MinorType.VARCHAR, 
>TypeProtos.DataMode.OPTIONAL)
>          .add("last__name", TypeProtos.MinorType.VARCHAR, 
>TypeProtos.DataMode.OPTIONAL)
>          .add("email", TypeProtos.MinorType.VARCHAR, 
>TypeProtos.DataMode.OPTIONAL)
>          .add("gender", TypeProtos.MinorType.VARCHAR, 
>TypeProtos.DataMode.OPTIONAL)
>          .add("birthdate", TypeProtos.MinorType.VARCHAR, 
>TypeProtos.DataMode.OPTIONAL)
>          .add("balance", TypeProtos.MinorType.FLOAT8, 
>TypeProtos.DataMode.OPTIONAL)
>          .add("order__count", TypeProtos.MinorType.FLOAT8, 
>TypeProtos.DataMode.OPTIONAL)
>          .add("average__order", TypeProtos.MinorType.FLOAT8, 
>TypeProtos.DataMode.OPTIONAL)
>          .buildSchema();
> 
>  RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
>          .addRow(1.0, "Cornelia", "Matej", "cmat...@mtv.com", "Female", 

Re: EVF Log Regex Errors

2019-07-20 Thread Paul Rogers
Hi Charles,

Turns out that there are two problems here. First, I mucked up the Jackson 
serialization of the schema objects. Second, you need to use the Joda format 
(with "HH") as we discussed. Once both those changes are made, things seem to 
work (at least in unit tests.)

There is a PR for the fix. Please review.

Thanks,
- Paul

 

On Tuesday, July 16, 2019, 6:45:01 PM PDT, Charles Givre  
wrote:  
 
 Hi Paul, 
Thanks for the response.  Unfortunately, I tried simply setting a fieldName and 
got an error. 

 "ssdlog": {
      "type": "logRegex",
      "regex": 
"(\\w{3}\\s\\d{1,2}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2})\\s+(\\w+)\\[(\\d+)\\]:\\s(.*?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*?)",
      "extension": "ssdlog",
      "maxErrors": 10,
      "schema": [{"fieldName": "test"}]
    },
--C


> On Jul 16, 2019, at 7:08 PM, Paul Rogers  wrote:
> 
> Hi Charles,
> 
> Thanks much for the feedback. I'll take a look.
> 
> A quick look at your config suggests that the timestamp might be the issue. 
> As I recall, there were no such tests in the unit test class. So, perhaps 
> something slipped through. (We should add a test for this case.)
> 
> 
> In EVF, we use the Joda (not Java 8) date/time classes. [1] (We do this for 
> obscure reasons related to how Drill handles intervals, and the fact that the 
> Java 8 date/time classes are not a full replacement for Joda.)
> 
> With Joda, your format should be: "MMM dd  HH:mm:ss" (Note the upper case 
> "H"). Try this to see if it gets you unstuck.
> 
> What we should really do is support SQL format strings. These are not 
> standard, but the Postgres format seem common [2]. Someone added this feature 
> to Drill a while back, so we must have a Postgres-to-Joda format converter in 
> the code somewhere we could use.
> 
> Thanks,
> - Paul
> 
> 
> [1] 
> https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
> 
> [2] https://www.postgresql.org/docs/9.1/functions-formatting.html
> 
> 
> 
> 
>    On Tuesday, July 16, 2019, 02:23:50 PM PDT, Charles Givre 
> wrote:  
> 
> 
> Hello All, 
> First, a big thank you Paul for updating the log regex reader to the new EVF 
> framework.  I am having a little trouble getting it to work however...
> Here is my config:
> 
> ,
>    "ssdlog": {
>      "type": "logRegex",
>      "regex": 
>"(\\w{3}\\s\\d{1,2}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2})\\s+(\\w+)\\[(\\d+)\\]:\\s(.*?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*?)",
>      "extension": "ssdlog",
>      "maxErrors": 10,
>      "schema": [
>          {"fieldName":"eventDate"}
>          ]
>    },
> 
> This works if I leave the schema null, however if I attempt to populate it, I 
> get JSON errors.  This was what I originally had:
> 
> "schema" : [ {
>        "fieldName" : "eventDate",
>        "fieldType" : "TIMESTAMP",
>        "format" : "MMM dd  hh:mm:ss"
>      }, {
>        "fieldName" : "process_name"
>      }, {
>        "fieldName" : "pid",
>        "fieldType" : "INT"
>      }, {
>        "fieldName" : "message"
>      }, {
>        "fieldName" : "src_ip"
>      } ]
> 
> which worked.  
> 
> 
> Also, I am working on updating a few format plugins and kept getting the 
> following error when I try to run unit tests:
> 
> at org.apache.drill.test.ClusterFixture.(ClusterFixture.java:152)
>    at 
>org.apache.drill.test.ClusterFixtureBuilder.build(ClusterFixtureBuilder.java:283)
>    at org.apache.drill.test.ClusterTest.startCluster(ClusterTest.java:83)
>    at 
>org.apache.drill.exec.store.excel.TestExcelFormat.setup(TestExcelFormat.java:49)
> Caused by: com.typesafe.config.ConfigException$Missing: No configuration 
> setting found for key 'drill.exec.grace_period_ms'
>    at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
>    at 
>com.typesafe.config.impl.SimpleConfig.getConfigNumber(SimpleConfig.java:170)
>    at com.typesafe.config.impl.SimpleConfig.getInt

[jira] [Resolved] (DRILL-7327) Log Regex Plugin Won't Recognize Schema

2019-07-20 Thread Paul Rogers (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7327.

Resolution: Not A Bug

> Log Regex Plugin Won't Recognize Schema
> ---
>
> Key: DRILL-7327
> URL: https://issues.apache.org/jira/browse/DRILL-7327
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>    Assignee: Paul Rogers
>Priority: Major
> Attachments: firewall.ssdlog
>
>
> When I attempt to create a define a schema for the new `logRegex` plugin, 
> Drill does not recognize the plugin if the configuration includes a schema.
> {code:json}
> {,
> "ssdlog": {
>   "type": "logRegex",
>   "regex": 
> "(\\w{3}\\s\\d{1,2}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2})\\s+(\\w+)\\[(\\d+)\\]:\\s(.*?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*?)",
>   "extension": "ssdlog",
>   "maxErrors": 10,
>   "schema": []
> }
> {code}
> This configuration works, however, this does not:
> {code:json}
> {,
> "ssdlog": {
>   "type": "logRegex",
>   "regex": 
> "(\\w{3}\\s\\d{1,2}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2})\\s+(\\w+)\\[(\\d+)\\]:\\s(.*?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*?)",
>   "extension": "ssdlog",
>   "maxErrors": 10,
>   "schema": [
> {"fieldName":"eventDate"}
> ]
> }
> {code}
> [~paul-rogers]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: EVF Log Regex Errors

2019-07-16 Thread Paul Rogers
Hi Charles,

Please file a JIRA ticket and include a short file and the config. I'll take a 
look in a few days and figure out what's broken.

Thanks,
- Paul

 

On Tuesday, July 16, 2019, 6:45:01 PM PDT, Charles Givre  
wrote:  
 
 Hi Paul, 
Thanks for the response.  Unfortunately, I tried simply setting a fieldName and 
got an error. 

 "ssdlog": {
      "type": "logRegex",
      "regex": 
"(\\w{3}\\s\\d{1,2}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2})\\s+(\\w+)\\[(\\d+)\\]:\\s(.*?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*?)",
      "extension": "ssdlog",
      "maxErrors": 10,
      "schema": [{"fieldName": "test"}]
    },
--C


> On Jul 16, 2019, at 7:08 PM, Paul Rogers  wrote:
> 
> Hi Charles,
> 
> Thanks much for the feedback. I'll take a look.
> 
> A quick look at your config suggests that the timestamp might be the issue. 
> As I recall, there were no such tests in the unit test class. So, perhaps 
> something slipped through. (We should add a test for this case.)
> 
> 
> In EVF, we use the Joda (not Java 8) date/time classes. [1] (We do this for 
> obscure reasons related to how Drill handles intervals, and the fact that the 
> Java 8 date/time classes are not a full replacement for Joda.)
> 
> With Joda, your format should be: "MMM dd  HH:mm:ss" (Note the upper case 
> "H"). Try this to see if it gets you unstuck.
> 
> What we should really do is support SQL format strings. These are not 
> standard, but the Postgres format seem common [2]. Someone added this feature 
> to Drill a while back, so we must have a Postgres-to-Joda format converter in 
> the code somewhere we could use.
> 
> Thanks,
> - Paul
> 
> 
> [1] 
> https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
> 
> [2] https://www.postgresql.org/docs/9.1/functions-formatting.html
> 
> 
> 
> 
>    On Tuesday, July 16, 2019, 02:23:50 PM PDT, Charles Givre 
> wrote:  
> 
> 
> Hello All, 
> First, a big thank you Paul for updating the log regex reader to the new EVF 
> framework.  I am having a little trouble getting it to work however...
> Here is my config:
> 
> ,
>    "ssdlog": {
>      "type": "logRegex",
>      "regex": 
>"(\\w{3}\\s\\d{1,2}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2})\\s+(\\w+)\\[(\\d+)\\]:\\s(.*?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*?)",
>      "extension": "ssdlog",
>      "maxErrors": 10,
>      "schema": [
>          {"fieldName":"eventDate"}
>          ]
>    },
> 
> This works if I leave the schema null, however if I attempt to populate it, I 
> get JSON errors.  This was what I originally had:
> 
> "schema" : [ {
>        "fieldName" : "eventDate",
>        "fieldType" : "TIMESTAMP",
>        "format" : "MMM dd  hh:mm:ss"
>      }, {
>        "fieldName" : "process_name"
>      }, {
>        "fieldName" : "pid",
>        "fieldType" : "INT"
>      }, {
>        "fieldName" : "message"
>      }, {
>        "fieldName" : "src_ip"
>      } ]
> 
> which worked.  
> 
> 
> Also, I am working on updating a few format plugins and kept getting the 
> following error when I try to run unit tests:
> 
> at org.apache.drill.test.ClusterFixture.(ClusterFixture.java:152)
>    at 
>org.apache.drill.test.ClusterFixtureBuilder.build(ClusterFixtureBuilder.java:283)
>    at org.apache.drill.test.ClusterTest.startCluster(ClusterTest.java:83)
>    at 
>org.apache.drill.exec.store.excel.TestExcelFormat.setup(TestExcelFormat.java:49)
> Caused by: com.typesafe.config.ConfigException$Missing: No configuration 
> setting found for key 'drill.exec.grace_period_ms'
>    at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
>    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
>    at 
>com.typesafe.config.impl.SimpleConfig.getConfigNumber(SimpleConfig.java:170)
>    at com.typesafe.config.impl.SimpleConfig.getInt(SimpleConfig.java:181)
>    at org.apache.drill.common.config.NestedConfig.getInt(NestedConfig.java:96)
>    at org.apache.drill.common.config.DrillConfig.getInt(DrillConfig.java

Re: EVF Log Regex Errors

2019-07-16 Thread Paul Rogers
Hi Charles,

Thanks much for the feedback. I'll take a look.

A quick look at your config suggests that the timestamp might be the issue. As 
I recall, there were no such tests in the unit test class. So, perhaps 
something slipped through. (We should add a test for this case.)


In EVF, we use the Joda (not Java 8) date/time classes. [1] (We do this for 
obscure reasons related to how Drill handles intervals, and the fact that the 
Java 8 date/time classes are not a full replacement for Joda.)

With Joda, your format should be: "MMM dd  HH:mm:ss" (Note the upper case 
"H"). Try this to see if it gets you unstuck.

What we should really do is support SQL format strings. These are not standard, 
but the Postgres format seem common [2]. Someone added this feature to Drill a 
while back, so we must have a Postgres-to-Joda format converter in the code 
somewhere we could use.

Thanks,
- Paul


[1] 
https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html

[2] https://www.postgresql.org/docs/9.1/functions-formatting.html


 

On Tuesday, July 16, 2019, 02:23:50 PM PDT, Charles Givre 
 wrote:  
 
 
Hello All, 
First, a big thank you Paul for updating the log regex reader to the new EVF 
framework.  I am having a little trouble getting it to work however...
Here is my config:

,
    "ssdlog": {
      "type": "logRegex",
      "regex": 
"(\\w{3}\\s\\d{1,2}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2})\\s+(\\w+)\\[(\\d+)\\]:\\s(.*?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*?)",
      "extension": "ssdlog",
      "maxErrors": 10,
      "schema": [
          {"fieldName":"eventDate"}
          ]
    },

This works if I leave the schema null, however if I attempt to populate it, I 
get JSON errors.  This was what I originally had:

"schema" : [ {
        "fieldName" : "eventDate",
        "fieldType" : "TIMESTAMP",
        "format" : "MMM dd  hh:mm:ss"
      }, {
        "fieldName" : "process_name"
      }, {
        "fieldName" : "pid",
        "fieldType" : "INT"
      }, {
        "fieldName" : "message"
      }, {
        "fieldName" : "src_ip"
      } ]

which worked.  


Also, I am working on updating a few format plugins and kept getting the 
following error when I try to run unit tests:

at org.apache.drill.test.ClusterFixture.(ClusterFixture.java:152)
    at 
org.apache.drill.test.ClusterFixtureBuilder.build(ClusterFixtureBuilder.java:283)
    at org.apache.drill.test.ClusterTest.startCluster(ClusterTest.java:83)
    at 
org.apache.drill.exec.store.excel.TestExcelFormat.setup(TestExcelFormat.java:49)
Caused by: com.typesafe.config.ConfigException$Missing: No configuration 
setting found for key 'drill.exec.grace_period_ms'
    at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
    at 
com.typesafe.config.impl.SimpleConfig.getConfigNumber(SimpleConfig.java:170)
    at com.typesafe.config.impl.SimpleConfig.getInt(SimpleConfig.java:181)
    at org.apache.drill.common.config.NestedConfig.getInt(NestedConfig.java:96)
    at org.apache.drill.common.config.DrillConfig.getInt(DrillConfig.java:44)
    at org.apache.drill.common.config.NestedConfig.getInt(NestedConfig.java:96)
    at org.apache.drill.common.config.DrillConfig.getInt(DrillConfig.java:44)
    at org.apache.drill.exec.server.Drillbit.(Drillbit.java:160)
    at org.apache.drill.exec.server.Drillbit.(Drillbit.java:138)
    at 
org.apache.drill.test.ClusterFixture.startDrillbits(ClusterFixture.java:228)
    at org.apache.drill.test.ClusterFixture.(ClusterFixture.java:146)
    ... 3 more


Process finished with exit code 255

I understand that I have to set the variable drill.exec.grace_period_ms, but 
I'm not sure how/where to do this.  Here is the beginning of my unit test code:

@ClassRule
public static final BaseDirTestWatcher dirTestWatcher = new 
BaseDirTestWatcher();

@BeforeClass
public static void setup() throws Exception {
  
ClusterTest.startCluster(ClusterFixture.builder(dirTestWatcher).maxParallelization(1));
  definePlugin();
}

private static void definePlugin() throws ExecutionSetupException {
  ExcelFormatConfig sampleConfig = new ExcelFormatConfig();

  // Define a temporary plugin for the "cp" storage plugin.
  Drillbit drillbit = cluster.drillbit();
  final StoragePluginRegistry pluginRegistry = 
drillbit.getContext().getStorage();
  final FileSystemPlugin plugin = (FileSystemPlugin) 
pluginRegistry.getPlugin("cp");
  final FileSystemConfig pluginConfig = (FileSystemConfig) plugin.getConfig();
  pluginConfig.getFormats().put("sample", sampleConfig);
  pluginRegistry.createOrUpdate("cp", pluginConfig, false);
}

@Test
public void 

[jira] [Created] (DRILL-7325) Scan, Project, Hash Join do not set container record count

2019-07-14 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7325:
--

 Summary: Scan, Project, Hash Join do not set container record count
 Key: DRILL-7325
 URL: https://issues.apache.org/jira/browse/DRILL-7325
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0


See DRILL-7324. The following are problems found because some operators fail to 
set the record count for their containers.

h4. Scan

TestComplexTypeReader, on cluster setup, using the PojoRecordReader:

ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
from ScanBatch
ScanBatch: Container record count not set

Reason: ScanBatch never sets the record count of its container (this is a 
generic issue, not specific to the PojoRecordReader).

h4. Filter

{{TestComplexTypeReader.testNonExistentFieldConverting()}}:

{noformat}
ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
from FilterRecordBatch
FilterRecordBatch: Container record count not set
{noformat}

h4. Hash Join

{{TestComplexTypeReader.test_array()}}:

{noformat}
ERROR o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors 
from HashJoinBatch
HashJoinBatch: Container record count not set
{noformat}

Occurs on the first batch in which the hash join returns {{OK_NEW_SCHEMA}} with 
no records.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (DRILL-7324) Many vector-validity errors from unit tests

2019-07-14 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7324:
--

 Summary: Many vector-validity errors from unit tests
 Key: DRILL-7324
 URL: https://issues.apache.org/jira/browse/DRILL-7324
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers


Drill's value vectors contain many counts that must be maintained in sync. 
Drill provides a utility, {{BatchValidator}} to check (a subset of) these 
values for consistency.

The {{IteratorValidatorBatchIterator}} class is used in tests to validate the 
state of each operator (AKA "record batch") as Drill runs the Volcano iterator. 
This class can also validate vectors by setting the {{VALIDATE_VECTORS}} 
constant to `true`.

This was done, then unit tests were run. Many tests failed. Examples:

{noformat}
[INFO] Running org.apache.drill.TestUnionDistinct
18:44:26.742 [22d42585-74c2-d418-6f59-9b1870d04770:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
LimitRecordBatch
key - NullableBitVector: Row count = 0, but value count = 2
18:44:26.745 [22d42585-74c2-d418-6f59-9b1870d04770:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
LimitRecordBatch
key - NullableBitVector: Row count = 0, but value count = 2

[INFO] Running org.apache.drill.TestUnionDistinct
8:44:48.302 [22d4256e-c90b-847c-5104-02d6cdf5223e:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
LimitRecordBatch
key - NullableBitVector: Row count = 0, but value count = 2
18:44:48.703 [22d4256e-ccf3-2af6-f56a-140e9c3e55bb:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
FilterRecordBatch
n_nationkey - IntVector: Row count = 2, but value count = 25
n_regionkey - IntVector: Row count = 2, but value count = 25
18:44:48.731 [22d4256e-ccf3-2af6-f56a-140e9c3e55bb:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
FilterRecordBatch
n_nationkey - IntVector: Row count = 4, but value count = 25
n_regionkey - IntVector: Row count = 4, but value count = 25
18:44:49.039 [22d4256f-6b39-d2ab-d145-4f2b0db315a3:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
FilterRecordBatch
n_nationkey - IntVector: Row count = 2, but value count = 25
18:44:49.363 [22d4256e-3d91-850f-9ab4-5939219ac0d0:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
FilterRecordBatch
c_custkey - IntVector: Row count = 4, but value count = 1500
18:44:49.597 [22d4256d-c113-ae5c-6f31-4dd1ec091365:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
FilterRecordBatch
n_nationkey - IntVector: Row count = 5, but value count = 25
n_regionkey - IntVector: Row count = 5, but value count = 25
18:44:49.610 [22d4256d-c113-ae5c-6f31-4dd1ec091365:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
FilterRecordBatch
r_regionkey - IntVector: Row count = 1, but value count = 5
18:44:53.029 [22d4256a-8b70-5f3b-f79b-806e194c5ed2:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
LimitRecordBatch
n_nationkey - IntVector: Row count = 0, but value count = 25
n_name - VarCharVector: Row count = 0, but value count = 25
n_regionkey - IntVector: Row count = 0, but value count = 25
18:44:53.033 [22d4256a-8b70-5f3b-f79b-806e194c5ed2:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
LimitRecordBatch
n_regionkey - IntVector: Row count = 5, but value count = 25
18:44:53.331 [22d4256a-526c-7815-c216-8e45752a4a6c:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
LimitRecordBatch
n_nationkey - IntVector: Row count = 5, but value count = 25
n_name - VarCharVector: Row count = 5, but value count = 25
n_regionkey - IntVector: Row count = 5, but value count = 25
18:44:53.337 [22d4256a-526c-7815-c216-8e45752a4a6c:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
LimitRecordBatch
n_regionkey - IntVector: Row count = 0, but value count = 25
18:44:53.646 [22d42569-c293-ced0-c3d0-e9153cc4a70a:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
LimitRecordBatch
key - NullableBitVector: Row count = 0, but value count = 2

Running org.apache.drill.TestTpchSingleMode
18:45:01.299 [22d42563-0ed6-1501-86a1-4cb375a9cad4:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
FilterRecordBatch

Running org.apache.drill.TestMergeFilterPlan
18:45:03.738 [22d4255f-b322-fd56-2f93-34b7f5c709c1:frag:0:0] ERROR 
o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
FilterRecordBatch
o_orderkey - IntVector: Row count = 561, but value cou

Re: Drill storage plugin for IPFS, any suggestion is welcome :)

2019-07-08 Thread Paul Rogers
王亮 你好,


Very creative use of Drill! We usually think of Drill as a tool for "big data" 
distributed file systems such as HDFS, MFS and S3. IPFS seems to be for storing 
web content. I like how you've shown that IPFS is, in fact, a distributed file 
system, and made Drill work in this context.

Perhaps data scientists might benefit from Minerva: instead of everyone 
downloading large data sets and doing queries locally, a data scientist could 
instead query the data where it lives on the web. Such a feature would be 
especially useful if the data changes over time.

As Charles mentioned, it would be great if you could offer Minerva changes to 
the Drill project. Most extensions live within the Drill project itself, 
typically in the "contrib" module.

The other choice would be for Minerva to be a separate project or repo that can 
be integrated with Drill. We have often talked about creating a true plugin 
architecture to support such a model, but gaps remain. Minerva might be a good 
reason to fix the gaps. 
Thanks,
- Paul

 

On Saturday, July 6, 2019, 02:31:27 AM PDT, 王亮  
wrote:  
 
 Hi all,

After reading that excellent book "Learning Apache Drill: Query and Analyze
Distributed Data Sources with SQL", my classmate and I also wanted to write
a Drill storage plugin. We found most DFS and NFS have been supported by
Drill, so we chose a relatively new and promising distributed file system,
IPFS.

So we built Minerva, a Drill storage plugin that connects IPFS's
decentralized storage and Drill's flexible query engine. Any data file
stored on IPFS can be easily accessed from Drill's query interface, just
like a file stored on a local disk. The basic idea is very simple: run a
Drill instance along the IPFS daemon, and you can connect to other users on
IPFS who are also using Minerva. If one of the users happens to have stored
the file you are trying to query, then Drill can send execution plan to
that node, who executes the operations locally and returns the results
back. Of course, other users can benefit from your node as well, if you are
sharing the data they want. If there are enough people running Minerva,
data sharing and querying can be made distributed and more efficient!

The query process is as follows:
0 The user inputs an SQL statement, referencing a file on IPFS by its CID;
1 The Foreman resolves the CIDs of the "pieces" of the data file, as well
as the IPFS providers of these pieces, by querying the DHT of IPFS;
2 The Foreman distributes jobs to drillbits running on the providers.
3 Drillbits on the providers read data from the piece of file on their
local disk, perform any necessary relational operations, and return results
to the Foreman.
4 The Foreman returns the results to the user.

Thanks to the modular design of Drill, we could rather "easily" write this
storage plugin. Now this plugin supports basic query operations, both read
and write, but only works with json and csv files. It is not very stable
for now, and the performance is still poor, mainly because it takes to too
long to do DHT queries on IPFS. We are trying to improve these problems in
the future.

If you are insterested, we have made a few slides that explain the ideas in
details:
https://www.slideshare.net/BowenDing4/minerva-ipfs-storage-plugin-for-ipfs

Any suggestion is welcome. ^_^

Find the code on GitHub: https://github.com/bdchain/Minerva

Best,
Wang Liang
  

[jira] [Created] (DRILL-7318) Unify type-to-string implementations

2019-07-06 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7318:
--

 Summary: Unify type-to-string implementations
 Key: DRILL-7318
 URL: https://issues.apache.org/jira/browse/DRILL-7318
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Paul Rogers


Drill has many places that perform type-to-string conversions. Unfortunately, 
these multiple implementations are inconsistent in subtle ways. The suggestion 
here is to unify them around an Arrow-like style (as in 
[\{{PrimitiveColumnMetadata.typeString()}}|https://github.com/apache/drill/blob/master/exec/vector/src/main/java/org/apache/drill/exec/record/metadata/PrimitiveColumnMetadata.java#L186)])
 but using SQL type names (as in 
[{{Types.getBaseSqlTypeName()}}|https://github.com/apache/drill/blob/master/common/src/main/java/org/apache/drill/common/types/Types.java#L140]).

Some of the many places where we do type-to-string conversions are:

* {{Types.java}} - This is supposed to be the definitive location, though the 
{{getExtendedSqlTypeName()}} method does not properly handle the {{VARDECIMAL}} 
type nor the optional width for {{VARCHAR}}, etc.
* {{MaterializedField.toString()}} - Uses internal type names. Was handling 
precision incorrectly.
* {{AbstractColumnMetadata.toString()}} - Uses internal type names. Was 
handling precision incorrectly.
* {{PrimitiveColumnMetadata.typeString()}} - Uses ad-hoc solution for some SQL 
names, internal names for other types, does not correctly ignore precision for 
types for which precision is not valid. (Assumes precision will be zero in 
those cases.)
* {{WebUserConnection.sendData() - Uses and ad-hoc type-to-string 
implementation that uses internal names, makes incorrect use of 
{{hasPrecisiont()}} to detect if the precision is non-zero. (DRILL-7308).
* The {{typeOf}} and {{sqlTypeOf()}} SQL functions.

There are probably others. The suggestion is:

* For internal use (e.g. {{toString()}}), use internal names: the {{MinorType}} 
names.
* For user-visible use, use the SQL type names from {{Types}}.
* Define a method in {{Types}} to state whether a type takes a precision.
* For Decimal, always include the precision. For VarChar, etc., include the 
precision (where it represents the width) only when non-zero.
* Define a method in {{Types}} to state whether a type takes a scale. (Only the 
decimal types do.)
* Include scale only for the types which accept them. (For Decimal, include the 
scale even if it is zero.)
* Use the Arrow-like "ARRAY<...>" syntax, for repeated types in the new schema 
file.
* Use the SQL "NOT NULL" syntax for user-visible strings for the {{OPTIONAL}} 
cardinality. Use just the type itself for the {{REQUIRED}} cardinality.
* Use the {{DataMode}} enums in internal strings.

In general, user-visible strings should be in the form that could be used in a 
SQL {{CREATE TABLE}} statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7311) Partial fixes for empty batch bugs

2019-06-30 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7311:
--

 Summary: Partial fixes for empty batch bugs
 Key: DRILL-7311
 URL: https://issues.apache.org/jira/browse/DRILL-7311
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0


DRILL-7305 explains that multiple operators have serious bugs when presented 
with empty batches. DRILL-7306 explains that the EVF (AKA "new scan framework") 
was originally coded to emit an empty "fast schema" batch, but that the feature 
was disabled because of the many empty-batch operator failures.

This ticket covers a set of partial fixes for empty-batch issues. This is the 
result of work done to get the converted JSON reader to work with a "fast 
schema." The JSON work, in the end, revealed that Drill has too many bugs to 
enable fast schema, and so the DRILL-7306 was implemented instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7309) Improve documentation for table functions

2019-06-25 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7309:
--

 Summary: Improve documentation for table functions
 Key: DRILL-7309
 URL: https://issues.apache.org/jira/browse/DRILL-7309
 Project: Apache Drill
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Bridget Bevens


Consider the [documentation of table 
functions|https://drill.apache.org/docs/plugin-configuration-basics/], the 
"Using the Formats Attributes as Table Function Parameters" section. The 
documentation is a bit sparse and it always takes me a long time to remember 
how to use table functions. Here are some improvements.

> ...use the table function syntax:

> select a, b from table({table function name}(parameters))

> The table function name is the table name, the type parameter is the format 
> name, ...

Change the second line to:

```
select a, b from table((type='', ))
```

The use of the angle brackets is a bit more consistent with other doc pages 
such as [this one|https://drill.apache.org/docs/query-directory-functions/]. We 
already mentioned the {{type}} parameter, but did not show it in the template. 
Then say:

The type parameter must match the name of a format plugin. This is the name you 
put in the {{type}} field of your plugin JSON as explained above. Note that 
this is *not* the name of your format config. That is, it might be "text", not 
"csv" or "csvh".

The type parameter *must* be the first parameter. Other parameters can appear 
in any order. You must provide required parameters. Only string, Boolean and 
integer parameters are supported. Table functions do not support lists (so you 
cannot specify the {{extensions}}, for example.)

If parameter names are the same as SQL reserved words, quote the parameter with 
back-ticks as for table and column names. Quote string values with 
single-quotes. Do not quote integer values.

If the string value contains back-slashes, you must escape them with a second 
back-slash. If the string contains a single-quote, you must escape it with 
another single-quote. Example:

```
`regex` => '''(\\d)'''
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Strange metadata from Text Reader

2019-06-24 Thread Paul Rogers
Hi All,

To close the loop on this, see the detailed comments in DRILL-7308 which 
Charles kindly filed. There is a code bug in the REST metadata feature itself 
which causes the schema to repeat for every returned record batch, and which 
causes it to display precision and scale for VARCHAR columns. Should be easy to 
fix.

Thanks,
- Paul

 

On Monday, June 24, 2019, 12:21:10 PM PDT, Arina Yelchiyeva 
 wrote:  
 
 It would be good to help to identify the commit that actually caused the bug.
Personally, I don’t recall anything that might have broken this functionality.

Kind regards,
Arina

> On Jun 24, 2019, at 10:19 PM, Charles Givre  wrote:
> 
> I don't have that version of Drill anymore but this feature worked correctly 
> until recently.  I'm using the latest build of Drill. 
> 
>> On Jun 24, 2019, at 3:18 PM, Arina Yelchiyeva  
>> wrote:
>> 
>> Just to confirm, in Drill 1.15 it works correctly?
>> 
>> Kind regards,
>> Arina
>> 
>>> On Jun 24, 2019, at 10:15 PM, Charles Givre  wrote:
>>> 
>>> Hi Arina, 
>>> It doesn't seem to make a difference unfortunately. :-(
>>> --C 
>>> 
 On Jun 24, 2019, at 3:09 PM, Arina Yelchiyeva  
 wrote:
 
 Hi Charles,
 
 Please try with v3 reader enabled: set 
 `exec.storage.enable_v3_text_reader` = true.
 Does it behave the same?
 
 Kind regards,
 Arina
 
> On Jun 24, 2019, at 9:38 PM, Charles Givre  wrote:
> 
> Hello Drill Devs,
> I'm noticing some strange behavior with the newest version of Drill.  If 
> you query a CSV file, you get the following metadata:
> 
> SELECT * FROM dfs.test.`domains.csvh` LIMIT 1
> 
> {
> "queryId": "22eee85f-c02c-5878-9735-091d18788061",
> "columns": [
> "domain"
> ],
> "rows": [
> {
>  "domain": "thedataist.com"
> }
> ],
> "metadata": [
> "VARCHAR(0, 0)",
> "VARCHAR(0, 0)"
> ],
> "queryState": "COMPLETED",
> "attemptedAutoLimit": 0
> }
> 
> 
> There are two issues here:
> 1.  VARCHAR now has precision 
> 2.  There are twice as many columns as there should be.
> 
> Additionally, if you query a regular CSV, without the columns extracted, 
> you get the following:
> 
> "rows": [
> {
>  "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]"
> }
> ],
> "metadata": [
> "VARCHAR(0, 0)",
> "VARCHAR(0, 0)"
> ],
> 
> This is bizarre in that the data type is not being reported correctly, it 
> should be LIST or something like that, AND we're getting too many columns 
> in the metadata.  I'll submit a JIRA as well, but could someone please 
> take a look?
> Thanks,
> -- C
> 
> 
> 
 
>>> 
>> 
> 
  

Re: Strange metadata from Text Reader

2019-06-24 Thread Paul Rogers
Hi Charles,

Latest master? Please file a JIRA with repo steps. I’ll take a look.

- Paul

Sent from my iPhone

> On Jun 24, 2019, at 11:38 AM, Charles Givre  wrote:
> 
> Hello Drill Devs,
> I'm noticing some strange behavior with the newest version of Drill.  If you 
> query a CSV file, you get the following metadata:
> 
> SELECT * FROM dfs.test.`domains.csvh` LIMIT 1
> 
> {
>  "queryId": "22eee85f-c02c-5878-9735-091d18788061",
>  "columns": [
>"domain"
>  ],
>  "rows": [
>{
>  "domain": "thedataist.com"
>}
>  ],
>  "metadata": [
>"VARCHAR(0, 0)",
>"VARCHAR(0, 0)"
>  ],
>  "queryState": "COMPLETED",
>  "attemptedAutoLimit": 0
> }
> 
> 
> There are two issues here:
> 1.  VARCHAR now has precision 
> 2.  There are twice as many columns as there should be.
> 
> Additionally, if you query a regular CSV, without the columns extracted, you 
> get the following:
> 
> "rows": [
>{
>  "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]"
>}
>  ],
>  "metadata": [
>"VARCHAR(0, 0)",
>"VARCHAR(0, 0)"
>  ],
> 
> This is bizarre in that the data type is not being reported correctly, it 
> should be LIST or something like that, AND we're getting too many columns in 
> the metadata.  I'll submit a JIRA as well, but could someone please take a 
> look?
> Thanks,
> -- C
> 
> 
> 



Re: Multi char csv delimiter

2019-06-24 Thread Paul Rogers
Hi Matthias,

Field delimiters, quotes and quote escapes can be only one character. The line 
delimiter can be multi.

Are you setting the line delimiter?

- Paul

Sent from my iPhone

> On Jun 24, 2019, at 12:10 PM, Arina Yelchiyeva  
> wrote:
> 
> Hi Matthias,
> 
> Attachments are not supported on the mailing list, please include text 
> describing your configuration.
> 
> Kind regards,
> Arina
> 
>> On Jun 24, 2019, at 2:21 PM, Rosenthaler Matthias (PS-DI/ETF1.1) 
>>  wrote:
>> 
>> Hi,
>> 
>> It seems that multi char delimiter “\n\r” is not supported for csv format 
>> drill 1.16.
>> The documentation mentions it should work, but it does not work for me. It 
>> always says “invalid JSON syntax” if I try to change the storage plugin 
>> configuration.
>> 
>> 
>> 
>> Mit freundlichen Grüßen / Best regards 
>> 
>> Matthias Rosenthaler
>> 
>> Powertrain Solutions, Engine Testing (PS-DI/ETF1.1) 
>> Robert Bosch AG | Robert-Bosch-Straße 1 | 4020 Linz | AUSTRIA | www.bosch.at 
>>  
>> Tel. +43 732 7667-479 | matthias.rosentha...@at.bosch.com 
>>  
>> 
>> Sitz: Robert Bosch Aktiengesellschaft, A-1030 Wien, Göllnergasse 15-17 , 
>> Registergericht: FN 55722 w HG-Wien
>> Aufsichtsratsvorsitzender: Dr. Uwe Thomas; Geschäftsführung: Dr. Klaus Peter 
>> Fouquet
>> DVR-Nr.: 0418871- ARA-Lizenz-Nr.: 1831 - UID-Nr.: ATU14719303 - Steuernummer 
>> 140/4988
> 



[jira] [Created] (DRILL-7306) Disable "fast schema" batch for new scan framework

2019-06-23 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7306:
--

 Summary: Disable "fast schema" batch for new scan framework
 Key: DRILL-7306
 URL: https://issues.apache.org/jira/browse/DRILL-7306
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0


 The EVF framework is set up to return a "fast schema" empty batch with only 
schema as its first batch because, when the code was written, it seemed that's 
how we wanted operators to work. However, DRILL-7305 notes that many operators 
cannot handle empty batches.

Since the empty-batch bugs show that Drill does not, in fact, provide a "fast 
schema" batch, this ticket asks to disable the feature in the new scan 
framework. The feature is disabled with a config option; it can be re-enabled 
if ever it is needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7305) Multiple operators do not handle empty batches

2019-06-23 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7305:
--

 Summary: Multiple operators do not handle empty batches
 Key: DRILL-7305
 URL: https://issues.apache.org/jira/browse/DRILL-7305
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers


While testing the new "EVF" framework, it was found that multiple operators 
incorrectly handle empty batches. The EVF framework is set up to return a "fast 
schema" empty batch with only schema as its first batch. It turns out that many 
operators fail with problems such as:

* Failure to set the value counts in the output container
* Fail to initialize the offset vector position 0 to 0 for variable-width or 
repeated vectors

And so on.

Partial fixes are in the JSON reader PR.

For now, the easiest work-around is to disable the "fast schema" path in the 
EVF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7304) Filter record batch misses schema changes within maps

2019-06-22 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7304:
--

 Summary: Filter record batch misses schema changes within maps
 Key: DRILL-7304
 URL: https://issues.apache.org/jira/browse/DRILL-7304
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers


While testing the new row-set based JSON reader, it was found that the Filter 
record batch does not properly handle a change of schema within a map. Consider 
this test:

{code:java}
  @Test
  public void adHoc() {
    try {
  client.alterSession(ExecConstants.JSON_ALL_TEXT_MODE, true);
  String sql = "select a from dfs.`jsoninput/drill_3353` where e = true";
  RowSet results = runTest(sql);
  results.print();
  results.clear();
    } finally {
  client.resetSession(ExecConstants.JSON_ALL_TEXT_MODE);
    }
  }
{code}

The "drill_3353" directory contains two files. {{a.json}}:

{code:json}
{ a : { b : 1, c : 1 }, e : false } 
{ a : { b : 1, c : 1 }, e : false } 
{ a : { b : 1, c : 1 }, e : true  } 
{code}

And {{b.json}}:

{code:json}
{ a : { b : 1, d : 1 }, e : false } 
{ a : { b : 1, d : 1 }, e : false } 
{ a : { b : 1, d : 1 }, e : true  } 
{code}

Notice that both files contain the field {{a.b}}, but the first contains 
{{a.c}} while the second contains {{a.d}}.

The test is configured to return the schema of each file without any "schema 
smoothing." That is, there is a hard schema change between files.

The filter record batch fails to notice the schema change, trues to use the 
{{b.json}} schema with {{a.json}}, and results in an exception due to invalid 
offset vectors.

The problem is a symptom of this code:

{code:java}
  protected boolean setupNewSchema() throws SchemaChangeException {
...
switch (incoming.getSchema().getSelectionVectorMode()) {
  case NONE:
if (sv2 == null) {
  sv2 = new SelectionVector2(oContext.getAllocator());
}
filter = generateSV2Filterer();
break;
...
if (container.isSchemaChanged()) {
  container.buildSchema(SelectionVectorMode.TWO_BYTE);
  return true;
}
{code}

That is, if, after calling {{generateSV2Filterer()}}, the schema of our 
outgoing container changes, rebuild the schema. Since we changed a map 
structure, the schema should be changed, but it is not.

Digging deeper, the following adds/gets each incoming field to the outgoing 
container:

{code:java}
  protected Filterer generateSV2Filterer() throws SchemaChangeException {
...
for (final VectorWrapper v : incoming) {
  final TransferPair pair = 
v.getValueVector().makeTransferPair(container.addOrGet(v.getField(), callBack));
  transfers.add(pair);
}
{code}

Now, since the top-level field {{a}} already exists in the container, we'd have 
to do a bit of sleuthing to see if its contents changed, but we don't:

{code:java}
  public  T addOrGet(final MaterializedField field, 
final SchemaChangeCallBack callBack) {
final TypedFieldId id = 
getValueVectorId(SchemaPath.getSimplePath(field.getName()));
final ValueVector vector;
if (id != null) {
  vector = getValueAccessorById(id.getFieldIds()).getValueVector();
  if (id.getFieldIds().length == 1 && 
!vector.getField().getType().equals(field.getType())) {
final ValueVector newVector = TypeHelper.getNewVector(field, 
this.getAllocator(), callBack);
replace(vector, newVector);
return (T) newVector;
  }
...
{code}

The logic is to check if we have a vector of the top-level name. If so, we 
check if the types are equal. if not, we go ahead and replace the existing 
vector (which has {{a\{b,d\}}}) with the new one (which has {{a\{b, c\}}}).

However, when running the code, the {{if}}-statement is not triggered, so the 
vector is not replaced, and the schema is not marked as changed.

The next question is why {{isEquals()}} considers the two maps the same. It 
seems that the Protobuf-generated code does not handle the child types, 
ignoring differences in child types.

As it turns out, prior work already ran into this issue in another context, and 
a solution is available: {{MaterializedField.isEquivalent()}} already does the 
proper checks, including proper type checking for types of Unions and member 
checking for maps.

Replacing {{MajorType.equals()}} with {{MaterializedField.isEquivalent()}} 
fixes the issue.

Digging deeper, the reason that {{isEquals()}} says that the two types are 
equals is that they are the same object. The input container evolved its map 
type as the JSON reader discovered new columns. But, the filter record batch 
simply reused the same object. That is, since the two containers share the same 
{{MaterializedField}}, changes made by the incoming operator immediately 
changed the schema of the filter operator's container. Said another w

[jira] [Created] (DRILL-7303) Filter record batch does not handle zero-length batches

2019-06-21 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7303:
--

 Summary: Filter record batch does not handle zero-length batches
 Key: DRILL-7303
 URL: https://issues.apache.org/jira/browse/DRILL-7303
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Testing of the row-set-based JSON reader revealed a limitation of the Filter 
record batch: if an incoming batch has zero records, the length of the 
associated SV2 is left at -1. In particular:

{code:java}
public class SelectionVector2 implements AutoCloseable {
  // Indicates actual number of rows in the RecordBatch
  // container which owns this SV2 instance
  private int batchActualRecordCount = -1;
{code}

Then:

{code:java}
public abstract class FilterTemplate2 implements Filterer {
  @Override
  public void filterBatch(int recordCount) throws SchemaChangeException{
if (recordCount == 0) {
  outgoingSelectionVector.setRecordCount(0);
  return;
}
{code}

Notice there is no call to set the actual record count. The solution is to 
insert one line of code:

{code:java}
if (recordCount == 0) {
  outgoingSelectionVector.setRecordCount(0);
  outgoingSelectionVector.setBatchActualRecordCount(0); // <-- Add this
  return;
}
{code}

Without this, the query fails with an error due to an invalid index of -1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7301) Assertion failure in HashAgg with mem prediction off

2019-06-18 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7301:
--

 Summary: Assertion failure in HashAgg with mem prediction off
 Key: DRILL-7301
 URL: https://issues.apache.org/jira/browse/DRILL-7301
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Boaz Ben-Zvi


DRILL-6951 revised the mock data source to use the new "EVF". A side effect is 
that the new version minimizes batch internal fragmentation (which is a good 
thing.) As it turns out, the {{TestHashAggrSpill}} unit tests based their 
spilling tests on total memory, included wasted internal fragmentation. After 
the upgrade to the mock data source, some of the {{TestHashAggrSpill}} tests 
failed because they no longer spilled.

The revised mock limits batch sizes to 10 MB by default. The code ensures that 
the largest vector, likely the one for {{empid_s17}}, is near 100% full.

Experimentation showed that doubling the row count provided sufficient memory 
usage to cause the operator to spill as requested. But, one test now fails with 
an assertion error:

{code:java}
  /**
   * Test with "needed memory" prediction turned off
   * (i.e., exercise code paths that catch OOMs from the Hash Table and recover)
   */
  @Test
  public void testNoPredictHashAggrSpill() throws Exception {
testSpill(58_000_000, 16, 2, 2, false, false /* no prediction */, null,
DEFAULT_ROW_COUNT, 1, 1, 1);
  }
{code}

Partial stack:

{noformat}
at 
org.apache.drill.exec.physical.impl.common.HashTableTemplate.outputKeys(HashTableTemplate.java:910)
 ~[classes/:na]
at 
org.apache.drill.exec.test.generated.HashAggregatorGen0.outputCurrentBatch(HashAggTemplate.java:1184)
 ~[na:na]
at 
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:267)
 ~[classes/:na]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
 ~[classes/:na]
{noformat}

Failure line:

{code:java}
  @Override
  public boolean outputKeys(int batchIdx, VectorContainer outContainer, int 
numRecords) {
assert batchIdx < batchHolders.size(); // <-- Fails here
return batchHolders.get(batchIdx).outputKeys(outContainer, numRecords);
  }
{code}

Perhaps the increase in row count forced the operator into an operating range 
with insufficient memory. If so, the test should have failed with some kind of 
OOM rather than an index assertion.

To test the low-memory theory, the memory limit was increased to 
{{60_000_000}}. Now the code failed at a different point:

{noformat}
at 
org.apache.drill.exec.physical.impl.common.HashTableTemplate.put(HashTableTemplate.java:678)
 ~[classes/:na]
at 
org.apache.drill.exec.test.generated.HashAggregatorGen0.checkGroupAndAggrValues(HashAggTemplate.java:1337)
 ~[na:na]
at 
org.apache.drill.exec.test.generated.HashAggregatorGen0.doWork(HashAggTemplate.java:606)
 ~[na:na]
at 
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:296)
 ~[classes/:na]
{noformat}

Code line:

{code:java}
  @Override
  public PutStatus put(int incomingRowIdx, IndexPointer htIdxHolder, int 
hashCode, int targetBatchRowCount) throws SchemaChangeException, 
RetryAfterSpillException {
...
for ( int currentIndex = startIdx;
 ... {
  // remember the current link, which would be the last when the next link 
is empty
  lastEntryBatch = batchHolders.get((currentIndex >>> 16) & BATCH_MASK); // 
<-- Here
{code}

Increasing memory to {{62_000_000}} produced this error:

{noformat}
at 
org.apache.drill.exec.physical.impl.common.HashTableTemplate.outputKeys(HashTableTemplate.java:910)
 ~[classes/:na]
at 
org.apache.drill.exec.test.generated.HashAggregatorGen0.outputCurrentBatch(HashAggTemplate.java:1184)
 ~[na:na]
at 
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:267)
 ~[classes/:na]
{noformat}

At the line shown for the first exception.

Increasing memory to {{64_000_000}} triggered the second error again.

Increasing memory to {{66_000_000}} triggered the first error again.

The errors recurred at memory (later jumping by 20M) up to 140 M, at which 
point the test failed because the query ran, but did not spill. The query fails 
with a memory limit of 130M.

At {{135_000_000}} the query works, but the returned row count is wrong:

{noformat}
java.lang.AssertionError: expected:<240> but was:<2334465>
{noformat}

At:

{code:java}
  private void runAndDump(...
  assertEquals(expectedRows, summary.recordCount());
{code}

There does seem to be something wrong with this code path. All other tests run 
fine with the new mock data source (and adjusted row counts.)

Have disabled the offending test until this bug can be fixed.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7300) Drill console, query output no longer has "Edit Query" button

2019-06-18 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7300:
--

 Summary: Drill console, query output no longer has "Edit Query" 
button
 Key: DRILL-7300
 URL: https://issues.apache.org/jira/browse/DRILL-7300
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Paul Rogers


In previous versions of Drill, one can use the following workflow to refine a 
query:

* Type the query
* Run it
* Refine it by clicking the "Edit Query" button (If I remember correctly)

In the current Drill 1.17, this functionality is broken:

* If the query runs successfully, there is a link to open the profile in a new 
window, from which one can navigate to Edit Query.
* Hitting the browser back button took me to the "Storage" tab. (I had 
previously added a workspace, but how did this end up as the previous page?)
* If the query fails, there is a "Back <|" button, but it also takes me to the 
Storage tab.

Expected a simple "Edit Query" button on the top of the results page that will 
edit the query within the same browser tab.

The workaround is to copy the query text before hitting "Submit", then use the 
"Query" tab and paste the query back into the editor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7299) Infinite exception loop in Sqlline after kill process

2019-06-18 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7299:
--

 Summary: Infinite exception loop in Sqlline after kill process
 Key: DRILL-7299
 URL: https://issues.apache.org/jira/browse/DRILL-7299
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers


Tried killing Sqlline using the "kill" command. Ended up in an infinte loop 
that repeated printed the following to the console:

{noformat}
java.lang.IllegalStateException
at 
org.jline.reader.impl.LineReaderImpl.readLine(LineReaderImpl.java:464)
at 
org.jline.reader.impl.LineReaderImpl.readLine(LineReaderImpl.java:445)
at sqlline.SqlLine.begin(SqlLine.java:537)
at sqlline.SqlLine.start(SqlLine.java:266)
at sqlline.SqlLine.main(SqlLine.java:205)
java.lang.IllegalStateException
at 
org.jline.reader.impl.LineReaderImpl.readLine(LineReaderImpl.java:464)
at 
org.jline.reader.impl.LineReaderImpl.readLine(LineReaderImpl.java:445)
at sqlline.SqlLine.begin(SqlLine.java:537)
at sqlline.SqlLine.start(SqlLine.java:266)
at sqlline.SqlLine.main(SqlLine.java:205)
...
{noformat}

Using "kill -9" properly killed the process.

Expected a simple "kill" ({{SIGTERM}}) to have done the job.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7298) Revise log regex plugin to work with table functions

2019-06-18 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7298:
--

 Summary: Revise log regex plugin to work with table functions
 Key: DRILL-7298
 URL: https://issues.apache.org/jira/browse/DRILL-7298
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Paul Rogers


See the [PR for DRILL-7293|https://github.com/apache/drill/pull/1807], the 
discussion regarding table properties. The logRegex plugin contains a list of 
{{LogFormatField}} objects:

{code:java}
  private List schema;
{code}

As it turns out, such a list cannot be used with table properties. This ticket 
asks to find a solution, perhaps using the suggestions from the PR.

The log format plugin allows users to read any text file that can be described 
with a regex. The plugin lets the user provide the plugin, and a list of fields 
that match the groups within the regex. These fields are described with the 
{{schema}} list. The schema defines a name, type and parse pattern.

Because of the versatility of logRegex, it would be great to be able to specify 
the pattern and field in a table function so that users do not have to create a 
new plugin config each time they want to query a new kind of file. DRILL-7293 
allows the user to specify the regex and schema using the recently added schema 
provisioning system. Still, it would be handy to use table functions.

The require changes are to use types that the table functions can handle, which 
limits choices to strings and numbers. For ad-hoc query use, it might be fine 
to just list field names. Or, perhaps, if no field names are provided, use the 
{{columns}} array as in CSV. For ad-hoc use, type conversions can be expressed 
as casts rather than as types in the table functions.

h4. Backward Compatibility

Care must be taken when changing the config structure of an existing plugin. In 
the past, Drill would refuse to start if the JSON configs stored in ZK did not 
match the schema that Jackson expects based on the config class. Any fix or 
this problem *must* ensure that existing configs do not cause Drill startup to 
fail. Ideally, configs would be automatically upgraded so that users don't have 
to take any manual steps when upgrading Drill with the features requested here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Extended Vector Framework ready for use

2019-06-13 Thread Paul Rogers
Hi All,

A previous note explained how Drill has added the "Extended Vector Framework" 
(also called the "Row Set Framework") to improve the user's experience with 
Drill. On of Drill's key contributions is "schema-on-read": Drill can make 
sense of many kinds of data files without the hassle of setting up the Hive 
Meta Store (HMS). While Drill can use HMS, but it is often more convenient to 
just query a table (directory of files) without first defining a schema in HMS.

The EVF helps to solve two problems that crop up with the schema-on-read 
approach:

* Drill does not know the size of the data to be read, yet each reader must 
limit record batch sizes to a configured maximum.

* File schemas can be ambiguous, resulting in two scan fragments picking 
different column types, which can lead to query failures when Drill tries to 
combine the results.

For the user, EVF simply makes Drill work better, especially if they use CREATE 
SCHEMA to tell Drill how to resolve schema ambiguities.

To achieve our goals, storage and format plugins must change (or be created) to 
use EVF. This is where you come in if you create or maintain plugins.

We've prepared multiple ways for you to learn how to use the EVF:

* The documentation of the CREATE SCHEMA statement. [1]

* The text format plugin now uses EVF. This is, however, not the best example 
because the plugin itself is rather complex.

*  Chapter 12 of the Learning Apache Drill book explains how to create a format 
plugin. It uses the log format plugin as an example. We've converted the log 
format plugin to use EVF (pull request pending at the moment.)

* We've created an EVF tutorial that shows how to convert the log plugin to use 
EVF. This connects up Chapter 12 of the Drill book with the recent EVF work. [2]


Please use this mailing list to share questions, comments and suggestions as 
you tackle your own plugins. Each plugin has its own unique quirks and issues 
which we can discuss here.


Thanks,
- Paul

[1] https://drill.apache.org/docs/create-or-replace-schema/


[2] 
https://github.com/paul-rogers/drill/wiki/Developer%27s-Guide-to-the-Enhanced-Vector-Framework





[jira] [Created] (DRILL-7293) Convert the regex ("log") plugin to use EVF

2019-06-12 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7293:
--

 Summary: Convert the regex ("log") plugin to use EVF
 Key: DRILL-7293
 URL: https://issues.apache.org/jira/browse/DRILL-7293
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0


The "log" plugin (which uses a regex to define the row format) is the subject 
of Chapter 12 of the Learning Apache Drill book (though the version in the book 
is simpler than the one in the master branch.)

The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set 
framework") gives Drill control over the size of batches created by readers, 
and allows readers to use the recently-added provided schema mechanism.

We wish to use the log reader as an example for how to convert a Drill format 
plugin to use the EVF so that other developers can convert their own plugins.

This PR provides the first set of log plugin changes to enable us to publish a 
tutorial on the EVF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7292) Remove V1, V2 text readers

2019-06-12 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7292:
--

 Summary: Remove V1, V2 text readers
 Key: DRILL-7292
 URL: https://issues.apache.org/jira/browse/DRILL-7292
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0


Now that the "V3" text reader (based on the "extended vector framework) is 
fully functional, we wish to remove the prior (V2) text reader.

Drill also contains the original, "V1" text reader that has not been used or 
supported for several years. We will remove this reader also to leave the code 
simpler and cleaner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

2019-06-04 Thread Paul Rogers
Hi Bohdan,

As you note, the two constraints we nave are 1) avoiding breaking 
compatibility, and 2) providing the true map (DICT) type.

As we noted, the DICT type will allow SqlLine to present the type as a map.

There seem to be advantages to reusing existing work where possible.

You have looked at the issue longer than the rest of us - can you see a path to 
building the feature this way? What issues would need resolution?

Does it make sense to try working out the proposed approach by revising your 
spec or JIra ticket description so we can see if we are on the right track?

Thanks,

- Paul

Sent from my iPhone

> On Jun 4, 2019, at 1:57 AM, Bohdan Kazydub  wrote:
> 
> Hi Paul,
> 
> if I understood you correctly, you are talking about implementation of
> "true map" as list of STRUCT (which is currently named MAP in Drill). While
> this implementation is viable we still do need to introduce a new type for
> such "true map" as REPEATED MAP is still a different data type. That is
> while a "true map" can be implemented using REPEATED MAP under the hood
> these are not the same types (e.g., what if user wants to use REPEATED MAP
> (AKA repeated struct) and not "true map").
> Is my understanding correct?
> 
> The approach found in [1] was taken similarly to that done in Hive[2] as I
> find it clearer and not to meddle with MAP's innards.
> 
> Also worth mentioning that there is this[3] open [WIP] PR into Apache Arrow
> which introduces MapVector (opened a few days ago) which uses the approach
> you suggested.
> 
> [1]
> https://docs.google.com/presentation/d/1FG4swOrkFIRL7qjiP7PSOPy8a1vnxs5Z9PM3ZfRPRYo/edit#slide=id.p
> [2]
> https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/MapColumnVector.java#L30
> [3]https://github.com/apache/arrow/pull/
> 
> On Tue, Jun 4, 2019 at 2:59 AM Paul Rogers 
> wrote:
> 
>> Hi Igor,
>> 
>> Glad the community was able to provide a bit of help.
>> 
>> Let's talk about about another topic. You said: "And main purpose will be
>> hiding of repeated map meta keys
>> ("key","value") and simulation of real map functionality."
>> 
>> On the one hand, we are all accustomed to thinking of a Java (or Python)
>> map as a black box: store (key, value) pairs, retrieve values by key. This
>> is the programming view. I wonder, however, if it is the best SQL view.
>> 
>> Drill is, of course, SQL-based. It may be easier to bring the data to SQL
>> than to bring SQL to the data. SQL works on tables (relations) and is very
>> powerful when doing so. Standard SQL does not, however, provide tools to
>> work with dictionaries. (There is an extension, SQL++, that might provide
>> such extensions. But, even if Drill supported SQL++, no front-end tools
>> provides such support AFAIK.)
>> 
>> So, how do we bring the DICT type to SQL? We do so by noting that a DICT
>> is really a little table of (key, value) pairs (with a uniqueness
>> constraint on the key.) Once we adopt this view, we can apply (I hope!) the
>> nested table mechanism recently added to Drill.
>> 
>> This means that the user DOES want to know the name of the key and value
>> columns: they are columns in a tuple (relation) that can be joined and
>> filtered. Suppose each customer has a DICT of contact information with keys
>> as "office", "home", "cell",... and values as the phone number. You can use
>> SQL to find the office numbers:
>> 
>> 
>> SELECT custName, contactInfo.value as phone WHERE contactInfo.key =
>> "office"...
>> 
>> 
>> So, rather than wanting to hide the (key, value) structure of a DICT, we
>> could argue that exposing that structure allows the DICT to look like a
>> relation, and thus exploit existing Drill features. In fact, this may make
>> Drill more powerful when working Hive maps than is Hive itself (If Hive
>> treats maps as opaque objects.)
>> 
>> 
>> You also showed the SQLLine output you would like for a DICT column. This
>> example exposes a "lie" (a short-cut) that Sqlline exploits. SqlLine asks
>> Drill to convert a column to a Java Object of some sort, then SqlLine calls
>> toString() on that object to produce the value you see in SqlLine output.
>> 
>> Some examples. An array (repeated) column is a set of values. Drill
>> converts the repeated value to a Java array, which toString() converts to
>> something like "[1, 2, 3]". The same is true of MAP: Drill converts it to a
>> Java Map, toString converts it to a JSON-like presentation

Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

2019-06-03 Thread Paul Rogers
Hi Igor,

Glad the community was able to provide a bit of help.

Let's talk about about another topic. You said: "And main purpose will be 
hiding of repeated map meta keys
("key","value") and simulation of real map functionality."

On the one hand, we are all accustomed to thinking of a Java (or Python) map as 
a black box: store (key, value) pairs, retrieve values by key. This is the 
programming view. I wonder, however, if it is the best SQL view.

Drill is, of course, SQL-based. It may be easier to bring the data to SQL than 
to bring SQL to the data. SQL works on tables (relations) and is very powerful 
when doing so. Standard SQL does not, however, provide tools to work with 
dictionaries. (There is an extension, SQL++, that might provide such 
extensions. But, even if Drill supported SQL++, no front-end tools provides 
such support AFAIK.)

So, how do we bring the DICT type to SQL? We do so by noting that a DICT is 
really a little table of (key, value) pairs (with a uniqueness constraint on 
the key.) Once we adopt this view, we can apply (I hope!) the nested table 
mechanism recently added to Drill.

This means that the user DOES want to know the name of the key and value 
columns: they are columns in a tuple (relation) that can be joined and 
filtered. Suppose each customer has a DICT of contact information with keys as 
"office", "home", "cell",... and values as the phone number. You can use SQL to 
find the office numbers:


SELECT custName, contactInfo.value as phone WHERE contactInfo.key = "office"...


So, rather than wanting to hide the (key, value) structure of a DICT, we could 
argue that exposing that structure allows the DICT to look like a relation, and 
thus exploit existing Drill features. In fact, this may make Drill more 
powerful when working Hive maps than is Hive itself (If Hive treats maps as 
opaque objects.)


You also showed the SQLLine output you would like for a DICT column. This 
example exposes a "lie" (a short-cut) that Sqlline exploits. SqlLine asks Drill 
to convert a column to a Java Object of some sort, then SqlLine calls 
toString() on that object to produce the value you see in SqlLine output.

Some examples. An array (repeated) column is a set of values. Drill converts 
the repeated value to a Java array, which toString() converts to something like 
"[1, 2, 3]". The same is true of MAP: Drill converts it to a Java Map, toString 
converts it to a JSON-like presentation.

So, your DICT (or repeated map) type should provide a getObject() method that 
converts the repeated map to a Java Map. SqlLine will convert the map object to 
the display format you showed in your example. (My guess is that a repeated map 
today produces an array of Java Map objects: you want a single Java Map built 
from the key/value pairs.)


A JDBC user can use the getObject() method to retrieve a Java Map 
representation of a Drill DICT. (This functionality is not available in ODBC 
AFAIK.) The same is true for anyone brave enough to use the native Drill client 
API.


Thanks,
- Paul

 

On Monday, June 3, 2019, 7:08:42 AM PDT, Igor Guzenko 
 wrote:  
 
 Hi all,

So finally, I'm going to abandon the renaming ticket DRILL-7097 and
related PR (1803).

Next, the DRILL-7096 should be rewritten to cover addition of new DICT
type. But, if I understand correctly,
based on repeated vector, now result for new type will be returned like:

row |  dict_column MAP
--
  1  | [{"key":1, "value":"v1"}, {"key":2, "value":"v2"} ]
  2  | [{"key":0, "value":"v7"}, {"key":2, "value":"v2"}, {"key":4,
"value":"v4"} ]
  3  | [{"key":-1, "value":"o"}]

And main purpose will be hiding of repeated map meta keys
("key","value") and simulation of real map functionality.

I believe that actually it won't be so easy to reuse all existing
functionality for repeated maps to return logically correct
results for DICT, because it's usage of repeated map in unexpected
way. Also I'd like to hear thoughts from Bohdan about
such application of repeated maps instead of new vector.

Thanks, Igor

  

Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

2019-06-01 Thread Paul Rogers
Hi All,

TLDR; Drill already provides a number of powerful features that give us 80-90% 
of what we need for DICT type. Much time could be saved by using them, focusing 
efforts on adding the remaining bits specific to DICT.

We divide the DICT problem down into two categories:

1. Internal representation, the topic of the previous note which suggested that 
a DICT is really just a repeated MAP.

2. DICT semantics, which is the topic here.

Item 2, semantics, can itself be further divided into two groups:

3. Functionality already in Drill that can be extended/repurposed for the DICT 
type, if DICT is implemented as a repeated MAP.

4. New functionality which must be added.

Existing functionality includes things like:

* The flatten() function which, essentially, joins a DICT with its containing 
row.
* The powerful nested table functionality (added by Parth, Aman and others over 
the last year) that lets users treat a map array (hence a DICT) as a nested 
table and allows sorting, filtering, aggregation and many other SQL operations.

For item 4, Igor probably has a list of new functionality. Some might include:

* A DICT data type which is a repeated map with the addition of identifying the 
key column. (Add a column property in ColumnMetadata, a field in 
MaterializedField.)

* Using the implied uniqueness constraint on the key column to plan nested 
table operations (some operations might be simpler if we know the key is unique 
within each map array.)

* Providing DICT functions such as extracting a value by key (noting that this 
can be done via a SELECT on the nested table.)

* And so on.


Leveraging functionality Drill already has should reduce the cost of 
implementation, and should avoid the compatibility issues that started this 
discussion.

Thanks,
- Paul

 

  
 

Re: How to implement AbstractRecordWriter

2019-05-31 Thread Paul Rogers
Hi Nicolas,

Yet another suggestion, FWIW. We already have tests for writing JSON and CSV. 
We also have tests for reading MapR DB. So, try making changes to those and 
seeing if you can get those to run. For example, create a test that reads a 
file in CSV, write it to JSON, read it as JSON and validate the results using 
the Row Set stuff. You only need 2-3 rows of data. Plenty of examples show how 
this is done, and we can provide specific pointers where you get stuck so that 
others benefit.


If you get this to work (in exec), try moving the test to your contrib project. 
Can you get it to work there? If not, what are the issues?

Now you are in fine shape to, say, change the initial read part from CSV to 
MapR DB. That will ensure you've got the existing MapR DB stuff working. 
Finally, change the write-to and read-from JSON to use MapR DB instead (the 
write using your new code.)

This way, you can tackle issues one by one, always being close to having 
something work (even if it is not the full final functionality.)

Thanks,
- Paul

 

On Friday, May 31, 2019, 12:16:09 PM PDT, Nicolas A Perez 
 wrote:  
 
 Is there a chance we can into a webex call at some point so someone can
help me out with an initial test run?

On Fri, May 31, 2019 at 19:38 Paul Rogers  wrote:


  

Re: How to implement AbstractRecordWriter

2019-05-31 Thread Paul Rogers
Hi Nicolas,

To address your last issue about the wide variety of ways we have to write 
tests... Yes, you are right that there is a wonderful variety of techniques 
that evolved over the life of the project. Unlike Spark, we do not enjoy an 
over-abundance of contributors, so we've pretty much left the old-style tests 
(and code) unchanged, just added newer ones on top. This also explains why 
"plug-ins" are not actually pluggable. Ugly, yes, but the best the team can do 
with limited resources. (We're always looking for volunteers!)


For the MapR DB tests, one approach is to just do whatever your predecessors 
did on the read side.

On the other hand, the path of least resistance is to follow the patterns in 
the CSV tests (and in ExampleTest) to use the newer frameworks for setup, the 
newer tools for running queries and capturing results, and the Row Set 
framework to verify the results. I find I can whip out a unit test in just a 
few minutes using these newer tools.

Also, please do contribute improvements where you can, or at least file JIRA 
tickets with your suggestions so that the project benefits from your experience 
learning how to contribute to Drill.

Thanks,
- Paul

 

On Friday, May 31, 2019, 10:10:14 AM PDT, Nicolas A Perez 
 wrote:  
 
 One of the issues I have is that I haven’t found a way to debug my tests
from intelliJ. It continues to say that some constructs from other modules
are missing.

Also, I haven’t  found *simple* examples of how to write *simple* tests.
Every time i look at the existing code, the tests are done in a different
way.

Now, on the other hand, pluggings should be independent from drill core
modules. If you think about, i can easily write a library that can be
injected into Spark without touching Spark code. For instance, the
DataSource API will load the required parts from my code at run time. Drill
does the same, but the problem is the coupling between drill and it’s
extension points.

On the tests side, you have another problem, you cannot easily tests your
new modules unless they are within drill core code. Maybe it is time to
decoupling the test framework from drill itself, too.

On Fri, May 31, 2019 at 18:38 Paul Rogers  wrote:

> Hi Nicolas,
>
> Charles outlined the choices quite well.
>
> Let's talk about your observation that you find it annoying to deal with
> the full Drill code. There may be some tricks here that can help you.
>
> As you know, I've been revising the text reader and the "EVF" (row set
> framework). Doing so requires a series of pull requests. To move fast, I've
> found the following workflow to be helpful:
>
> * Use a machine with an SSD. A Mac is ideal. A Linux desktop also works
> (mine uses Linux Mint.) The SSD makes rebuilds very fast.
>
> * Use unit tests for all your testing. For example, I created dozens of
> unit tests for CSV files to exercise the text reader, and many more to
> exercise the EVF. All development and testing consists of adding/changing
> code, adding/changing tests, and stepping through the unit test and
> underlying code to find bugs.
>
> * Use JUnit categories to run selected unit tests as a group.
>
> In most cases, you let your IDE do the build; you don't need Maven nor do
> you need to build jar files. Edit a file, run a unit test from your IDE and
> step through code. My edit/compile/debug cycle tends to be seconds.
>
> If, however, you find yourself using Maven to build Drill, then are
> running unit tests from Maven, and attaching a debugger, then your
> edit/compile/debug cycle will be 5+ minutes, which is going to be
> irritating.
>
> If you are doing a full build so you can use SqlLine to test, then this
> suggests it is time to write a unit test case for that issue so you can run
> it from the IDE. Using the RowSet stuff makes such tests easy. See
> TestCsvWithHeaders [1] for some examples.
>
> If you run from the IDE, and find things don't work then perhaps there is
> a config issue. Do we have code that looks for a file in
> $DRILL_HOME/whatever rather than using the class path? Is a required native
> library not on the LD_LIBRARY_PATH for the IDE?
>
> Most unit tests are designed to be stateless. They read a file stored in
> resources, or they write a test file, read the file, and discard the file
> when done.
>
> You are using MapRDB to insert data, which, of course, is stateful. So,
> perhaps your test can put the DB into a known start state, insert some
> records, read those records, compare them with the expected results, and
> clean up the state so you are ready for the next test run. Your target is
> that edit/compile/debug cycle of a few seconds.
>
>
> Overall, if you can master the art of running Drill, using unit tests, in
> your IDE, you can move forward very quickly.
>
> Use Maven builds, and ru

Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

2019-05-31 Thread Paul Rogers
Ted, you found the simple, elegant solution, as usual!

As it turns out, this project could be made far simpler if we just look at it 
from the right angle. Bear with me.

Ted and Charles each have, on occasion, suggested the value of a correlated 
list, along with functions to zip/unzip values in various ways. Drill has kv 
functions. Charles (I believe) suggested other, more general cases such as that 
zip function.


The new, proposed "DICT" type is, in one sense a hash map. But, the unique-name 
aspect does not fit well with write-once vectors. So, the DICT type is really a 
list of pairs (k, v). We provide functions to add a (k, v) pair, to find a 
value given a key, etc.

Here's the thing, to quote Monty Python, "we already got one": it is called a 
Map Array (AKA repeated map). Consider a Map of two fields (k VARCHAR, v 
VARCHAR). This is exactly the data structure proposed for the new DICT type. 
All that is missing are the functions to give it DICT-like behavior.

Why limit a DICT to VARCHAR keys or values? With a Map array, we can have (k 
INT, v DATE[]) (making up an array notation.) Want a "variant" key? Then it is 
(k VARCHAR, v UNION) (assuming we get UNION to work everywhere.)


We don't need to limit ourselves to a single value, either, since a MAP allows 
any number of fields. Maybe (id INT, name VARCHAR, age INT) or even (k INT, 
MAP(name VARCHAR, age INT)).

A quick read of the Hive MAP presentation (Igor's [3]) shows this is what the 
new Map is supposed to do.

You get the idea. In this world a DICT is just an alias for a map array, along 
with new functions.

This gives four big benefits:

1. No renaming needed, so no compatibility issues.
2. We can use the existing "complex" and "row set" readers and writers, etc.
3. No new low-level vectors, mutators, accessors etc. Just use the existing 
(fully tested) map array.
4. Done right, with a DICT as a correlated list (a map array), we can offer the 
additional functions that Ted and Charles suggested, killing two birds with one 
stone (to be a bit politically incorrect.)


To read/write a DICT, just use the existing Map Array features of the row set 
mechanism. Or, if you are old-school, the complex readers/writers for map 
arrays. (Or, if you are even older school, and have a strong stomach, you can 
diddle with the vectors directly.)


Just to be very clear here. If we introduce DICT as an alias for repeated MAP, 
then you IMMEDIATELY can use all our existing tools to create a Hive Map, and 
clients can already read the results. You must make it fancier and easier to 
use by adding DICT-specific features.

Thanks,
- Paul

 

On Friday, May 31, 2019, 10:54:29 AM PDT, Ted Dunning 
 wrote:  
 
 Would it be possible to call the new structure a Dict (following Python's
inspiration)?

That would avoid the large disruption of renaming Map*.



On Fri, May 31, 2019 at 10:10 AM Paul Rogers 
wrote:

> Hi Igor,
>
> Thank you for finally addressing a long-running irritation: that the Drill
> Map type is not a map, it is a tuple.
>
> Perhaps you can divide the discussion into three parts.
>
> 1. Renaming classes, enums and other items internal to the Drill source
> code.
>
> 2. Renaming classes that are part of a public or ad-hoc API.
>
> 3. Renaming items visible to users.
>
> Changing items in part 1 causes a one-time disruption to anyone who has a
> working branch. However, a rebase onto master would easily resolve any
> issues. So, changes in this group are pretty safe.
>
>
> The PR also seems to change symbols visible to the anyone who has code in
> a repo separate from Drill, but that builds against Drill. All UDFs and
> plugins that use the former map classes must change. This means that those
> contributions can support only Drill before your PR or after; the
> maintainer would need two separate branches to support both versions of
> Drill.
>
> Such breaking of (implied) API compatibility is often considered a "bad
> thing." We may not want to complicate the lives of those who have
> graciously created Drill extensions and integrations.
>
> Finally, if we change anything visible from SqlLine, we break running
> applications, which we almost certainly do not want to do. See the changes
> to Types.java as an example.
>
> Can you make the change in a way that all your changes fall only into
> group 1, provide a gradual migration for group 2, and do not change
> anything in group 3?
>
> For example; the MinorType enum is a de-facto public API and must retain
> MAP with its current meaning, at least for some number of releases. You
> could add a STRUCT enum and mark MAP deprecated, and encourage third-party
> code to migrate. But we must still support MAP for some period of time to
> provide time for the migration. Then, add the new "map&q

Re: How to implement AbstractRecordWriter

2019-05-31 Thread Paul Rogers
Hi Nicolas,

Regarding your point that plugins should be, well, plugins -- independent of 
Drill code. Yes, that is true. But, no one has invested the time to make it so. 
Doing so would require a clear, stable code API; an easy way to develop such 
code without the need for the "build jar, copy to DRILL_HOME, restart Drill" 
approach that Charles mentioned.

There were some recent improvements around the bootstrap file, which is great. 
In the mean while, and since the MapR plugin code is already part of Drill, 
let's see if we can get the "work within Drill" approach to work for you. Then, 
perhaps you can use your experience to suggest changes that could be made to 
achieve the "true plugin" goal. All the Drill contributors who are not part of 
the core Drill team would likely very much appreciate a true plugin capability.


I use Eclipse, perhaps others who use IntelliJ can comment on the specifics of 
that IDE.

Drill is divided into modules: your code in the contrib module depends on Drill 
code in java-exec, vector and so on. When I run tests in java-exec in Eclipse, 
Eclipse automatically detects and rebuilds changes in dependent modules such as 
common or vector. This establishes that Eclipse, at least, understands Maven 
dependencies.


I seem to recall that I also got this to work when writing the Drill book when 
I created an example plugin in the contrib module. I don't recall having to 
change anything to get it to work. Perhaps others who have worked on other 
contrib modules can offer their experience.


So, one thing to check is if the Maven dependencies are configured correctly 
for the MapR plugin.

One issue which I thought we solved are test-time dependencies. Tim did some 
work to ensure that code in src/test is visible to downstream modules. Which 
symbols/constructs are causing you problems? Perhaps there is more to fix?

For now, perhaps you can target the goal of getting the existing MapR plugin 
code to work properly in the IDE. This is supposed to work, so it might just be 
a matter of resolving a few specific glitches.

Has anyone worked on the MapR DB plugin previously and can offer advice?

Thanks,
- Paul

 

On Friday, May 31, 2019, 10:10:14 AM PDT, Nicolas A Perez 
 wrote:  
 
 One of the issues I have is that I haven’t found a way to debug my tests
from intelliJ. It continues to say that some constructs from other modules
are missing.

Also, I haven’t  found *simple* examples of how to write *simple* tests.
Every time i look at the existing code, the tests are done in a different
way.

Now, on the other hand, pluggings should be independent from drill core
modules. If you think about, i can easily write a library that can be
injected into Spark without touching Spark code. For instance, the
DataSource API will load the required parts from my code at run time. Drill
does the same, but the problem is the coupling between drill and it’s
extension points.

On the tests side, you have another problem, you cannot easily tests your
new modules unless they are within drill core code. Maybe it is time to
decoupling the test framework from drill itself, too.

On Fri, May 31, 2019 at 18:38 Paul Rogers  wrote:

> Hi Nicolas,
>
> Charles outlined the choices quite well.
>
> Let's talk about your observation that you find it annoying to deal with
> the full Drill code. There may be some tricks here that can help you.
>
> As you know, I've been revising the text reader and the "EVF" (row set
> framework). Doing so requires a series of pull requests. To move fast, I've
> found the following workflow to be helpful:
>
> * Use a machine with an SSD. A Mac is ideal. A Linux desktop also works
> (mine uses Linux Mint.) The SSD makes rebuilds very fast.
>
> * Use unit tests for all your testing. For example, I created dozens of
> unit tests for CSV files to exercise the text reader, and many more to
> exercise the EVF. All development and testing consists of adding/changing
> code, adding/changing tests, and stepping through the unit test and
> underlying code to find bugs.
>
> * Use JUnit categories to run selected unit tests as a group.
>
> In most cases, you let your IDE do the build; you don't need Maven nor do
> you need to build jar files. Edit a file, run a unit test from your IDE and
> step through code. My edit/compile/debug cycle tends to be seconds.
>
> If, however, you find yourself using Maven to build Drill, then are
> running unit tests from Maven, and attaching a debugger, then your
> edit/compile/debug cycle will be 5+ minutes, which is going to be
> irritating.
>
> If you are doing a full build so you can use SqlLine to test, then this
> suggests it is time to write a unit test case for that issue so you can run
> it from the IDE. Using the RowSet stuff makes such tests easy. See
> TestCsvWithHeaders [1] for some examples.
>

Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

2019-05-31 Thread Paul Rogers
Hi Igor,

Thank you for finally addressing a long-running irritation: that the Drill Map 
type is not a map, it is a tuple.

Perhaps you can divide the discussion into three parts.

1. Renaming classes, enums and other items internal to the Drill source code.

2. Renaming classes that are part of a public or ad-hoc API.

3. Renaming items visible to users.

Changing items in part 1 causes a one-time disruption to anyone who has a 
working branch. However, a rebase onto master would easily resolve any issues. 
So, changes in this group are pretty safe.


The PR also seems to change symbols visible to the anyone who has code in a 
repo separate from Drill, but that builds against Drill. All UDFs and plugins 
that use the former map classes must change. This means that those 
contributions can support only Drill before your PR or after; the maintainer 
would need two separate branches to support both versions of Drill.

Such breaking of (implied) API compatibility is often considered a "bad thing." 
We may not want to complicate the lives of those who have graciously created 
Drill extensions and integrations.

Finally, if we change anything visible from SqlLine, we break running 
applications, which we almost certainly do not want to do. See the changes to 
Types.java as an example.

Can you make the change in a way that all your changes fall only into group 1, 
provide a gradual migration for group 2, and do not change anything in group 3?

For example; the MinorType enum is a de-facto public API and must retain MAP 
with its current meaning, at least for some number of releases. You could add a 
STRUCT enum and mark MAP deprecated, and encourage third-party code to migrate. 
But we must still support MAP for some period of time to provide time for the 
migration. Then, add the new "map" as, say KVMAP, TRUEMAP, KVPAIRS, HIVEMAP, 
MAP2 or whatever. (Awkward, yes, but necessary.) In the future, when the old 
MAP enum value is retired, it can be repurposed as an alias for KVMAP (or 
whatever), and the KVMAP enum marked as deprecated, to be removed after several 
more releases.


Similarly the SQL "MAP" type keyword cannot change, nor can the name of any SQL 
function (UDF) that use the "map" term. These changes will break SQL created by 
users which generally does not end well. Again, you can add a new alias, and 
encourage use of that alias.

One could certainly argue that making a breaking change will impact a limited 
number of people, and that the benefit justifies the cost. I'll leave that 
debate to others, focusing here on the mechanics.


Thanks,
- Paul

 

On Friday, May 31, 2019, 12:06:35 AM PDT, Igor Guzenko 
 wrote:  
 
 Hello Drillers,

I'm working on the renaming of Map vector[1] and related stuff to make
space for new canonical Map vector [2] [3]. I believe this
renaming causes big impact on Drill and related client's code
(ODBC/JDBC).

So I'd like to be sure that this renaming is really necessary and
everybody agrees with the changes. Please check the draft PR [4] and
reply on the email.

Alternative solution is simply leave current map vector as is and name
newly created Map vector (+readers, writers etc.)  differently.

[1] https://issues.apache.org/jira/browse/DRILL-7097
[2] https://issues.apache.org/jira/browse/DRILL-7096
[3] 
https://docs.google.com/presentation/d/1FG4swOrkFIRL7qjiP7PSOPy8a1vnxs5Z9PM3ZfRPRYo/edit#slide=id.p
[4] https://github.com/apache/drill/pull/1803

Thanks, Igor Guzenko
  

Re: How to implement AbstractRecordWriter

2019-05-31 Thread Paul Rogers
Hi Nicolas,

Charles outlined the choices quite well.

Let's talk about your observation that you find it annoying to deal with the 
full Drill code. There may be some tricks here that can help you.

As you know, I've been revising the text reader and the "EVF" (row set 
framework). Doing so requires a series of pull requests. To move fast, I've 
found the following workflow to be helpful:

* Use a machine with an SSD. A Mac is ideal. A Linux desktop also works (mine 
uses Linux Mint.) The SSD makes rebuilds very fast.

* Use unit tests for all your testing. For example, I created dozens of unit 
tests for CSV files to exercise the text reader, and many more to exercise the 
EVF. All development and testing consists of adding/changing code, 
adding/changing tests, and stepping through the unit test and underlying code 
to find bugs.

* Use JUnit categories to run selected unit tests as a group.

In most cases, you let your IDE do the build; you don't need Maven nor do you 
need to build jar files. Edit a file, run a unit test from your IDE and step 
through code. My edit/compile/debug cycle tends to be seconds.

If, however, you find yourself using Maven to build Drill, then are running 
unit tests from Maven, and attaching a debugger, then your edit/compile/debug 
cycle will be 5+ minutes, which is going to be irritating.

If you are doing a full build so you can use SqlLine to test, then this 
suggests it is time to write a unit test case for that issue so you can run it 
from the IDE. Using the RowSet stuff makes such tests easy. See 
TestCsvWithHeaders [1] for some examples.

If you run from the IDE, and find things don't work then perhaps there is a 
config issue. Do we have code that looks for a file in $DRILL_HOME/whatever 
rather than using the class path? Is a required native library not on the 
LD_LIBRARY_PATH for the IDE?

Most unit tests are designed to be stateless. They read a file stored in 
resources, or they write a test file, read the file, and discard the file when 
done.

You are using MapRDB to insert data, which, of course, is stateful. So, perhaps 
your test can put the DB into a known start state, insert some records, read 
those records, compare them with the expected results, and clean up the state 
so you are ready for the next test run. Your target is that edit/compile/debug 
cycle of a few seconds.


Overall, if you can master the art of running Drill, using unit tests, in your 
IDE, you can move forward very quickly.

Use Maven builds, and run tests via Maven, only when getting ready to submit a 
PR. If you change, say, only the contrib module, you only need build and test 
that module. If you also change exec, say, then you can just build those two 
modules.

To use categories, tag your tests as follows:

@Category(RowSetTests.class) class MyTest ...

(I'll send the Maven command line separately; I'm not on that machine at the 
moment.)


Thanks much to the team members who helped make this happen. I've since worked 
on other projects that don't have this power and it is truly a grueling 
experience to wait for long builds and deploys after ever change.


Thanks,
- Paul

[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/store/easy/text/compliant/TestCsvWithHeaders.java



 

On Friday, May 31, 2019, 5:17:40 AM PDT, Charles Givre  
wrote:  
 
 Hi Nicolas, 

You have two options:  
1.  You can develop format plugins and UDFs in Drill by adding them to the 
contrib/ folder and then test them with unit tests.  Take a look at this PR as 
an example[1].  If you're intending to submit your work to Drill for inclusion, 
this would be my recommendation as you can write the unit tests as you go, and 
it doesn't take very long to build and you can debug.
2.  Alternatively, you can package the code separately as shown here[2]. 
However, this option requires you to build it, then copy the jars over to 
DRILL_HOME/jars/3rd_party along with any dependencies, then run Drill.  I'm not 
sure how you could write unit tests this way. 

I hope this helps.


[1]: https://github.com/apache/drill/pull/1749
[2]: https://github.com/cgivre/drill-excel-plugin


> On May 31, 2019, at 8:06 AM, Nicolas A Perez  wrote:
> 
> Paul,
> 
> Is it possible to develop my plugin outside of the drill code, let's say in
> my own repository and then package it and add it to the location where the
> plugins live? Does that work, too? I just find annoying to deal with the
> full drill code in order to develop a plugin. At the same time, I might
> want to detach the development of plugins from the drill life cycle itself.
> 
> Please advise.
> 
> Best Regards,
> 
> Nicolas A Perez
> 
> On Thu, May 30, 2019 at 9:58 PM Paul Rogers 
> wrote:
> 
>> Hi Nicolas,
>> 
>> A quick check of the code suggests that AbstractWriter is a
>> Json-serialized description of the ph

Re: How to implement AbstractRecordWriter

2019-05-30 Thread Paul Rogers
Hi Nicolas,

A quick check of the code suggests that AbstractWriter is a Json-serialized 
description of the physical plan. It represents the information sent from the 
planner to the execution engine, and is interpreted by the scan operator. That 
is, it is the "physical plan."

The question is, how does the execution engine translate create the actual 
writer based on the physical plan? The only good example seems to be for the 
FileSystemPlugin. That particular storage plugin is complicated by the 
additional layer of the format plugins.

There is a bit of magic here. Briefly, Drill uses a BatchCreator to create your 
writer. It does so via some Java introspection magic. Drill looks for all 
subclases of BatchCreator, the uses the type of the second argument to the 
getBatch() method to find the correct class. This may mean that you need to 
create one with MapRDBFormatPluginConfig as the type of the second argument.

The getBatch() method then creates the CloseableRecordBatch implementation. 
This is a full Drill operator, meaning it must handle the Volcano iterator 
protocol. Looks like you can perhaps use WriterRecordBatch as the writer 
operator itself. (See EasyWriterBatchCreator and follow the code to understand 
the plumbing.)

You create a RecordWriter to do the actual work. AFAIK, MapRDB supports JSON 
data model (at least in some form). If this is the version you are working on, 
the fastest development path might just be to copy the JsonRecordWriter, and 
replace the writes to JSON with writes to MapRDB. At least this gives you a 
place to start looking.


A more general solution would be to build the writer using some of the recent 
additions to Drill such as the row set mechanisms for reading a record batch. 
But, since copying the JSON approach provides a quick & dirty solution, perhaps 
that is good enough for this particular use case.


In our book, we recommend building each step one-by-one and doing a quick test 
to verify that each step works as you expect. If you create your BatchCreator, 
but not the writer, things won't actually work, but you can set a breakpoint in 
the getBatch() method to verify the Drill did find your class. And so on.


Thanks,
- Paul

 

On Thursday, May 30, 2019, 3:05:39 AM PDT, Nicolas A Perez 
 wrote:  
 
 Can anyone give me an overview of how to implement AbstractRecordWriter?

What are the mechanics it follows, what should I do and so on? It will very
helpful.

Best Regards,

Nicolas A Perez
-- 

Sent by Nicolas A Perez from my GMAIL account.

  

Re: Questions about bushy join

2019-05-29 Thread Paul Rogers
er.parquet]],
> > selectionRoot=classpath:/tpch/supplier.parquet, numFiles=1,
> > usedMetadataFile=false, columns=[`s_suppkey`, `s_nationkey`]]]) :
> > rowType = RecordType(ANY s_suppkey, ANY s_nationkey): rowcount =
> > 100.0, cumulative cost = {100.0 rows, 200.0 cpu, 0.0 io, 0.0 network,
> > 0.0 memory}, id = 10929
> > 00-10                      HashJoin(condition=[=($2, $3)],
> > joinType=[inner]) : rowType = RecordType(ANY n_name, ANY n_nationkey,
> > ANY n_regionkey, ANY r_regionkey, ANY r_name): rowcount = 25.0,
> > cumulative cost = {62.0 rows, 417.0 cpu, 0.0 io, 0.0 network, 17.6
> > memory}, id = 10934
> > 00-15                        Scan(groupscan=[ParquetGroupScan
> > [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]],
> > selectionRoot=classpath:/tpch/nation.parquet, numFiles=1,
> > usedMetadataFile=false, columns=[`n_name`, `n_nationkey`,
> > `n_regionkey`]]]) : rowType = RecordType(ANY n_name, ANY n_nationkey,
> > ANY n_regionkey): rowcount = 25.0, cumulative cost = {25.0 rows, 75.0
> > cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 10930
> > 00-14                        SelectionVectorRemover : rowType =
> > RecordType(ANY r_regionkey, ANY r_name): rowcount = 1.0, cumulative
> > cost = {11.0 rows, 34.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
> > 10933
> > 00-18                          Filter(condition=[=($1, 'EUROPE')]) :
> > rowType = RecordType(ANY r_regionkey, ANY r_name): rowcount = 1.0,
> > cumulative cost = {10.0 rows, 33.0 cpu, 0.0 io, 0.0 network, 0.0
> > memory}, id = 10932
> > 00-20                            Scan(groupscan=[ParquetGroupScan
> > [entries=[ReadEntryWithPath [path=classpath:/tpch/region.parquet]],
> > selectionRoot=classpath:/tpch/region.parquet, numFiles=1,
> > usedMetadataFile=false, columns=[`r_regionkey`, `r_name`]]]) : rowType
> > = RecordType(ANY r_regionkey, ANY r_name): rowcount = 5.0, cumulative
> > cost = {5.0 rows, 10.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
> > 10931
> >
> >
> > On Mon, May 27, 2019 at 8:23 PM weijie tong 
> > wrote:
> >
> > > Thanks for the answer. The blog[1] from hive shows that a optimal bushy
> > > tree plan could give a better query performance.At the bushy join case,
> > it
> > > will make the more build side of hash join nodes works parallel  also
> > with
> > > reduced intermediate data size.  To the worry about plan time cost,
> most
> > > bushy join query optimization use the heuristic planner [2] to identify
> > the
> > > pattern matches the bushy join to reduce the tree space(That's also
> what
> > > calcite does).  I wonder whether we can replace the
> >  LoptOptimizeJoinRule
> > > with MultiJoinOptimizeBushyRule.
> > >
> > > [1]
> > >
> > >
> >
> https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/
> > > [2] http://www.vldb.org/pvldb/vol9/p1401-chen.pdf
> > >
> > > On Tue, May 28, 2019 at 5:48 AM Paul Rogers  >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > Weijie, do you have some example plans that would appear to be
> > > > sub-optimal, and would be improved with a bushy join plan? What
> > > > characteristic of the query or schema causes the need for a busy
> plan?
> > > >
> > > > FWIW, Impala uses a compromise approach: it evaluates left-deep
> plans,
> > > > then will "flip" a join if the build side turns out to be larger than
> > the
> > > > probe side. This may just be an artifact of Impala's cost model which
> > is
> > > > designed for star schemas, looks only one step ahead, and struggles
> > with
> > > > queries that do not fit the pattern. (Impala especially struggles
> with
> > > > multi-key joins and correlated filters on joined tables.) But, since
> > the
> > > > classic data warehouse use case tends to have simple star schemas,
> the
> > > > Impala approach works pretty well in practice. (Turns out that
> > Snowflake,
> > > > in their paper, claims to do something similar. [1])
> > > >
> > > > On the other hand, it might be that Calcite, because it uses a true
> > cost
> > > > model, already produces optimal plans and the join-flip trick is
> > > > unnecessary.
> > > >
> > > > A case where this trick seemed to help is the idea of joining two
> fact
> > > > tables, each of which is filtered via dimension tables. Making

Re: adding insert

2019-05-27 Thread Paul Rogers
Hi Ted,

Drill can do a CTAS today, which uses a writer provided by the format plugin. 
One would think this same structure could work for an INSERT operation, with a 
writer provided by the storage plugin. The devil, of course, is always in the 
details. And in finding resources to do the work...

Thanks,
- Paul

 

On Monday, May 27, 2019, 5:28:27 PM PDT, Ted Dunning 
 wrote:  
 
 I have in mind the ability to push rows to an underlying DB without any
transactional support.




  

Schema support in storage and format plugins

2019-05-27 Thread Paul Rogers
Hi All,

Drill 1.16 introduced the the "provided schema" mechanism to help you query the 
kind of messy files found in the real world. Arina and Bridget created nice 
documentation [1] for the feature. Sorabh presented the feature at the recent 
Drill Meetup. If you are a plugin developer, we need your help to expand the 
feature to other plugins.


To understand the need for the feature, it helps to remember that are two 
popular ways to query data in a distributed file system (DFS) such as Hadoop: 
direct query or ETL.

 Most major query engines require an ETL step: use Hive or Spark to transform 
your data into a standard format such as Parquet or ORC. Then, use a tool such 
as Drill (or Impala, Presto, Hive LLAP, Snowflake, Big Query, etc.) to query 
the data. The ETL approach works well, but it has a cost: you must maintain 
multiple copies of the data, manage an ETL pipeline, and so on. This cost is 
justified if your users query the data frequently, as in the classic "data 
warehouse" use case.


There are other use cases (such as log analysis, data exploration, data 
science) where the benefit of the two-step ETL process is less clear. These use 
cases are better served by directly querying your "raw" data. Here your choices 
are mostly Drill, Spark or the original Hive. Although Spark is very powerful, 
Drill is far easier to use for tasks that can be expressed in SQL using your 
favorite BI tool.


Drill's "schema on read" (AKA "schemaless") approach allows Drill to read a 
data file directly: just point Drill at a file and immediately run queries on 
that file. However, we've seen over the years that files can be messy or 
ambiguous. Let's look at the two most common problems and how the provided 
schema solves them: schema evolution and ambiguous data.

Schema evolution occurs when, say, a table started with two columns (a, b), 
then newer versions added a third column (c, say). If you query SELECT a, b, c 
FROM ..., Drill has to guess a type for column c in the old files (without the 
column). Drill generally guesses Nullable Int. But, if the column is actually 
VarChar, then a schema conflict (AKA "hard schema change") will occur and your 
query may fail. With a provided schema, you can tell Drill that column "c" is a 
VarChar, and even provide a default value. Now, Drill knows what to do for 
files without column "c".

Another kind of ambiguity occurs when Drill attempts to guess a data type from 
looking at the first few rows of a file. The classic JSON example is a 
two-record file: {a: 10} {a: 10.1} -- a column starts as an INT, but then we 
want to store FLOAT data into it, causing an error. With a hint, the user can 
just declare the column as FLOAT, avoiding the ambiguity.


The "provided schema" feature solves these problems by supplying hints about 
how interpret a file. The feature avoids heavy-weight cost of the Hive 
metastore (HMS) that is used by Hive, Impala and Presto. Instead, the schema is 
a simple file stored directly in the DFS alongside your data.


 You can enable schema support in a plugin by using the new "enhanced vector 
framework" (EVF) (AKA the "row set framework" or the "new scan framework".) 
This framework was originally developed to control reader memory use by 
limiting batch and vector size and to minimize vector memory fragmentation. 
Solving those problems turned out to also solve the problems needed to support 
a provided schema.

We are actively working to prepare the EVF for your use. We are converting the 
Log (regex) format plugin and preparing a tutorial based on that conversion. 
(The log reader was the basis of the format plugin chapter of the Learning 
Apache Drill book, so it is a good choice for the EVF tutorial.)

If you are a user, please try out the feature on text files and let us know how 
it works for you. That way, we an address any issues before we convert the 
other plugins.


Thanks,
- Paul

[1] https://drill.apache.org/docs/create-or-replace-schema/


Re: Questions about bushy join

2019-05-27 Thread Paul Rogers
Hi All,

Weijie, do you have some example plans that would appear to be sub-optimal, and 
would be improved with a bushy join plan? What characteristic of the query or 
schema causes the need for a busy plan?

FWIW, Impala uses a compromise approach: it evaluates left-deep plans, then 
will "flip" a join if the build side turns out to be larger than the probe 
side. This may just be an artifact of Impala's cost model which is designed for 
star schemas, looks only one step ahead, and struggles with queries that do not 
fit the pattern. (Impala especially struggles with multi-key joins and 
correlated filters on joined tables.) But, since the classic data warehouse use 
case tends to have simple star schemas, the Impala approach works pretty well 
in practice. (Turns out that Snowflake, in their paper, claims to do something 
similar. [1])

On the other hand, it might be that Calcite, because it uses a true cost model, 
already produces optimal plans and the join-flip trick is unnecessary.

A case where this trick seemed to help is the idea of joining two fact tables, 
each of which is filtered via dimension tables. Making something up:

- join on itemid
  - join on sales.stateid = state.id
    - state table where state.name = "CA"
    - sales
 - join on returns.reasonId = reason.id
    - reason table where reason.name = "defective"
    - returns


That is, we have large fact tables for sales and returns. We filter both using 
a dimension table. Then, we join the (greatly reduced) fact data sets on the 
item ID. A left-deep play will necessarily be less efficient because of the 
need to move an entire fact set though a join. (Though the JPPD feature might 
reduce the cost by filtering early.)


In any event, it would be easy to experiment with this idea in Drill. Drill 
already has several post-Calcite rule sets. It might be fairly easy to add one 
that implements the join-flip case. Running this experiment on a test workload 
would identify if the rule is ever needed, and if it is triggered, if the 
result improves performance.


Thanks,
- Paul

[1] http://info.snowflake.net/rs/252-RFO-227/images/Snowflake_SIGMOD.pdf
 

On Monday, May 27, 2019, 2:04:29 PM PDT, Aman Sinha  
wrote:  
 
 Hi Weijie,
As you might imagine Busy joins have pros and cons compared to Left-deep
only plans:  The main pro is that they enumerate a lot more plan choices
such that the planner is likely to find the optimal join order.  On the
other hand, there are significant cons: (a) by enumerating more join
orders, they would substantially increase planning time (depending on the
number of tables).  (b) the size of the intermediate results produced by
the join must be accurately estimated in order to avoid situations where
hash join build side turns out to be orders of magnitude more than
estimated.  This could happen easily in big data systems where statistics
are constantly changing due to new data ingestion and even running ANALYZE
continuously is not feasible.
That said, it is not a bad idea to experiment with such plans with say more
than 5 table joins and compare with left-deep plans.

Aman

On Mon, May 27, 2019 at 7:00 AM weijie tong  wrote:

> Hi all:
>  Does anyone know why we don't support bushy join in the query plan
> generation while hep planner is enabled. The codebase shows the fact that
> the PlannerPhase.JOIN_PLANNING use the LoptOptimizeJoinRule not calcite's
> MultiJoinOptimizeBushyRule.
>
  

Re: adding insert

2019-05-27 Thread Paul Rogers
Hi Ted,

>From item 3, it should like you are focusing on using Drill to front a DB 
>system, rather than proposing to use Drill to update files in a distributed 
>file system (DFS).


Turns out that, for the DFS case, the former HortonWorks put quite a bit into 
working out viable insert/update semantics in Hive with the Hive ACID support. 
[1], [2] This was a huge amount of work done in conjunction with various 
partners, and is on its third version as Hive learns the semantics and how to 
get ACID to perform well under load. Adding ACID support to Drill would be a 
"non-trivial" exercise (unless Drill could actually borrow Hive's code, but 
even that might not be simple.)


Drill is far simpler than Hive because Drill has long exploited the fact that 
data is read-only. Once data can change, we must revisit various aspects to 
account for that fact. Since change can occur concurrently with queries (and 
other changes), some kind of concurrency control is needed. Hive has worked out 
a way to ensure that only completed transactions are included in a query by 
using delta files. Hive delta files can include inserts, updates and deletes.

If insert is all that is needed, then there may be simpler solutions: just 
track which files are newly added. If the underlying file system is atomic, 
then even this can be simplified down to just noticing that a file exist when 
planning a query. If the file is visible before it is complete, then some form 
of mechanism is needed to detect in-progress files. Of course, Drill must 
already handle this case for files created outside of Drill, so it may "just 
work" for the DFS case.


And, if the goal is simply to push insert into a DB, then the DB itself can 
handle transactions and concurrency. Generally most DBs manage transaction as 
part of a session. To ensure Drill does a consistent insert, Drill would need 
to push the update though a single client (single minor fragment). A 
distributed insert (using multiple minor fragments each inserting a subset of 
rows) would require two-phase commit, or would have to forgo consistency. (The 
CAP problem.) Further, Drill would have to handle insert failures (deadlock 
detection, duplicate keys, etc.) reported by the target DB and return that 
error to the Drill client (hopefully in a form other than a long Java stack 
trace...)

All this said, I suspect you have in mind a specific use case that is far 
simpler than the general case. Can you explain more a bit what you have in mind?

Thanks,
- Paul

[1] 
https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/
[2] 
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html



 

On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning 
 wrote:  
 
 I would like to start a discussion about how to add insert capabilities to
drill.

It seems that the basic outline is:

1) making sure Calcite will parse it (almost certain)
2) defining an upsert operator in the logical plan
3) push rules into Drill from the DB driver to allow Drill to push down the
upsert into DB

Are these generally correct?

Can anybody point me to analogous operations?
  

[jira] [Created] (DRILL-7279) Support provided schema for CSV without headers

2019-05-26 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7279:
--

 Summary: Support provided schema for CSV without headers
 Key: DRILL-7279
 URL: https://issues.apache.org/jira/browse/DRILL-7279
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7278) Refactor result set loader projection mechanism

2019-05-25 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7278:
--

 Summary: Refactor result set loader projection mechanism
 Key: DRILL-7278
 URL: https://issues.apache.org/jira/browse/DRILL-7278
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7261) Rollup of CSV V3 fixes

2019-05-15 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7261:
--

 Summary: Rollup of CSV V3 fixes
 Key: DRILL-7261
 URL: https://issues.apache.org/jira/browse/DRILL-7261
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0


Rollup of related CSV V3 fixes along with supporting row set framework fixes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [ANNOUNCE] New Committer: Jyothsna Donapati

2019-05-09 Thread Paul Rogers
Congratulations! Well deserved.

Thanks,
- Paul

 

On Thursday, May 9, 2019, 2:28:09 PM PDT, Aman Sinha  
wrote:  
 
 The Project Management Committee (PMC) for Apache Drill has invited Jyothsna
Donapati to become a committer, and we are pleased to announce that she has
accepted.

Jyothsna has been contributing to Drill for about 1 1/2 years.  She
initially contributed the graceful shutdown capability and more recently
has made several crucial improvements in the parquet metadata caching which
have gone into the 1.16 release.  She also co-authored the design document
for this feature.

Welcome Jyothsna, and thank you for your contributions.  Keep up the good work
!

-Aman
(on behalf of Drill PMC)
  

[jira] [Created] (DRILL-7224) Update example row set test

2019-04-28 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7224:
--

 Summary: Update example row set test
 Key: DRILL-7224
 URL: https://issues.apache.org/jira/browse/DRILL-7224
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.17.0


The example row set test {{ExampleTest}} is a bit outdated. This PR will update 
it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Hangout Discussion Topics for 04-16-2019

2019-04-24 Thread Paul Rogers
Hi Igor,

Thanks for the recap. You asked about vector allocation. Here is where I think 
things stand. Others can fill in details that I may miss.

We have several ways to size value vectors; but no single standard. As you 
note, the most common way is simply to accept the cost of letting the vector 
double in size multiple times.

One way to pre-allocate vectors is to use the "sizer" along with its associated 
allocation helper. This was always meant to be a quick & dirty temporary 
solution, but has turned out, I believe, to be the primary vector size 
management solution in most operators.

Another is the new row set framework: vector size (in terms of number of items 
and estimated item size) is expressed in metadata, then is used to allocate 
each new batch to the desired size.

You can also just do the work yourself: pick a number, and, when allocating a 
vector, tell it to use that size. You then take on the task of estimating 
average width, picking a good target number of rows for your batch, working out 
the number of items in arrays, etc. (This is, in fact, what the other two 
methods mentioned above actually do.)

The key problem with the ad-hoc techniques is that they can't limit maximum 
vector size to 16 MB (to avoid Netty fragmentation) nor limit overall batch 
size to some reasonable number. The ad-hoc techniques can also lead to internal 
fragmentation (excessive unused space within each vector.) Solving these 
problems is what the row set framework was designed to do.

Thanks,
- Paul

 

On Wednesday, April 24, 2019, 10:48:44 AM PDT, Igor Guzenko 
 wrote:  
 
 Hello Everyone,

Sorry for the late reply, here is presentations about

Map vector    -
https://docs.google.com/presentation/d/1FG4swOrkFIRL7qjiP7PSOPy8a1vnxs5Z9PM3ZfRPRYo/edit#slide=id.p
Hive complex types  -
https://docs.google.com/presentation/d/1nc0ID5aju-qj-7hjquFpH-TwGjeReWTYogsExuOe8ZA/edit?usp=sharing
.

Discussion results for Map new vector:
- Need to eliminate possibility of key duplication;
- Need to check Hive behavior when ORDER BY is performed for Map
complex type column;
- Need to describe design and all use cases for the vector in design document.

Discussion results for Hive complex types:
- Aman Sinha made few great suggestions. First is that creation of
Hive writers may be done once for table scan and second is that at
this moment
  would be good to calculate size for vectors and allocate early. Need
to provide few examples describing how will the allocation work for
complex types.
- Need to describe suggested approach in design document and proceed
discussion there.

Question from my side. Do we have already implemented somewhere
predicted allocation of value vectors ? Any example would be useful,
because
now I can see that our existing vector writers usually use mutator's
setSafe(...) methods inside which size of buffer may be increased when
necessary.

The future design document will be located at
https://docs.google.com/document/d/1yEcaJi9dyksfMs4w5_GsZCQH_Pffe-HLeLVNNKsV7CA/edit?usp=sharing
.
Please feel free to leave your comments and suggestions in the
document and presentations.

Thanks,
Igor Guzenko


On Wed, Apr 17, 2019 at 3:04 AM Jyothsna Reddy  wrote:
>
> Hi All,
> The hangout will start at 9:30 AM PST instead of 10 AM PST on 04-18-2019.
>
>
> Thank you,
> Jyothsna
>
>
>
>
> On Tue, Apr 16, 2019 at 2:00 PM Jyothsna Reddy 
> wrote:
>
> > Hi Charles,
> > Yes, sure!! Probably we can start with your discussion first and Hive
> > complex types later since there will be some discussion around the later
> > topic.
> >
> > Thank you,
> > Jyothsna
> >
> >
> >
> >
> > On Tue, Apr 16, 2019 at 1:40 PM Charles Givre  wrote:
> >
> >> Hi Jyothsna,
> >> Could I get a few minutes on the next Hangout to promote the Drill day at
> >> ApacheCon?
> >> Thanks
> >>
> >> > On Apr 16, 2019, at 16:38, Jyothsna Reddy 
> >> wrote:
> >> >
> >> > Hi Everyone,
> >> >
> >> > Here are some key points of today's hangout discussion:
> >> >
> >> > Sorabh mentioned that there are some regressions in TPCDS queries and
> >> its a
> >> > blocker for 1.16 release.
> >> >
> >> > Bohdan presented tehir proposal for Hive Complex types support. Here are
> >> > some of the important points
> >> >
> >> >  - Structure of MapVector : Keys are of primitive type where values can
> >> >  be of either primitive or complex type.
> >> >  - MapReader and MapWriter are used to read and write from the
> >> MapVector
> >> >  - MapWriter tracks the current row/length and is used to calculate
> >> write
> >> >  position and offset
> >> >
> >> > Following are some of the questions from the audience
> >> >
> >> >  - Will the types be implicitly casted since calcite supports keys of
> >> >  type int and string.
> >> >  - Future improvements include sorting the keys for better lookup, Is
> >> it
> >> >  per row or across all the rows?
> >> >
> >> > Since there is more to discuss, there will be a hangout session on
> >> > 04-18-2019 at 10 AM PST (link
> >> > 

Re: QUESTION: Packet Parser for PCAP Plugin

2019-04-23 Thread Paul Rogers
Hi Charles,

Two comments. 

First, Drill "maps" are actually structs (nested tuples): every record must 
have the same set of columns within the "map." That is, though the Drill type 
is called a "map", and you might assume that, given that name, it would act 
like a JSON, Python of Java map, the actual implementation is, in fact, a 
struct. (I saw a JIRA ticket to rename the Map type in some context because of 
this unfortunate mismatch of name and implementation.)

By contrast, Hive defines both Map and Struct types. A Drill "Map" is like a 
Hive Struct, and Drill has no equivalent of a Hive Map. Still, there are 
solutions.

To use a single parsed_packet map column, you'd have to know the union of all 
the columns you'll create across all the packet types and define a map schema 
that includes all these columns. Define this map in all batches so you have a 
consistent schema. This means including all columns for all packet types, even 
if the data does not happen to have all packet types.

Or, you could define a different map for each packet type; but you'd still have 
to define the needed ones up front. You could do this if you had columns 
called, say, parsed_x_packet, parsed_y_packet, etc. If that packet type is 
projected (appears in the SELECT ... clause), then define the required schema 
for all records. The user just selects the packet types of interest.

This brings us to the second comment. The long work to merge the row set 
framework into Drill is coming to a close, and it is now available for you to 
use. The row set framework provides a very simple way to define your map 
schemas (once you know what they are). It also handles projection:the user 
selects some of your parsed packets, but not others, or projects some of the 
packet map columns, but not others.

Drill 1.16 migrates the CSV reader to the new framework (where it also supports 
user-defined schemas and type conversions.) The next step in the row set work 
is to migrate a few other readers to the new framework. Perhaps, PCAP might be 
a good candidate to enable your new packet-parsing feature.


Thanks,
- Paul

 

On Tuesday, April 23, 2019, 9:34:16 AM PDT, Charles Givre 
 wrote:  
 
 Hello all,
I saw a few open source libraries that parse actual packet content and was 
interested in incorporating this into Drill's PCAP parser.  I was thinking 
initially of writing this as a UDF, however, I think it would be much better to 
include this directly in Drill.  What I was thinking was to create a field 
called parsed_packet that would be a Drill Map.  The contents of this field 
would vary depending on the type of packet.  For instance, if it is a DNS 
packet, you get all the DNS info, ICMP etc...
Does the community think this is a good idea?  Also, given the structure of the 
PCAP plugin, I'm not quite sure how to create a Map field with variable 
contents.  Are there any examples that use the same architecture as the PCAP 
plugin?
Thanks,
-- C  

Re: [Discuss] Integrate Arrow gandiva into Drill

2019-04-20 Thread Paul Rogers
Hi Weijie,

Thanks much for the explanation. Sounds like you are making good progress.


For which operator is the filter pushed into the scan? Although Impala does 
this for all scans, AFAIK, Drill does not do so. For example, the text and JSON 
reader do not handle filtering. Filtering is instead done by the Filter 
operator in these cases. Perhaps you have your own special scan which handles 
filtering?


The concern in DRILL-6340 was the user might do a project operation that causes 
the output batch to be much larger than the input batch. Someone suggested 
flatten as one example. String concatenation is another example. The input 
batch might be large. The result of the concatenation could be too large for 
available memory. So, the idea was to project the single input batch into two 
(or more) output batches to control batch size.


II like how you've categorized the vectors into the set that Gandiva can 
project, and the set that Drill must handle. Maybe you can extend this idea for 
the case where input batches are split into multiple output batches.

 Let Drill handle VarChar expressions that could increase column width (such as 
the concatenate operator.) Let Drill decide the number of rows in the output 
batch. Then, for the columns that Gandiva can handle, project just those rows 
needed for the current output batch.

Your solution might also be extended to handle the Gandiva library issue. Since 
you are splitting vectors into the Drill group and the Gandiva group, if Drill 
runs on a platform without Gandiva support, or if the Gandiva library can't be 
found, just let all vectors fall into the Drill vector group.

If the user wants to use Gandiva, he/she could set a config option to point to 
the Gandiva library (and supporting files, if any.) Or, use the existing 
LD_LIBRARY_PATH env. variable.

Thanks,
- Paul

 

On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong 
 wrote:  
 
 Hi Paul:
Currently Gandiva only supports Project ,Filter operations. My work is to
integrate Project operator. Since most of the Filter operator will be
pushed down to the Scan.

The Gandiva project interface works at the RecordBatch level. It accepts
the memory address of the vectors of  input RecordBatch and . Before that
it also need to construct a binary schema object to describe the input
RecordBatch schema.

The integration work mainly has two parts:
  1. at the setup step, find the expressions which can be solved by the
Gandiva . The matched expression will be solved by the Gandiva, others will
still be solved by Drill.
  2. invoking the Gandiva native project method. The matched expressions'
ValueVectors will all be allocated corresponding Arrow type null
representation ValueVector. The null input vector's bit  will also be set.
The same work will also be done to the output ValueVectors, transfer the
arrow output null vector to Drill's null vector. Since the native method
only care the physical memory address, invoking that native method is not a
hard work.

Since my current implementation is before DRILL-6340, it does not solve the
output size of the project which is less than the input size case. To cover
that case , there's some more work to do which I have not focused on.

To contribute to community , there's also some test case problem which
needs to be considered, since the Gandiva jar is platform dependent.




On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers 
wrote:

> Hi Weijie,
>
> Thanks much for the update on your Gandiva work. It is great work.
>
> Can you say more about how you are doing the integration?
>
> As you mentioned the memory layout of Arrow's null vector differs from the
> "is set" vector in Drill. How did you work around that?
>
> The Project operator is pretty simple if we are just copying or removing
> columns. However, much of Project deals with invoking Drill-provided
> functions: simple ones (add two ints) and complex ones (perform a regex
> match). To be useful, the integration would have to mimic Drill's behavior
> for each of these many functions.
>
> Project currently works row-by-row. But, to get the maximum performance,
> it would work column-by-column to take full advantage of vectorization.
> Doing that would require large changes to the code that sets up codegen,
> and iterates over the batch.
>
>
> For operators such as Sort, the only vector-based operations are 1) sort a
> batch using defined keys to get an offset vector, and 2) create a new
> vector by copying values, row-by-row, from one batch to another according
> to the offset vector.
>
> The join and aggregate operations are even more complex, as are the
> partition senders and receivers.
>
> Can you tell us where you've used Gandiva? Which operators? How did you
> handle the function integration? I am very curious how you were able to
> solve these problems.
>
>
> Thanks,
>
>

<    1   2   3   4   5   6   7   8   9   10   >