Re: Single Hdfs block per parquet file

2017-03-22 Thread Padma Penumarthy
Yes, seems like it is possible to create files with different block sizes.
We could potentially pass the configured store.parquet.block-size to the create 
call.
I will try it out and see. will let you know.

Thanks,
Padma 


> On Mar 22, 2017, at 4:16 PM, François Méthot  wrote:
> 
> Here are 2 links I could find:
> 
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> 
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> 
> Francois
> 
> On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy 
> wrote:
> 
>> I think we create one file for each parquet block.
>> If underlying HDFS block size is 128 MB and parquet block size  is  >
>> 128MB,
>> it will create more blocks on HDFS.
>> Can you let me know what is the HDFS API that would allow you to
>> do otherwise ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>>> On Mar 22, 2017, at 11:54 AM, François Méthot 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> Is there a way to force Drill to store CTAS generated parquet file as a
>>> single block when using HDFS? Java HDFS API allows to do that, files
>> could
>>> be created with the Parquet block-size.
>>> 
>>> We are using Drill on hdfs configured with block size of 128MB. Changing
>>> this size is not an option at this point.
>>> 
>>> It would be ideal for us to have single parquet file per hdfs block,
>> setting
>>> store.parquet.block-size to 128MB would fix our issue but we end up with
>> a
>>> lot more files to deal with.
>>> 
>>> Thanks
>>> Francois
>> 
>> 



[GitHub] drill pull request #793: DRILL-4678: Tune metadata by generating a dispatche...

2017-03-22 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/793#discussion_r107561519
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdRowCount.java
 ---
@@ -14,35 +14,71 @@
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
- 
**/
+ */
 package org.apache.drill.exec.planner.cost;
 
+import org.apache.calcite.rel.SingleRel;
 import org.apache.calcite.rel.core.Aggregate;
 import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.core.Sort;
+import org.apache.calcite.rel.core.Union;
 import org.apache.calcite.rel.metadata.ReflectiveRelMetadataProvider;
 import org.apache.calcite.rel.metadata.RelMdRowCount;
 import org.apache.calcite.rel.metadata.RelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
 import org.apache.calcite.util.BuiltInMethod;
 import org.apache.calcite.util.ImmutableBitSet;
+import org.apache.drill.exec.planner.common.DrillLimitRelBase;
 
 public class DrillRelMdRowCount extends RelMdRowCount{
   private static final DrillRelMdRowCount INSTANCE = new 
DrillRelMdRowCount();
 
   public static final RelMetadataProvider SOURCE = 
ReflectiveRelMetadataProvider.reflectiveSource(BuiltInMethod.ROW_COUNT.method, 
INSTANCE);
 
   @Override
-  public Double getRowCount(Aggregate rel) {
+  public Double getRowCount(Aggregate rel, RelMetadataQuery mq) {
 ImmutableBitSet groupKey = ImmutableBitSet.range(rel.getGroupCount());
 
 if (groupKey.isEmpty()) {
   return 1.0;
 } else {
-  return super.getRowCount(rel);
+  return super.getRowCount(rel, mq);
 }
   }
 
   @Override
-  public Double getRowCount(Filter rel) {
-return rel.getRows();
+  public Double getRowCount(Filter rel, RelMetadataQuery mq) {
--- End diff --

Seems to me we do not have to override these getRowCount() call; the parent 
has provided the implementations. Any reason why you want to override these 
methods? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (DRILL-5001) Join only supports implicit casts error even when I have explicit cast

2017-03-22 Thread Rahul Challapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Challapalli resolved DRILL-5001.
--
Resolution: Not A Bug

Ok...this is not a bug. The underlying parquet data actually contained a 
varchar type where I wrongly assumed its a date type

> Join only supports implicit casts error even when I have explicit cast
> --
>
> Key: DRILL-5001
> URL: https://issues.apache.org/jira/browse/DRILL-5001
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Rahul Challapalli
> Attachments: error.log, fewtypes_null_large.tgz
>
>
> git.commit.id.abbrev=190d5d4
> The below query fails even when I had an explicit cast on the right hand side 
> of the join condition. The data also contains a metadata cache
> {code}
> select
>   a.int_col,
>   b.date_col 
> from
>   dfs. `/ drill / testdata / parquet_date / metadata_cache / mixed / 
> fewtypes_null_large ` a 
>   inner join
> (
>   select
> * 
>   from
> dfs. `/ drill / testdata / parquet_date / metadata_cache / mixed / 
> fewtypes_null_large ` 
>   where
> dir0 = '1.2' 
> and date_col > '1996-03-07' 
> )
> b 
> on a.date_col = cast(date_add(b.date_col, 5) as date) 
> where
>   a.int_col = 7 
>   and a.dir0 = '1.9' 
> group by
>   a.int_col,
>   b.date_col;
> Error: SYSTEM ERROR: DrillRuntimeException: Join only supports implicit casts 
> between 1. Numeric data
>  2. Varchar, Varbinary data 3. Date, Timestamp data Left type: DATE, Right 
> type: VARCHAR. Add explicit casts to avoid this error
> Fragment 2:0
> [Error Id: a1b26420-af35-4892-9a87-d9b04e4423dc on qa-node190.qa.lab:31010] 
> (state=,code=0)
> {code}
> I attached the data and the log file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Single Hdfs block per parquet file

2017-03-22 Thread François Méthot
Here are 2 links I could find:

http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)

http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)

Francois

On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy 
wrote:

> I think we create one file for each parquet block.
> If underlying HDFS block size is 128 MB and parquet block size  is  >
> 128MB,
> it will create more blocks on HDFS.
> Can you let me know what is the HDFS API that would allow you to
> do otherwise ?
>
> Thanks,
> Padma
>
>
> > On Mar 22, 2017, at 11:54 AM, François Méthot 
> wrote:
> >
> > Hi,
> >
> > Is there a way to force Drill to store CTAS generated parquet file as a
> > single block when using HDFS? Java HDFS API allows to do that, files
> could
> > be created with the Parquet block-size.
> >
> > We are using Drill on hdfs configured with block size of 128MB. Changing
> > this size is not an option at this point.
> >
> > It would be ideal for us to have single parquet file per hdfs block,
> setting
> > store.parquet.block-size to 128MB would fix our issue but we end up with
> a
> > lot more files to deal with.
> >
> > Thanks
> > Francois
>
>


[GitHub] drill pull request #793: DRILL-4678: Tune metadata by generating a dispatche...

2017-03-22 Thread Serhii-Harnyk
GitHub user Serhii-Harnyk opened a pull request:

https://github.com/apache/drill/pull/793

DRILL-4678: Tune metadata by generating a dispatcher at runtime

Changes for rebasing to Calcite 1.4.0-drill-r20 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Serhii-Harnyk/drill DRILL-4678

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/793.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #793


commit 794e9537bbe7d25abe3809db5847d898c45c73e1
Author: Serhii-Harnyk 
Date:   2017-03-03T15:24:26Z

DRILL-4678: Tune metadata by generating a dispatcher at runtime




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Single Hdfs block per parquet file

2017-03-22 Thread Padma Penumarthy
I think we create one file for each parquet block.
If underlying HDFS block size is 128 MB and parquet block size  is  > 128MB, 
it will create more blocks on HDFS. 
Can you let me know what is the HDFS API that would allow you to
do otherwise ?

Thanks,
Padma


> On Mar 22, 2017, at 11:54 AM, François Méthot  wrote:
> 
> Hi,
> 
> Is there a way to force Drill to store CTAS generated parquet file as a
> single block when using HDFS? Java HDFS API allows to do that, files could
> be created with the Parquet block-size.
> 
> We are using Drill on hdfs configured with block size of 128MB. Changing
> this size is not an option at this point.
> 
> It would be ideal for us to have single parquet file per hdfs block, setting
> store.parquet.block-size to 128MB would fix our issue but we end up with a
> lot more files to deal with.
> 
> Thanks
> Francois



Single Hdfs block per parquet file

2017-03-22 Thread François Méthot
Hi,

Is there a way to force Drill to store CTAS generated parquet file as a
single block when using HDFS? Java HDFS API allows to do that, files could
be created with the Parquet block-size.

We are using Drill on hdfs configured with block size of 128MB. Changing
this size is not an option at this point.

It would be ideal for us to have single parquet file per hdfs block, setting
store.parquet.block-size to 128MB would fix our issue but we end up with a
lot more files to deal with.

Thanks
Francois


[jira] [Created] (DRILL-5376) Rationalize Drill's row structure for simpler code, better performance

2017-03-22 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5376:
--

 Summary: Rationalize Drill's row structure for simpler code, 
better performance
 Key: DRILL-5376
 URL: https://issues.apache.org/jira/browse/DRILL-5376
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers


Drill is a columnar system, but data is ultimately represented as rows (AKA 
records or tuples.) The way that Drill represents rows leads to excessive code 
complexity and runtime cost.

Data in Drill is stored in vectors: one (or more) per column. Vectors do not 
stand alone, however, they are "bundled" into various forms of grouping: the 
{{VectorContainer}}, {{RecordBatch}}, {{VectorAccessible}}, 
{{VectorAccessibleSerializable}}, and more. Each has slightly different 
semantics, requiring large amounts of code to bridge between the 
representations.

Consider only a simple row: one with only scalar columns. In classic relational 
theory, such a row is a tuple:

{code}
R = (a, b, c, d, ...)
{code}

A tuple is defined as an ordered list of column values. Unlike a list or array, 
the column values also have names and may have varying data types.

In SQL, columns are referenced by either position or name. In most execution 
engines, columns are referenced by position (since positions, in most systems, 
cannot change.) A 1:1 mapping is provided between names and positions. (See the 
JDBC {{RecordSet}} interface.)

This allows code to be very fast: code references columns by index, not by 
name, avoiding name lookups for each column reference.

Drill provides a murky, hybrid approach. Some structures ({{BatchSchema}}, for 
example) appear to provide a fixed column ordering, allowing indexed column 
access. But, other abstractions provide only an iterator. Others (such as 
{{VectorContainer}}) provides name-based access or, by clever programming, 
indexed access.

As a result, it is never clear exactly how to quickly access a column: by name, 
by name to multi-part index to vector?

Of course, Drill also supports maps, which add to the complexity. First, we 
must understand that a "map" in Drill is not a "map" in the classic sense: it 
is not a collection of (name, value) pairs in the JSON sense: a collection in 
which each instance may have a different set of pairs.

Instead, in Drill, a "map" is really a nested tuple: a map has the same 
structure as a Drill record: a collection of names and values in which all rows 
have the same structure. (This is so because maps are really a collection of 
value vectors, and the vectors cut across all rows.)

Drill, however, does not reflect this symmetry: that a row and a map are both 
tuples. There are no common abstractions for the two. Instead, maps are 
represented as a {{MapVector}} that contains a (name, vector) map for its 
children.

Because of this name-based mapping, high-speed indexed access to vectors is not 
provided "out of the box." Certainly each consumer of a map can build its own 
indexing mechanism. But, this leads to code complexity and redundancy.

This ticket asks to rationalize Drill's row, map and schema abstractions around 
the tuple concept. A schema is a description of a tuple and should (as in JDBC) 
provide both name and index based access. That is, provide methods of the form:

{code}
MaterializedField getField(int index);
MaterializedField getField(String name);
...
ValueVector getVector(int index);
ValueVector getVector(String name);
{code}

Provide a common abstraction for rows and maps, recognizing their structural 
similarity.

There is an obvious issue with indexing columns in a row when the row contains 
maps. Should indexing be multi-part (index into row, then into map) as today? A 
better alternative is to provide a flattened interface:

{code}
0: a, 1: b.x, 2: b.y, 3: c, ...
{code}

Use this change to simplify client code, over time, to use a simple 
indexed-based column access.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] drill pull request #792: DRILL-4971: query encounters system error: Statemen...

2017-03-22 Thread vdiravka
GitHub user vdiravka opened a pull request:

https://github.com/apache/drill/pull/792

DRILL-4971: query encounters system error: Statement "break AndOP3" i…

…s not enclosed by a breakable statement with label "AndOP3"

- New evaluated blocks for boolean operators should be with braces always, 
since they use labels.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vdiravka/drill DRILL-4971

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/792.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #792


commit b8950f88ee0ffda6ae2f6dd31a29dedda08e0189
Author: Vitalii Diravka 
Date:   2017-03-17T11:41:46Z

DRILL-4971: query encounters system error: Statement "break AndOP3" is not 
enclosed by a breakable statement with label "AndOP3"
- New evaluated blocks for boolean operators should be with braces always, 
since they use labels.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-5375) Nested loop join: return correct result for left join

2017-03-22 Thread Arina Ielchiieva (JIRA)
Arina Ielchiieva created DRILL-5375:
---

 Summary: Nested loop join: return correct result for left join
 Key: DRILL-5375
 URL: https://issues.apache.org/jira/browse/DRILL-5375
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Arina Ielchiieva
Assignee: Arina Ielchiieva


Mini repro:
1. Create 2 Hive tables with data
{code}
CREATE TABLE t1 (
  FYQ varchar(999),
  dts varchar(999),
  dte varchar(999)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

2016-Q1,2016-06-01,2016-09-30
2016-Q2,2016-09-01,2016-12-31
2016-Q3,2017-01-01,2017-03-31
2016-Q4,2017-04-01,2017-06-30

CREATE TABLE t2 (
  who varchar(999),
  event varchar(999),
  dt varchar(999)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

aperson,did somthing,2017-01-06
aperson,did somthing else,2017-01-12
aperson,had chrsitmas,2016-12-26
aperson,went wild,2016-01-01
{code}
2. Impala Query shows correct result
{code}
select t2.dt, t1.fyq, t2.who, t2.event
from t2
left join t1 on t2.dt between t1.dts and t1.dte
order by t2.dt;
++-+-+---+
| dt | fyq | who | event |
++-+-+---+
| 2016-01-01 | NULL| aperson | went wild |
| 2016-12-26 | 2016-Q2 | aperson | had chrsitmas |
| 2017-01-06 | 2016-Q3 | aperson | did somthing  |
| 2017-01-12 | 2016-Q3 | aperson | did somthing else |
++-+-+---+
{code}

3. Drill query shows wrong results:
{code}
alter session set planner.enable_nljoin_for_scalar_only=false;
use hive;
select t2.dt, t1.fyq, t2.who, t2.event
from t2
left join t1 on t2.dt between t1.dts and t1.dte
order by t2.dt;

+-+--+--++
| dt  |   fyq|   who|   event|
+-+--+--++
| 2016-12-26  | 2016-Q2  | aperson  | had chrsitmas  |
| 2017-01-06  | 2016-Q3  | aperson  | did somthing   |
| 2017-01-12  | 2016-Q3  | aperson  | did somthing else  |
+-+--+--++
3 rows selected (2.523 seconds)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Is it possible to delegate data joins and filtering to the datasource ?

2017-03-22 Thread Muhammad Gelbana
I'm trying to use Drill with a proprietary datasource that is very fast in
applying data joins (i.e. SQL joins) and query filters (i.e. SQL where
conditions).

To connect to that datasource, I first have to write a storage plugin, but
I'm not sure if my main goal is applicable.

May main goal is to configure Drill to let the datasource perform JOINS and
filters and only return the data. Then drill can perform further processing
based on the original SQL query sent to Drill.

Is this possible by developing a storage plugin ? Where exactly should I be
looking ?

I've been going through this wiki
 and I don't think I understood
every concept. So if there is another source of information about storage
plugins development, please point it out.

*-*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana