[jira] [Created] (DRILL-3713) Drill-embedded doesn't work with VPN due to use of local IP, should use 127.0.0.1 instead

2015-08-26 Thread Hari Sekhon (JIRA)
Hari Sekhon created DRILL-3713:
--

 Summary: Drill-embedded doesn't work with VPN due to use of local 
IP, should use 127.0.0.1 instead
 Key: DRILL-3713
 URL: https://issues.apache.org/jira/browse/DRILL-3713
 Project: Apache Drill
  Issue Type: Bug
  Components: Client - JDBC
Affects Versions: 1.1.0
 Environment: Drill embedded
Reporter: Hari Sekhon
Assignee: Daniel Barclay (Drill)
Priority: Critical


Drill appears to be using my local IP address (192.168.x.x that is assigned by 
DHCP on wifi) but when connected to my company VPN this breaks drill's access 
due to routing changes done by the VPN client which I have no control over 
(this is a locked down company laptop without admin rights, not my personal 
laptop).

I've noticed that 127.0.0.1 remains accessible however.

I believe it would be better for drill-embedded to do everything on 127.0.0.1 
to avoid these kinds of issues. Currently I can only use drill-embedded when 
not on the VPN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Partition pruning inconsistency

2015-08-26 Thread Aman Sinha
We have had some issues where the same query run at different times
(possibly with other queries running concurrently...not sure about the
concurrency level)  either performed partition pruning or did not.  The
times where it failed happened due to couple of reasons :
  (a) allocateNew() in the PruneScanRule failed with an out of memory
condition
  (b) the interpreter evaluator encountered an error with a particular
expression type evaluation

The PruneScanRule currently logs a warning message and does not fail the
query since this is a performance optimization.  While we will address the
root cause of (a) and (b) (there's a JIRA open for (b) )  an important
issue is the inconsistent behavior of a query.

Should we provide a system setting that allows the query to fail in this
situation ?
Note that other rules in the optimizer could also fail and some rules  log
warnings but those failures are very rare, while PruneScan rule is doing
more complex operations - creating value vectors, doing interpreter
evaluation - so the chances of something failing increases.

Aman


Re: Partition pruning inconsistency

2015-08-26 Thread Vicky Markman
There is Jira for (a) as well:
https://issues.apache.org/jira/browse/DRILL-3045

On Wed, Aug 26, 2015 at 8:31 AM, Aman Sinha  wrote:

> We have had some issues where the same query run at different times
> (possibly with other queries running concurrently...not sure about the
> concurrency level)  either performed partition pruning or did not.  The
> times where it failed happened due to couple of reasons :
>   (a) allocateNew() in the PruneScanRule failed with an out of memory
> condition
>   (b) the interpreter evaluator encountered an error with a particular
> expression type evaluation
>
> The PruneScanRule currently logs a warning message and does not fail the
> query since this is a performance optimization.  While we will address the
> root cause of (a) and (b) (there's a JIRA open for (b) )  an important
> issue is the inconsistent behavior of a query.
>
> Should we provide a system setting that allows the query to fail in this
> situation ?
> Note that other rules in the optimizer could also fail and some rules  log
> warnings but those failures are very rare, while PruneScan rule is doing
> more complex operations - creating value vectors, doing interpreter
> evaluation - so the chances of something failing increases.
>
> Aman
>


Re: Partition pruning inconsistency

2015-08-26 Thread Jinfeng Ni
The idea of optimizer rule failure did not fail the entire query is that
rule's failure
might only happen in a plan which turn out to be sub-optimal, and we do not
want block optimizer from continuing finding the final/optimal plan.

If the case of PruneScanRule, the filters to be pushed might be pushed down
from ancestor operators (join, or other filter, etc). That is,
PruneScanRule could
be fired multiple times, each with different filter condition. If after one
filter triggers
the failure, Drill stops the entire query, then Drill does not have the
chance find
the final plan.

For issue either (a) or (b), one criteria is whether the same filter
evaluation would
hit the same failure under the evaluation model as the interpreter model.
If the
interpreter model hit a failure that is not going to happen in the
evaluation model,
that means there is a bug which deserves a fix.

On the other hand, I understand that providing such option would save user
from
seeing inconsistent behavior in partition pruning.






On Wed, Aug 26, 2015 at 8:51 AM, Vicky Markman 
wrote:

> There is Jira for (a) as well:
> https://issues.apache.org/jira/browse/DRILL-3045
>
> On Wed, Aug 26, 2015 at 8:31 AM, Aman Sinha  wrote:
>
> > We have had some issues where the same query run at different times
> > (possibly with other queries running concurrently...not sure about the
> > concurrency level)  either performed partition pruning or did not.  The
> > times where it failed happened due to couple of reasons :
> >   (a) allocateNew() in the PruneScanRule failed with an out of memory
> > condition
> >   (b) the interpreter evaluator encountered an error with a particular
> > expression type evaluation
> >
> > The PruneScanRule currently logs a warning message and does not fail the
> > query since this is a performance optimization.  While we will address
> the
> > root cause of (a) and (b) (there's a JIRA open for (b) )  an important
> > issue is the inconsistent behavior of a query.
> >
> > Should we provide a system setting that allows the query to fail in this
> > situation ?
> > Note that other rules in the optimizer could also fail and some rules
> log
> > warnings but those failures are very rare, while PruneScan rule is doing
> > more complex operations - creating value vectors, doing interpreter
> > evaluation - so the chances of something failing increases.
> >
> > Aman
> >
>


Re: Lucene Format Plugin

2015-08-26 Thread rahul challapalli
Stefan,

I have some changes to push. I will push them and also rebase the branch on
top of latest mater. I will do it sometime tomorrow.

- Rahul

On Tue, Aug 25, 2015 at 11:49 PM, Stefán Baxter 
wrote:

> Hi Rahul,
>
> I will start working on this later this week and over the weekend. I'm not
> sure how long it will take me to become productive but hopefully I will be
> able to share something soon.
>
> I will fork your repo on github. Can you please make sure it's up to date
> with master?
> I'm assuming that it runs in current state so I can get straight to work
> :).
>
> Best regards,
>  -Stefan
>
> On Sun, Aug 23, 2015 at 1:28 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
>> Hi Stefan,
>>
>> I was not able to make any further progress on this. Below are a list of
>> things to-do from a high level
>>
>> 1. Cleanup LuceneScanSpec : The current implementation serializes a lot
>> of low level state information to serialize/de-serialize lucene's
>> SegmentReader. This has to be changed otherwise the plugin is tightly
>> coupled to Lucene's implementation details
>> 2. Serialization of Lucene Query object
>> 3. Convert Sql filter into Lucene Query object : I just started it and
>> made it work in the simplest case. You can take a look at it here.
>>
>> https://github.com/rchallapalli/drill/blob/lucene/contrib/format-lucene/src/main/java/org/apache/drill/exec/planner/logical/SqlFilterToLuceneQuery.java
>> As part of the ElasticSearch storage plugin, Andrew has converted the
>> sql filter to Elastic Search Query. It looks like he handled many cases. We
>> can leverage
>> this for the Lucene format plugin. Below is his code
>>
>> https://github.com/aleph-zero/drill/blob/elastic/contrib/storage-elasticsearch/src/main/java/org/apache/drill/exec/store/elasticsearch/rules/PredicateAnalyzer.java
>> 4. Currently the lucene format plugin does not work on HDFS/MaprFs. This
>> should be handled
>> 5. Pushing Agg functions and Limits into the scan. (This will be an
>> improvement)
>> 5. Testing
>>
>> I want to work on (1) sometime next week.
>>
>> - Rahul
>>
>>
>> On Sat, Aug 22, 2015 at 12:00 AM, Stefán Baxter <
>> ste...@activitystream.com> wrote:
>>
>>> Hi Rahul,
>>>
>>> Can you elaborate a bit on the status of the Lucene plugin and what
>>> needs to be done before using it?
>>>
>>> Also let me know if there are specific things that need improving. We
>>> want to try to using it in our project and perhaps we can contribute
>>> something meaningful.
>>>
>>> Regards,
>>>  -Stefan
>>>
>>>
>>>
>>> On Mon, Aug 10, 2015 at 5:01 AM, Sudip Mukherjee <
>>> smukher...@commvault.com> wrote:
>>>
 Hi Rahul,

 Thanks for sharing your code. I was trying to get plugin for solr
 engine. But I thought of using solr's rest api to do the queries ,get
 schema metadata info etc.
 The goal for me is to expose a solr engine to tools like Tableau or  MS
 Excel and user can do stuff there.

 I am still very new to this and there is a learning curve. It would be
 great if you can comment/review whatever I've done so far.

 https://github.com/sudipmukherjee/drill/tree/master/contrib/storage-solr

 Thanks,
 Sudip

 -Original Message-
 From: rahul challapalli [mailto:challapallira...@gmail.com]
 Sent: 10 August 2015 AM 05:21
 To: dev@drill.apache.org
 Subject: Re: Lucene Format Plugin

 Below is the link to my branch which contains the changes related to
 the format plugin.

 https://github.com/rchallapalli/drill/tree/lucene/contrib/format-lucene

 Any thoughts on how to handle contributions like this which still have
 some work to be done?

 - Rahul


 On Mon, Aug 3, 2015 at 12:21 PM, rahul challapalli <
 challapallira...@gmail.com> wrote:

 > Thanks Jason.
 >
 > I want to look at the solr plugin and see where we can collaborate or
 > if we already duplicated part of the effort.
 >
 > I still need to push a few commits. I will share the code once I get
 > these changes pushed.
 >
 > - Rahul
 >
 >
 >
 > On Mon, Aug 3, 2015 at 11:31 AM, Jason Altekruse
 > >>> > > wrote:
 >
 >> Hey Rahul,
 >>
 >> This is really cool! Thanks for all of the time you put into writing
 >> this, I think we have a lot of available opportunities to reach new
 >> communities with efforts like this.
 >>
 >> I noticed last week another contributor opened a JIRA for a solr
 >> plugin, there might be a good opportunity for the two of you to join
 >> efforts, as I believe he likely stated working on a lucene reader as
 >> part of his solr work.
 >>
 >> Would you like to post a link to your work on Github or another
 >> public host of your code?
 >>
 >> https://issues.apache.org/jira/browse/DRILL-3585
 >>
 >> On Mon, Aug 3, 2015 at 2:29 AM, Stefán Baxter
 >> 
 >> wrote:

No of files created by CTAS auto partition feature

2015-08-26 Thread rahul challapalli
Drillers,

I executed the below query on TPCH SF100 with drill and it took ~2hrs to
complete on a 2 node cluster.

alter session set `planner.width.max_per_node` = 4;
alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
create table lineitem partition by (l_shipdate, l_receiptdate) as select *
from dfs.`/drill/testdata/tpch100/lineitem`;

The below query returned 75780, so I expected drill to create the same no
of files or may be a little more. But drill created so many files that a
"hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
did not change the default parquet block size)

select count(*) from (select l_shipdate, l_receiptdate from
dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, l_receiptdate)
sub;
+-+
| EXPR$0  |
+-+
| 75780   |
+-+


Any thoughts on why drill is creating so many files?

- Rahul


Re: No of files created by CTAS auto partition feature

2015-08-26 Thread Steven Phillips
It would be helpful if you could figure out what the file count is. But
here are some thoughs:

What is the value of the option:
store.partition.hash_distribute

If it is false, which it is by default, then every fragment will
potentially have data in every partition. In this case, that could increase
the number of files by a factor of 8.

On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> Drillers,
>
> I executed the below query on TPCH SF100 with drill and it took ~2hrs to
> complete on a 2 node cluster.
>
> alter session set `planner.width.max_per_node` = 4;
> alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
> create table lineitem partition by (l_shipdate, l_receiptdate) as select *
> from dfs.`/drill/testdata/tpch100/lineitem`;
>
> The below query returned 75780, so I expected drill to create the same no
> of files or may be a little more. But drill created so many files that a
> "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
> did not change the default parquet block size)
>
> select count(*) from (select l_shipdate, l_receiptdate from
> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, l_receiptdate)
> sub;
> +-+
> | EXPR$0  |
> +-+
> | 75780   |
> +-+
>
>
> Any thoughts on why drill is creating so many files?
>
> - Rahul
>


Re: No of files created by CTAS auto partition feature

2015-08-26 Thread Stefán Baxter
Hi,

Is it possible that the combination values of  (l_shipdate,
l_receiptdate) have a very high cardinality?
I would think you are creating partition files for a small subset of the
data.

Please keep in mind that I know nothing about TPCH SF100 and only a little
about Drill :).

Regards,
 -Stefan

On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips  wrote:

> It would be helpful if you could figure out what the file count is. But
> here are some thoughs:
>
> What is the value of the option:
> store.partition.hash_distribute
>
> If it is false, which it is by default, then every fragment will
> potentially have data in every partition. In this case, that could increase
> the number of files by a factor of 8.
>
> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Drillers,
> >
> > I executed the below query on TPCH SF100 with drill and it took ~2hrs to
> > complete on a 2 node cluster.
> >
> > alter session set `planner.width.max_per_node` = 4;
> > alter session set `planner.memory.max_query_memory_per_node` =
> 8147483648;
> > create table lineitem partition by (l_shipdate, l_receiptdate) as select
> *
> > from dfs.`/drill/testdata/tpch100/lineitem`;
> >
> > The below query returned 75780, so I expected drill to create the same no
> > of files or may be a little more. But drill created so many files that a
> > "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
> > did not change the default parquet block size)
> >
> > select count(*) from (select l_shipdate, l_receiptdate from
> > dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
> l_receiptdate)
> > sub;
> > +-+
> > | EXPR$0  |
> > +-+
> > | 75780   |
> > +-+
> >
> >
> > Any thoughts on why drill is creating so many files?
> >
> > - Rahul
> >
>


Re: No of files created by CTAS auto partition feature

2015-08-26 Thread Andries Engelbrecht
What is the distinct count for this columns? IIRC TPC-H has at least 5 years of 
data irrespective of SF, so you are requesting a lot of partitions. 76K sounds 
about right for 5 years of TPCH shipmate and correlating receipt date data, 
your query doesn’t count the actual files.

Try to partition just on the shipmate column first.

—Andries


> On Aug 26, 2015, at 12:34 PM, Stefán Baxter  wrote:
> 
> Hi,
> 
> Is it possible that the combination values of  (l_shipdate,
> l_receiptdate) have a very high cardinality?
> I would think you are creating partition files for a small subset of the
> data.
> 
> Please keep in mind that I know nothing about TPCH SF100 and only a little
> about Drill :).
> 
> Regards,
> -Stefan
> 
> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips  wrote:
> 
>> It would be helpful if you could figure out what the file count is. But
>> here are some thoughs:
>> 
>> What is the value of the option:
>> store.partition.hash_distribute
>> 
>> If it is false, which it is by default, then every fragment will
>> potentially have data in every partition. In this case, that could increase
>> the number of files by a factor of 8.
>> 
>> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
>> challapallira...@gmail.com> wrote:
>> 
>>> Drillers,
>>> 
>>> I executed the below query on TPCH SF100 with drill and it took ~2hrs to
>>> complete on a 2 node cluster.
>>> 
>>> alter session set `planner.width.max_per_node` = 4;
>>> alter session set `planner.memory.max_query_memory_per_node` =
>> 8147483648;
>>> create table lineitem partition by (l_shipdate, l_receiptdate) as select
>> *
>>> from dfs.`/drill/testdata/tpch100/lineitem`;
>>> 
>>> The below query returned 75780, so I expected drill to create the same no
>>> of files or may be a little more. But drill created so many files that a
>>> "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
>>> did not change the default parquet block size)
>>> 
>>> select count(*) from (select l_shipdate, l_receiptdate from
>>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
>> l_receiptdate)
>>> sub;
>>> +-+
>>> | EXPR$0  |
>>> +-+
>>> | 75780   |
>>> +-+
>>> 
>>> 
>>> Any thoughts on why drill is creating so many files?
>>> 
>>> - Rahul
>>> 
>> 



[jira] [Created] (DRILL-3714) Query runs out of memory and remains in CANCELLATION_REQUESTED state until drillbit is restarted

2015-08-26 Thread Victoria Markman (JIRA)
Victoria Markman created DRILL-3714:
---

 Summary: Query runs out of memory and remains in 
CANCELLATION_REQUESTED state until drillbit is restarted
 Key: DRILL-3714
 URL: https://issues.apache.org/jira/browse/DRILL-3714
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Flow
Affects Versions: 1.2.0
Reporter: Victoria Markman
Assignee: Chris Westin


This is a variation of DRILL-3705 with the difference of drill behavior when 
hitting OOM condition.

Query runs out of memory during execution and remains in 
"CANCELLATION_REQUESTED" state until drillbit is bounced.
Client (sqlline in this case) never gets a response from the server.

Reproduction details:
Single node drillbit installation.
DRILL_MAX_DIRECT_MEMORY="8G"
DRILL_HEAP="4G"

Run this query on TPCDS SF100 data set
{code}
SELECT SUM(ss.ss_net_paid_inc_tax) OVER (PARTITION BY ss.ss_store_sk) AS 
TotalSpend FROM store_sales ss WHERE ss.ss_store_sk IS NOT NULL ORDER BY 1 
LIMIT 10;
{code}

drillbit.log
{code}
2015-08-26 16:54:58,469 [2a2210a7-7a78-c774-d54c-c863d0b77bb0:frag:3:22] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 2a2210a7-7a78-c774-d54c-c863d0b77bb0:3:22: 
State to report: RUNNING
2015-08-26 16:55:50,498 [BitServer-5] WARN  o.a.drill.exec.rpc.data.DataServer 
- Message of mode REQUEST of rpc type 3 took longer than 500ms.  Actual 
duration was 2569ms.
2015-08-26 16:56:31,086 [BitServer-5] ERROR o.a.d.exec.rpc.RpcExceptionHandler 
- Exception in RPC communication.  Connection: /10.10.88.133:31012 <--> 
/10.10.88.133:54554 (data server).  Closing connection.
io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct 
buffer memory
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:233)
 ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
 [netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
 [netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
 [netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
 [netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
 [netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
 [netty-transport-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:618)
 [netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329) 
[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250) 
[netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
 [netty-common-4.0.27.Final.jar:4.0.27.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_71]
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
~[na:1.7.0_71]
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
~[na:1.7.0_71]
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:437) 
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:280) 
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:110) 
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at 
io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251) 
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849) 
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841) 
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831) 
~[netty-buffer

Re: zeroVectors() interface for value vectors

2015-08-26 Thread Steven Phillips
One possible exception to the access pattern occurs when vectors wrap other
vectors. Specifically, the offset vectors in Variable Length and Repeated
vectors. These vectors are accessed and mutated multiple times. If we are
going to implement strict enforcement, we need to consider that case.

On Tue, Aug 25, 2015 at 7:15 PM, Jacques Nadeau  wrote:

> Yes, by recommendation is to correct the usage in StreamingAggBatch
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Aug 25, 2015 at 4:52 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > I think zeroVector() is mainly used to fill the vector with zeros, which
> is
> > fine if you call it while the vector is in "mutate" state, but
> > StreamingAggBatch does actually call it after setting the value count of
> > the value vector which is against the paradigm.
> >
> >
> > On Tue, Aug 25, 2015 at 3:51 PM, Jacques Nadeau 
> > wrote:
> >
> > > In all but one situations, this is an internal concern (making sure to
> > zero
> > > out the memory).  For fixed width vectors, there is an assumption that
> an
> > > initial allocation is clean memory (e.g. all zeros in the faces of an
> int
> > > vector).  So this should be pulled off a public vector interface.  The
> > one
> > > place where it is being used today is StreamingAggBatch and I think we
> > > should fix that to follow the state paradigm described above.
> > >
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Tue, Aug 25, 2015 at 3:41 PM, Abdel Hakim Deneche <
> > > adene...@maprtech.com>
> > > wrote:
> > >
> > > > Another question: FixedWidthVector interface defines a zeroVector()
> > > method
> > > > that
> > > > "Zero out the underlying buffer backing this vector" according to
> it's
> > > > javadoc.
> > > >
> > > > Where does this method fit in the value vector states described
> > earlier ?
> > > > it doesn't clear the vector yet it doesn't reset everything to the
> > after
> > > > allocate state.
> > > >
> > > > On Tue, Aug 25, 2015 at 10:46 AM, Abdel Hakim Deneche <
> > > > adene...@maprtech.com
> > > > > wrote:
> > > >
> > > > > One more question about the transition from allocate -> mutate. For
> > > Fixed
> > > > > width vectors and BitVector you can actually call setSafe() without
> > > > calling
> > > > > allocateNew() first and it will work. Should it throw an exception
> > > > instead
> > > > > ?
> > > > > not calling allocateNew() has side effects that could cause
> setSafe()
> > > to
> > > > > throw an OversizedAllocationException if you call setSafe() then
> > > clear()
> > > > > multiple times.
> > > > >
> > > > > On Tue, Aug 25, 2015 at 10:01 AM, Chris Westin <
> > > chriswesti...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Maybe we should start by putting these rules in a comment in the
> > value
> > > > >> vector base interfaces? The lack of such information is why there
> > are
> > > > >> deviations and other expectations.
> > > > >>
> > > > >> On Tue, Aug 25, 2015 at 8:22 AM, Jacques Nadeau <
> jacq...@dremio.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > There are a few unspoken "rules" around vectors:
> > > > >> >
> > > > >> > - values need to be written in order (e.g. index 0, 1, 2, 5)
> > > > >> > - null vectors start with all values as null before writing
> > anything
> > > > >> > - for variable width types, the offset vector should be all
> zeros
> > > > before
> > > > >> > writing
> > > > >> > - you must call setValueCount before a vector can be read
> > > > >> > - you should never write to a vector once it has been read.
> > > > >> >
> > > > >> > The ultimate goal we should get to the point where you the
> > > interfaces
> > > > >> > guarantee this order of operation:
> > > > >> >
> > > > >> > allocate > mutate > setvaluecount > access > clear (or allocate
> to
> > > > start
> > > > >> > the process over, xxx).  Any deviation from this pattern should
> > > result
> > > > >> in
> > > > >> > exception.  We should do this only in debug mode as this code is
> > > > >> extremely
> > > > >> > performance sensitive.  Operations like transfer should be built
> > on
> > > > top
> > > > >> of
> > > > >> > this state model.  (In that case, it would mean src moves to
> clear
> > > > state
> > > > >> > and target moves to access state.  It also means that transfer
> > > should
> > > > >> only
> > > > >> > work in access state.)
> > > > >> >
> > > > >> > If we need special purpose data structures that don't operate in
> > > these
> > > > >> > ways, we should make sure to keep them separate rather than
> trying
> > > to
> > > > >> > accommodate a deviation from this pattern in the core vector
> code.
> > > > >> >
> > > > >> > I wrote xxx above because I see the purpose of zeroVectors as
> > being
> > > a
> > > > >> reset
> > > > >> > on the vector state back to the original state.  Maybe we should
> > > > >> actually
> > > > >> > call it 'reset' rather than 'zeroVectors'.  This would basically
> > > pick
> > > > >>

Re: No of files created by CTAS auto partition feature

2015-08-26 Thread rahul challapalli
Steven,

You were right. The count is 606240 which is 8*75780.


Stefan & Andries,

Below is the distinct count or cardinality

select count(*) from (select l_shipdate, l_receiptdate from
dfs.`/drill/testdata/tpch100/
lineitem` group by l_shipdate, l_receiptdate) sub;
+-+
| EXPR$0  |
+-+
| 75780   |
+-+

- Rahul





On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> What is the distinct count for this columns? IIRC TPC-H has at least 5
> years of data irrespective of SF, so you are requesting a lot of
> partitions. 76K sounds about right for 5 years of TPCH shipmate and
> correlating receipt date data, your query doesn’t count the actual files.
>
> Try to partition just on the shipmate column first.
>
> —Andries
>
>
> > On Aug 26, 2015, at 12:34 PM, Stefán Baxter 
> wrote:
> >
> > Hi,
> >
> > Is it possible that the combination values of  (l_shipdate,
> > l_receiptdate) have a very high cardinality?
> > I would think you are creating partition files for a small subset of the
> > data.
> >
> > Please keep in mind that I know nothing about TPCH SF100 and only a
> little
> > about Drill :).
> >
> > Regards,
> > -Stefan
> >
> > On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips  wrote:
> >
> >> It would be helpful if you could figure out what the file count is. But
> >> here are some thoughs:
> >>
> >> What is the value of the option:
> >> store.partition.hash_distribute
> >>
> >> If it is false, which it is by default, then every fragment will
> >> potentially have data in every partition. In this case, that could
> increase
> >> the number of files by a factor of 8.
> >>
> >> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
> >> challapallira...@gmail.com> wrote:
> >>
> >>> Drillers,
> >>>
> >>> I executed the below query on TPCH SF100 with drill and it took ~2hrs
> to
> >>> complete on a 2 node cluster.
> >>>
> >>> alter session set `planner.width.max_per_node` = 4;
> >>> alter session set `planner.memory.max_query_memory_per_node` =
> >> 8147483648;
> >>> create table lineitem partition by (l_shipdate, l_receiptdate) as
> select
> >> *
> >>> from dfs.`/drill/testdata/tpch100/lineitem`;
> >>>
> >>> The below query returned 75780, so I expected drill to create the same
> no
> >>> of files or may be a little more. But drill created so many files that
> a
> >>> "hadoop fs -count" command failed with a "GC overhead limit exceeded".
> (I
> >>> did not change the default parquet block size)
> >>>
> >>> select count(*) from (select l_shipdate, l_receiptdate from
> >>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
> >> l_receiptdate)
> >>> sub;
> >>> +-+
> >>> | EXPR$0  |
> >>> +-+
> >>> | 75780   |
> >>> +-+
> >>>
> >>>
> >>> Any thoughts on why drill is creating so many files?
> >>>
> >>> - Rahul
> >>>
> >>
>
>


Re: zeroVectors() interface for value vectors

2015-08-26 Thread Julien Le Dem
I can take a look at the Vectors and add asserts to enforce the contract is
respected.

On Wed, Aug 26, 2015 at 2:52 PM, Steven Phillips  wrote:

> One possible exception to the access pattern occurs when vectors wrap other
> vectors. Specifically, the offset vectors in Variable Length and Repeated
> vectors. These vectors are accessed and mutated multiple times. If we are
> going to implement strict enforcement, we need to consider that case.
>
> On Tue, Aug 25, 2015 at 7:15 PM, Jacques Nadeau 
> wrote:
>
> > Yes, by recommendation is to correct the usage in StreamingAggBatch
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Aug 25, 2015 at 4:52 PM, Abdel Hakim Deneche <
> > adene...@maprtech.com>
> > wrote:
> >
> > > I think zeroVector() is mainly used to fill the vector with zeros,
> which
> > is
> > > fine if you call it while the vector is in "mutate" state, but
> > > StreamingAggBatch does actually call it after setting the value count
> of
> > > the value vector which is against the paradigm.
> > >
> > >
> > > On Tue, Aug 25, 2015 at 3:51 PM, Jacques Nadeau 
> > > wrote:
> > >
> > > > In all but one situations, this is an internal concern (making sure
> to
> > > zero
> > > > out the memory).  For fixed width vectors, there is an assumption
> that
> > an
> > > > initial allocation is clean memory (e.g. all zeros in the faces of an
> > int
> > > > vector).  So this should be pulled off a public vector interface.
> The
> > > one
> > > > place where it is being used today is StreamingAggBatch and I think
> we
> > > > should fix that to follow the state paradigm described above.
> > > >
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Tue, Aug 25, 2015 at 3:41 PM, Abdel Hakim Deneche <
> > > > adene...@maprtech.com>
> > > > wrote:
> > > >
> > > > > Another question: FixedWidthVector interface defines a zeroVector()
> > > > method
> > > > > that
> > > > > "Zero out the underlying buffer backing this vector" according to
> > it's
> > > > > javadoc.
> > > > >
> > > > > Where does this method fit in the value vector states described
> > > earlier ?
> > > > > it doesn't clear the vector yet it doesn't reset everything to the
> > > after
> > > > > allocate state.
> > > > >
> > > > > On Tue, Aug 25, 2015 at 10:46 AM, Abdel Hakim Deneche <
> > > > > adene...@maprtech.com
> > > > > > wrote:
> > > > >
> > > > > > One more question about the transition from allocate -> mutate.
> For
> > > > Fixed
> > > > > > width vectors and BitVector you can actually call setSafe()
> without
> > > > > calling
> > > > > > allocateNew() first and it will work. Should it throw an
> exception
> > > > > instead
> > > > > > ?
> > > > > > not calling allocateNew() has side effects that could cause
> > setSafe()
> > > > to
> > > > > > throw an OversizedAllocationException if you call setSafe() then
> > > > clear()
> > > > > > multiple times.
> > > > > >
> > > > > > On Tue, Aug 25, 2015 at 10:01 AM, Chris Westin <
> > > > chriswesti...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Maybe we should start by putting these rules in a comment in the
> > > value
> > > > > >> vector base interfaces? The lack of such information is why
> there
> > > are
> > > > > >> deviations and other expectations.
> > > > > >>
> > > > > >> On Tue, Aug 25, 2015 at 8:22 AM, Jacques Nadeau <
> > jacq...@dremio.com
> > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > There are a few unspoken "rules" around vectors:
> > > > > >> >
> > > > > >> > - values need to be written in order (e.g. index 0, 1, 2, 5)
> > > > > >> > - null vectors start with all values as null before writing
> > > anything
> > > > > >> > - for variable width types, the offset vector should be all
> > zeros
> > > > > before
> > > > > >> > writing
> > > > > >> > - you must call setValueCount before a vector can be read
> > > > > >> > - you should never write to a vector once it has been read.
> > > > > >> >
> > > > > >> > The ultimate goal we should get to the point where you the
> > > > interfaces
> > > > > >> > guarantee this order of operation:
> > > > > >> >
> > > > > >> > allocate > mutate > setvaluecount > access > clear (or
> allocate
> > to
> > > > > start
> > > > > >> > the process over, xxx).  Any deviation from this pattern
> should
> > > > result
> > > > > >> in
> > > > > >> > exception.  We should do this only in debug mode as this code
> is
> > > > > >> extremely
> > > > > >> > performance sensitive.  Operations like transfer should be
> built
> > > on
> > > > > top
> > > > > >> of
> > > > > >> > this state model.  (In that case, it would mean src moves to
> > clear
> > > > > state
> > > > > >> > and target moves to access state.  It also means that transfer
> > > > should
> > > > > >> only
> > > > > >> > work in access state.)
> > > > > >> >
> > > > > >> > If we need special purpose data structures that don't operate
> in
> > > > these
> > > > > >> > ways, we should make sure to keep them 

[GitHub] drill pull request: DRILL-3153: Fix JDBC's getIdentifierQuoteStrin...

2015-08-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/99


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: First pass at test re-factoring

2015-08-26 Thread aleph-zero
GitHub user aleph-zero opened a pull request:

https://github.com/apache/drill/pull/135

First pass at test re-factoring

Unifying test class hierarchy and adding randomization functionality.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aleph-zero/drill issues/DRILL-2026

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #135


commit 3a9a0a1469265b7b75c11d4b36895e627a835b55
Author: andrew 
Date:   2015-08-26T22:11:17Z

First pass at test re-factoring

Unifying test class hierarchy and adding randomization functionality.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: No of files created by CTAS auto partition feature

2015-08-26 Thread Andries Engelbrecht
Looks like Drill is doing the partitioning as requested then. May not be 
optimal though.

Is there a reason why you want to subpartition this much? You may be better of 
to just partition by l_shipdate (not shipmate, autocorrect got me there). Or 
use columns with much lower cardinality to test subpartitioning.

—Andries


> On Aug 26, 2015, at 3:05 PM, rahul challapalli  
> wrote:
> 
> Steven,
> 
> You were right. The count is 606240 which is 8*75780.
> 
> 
> Stefan & Andries,
> 
> Below is the distinct count or cardinality
> 
> select count(*) from (select l_shipdate, l_receiptdate from
> dfs.`/drill/testdata/tpch100/
> lineitem` group by l_shipdate, l_receiptdate) sub;
> +-+
> | EXPR$0  |
> +-+
> | 75780   |
> +-+
> 
> - Rahul
> 
> 
> 
> 
> 
> On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
> 
>> What is the distinct count for this columns? IIRC TPC-H has at least 5
>> years of data irrespective of SF, so you are requesting a lot of
>> partitions. 76K sounds about right for 5 years of TPCH shipmate and
>> correlating receipt date data, your query doesn’t count the actual files.
>> 
>> Try to partition just on the shipmate column first.
>> 
>> —Andries
>> 
>> 
>>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> Is it possible that the combination values of  (l_shipdate,
>>> l_receiptdate) have a very high cardinality?
>>> I would think you are creating partition files for a small subset of the
>>> data.
>>> 
>>> Please keep in mind that I know nothing about TPCH SF100 and only a
>> little
>>> about Drill :).
>>> 
>>> Regards,
>>> -Stefan
>>> 
>>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips  wrote:
>>> 
 It would be helpful if you could figure out what the file count is. But
 here are some thoughs:
 
 What is the value of the option:
 store.partition.hash_distribute
 
 If it is false, which it is by default, then every fragment will
 potentially have data in every partition. In this case, that could
>> increase
 the number of files by a factor of 8.
 
 On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
 challapallira...@gmail.com> wrote:
 
> Drillers,
> 
> I executed the below query on TPCH SF100 with drill and it took ~2hrs
>> to
> complete on a 2 node cluster.
> 
> alter session set `planner.width.max_per_node` = 4;
> alter session set `planner.memory.max_query_memory_per_node` =
 8147483648;
> create table lineitem partition by (l_shipdate, l_receiptdate) as
>> select
 *
> from dfs.`/drill/testdata/tpch100/lineitem`;
> 
> The below query returned 75780, so I expected drill to create the same
>> no
> of files or may be a little more. But drill created so many files that
>> a
> "hadoop fs -count" command failed with a "GC overhead limit exceeded".
>> (I
> did not change the default parquet block size)
> 
> select count(*) from (select l_shipdate, l_receiptdate from
> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
 l_receiptdate)
> sub;
> +-+
> | EXPR$0  |
> +-+
> | 75780   |
> +-+
> 
> 
> Any thoughts on why drill is creating so many files?
> 
> - Rahul
> 
 
>> 
>> 



Re: No of files created by CTAS auto partition feature

2015-08-26 Thread rahul challapalli
Well this for generating some testdata

- Rahul

On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> Looks like Drill is doing the partitioning as requested then. May not be
> optimal though.
>
> Is there a reason why you want to subpartition this much? You may be
> better of to just partition by l_shipdate (not shipmate, autocorrect got me
> there). Or use columns with much lower cardinality to test subpartitioning.
>
> —Andries
>
>
> > On Aug 26, 2015, at 3:05 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
> >
> > Steven,
> >
> > You were right. The count is 606240 which is 8*75780.
> >
> >
> > Stefan & Andries,
> >
> > Below is the distinct count or cardinality
> >
> > select count(*) from (select l_shipdate, l_receiptdate from
> > dfs.`/drill/testdata/tpch100/
> > lineitem` group by l_shipdate, l_receiptdate) sub;
> > +-+
> > | EXPR$0  |
> > +-+
> > | 75780   |
> > +-+
> >
> > - Rahul
> >
> >
> >
> >
> >
> > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> > aengelbre...@maprtech.com> wrote:
> >
> >> What is the distinct count for this columns? IIRC TPC-H has at least 5
> >> years of data irrespective of SF, so you are requesting a lot of
> >> partitions. 76K sounds about right for 5 years of TPCH shipmate and
> >> correlating receipt date data, your query doesn’t count the actual
> files.
> >>
> >> Try to partition just on the shipmate column first.
> >>
> >> —Andries
> >>
> >>
> >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter  >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is it possible that the combination values of  (l_shipdate,
> >>> l_receiptdate) have a very high cardinality?
> >>> I would think you are creating partition files for a small subset of
> the
> >>> data.
> >>>
> >>> Please keep in mind that I know nothing about TPCH SF100 and only a
> >> little
> >>> about Drill :).
> >>>
> >>> Regards,
> >>> -Stefan
> >>>
> >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips 
> wrote:
> >>>
>  It would be helpful if you could figure out what the file count is.
> But
>  here are some thoughs:
> 
>  What is the value of the option:
>  store.partition.hash_distribute
> 
>  If it is false, which it is by default, then every fragment will
>  potentially have data in every partition. In this case, that could
> >> increase
>  the number of files by a factor of 8.
> 
>  On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
>  challapallira...@gmail.com> wrote:
> 
> > Drillers,
> >
> > I executed the below query on TPCH SF100 with drill and it took ~2hrs
> >> to
> > complete on a 2 node cluster.
> >
> > alter session set `planner.width.max_per_node` = 4;
> > alter session set `planner.memory.max_query_memory_per_node` =
>  8147483648;
> > create table lineitem partition by (l_shipdate, l_receiptdate) as
> >> select
>  *
> > from dfs.`/drill/testdata/tpch100/lineitem`;
> >
> > The below query returned 75780, so I expected drill to create the
> same
> >> no
> > of files or may be a little more. But drill created so many files
> that
> >> a
> > "hadoop fs -count" command failed with a "GC overhead limit
> exceeded".
> >> (I
> > did not change the default parquet block size)
> >
> > select count(*) from (select l_shipdate, l_receiptdate from
> > dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
>  l_receiptdate)
> > sub;
> > +-+
> > | EXPR$0  |
> > +-+
> > | 75780   |
> > +-+
> >
> >
> > Any thoughts on why drill is creating so many files?
> >
> > - Rahul
> >
> 
> >>
> >>
>
>


[GitHub] drill pull request: Support all encoded vector types for Hash Aggr...

2015-08-26 Thread straightflush
GitHub user straightflush opened a pull request:

https://github.com/apache/drill/pull/136

Support all encoded vector types for Hash Aggregator.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/straightflush/drill hashagg-sv2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/136.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #136


commit 578e3f61131534b03fa183e7d05b19d5ae4b6896
Author: Amit Hadke 
Date:   2015-08-26T22:47:01Z

Support all encoded vectore types for Hash Aggragator.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Support all encoded vector types for Hash Aggr...

2015-08-26 Thread straightflush
Github user straightflush commented on the pull request:

https://github.com/apache/drill/pull/136#issuecomment-135200237
  
@StevenMPhillips @jaltekruse 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-3715) Enable selection vector sv2 and sv4 in hash aggregator

2015-08-26 Thread amit hadke (JIRA)
amit hadke created DRILL-3715:
-

 Summary: Enable selection vector sv2 and sv4 in hash aggregator
 Key: DRILL-3715
 URL: https://issues.apache.org/jira/browse/DRILL-3715
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Reporter: amit hadke
Assignee: amit hadke
Priority: Minor


HashAggregator already can read sv2 and sv4 vectors. Enable support for all of 
them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request: DRILL-3715 Support all encoded vector types fo...

2015-08-26 Thread jaltekruse
Github user jaltekruse commented on the pull request:

https://github.com/apache/drill/pull/136#issuecomment-135203792
  
We should have a plan verification test to make sure that this change 
actually has the desired effect. This should be able to execute fine without 
your change as well, but the plan will have and unnecessary 
SelectionVectorRemover in it today. You can see examples of plan verification 
tests in 
/Users/jaltekruse/drill/exec/java-exec/src/test/java/org/apache/drill/TestCTASPartitionFilter.java
 and other subclasses of PlanTestBase


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-3715 Support all encoded vector types fo...

2015-08-26 Thread mehant
Github user mehant commented on the pull request:

https://github.com/apache/drill/pull/136#issuecomment-135205449
  
There have been bugs in the handling of sv2 in the case where the first and 
second batch are filtered out. Could you add a test case for this? a similar 
test was added for streaming aggregate as part of DRILL-3069: 
https://github.com/apache/drill/commit/97a63168e93a70c7ed88d2e801dd2ea2e5f1dd74


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: No of files created by CTAS auto partition feature

2015-08-26 Thread Jason Altekruse
I feel like there is a little misunderstanding here.

Rahul, did you try setting the option that Steven suggested?
`store.partition.hash_distribute`

This will cause a re-distribution of the data so that the rows that belong
in a particular partition will all be written by a single writer. They will
not necessarily be all in one file, as we have a limit on file sizes and I
don't think we cap partition size.

The default behavior is not to re-distribute, because it is expensive. This
however means that every fragment will write out a file for whichever keys
appear in the data that ends up at that fragment.

If there is a large number of fragments and the data is spread out pretty
randomly, then there is a reasonable case for turning on this option to
co-locate data in a single partition to a single writer to reduce the
number of smaller files. There is no magic formula for when it is best to
turn on this option, but in most cases it will reduce the number of files
produced.



On Wed, Aug 26, 2015 at 3:48 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> Well this for generating some testdata
>
> - Rahul
>
> On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
>
> > Looks like Drill is doing the partitioning as requested then. May not be
> > optimal though.
> >
> > Is there a reason why you want to subpartition this much? You may be
> > better of to just partition by l_shipdate (not shipmate, autocorrect got
> me
> > there). Or use columns with much lower cardinality to test
> subpartitioning.
> >
> > —Andries
> >
> >
> > > On Aug 26, 2015, at 3:05 PM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> > >
> > > Steven,
> > >
> > > You were right. The count is 606240 which is 8*75780.
> > >
> > >
> > > Stefan & Andries,
> > >
> > > Below is the distinct count or cardinality
> > >
> > > select count(*) from (select l_shipdate, l_receiptdate from
> > > dfs.`/drill/testdata/tpch100/
> > > lineitem` group by l_shipdate, l_receiptdate) sub;
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | 75780   |
> > > +-+
> > >
> > > - Rahul
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> > > aengelbre...@maprtech.com> wrote:
> > >
> > >> What is the distinct count for this columns? IIRC TPC-H has at least 5
> > >> years of data irrespective of SF, so you are requesting a lot of
> > >> partitions. 76K sounds about right for 5 years of TPCH shipmate and
> > >> correlating receipt date data, your query doesn’t count the actual
> > files.
> > >>
> > >> Try to partition just on the shipmate column first.
> > >>
> > >> —Andries
> > >>
> > >>
> > >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > >> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Is it possible that the combination values of  (l_shipdate,
> > >>> l_receiptdate) have a very high cardinality?
> > >>> I would think you are creating partition files for a small subset of
> > the
> > >>> data.
> > >>>
> > >>> Please keep in mind that I know nothing about TPCH SF100 and only a
> > >> little
> > >>> about Drill :).
> > >>>
> > >>> Regards,
> > >>> -Stefan
> > >>>
> > >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips 
> > wrote:
> > >>>
> >  It would be helpful if you could figure out what the file count is.
> > But
> >  here are some thoughs:
> > 
> >  What is the value of the option:
> >  store.partition.hash_distribute
> > 
> >  If it is false, which it is by default, then every fragment will
> >  potentially have data in every partition. In this case, that could
> > >> increase
> >  the number of files by a factor of 8.
> > 
> >  On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
> >  challapallira...@gmail.com> wrote:
> > 
> > > Drillers,
> > >
> > > I executed the below query on TPCH SF100 with drill and it took
> ~2hrs
> > >> to
> > > complete on a 2 node cluster.
> > >
> > > alter session set `planner.width.max_per_node` = 4;
> > > alter session set `planner.memory.max_query_memory_per_node` =
> >  8147483648;
> > > create table lineitem partition by (l_shipdate, l_receiptdate) as
> > >> select
> >  *
> > > from dfs.`/drill/testdata/tpch100/lineitem`;
> > >
> > > The below query returned 75780, so I expected drill to create the
> > same
> > >> no
> > > of files or may be a little more. But drill created so many files
> > that
> > >> a
> > > "hadoop fs -count" command failed with a "GC overhead limit
> > exceeded".
> > >> (I
> > > did not change the default parquet block size)
> > >
> > > select count(*) from (select l_shipdate, l_receiptdate from
> > > dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
> >  l_receiptdate)
> > > sub;
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | 75780   |
> > > +-+
> > >
> > >
> 

Re: No of files created by CTAS auto partition feature

2015-08-26 Thread rahul challapalli
Jason,

What you described is exactly my understanding.

I did kickoff a run after setting `store.partition.hash_distribute`. It is
still running. I am expecting the no of files to be slightly more than or
equal to 75780. (As the default parquet block size should be sufficient for
most of the partitions)

- Rahul



On Wed, Aug 26, 2015 at 4:36 PM, Jason Altekruse 
wrote:

> I feel like there is a little misunderstanding here.
>
> Rahul, did you try setting the option that Steven suggested?
> `store.partition.hash_distribute`
>
> This will cause a re-distribution of the data so that the rows that belong
> in a particular partition will all be written by a single writer. They will
> not necessarily be all in one file, as we have a limit on file sizes and I
> don't think we cap partition size.
>
> The default behavior is not to re-distribute, because it is expensive. This
> however means that every fragment will write out a file for whichever keys
> appear in the data that ends up at that fragment.
>
> If there is a large number of fragments and the data is spread out pretty
> randomly, then there is a reasonable case for turning on this option to
> co-locate data in a single partition to a single writer to reduce the
> number of smaller files. There is no magic formula for when it is best to
> turn on this option, but in most cases it will reduce the number of files
> produced.
>
>
>
> On Wed, Aug 26, 2015 at 3:48 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Well this for generating some testdata
> >
> > - Rahul
> >
> > On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht <
> > aengelbre...@maprtech.com> wrote:
> >
> > > Looks like Drill is doing the partitioning as requested then. May not
> be
> > > optimal though.
> > >
> > > Is there a reason why you want to subpartition this much? You may be
> > > better of to just partition by l_shipdate (not shipmate, autocorrect
> got
> > me
> > > there). Or use columns with much lower cardinality to test
> > subpartitioning.
> > >
> > > —Andries
> > >
> > >
> > > > On Aug 26, 2015, at 3:05 PM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > > >
> > > > Steven,
> > > >
> > > > You were right. The count is 606240 which is 8*75780.
> > > >
> > > >
> > > > Stefan & Andries,
> > > >
> > > > Below is the distinct count or cardinality
> > > >
> > > > select count(*) from (select l_shipdate, l_receiptdate from
> > > > dfs.`/drill/testdata/tpch100/
> > > > lineitem` group by l_shipdate, l_receiptdate) sub;
> > > > +-+
> > > > | EXPR$0  |
> > > > +-+
> > > > | 75780   |
> > > > +-+
> > > >
> > > > - Rahul
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> > > > aengelbre...@maprtech.com> wrote:
> > > >
> > > >> What is the distinct count for this columns? IIRC TPC-H has at
> least 5
> > > >> years of data irrespective of SF, so you are requesting a lot of
> > > >> partitions. 76K sounds about right for 5 years of TPCH shipmate and
> > > >> correlating receipt date data, your query doesn’t count the actual
> > > files.
> > > >>
> > > >> Try to partition just on the shipmate column first.
> > > >>
> > > >> —Andries
> > > >>
> > > >>
> > > >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <
> > ste...@activitystream.com
> > > >
> > > >> wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> Is it possible that the combination values of  (l_shipdate,
> > > >>> l_receiptdate) have a very high cardinality?
> > > >>> I would think you are creating partition files for a small subset
> of
> > > the
> > > >>> data.
> > > >>>
> > > >>> Please keep in mind that I know nothing about TPCH SF100 and only a
> > > >> little
> > > >>> about Drill :).
> > > >>>
> > > >>> Regards,
> > > >>> -Stefan
> > > >>>
> > > >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips 
> > > wrote:
> > > >>>
> > >  It would be helpful if you could figure out what the file count
> is.
> > > But
> > >  here are some thoughs:
> > > 
> > >  What is the value of the option:
> > >  store.partition.hash_distribute
> > > 
> > >  If it is false, which it is by default, then every fragment will
> > >  potentially have data in every partition. In this case, that could
> > > >> increase
> > >  the number of files by a factor of 8.
> > > 
> > >  On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
> > >  challapallira...@gmail.com> wrote:
> > > 
> > > > Drillers,
> > > >
> > > > I executed the below query on TPCH SF100 with drill and it took
> > ~2hrs
> > > >> to
> > > > complete on a 2 node cluster.
> > > >
> > > > alter session set `planner.width.max_per_node` = 4;
> > > > alter session set `planner.memory.max_query_memory_per_node` =
> > >  8147483648;
> > > > create table lineitem partition by (l_shipdate, l_receiptdate) as
> > > >> select
> > >  *
> > > > from dfs.`/drill/testdata/tpch100/lineitem`;
> > > >
> > > 

Re: No of files created by CTAS auto partition feature

2015-08-26 Thread rahul challapalli
Steven, Jason :

Below is my understanding of when we should spill to disk while performing
a sort. Let me know if I am missing anything

alter session set `planner.width.max_per_node` = 4;
alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
(~8GB)
create table lineitem partition by (l_shipdate, l_receiptdate) as select *
from dfs.`/drill/testdata/tpch100/lineitem`;

1. The above query creates 4 minor fragments and each minor fragment gets
~2GB for the sort phase.
2. Once a minor fragment cosumes ~2GB of memory, is starts spilling each
partition into a separate file to disk
3. The spilled files would be of different sizes.
4. Now if it is a regular CTAS (with no partition by clause), each spilled
file should be approximately ~2GB in size

I just have a hunch that we are spilling a little early :)

- Rahul


On Wed, Aug 26, 2015 at 4:49 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> Jason,
>
> What you described is exactly my understanding.
>
> I did kickoff a run after setting `store.partition.hash_distribute`. It is
> still running. I am expecting the no of files to be slightly more than or
> equal to 75780. (As the default parquet block size should be sufficient for
> most of the partitions)
>
> - Rahul
>
>
>
> On Wed, Aug 26, 2015 at 4:36 PM, Jason Altekruse  > wrote:
>
>> I feel like there is a little misunderstanding here.
>>
>> Rahul, did you try setting the option that Steven suggested?
>> `store.partition.hash_distribute`
>>
>> This will cause a re-distribution of the data so that the rows that belong
>> in a particular partition will all be written by a single writer. They
>> will
>> not necessarily be all in one file, as we have a limit on file sizes and I
>> don't think we cap partition size.
>>
>> The default behavior is not to re-distribute, because it is expensive.
>> This
>> however means that every fragment will write out a file for whichever keys
>> appear in the data that ends up at that fragment.
>>
>> If there is a large number of fragments and the data is spread out pretty
>> randomly, then there is a reasonable case for turning on this option to
>> co-locate data in a single partition to a single writer to reduce the
>> number of smaller files. There is no magic formula for when it is best to
>> turn on this option, but in most cases it will reduce the number of files
>> produced.
>>
>>
>>
>> On Wed, Aug 26, 2015 at 3:48 PM, rahul challapalli <
>> challapallira...@gmail.com> wrote:
>>
>> > Well this for generating some testdata
>> >
>> > - Rahul
>> >
>> > On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht <
>> > aengelbre...@maprtech.com> wrote:
>> >
>> > > Looks like Drill is doing the partitioning as requested then. May not
>> be
>> > > optimal though.
>> > >
>> > > Is there a reason why you want to subpartition this much? You may be
>> > > better of to just partition by l_shipdate (not shipmate, autocorrect
>> got
>> > me
>> > > there). Or use columns with much lower cardinality to test
>> > subpartitioning.
>> > >
>> > > —Andries
>> > >
>> > >
>> > > > On Aug 26, 2015, at 3:05 PM, rahul challapalli <
>> > > challapallira...@gmail.com> wrote:
>> > > >
>> > > > Steven,
>> > > >
>> > > > You were right. The count is 606240 which is 8*75780.
>> > > >
>> > > >
>> > > > Stefan & Andries,
>> > > >
>> > > > Below is the distinct count or cardinality
>> > > >
>> > > > select count(*) from (select l_shipdate, l_receiptdate from
>> > > > dfs.`/drill/testdata/tpch100/
>> > > > lineitem` group by l_shipdate, l_receiptdate) sub;
>> > > > +-+
>> > > > | EXPR$0  |
>> > > > +-+
>> > > > | 75780   |
>> > > > +-+
>> > > >
>> > > > - Rahul
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
>> > > > aengelbre...@maprtech.com> wrote:
>> > > >
>> > > >> What is the distinct count for this columns? IIRC TPC-H has at
>> least 5
>> > > >> years of data irrespective of SF, so you are requesting a lot of
>> > > >> partitions. 76K sounds about right for 5 years of TPCH shipmate and
>> > > >> correlating receipt date data, your query doesn’t count the actual
>> > > files.
>> > > >>
>> > > >> Try to partition just on the shipmate column first.
>> > > >>
>> > > >> —Andries
>> > > >>
>> > > >>
>> > > >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <
>> > ste...@activitystream.com
>> > > >
>> > > >> wrote:
>> > > >>>
>> > > >>> Hi,
>> > > >>>
>> > > >>> Is it possible that the combination values of  (l_shipdate,
>> > > >>> l_receiptdate) have a very high cardinality?
>> > > >>> I would think you are creating partition files for a small subset
>> of
>> > > the
>> > > >>> data.
>> > > >>>
>> > > >>> Please keep in mind that I know nothing about TPCH SF100 and only
>> a
>> > > >> little
>> > > >>> about Drill :).
>> > > >>>
>> > > >>> Regards,
>> > > >>> -Stefan
>> > > >>>
>> > > >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips 
>> > > wrote:
>> > > >>>
>> > >  It would be helpful if you c

Re: No of files created by CTAS auto partition feature

2015-08-26 Thread Steven Phillips
That's not really how it works. The only "spilling" to disk occurs during
External Sort, and the spill files are not created based on partition.

What makes you think it is spilling prematurely?

On Wed, Aug 26, 2015 at 5:15 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> Steven, Jason :
>
> Below is my understanding of when we should spill to disk while performing
> a sort. Let me know if I am missing anything
>
> alter session set `planner.width.max_per_node` = 4;
> alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
> (~8GB)
> create table lineitem partition by (l_shipdate, l_receiptdate) as select *
> from dfs.`/drill/testdata/tpch100/lineitem`;
>
> 1. The above query creates 4 minor fragments and each minor fragment gets
> ~2GB for the sort phase.
> 2. Once a minor fragment cosumes ~2GB of memory, is starts spilling each
> partition into a separate file to disk
> 3. The spilled files would be of different sizes.
> 4. Now if it is a regular CTAS (with no partition by clause), each spilled
> file should be approximately ~2GB in size
>
> I just have a hunch that we are spilling a little early :)
>
> - Rahul
>
>
> On Wed, Aug 26, 2015 at 4:49 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Jason,
> >
> > What you described is exactly my understanding.
> >
> > I did kickoff a run after setting `store.partition.hash_distribute`. It
> is
> > still running. I am expecting the no of files to be slightly more than or
> > equal to 75780. (As the default parquet block size should be sufficient
> for
> > most of the partitions)
> >
> > - Rahul
> >
> >
> >
> > On Wed, Aug 26, 2015 at 4:36 PM, Jason Altekruse <
> altekruseja...@gmail.com
> > > wrote:
> >
> >> I feel like there is a little misunderstanding here.
> >>
> >> Rahul, did you try setting the option that Steven suggested?
> >> `store.partition.hash_distribute`
> >>
> >> This will cause a re-distribution of the data so that the rows that
> belong
> >> in a particular partition will all be written by a single writer. They
> >> will
> >> not necessarily be all in one file, as we have a limit on file sizes
> and I
> >> don't think we cap partition size.
> >>
> >> The default behavior is not to re-distribute, because it is expensive.
> >> This
> >> however means that every fragment will write out a file for whichever
> keys
> >> appear in the data that ends up at that fragment.
> >>
> >> If there is a large number of fragments and the data is spread out
> pretty
> >> randomly, then there is a reasonable case for turning on this option to
> >> co-locate data in a single partition to a single writer to reduce the
> >> number of smaller files. There is no magic formula for when it is best
> to
> >> turn on this option, but in most cases it will reduce the number of
> files
> >> produced.
> >>
> >>
> >>
> >> On Wed, Aug 26, 2015 at 3:48 PM, rahul challapalli <
> >> challapallira...@gmail.com> wrote:
> >>
> >> > Well this for generating some testdata
> >> >
> >> > - Rahul
> >> >
> >> > On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht <
> >> > aengelbre...@maprtech.com> wrote:
> >> >
> >> > > Looks like Drill is doing the partitioning as requested then. May
> not
> >> be
> >> > > optimal though.
> >> > >
> >> > > Is there a reason why you want to subpartition this much? You may be
> >> > > better of to just partition by l_shipdate (not shipmate, autocorrect
> >> got
> >> > me
> >> > > there). Or use columns with much lower cardinality to test
> >> > subpartitioning.
> >> > >
> >> > > —Andries
> >> > >
> >> > >
> >> > > > On Aug 26, 2015, at 3:05 PM, rahul challapalli <
> >> > > challapallira...@gmail.com> wrote:
> >> > > >
> >> > > > Steven,
> >> > > >
> >> > > > You were right. The count is 606240 which is 8*75780.
> >> > > >
> >> > > >
> >> > > > Stefan & Andries,
> >> > > >
> >> > > > Below is the distinct count or cardinality
> >> > > >
> >> > > > select count(*) from (select l_shipdate, l_receiptdate from
> >> > > > dfs.`/drill/testdata/tpch100/
> >> > > > lineitem` group by l_shipdate, l_receiptdate) sub;
> >> > > > +-+
> >> > > > | EXPR$0  |
> >> > > > +-+
> >> > > > | 75780   |
> >> > > > +-+
> >> > > >
> >> > > > - Rahul
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> >> > > > aengelbre...@maprtech.com> wrote:
> >> > > >
> >> > > >> What is the distinct count for this columns? IIRC TPC-H has at
> >> least 5
> >> > > >> years of data irrespective of SF, so you are requesting a lot of
> >> > > >> partitions. 76K sounds about right for 5 years of TPCH shipmate
> and
> >> > > >> correlating receipt date data, your query doesn’t count the
> actual
> >> > > files.
> >> > > >>
> >> > > >> Try to partition just on the shipmate column first.
> >> > > >>
> >> > > >> —Andries
> >> > > >>
> >> > > >>
> >> > > >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <
> >> > ste...@activitystream.com
> >> > > >
> >> > > >> w

Re: No of files created by CTAS auto partition feature

2015-08-26 Thread rahul challapalli
Thanks for clarifying.

In the spill directory under a specific minor fragment, I saw files with
different sizes. A few files are just 50MB each and a few files are 1.5 GB.
Its not clear to me as to why the file sizes are so different.

- Rahul

On Wed, Aug 26, 2015 at 5:48 PM, Steven Phillips  wrote:

> That's not really how it works. The only "spilling" to disk occurs during
> External Sort, and the spill files are not created based on partition.
>
> What makes you think it is spilling prematurely?
>
> On Wed, Aug 26, 2015 at 5:15 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Steven, Jason :
> >
> > Below is my understanding of when we should spill to disk while
> performing
> > a sort. Let me know if I am missing anything
> >
> > alter session set `planner.width.max_per_node` = 4;
> > alter session set `planner.memory.max_query_memory_per_node` =
> 8147483648;
> > (~8GB)
> > create table lineitem partition by (l_shipdate, l_receiptdate) as select
> *
> > from dfs.`/drill/testdata/tpch100/lineitem`;
> >
> > 1. The above query creates 4 minor fragments and each minor fragment gets
> > ~2GB for the sort phase.
> > 2. Once a minor fragment cosumes ~2GB of memory, is starts spilling each
> > partition into a separate file to disk
> > 3. The spilled files would be of different sizes.
> > 4. Now if it is a regular CTAS (with no partition by clause), each
> spilled
> > file should be approximately ~2GB in size
> >
> > I just have a hunch that we are spilling a little early :)
> >
> > - Rahul
> >
> >
> > On Wed, Aug 26, 2015 at 4:49 PM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> >
> > > Jason,
> > >
> > > What you described is exactly my understanding.
> > >
> > > I did kickoff a run after setting `store.partition.hash_distribute`. It
> > is
> > > still running. I am expecting the no of files to be slightly more than
> or
> > > equal to 75780. (As the default parquet block size should be sufficient
> > for
> > > most of the partitions)
> > >
> > > - Rahul
> > >
> > >
> > >
> > > On Wed, Aug 26, 2015 at 4:36 PM, Jason Altekruse <
> > altekruseja...@gmail.com
> > > > wrote:
> > >
> > >> I feel like there is a little misunderstanding here.
> > >>
> > >> Rahul, did you try setting the option that Steven suggested?
> > >> `store.partition.hash_distribute`
> > >>
> > >> This will cause a re-distribution of the data so that the rows that
> > belong
> > >> in a particular partition will all be written by a single writer. They
> > >> will
> > >> not necessarily be all in one file, as we have a limit on file sizes
> > and I
> > >> don't think we cap partition size.
> > >>
> > >> The default behavior is not to re-distribute, because it is expensive.
> > >> This
> > >> however means that every fragment will write out a file for whichever
> > keys
> > >> appear in the data that ends up at that fragment.
> > >>
> > >> If there is a large number of fragments and the data is spread out
> > pretty
> > >> randomly, then there is a reasonable case for turning on this option
> to
> > >> co-locate data in a single partition to a single writer to reduce the
> > >> number of smaller files. There is no magic formula for when it is best
> > to
> > >> turn on this option, but in most cases it will reduce the number of
> > files
> > >> produced.
> > >>
> > >>
> > >>
> > >> On Wed, Aug 26, 2015 at 3:48 PM, rahul challapalli <
> > >> challapallira...@gmail.com> wrote:
> > >>
> > >> > Well this for generating some testdata
> > >> >
> > >> > - Rahul
> > >> >
> > >> > On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht <
> > >> > aengelbre...@maprtech.com> wrote:
> > >> >
> > >> > > Looks like Drill is doing the partitioning as requested then. May
> > not
> > >> be
> > >> > > optimal though.
> > >> > >
> > >> > > Is there a reason why you want to subpartition this much? You may
> be
> > >> > > better of to just partition by l_shipdate (not shipmate,
> autocorrect
> > >> got
> > >> > me
> > >> > > there). Or use columns with much lower cardinality to test
> > >> > subpartitioning.
> > >> > >
> > >> > > —Andries
> > >> > >
> > >> > >
> > >> > > > On Aug 26, 2015, at 3:05 PM, rahul challapalli <
> > >> > > challapallira...@gmail.com> wrote:
> > >> > > >
> > >> > > > Steven,
> > >> > > >
> > >> > > > You were right. The count is 606240 which is 8*75780.
> > >> > > >
> > >> > > >
> > >> > > > Stefan & Andries,
> > >> > > >
> > >> > > > Below is the distinct count or cardinality
> > >> > > >
> > >> > > > select count(*) from (select l_shipdate, l_receiptdate from
> > >> > > > dfs.`/drill/testdata/tpch100/
> > >> > > > lineitem` group by l_shipdate, l_receiptdate) sub;
> > >> > > > +-+
> > >> > > > | EXPR$0  |
> > >> > > > +-+
> > >> > > > | 75780   |
> > >> > > > +-+
> > >> > > >
> > >> > > > - Rahul
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> > >> > > > aengelbre...@maprtech.com> wrote:
> > >> >