from:"rahul challapalli"

Re: Issue faced in Apache drill

2019-04-09 Thread rahul challapalli

My above solution made an implicit assumption that we return null even if a
single value in the column b is null. However you can modify the query to
replace nulls with 0's if that is what you want to do.

On Tue, Apr 9, 2019 at 4:41 PM rahul challapalli 
wrote:

> I haven't tried it myself but something like the below workaround should
> be helpful
>
> select
>   a,
>   case
> when exists (select 1 from dfs.`sample.json` where b is null) then
> null
>else sum(b)
>   end
> from dfs.`sample.json`
> group by a
>
> - Rahul
>
> On Tue, Apr 9, 2019 at 4:32 PM Gayathri Selvaraj <
> gayathri.selvar...@gmail.com> wrote:
>
>> Hi Team,
>>
>>
>> Facing some issues with the following case:
>>
>> Json file (*sample.json*) is having the following content:
>> {"a":2,"b":null} {"a":2,"b":null} {"a":3,"b":null} {"a":4,"b":null}
>>
>> *Query:*
>>
>> SELECT a, sum(b) FROM dfs.`C:\\Users\\user\\Desktop
>> sample.json` group by a;
>>
>> *Error:*
>>
>> UNSUPPORTED_OPERATION ERROR: Only COUNT, MIN and MAX aggregate functions
>> supported for VarChar type
>>
>> *Observation:*
>>
>> If we query without using group by, then it is working fine without any
>> error. If group by is used, then sum of null values is throwing the above
>> error.
>>
>>
>>
>> Can anyone please let us know the solution for this or if there are any
>> alternative. I have raised a JIRA ticket for the same -
>> https://issues.apache.org/jira/browse/DRILL-7161
>>
>>
>> Regards,
>>
>> Gayathri
>>
>

Re: Issue faced in Apache drill

2019-04-09 Thread rahul challapalli

I haven't tried it myself but something like the below workaround should be
helpful

select
  a,
  case
when exists (select 1 from dfs.`sample.json` where b is null) then null
   else sum(b)
  end
from dfs.`sample.json`
group by a

- Rahul

On Tue, Apr 9, 2019 at 4:32 PM Gayathri Selvaraj <
gayathri.selvar...@gmail.com> wrote:

> Hi Team,
>
>
> Facing some issues with the following case:
>
> Json file (*sample.json*) is having the following content:
> {"a":2,"b":null} {"a":2,"b":null} {"a":3,"b":null} {"a":4,"b":null}
>
> *Query:*
>
> SELECT a, sum(b) FROM dfs.`C:\\Users\\user\\Desktop
> sample.json` group by a;
>
> *Error:*
>
> UNSUPPORTED_OPERATION ERROR: Only COUNT, MIN and MAX aggregate functions
> supported for VarChar type
>
> *Observation:*
>
> If we query without using group by, then it is working fine without any
> error. If group by is used, then sum of null values is throwing the above
> error.
>
>
>
> Can anyone please let us know the solution for this or if there are any
> alternative. I have raised a JIRA ticket for the same -
> https://issues.apache.org/jira/browse/DRILL-7161
>
>
> Regards,
>
> Gayathri
>

Re: Apache Drill issue

2018-06-04 Thread rahul challapalli

Additionally to what padma said, it would be helpful if you can post the
query that you are trying to execute. Also as a sanity check can you list
the tables present in hive? Run the below commands

use hive;
show tables;

On Mon, Jun 4, 2018 at 8:05 AM, Padma Penumarthy 
wrote:

> Did you verify the permissions ? Check the drillbit log.
> That will give some clues.
>
> Thanks
> Padma
>
>
> > On Jun 3, 2018, at 7:28 AM, Samiksha Kapoor 
> wrote:
> >
> > Hi Team,
> >
> > I am doing a POC on Apache drill for my organization, I have installed
> > Drill on my Linux environment. However i am constantly facing one issue
> > even after doing so many configurations. When i am trying to view the
> hive
> > tables, its giving no output to me, the query is executing fine without
> any
> > results.
> >
> > Please help me out to fine the resolution so that i can do a successful
> POC
> > for my organization.
> >
> > Looking forward to hear from you.
> >
> > Thanks,
> > Samiksha Kapoor
>
>

Re: question about views

2018-03-19 Thread rahul challapalli

First I would suggest to ignore the view and try out a query which has the
required filters as part of the subqueries on both sides of the union (for
both the database and partitioned parquet data). The plan for such a query
should have the answers to your question. If both the subqueries
independently prune out un-necessary data, using partitions or indexes, I
don't think adding a union between them would alter that behavior.

-Rahul

On Mon, Mar 19, 2018 at 1:44 PM, Ted Dunning  wrote:

> IF I create a view that is a union of partitioned parquet files and a
> database that has secondary indexes, will Drill be able to properly push
> down query limits into both parts of the union?
>
> In particular, if I have lots of archival data and parquet partitioned by
> time but my query only asks for recent data that is in the database, will
> the query avoid the parquet files entirely (as you would wish)?
>
> Conversely, if the data I am asking for is entirely in the archive, will
> the query make use of the partitioning on my parquet files correctly?
>

Re: [MongoDB] How does drill aggregate data

2017-10-02 Thread rahul challapalli

This will largely depend on the implementation of the Mongo DB storage
plugin. Based on my glimpse at the plugin code [1], it looks like we read
all the data from Mongo DB and then aggregation in drill

[1]
https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/main/java/org/apache/drill/exec/store/mongo/MongoGroupScan.java

On Mon, Oct 2, 2017 at 8:42 AM, Andy  wrote:

> Hi, Drill Team
> My name is Andy.
> Currently, Im considering to use Apache Drill to query and aggregate data
> from MongoDB.
>
> But I really confused how does it work about aggregation.
> For example i have this query:
> SELECT user,SUM(amount)
> FROM sales WHERE type=1
> GROUP BY user
>
> Then, i have 2 thinking on this:
> 1. Drill will use aggregate from MongoDB's api to do GROUP BY
> 2. Drill will filter data ( find from Mongo API) then it will do
> aggregation by its own way ( such as: collect all matching documents onto
> memory & do group by )
>
> So, can you help me to understand how it works
> Thanks in advanced
>

Re: Query Optimization

2017-08-17 Thread rahul challapalli

Could you be running into https://issues.apache.org/jira/browse/DRILL-3846 ?

- Rahul

On Thu, Aug 17, 2017 at 9:13 PM, Padma Penumarthy 
wrote:

> It is supposed to work like you expected. May be you are running into a
> bug.
> Why is it reading all files after metadata refresh ? That is difficult to
> answer without
> looking at the logs and query profile. If you look at the query profile,
> you can may
> be check what usedMetadataFile flag says for scan.
> Also, I am thinking if you created so many files, your metadata
> cache file could be big. May be you can manually sanity
> check if it looks ok (look for .drill.parquet.metadata file in the root
> directory) and not
> corrupted ?
>
> Thanks,
> Padma
>
>
> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz mailto:kfara
> a...@mapr.com>> wrote:
>
> Please share your SQL query and the query plan.
>
> To get the query plan, execute EXPLAIN PLAN FOR ;
>
>
> Thanks,
>
> Khurram
>
> 
> From: Divya Gehlot mailto:divya.htco...@gmail.com
> >>
> Sent: Friday, August 18, 2017 7:15:18 AM
> To: user@drill.apache.org
> Subject: Re: Query Optimization
>
> Hi ,
> Yes its the same query its just the ran the metadata refresh command .
> My understanding is metadata refresh command saves reading the metadata.
> How about column values ... Why is it reading all the files after metedata
> refresh ?
> Partition helps to retrieve data faster .
> Like in hive how it happens when you mention the partition column in where
> condition
> it just goes and read and improves the query performace .
> In my query also I where conidtion has  partioning column it should go and
> read those partitioned files right ?
> Why is it taking more time ?
> Does the Drill works in different way compare to hive ?
>
>
> Thanks,
> Divya
>
> On 18 August 2017 at 07:37, Padma Penumarthy  ppenumar...@mapr.com>> wrote:
>
> It might read all those files if some new data gets added after running
> refresh metadata cache.
> If everything is same before and after metadata refresh i.e. no
> new data added and query is exactly the same, then it should not do that.
> Also, check if you can partition in  a way that will not create so many
> files in the
> first place.
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 10:54 PM, Divya Gehlot  mailto:divya.htco...@gmail.com>>
> wrote:
>
> Hi,
> Another observation is
> My query had where conditions based on the partition values
>
> Total number of parquet files in directory  - 102290
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
>
> This is how the refresh metadata works I mean it scans each and every
> files
> and get the results ?
>
> I dont  have access to logs now .
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:48, Divya Gehlot  divya.htco...@gmail.com>>
> wrote:
>
> Hi,
> Another observation is
> My query had where conditions based on the partition values
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:03, Padma Penumarthy  ppenumar...@mapr.com>>
> wrote:
>
> Does your query have partition filter ?
> Execution time is increased most likely because partition pruning is
> not
> happening.
> Did you get a chance to look at the logs ?  That might give some clues.
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 9:32 PM, Divya Gehlot  divya.htco...@gmail.com>>
> wrote:
>
> Hi,
> Even I am surprised .
> I am running Drill version 1.10  on MapR enterprise version.
> *Query *- Selecting all the columns on partitioned parquet table
>
> I observed few things from Query statistics :
>
> Value
>
> Before Refresh Metadata
>
> After Refresh Metadata
>
> Fragments
>
> 1
>
> 13
>
> DURATION
>
> 01 min 0.233 sec
>
> 18 min 0.744 sec
>
> PLANNING
>
> 59.818 sec
>
> 33.087 sec
>
> QUEUED
>
> Not Available
>
> Not Available
>
> EXECUTION
>
> 0.415 sec
>
> 17 min 27.657 sec
>
> The planning time is being reduced by approx 60% but the execution
> time
> increased  drastically.
> I would like to understand why the exceution time increases after the
> metadata refresh .
>
>
> Appreciate the help.
>
> Thanks,
> divya
>
>
> On 17 August 2017 at 11:54, Padma Penumarthy  ppenumar...@mapr.com>>
> wrote:
>
> Refresh table metadata should  help reduce query planning time.
> It is odd that it went up after you did refresh table metadata.
> Did you check the logs to see what is happening ? You might have to
> turn on some debugs if needed.
> BTW, what version of Drill are you running ?
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 8:15 PM, Divya Gehlot  divya.htco...@gmail.com>>
> wrote:
>
> Hi,
> I have data in parquet file format .
> when I run the query the data and see the execution plan I could see
> following
> statistics
>
> TOTAL FRAGMENTS: 1
> DURATION: 01 min 0.233 sec
> PLANNING: 59.818 sec
> QUEUED: Not Available
> EXECUTION: 0.415 sec
>
>
>
> As its a paquet file fo

Re: append data to already existing table saved in parquet format

2017-07-25 Thread rahul challapalli

I am not aware of any clean way to do this. However if your data is
partitioned based on directories, then you can use the below hack which
leverages temporary tables [1]. Essentially, you backup your partition to a
temp table, then override it by taking the union of new partition data and
existing partition data. This way we are not over-writing the entire table.

create temporary table mytable_2017 (col1, col2)  as select col1, col2,
..from mytable where dir0 = "2017";
drop table `mytable/2017`;
create table `mytable/2017` as
select col1, col2 .from new_partition_data
union
select col1, col2 . from mytable_2017;
drop table mytable_2017;

Caveat : Temporary tables get dropped automatically if the session ends or
the drillbit crashes. In the above sequence, if the connection gets dropped
(there are known issues causing this) between the client and drillbit after
executing the "DROP" statement, then your partition data is lost forever.
And since drill doesn't support transactions, the mentioned approach is
dangerous.

[1] https://drill.apache.org/docs/create-temporary-table-as-cttas/

On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot 
wrote:

> Hi,
> I am naive to Apache drill.
> As I have data coming in every hour , when I searched I couldnt find the
> insert into partition command in Apache drill.
> How can we insert data to particular partition without rewriting the whole
>  data set ?
>
>
> Appreciate the help.
> Thanks,
> Divya
>

Re: Index out of bounds for SELECT * from 'directory'

2017-07-13 Thread rahul challapalli

With the amount of information provided its hard to guide you. Index out of
bounds errors are generally bad as they indicate some accounting
corruption. I suggest that you go ahead and file a jira with the below
information

1. Query
2. Drill Version
3. Data sets used
4. Logs and profiles
5. files under drill-conf directory

- Rahul

On Thu, Jul 13, 2017 at 10:12 AM, Dan Holmes  wrote:

> I am getting the following error with this query.
>
> SELECT COUNT(*)
> FROM dfs.`/home/dan/twm/sales`
>
> version 1.10.0
>
> all the files are .txt.  Here is the relevant part of the profile for dfs
> "txt": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "extractHeader": true,
>   "delimiter": "|"
>
> how do i diagnose this?
>
> thank you
> dan
> Query Failed: An Error
> Occurredorg.apache.drill.common.exceptions.UserRemoteException:
> SYSTEM ERROR: IndexOutOfBoundsException: index: 32384, length: 4 (expected:
> range(0, 16384)) Fragment 1:0 [Error Id:
> 7cacf366-21dc-4528-9f4c-eda3c2e28a8b on ubuntu:31010]
>

Re: Reading Parquet files with array or list columns

2017-06-30 Thread rahul challapalli

HmmI too see no simple workaround for the second case. Can you also
file a jira for the CTAS case? Drill could have been running short on heap
memory.

- Rahul

On Fri, Jun 30, 2017 at 11:46 AM, David Kincaid 
wrote:

> The view only works for the first example in the Jira I created. That was
> the workaround we have been using since January.
>
> Recently we've had a use case where we are running a Spark script to
> pre-join some data before we try to use it in Drill. That was the subject
> of the initial e-mail in this thread and the topic of the comment I made in
> the JIra on 6/17. As far as I've been able to tell there isn't a similar
> work around for this case that will make the column appear as an array.
>
> Note, I tried to use Drill to do that pre-join of the Parquet data using
> CTAS, but it ran for about 4 hours then crashed. The Spark script to do it
> runs in 14 minutes successfully.
>
> - Dave
>
> On Fri, Jun 30, 2017 at 1:38 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Like I suggested in the comment for DRILL-5183, can you try using a view
> as
> > a workaround until the issue gets resolved?
> >
> > On Fri, Jun 30, 2017 at 10:41 AM, David Kincaid 
> > wrote:
> >
> > > As far as I was able to discern it is not possible to actually use this
> > > column as an array in Drill at all. It just does not correctly read the
> > > Parquet. I have had a very similar defect I created in Jira back in
> > January
> > > that has had no attention at all. So we are moving on to other tools. I
> > > understand Drill is free and no one developing it owes me anything.
> It's
> > > just not going to work for us without proper support for nested objects
> > in
> > > Parquet format.
> > >
> > > Thanks for the reply though. It's much appreciated to have some
> > > acknowledgment that I raised a valid issue.
> > >
> > > - Dave
> > >
> > > On Fri, Jun 30, 2017 at 12:06 PM, François Méthot  >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Have you tried:
> > > >select column['list'][0]['element'] from ...
> > > >should return "My First Value".
> > > >
> > > > or try:
> > > > select flatten(column['list'])['element] from ...
> > > >
> > > > Hope it helps, in our data we have a column that looks like this:
> > > > [{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
> > > > "DATA":"thedata2"},.]
> > > >
> > > > We ended doing custom function to do look up instead of doing costly
> > > > flatten technique.
> > > >
> > > > Francois
> > > >
> > > >
> > > >
> > > > On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid <
> > kincaid.d...@gmail.com>
> > > > wrote:
> > > >
> > > > > I'm having a problem querying Parquet files that were created from
> > > Spark
> > > > > and have columns that are array or list types. When I do a SELECT
> on
> > > > these
> > > > > columns they show up like this:
> > > > >
> > > > > {"list": [{"element": "My first value"}, {"element": "My second
> > > value"}]}
> > > > >
> > > > > which Drill does not recognize as a REPEATED column and is not
> really
> > > > > workable to hack around like I did in DRILL-5183 (
> > > > > https://issues.apache.org/jira/browse/DRILL-5183). I can get to
> one
> > > > value
> > > > > using something like t.columnName.`list`.`element` but that's not
> > > really
> > > > > feasible to use in a query.
> > > > >
> > > > > The little I could find on this by Googling around led me to this
> > > > document
> > > > > on the Parquet format Github page -
> > > > > https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md
> > .
> > > > This
> > > > > seems to say that Spark is writing these files correctly, but Drill
> > is
> > > > not
> > > > > interpreting them properly.
> > > > >
> > > > > Is there a workaround that anyone can help me to turn these columns
> > > into
> > > > > values that Drill understands as repeated values? This is a fairly
> > > urgent
> > > > > issue for us.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Dave
> > > > >
> > > >
> > >
> >
>

Re: Reading Parquet files with array or list columns

2017-06-30 Thread rahul challapalli

Like I suggested in the comment for DRILL-5183, can you try using a view as
a workaround until the issue gets resolved?

On Fri, Jun 30, 2017 at 10:41 AM, David Kincaid 
wrote:

> As far as I was able to discern it is not possible to actually use this
> column as an array in Drill at all. It just does not correctly read the
> Parquet. I have had a very similar defect I created in Jira back in January
> that has had no attention at all. So we are moving on to other tools. I
> understand Drill is free and no one developing it owes me anything. It's
> just not going to work for us without proper support for nested objects in
> Parquet format.
>
> Thanks for the reply though. It's much appreciated to have some
> acknowledgment that I raised a valid issue.
>
> - Dave
>
> On Fri, Jun 30, 2017 at 12:06 PM, François Méthot 
> wrote:
>
> > Hi,
> >
> > Have you tried:
> >select column['list'][0]['element'] from ...
> >should return "My First Value".
> >
> > or try:
> > select flatten(column['list'])['element] from ...
> >
> > Hope it helps, in our data we have a column that looks like this:
> > [{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
> > "DATA":"thedata2"},.]
> >
> > We ended doing custom function to do look up instead of doing costly
> > flatten technique.
> >
> > Francois
> >
> >
> >
> > On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid 
> > wrote:
> >
> > > I'm having a problem querying Parquet files that were created from
> Spark
> > > and have columns that are array or list types. When I do a SELECT on
> > these
> > > columns they show up like this:
> > >
> > > {"list": [{"element": "My first value"}, {"element": "My second
> value"}]}
> > >
> > > which Drill does not recognize as a REPEATED column and is not really
> > > workable to hack around like I did in DRILL-5183 (
> > > https://issues.apache.org/jira/browse/DRILL-5183). I can get to one
> > value
> > > using something like t.columnName.`list`.`element` but that's not
> really
> > > feasible to use in a query.
> > >
> > > The little I could find on this by Googling around led me to this
> > document
> > > on the Parquet format Github page -
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
> > This
> > > seems to say that Spark is writing these files correctly, but Drill is
> > not
> > > interpreting them properly.
> > >
> > > Is there a workaround that anyone can help me to turn these columns
> into
> > > values that Drill understands as repeated values? This is a fairly
> urgent
> > > issue for us.
> > >
> > > Thanks,
> > >
> > > Dave
> > >
> >
>

Re: Pushing down Joins, Aggregates and filters, and data distribution questions

2017-06-01 Thread rahul challapalli

I would first recommend you spend some time reading the execution flow
inside drill [1]. Try to understand specifically what major/minor fragments
are and that different major fragments can have different levels of
parallelism.

Let us take a simple query which runs on a 2 node cluster

select * from employee where salary > 10;

Now how do we control parallelism for the above query? Unfortunately the
generic answer is not a simple one. But since I conveniently took a simple
query with a single major fragment, lets make an effort to understand this.
There are 3 variables which control the parallelism

1. No of cores available
2. planner.width.max_per_node : Maximum number of minor fragments within a
major fragment per node
3. Parallelism supported by the scan for the particular storage plugin
involved

Lets try to understand the last parameter which is of interest to storage
plugin developers. Like you hinted, the number of sub-scans determines the
parallelism of the above query in the absence of the first 2 variables. But
how many subscan's can exist? This unfortunately depends on how you can
split the data (by respecting the row boundaries) and is dependent on the
storage format. Hypothetically, lets say you have a file which is composed
of 100 parts and each part contains few records and you know that a single
record is not split across multiple parts. Now with this setup, the storage
plugin simply has to get the number of parts present in the data and
instantiate that many subscans.

So in the above simplistic setup the max parallelization that can be
achieved for the major fragment (and in effect the whole query) is
determined by the number of parts present in the data which is 100. Now if
you do not set (2), the default max parallelization limit is 70% of the
number of cores available. If (2) is set by the user, that determines the
max threads that can be used per node. So for our example, the max
parallelization that can be supported is MIN(100,
planner.width.max_per_node). So if the user has planner.width.max_per_node
set to 30, then we end up with a total of 60 threads (on 2 nodes combined)
which need to run 100 minor fragments

With this understanding lets move to the next related topic which is
"Assignment". Now we have 60 threads (across 2 nodes) and 100 minor
fragments. So how do you assign minor fragments to specific nodes? This is
determined by the affinity that a particular node has for handling a
particular subscan. This can be controlled by the storage plugin by using
the "public List getOperatorAffinity()" method in the
GroupScan class.

Now to your questions

1. If I have multiple *SubScan*s to be executed, will each *SubScan* be
   handled by a single *Scan* operator ? So whenever I have *n* *SubScan*s,
   I'll have *n* Scan operators distributed among Drill's cluster ?

I am not sure if I even understood your question correctly. Each minor
fragment gets executed in a single thread. In my example, each minor
fragment executes one subscan, followed by project, filter etc. Read [1] to
understand more about this.

2. How can I control the amount of any type of physical operators per
   Drill cluster or node ? For instance, what if I want to have less
   *Filter* operators or more *Scan* operators, how can I do that ?

I am not sure if we can control parallelism at the operator level within a
major fragment.

[1] https://drill.apache.org/docs/drill-query-execution/

On Thu, Jun 1, 2017 at 5:17 AM, Muhammad Gelbana 
wrote:
>
> First of all, I was very happy to at last attend the hangouts meeting,
I've
> been trying to do so for quite sometime.
>
> I know I confused most of you during the meeting but that's because my
> requirements aren't crystal clear at the moment and I'm still learning
what
> Drill can do. Hopefully I learn enough so I would be confident about the
> options I have when I need to make implementation decisions.
>
> Now to the point, and let me restate my case..
>
> We have a proprietary datasource that can perform limits, aggregations,
> filters and joins very fast. This datasource can handle SQL queries but
not
> all possible SQL syntax. I've been successful, so far, to pushdown joins,
> filters and limits, but I'm still struggling with aggregates. I've sent an
> email about aggregates to Calcite's mailing list.
>
> The amount of data this datasource may be required to process can be
> billions of records and 100s of GBs of data. So we are looking forward to
> distribute this data among multiple servers to overcome storage
limitations
> and maximize throughput.
>
> This distribution can be just duplicating the data to maximize throughput,
> so each server will have the same set of data, *or* records may be
> distributed among different servers, without duplication among these
> servers because a single server may not be able to hold all the data. So
> some tables may be duplicated and some tables may be distributed among
> servers. Let's assume that the distribution d

Re: Partitioning for parquet

2017-05-31 Thread rahul challapalli

If most of your queries use date column in the filter condition, I would
partition the data on the date column. Then you can simply say

select * from events where `date` between '2016-11-11' and '2017-01-23';

- Rahul

On Wed, May 31, 2017 at 3:22 PM, Raz Baluchi  wrote:

> So, if I understand you correctly, I would have to include the 'yr' and
> 'mnth' columns in addition to the 'date' column in the query?
>
> e.g.
>
> select * from events where yr in (2016, 2017)  and mnth in (11,12,1) and
> date between '2016-11-11' and '2017-01-23';
>
> Is that correct?
>
> On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > How to partition data is dependent on how you want to access your data.
> If
> > you can foresee that most of the queries use year and month, then
> go-ahead
> > and partition the data on those 2 columns. You can do that like below
> >
> > create table partitioned_data partition by (yr, mnth) as select
> > extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
> >  from mydata;
> >
> > For partitioning to have any benefit, your queries should have filters on
> > month and year columns.
> >
> > - Rahul
> >
> > On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi 
> > wrote:
> >
> > > Hi all,
> > >
> > > Trying to understand parquet partitioning works.
> > >
> > > What is the recommended partitioning scheme for event data that will be
> > > queried primarily by date. I assume that partitioning by year and month
> > > would be optimal?
> > >
> > > Lets say I have data that looks like:
> > >
> > > application,status,date,message
> > > kafka,down,2017-03023 04:53,zookeeper is not available
> > >
> > >
> > > Would I have to create new columns for year and month?
> > >
> > > e.g.
> > > application,status,date,message,year,month
> > > kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
> > >
> > > and then perform a CTAS using the year and month columns as the
> > 'partition
> > > by'?
> > >
> > > Thanks
> > >
> >
>

Re: Partitioning for parquet

2017-05-31 Thread rahul challapalli

How to partition data is dependent on how you want to access your data. If
you can foresee that most of the queries use year and month, then go-ahead
and partition the data on those 2 columns. You can do that like below

create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
 from mydata;

For partitioning to have any benefit, your queries should have filters on
month and year columns.

- Rahul

On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi  wrote:

> Hi all,
>
> Trying to understand parquet partitioning works.
>
> What is the recommended partitioning scheme for event data that will be
> queried primarily by date. I assume that partitioning by year and month
> would be optimal?
>
> Lets say I have data that looks like:
>
> application,status,date,message
> kafka,down,2017-03023 04:53,zookeeper is not available
>
>
> Would I have to create new columns for year and month?
>
> e.g.
> application,status,date,message,year,month
> kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
>
> and then perform a CTAS using the year and month columns as the 'partition
> by'?
>
> Thanks
>

Re: Apache Drill takes 5-6 secs in fetching 1000 records from PostgreSQL table

2017-05-30 Thread rahul challapalli

5-6 seconds is a lot of time for the query and dataset size you mentioned.
Did you check the profile to see where the time is being spent?

On Tue, May 30, 2017 at 2:53 AM,  wrote:

> Hi,
>
> I am creating an UNLOGGED table in PostgreSQL and reading it using Apache
> Drill. Table contains just one column with 1000 UUID entries.
> It is taking 5-6 secs for me to read those records.
>
> I am fetching data using below query,
>
> Select uuidColumn from pgPlugin.public.uuidTable
>
>
> Is there anything that I am missing or any Drill level tweaking is
> required so that queries can be executed in milli-seconds.
>
> Thanks in advance.
>
> Regards,
> Jasbir singh
>
> 
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> 
> __
>
> www.accenture.com
>

Re: External Sort - Unable to Allocate Buffer error

2017-05-02 Thread rahul challapalli

This is clearly a bug and like zelaine suggested the new sort is still work
in progress. We have a few similar bugs open for the new sort. I could have
pointed to the jira's but unfortunately JIRA is not working for me due to
firewall issues.

Another suggestion is build drill from the latest master and try it out, if
you are willing to spend some time. But again there is no guarantee yet.

Please go ahead and raise a new jira. If it is a duplicate, I will mark it
as such later. Thank You.

- Rahul

On Tue, May 2, 2017 at 8:24 AM, Nate Butler  wrote:

> Zelaine, thanks for the suggestion. I added this option both to the
> drill-override and in the session and this time the query did stay running
> for much longer but it still eventually failed with the same error,
> although much different memory values.
>
>   (org.apache.drill.exec.exception.OutOfMemoryException) Unable to
> allocate
> buffer of size 134217728 due to memory limit. Current allocation:
> 10653214316
> org.apache.drill.exec.memory.BaseAllocator.buffer():220
> org.apache.drill.exec.memory.BaseAllocator.buffer():195
> org.apache.drill.exec.vector.VarCharVector.reAlloc():425
> org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
> org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.
> doCopy():22
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.next():76
>
> org.apache.drill.exec.physical.impl.xsort.managed.
> CopierHolder$BatchMerger.next():234
>
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
> doMergeAndSpill():1408
>
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
> mergeAndSpill():1376
>
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
> spillFromMemory():1339
>
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.
> processBatch():831
>
> org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.loadBatch():618
>
> org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.load():660
>
> org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.innerNext():559
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
> org.apache.drill.exec.physical.impl.aggregate.
> StreamingAggBatch.innerNext():137
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
> org.apache.drill.exec.physical.impl.partitionsender.
> PartitionSenderRootExec.innerNext():144
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> java.lang.Thread.run():745 (state=,code=0)
>
> At first I didn't change planner.width.max_per_query and the default on a
> 32 core machine makes it 23. This query failed after 34 minutes. I then
> tried setting planner.width.max_per_query=1 and this query also failed but
> of course took took longer, about 2 hours. In both cases,
> planner.memory.max_query_memory_per_node was set to 230G.
>
>
> On Mon, May 1, 2017 at 11:09 AM, Zelaine Fong  wrote:
>
> > Nate,
> >
> > The Jira you’ve referenced relates to the new external sort, which is not
> > enabled by default, as it is still going through some additional testing.
> > If you’d like to try it to see if it resolves your problem, you’ll need
> to
> > set “sort.external.disable_managed” as follows  in your
> > drill-override.conf file:
> >
> > drill.exec: {
> >   cluster-id: "drillbits1",
> >   zk.connect: "localhost:2181",
> >   sort.external.disable_managed: false
> > }
> >
> > and run the following query:
> >
> > ALTER SESSION SET `exec.sort.disable_managed` = false;
> >
> > -- Zelaine
> >
> > On 5/1/17, 7:44 AM, "Nate Butler"  wrote:
> >
> > We keep running into this issue when trying to issue a query with
> > hashagg
> > disabled. When I look at system memory usage though, drill doesn't
> > seem to
> > be using much of it but still hits this error.
> >
> > Our environment:
> >
> > - 1 r3.8xl
> > - 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of
> > Direct
> > - Data stored on S3 is compressed CSV
> >
> > I've tried increasing planner.memory.max_query_memory_per_node to
> > 230G and
> > lowered plann

Re: Apache Drill Query Planning Performance

2017-04-26 Thread rahul challapalli

If your hive metastore contains a lot of metadata (many databases, tables,
columns etc), then drill might spend a significant time in fetching the
metadata the first time. It caches the metadata, so subsequent runs should
be faster. The fact that other queries are run in-between the first and
second run of your query does not invalidate the cached metadata. Its not
clear from what you mentioned whether the second run's planning time is as
long as the first run when you run some other queries in the middle. If so
there is something else going on.

Also, if you have attached any images (or files), they will be filtered
out. If you want to share something (logs, profiles etc), then go ahead and
raise a jira with all the information you have.

- Rahul

On Wed, Apr 26, 2017 at 7:12 AM, Ivan Kovacevic 
wrote:

> Dear Sir or Madam,
>
> I would like to ask a question regarding query planning, since I am
> writing a chapter about Apache Drill in my master thesis.
> My DrillBit is installed within a Cloudera VM, and there is a separate VM
> with MongoDb installed.
> At the time of writing, I'm performing analysis on the yelp academic
> dataset contained in Hive tables, and joining it with separate data in
> MongoDb.
> When running queries, I have noticed that there is a significant
> difference in the duration of the first planning phase of a query and the
> following planning phases of the same query, e.g.:
>
> SELECT COUNT(*) FROM `hive.yelp_academic_dataset`.review_impala;
>
>- The first time the query is run:
>   - PLANNING:* 30.230 sec*
>   - EXECUTION: 27.968 sec
>   - [image: Ugrađena slika 1]
>   - [image: Ugrađena slika 2]
>   - The next time the same query is run (given that other queries are
>not run in the meantime)
>   - PLANNING: *0.087 sec*
>   - EXECUTION: 34.682 sec
>   - [image: Ugrađena slika 3]
>   - [image: Ugrađena slika 4]
>
>
> The reason I find it rather odd is that if another query runs in the
> meantime, the next time the first query is re-run, it will again take a
> long time to finish the query planning phase.
> What causes such difference in the query planning phase duration?
> I'm looking forward to Your answer.
>
> Best Regards,
> Ivan Kovačević
>

Re: Support for ORC files

2017-04-13 Thread rahul challapalli

What you need is a format plugin. You can take a look at the Text Format
plugin while reading paul's documentation which abhishek already shared.
Don't look at parquet as it is more complicated. A short summary of what
you need : (maybe too short to be any useful :) )

1. A group of classes which make drill recognize your format plugin.
2. An ORC Reader. This will the heart of this project. Essentially you
provide a way to read data(columns) from ORC files and then populate
drill's value vectors. You can later enhance this by parallelizing the
reads of individual columns.
3. Once you have the format plugin working, you might want to start playing
with planner rules if you want features like "filter pushdown into the
scan" etc.

- Rahul

On Apr 13, 2017 2:57 PM, "Manoj Murumkar"  wrote:

Thanks. I knew about the hive table format support. I'll look into reading
directly from orc files on hdfs (a la parquet). Is there some documentation
around how to develop a new storage plugin?

> On Apr 13, 2017, at 2:51 PM, Abhishek Girish  wrote:
>
> Drill does not support ORC as a DFS file format. You are welcome to
> contribute. As a workaround, Drill supports reading ORC files via the Hive
> plugin, so you should be able use that.
>
> On Thu, Apr 13, 2017 at 2:19 PM, Manoj Murumkar 
> wrote:
>
>> Hi!
>>
>> I am wondering if someone is actively working on ORC support already.
>> Appreciate any pointers.
>>
>> Thanks,
>>
>> Manoj
>>

Re: Support for ORC files

2017-04-13 Thread rahul challapalli

Drill indirectly supports reading ORC files through the hive plugin. Apart
from that I am not aware of any efforts in coming up with a format plugin
for orc from the community.

Rahul

On Apr 13, 2017 2:19 PM, "Manoj Murumkar"  wrote:

> Hi!
>
> I am wondering if someone is actively working on ORC support already.
> Appreciate any pointers.
>
> Thanks,
>
> Manoj
>

Re: Quoting queries

2017-03-30 Thread rahul challapalli

Hmm...strange. It works for me on drill 1.9.0 from the sqlline client. Can
you try running it from sqlline just so that we can eliminate other tools
trying to do some validation and failing?

0: jdbc:drill:zk=x.x.x.x:5181> select * from `a/b/c.json`;

*+-+*

*| **id ** |*

*+-+*

*| *1  * |*

*+-+*

1 row selected (0.209 seconds)

0: jdbc:drill:zk=x.x.x.x:5181> select * from `a/*/c.json`;

*+---+-+*

*| **dir0 ** | **id ** |*

*+---+-+*

*| *b* | *1  * |*

*+---+-+*

1 row selected (0.233 seconds)

On Thu, Mar 30, 2017 at 12:57 AM, Lane David (ST-ESS/MKP3.2) <
david.l...@de.bosch.com> wrote:

> Hi all,
>
> we are experimenting with Drill at the moment. Everything is working fine
> on the server and I can execute any queries I need there successfully. I
> followed the instructions on the following page to get it working from a
> client computer:
> https://drill.apache.org/docs/using-jdbc-with-squirrel-on-windows/
>
> This has also been successful and I can execute simple queries from
> Squirrel. However I get various parse errors when certain other patterns
> are used.
>
> The following works:
> select * from `a/b/c.json`;
>
> The following doesn't:
> select * from `a/*/c.json`;
>
> Error: PARSE ERROR: Lexical error at line 1, column 44.  Encountered:
>  after : "`a"
>
> I tried the following:
> select * from `a/\*/c.json`;
> Error: VALIDATION ERROR: From line 1, column 33 to line 1, column 60:
> Table 'a/\*/c.json' not found
> And various other combinations of \, /, ', etc.
>
> I have tried Squirrel, SQL WB, Intellij
>
> Anyone got this working?
>
> Mit freundlichen Grüßen / Best regards
>
> David Lane
>
> Product Management Cloud-based Services (ST-ESS/MKP3.2)
> Bosch Sicherheitssysteme GmbH | Postfach 11 11 | 85626 Grasbrunn | GERMANY
> | boschbuildingsecurity.com
> Tel. +49(89)6290-1674 | Fax +49(89)6290 | david.l...@de.bosch.com david.l...@de.bosch.com>
>
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart HRB 23118
> Aufsichtsratsvorsitzender: Stefan Hartung; Geschäftsführung: Gert van
> Iperen, Andreas Bartz, Thomas Quante, Bernhard Schuster
>
>
>
>

Re: JDBC disconnections over remote networks

2017-03-30 Thread rahul challapalli

I haven't used G1GC in any of my testing. So I cannot comment much on
whether it would be helpful or not.

On Thu, Mar 30, 2017 at 8:35 AM, Wesley Chow  wrote:

> Sorry I haven't had time to look into this much and fix our logging setup,
> but I did try explicitly setting JVM heap values in the client rather than
> relying on the default allocation and after a few runs it does seem that
> fixed it. I'm going to cautiously say that was the issue. Thanks!
>
> Would it be prudent to use G1GC for all our clients, since it's pauses are
> supposed to be far less severe?
>
> Wes
>
>
> On Tue, Mar 28, 2017 at 1:42 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Also how much memory did you configure your client to use? If the client
> > does not have sufficient memory to run, then garbage collector could
> start
> > running and thereby causing the client to become un-responsive to
> > heartbeats. So also kindly check the sqlline logs as well for any
> > exceptions
> >
> > On Mon, Mar 27, 2017 at 1:43 PM, Wesley Chow  wrote:
> >
> > > That's totally possible. The ErrorIds are stored on the drillbit
> machines
> > > right? Our logging is configured incorrectly at the moment so I can't
> > find
> > > the error. Will fix that and report back.
> > >
> > > If I limit to 100,000 rows the query consistently works. If I limit to
> 1M
> > > rows then the query consistently disconnects. If I CTAS on 1M rows then
> > it
> > > works, so it does appear to be an issue only when returning results to
> > the
> > > client. I don't know if there is some value between 100k and 1M for
> which
> > > it sometimes works and sometimes doesn't. Is that useful to know? I can
> > do
> > > a little binary searching on values if that would help.
> > >
> > > Wes
> > >
> > >
> > > On Mon, Mar 27, 2017 at 4:13 PM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > >
> > > > Do you think that the error you are seeing is related to DRILL-4708
> > > > <https://issues.apache.org/jira/browse/DRILL-4708> ? If not kindly
> > > provide
> > > > more information about the error (message, stack trace etc). And also
> > > does
> > > > the connection error happen consistently after returning X number of
> > > > records or is it random?
> > > >
> > > > - Rahul
> > > >
> > > > On Mon, Mar 27, 2017 at 1:07 PM, Wesley Chow 
> > wrote:
> > > >
> > > > > hi all,
> > > > >
> > > > > I've been noticing that queries that return large numbers of rows
> > (1M+,
> > > > > each row maybe around 500 bytes) via the JDBC connector (and thus
> > > > sqlline)
> > > > > from our office to drillbits in EC2 consistently disconnect with a
> > > > > connection error while streaming the results back. The same query
> > > > initiated
> > > > > from an EC2 machine works fine. Any thoughts on what I should be
> > > looking
> > > > > at? When the disconnection occurs, none of my other network
> > connections
> > > > > such as ssh are affected, just the Drill JDBC connector.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > > >
> > >
> >
>

Re: Apache Drill Clarification on Reading Parquet files

2017-03-29 Thread rahul challapalli

Welcome to the community and we are glad you are considering drill for your
use-case.

1. There are a few ways in which you can make drill avoid reading all the
files. Take a look at the below items
  a) Partition your data and store the partition information in the
parquet footer. Documentation can be found at
https://drill.apache.org/docs/partition-by-clause/
  b) Partition your data based on directory structure. Documentation
can be found at https://drill.apache.org/docs/how-to-partition-data/
  c) You can also leverage parquet filter pushdown which works at the
row-group level. Even with a single parquet file you can skip reading all
row-groups. Documentation can be found at
https://drill.apache.org/docs/parquet-filter-pushdown/

2. So you do not have a distributed file system like MAPRFS or HDFS but
still want to run drill on multiple nodes. One obvious requirement would be
to make sure your data is exactly replicated on all the nodes where drill
is running. And drill uses zookeeper for coordination. So you would still
need to install that. Since this is not a configuration which is widely
used/tested, I wouldn't be surprised if you run into issues.

Also if you have a lot of parquet files, you may want to take a look at
parquet metadata caching feature (
https://drill.apache.org/docs/optimizing-parquet-metadata-reading/).

- Rahul

On Wed, Mar 29, 2017 at 1:02 PM, basil arockia edwin <
basil.edwi...@gmail.com> wrote:

> Dear team,
>We are planning to use apache drill in our project to query the
> parquet files resides in the file system/openstack swift which we would use
> it in our web application for analytics purpose.
> We need the below questions to be clarified to take further decision.
>
> 1.If we are having 1000 parquet files in a directory and we have our
> required results in only 5 files. Does drill search the entire 1000 parquet
> files metadata information or it will search only the associated 5 files?
>
> 2.Is it possible to install apache drill in cluster mode with out using
> HDFS for scaling?
>
> Thanks,
> Basil
>

Re: JDBC disconnections over remote networks

2017-03-28 Thread rahul challapalli

Also how much memory did you configure your client to use? If the client
does not have sufficient memory to run, then garbage collector could start
running and thereby causing the client to become un-responsive to
heartbeats. So also kindly check the sqlline logs as well for any
exceptions

On Mon, Mar 27, 2017 at 1:43 PM, Wesley Chow  wrote:

> That's totally possible. The ErrorIds are stored on the drillbit machines
> right? Our logging is configured incorrectly at the moment so I can't find
> the error. Will fix that and report back.
>
> If I limit to 100,000 rows the query consistently works. If I limit to 1M
> rows then the query consistently disconnects. If I CTAS on 1M rows then it
> works, so it does appear to be an issue only when returning results to the
> client. I don't know if there is some value between 100k and 1M for which
> it sometimes works and sometimes doesn't. Is that useful to know? I can do
> a little binary searching on values if that would help.
>
> Wes
>
>
> On Mon, Mar 27, 2017 at 4:13 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Do you think that the error you are seeing is related to DRILL-4708
> > <https://issues.apache.org/jira/browse/DRILL-4708> ? If not kindly
> provide
> > more information about the error (message, stack trace etc). And also
> does
> > the connection error happen consistently after returning X number of
> > records or is it random?
> >
> > - Rahul
> >
> > On Mon, Mar 27, 2017 at 1:07 PM, Wesley Chow  wrote:
> >
> > > hi all,
> > >
> > > I've been noticing that queries that return large numbers of rows (1M+,
> > > each row maybe around 500 bytes) via the JDBC connector (and thus
> > sqlline)
> > > from our office to drillbits in EC2 consistently disconnect with a
> > > connection error while streaming the results back. The same query
> > initiated
> > > from an EC2 machine works fine. Any thoughts on what I should be
> looking
> > > at? When the disconnection occurs, none of my other network connections
> > > such as ssh are affected, just the Drill JDBC connector.
> > >
> > > Thanks,
> > > Wes
> > >
> >
>

Re: JDBC disconnections over remote networks

2017-03-27 Thread rahul challapalli

Do you think that the error you are seeing is related to DRILL-4708
 ? If not kindly provide
more information about the error (message, stack trace etc). And also does
the connection error happen consistently after returning X number of
records or is it random?

- Rahul

On Mon, Mar 27, 2017 at 1:07 PM, Wesley Chow  wrote:

> hi all,
>
> I've been noticing that queries that return large numbers of rows (1M+,
> each row maybe around 500 bytes) via the JDBC connector (and thus sqlline)
> from our office to drillbits in EC2 consistently disconnect with a
> connection error while streaming the results back. The same query initiated
> from an EC2 machine works fine. Any thoughts on what I should be looking
> at? When the disconnection occurs, none of my other network connections
> such as ssh are affected, just the Drill JDBC connector.
>
> Thanks,
> Wes
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

2017-03-07 Thread rahul challapalli

I did not get a chance to review the log file.

However the next thing I would try is to make your cluster a single node
cluster first and then run the same explain plan query separately on each
individual file.



On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA"  wrote:

> Hi Rahul,
>
> thanks for your suggestions. However, I am still not able to see any
> reduction in query planning time
> by explicit column names, removing extract headers and using columns[index]
>
> As I said, I ran explain plan and its taking 30+ secs for me.
> My data is 1 GB tsv split into 20 files of 5 MB each.
> Each 5MB file has close to 50k records
> Its a 5 node cluster, and width per node is 4
> Therefore, total number of minor fragments for one major fragment is 20
> I have copied the source directory in all the drillbit nodes
>
> can you tell me a reasonable time estimate which I can expect drill to
> return result for query for the above described scenario.
> Query is - select columns[0] from 
> dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv`
> where columns[0] ='41' and columns[3] ='568'
>
> attached is the json profile
> and the drillbit.log
>
> I also have the tracing enabled.
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
> org.apache.drill.exec.work.foreman.Foreman
> however i see the duration of various steps in the order of ms in the logs.
> i am not sure where planning time of the order of 30 secs is consumed.
>
> Please help
>
> Regards,
> Projjwal
>
>
>
>
>
>
>
> On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
>> You can try the below things. For each of the below check the planning
>> time
>> individually
>>
>> 1. Run explain plan for a simple "select * from `
>> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
>> 2. Replace the '*' in your query with explicit column names
>> 3. Remove the extract header from your storage plugin configuration and
>> from your data files? Rewrite your query to use, columns[0_based_index]
>> instead of explicit column names
>>
>> Also how many columns do you have in your text files and what is the size
>> of each file? Like gautam suggested, it would be good to take a look at
>> drillbit.log file (from the foreman node where planning occurred) and the
>> query profile as well.
>>
>> - Rahul
>>
>> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai  wrote:
>>
>> > Can you please provide the drillbit.log file?
>> >
>> >
>> > Gautam
>> >
>> > 
>> > From: PROJJWAL SAHA 
>> > Sent: Monday, March 6, 2017 1:45:38 AM
>> > To: user@drill.apache.org
>> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
>> > system on tsv file
>> >
>> > all, please help me in giving suggestions on what areas i can look into
>> > why the query planning time is taking so long for files which are local
>> to
>> > the drill machines. I have the same directory structure copied on all
>> the 5
>> > nodes of the cluster. I am accessing the source files using out of the
>> box
>> > dfs storage plugin.
>> >
>> > Query planning time is approx 30 secs
>> > Query execution time is apprx 1.5 secs
>> >
>> > Regards,
>> > Projjwal
>> >
>> > -- Forwarded message --
>> > From: PROJJWAL SAHA mailto:proj.s...@gmail.com>>
>> > Date: Fri, Mar 3, 2017 at 5:06 PM
>> > Subject: Minimise query plan time for dfs plugin for local file system
>> on
>> > tsv file
>> > To: user@drill.apache.org<mailto:user@drill.apache.org>
>> >
>> >
>> > Hello all,
>> >
>> > I am quering select * from dfs.xxx where yyy (filter condition)
>> >
>> > I am using dfs storage plugin that comes out of the box from drill on a
>> > 1GB file, local to the drill cluster.
>> > The 1GB file is split into 10 files of 100 MB each.
>> > As expected I see 11 minor and 2 major fagments.
>> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>> >
>> > One observation is that the query plan time is more than 30 seconds. I
>> ran
>> > the explain plan query to validate this.
>> > The query execution time is 2 secs.
>> > total time taken is 32secs
>> >
>> > I wanted to understand how can i minimise the query plan time.
>> Suggestions
>> > ?
>> > Is the time taken described above expected ?
>> > Attached is result from explain plan query
>> >
>> > Regards,
>> > Projjwal
>> >
>> >
>> >
>>
>
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

2017-03-06 Thread rahul challapalli

You can try the below things. For each of the below check the planning time
individually

1. Run explain plan for a simple "select * from `
/scratch/localdisk/drill/testdata/Cust_1G_tsv`"
2. Replace the '*' in your query with explicit column names
3. Remove the extract header from your storage plugin configuration and
from your data files? Rewrite your query to use, columns[0_based_index]
instead of explicit column names

Also how many columns do you have in your text files and what is the size
of each file? Like gautam suggested, it would be good to take a look at
drillbit.log file (from the foreman node where planning occurred) and the
query profile as well.

- Rahul

On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai  wrote:

> Can you please provide the drillbit.log file?
>
>
> Gautam
>
> 
> From: PROJJWAL SAHA 
> Sent: Monday, March 6, 2017 1:45:38 AM
> To: user@drill.apache.org
> Subject: Fwd: Minimise query plan time for dfs plugin for local file
> system on tsv file
>
> all, please help me in giving suggestions on what areas i can look into
> why the query planning time is taking so long for files which are local to
> the drill machines. I have the same directory structure copied on all the 5
> nodes of the cluster. I am accessing the source files using out of the box
> dfs storage plugin.
>
> Query planning time is approx 30 secs
> Query execution time is apprx 1.5 secs
>
> Regards,
> Projjwal
>
> -- Forwarded message --
> From: PROJJWAL SAHA mailto:proj.s...@gmail.com>>
> Date: Fri, Mar 3, 2017 at 5:06 PM
> Subject: Minimise query plan time for dfs plugin for local file system on
> tsv file
> To: user@drill.apache.org
>
>
> Hello all,
>
> I am quering select * from dfs.xxx where yyy (filter condition)
>
> I am using dfs storage plugin that comes out of the box from drill on a
> 1GB file, local to the drill cluster.
> The 1GB file is split into 10 files of 100 MB each.
> As expected I see 11 minor and 2 major fagments.
> The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>
> One observation is that the query plan time is more than 30 seconds. I ran
> the explain plan query to validate this.
> The query execution time is 2 secs.
> total time taken is 32secs
>
> I wanted to understand how can i minimise the query plan time. Suggestions
> ?
> Is the time taken described above expected ?
> Attached is result from explain plan query
>
> Regards,
> Projjwal
>
>
>

Re: Explain Plan for Parquet data is taking a lot of timre

2017-03-06 Thread rahul challapalli

lto:ppenumar...@mapr.com]
>
> > Sent: Friday, February 24, 2017 11:22 PM
>
> > To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org
>
> > Subject: Re: Explain Plan for Parquet data is taking a lot of timre
>
> >
>
> > Yes, limit is pushed down to parquet reader in 1.9. But, that will not
> help with planning time.
>
> > It is definitely worth trying with 1.9 though.
>
> >
>
> > Thanks,
>
> > Padma
>
> >
>
> >
>
> >> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht  aengelbre...@mapr.com"aengelbre...@mapr.com> wrote:
>
> >>
>
> >> Looks like the metadata cache is being used  "usedMetadataFile=true, ".
> But to be sure did you perform a REFRESH TABLE METADATA  on
> the parquet data?
>
> >>
>
> >>
>
> >> However it looks like it is reading a full batch " rowcount = 32600.0,
> cumulative cost = {32600.0 rows, 32600.0"
>
> >>
>
> >>
>
> >> Didn't the limit operator get pushed down to the parquet reader in 1.9?
>
> >>
>
> >> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows
> gets reduced to 100.
>
> >>
>
> >>
>
> >> Can you look in the query profile where time is spend, also how long it
> takes before the query starts to run in the WebUI profile.
>
> >>
>
> >>
>
> >> Best Regards
>
> >>
>
> >>
>
> >> Andries Engelbrecht
>
> >>
>
> >>
>
> >> Senior Solutions Architect
>
> >>
>
> >> MapR Alliances and Channels Engineering
>
> >>
>
> >>
>
> >> HYPERLINK "mailto:aengelbre...@mapr.com"aengelbre...@mapr.com
>
> >>
>
> >>
>
> >> [1483990071965_mapr-logo-signature.png]
>
> >>
>
> >> 
>
> >> From: Jinfeng Ni mailto:j...@apache.org"j...@apache.org>
>
> >> Sent: Thursday, February 23, 2017 4:53:34 PM
>
> >> To: user
>
> >> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
>
> >>
>
> >> The reason the plan shows only one single parquet file is because
>
> >> "LIMIT 100" is applied and filter out the rest of them.
>
> >>
>
> >> Agreed that parquet metadata caching might help reduce planning time,
>
> >> when there are large number of parquet files.
>
> >>
>
> >> On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli
>
> >> mailto:challapallira...@gmail.com"challapallirahul@
> gmail.com> wrote:
>
> >>> You said there are 2144 parquet files but the plan suggests that you
>
> >>> only have a single parquet file. In any case its a long time to plan
> the query.
>
> >>> Did you try the metadata caching feature [1]?
>
> >>>
>
> >>> Also how many rowgroups and columns are present in the parquet file?
>
> >>>
>
> >>> [1]
>
> >>> https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>
> >>>
>
> >>> - Rahul
>
> >>>
>
> >>> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod  jeena.vi...@oracle.com"jeena.vi...@oracle.com> wrote:
>
> >>>
>
> >>>> Hi,
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>> Drill is taking 23 minutes for a simple select * query with limit
>
> >>>> 100 on 1GB uncompressed parquet data. EXPLAIN PLAN for this query
>
> >>>> is also taking that long(~23 minutes).
>
> >>>>
>
> >>>> Query: select * from .root.`testdata` limit 100;
>
> >>>>
>
> >>>> Query  Plan:
>
> >>>>
>
> >>>> 00-00Screen : rowType = RecordType(ANY *): rowcount = 100.0,
>
> >>>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network,
>
> >>>> 0.0 memory}, id = 1429
>
> >>>>
>
> >>>> 00-01  Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
>
> >>>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0
>
> >>>> network,
>
> >>>> 0.0 memory}, id = 1428
>
> >>>>
>
> >>>> 00-02SelectionVectorRemover : rowType = (DrillRecordRow[*]):
>
> >>>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0
>
> >>>> io, 0.0 network, 0.0 memory}, id = 1427
>
> >>>>
>
> >>>> 00-03  Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
>
> >>>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0
>
> >>>> io, 0.0 network, 0.0 memory}, id = 1426
>
> >>>>
>
> >>>> 00-04Scan(groupscan=[ParquetGroupScan
>
> >>>> [entries=[ReadEntryWithPath [path=/testdata/part-r-0-
>
> >>>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]],
>
> >>>> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true,
>
> >>>> cacheFileRoot=/testdata,
>
> >>>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount =
>
> >>>> 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0
>
> >>>> network, 0.0 memory}, id = 1425
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the
>
> >>>> data is in Oracle Storage Cloud Service. When I run the same query
>
> >>>> on 1GB TSV file in this location it is taking only 38 seconds .
>
> >>>>
>
> >>>> Also testdata contains around 2144 .parquet files each around 500KB.
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>> Is there any additional configuration required for parquet?
>
> >>>>
>
> >>>> Kindly suggest how to improve the response time here.
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>> Regards
>
> >>>> Jeena
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >>>>
>
> >
>
>
>
>
>

Re: Metadata Caching

2017-03-06 Thread rahul challapalli

There is no need to refresh the metadata for every query. You only need to
generate the metadata cache once for each folder. Now if your data gets
updated, then any subsequent query you submit will automatically refresh
the metadata cache. Again you need not run the "refresh table metadata
" command  explicitly. Refer to [1] and ignore the reference
to "session" on that page.

[1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/

- Rahul

On Mon, Mar 6, 2017 at 7:49 AM, Chetan Kothari 
wrote:

> Hi All
>
>
>
> As I understand,  we can trigger generation of the Parquet Metadata Cache
> File by using REFRESH TABLE METADATA .
>
> It seems we need to run this command on a directory, nested or flat, once
> during the session.
>
>
>
> Why we need to run for every session? That implies if I use REST API to
> fire query, I have to generate meta-data cache file as part of every REST
> API call.
>
> This seems to be issue as I have seen that generation of meta-data cache
> file takes some significant time.
>
>
>
> Can't we define/configure  cache expiry time so that we can keep meta-data
> in cache for longer duration?
>
>
>
> Any inputs on this will be helpful.
>
>
>
> Regards
>
> Chetan
>
>
>

Re: Drill 1.9 Null pointer Exception

2017-03-03 Thread rahul challapalli

It looks like you are trying to query a hive table (backed by a hbase
table) from drill. Can you try querying the same table from hive itself? I
would also login to hbase and check whether the underlying table exists or
not

On Thu, Mar 2, 2017 at 2:14 AM, Khurram Faraaz  wrote:

> Can you please share your query and the type of data over which the query
> is executed ?
>
> 
> From: Anas A 
> Sent: Thursday, March 2, 2017 2:18:32 PM
> To: user@drill.apache.org
> Cc: prasann...@trinitymobility.com; 'Sushil'
> Subject: RE: Drill 1.9 Null pointer Exception
>
> Hi Khurram,
> Thanks for your response. The HBase and hive version is not changed , we
> only updated the drill version to 1.9 . our requirement is to work with
> Spatial queries which is supported from 1.9 .  is there any way to fix the
> issue.
>
> Thanks & Regards
> Anas A,
> Trinity Mobility Pvt. Ltd | Bangalore | +91-7736368236
>
>
>
>
> -Original Message-
> From: Khurram Faraaz [mailto:kfar...@mapr.com]
> Sent: 02 March 2017 14:02
> To: user@drill.apache.org
> Cc: prasann...@trinitymobility.com; 'Sushil' 
> Subject: Re: Drill 1.9 Null pointer Exception
>
> Hi Anas,
>
>
> Not sure what is causing the NPE, is your HBase version same as before ?
>
>
> This assertion that you see in the stack trace below, was recently fixed in
> DRILL-5040, you may want to try to latest available build Drill 1.10.0
>
>
> Caused by: java.lang.AssertionError: Internal error: Error while applying
> rule DrillPushProjIntoScan, args
> [rel#35532:LogicalProject.NONE.ANY([]).[](input=rel#
> 35531:Subset#0.ENUMERABL
> E.ANY([]).[],$f0=0),
> rel#35516:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[dfs.tmp,
> bfe2dad0-921a-4f06-9799-494ab8a7246d/851a124c-80a1-
> 45e3-9496-d2562007911e])]
> at org.apache.calcite.util.Util.newInternal(Util.java:792)
> ~[calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at
> org.apache.calcite.plan.volcano.VolcanoRuleCall.
> onMatch(VolcanoRuleCall.java
> :251) ~[calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.
> findBestExp(VolcanoPlanner.ja
> va:808) ~[calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303)
> ~[calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(
> Defau
> ltSqlHandler.java:404)
> ~[drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
>
>
>
> Thanks,
>
> Khurram
>
> 
> From: Anas A 
> Sent: Thursday, March 2, 2017 1:07:46 PM
> To: user@drill.apache.org
> Cc: prasann...@trinitymobility.com; 'Sushil'
> Subject: Drill 1.9 Null pointer Exception
>
> Hi,
> I am trying to work with Apache drill 1.9 to access HBase and Hive tables
> am
> getting a nullpointer Exception. The same table I queried using drill 1.8
> and it worked fine without any issues. Attaching the Error .Please suggest.
>
>
>
> ERROR :
>
> select * from twitter_test_nlp limit 10;
> Error: SYSTEM ERROR: NullPointerException
>
>
> [Error Id: 8c747c22-4f7f-4ba7-b30c-cb5fb3614a41 on
> master01.trinitymobility.local:31010]
>
>   (org.apache.drill.exec.work.foreman.ForemanException) Unexpected
> exception
> during fragment initialization: Internal error: Error while applying rule
> DrillPushProjIntoScan, args
> [rel#980:LogicalProject.NONE.ANY([]).[](input=rel#979:Subset#
> 0.ENUMERABLE.AN
> Y([]).[],id=$0,dates=$1,times=$2,time_zone=$3,users=$4,
> profile_image_url=$5,
> latitude=$6,longitude=$7,twitter_handle=$8,sentiment=$
> 9,language=$10,lang_pr
> obability=$11,text=$12),
> rel#964:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[hive, social,
> twitter_test_nlp])]
> org.apache.drill.exec.work.foreman.Foreman.run():281
> java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> java.lang.Thread.run():745
>   Caused By (java.lang.AssertionError) Internal error: Error while applying
> rule DrillPushProjIntoScan, args
> [rel#980:LogicalProject.NONE.ANY([]).[](input=rel#979:Subset#
> 0.ENUMERABLE.AN
> Y([]).[],id=$0,dates=$1,times=$2,time_zone=$3,users=$4,
> profile_image_url=$5,
> latitude=$6,longitude=$7,twitter_handle=$8,sentiment=$
> 9,language=$10,lang_pr
> obability=$11,text=$12),
> rel#964:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[hive, social,
> twitter_test_nlp])]
> org.apache.calcite.util.Util.newInternal():792
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch():251
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp():808
> org.apache.calcite.tools.Programs$RuleSetProgram.run():303
>
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform():
> 404
>
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform():
> 343
>
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.
> convertToDrel()
> :24

Re: RetriesExhaustedException in drill

2017-03-02 Thread rahul challapalli

It would be helpful if you provide more context.

1. What sort of query are you running
2. Where is the data stored and in what format
3. What is the size of the data
4. Full stack trace of the exception

- Rahul

On Thu, Mar 2, 2017 at 3:11 AM, prasanna lakshmi  wrote:

> Hi All,
>
>   In regular time intervals whenever we  request for data through
> Drill, Application throws exception . Even Drill bit service status remains
> unchanged as Running at this point of time  . On restarting the Drill Bit
> service Application resumes to work well. I am mentioning the exception
> below:
>
>java.sql.SQLException: SYSTEM ERROR: RetriesExhaustedException: Can't
> get the locations
>

Re: Explain Plan for Parquet data is taking a lot of timre

2017-02-23 Thread rahul challapalli

You said there are 2144 parquet files but the plan suggests that you only
have a single parquet file. In any case its a long time to plan the query.
Did you try the metadata caching feature [1]?

Also how many rowgroups and columns are present in the parquet file?

[1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/

- Rahul

On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod  wrote:

> Hi,
>
>
>
> Drill is taking 23 minutes for a simple select * query with limit 100 on
> 1GB uncompressed parquet data. EXPLAIN PLAN for this query is also taking
> that long(~23 minutes).
>
> Query: select * from .root.`testdata` limit 100;
>
> Query  Plan:
>
> 00-00Screen : rowType = RecordType(ANY *): rowcount = 100.0,
> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 1429
>
> 00-01  Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 1428
>
> 00-02SelectionVectorRemover : rowType = (DrillRecordRow[*]):
> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 1427
>
> 00-03  Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 1426
>
> 00-04Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=/testdata/part-r-0-
> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], selectionRoot=/testdata,
> numFiles=1, usedMetadataFile=true, cacheFileRoot=/testdata,
> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 32600.0,
> cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 1425
>
>
>
> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the data is
> in Oracle Storage Cloud Service. When I run the same query on 1GB TSV file
> in this location it is taking only 38 seconds .
>
> Also testdata contains around 2144 .parquet files each around 500KB.
>
>
>
> Is there any additional configuration required for parquet?
>
> Kindly suggest how to improve the response time here.
>
>
>
> Regards
> Jeena
>
>
>
>
>
>
>
>
>
>
>

Re: Issue with drill query

2017-02-07 Thread rahul challapalli

Your query is the longest query I have heard of :)

In any case, lets try the below steps :

1. Can you first try your query directly on hive? If hive reports an error
during its metadata operations, then you can expect drill to fail as well
during planning.
2. Increase the heap memory : Since the query is failing at planning stage,
lets give the planner more memory and see if we can make any progress

- Rahul


On Tue, Feb 7, 2017 at 7:55 AM,  wrote:

> Hi all,
>
> I hope this is the correct place to ask for help.
>
> We have some hive tables stored as textfile that we try to query from, and
> perform some joins.
> Because we have programmed a SQL Query generator and based on end user’s
> selection, it came up with a SQL query that is about 1 million characters
> long and has about 700 selects/nested queries.
> After submitting the query thru Drill’s web interface, the query is
> accepted but stuck in Starting state. Further checks from log file, we got
> the following.
> We are running a 2 node drill cluster, m4.2xlarge 8 vCPU and 32GB of ram.
>
> Thus wanted to check if anyone has ran a long query like ours before or we
> should be looking at another area on the issue.
>
> Thank you in advance!
>
> 2017-02-07 15:25:45,995 [27661a58-2d7e-650c--b1c82b38bb9a:foreman]
> ERROR o.a.d.e.s.hive.HiveMetadataProvider - Failed to parse Hive stats in
> metastore.
> java.lang.NumberFormatException: null
> at java.lang.Long.parseLong(Long.java:552) ~[na:1.8.0_111]
> at java.lang.Long.valueOf(Long.java:803) ~[na:1.8.0_111]
> at org.apache.drill.exec.store.hive.HiveMetadataProvider.
> getStatsFromProps(HiveMetadataProvider.java:212)
> [drill-storage-hive-core-1.9.0.jar:1.9.0]
> at org.apache.drill.exec.store.hive.HiveMetadataProvider.
> getStats(HiveMetadataProvider.java:90) [drill-storage-hive-core-1.9.
> 0.jar:1.9.0]
> at 
> org.apache.drill.exec.store.hive.HiveScan.getScanStats(HiveScan.java:224)
> [drill-storage-hive-core-1.9.0.jar:1.9.0]
> at org.apache.drill.exec.physical.base.AbstractGroupScan.
> getScanStats(AbstractGroupScan.java:79) [drill-java-exec-1.9.0.jar:1.9.0]
> at org.apache.drill.exec.planner.logical.DrillScanRel.
> computeSelfCost(DrillScanRel.java:159) [drill-java-exec-1.9.0.jar:1.9.0]
> at org.apache.calcite.rel.metadata.RelMdPercentageOriginalRows.
> getNonCumulativeCost(RelMdPercentageOriginalRows.java:165)
> [calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
> ~[na:na]
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_111]
> at java.lang.reflect.Method.invoke(Method.java:498)
> ~[na:1.8.0_111]
> at org.apache.calcite.rel.metadata.ReflectiveRelMetadataProvider$
> 1$1.invoke(ReflectiveRelMetadataProvider.java:182)
> [calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at com.sun.proxy.$Proxy71.getNonCumulativeCost(Unknown Source)
> [na:na]
> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
> ~[na:na]
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_111]
> at java.lang.reflect.Method.invoke(Method.java:498)
> ~[na:1.8.0_111]
> at org.apache.calcite.rel.metadata.ChainedRelMetadataProvider$
> ChainedInvocationHandler.invoke(ChainedRelMetadataProvider.java:109)
> [calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at com.sun.proxy.$Proxy71.getNonCumulativeCost(Unknown Source)
> [na:na]
> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
> ~[na:na]
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_111]
> at java.lang.reflect.Method.invoke(Method.java:498)
> ~[na:1.8.0_111]
> at org.apache.calcite.rel.metadata.CachingRelMetadataProvider$
> CachingInvocationHandler.invoke(CachingRelMetadataProvider.java:132)
> [calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at com.sun.proxy.$Proxy71.getNonCumulativeCost(Unknown Source)
> [na:na]
> at org.apache.calcite.rel.metadata.RelMetadataQuery.
> getNonCumulativeCost(RelMetadataQuery.java:115)
> [calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at org.apache.calcite.plan.volcano.VolcanoPlanner.
> getCost(VolcanoPlanner.java:1112) [calcite-core-1.4.0-drill-r19.
> jar:1.4.0-drill-r19]
> at org.apache.calcite.plan.volcano.RelSubset.
> propagateCostImprovements0(RelSubset.java:363)
> [calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at org.apache.calcite.plan.volcano.RelSubset.
> propagateCostImprovements(RelSubset.java:344)
> [calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
> at org.apache.calcite.plan.volcano.VolcanoPlanner.
> addRelToSet(VolcanoPlanner.java:1827) [calcite-core-1.4.0-drill-r19.
> jar:1.4.0-drill-r19]
> at org.apache.calcite.plan.volcano.VolcanoPlanne

Re: Storage Plugin for accessing Hive ORC Table from Drill

2017-01-22 Thread rahul challapalli

As chunhui mentioned this could very well be a compatibility issue of drill
with hive 2.0. Since drill has never been tested against hive 2.0, this is
not a total surprise. Can you try the below 2 things

1. Make sure you can read the table with hive.
2. Create a very simple hive orc table with a single column (use stored as
orc instead of explicitly mentioning the input and output formats in ur
ddl). Now try reading this simple table from drill.

On Jan 22, 2017 9:55 AM, "Anup Tiwari"  wrote:

> can you point me to any specific line or sentence on that link?
>
> Also please correct me if i am misinterpreting, but as written in 1st
> line "*Drill
> 1.1 and later supports Hive 1.0*", does that mean Drill 1.1 and later
> doesn't support OR partially support Hive 2.x?
>
> Regards,
> *Anup Tiwari*
>
> On Sat, Jan 21, 2017 at 8:48 PM, Zelaine Fong  wrote:
>
> > Have you taken a look at http://drill.apache.org/docs/
> hive-storage-plugin/
> > ?
> >
> > -- Zelaine
> >
> > On 1/20/17, 10:07 PM, "Anup Tiwari"  wrote:
> >
> > @Andries, We are using Hive 2.1.1 with Drill 1.9.0.
> >
> > @Zelaine, Could this be a problem in your Hive metastore?--> As i
> > mentioned
> > earlier, i am able to read hive parquet tables in Drill through hive
> > storage plugin. So can you tell me a bit more like which type of
> > configuration i am missing in metastore?
> >
> > Regards,
> > *Anup Tiwari*
> >
> > On Sat, Jan 21, 2017 at 4:56 AM, Zelaine Fong 
> wrote:
> >
> > > The stack trace shows the following:
> > >
> > > Caused by: org.apache.drill.common.exceptions.
> DrillRuntimeException:
> > > java.io.IOException: Failed to get numRows from HiveTable
> > >
> > > The Drill optimizer is trying to read rowcount information from
> Hive.
> > > Could this be a problem in your Hive metastore?
> > >
> > > Has anyone else seen this before?
> > >
> > > -- Zelaine
> > >
> > > On 1/20/17, 7:35 AM, "Andries Engelbrecht" 
> > wrote:
> > >
> > > What version of Hive are you using?
> > >
> > >
> > > --Andries
> > >
> > > 
> > > From: Anup Tiwari 
> > > Sent: Friday, January 20, 2017 3:00:43 AM
> > > To: user@drill.apache.org; d...@drill.apache.org
> > > Subject: Re: Storage Plugin for accessing Hive ORC Table from
> > Drill
> > >
> > > Hi,
> > >
> > > Please find below Create Table Statement and subsequent Drill
> > Error :-
> > >
> > > *Table Structure :*
> > >
> > > CREATE TABLE `logindetails_all`(
> > >   `sid` char(40),
> > >   `channel_id` tinyint,
> > >   `c_t` bigint,
> > >   `l_t` bigint)
> > > PARTITIONED BY (
> > >   `login_date` char(10))
> > > CLUSTERED BY (
> > >   channel_id)
> > > INTO 9 BUCKETS
> > > ROW FORMAT SERDE
> > >   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> > > STORED AS INPUTFORMAT
> > >   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> > > OUTPUTFORMAT
> > >   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> > > LOCATION
> > >   'hdfs://hostname1:9000/usr/hive/warehouse/logindetails_all'
> > > TBLPROPERTIES (
> > >   'compactorthreshold.hive.compactor.delta.num.threshold'='6',
> > >   'compactorthreshold.hive.compactor.delta.pct.threshold'
> ='0.5',
> > >   'transactional'='true',
> > >   'transient_lastDdlTime'='1484313383');
> > > ;
> > >
> > > *Drill Error :*
> > >
> > > *Query* : select * from hive.logindetails_all limit 1;
> > >
> > > *Error :*
> > > 2017-01-20 16:21:12,625 [277e145e-c6bc-3372-01d0-
> > 6c5b75b92d73:foreman]
> > > INFO  o.a.drill.exec.work.foreman.Foreman - Query text for
> > query id
> > > 277e145e-c6bc-3372-01d0-6c5b75b92d73: select * from
> > > hive.logindetails_all
> > > limit 1
> > > 2017-01-20 16:21:12,831 [277e145e-c6bc-3372-01d0-
> > 6c5b75b92d73:foreman]
> > > ERROR o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR:
> > > NumberFormatException: For input string: "004_"
> > >
> > >
> > > [Error Id: 53fa92e1-477e-45d2-b6f7-6eab9ef1da35 on
> > > prod-hadoop-101.bom-prod.aws.games24x7.com:31010]
> > > org.apache.drill.common.exceptions.UserException: SYSTEM
> ERROR:
> > > NumberFormatException: For input string: "004_"
> > >
> > >
> > > [Error Id: 53fa92e1-477e-45d2-b6f7-6eab9ef1da35 on
> > > prod-hadoop-101.bom-prod.aws.games24x7.com:31010]
> > > at
> > > org.apache.drill.common.exceptions.UserException$
> > > Builder.build(UserException.java:543)
> > > ~[drill-common-1.9.0.jar:1.9.0]
> > > at
> > > org.apache.drill.exec.work.foreman.Foreman$For

Re: Array or list type attributes from MongoDB

2017-01-18 Thread rahul challapalli

I suggest that you try using "flatten" [1] along with the mongo db storage
plugin. I did not understand what you meant by  "entities could be offered
to QlikSense as if it were different tables". Below is an example of
flatten usage

select d.id, flatten(d.doc.files), d.doc.requestId from (
select data.response._id id, flatten(data.response.docs) doc from
`input.json` data
) d;

[1] https://drill.apache.org/docs/flatten/

On Wed, Jan 18, 2017 at 7:54 AM, Virilo Tejedor 
wrote:

> Hi,
>
>
> I’d like to use Apache Drill to feed a QlikSense application from a MongoDB
> database.
>
> I’m not very sure about how should I manage list type attributes.
>
>
> I have a MongoDB collection containing entities like this example, where a
> “response” contains a list of “docs”, and each doc has a list of “files”
>
> response = {   '_id': ObjectId('12346e41234b0dbfe5f5b4d38'),
>
> 'creationDate': '2016-11-02T17:33:58+01:00',
>
> 'docs': [   {   'files': [   {
>
>  'content': '...',
>
>  'name': None,
>
>  'type': 't1'},
>
>  {
>
>  'content': '...',
>
>  'name': None,
>
>  'type': 't31'},
>
>  {
>
>  'content': '...',
>
>  'name': 'ORIGINAL',
>
>  'type': 'tX'}],
>
> 'requestId': 39},
>
> {   'files': [   {
>
>  'content': '...',
>
>  'name': None,
>
>  'type': 'tG'},
>
>  {
>
>  'content': '...',
>
>  'name': None,
>
>  'type': 'tX'}],
>
> 'requestId': None}],
>
> 'entityCode': '360',
>
> 'entityType': '13',
>
> 'messageType': 'msgZ',
>
> 'processId': 'ID001294',
>
> 'registryCode': '00015',
>
> 'registryType': '2',
>
> 'returnCode': '200',
>
> 'returnDescription': None}
>
>
>
> It could be great if this entities could be offered to QlikSense as if it
> were different tables:
>
>
>
> Is it possible to configure Apache Drill on this way?
>
>
>
> I know that it is possible using MongoDB BI Connector.  But this connector
> is an enterprise feature only.
>
> https://docs.mongodb.com/bi-connector/master/schema-configuration/#arrays
>
>
>
>
> Thanks in advance.
>
>
>
>
> Best regards,
>
>
> Virilo
>

Re: Stored Procedure & Function in Apache

2017-01-18 Thread rahul challapalli

I believe you have to re-write your sqlserver procedures leveraging the
CTAS/DROP commands of drill keeping in mind that drill does not support
INSERT/UPDATE commands.

[1] https://drill.apache.org/docs/create-table-as-ctas/
[2] https://drill.apache.org/docs/drop-table/

- Rahul

On Wed, Jan 18, 2017 at 5:29 AM, Sanjiv Kumar  wrote:

> Hello
> I want to know how to use stored procedure and function in Apache
> Drill.?
> Does Drill Supports Procedure and Function ? & if not Is there any way to
> run  stored procedure and function which are stored in my Local SqlServer
> through Drill.
>
>
>
>
> *Thanks & Regards,*
> *Sanjiv Kumar*
>
>-
>

Re: WARC files

2017-01-17 Thread rahul challapalli

I believe what you you need is a format plugin.

Once you manage to read a file and populate drill's internal data
structures(value vectors), then the format of the file no longer comes into
picture. So from here on you can use any sql operators (filter, join etc)
or UDF's

To my knowledge there is no format plugin available for drill to read WARC
files. However if hive supports reading WARC files, then you can use drill
and query them through the hive plugin for better query runtimes.

- Rahul

On Mon, Jan 16, 2017 at 7:05 PM, Bob Rudis  wrote:

> Hey folks,
>
> Does anyone know if there have been UDFs made to enable working with
> WARC files in Drill?
>
> WARC: http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
>
> thx,
>
> -Bob
>

Re: Directory Based Partition Pruning Documentation

2016-11-16 Thread rahul challapalli

I raised https://issues.apache.org/jira/browse/DRILL-5046 to track this.
Thank you Birdget

On Wed, Nov 16, 2016 at 12:28 PM, Bridget Bevens 
wrote:

> Hi Rahul,
>
> If there's specific content pertaining to partition pruning that is missing
> from the Apache Drill doc set, please file a JIRA indicating the doc gap,
> reference any related JIRAs, and assign it to me.
>
> Thanks,
> Bridget
>
> On Wed, Nov 16, 2016 at 11:33 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Folks,
> >
> > After a quick glance through our documentation, I couldn't find much
> about
> > directory based partition pruning feature in drill. All I could find was
> > [1]. Can someone point me to the relevant docs on this feature?
> >
> > [1] https://drill.apache.org/docs/partition-pruning/
> >
> > - Rahul
> >
>

Directory Based Partition Pruning Documentation

2016-11-16 Thread rahul challapalli

Folks,

After a quick glance through our documentation, I couldn't find much about
directory based partition pruning feature in drill. All I could find was
[1]. Can someone point me to the relevant docs on this feature?

[1] https://drill.apache.org/docs/partition-pruning/

- Rahul

Re: Drill where clause vs Hive on non-partition column

2016-11-16 Thread rahul challapalli

Assume you have a hive table "revenue" partitioned by year. Then the folder
structure for the table on maprfs/hdfs looks something like below


*revenue*
*|year=2015*
*|year=2016*
*|year=2107*

Now if you want to leverage partition pruning, you can use something like
below
Hive plugin : select count(*) from hive.revenue where `year` = 2015
DFS plugin : select count(*) from dfs.`/user/hive/warehouse/revenue` where
dir0='year=2015'

I am not sure if we have a jira for tracking parquet filter pushdown when
using hive + native parquet reader

- Rahul

On Wed, Nov 16, 2016 at 7:24 AM, Sonny Heer  wrote:

> thats a lot of good information Rahul!! - thanks.
>
> "modify the query to take advantage of drill's directory based
> partitioning"
>
> What does this entail?  Do you have to tell it on which column the
> directories are partitioned by?
>
> I think option 3 is probably the way to go.  Is there a ticket tracking
> work on this?
>
> Thanks again
>
> On Tue, Nov 15, 2016 at 10:25 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Robert's suggestion is with using the DFS plugin. If you directly use DFS
> > instead of hive plugin then
> >
> > 1. DFS plugin has to determine the underlying data format on the fly.
> > 2. DFS plugin does not know the schema in advance. But in the case
> parquet
> > drill would get this information from the parquet metadata. However if
> the
> > hive table is backed by a csv file, then you cast the columns
> appropriately
> > in the query or create a view.
> > 3. If the underlying hive table is partitioned, then drill does not know
> > anything about partitions. However since hive partitions are just
> > sub-directories, you can still modify the query to take advantage of
> > drill's directory based partitioning
> > 4. In terms of performance, I am not aware of any published benchmarks
> > comparing hive plugin and dfs plugin for parquet format. But from my
> > general experience it appears as though DFS plugin is faster.
> >
> > Also do not forget the 3rd option in my first response (Hive Plugin +
> Drill
> > native parquet reader). We do have plans to support filter pushdown for
> > this scenario in the future.
> >
> > - Rahul
> >
> > On Tue, Nov 15, 2016 at 8:01 AM, Sonny Heer  wrote:
> >
> > > Thanks Robert.
> > >
> > > "You can then use Drill to query the Hive table and get predicate
> > pushdown"
> > >
> > > This is using the DFS plugin and going directly to the hive table
> folder?
> > >
> > > Can someone speak to what advantages there are to use the hive plugin
> vs
> > > going directly to dfs
> > >
> > > On Tue, Nov 15, 2016 at 12:32 AM, Robert Hou 
> wrote:
> > >
> > > > I have used Hive 1.2 and I have found that the stats in parquet files
> > are
> > > > populated for some data types.  Integer, bigint, float, double, date
> > > work.
> > > > String does not seem to work.
> > > >
> > > > You can then use Drill to query the Hive table and get predicate
> > pushdown
> > > > for simple compare filters.  This has the form "where col = value".
> > > Other
> > > > standard operators are !=, <, <=, >, >=.  Compound filters can use
> > > "and/or"
> > > > logic.  This will be supported in Drill 1.9.
> > > >
> > > > In the future, we will add expressions and functions.
> > > >
> > > > Thanks.
> > > >
> > > > --Robert
> > > >
> > > >
> > > > On Mon, Nov 14, 2016 at 3:53 PM, Sonny Heer 
> > wrote:
> > > >
> > > > > Is there a way to do that during the creation of the parquet table?
> > > > Might
> > > > > be a hive question but all we do is 'STORED AS parquet' and then
> > during
> > > > > insert set the parquet.* properties.  I'm just trying to see if #2
> is
> > > an
> > > > > option for us to utilize filter pushdown via dfs
> > > > >
> > > > > On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli <
> > > > > challapallira...@gmail.com> wrote:
> > > > >
> > > > > > I do not know of any plans to support filter pushdown when using
> > the
> > > > hive
> > > > > > plugin.
> > > > > > If you run analyze stats then hive computes the table stats and
>

Re: MySQL CONNECTION_ID() equivalent in Drill

2016-11-15 Thread rahul challapalli

I couldn't find any documented functions which do what you are describing.

On Mon, Nov 14, 2016 at 2:01 AM, Nagarajan Chinnasamy <
nagarajanchinnas...@gmail.com> wrote:

> Hi,
>
> I would like to know if there is a function or column in system tables that
> is equivalent to MySQL's CONNECTION_ID function.
>
> I am basically trying to achieve the technique answered by Justin Swanhart
> for the following stackoverflow question:
>
> http://stackoverflow.com/questions/2281890/can-i-
> create-view-with-parameter-in-mysql
>
>
> Best Regards,
> Nagu.
>

Re: Drill where clause vs Hive on non-partition column

2016-11-15 Thread rahul challapalli

Robert's suggestion is with using the DFS plugin. If you directly use DFS
instead of hive plugin then

1. DFS plugin has to determine the underlying data format on the fly.
2. DFS plugin does not know the schema in advance. But in the case parquet
drill would get this information from the parquet metadata. However if the
hive table is backed by a csv file, then you cast the columns appropriately
in the query or create a view.
3. If the underlying hive table is partitioned, then drill does not know
anything about partitions. However since hive partitions are just
sub-directories, you can still modify the query to take advantage of
drill's directory based partitioning
4. In terms of performance, I am not aware of any published benchmarks
comparing hive plugin and dfs plugin for parquet format. But from my
general experience it appears as though DFS plugin is faster.

Also do not forget the 3rd option in my first response (Hive Plugin + Drill
native parquet reader). We do have plans to support filter pushdown for
this scenario in the future.

- Rahul

On Tue, Nov 15, 2016 at 8:01 AM, Sonny Heer  wrote:

> Thanks Robert.
>
> "You can then use Drill to query the Hive table and get predicate pushdown"
>
> This is using the DFS plugin and going directly to the hive table folder?
>
> Can someone speak to what advantages there are to use the hive plugin vs
> going directly to dfs
>
> On Tue, Nov 15, 2016 at 12:32 AM, Robert Hou  wrote:
>
> > I have used Hive 1.2 and I have found that the stats in parquet files are
> > populated for some data types.  Integer, bigint, float, double, date
> work.
> > String does not seem to work.
> >
> > You can then use Drill to query the Hive table and get predicate pushdown
> > for simple compare filters.  This has the form "where col = value".
> Other
> > standard operators are !=, <, <=, >, >=.  Compound filters can use
> "and/or"
> > logic.  This will be supported in Drill 1.9.
> >
> > In the future, we will add expressions and functions.
> >
> > Thanks.
> >
> > --Robert
> >
> >
> > On Mon, Nov 14, 2016 at 3:53 PM, Sonny Heer  wrote:
> >
> > > Is there a way to do that during the creation of the parquet table?
> > Might
> > > be a hive question but all we do is 'STORED AS parquet' and then during
> > > insert set the parquet.* properties.  I'm just trying to see if #2 is
> an
> > > option for us to utilize filter pushdown via dfs
> > >
> > > On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > >
> > > > I do not know of any plans to support filter pushdown when using the
> > hive
> > > > plugin.
> > > > If you run analyze stats then hive computes the table stats and
> stores
> > > them
> > > > in the hive metastore for the relevant table. I believe drill uses
> some
> > > of
> > > > these stats. However running analyze stats command does not alter(or
> > add)
> > > > the metadata in the parquet files themselves. The parquet level
> > metadata
> > > > should be written when the parquet file itself is created in the
> first
> > > > place.
> > > >
> > > > - Rahul
> > > >
> > > > On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer 
> > wrote:
> > > >
> > > > > Rahul,
> > > > >
> > > > > Thanks for the details.  Is there any plans to support filter
> > pushdown
> > > > for
> > > > > #1?  Do you know if we run analyze stats through hive on a parquet
> > file
> > > > if
> > > > > that will have enough info to do the pushdown?
> > > > >
> > > > > Thanks again.
> > > > >
> > > > > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli <
> > > > > challapallira...@gmail.com> wrote:
> > > > >
> > > > > > Sonny,
> > > > > >
> > > > > > If the underlying data in the hive table is in parquet format,
> > there
> > > > are
> > > > > 3
> > > > > > ways to query from drill :
> > > > > >
> > > > > > 1. Using the hive plugin : This does not support filter pushdown
> > for
> > > > any
> > > > > > formats (ORC, Parquet, Text...etc)
> > > > > > 2. Directly Querying the folder in maprfs/hdfs which contains the
> > > > parquet
> > > > > > f

Re: Drill where clause vs Hive on non-partition column

2016-11-14 Thread rahul challapalli

I do not know of any plans to support filter pushdown when using the hive
plugin.
If you run analyze stats then hive computes the table stats and stores them
in the hive metastore for the relevant table. I believe drill uses some of
these stats. However running analyze stats command does not alter(or add)
the metadata in the parquet files themselves. The parquet level metadata
should be written when the parquet file itself is created in the first
place.

- Rahul

On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer  wrote:

> Rahul,
>
> Thanks for the details.  Is there any plans to support filter pushdown for
> #1?  Do you know if we run analyze stats through hive on a parquet file if
> that will have enough info to do the pushdown?
>
> Thanks again.
>
> On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Sonny,
> >
> > If the underlying data in the hive table is in parquet format, there are
> 3
> > ways to query from drill :
> >
> > 1. Using the hive plugin : This does not support filter pushdown for any
> > formats (ORC, Parquet, Text...etc)
> > 2. Directly Querying the folder in maprfs/hdfs which contains the parquet
> > files using DFS plugin: With DRILL-1950, we can now do a filter pushdown
> > into the parquet files. In order to take advantage of this feature, the
> > underlying parquet files should have the relevant stats. This feature
> will
> > only be available with the 1.9.0 release
> > 3. Using the drill's native parquet reader in conjunction with the hive
> > plugin (See store.hive.optimize_scan_with_native_readers) : This allows
> > drill to fetch all the metadata about the hive table from the metastore
> and
> > then drill uses its own parquet reader for actually reading the files.
> This
> > approach currently does not support parquet filter pushdown but this
> might
> > be added in the next release after 1.9.0.
> >
> > - Rahul
> >
> > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer 
> wrote:
> >
> > > I'm running a drill query with a where clause on a non-partitioned
> column
> > > via hive storage plugin.  This query inspects all partitions (kind of
> > > expected), but when i run the same query in Hive I can see a predicate
> > > passed down to the query plan.  This particular query is much faster in
> > > Hive vs Drill.  BTW these are parquet files.
> > >
> > > Hive:
> > >
> > > Stage-0
> > >
> > > Fetch Operator
> > >
> > > limit:-1
> > >
> > > Select Operator [SEL_2]
> > >
> > > outputColumnNames:["_col0"]
> > >
> > > Filter Operator [FIL_4]
> > >
> > > predicate:(my_column = 123) (type: boolean)
> > >
> > > TableScan [TS_0]
> > >
> > > alias:my_table
> > >
> > >
> > > Any idea on why this is?  My guess is Hive is storing hive specific
> info
> > in
> > > the parquet file since it was created through Hive.  Although it seems
> > > drill-hive plugin should honor this to.  Not sure, but willing to look
> > > through code if someone can point me in the right direction.  Thanks!
> > >
> > > --
> > >
> >
>
>
>
> --
>
>
> Pushpinder S. Heer
> Senior Software Engineer
> m: 360-434-4354 h: 509-884-2574
>

Re: INTERVAL date arithmetic

2016-11-14 Thread rahul challapalli

If you have an intervalday, then you should be able to use *extract(day
from )* function. Also if you have a date then you can try something
like below

0: jdbc:drill:zk=10.10.100.190:5181> select datediff(NOW(), l_shipdate)
from cp.`tpch/lineitem.parquet` limit 1;

*+-+*

*| **EXPR$0 ** |*

*+-+*

*| *7552   * |*

*+-+*

1 row selected (0.475 seconds)

If you have a timestamp then cast it to a date and try the above query.


- Rahul

On Thu, Nov 10, 2016 at 9:01 AM, Robin Moffatt <
robin.moff...@rittmanmead.com> wrote:

> Hi,
>
> I get an error:
>
> 0: jdbc:drill:zk=local> select extract(day from cast(p.post.published_at as
> interval day))
> . . . . . . . . . . . > from dfs.data.ghost_posts_rm
> . . . . . . . . . . . > p;
> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format:
> "2003-06-28T23:00:00.000Z"
>
> thanks
> 
>
> On 10 November 2016 at 16:50, rahul challapalli <
> challapallira...@gmail.com>
> wrote:
>
> > Can you try the below query?
> >
> > select extract(day from cast(p.post.published_at as interval day))
> > from dfs.data.ghost_posts_rm
> > p;
> >
> > - Rahul
> >
> > On Thu, Nov 10, 2016 at 3:01 AM, Robin Moffatt <
> > robin.moff...@rittmanmead.com> wrote:
> >
> > > Hi,
> > > I have a date in a table, that I want to calculate how many days it is
> > > between then and current date.
> > > I have read the docs on date time formats, including intervals (
> > > http://drill.apache.org/docs/date-time-and-timestamp/), as well as
> date
> > > time functions (
> > > http://drill.apache.org/docs/date-time-functions-and-arithmetic/).
> > >
> > > I have a query that returns the interval:
> > >
> > > 0: jdbc:drill:zk=local> select p.post.published_at,age(p.
> > > post.published_at)
> > > FROM   dfs.data.ghost_posts_rm p limit 5;
> > > +---+---+
> > > |  EXPR$0   |  EXPR$1   |
> > > +---+---+
> > > | 2003-06-28T23:00:00.000Z  | P162M24D  |
> > >
> > > but I can't see how to transform the INTERVALDAY into an int of days
> > alone.
> > >
> > > Any suggestions?
> > >
> > > thanks.
> > >
> >
>

Re: Drill where clause vs Hive on non-partition column

2016-11-14 Thread rahul challapalli

Sonny,

If the underlying data in the hive table is in parquet format, there are 3
ways to query from drill :

1. Using the hive plugin : This does not support filter pushdown for any
formats (ORC, Parquet, Text...etc)
2. Directly Querying the folder in maprfs/hdfs which contains the parquet
files using DFS plugin: With DRILL-1950, we can now do a filter pushdown
into the parquet files. In order to take advantage of this feature, the
underlying parquet files should have the relevant stats. This feature will
only be available with the 1.9.0 release
3. Using the drill's native parquet reader in conjunction with the hive
plugin (See store.hive.optimize_scan_with_native_readers) : This allows
drill to fetch all the metadata about the hive table from the metastore and
then drill uses its own parquet reader for actually reading the files. This
approach currently does not support parquet filter pushdown but this might
be added in the next release after 1.9.0.

- Rahul

On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer  wrote:

> I'm running a drill query with a where clause on a non-partitioned column
> via hive storage plugin.  This query inspects all partitions (kind of
> expected), but when i run the same query in Hive I can see a predicate
> passed down to the query plan.  This particular query is much faster in
> Hive vs Drill.  BTW these are parquet files.
>
> Hive:
>
> Stage-0
>
> Fetch Operator
>
> limit:-1
>
> Select Operator [SEL_2]
>
> outputColumnNames:["_col0"]
>
> Filter Operator [FIL_4]
>
> predicate:(my_column = 123) (type: boolean)
>
> TableScan [TS_0]
>
> alias:my_table
>
>
> Any idea on why this is?  My guess is Hive is storing hive specific info in
> the parquet file since it was created through Hive.  Although it seems
> drill-hive plugin should honor this to.  Not sure, but willing to look
> through code if someone can point me in the right direction.  Thanks!
>
> --
>

Re: SYSTEM ERROR: CompileException

2016-11-11 Thread rahul challapalli

This is a bug and its weird that changing the literal in the condition
makes it work. Can you go ahead and raise a jira for the same?

On Thu, Nov 10, 2016 at 10:58 AM, Josson Paul  wrote:

> Hi,
>
>   My query is below
>
> select MIN(case when (CMP__acIds like '%6%') then A__ln__intrRt else
> null end) acID2,MIN(case when (A__ln__intrRt > 4.0) then 1 else null
> end) acID1,MIN(case when (A__ln__intrRt > 4.0) and CMP__acIds like
> '%6%' then 1 else null end) acID3 from 
>
>
> The above query returns
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> CompileException: Line 151, Column 22: Statement "break AndOP0" is not
> enclosed by a breakable statement with label "AndOP0" Fragment 1:0
>
> I am using drill 1.8
>
> If I make different values for A__ln__intrRt greater than condition, then
> the query works. If both values are same, the query gives above error.
> Currently as you see in the query, both have 4.0 as values. Another way to
> make this query works is to remove one of the select field
> --
> Thanks
> Josson
>

Re: INTERVAL date arithmetic

2016-11-10 Thread rahul challapalli

Can you try the below query?

select extract(day from cast(p.post.published_at as interval day))
from dfs.data.ghost_posts_rm
p;

- Rahul

On Thu, Nov 10, 2016 at 3:01 AM, Robin Moffatt <
robin.moff...@rittmanmead.com> wrote:

> Hi,
> I have a date in a table, that I want to calculate how many days it is
> between then and current date.
> I have read the docs on date time formats, including intervals (
> http://drill.apache.org/docs/date-time-and-timestamp/), as well as date
> time functions (
> http://drill.apache.org/docs/date-time-functions-and-arithmetic/).
>
> I have a query that returns the interval:
>
> 0: jdbc:drill:zk=local> select p.post.published_at,age(p.
> post.published_at)
> FROM   dfs.data.ghost_posts_rm p limit 5;
> +---+---+
> |  EXPR$0   |  EXPR$1   |
> +---+---+
> | 2003-06-28T23:00:00.000Z  | P162M24D  |
>
> but I can't see how to transform the INTERVALDAY into an int of days alone.
>
> Any suggestions?
>
> thanks.
>

Re: Parquet Date Format Problem

2016-11-01 Thread rahul challapalli

The fix will be available with the Drill 1.9 release unless you want to
build from source yourself.

On Tue, Nov 1, 2016 at 11:24 AM, Lee, David  wrote:

> Nevermind.. Found the problem..
>
> https://issues.apache.org/jira/browse/DRILL-4203
>
>
> David Lee
> Vice President | BlackRock
> Phone: +1.415.670.2744 | Mobile: +1.415.706.6874
>
> From: Lee, David
> Sent: Tuesday, November 01, 2016 11:21 AM
> To: 'user@drill.apache.org' 
> Subject: Parquet Date Format Problem
>
> I created a parquet file using Drill, but date values in the parquet files
> don’t appear to be a logical INT32 type and as such when I’m trying to read
> the parquet file in Spark it looks corrupted..
>
> Here’s my test case..
>
>
> A. Create a test.txt file in /tmp:
>
> as_of
> 2016-09-30
>
>
> B. Convert it to parquet using Drill:
>
> 0: jdbc:drill:zk=local> create table dfs.tmp.`/test` as select cast(as_of
> AS date) as as_of from table(dfs.`/tmp/test.txt`(type => 'text',
> fieldDelimiter => ',', extractHeader => true));
>
>
> C.Read the new file using Drill which looks fine:
>
>
> 0: jdbc:drill:zk=local> select * from dfs.`/tmp/test`;
> +-+
> |as_of|
> +-+
> | 2016-09-30  |
> +-+
>
>
> D.However running parquet-tools on it gives a completely different
> result:
>
> java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/test
> as_of = 4898250
>
> java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/test/0_0_0.parquet
> message root {
>   required int32 as_of (DATE);
> }
>
> According to the Parquet docs.. 4898250 days after Jan 1st 1970 is
> sometime in the year 15,435..
>
> https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md
> DATE
> DATE is used to for a logical date type, without a time of day. It must
> annotate an int32 that stores the number of days from the Unix epoch, 1
> January 1970.
>
>
>
> David Lee
> Vice President | BlackRock
> Phone: +1.415.670.2744 | Mobile: +1.415.706.6874
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See http://www.blackrock.com/
> corporate/en-us/compliance/email-disclaimers for further information.
> Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> privacy-policy for more information about BlackRock’s Privacy Policy.
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
>
> © 2016 BlackRock, Inc. All rights reserved.
>

Re: Drill issue - Reading DATE & TIME data type from Parquet

2016-10-17 Thread rahul challapalli

Amarnath,

This is a known issue and the work is already in progress. You can track it
using [1]

[1] https://issues.apache.org/jira/browse/DRILL-4203

- Rahul

On Mon, Oct 17, 2016 at 9:53 AM, Amarnath Vibhute <
amarnath.vibh...@gmail.com> wrote:

> Dear Team,
>
> I have started using Drill recently (a couple of weeks now). I am using
> Drill 1.8 version on a 3 node cluster (MapR community edition).
> I have some data stored in Hive tables in Parquet format. When I am reading
> this Parquet data using Drill queries I am facing problem for attributes
> which are stored in DATE & TIMESTAMP data types. Basically, I can't see the
> correct data values which I loaded in the table while querying using Drill.
> But if I read data using Hive, I can see correct data without any issues.
>
> I have explained the problem in detail on MapR community, please refer link
> -
> https://community.mapr.com/thread/18883-getting-weird-
> output-for-date-timestamp-data-type-columns-while-
> selecting-data-from-parquet-file-in-drill
>
> I understood that for TIMESTAMP I can use the CONVERT_FROM function to get
> correct value but not still sure about DATE values.
>
> Can you help/guide me to read correct DATE stored in Parquet using Drill
> query?
>
> I do not want to store DATE in string format as I am sure next Drill
> versions will surely support reading DATE from Parquet data.
>
> Thanks in advance!
> Amarnath Vibhute
>

Re: Reading column from parquet file saved using spark.1.6

2016-10-17 Thread rahul challapalli

This is tracked by https://issues.apache.org/jira/browse/DRILL-4203

On Mon, Oct 17, 2016 at 10:14 AM, Tushar Pathare  wrote:

>
> Hello Team,
>  I am getting wrong values for date
> columns(START_DT,END_DT)timestamp while querying.
> Please see the attached screenshot.I am using the latest build from drill
> 1.9. snapshot
>
> The value I am getting is [B@471ce738
>
> The schema is as follows
>
> |-- _IDENTIFIER: string (nullable = true)
> |-- START_DT: timestamp (nullable = true)
> |-- END_DT: timestamp (nullable = true)
> |-- PATIENT_HISTORY_CODE: string (nullable = true)
> |-- VAL: string (nullable = true)
>
>  Could you please help me on this.
>
> Thanks
>
>
>
>
> Tushar B Pathare
> High Performance Computing (HPC) Administrator
> General Parallel File System
> Scientific Computing
> Bioinformatics Division
> Research
>
> "what ever the mind of man can conceive and believe, drill can query"
>
> Sidra Medical and Research Centre
> Sidra OPC Building
> PO Box 26999  |  Doha, Qatar
> Near QNCC,5th Floor
> Office 4003  ext 37443 | M +974 74793547
> tpath...@sidra.org | www.sidra.org sidra.org/>
>
> Disclaimer: This email and its attachments may be confidential and are
> intended solely for the use of the individual to whom it is addressed. If
> you are not the intended recipient, any reading, printing, storage,
> disclosure, copying or any other action taken in respect of this e-mail is
> prohibited and may be unlawful. If you are not the intended recipient,
> please notify the sender immediately by using the reply function and then
> permanently delete what you have received. Any views or opinions expressed
> are solely those of the author and do not necessarily represent those of
> Sidra Medical and Research Center.
>

Re: Query hangs on planning

2016-09-01 Thread rahul challapalli

While planning we use heap memory. 2GB of heap should be sufficient for
what you mentioned. This looks like a bug to me. Can you raise a jira for
the same? And it would be super helpful if you can also attach the data set
used.

Rahul

On Wed, Aug 31, 2016 at 9:14 AM, Oscar Morante  wrote:

> Sure,
> This is what I remember:
>
> * Failure
>- embedded mode on my laptop
>- drill memory: 2Gb/4Gb (heap/direct)
>- cpu: 4cores (+hyperthreading)
>- `planner.width.max_per_node=6`
>
> * Success
>- AWS Cluster 2x c3.8xlarge
>- drill memory: 16Gb/32Gb
>- cpu: limited by kubernetes to 24cores
>- `planner.width.max_per_node=23`
>
> I'm very busy right now to test again, but I'll try to provide better info
> as soon as I can.
>
>
>
> On Wed, Aug 31, 2016 at 05:38:53PM +0530, Khurram Faraaz wrote:
>
>> Can you please share the number of cores on the setup where the query hung
>> as compared to the number of cores on the setup where the query went
>> through successfully.
>> And details of memory from the two scenarios.
>>
>> Thanks,
>> Khurram
>>
>> On Wed, Aug 31, 2016 at 4:50 PM, Oscar Morante 
>> wrote:
>>
>> For the record, I think this was just bad memory configuration after all.
>>> I retested on bigger machines and everything seems to be working fine.
>>>
>>>
>>> On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote:
>>>
>>> Oscar, can you please report a JIRA with the required steps to reproduce
 the OOM error. That way someone from the Drill team will take a look and
 investigate.

 For others interested here is the stack trace.

 2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-378aaa4ce50e:foreman]
 ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure
 Occurred,
 exiting. Information message: Unable to handle out of memory condition
 in
 Foreman.
 java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
 ~[na:1.7.0_111]
at java.lang.String.(String.java:203) ~[na:1.7.0_111]
at java.lang.StringBuilder.toString(StringBuilder.java:405)
 ~[na:1.7.0_111]
at org.apache.calcite.util.Util.newInternal(Util.java:785)
 ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
at
 org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(
 VolcanoRuleCall.java:251)
 ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
at
 org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(
 VolcanoPlanner.java:808)
 ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
at
 org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303)
 ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
at
 org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
 .transform(DefaultSqlHandler.java:404)
 ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
at
 org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
 .transform(DefaultSqlHandler.java:343)
 ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
at
 org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
 .convertToDrel(DefaultSqlHandler.java:240)
 ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
at
 org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
 .convertToDrel(DefaultSqlHandler.java:290)
 ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
at
 org.apache.drill.exec.planner.sql.handlers.ExplainHandler.ge
 tPlan(ExplainHandler.java:61)
 ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
at
 org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(Dri
 llSqlWorker.java:94)
 ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
at
 org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978)
 ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:
 257)
 ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
 Executor.java:1145)
 [na:1.7.0_111]
at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
 lExecutor.java:615)
 [na:1.7.0_111]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]

 Thanks,
 Khurram

 On Tue, Aug 9, 2016 at 7:46 PM, Oscar Morante 
 wrote:

 Yeah, when I uncomment only the `upload_date` lines (a dir0 alias),

> explain succeeds within ~30s.  Enabling any of the other lines triggers
> the
> failure.
>
> This is a log with the `upload_date` lines and `usage <> 'Test'`
> enabled:
> https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022b3c55e
>
> The client t

Re: Querying Delimited Sequence file

2016-08-31 Thread rahul challapalli

I will try the split_part function myself to see if I can reproduce your
issue. And I couldn't see the query which references the output of split
function. Something like the below should work

SELECT
  d.columns_arr[0],
  d.columns_arr[1]FROM (SELECT
  split(CONVERT_FROM(binary_value, 'UTF8'), chr(1)) columns_arrFROM data) d;


On Wed, Aug 31, 2016 at 1:50 AM, Robin Moffatt <
robin.moff...@rittmanmead.com> wrote:

> Thanks, SPLIT_PART looks useful.
>
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
> split_part(version,'.',1),split_part(version,'.',2),
> split_part(version,'.',3)
> from sys.version;
> +-+-+-+
> | EXPR$0  | EXPR$1  | EXPR$2  |
> +-+-+-+
> | 1   | 7   | 0   |
> +-+-+-+
> 1 row selected (0.351 seconds)
>
> But used with my actual data (sequence file), I get an error. I've
> successfully SPLIT it using CHR(1) for the \x01 delimiter:
>
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> split(CONVERT_FROM(binary_value, 'UTF8'),chr(1)) from
>  `/user/oracle/seq/pdb.soe.logon` limit 1;
> ++
> | EXPR$0 |
> ++
> | ["\u\u\u|I","PDB.SOE.LOGON","2016-08-30
> 10:34:01.000145","2016-08-30T11:34:07.934000","01558898","","
> 338328","13645","2016-08-30:11:34:01"]
> |
> ++
>
> But if I now try to access one of those elements, it errors:
>
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> split_part(CONVERT_FROM(binary_value, 'UTF8'),chr(1),1) from
>  `/user/oracle/seq/pdb.soe.logon` limit 5;
> Error: SYSTEM ERROR: IllegalArgumentException: length: -123 (expected: >=
> 0)
>
> Fragment 0:0
>
> [Error Id: beba85c3-8c5b-4c05-9ae7-d12263811af4 on
> cdh57-01-node-02.moffatt.me:31010] (state=,code=0)
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> split_part(CONVERT_FROM(binary_value, 'UTF8'),chr(1),2) from
>  `/user/oracle/seq/pdb.soe.logon` limit 5;
> Error: SYSTEM ERROR: IllegalArgumentException: length: -6 (expected: >= 0)
>
> Fragment 0:0
>
> [Error Id: b4f18223-2999-4388-9450-dc9683c543ec on
> cdh57-01-node-02.moffatt.me:31010] (state=,code=0)
>
>
> Should this work?
>
> thanks.
>
>
> On 30 August 2016 at 19:06, rahul challapalli 
> wrote:
>
> > You should be able to use split_part function (I haven't tried it
> > myself...but it is supported). With this function you can extract
> > individual columns. Unfortunately I couldn't find the documentation for
> > this function as well. But it should be similar to how other databases
> > implement this function.
> >
> > Also as you have observed, split does not support delimiters with more
> than
> > one character. You can raise a jira and mark it as documentation related.
> >
> > Rahul
> >
> >
> >
> > On Tue, Aug 30, 2016 at 8:58 AM, Robin Moffatt <
> > robin.moff...@rittmanmead.com> wrote:
> >
> > > Hi,
> > >
> > > Thanks - I think SPLIT gets me some of the way, but after the FLATTEN I
> > > want to PIVOT, so instead of :
> > >
> > > 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
> > > flatten(split(version,'.')) from sys.version;
> > > +-+
> > > | EXPR$0  |
> > > +-+
> > > | 1   |
> > > | 7   |
> > > | 0   |
> > > +-+
> > >
> > > I'd get something like:
> > >
> > > +-+-+-+
> > > | EXPR$0  | EXPR$1  | EXPR$2  |
> > > +-+-+-+
> > > | 1   | 7   | 0   |
> > > +-+-+-+
> > >
> > > I'm guessing this isn't possible in Drill yet?
> > >
> > > Also, what would be be the syntax to enter the \x01 character in the
> > SPLIT
> > > function? Entered literally I get an error:
> > >
> > > 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> > > split(CONVERT_FROM(binary_value, 'UTF8'),'\x01') from
> > >  `/user/oracle/seq/pdb.soe.logon` limit 5;
> > > Error: SYSTEM ERROR: IllegalArgumentException: Only single character
> > > delimiters are supported for split()
> > >
> > > BTW I didn't realise SPLIT was supported, and it's not listed in
> > > https://drill.apache.org/docs/string-manipulation/ or
> > > htt

Re: Drill Queries Timing Out

2016-08-31 Thread rahul challapalli

Scott,

Can u post the drill profile for the run with 40 nodes and 60 nodes?

I am assuming that you are using the same version of drill in both
scenarios. If not let us know.

Rahul

On Aug 31, 2016 8:28 AM, "scott"  wrote:

>

> Hello,
> I'm having some performance issues testing Drill on a large MapR cluster.
> I've been building a cluster of 100 nodes for the past few weeks. When the
> cluster had only 40 nodes, I ran a benchmark test where Drill performed
> very well, returning in 80 seconds from counting a large table. After
> adding the additional 60 nodes, the same benchmark test is not finishing.
> It times out after approx. 5 minutes due to configured timeout value of
> 30. My understanding of Drill is that performance should improve when
> you increase the cluster size. Each drillbit is configured with 16G. Can
> someone tell me if there are some configuration settings that can improve
> this? Or, is there some point where Drill performance decreases when the
> size of the cluster is too large?
>
> Thanks,
> Scott

Re: Querying Delimited Sequence file

2016-08-30 Thread rahul challapalli

Also you can refer to [1] for the list of string functions implemented.

[1]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java

On Tue, Aug 30, 2016 at 11:06 AM, rahul challapalli <
challapallira...@gmail.com> wrote:

> You should be able to use split_part function (I haven't tried it
> myself...but it is supported). With this function you can extract
> individual columns. Unfortunately I couldn't find the documentation for
> this function as well. But it should be similar to how other databases
> implement this function.
>
> Also as you have observed, split does not support delimiters with more
> than one character. You can raise a jira and mark it as documentation
> related.
>
> Rahul
>
>
>
> On Tue, Aug 30, 2016 at 8:58 AM, Robin Moffatt <
> robin.moff...@rittmanmead.com> wrote:
>
>> Hi,
>>
>> Thanks - I think SPLIT gets me some of the way, but after the FLATTEN I
>> want to PIVOT, so instead of :
>>
>> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
>> flatten(split(version,'.')) from sys.version;
>> +-+
>> | EXPR$0  |
>> +-+
>> | 1   |
>> | 7   |
>> | 0   |
>> +-+
>>
>> I'd get something like:
>>
>> +-+-+-+
>> | EXPR$0  | EXPR$1  | EXPR$2  |
>> +-+-+-+
>> | 1   | 7   | 0   |
>> +-+-+-+
>>
>> I'm guessing this isn't possible in Drill yet?
>>
>> Also, what would be be the syntax to enter the \x01 character in the SPLIT
>> function? Entered literally I get an error:
>>
>> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
>> split(CONVERT_FROM(binary_value, 'UTF8'),'\x01') from
>>  `/user/oracle/seq/pdb.soe.logon` limit 5;
>> Error: SYSTEM ERROR: IllegalArgumentException: Only single character
>> delimiters are supported for split()
>>
>> BTW I didn't realise SPLIT was supported, and it's not listed in
>> https://drill.apache.org/docs/string-manipulation/ or
>> https://drill.apache.org/search/?q=split -- is there somewhere I should
>> log
>> this kind of documentation issue?
>>
>> thanks, Robin.
>>
>>
>> On 30 August 2016 at 16:07, Zelaine Fong  wrote:
>>
>> > If the column is delimited by some character, you can use the SPLIT()
>> > function to separate the value into an array of values.  You can then
>> use
>> > the FLATTEN() function to separate the array of values into individual
>> > records.
>> >
>> > E.g., if your column has the value "a:b", where your delimiter is ":",
>> you
>> > would run the following query:
>> >
>> > 0: jdbc:drill:zk=local> select flatten(split(columns[0],':')) from
>> > `/tmp/foo.csv`;
>> > +-+
>> > | EXPR$0  |
>> > +-+
>> > | a   |
>> > | b   |
>> > +-+
>> > 2 rows selected (0.319 seconds)
>> >
>> > Is that what you had in mind?
>> >
>> > -- Zelaine
>> >
>> > On Tue, Aug 30, 2016 at 7:17 AM, Robin Moffatt <
>> > robin.moff...@rittmanmead.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > I'm trying to read a sequence file, in which the key is null and the
>> > value
>> > > holds multiple columns [1], delimited by \x01. In Hive I simply
>> define it
>> > > as :
>> > >
>> > > CREATE EXTERNAL TABLE foo (col1 string, col2 string, col3 timestamp)
>> > > ROW FORMAT DELIMITED
>> > > STORED as sequencefile
>> > > LOCATION '/user/oracle/foo/bar';
>> > >
>> > > In Drill I've got as far as
>> > >
>> > > SELECT CONVERT_FROM(binary_value, 'UTF8') from  `/user/oracle/foo/bar`
>> > >
>> > > which yields the data but as a single column. I can cast it to
>> individual
>> > > columns but this is no use if the field positions change
>> > >
>> > > SELECT substr(CONVERT_FROM(binary_value, 'UTF8'),5,1) as
>> > > col0,substr(CONVERT_FROM(binary_value, 'UTF8'),7,13) as
>> > > col1,substr(CONVERT_FROM(binary_value, 'UTF8'),20,20) as col3 from
>> > >  `/user/oracle/seq/pdb.soe.logon` limit 5;
>> > > +---++---+
>> > > | col0  |  col1  | col3  |
>> > > +---++---+
>> > > | I | PDB.SOE.LOGON  | 2016-07-29 13:36:40  |
>> > >
>> > >
>> > > Is there a way to treat a column as delimited and burst it out into
>> > > multiple columns? Presumably I could somehow dump the string contents
>> to
>> > > CSV and then re-read it - but I'm interested here in using Drill the
>> > query
>> > > existing data; wrangling it to suit Drill isn't really what I'm
>> looking
>> > for
>> > > (and maybe Drill just isn't the right tool here?).
>> > >
>> > >
>> > > thanks,
>> > >
>> > > Robin.
>> > >
>> > > [1]
>> > > https://docs.oracle.com/goldengate/bd1221/gg-bd/GADBD/
>> > > GUID-85A82B2E-CD51-463A-8674-3D686C3C0EC0.htm#GADBD-GUID-
>> > > 4CAFC347-0F7D-49AB-B293-EFBCE95B66D6
>> > >
>> >
>>
>
>

Re: Querying Delimited Sequence file

2016-08-30 Thread rahul challapalli

You should be able to use split_part function (I haven't tried it
myself...but it is supported). With this function you can extract
individual columns. Unfortunately I couldn't find the documentation for
this function as well. But it should be similar to how other databases
implement this function.

Also as you have observed, split does not support delimiters with more than
one character. You can raise a jira and mark it as documentation related.

Rahul



On Tue, Aug 30, 2016 at 8:58 AM, Robin Moffatt <
robin.moff...@rittmanmead.com> wrote:

> Hi,
>
> Thanks - I think SPLIT gets me some of the way, but after the FLATTEN I
> want to PIVOT, so instead of :
>
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
> flatten(split(version,'.')) from sys.version;
> +-+
> | EXPR$0  |
> +-+
> | 1   |
> | 7   |
> | 0   |
> +-+
>
> I'd get something like:
>
> +-+-+-+
> | EXPR$0  | EXPR$1  | EXPR$2  |
> +-+-+-+
> | 1   | 7   | 0   |
> +-+-+-+
>
> I'm guessing this isn't possible in Drill yet?
>
> Also, what would be be the syntax to enter the \x01 character in the SPLIT
> function? Entered literally I get an error:
>
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> split(CONVERT_FROM(binary_value, 'UTF8'),'\x01') from
>  `/user/oracle/seq/pdb.soe.logon` limit 5;
> Error: SYSTEM ERROR: IllegalArgumentException: Only single character
> delimiters are supported for split()
>
> BTW I didn't realise SPLIT was supported, and it's not listed in
> https://drill.apache.org/docs/string-manipulation/ or
> https://drill.apache.org/search/?q=split -- is there somewhere I should
> log
> this kind of documentation issue?
>
> thanks, Robin.
>
>
> On 30 August 2016 at 16:07, Zelaine Fong  wrote:
>
> > If the column is delimited by some character, you can use the SPLIT()
> > function to separate the value into an array of values.  You can then use
> > the FLATTEN() function to separate the array of values into individual
> > records.
> >
> > E.g., if your column has the value "a:b", where your delimiter is ":",
> you
> > would run the following query:
> >
> > 0: jdbc:drill:zk=local> select flatten(split(columns[0],':')) from
> > `/tmp/foo.csv`;
> > +-+
> > | EXPR$0  |
> > +-+
> > | a   |
> > | b   |
> > +-+
> > 2 rows selected (0.319 seconds)
> >
> > Is that what you had in mind?
> >
> > -- Zelaine
> >
> > On Tue, Aug 30, 2016 at 7:17 AM, Robin Moffatt <
> > robin.moff...@rittmanmead.com> wrote:
> >
> > > Hi,
> > >
> > > I'm trying to read a sequence file, in which the key is null and the
> > value
> > > holds multiple columns [1], delimited by \x01. In Hive I simply define
> it
> > > as :
> > >
> > > CREATE EXTERNAL TABLE foo (col1 string, col2 string, col3 timestamp)
> > > ROW FORMAT DELIMITED
> > > STORED as sequencefile
> > > LOCATION '/user/oracle/foo/bar';
> > >
> > > In Drill I've got as far as
> > >
> > > SELECT CONVERT_FROM(binary_value, 'UTF8') from  `/user/oracle/foo/bar`
> > >
> > > which yields the data but as a single column. I can cast it to
> individual
> > > columns but this is no use if the field positions change
> > >
> > > SELECT substr(CONVERT_FROM(binary_value, 'UTF8'),5,1) as
> > > col0,substr(CONVERT_FROM(binary_value, 'UTF8'),7,13) as
> > > col1,substr(CONVERT_FROM(binary_value, 'UTF8'),20,20) as col3 from
> > >  `/user/oracle/seq/pdb.soe.logon` limit 5;
> > > +---++---+
> > > | col0  |  col1  | col3  |
> > > +---++---+
> > > | I | PDB.SOE.LOGON  | 2016-07-29 13:36:40  |
> > >
> > >
> > > Is there a way to treat a column as delimited and burst it out into
> > > multiple columns? Presumably I could somehow dump the string contents
> to
> > > CSV and then re-read it - but I'm interested here in using Drill the
> > query
> > > existing data; wrangling it to suit Drill isn't really what I'm looking
> > for
> > > (and maybe Drill just isn't the right tool here?).
> > >
> > >
> > > thanks,
> > >
> > > Robin.
> > >
> > > [1]
> > > https://docs.oracle.com/goldengate/bd1221/gg-bd/GADBD/
> > > GUID-85A82B2E-CD51-463A-8674-3D686C3C0EC0.htm#GADBD-GUID-
> > > 4CAFC347-0F7D-49AB-B293-EFBCE95B66D6
> > >
> >
>

Re: Partition reading problem (like operator) while using hive partition table in drill

2016-08-03 Thread rahul challapalli

DRILL-4665 has been fixed. Can you try it out with the latest master and
see if it works for you now?

- Rahul

On Wed, Aug 3, 2016 at 10:28 AM, Shankar Mane 
wrote:

> has any 1 started working on this ?
>
> On Wed, Jun 1, 2016 at 8:27 PM, Zelaine Fong  wrote:
>
> > Shankar,
> >
> > Work on this issue has not yet started.  Hopefully, the engineer assigned
> > to the issue will be able to take a look in a week or so.
> >
> > -- Zelaine
> >
> > On Tue, May 31, 2016 at 10:33 PM, Shankar Mane <
> shankar.m...@games24x7.com
> > >
> > wrote:
> >
> > > I didn't get any response or updates on this jira ticket ( DRILL-4665).
> > >
> > > Does anyone looking into this?
> > > On 11 May 2016 03:31, "Aman Sinha"  wrote:
> > >
> > > > The Drill test team was able to repro this and is now filed as:
> > > > https://issues.apache.org/jira/browse/DRILL-4665
> > > >
> > > > On Tue, May 10, 2016 at 8:16 AM, Aman Sinha 
> > > wrote:
> > > >
> > > > > This is supposed to work, especially since LIKE predicate is not
> even
> > > on
> > > > > the partitioning column (it should work either way).  I did a quick
> > > test
> > > > > with file system tables and it works for LIKE conditions.  Not sure
> > yet
> > > > > about Hive tables.  Could you pls file a JIRA and we'll follow up.
> > > > > Thanks.
> > > > >
> > > > > -Aman
> > > > >
> > > > > On Tue, May 10, 2016 at 1:09 AM, Shankar Mane <
> > > > shankar.m...@games24x7.com>
> > > > > wrote:
> > > > >
> > > > >> Problem:
> > > > >>
> > > > >> 1. In drill, we are using hive partition table. But explain plan
> > (same
> > > > >> query) for like and = operator differs and used all partitions in
> > case
> > > > of
> > > > >> like operator.
> > > > >> 2. If you see below drill explain plans: Like operator uses *all*
> > > > >> partitions where
> > > > >> = operator uses *only* partition filtered by log_date condition.
> > > > >>
> > > > >> FYI- We are storing our logs in hive partition table (parquet,
> > > > >> gz-compressed). Each partition is having ~15 GB data. Below is the
> > > > >> describe
> > > > >> statement output from hive:
> > > > >>
> > > > >>
> > > > >> /
> > Hive
> > > > >>
> > > > >>
> > > >
> > >
> >
> /
> > > > >> hive> desc hive_kafkalogs_daily ;
> > > > >> OK
> > > > >> col_name data_type comment
> > > > >> sessionid   string
> > > > >> ajaxurl string
> > > > >>
> > > > >> log_date string
> > > > >>
> > > > >> # Partition Information
> > > > >> # col_name data_type   comment
> > > > >>
> > > > >> log_date string
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> /*
> > > Drill
> > > > >> Plan (query with LIKE)
> > > > >>
> > > > >>
> > > >
> > >
> >
> ***/
> > > > >>
> > > > >> explain plan for select sessionid, servertime, ajaxUrl from
> > > > >> hive.hive_kafkalogs_daily where log_date = '2016-05-09' and
> ajaxUrl
> > > like
> > > > >> '%utm_source%' limit 1 ;
> > > > >>
> > > > >> +--+--+
> > > > >> | text | json |
> > > > >> +--+--+
> > > > >> | 00-00Screen
> > > > >> 00-01  Project(sessionid=[$0], servertime=[$1], ajaxUrl=[$2])
> > > > >> 00-02SelectionVectorRemover
> > > > >> 00-03  Limit(fetch=[1])
> > > > >> 00-04UnionExchange
> > > > >> 01-01  SelectionVectorRemover
> > > > >> 01-02Limit(fetch=[1])
> > > > >> 01-03  Project(sessionid=[$0], servertime=[$1],
> > > > >> ajaxUrl=[$2])
> > > > >> 01-04SelectionVectorRemover
> > > > >> 01-05  Filter(condition=[AND(=($3,
> > '2016-05-09'),
> > > > >> LIKE($2, '%utm_source%'))])
> > > > >> 01-06Scan(groupscan=[HiveScan
> > > > >> [table=Table(dbName:default, tableName:hive_kafkalogs_daily),
> > > > >> columns=[`sessionid`, `servertime`, `ajaxurl`, `log_date`],
> > > > >> numPartitions=29, partitions= [Partition(values:[2016-04-11]),
> > > > >> Partition(values:[2016-04-12]), Partition(values:[2016-04-13]),
> > > > >> Partition(values:[2016-04-14]), Partition(values:[2016-04-15]),
> > > > >> Partition(values:[2016-04-16]), Partition(values:[2016-04-17]),
> > > > >> Partition(values:[2016-04-18]), Partition(values:[2016-04-19]),
> > > > >> Partition(values:[2016-04-20]), Partition(values:[2016-04-21]),
> > > > >> Partition(values:[2016-04-22]), Partition(values:[2016-04-23]),
> > > > >> Partition(values:[2016-04-24]), Partition(values:[2016-04-25]),
> > > > >> Partition(values:[2016-04-26]), Partition(values:[2016-04-27]),
> > > > >> Partition(values:[2016-04-28]), Partition(values:[2016-04-29]),
> > > > >> Partition(values:[2016-04-30]), Partition(values:[2016-05-01]),
> > > > >> Partition(values:[2016-

Re: How drill works internally

2016-07-25 Thread rahul challapalli

You can start with the high level architecture [1]. Then the community
might help you if you have any specific questions.

[1] https://drill.apache.org/architecture/

On Sun, Jul 24, 2016 at 11:36 PM, Sanjiv Kumar  wrote:

> How drill runs query internally. I want to know how drill execute query for
> different data sources.I want to know internal process of drill.
>
>
>
>  ..
>   Thanks & Regards
>   *Sanjiv Kumar*
>

Re: Performance with multiple FLATTENs

2016-07-19 Thread rahul challapalli

Matt,

Having multiple flatten's in your query leads to cross-join between the
output of each flatten. So a performance hit is expected with the addition
of each flatten. And there could also be a genuine performance bug for this
scenario. To be sure it is a bug we need more information as Abhishek
pointed out.

However if you want to do some computations after you flattened out your
query, it might be helpful sometimes to rewrite the query such that
multiple flatten's fall in multiple sub-queries. You may see some
performance improvement. Let me know how it goes.

- Rahul

On Tue, Jul 19, 2016 at 1:22 PM, Abhishek Girish 
wrote:

> Hi Matt,
>
> Can you please share more information on your setup, specifically the size
> of your dataset, including an approximate average size of individual JSON
> files, the number of nodes, including Drillbit memory config.
>
> Also can you share the query profiles for the few scenarios you mention.
>
> Regards,
> Abhishek
>
> On Friday, July 15, 2016, Matt  wrote:
>
> > I have JSON data with with a nested list and am using FLATTEN to extract
> > two of three list elements as:
> >
> > ~~~
> > SELECT id, FLATTEN(data)[0] AS dttm, FLATTEN(data)[1] AS result FROM ...
> > ~~~
> >
> > This works, but each FLATTEN seems to slow the query down dramatically,
> 3x
> > slower with the second flatten.
> >
> > Is there a better approach to extracting list elements?
> >
> > ~~~
> > [
> >   {
> > "id": 16,
> > "data": [
> >   [
> > "2016-07-13 00:00",
> > 509,
> > "OK"
> >   ],
> >   [
> > "2016-07-13 00:01",
> > 461,
> > "OK"
> >   ],
> >   [
> > "2016-07-13 00:02",
> > 508,
> > "OK"
> >   ],
> > ~~~
> >
>

Re: Best way to set schema to handle different json structures

2016-07-11 Thread rahul challapalli

Did you try creating a view with the merged schema? Then you can try
running all your queries on top of that view.

- Rahul

On Mon, Jul 11, 2016 at 3:23 PM, Scott Kinney  wrote:

> We have several different json structures we want to run queries across. I
> can take a sample of each and merge the json together as python
> dictionaries then write that out to a file and have drill read that file
> first to set the schema but I dont think this will be very practical as we
> will have out data in s3 in a name/year/month/day and I dont want to have
> to put this schema file in every directory is s3. that seems unmanageable.
>
>
> Is there a way to set the schema from a file before making a query via the
> REST api?
>
>
>
> 
> Scott Kinney | DevOps
> stem    |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>

Re: CHAR data type

2016-07-11 Thread rahul challapalli

I raised https://issues.apache.org/jira/browse/DRILL-4772 to fix the doc
issue. Thanks Santosh.

- Rahul

On Mon, Jul 4, 2016 at 10:11 AM, Santosh Kulkarni <
santoshskulkarn...@gmail.com> wrote:

> Here is the link:
>
> https://drill.apache.org/docs/supported-data-types/
>
>
>
> On Mon, Jul 4, 2016 at 12:06 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Can you point us to where you are looking? The documentation should only
> > say that "CHAR" datatype in hive is supported from Drill 1.7 onward.
> >
> > - Rahul
> >
> > On Mon, Jul 4, 2016 at 9:53 AM, Santosh Kulkarni <
> > santoshskulkarn...@gmail.com> wrote:
> >
> > > Thanks Shankar. I was looking in Drill documentation but did not
> realize
> > to
> > > check in 1.7 Release notes.
> > >
> > >
> > >
> > > On Mon, Jul 4, 2016 at 10:02 AM, Shankar Mane <
> > shankar.m...@games24x7.com>
> > > wrote:
> > >
> > > > It is being supported since 1.7.0. Please check this link
> > > > https://drill.apache.org/docs/apache-drill-1-7-0-release-notes/
> > > >
> > > > On 04-Jul-2016 8:07 PM, "Santosh Kulkarni" <
> > santoshskulkarn...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > Hello,
> > > >
> > > > While running another simple query for select count(*) from
> table_name,
> > > > Drill gave an error for Unsupported Hive data type CHAR.
> > > >
> > > > The column is of CHAR(6) data type. Drill documentation shows CHAR as
> > > > supported data type.
> > > >
> > > > This is on Drill version 1.6
> > > >
> > > > Thanks,
> > > >
> > > > Santosh
> > > >
> > >
> >
>

Re: Looking for workaround to Schema detection problems

2016-07-08 Thread rahul challapalli

In the past setting the below parameter still did not fix the issue. But
still worth a try

ALTER SESSION SET `store.json.all_text_mode` = true;

You might also want to try explicit casting to varchar for this specific
column.

On Fri, Jul 8, 2016 at 8:14 AM, Zelaine Fong  wrote:

> Have you tried using
>
> ALTER SESSION SET `store.json.all_text_mode` = true;
>
> -- Zelaine
>
> On Fri, Jul 8, 2016 at 6:37 AM, Holy Alexander <
> alexander.h...@derstandard.at> wrote:
>
> > Hi Vitalii!
> >
> >
> > This is what I tried:
> >
> > Altered the setting system-wide:
> >
> > ALTER SYSTEM SET `exec.enable_union_type` = true
> >
> > Verified that the setting is really altered
> >
> > SELECT *
> > FROM sys.options
> > WHERE type in ('SYSTEM','SESSION') order by name
> >
> > And re-run the query
> >
> > Unfortunately this does not solve the problem.
> > It just causes a different error:
> >
> > [30027]Query execution error. Details:[
> > SYSTEM ERROR: NullPointerException
> > Fragment 0:0
> > [Error Id: 0f9cb7ae-d2d5-474c-ad57-2d558041e2c6 on
> >
> > (I tried this on Drill 1.7 and 1.6)
> >
> > Best regards,
> > Alexander
> >
> >
> > -Original Message-
> > From: Vitalii Diravka [mailto:vitalii.dira...@gmail.com]
> > Sent: 08 July 2016 13:30
> > To: user@drill.apache.org
> > Subject: Re: Looking for workaround to Schema detection problems
> >
> > Hi Alexander,
> >
> > Please try with turning on the union type:
> >
> > ALTER SESSION SET `exec.enable_union_type` = true;
> >
> > Kind regards
> > Vitalii
> >
> > 2016-07-08 10:50 GMT+00:00 Holy Alexander  >:
> >
> > > My JSON data looks - simplified - like this
> > >
> > > {"ID":1,"a":"some text"}
> > > {"ID":2,"a":"some text","b":"some other text"} {"ID":3,"a":"some
> > > text"}
> > >
> > > Column b is only physically serialized when it is not null.
> > > It is the equivalent of a NULLable VARCHAR() column in SQL.
> > >
> > > I run queries like these:
> > >
> > > SELECT b
> > > FROM dfs.`D:\MyData\test.json`
> > > WHERE b IS NOT NULL
> > >
> > > And normally all is fine.
> > > However, among my thousands of data files, I have two files where the
> > > first occurrence of b happens a few thousand records down the file.
> > > These two data files would look like this:
> > >
> > > {"ID":1,"a":"some text"}
> > > {"ID":2,"a":"some text"}
> > > ... 5000 more records without column b ...
> > > {"ID":5002,"a":"some text","b":"some other text"} {"ID":5003,"a":"some
> > > text"}
> > >
> > > In this case, my simple SQL query above fails:
> > >
> > > [30027]Query execution error. Details:[ DATA_READ ERROR: Error parsing
> > > JSON - You tried to write a VarChar type when you are using a
> > > ValueWriter of type NullableIntWriterImpl.
> > > File  /D:/MyData/test.json
> > > Record 5002 Fragment ...
> > >
> > > It seems that the Schema inference mechanism of Drill only samples a
> > > certain amount of bytes (or records) to determine the schema.
> > > If the first occurrence of a schema detail happens to far down things
> > > go boom.
> > >
> > > I am now looking for a sane way to work around this.
> > > Preferred by extending the query and not by altering my massive
> > > amounts of data.
> > >
> > > BTW, I tried altering the data by chaning the first line:
> > > {"ID":1,"a":"some text","b":null}
> > > does not help.
> > >
> > > Of course, changing the first line to
> > > {"ID":1,"a":"some text","b":""}
> > > solves the problem, but this is not a practical solution.
> > >
> > > Any help appreciated.
> > > Alexander
> > >
> >
>

Re: What are the JDBC/ODBC requirements to connect to Drill?

2016-07-06 Thread rahul challapalli

Few Answers inline

On Tue, Jul 5, 2016 at 3:40 AM, Juan Diego Ruiz Perea  wrote:

> Hello,
>
> We want to test connecting Oracle BI (OBI) to Apache Drill. We saw the
> JDBC/ODBC drivers option and have the following questions:
>
>- Do you know if someone has already tested connecting OBI to Apache
>Drill? -- Not that I know of. I have tested it from Spotfire and a
> custom built jdbc application

   - Do you know if there is any specific requirement to use properly the
>JDBC driver? I mean any JDBC client at all should connect with no
> issue? Kindly refer to the relevant documentation at
> https://drill.apache.org/docs/using-the-jdbc-driver/
>- I saw your ODBC driver is 3.8 version, do you know if Apache Drill
>supports also previous ODBC versions?
>
> Thanks a lot in advance.
>
> Kind regards,
>
> Juan Diego
>

Re: Help with the Optimizer of Apache Drill

2016-07-05 Thread rahul challapalli

For a start, below is the relevant piece from the documentation [1]. You
can also prepend any query with "explain plan for" to view the exact plan
generated by Drill's Optimizer.

• Optimizer: Drill uses various standard database optimizations such as
rule based/cost based, as well as data locality and other optimization
rules exposed by the storage engine to re-write and split the query. The
output of the optimizer is a distributed physical query plan that
represents the most efficient and fastest way to execute the query across
different nodes in the cluster.

[1] https://drill.apache.org/architecture/

On Tue, Jul 5, 2016 at 7:59 AM, Benamor, Adel  wrote:

> Hello,
> I'm new in the utilization of data virtualization and I try  to understand
> the running of Apache Drill.
> I browsed the documentation but I didn't understand how is the running of
> the optimizer.
> Indeed, I learned that it's a cost-base optimizer, but nothing else.
> I want to know how the optimizer works ?
>
> Thanks for your help
> Warm regards
> Adel BENAMOR
>
>

Re: Initial Feed Back on 1.7.0 Release

2016-07-05 Thread rahul challapalli

John,

Once you add/update data in one of your sub-folders, the immediate next
query should update the metadata cache automatically and all subsequent
queries should fetch metadata from the cache. If this is not the case, its
a bug. Can you confirm your findings?

- Rahul

On Tue, Jul 5, 2016 at 9:53 AM, John Omernik  wrote:

> Hey Abdel, thanks for the response..  on questions 1 and 2, from what I
> understood, nothing was changed, but then I had to make the third query for
> it to take.  I'll keep observing to determine what that may be.
>
> On 3, a logical place to implement, or start implementing incremental may
> be allowing a directories refresh automatically update the parents data
> without causing a cascading (update everything) refresh.  So if if I have a
> structure like this:
>
> mytable
> ...dir0=2016-06-06
> ...dir1=23
>
> (basically table, days, hours)
>
> that if I update data in hour 23, it would update 2016-06-06 with the new
> timestamps and update mytable with the new timestamps.  The only issue
> would be figuring out a way to take a lock. (Say you had multiple loads
> happening, you want to ensure that one days updates don't clobber another
> days)
>
> Just a thought on that.
>
> Yep, the incremental issue would come into play here.  Are there any design
> docs or JIRAs on the incremental updates to metadata?
>
> Thanks for your reply.  I am looking forward other dev's thoughts on your
> answer to 3 as well.
>
> Thanks!
>
> John
>
>
> On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > answers inline.
> >
> > On Tue, Jul 5, 2016 at 8:39 AM, John Omernik  wrote:
> >
> > > Working with the 1.7.0, the feature that I was very interested in was
> the
> > > fixing of the Metadata Caching while using user impersonation.
> > >
> > > I have a large table, with a day directory that can contain up to 1000
> > > parquet files each.
> > >
> > >
> > > Planning was getting terrible on this table as I added new data, and
> the
> > > metadata cache wasn't an option for me because of impersonation.
> > >
> > > Well now will 1.7.0 that's working, and it makes a HUGE difference. A
> > query
> > > that would take 120 seconds now takes 20 seconds.   Etc.
> > >
> > > Overall, this is a great feature and folks should look into it for
> > > performance of large Parquet tables.
> > >
> > > Some observations that I would love some help with.
> > >
> > > 1. Drill "Seems" to know when a new subdirectory was added and it
> > generates
> > > the metadata for that directory with the missing data. This is without
> > > another REFRESH TABLE METADATA command.  That works great for new
> > > directories, however, what happens if you just copy new files into an
> > > existing directory? Will it use the metadata cache that only lists the
> > old
> > > files. or will things get updated? I guess, how does it know things are
> > in
> > > sync?
> > >
> >
> > When you query folder A that contains metadata cache, Drill will check
> all
> > it's sub-directories' last modification time to figure out if anything
> > changed since last time the metadata cache was refreshed. If data was
> > added/removed, Drill will refresh the metadata cache for folder A.
> >
> >
> > > 2.  Pertaining to point 1, when new data was added, the first query
> that
> > > used that directory partition, seemed to write the metadata file.
> > However,
> > > the second query ran ALSO rewrote the file (and it ran with the speed
> of
> > an
> > > uncached directory).  However, the third query was now running at
> cached
> > > speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe
> > there
> > > is an reason?
> > >
> >
> > Unfortunately, the current implementation of metadata cache doesn't
> support
> > incremental refresh, so each time Drill detects a change inside the
> folder,
> > it will run a "full" metadata cache refresh before running the query,
> > that's what explains why your second query took so long to finish.
> >
> >
> > > 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> > > subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
> > > `mytable/2016-07-04`  and have things be all where drill is happy?
> I.e.
> > > does the mytable metadata need to be updated as well or is that wasted
> > > cycles?
> > >
> >
> > Drill keeps a metadata cache file for every subdirectory of your table.
> So
> > you'll end up with a cache file in "mytable" and another one in
> > "mytable/2016-07-04".
> > I'm not sure about the following, and other developers will correct soon
> > enough, but my understanding is that you can run a refresh command on the
> > subfolder and it will only cause that particular cache (and any of it's
> > subfolders) to be updated and it won't cause the cache file on "mytable"
> > and any other of it's subfolders to be updated.
> > Also, as long as you only query this particular day, Drill won't detect
> the
> > change and won't try to update any other

Re: CHAR data type

2016-07-04 Thread rahul challapalli

Can you point us to where you are looking? The documentation should only
say that "CHAR" datatype in hive is supported from Drill 1.7 onward.

- Rahul

On Mon, Jul 4, 2016 at 9:53 AM, Santosh Kulkarni <
santoshskulkarn...@gmail.com> wrote:

> Thanks Shankar. I was looking in Drill documentation but did not realize to
> check in 1.7 Release notes.
>
>
>
> On Mon, Jul 4, 2016 at 10:02 AM, Shankar Mane 
> wrote:
>
> > It is being supported since 1.7.0. Please check this link
> > https://drill.apache.org/docs/apache-drill-1-7-0-release-notes/
> >
> > On 04-Jul-2016 8:07 PM, "Santosh Kulkarni"  >
> > wrote:
> >
> > Hello,
> >
> > While running another simple query for select count(*) from table_name,
> > Drill gave an error for Unsupported Hive data type CHAR.
> >
> > The column is of CHAR(6) data type. Drill documentation shows CHAR as
> > supported data type.
> >
> > This is on Drill version 1.6
> >
> > Thanks,
> >
> > Santosh
> >
>

Re: Querying Parquet: Filtering on a sorted column

2016-07-01 Thread rahul challapalli

This is something which is not currently supported. The "parquet filter
pushdown" feature should be able to achieve this. Its still under
development.

- Rahul

On Fri, Jul 1, 2016 at 12:10 PM, Dan Wild  wrote:

> Hi,
>
> I'm attempting to query a directory of parquet files that are partitioned
> on column A (int) and sorted on column B (also int).  When I run a query
> such as SELECT * FROM mydirectory WHERE A = 123 AND B = 456, I can see that
> the physical query plan is using the criteria on A to choose the correct
> parquet file, but it is performing a ParquetGroupScan on ALL rows in that
> file despite the criteria on the sorted column B.
>
> Based on my understanding of parquet, Drill should be using the page and/or
> column metadata to avoid scanning the entire file when filtering on a
> sorted column.  However, there is no performance benefit when filtering on
> column B compared to any other non-sorted column.
>
> Is there something I can do to make Drill take advantage of the fact that
> my file is sorted?
>
> Thanks,
> Dan
>

Re: Drill with mapreduce

2016-06-28 Thread rahul challapalli

This looks like a bug in the JDBC driver packaging. Can you raise a JIRA
for the same?

On Tue, Jun 28, 2016 at 9:10 PM, GameboyNO1 <7304...@qq.com> wrote:

> Hi,
> I'm trying to use drill with mapreduce.
> Details are:
> I put a list of drill queries in a file as mapper's input, some to query
> hbase, some to query qarquet files. Every query is executed in mapper, and
> the query result is sorted in reducer.
> In mapper, I connect to drill with JDBC, and have problem of hitting Java
> exception in mapper: NoClassDefFoundError on oadd/org/apache/log4j/Logger.
> Anyone can give some help about how to fix it?
> And also welcome comments on my solution.
> Thanks!
>
>
> Alfie

Re: Is this normal view behavior?

2016-06-23 Thread rahul challapalli

I couldn't reproduce the problem as well. Tried with both csv files and
parquet files. Can you point us to the commit which you are using? I am
curious to know how you ended up seeing "_DEFAULT_COL_TO_READ_" :)

- Rahul

On Thu, Jun 23, 2016 at 3:15 PM, Jinfeng Ni  wrote:

> Tried on a commit on 1.7.0-SNAPSHOT. Looks like I could not re-produce
> the problem.  Which version are u using?
>
> create view dfs.tmp.myview as select dir0 as p_day, l_partkey,
> l_orderkey, l_suppkey from dfs.tmp.t2;
> +---+-+
> |  ok   | summary |
> +---+-+
> | true  | View 'myview' created successfully in 'dfs.tmp' schema  |
> +---+-+
>
> select * from dfs.tmp.myview;
> +++-++
> | p_day  | l_partkey  | l_orderkey  | l_suppkey  |
> +++-++
> | 1990   | 11001  | 42128896| 36002  |
>
>
> select p_day from dfs.tmp.myview;
> ++
> | p_day  |
> ++
> | 1990   |
>
>
> select dir0 from dfs.tmp.myview;
>
> Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 11:
> Column 'dir0' not found in any table
>
>
>
> On Thu, Jun 23, 2016 at 1:37 PM, Neeraja Rentachintala
>  wrote:
> > This is a bug.
> >
> > On Thu, Jun 23, 2016 at 1:32 PM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> >
> >> This looks like a bug. If you renamed the dir0 column as p_day, then you
> >> should see that in sqlline as well. And I have never seen
> >> "_DEFAULT_COL_TO_READ_"
> >> before. Can you file a jira?
> >>
> >> - Rahul
> >>
> >> On Thu, Jun 23, 2016 at 12:33 PM, John Omernik 
> wrote:
> >>
> >> > I have a table that is a directory of parquet files, each row had say
> 3
> >> > columns, and the table is split into subdirectories that allow me to
> use
> >> > dir0 partitioning.
> >> >
> >> > so if I select * from `table`
> >> >
> >> > I get col1, col2, col3, and dir0 as my fields returned.
> >> >
> >> > So if I create a view
> >> >
> >> > CREATE VIEW view_myview as
> >> > select dir0 as `p_day`, col1, col2, col3 from `path/to/table`
> >> >
> >> > and run
> >> > select * from view_myview
> >> >
> >> > why, in sqlline, isn't the first column named "p_day"
> >> >
> >> > I can reference things in my query by p_day, however, the returned
> >> results,
> >> > still say dir0?
> >> >
> >> > I dir0 | col1| col2 | col3 |
> >> >
> >> > If I do select p_day, col1 then I get
> >> >
> >> > | dir0 | col1|
> >> >
> >> > if I do select p_day then I get
> >> >
> >> > | _DEFAULT_COL_TO_READ_ | dir0 |
> >> >
> >> > where the first column (DEFAULT_COL_TO_READ) is always null.
> >> >
> >> > If I do select dir0 from view I get "dir0" not found.
> >> >
> >> > I guess, the "expected" (principal of least surprise) would be to
> have it
> >> > just be a column, that is always labeled p_day, and if I only select
> >> that,
> >> > I get the dir0 value repeated for each value.
> >> >
> >> > Am I over thinking minutia again? :)
> >> >
> >>
>

Re: Is this normal view behavior?

2016-06-23 Thread rahul challapalli

This looks like a bug. If you renamed the dir0 column as p_day, then you
should see that in sqlline as well. And I have never seen
"_DEFAULT_COL_TO_READ_"
before. Can you file a jira?

- Rahul

On Thu, Jun 23, 2016 at 12:33 PM, John Omernik  wrote:

> I have a table that is a directory of parquet files, each row had say 3
> columns, and the table is split into subdirectories that allow me to use
> dir0 partitioning.
>
> so if I select * from `table`
>
> I get col1, col2, col3, and dir0 as my fields returned.
>
> So if I create a view
>
> CREATE VIEW view_myview as
> select dir0 as `p_day`, col1, col2, col3 from `path/to/table`
>
> and run
> select * from view_myview
>
> why, in sqlline, isn't the first column named "p_day"
>
> I can reference things in my query by p_day, however, the returned results,
> still say dir0?
>
> I dir0 | col1| col2 | col3 |
>
> If I do select p_day, col1 then I get
>
> | dir0 | col1|
>
> if I do select p_day then I get
>
> | _DEFAULT_COL_TO_READ_ | dir0 |
>
> where the first column (DEFAULT_COL_TO_READ) is always null.
>
> If I do select dir0 from view I get "dir0" not found.
>
> I guess, the "expected" (principal of least surprise) would be to have it
> just be a column, that is always labeled p_day, and if I only select that,
> I get the dir0 value repeated for each value.
>
> Am I over thinking minutia again? :)
>

Re: Apache Drill vs PrestoDB

2016-06-08 Thread rahul challapalli

The post on quora gives a good overview. It would be helpful if you can
provide some insights into what you are trying to achieve. Few questions to
that end

  1. Who will be the users of your application
  2. Where does your data live and in what format
  3. What is the scale of data you want to the tool to handle
  4. Interactive queries or long running queries(> 1Hr)
  5. Maximum no of concurrent users you expect
  6. Authentication/Authorization requirements
  7. Any SLA's around query response times
  8. Any specific BI tools that need to be supported

- Rahul

On Tue, Jun 7, 2016 at 8:07 PM, Santosh Kulkarni <
santoshskulkarn...@gmail.com> wrote:

> Hi,
>
> While searching for comparison between Drill and Presto, google search
> gives a high level design comparison posted on Quora.
>
> Does anyone has more detailed comparison on these 2 tools?
>
> Thanks in advance.
>
> Santosh
>

Re: HiveMetastore HA with Drill

2016-06-02 Thread rahul challapalli

Not sure if our hive storage plugin supports this feature. Even if the
feature is available, we haven't tested it.

- Rahul

On Tue, May 31, 2016 at 12:01 PM, Veera Naranammalpuram <
vnaranammalpu...@maprtech.com> wrote:

> Anyone has any insights into how the Hive storage plug-in can handle Hive
> MetaStore HA? The Hive storage plug-in has only one property for
> hive.metastore.uris and it takes only one IP:port. When I add a second one,
> the update of the storage plug-in fails.
>
>   "configProps": {
> "hive.metastore.uris": "thrift://:9083"
>   }
>
> How can we give 2 IP's to Drill so it knows to try the second IP if its not
> able to talk to the first one?
>
> Thanks in advance.
>
> --
> Veera Naranammalpuram
> Product Specialist - SQL on Hadoop
> *MapR Technologies (www.mapr.com )*
> *(Email) vnaranammalpu...@maprtech.com *
> *(Mobile) 917 683 8116 - can text *
> *Timezone: ET (UTC -5:00 / -4:00)*
>

Re: Error with flatten function on MongoDB documents that contain array of key-value pairs

2016-05-25 Thread rahul challapalli

Just to be sure, can you run the below query which does not contain
flatten? If this query also fails, then it could be bad data in "Pnl"
column ( may be an empty string?)

SELECT x.DateValueCollection FROM `mongo`.`db_name`.`
some.random.collection.name` AS x;



On Wed, May 25, 2016 at 10:32 AM, Arman Siddiqui <
asiddi...@symmetryinvestments.com> wrote:

> Good afternoon,
>
> I am receiving the following error when using the flatten function to
> query a MongoDB collection:
> Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
> VarChar type when you are using a ValueWriter of type
> NullableFloat8WriterImpl.
>
> The collection contains a number documents, where each document consists
> of some key-value pairs and a single array which itself contains exactly 2
> key-value pairs per element.
>
> Here is an example document in the collection:
> {
>
> "_id" : ObjectId("1234567890abcdef12345678"),
> "TradeId" : NumberInt(12345),
> "DateValueCollection" : [
> {
> "ScenarioDate" : ISODate("2011-05-20T00:00:00.000+"),
> "Pnl" : 22.0
> },
> {
> "ScenarioDate" : ISODate("2011-05-23T00:00:00.000+"),
> "Pnl" : -30.0
> },
> {
> "ScenarioDate" : ISODate("2011-05-24T00:00:00.000+"),
> "Pnl" : 15.0
> },
> {
> "ScenarioDate" : ISODate("2011-05-25T00:00:00.000+"),
> "Pnl" : 9.0
> }
> ]
> }
>
>
> Within the array, I have checked that every ScenarioDate value is of
> MongoDB type Date and that every PnL value is of MongoDB type double.
> There are no null values.
>
> Each document contains this same array structure with approx. 1k such
> elements.
>
> When I copy a small set of these documents into a new collection, the
> flatten functions works correctly.  But in the full collection, the flatten
> function fails with the error above.
>
> I have tried toggling the store.mongo.read_numbers_as_double and
> store.mongo.all_text_mode flags with no luck.
>
>
>
> Command used (via shell or web client) and verbose error output for
> reference:
>
> 0: jdbc:drill:zk=local> SELECT flatten(`x`.`DateValueCollection`) FROM
> `mongo`.`db_name`.`some.random.collection.name` AS `x` limit 10;
>
> Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
> VarChar type when you are using a ValueWriter of type
> NullableFloat8WriterImpl.
>
> Fragment 0:0
>
> [Error Id: a43e57f6-f8ce-4850-8040-65858828056f on
> SYM156.options-it.com:31010] (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: You tried
> to write a VarChar type when you are using a ValueWriter of type
> NullableFloat8WriterImpl.
>
> Fragment 0:0
>
> [Error Id: a43e57f6-f8ce-4850-8040-65858828056f on somehostname:31010]
> at
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:247)
> at
> org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:290)
> at
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1923)
> at
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:73)
> at
> net.hydromatic.avatica.AvaticaConnection.executeQueryInternal(AvaticaConnection.java:404)
> at
> net.hydromatic.avatica.AvaticaStatement.executeQueryInternal(AvaticaStatement.java:355)
> at
> net.hydromatic.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:338)
> at
> net.hydromatic.avatica.AvaticaStatement.execute(AvaticaStatement.java:69)
> at
> org.apache.drill.jdbc.impl.DrillStatementImpl.execute(DrillStatementImpl.java:101)
> at sqlline.Commands.execute(Commands.java:841)
> at sqlline.Commands.sql(Commands.java:751)
> at sqlline.SqlLine.dispatch(SqlLine.java:746)
> at sqlline.SqlLine.begin(SqlLine.java:621)
> at sqlline.SqlLine.start(SqlLine.java:375)
> at sqlline.SqlLine.main(SqlLine.java:268)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
> ERROR: IllegalArgumentException: You tried to write a VarChar type when you
> are using a ValueWriter of type NullableFloat8WriterImpl.
>
> Fragment 0:0
>
> [Error Id: a43e57f6-f8ce-4850-8040-65858828056f on somehostname:31010]
> at
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:119)
> at
> org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:113)
> at
> org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:46)
> at
> org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:31)
> at org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:67)
> at
> org.apache.drill.exec.rpc.RpcBus$RequestEvent.run(RpcBus.java:374)
> at
> org.apache.drill.common.SerializedExecutor$RunnableProcessor.run(Se

Re: [ANNOUNCE] New PMC Chair of Apache Drill

2016-05-25 Thread rahul challapalli

Congratulations Parth!

Thank You Jacques for your leadership over the last few years.

On Wed, May 25, 2016 at 10:26 AM, Gautam Parai  wrote:

> Congratulations Parth!
>
> On Wed, May 25, 2016 at 9:02 AM, Jinfeng Ni  wrote:
>
> > Big congratulations, Parth!
> >
> > Thank you, Jacques, for your contribution and leadership over the last
> > few years!
> >
> >
> > On Wed, May 25, 2016 at 8:35 AM, Jacques Nadeau 
> > wrote:
> > > I'm pleased to announce that the Drill PMC has voted to elect Parth
> > Chandra
> > > as the new PMC chair of Apache Drill. Please join me in congratulating
> > > Parth!
> > >
> > > thanks,
> > > Jacques
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> >
>

Re: File size limit for CTAS?

2016-01-21 Thread rahul challapalli

Ignoring the CTAS part can you try running the select query and see if it
completes. My suspicion is that some record/field in your large file is
causing drill to break. Also it would be helpful if you can give more
information from the drillbit.log when this error happens (Search for
da53d687-a8d5-4927-88ec-e56d5da17112)

- Rahul

On Thu, Jan 21, 2016 at 4:10 PM, Matt  wrote:

> Converting CSV files to Parquet with CTAS, and getting errors on some
> larger files:
>
> With a source file of 16.34GB (as reported in the HDFS explorer):
>
> ~~~
> create table `/parquet/customer_20151017` partition by (date_tm) AS select
> * from `/csv/customer/customer_20151017.csv`;
> Error: SYSTEM ERROR: IllegalArgumentException: length: -484 (expected: >=
> 0)
>
> Fragment 1:1
>
> [Error Id: da53d687-a8d5-4927-88ec-e56d5da17112 on es07:31010]
> (state=,code=0)
> ~~~
>
> But an optation on a 70 MB file of the same format succeeds.
>
> Given some HDFS advice is to avoid large numbers of small files [1], is
> there a general guideline for the max file size to ingest into Parquet
> files with CTAS?
>
> ---
>
> [1] HDFS put performance is very poor with a large number of small files,
> thus trying to find the right amount of source rollup to perform. Pointers
> to HDFS configuration guides for beginners would be appreciated too. I have
> only used HDFS for Drill - no other Hadoop experience.
>

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

2016-01-17 Thread rahul challapalli

The level of parallelization in the lucene plugin is a segment.

Stefan,

I think it would be more accurate if you rewrite your join query so that we
push the join keys into the lucene group scan and then compare the numbers.
Something like the below

   select * from tbl1 a left join (select * from tbl2 where tbl2.col1 in
(select col1 from tbl1)) b where a.col1 = b.col1;

- Rahul

On Sun, Jan 17, 2016 at 11:20 AM, Jacques Nadeau  wrote:

> Can you give more detail about the join stats themselves? You also state
> 20x slower but I'm trying to understand what that means. 20x slower than
> what? Are you parallelizing the Lucene read or is this a single reader?
>
> For example:
>
> I have a join.
> The left side has a billion rows.
> The right side has 10 million rows.
> When applying the join condition, only 10k rows are needed from the right
> side.
>
> How long does it take to read a few million records from Lucene? (Recently
> with Elastic we've been seeing ~50-100k/second per thread when only
> retrieving a single stored field.)
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sat, Jan 16, 2016 at 12:11 PM, Stefán Baxter  >
> wrote:
>
> > Hi Jacques,
> >
> > Thank you for taking the time, it's appreciated.
> >
> > I'm trying to contribute to the Lucene reader for Drill (Started by Rahul
> > Challapalli). We would like to use it for storage of metadata used in our
> > Drill setup.
> > This is perfectly suited for our needs as the metadata is already
> available
> > in Lucene document+indexes and it's tenant specific (So this is not the
> > global metadata that should reside in Postgres/HBase or something
> similar)
> >
> > I think it's best that I confess that I'm not sure what I'm looking for
> or
> > how to ask for it, at least not in proper Drill terms.
> >
> > The Lucene reader is working but the joins currently rely on full scan
> > which introduces ~20 time longer execution time on simple data sets (few
> > million records) so I need to get the index based joins going but I don't
> > know how.
> >
> > We have resources to do this now but our knowlidge of Drill is limited
> and
> > I could not, in my initial scan of the project, find any use
> > of DrillJoinRel that indicated indexes were involved (please forgive me
> if
> > this is a false assumption).
> >
> > Can you please clarify things for me a bit:
> >
> >- Is the JDBC connector already doing proper pushdown of filters for
> >joins? (If so then I must really get my reading glasses on)
> >- What will change with this new approach.
> >
> > I'm not really sure what you need from me now but I'm more than happy to
> > share everything except the data it self :).
> >
> > The fork is places here:
> > https://github.com/activitystream/drill/tree/lucene-work but no tests
> > files
> > are included in the repo, sorry, and this is all very immature.
> >
> > Regards,
> >  -Stefán
> >
> >
> >
> >
> > On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau 
> > wrote:
> >
> > > Closest things already done to date is the join pushdown in the jdbc
> > > connector and the prototype code someone built a while back to do a
> join
> > > using HBase as a hash table. Aman and I have an ongoing thread
> discussing
> > > using elastic indexing and sideband communication to accelerate joins.
> If
> > > would be great if you could cover exactly what you're doing (including
> > > relevant stats), that would give us a better idea of how to point you
> in
> > > the right direction.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter <
> > ste...@activitystream.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Can anyone point me to an implementation where joins are implemented
> > with
> > > > full support for filters and efficient handling of joins based on
> > > indexes.
> > > >
> > > > The only code I have come across all seems to rely on complete scan
> of
> > > the
> > > > related table and that is not acceptable for the use case we are
> > working
> > > on
> > > > (Lucene reader).
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > >
> >
>

Re: Lucene Plugin :: Join Filter and pushdown

2016-01-14 Thread rahul challapalli

Use Case : In the case of a left join between a non-index table and a
lucene index, it is more efficient to read the join keys from the non-index
table and push them into the LuceneGroupScan. This way we can avoid reading
the whole index.
I was suggesting converting the plan for Q1 into a plan similar to Q2 using
an optimizer rule.
  Q1.) select * from tbl1 left join tbl2 on tbl1.col1 = tbl2.col1
  Q2.) select * from tbl1 left join (select * from tbl2 where tbl2.col1 in
(select col1 from tbl1))

Any other suggestions or pointers are appreciated

- Rahul

On Thu, Jan 14, 2016 at 2:52 PM, Stefán Baxter 
wrote:

> Hi,
>
> I'm working on the Lucene plugin (see previous email) and the focus now is
> support for joins with filter push-down to avoid the default table scan
> that is provided by default.
>
> I'm fairly new to Drill and in over my head, to be honest, but this is fun
> and with this addition the Lucene plugin could be come usable or at least
> worth exploring.
>
> Is there anyone here that could assist me a bit?
>
> Current status:
>
>- The lucene plugin is working and join filters are partially
>- RelOptRuleOperand is constructed and DrillJoinRel.conditions are
>processed by a sceleton class (The "normal" queries are already being
>processed fairly well)
>
> There are probably more things involved then I can imagine at this point
> and perhaps I'm naive in thinking someone has the time to assist a relative
> noob on such a task but examples are also appreciated. The plugins that I
> have seen seem to have relatively no join-filter logic so a rich
> example/blueprint would also be great.
>
> Regards,
>  -Stefán
>

Re: Classpath scanning & udfs

2016-01-12 Thread rahul challapalli

Adding the drill-module.conf file in the drill-conf directory with the
custom package names worked for me. Thanks for the assistance Jason, Julien
and Jacques.

- Rahul

On Tue, Jan 12, 2016 at 8:09 AM, Jason Altekruse 
wrote:

> Copying info over from Slack.
>
> For anyone who finds this thread, empty drill-module.conf files do not
> cause problems. The issue was a misunderstanding about the function of
> drill-override.conf. Values cannot be added to existing property lists
> using this file, it is designed for overriding the default values.
>
> If you put your UDFs in a package that is already being scanned as part of
> the default list then there is no need to add a new package to the list.
> The default list is generated by merging the drill-module.conf files in all
> of the default Drill packages. Here is one example from the drill exec
> module [1]
>
> To augment the list, add a drill-module.conf file to the classpath, the
> easiest way to do this is just include it in your jar.
>
> [1] -
>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/resources/drill-module.conf
> <
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/resources/drill-module.conf
> >
>
> - Jason
>
> On Mon, Jan 11, 2016 at 11:24 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Sure!
> >
> > On Mon, Jan 11, 2016 at 11:06 AM, Jason Altekruse <
> > altekruseja...@gmail.com>
> > wrote:
> >
> > > The error you posted originally is from deserializing a storage plugin
> > > config, not the drill-override or drill-module files. That being said,
> we
> > > find the list of available storage plugins using classpath scanning.
> > >
> > > I don't know why the inclusion or exclusion of contents inside of a
> > > module.conf file would impact our scanning for storage plugins, but I
> > > believe this is the issue you are seeing. Somehow the package that
> > contains
> > > the hbase plugin is being removed from the list of packages to scan
> when
> > > you add the empty file.
> > >
> > > Do you want to jump on slack to chat about this?
> > >
> > > - Jason
> > >
> > > On Mon, Jan 11, 2016 at 10:52 AM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > >
> > > > Julien,
> > > >
> > > > I have an empty drill-module.conf file in the udf jar file. Below are
> > the
> > > > contents of my global drill-override.conf file
> > > >
> > > > drill.classpath.scanning.packages :
> > > ${?drill.classpath.scanning.packages} [
> > > > org.apache.drill.udfs ]
> > > > drill.exec: {
> > > >   cluster-id: "rahul_cluster_com-drillbits",
> > > >   zk.connect: "localhost:5181"
> > > > }
> > > >
> > > > With this, the drillbit fails to start with the error I posted in the
> > > first
> > > > email. May be I am getting the syntax wrong?
> > > >
> > > > The reason I am insisting an empty drill-module.conf file is because,
> > > udf's
> > > > developed prior to the "classpath scanning change" had an empty
> > > > drill-module.conf file and we used to override the udf package name
> in
> > > the
> > > > global conf file.
> > > >
> > > > - Rahul
> > > >
> > > > On Mon, Jan 11, 2016 at 10:46 AM, Julien Le Dem 
> > > wrote:
> > > >
> > > > > Yes I believe that should work:
> > > > > -  add an empty drill-module.conf in the root of the udf jar
> > > > > -  add the package to drill.classpath.scanning.packages in the
> drill
> > > conf
> > > > > (possibly using drill-override.conf)
> > > > >
> > > > > However if you are adding the drill-module.conf file to the jar,
> you
> > > > might
> > > > > as well add the package in it. (unless there's some other reason)
> > > > >
> > > > > On Mon, Jan 11, 2016 at 10:28 AM, rahul challapalli <
> > > > > challapallira...@gmail.com> wrote:
> > > > >
> > > > > > Just to be sure, If I have an empty drill-module.conf in the root
> > of
> > > my
> > > > > udf
> > > > > > jar, then there is no way to add the package information to the
> > > global
> > > > > > drill-override.conf file?
> > > >

Re: Classpath scanning & udfs

2016-01-11 Thread rahul challapalli

Sure!

On Mon, Jan 11, 2016 at 11:06 AM, Jason Altekruse 
wrote:

> The error you posted originally is from deserializing a storage plugin
> config, not the drill-override or drill-module files. That being said, we
> find the list of available storage plugins using classpath scanning.
>
> I don't know why the inclusion or exclusion of contents inside of a
> module.conf file would impact our scanning for storage plugins, but I
> believe this is the issue you are seeing. Somehow the package that contains
> the hbase plugin is being removed from the list of packages to scan when
> you add the empty file.
>
> Do you want to jump on slack to chat about this?
>
> - Jason
>
> On Mon, Jan 11, 2016 at 10:52 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Julien,
> >
> > I have an empty drill-module.conf file in the udf jar file. Below are the
> > contents of my global drill-override.conf file
> >
> > drill.classpath.scanning.packages :
> ${?drill.classpath.scanning.packages} [
> > org.apache.drill.udfs ]
> > drill.exec: {
> >   cluster-id: "rahul_cluster_com-drillbits",
> >   zk.connect: "localhost:5181"
> > }
> >
> > With this, the drillbit fails to start with the error I posted in the
> first
> > email. May be I am getting the syntax wrong?
> >
> > The reason I am insisting an empty drill-module.conf file is because,
> udf's
> > developed prior to the "classpath scanning change" had an empty
> > drill-module.conf file and we used to override the udf package name in
> the
> > global conf file.
> >
> > - Rahul
> >
> > On Mon, Jan 11, 2016 at 10:46 AM, Julien Le Dem 
> wrote:
> >
> > > Yes I believe that should work:
> > > -  add an empty drill-module.conf in the root of the udf jar
> > > -  add the package to drill.classpath.scanning.packages in the drill
> conf
> > > (possibly using drill-override.conf)
> > >
> > > However if you are adding the drill-module.conf file to the jar, you
> > might
> > > as well add the package in it. (unless there's some other reason)
> > >
> > > On Mon, Jan 11, 2016 at 10:28 AM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > >
> > > > Just to be sure, If I have an empty drill-module.conf in the root of
> my
> > > udf
> > > > jar, then there is no way to add the package information to the
> global
> > > > drill-override.conf file?
> > > >
> > > > On Mon, Jan 11, 2016 at 10:26 AM, Julien Le Dem 
> > > wrote:
> > > >
> > > > > You are correct:
> > > > > The jar containing the UDFs should have a drill-module.conf at the
> > root
> > > > > adding your package to the property
> drill.classpath.scanning.packages
> > > for
> > > > > scanning
> > > > > drill.classpath.scanning.packages :
> > > > ${?drill.classpath.scanning.packages} [
> > > > > my.package.containing.my.udfs
> > > > > ]
> > > > > Jars that don't contain a drill-module.conf will not get scanned.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 11, 2016 at 10:17 AM, rahul challapalli <
> > > > > challapallira...@gmail.com> wrote:
> > > > >
> > > > > > Thanks for your reply Jason.
> > > > > >
> > > > > > If we cannot override the global configuration file, then for
> > > existing
> > > > > > UDF's we have to re-compile them by modifying the
> drill-module.conf
> > > > file.
> > > > > > If so our UDF's are not backward compatible. Appreciate it if
> > someone
> > > > can
> > > > > > confirm this.
> > > > > >
> > > > > > - Rahul
> > > > > >
> > > > > > On Mon, Jan 11, 2016 at 9:59 AM, Jason Altekruse <
> > > > > altekruseja...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Rahul,
> > > > > > >
> > > > > > > The error message you are seeing is in reading a storage plugin
> > > > > > > configuration file. I am planning to fix these kinds of
> messages
> > to
> > > > > > > actually direct users at the fil

Re: Classpath scanning & udfs

2016-01-11 Thread rahul challapalli

Julien,

I have an empty drill-module.conf file in the udf jar file. Below are the
contents of my global drill-override.conf file

drill.classpath.scanning.packages : ${?drill.classpath.scanning.packages} [
org.apache.drill.udfs ]
drill.exec: {
  cluster-id: "rahul_cluster_com-drillbits",
  zk.connect: "localhost:5181"
}

With this, the drillbit fails to start with the error I posted in the first
email. May be I am getting the syntax wrong?

The reason I am insisting an empty drill-module.conf file is because, udf's
developed prior to the "classpath scanning change" had an empty
drill-module.conf file and we used to override the udf package name in the
global conf file.

- Rahul

On Mon, Jan 11, 2016 at 10:46 AM, Julien Le Dem  wrote:

> Yes I believe that should work:
> -  add an empty drill-module.conf in the root of the udf jar
> -  add the package to drill.classpath.scanning.packages in the drill conf
> (possibly using drill-override.conf)
>
> However if you are adding the drill-module.conf file to the jar, you might
> as well add the package in it. (unless there's some other reason)
>
> On Mon, Jan 11, 2016 at 10:28 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Just to be sure, If I have an empty drill-module.conf in the root of my
> udf
> > jar, then there is no way to add the package information to the global
> > drill-override.conf file?
> >
> > On Mon, Jan 11, 2016 at 10:26 AM, Julien Le Dem 
> wrote:
> >
> > > You are correct:
> > > The jar containing the UDFs should have a drill-module.conf at the root
> > > adding your package to the property drill.classpath.scanning.packages
> for
> > > scanning
> > > drill.classpath.scanning.packages :
> > ${?drill.classpath.scanning.packages} [
> > >     my.package.containing.my.udfs
> > > ]
> > > Jars that don't contain a drill-module.conf will not get scanned.
> > >
> > >
> > >
> > > On Mon, Jan 11, 2016 at 10:17 AM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > >
> > > > Thanks for your reply Jason.
> > > >
> > > > If we cannot override the global configuration file, then for
> existing
> > > > UDF's we have to re-compile them by modifying the drill-module.conf
> > file.
> > > > If so our UDF's are not backward compatible. Appreciate it if someone
> > can
> > > > confirm this.
> > > >
> > > > - Rahul
> > > >
> > > > On Mon, Jan 11, 2016 at 9:59 AM, Jason Altekruse <
> > > altekruseja...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Rahul,
> > > > >
> > > > > The error message you are seeing is in reading a storage plugin
> > > > > configuration file. I am planning to fix these kinds of messages to
> > > > > actually direct users at the file that is failing parsing. I have
> > seen
> > > > this
> > > > > in the past when the classpath was incorrect and one of the plugins
> > > (like
> > > > > Hbase) was not included.
> > > > >
> > > > > Julien can confirm, but I think this might be intentional to have
> the
> > > > paths
> > > > > read out of the modules configuration rather than the global one to
> > > save
> > > > > time when scanning the path (rather than scanning all of the jars
> for
> > > all
> > > > > paths given in the override file).
> > > > >
> > > > > On Fri, Jan 8, 2016 at 4:32 PM, rahul challapalli <
> > > > > challapallira...@gmail.com> wrote:
> > > > >
> > > > > > Before 1.2, my udfs project contained an empty
> drill-override.conf
> > > file
> > > > > and
> > > > > > I used to update the drill-override.conf on all the drillbits to
> > > > specify
> > > > > > the package of my UDF. This is no longer working for me. I tried
> a
> > > few
> > > > > > things and below is how my drill-override.conf file looks now
> > > > > >
> > > > > > drill.classpath.scanning.packages :
> > > > > ${?drill.classpath.scanning.packages} [
> > > > > > org.apache.drill.udfs ]
> > > > > > drill.exec: {
> > > > > >   cluster-id: "rahul_cluster_com-drillbits",
> > > > > >   zk.connect: "localhost:5181"
> > > > > > }
> > > > > >
> > > > > > When I restart the drillbits, I get this strange error " Caused
> by:
> > > > > > com.fasterxml.jackson.databind.JsonMappingException: Could not
> > > resolve
> > > > > type
> > > > > > id 'hbase' into a subtype of [simple type, class
> > > > > > org.apache.drill.common.logical.StoragePluginConfig]"
> > > > > >
> > > > > > If I moved the package information to the drill-module.conf in my
> > > udf's
> > > > > > project, then things are working fine. However this requires
> > > > re-compiling
> > > > > > the udfs which is not desirable. Is there any other way around
> > this ?
> > > > > >
> > > > > > - Rahul
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
>
>
>
> --
> Julien
>

Re: Classpath scanning & udfs

2016-01-11 Thread rahul challapalli

Just to be sure, If I have an empty drill-module.conf in the root of my udf
jar, then there is no way to add the package information to the global
drill-override.conf file?

On Mon, Jan 11, 2016 at 10:26 AM, Julien Le Dem  wrote:

> You are correct:
> The jar containing the UDFs should have a drill-module.conf at the root
> adding your package to the property drill.classpath.scanning.packages for
> scanning
> drill.classpath.scanning.packages : ${?drill.classpath.scanning.packages} [
> my.package.containing.my.udfs
> ]
> Jars that don't contain a drill-module.conf will not get scanned.
>
>
>
> On Mon, Jan 11, 2016 at 10:17 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Thanks for your reply Jason.
> >
> > If we cannot override the global configuration file, then for existing
> > UDF's we have to re-compile them by modifying the drill-module.conf file.
> > If so our UDF's are not backward compatible. Appreciate it if someone can
> > confirm this.
> >
> > - Rahul
> >
> > On Mon, Jan 11, 2016 at 9:59 AM, Jason Altekruse <
> altekruseja...@gmail.com
> > >
> > wrote:
> >
> > > Rahul,
> > >
> > > The error message you are seeing is in reading a storage plugin
> > > configuration file. I am planning to fix these kinds of messages to
> > > actually direct users at the file that is failing parsing. I have seen
> > this
> > > in the past when the classpath was incorrect and one of the plugins
> (like
> > > Hbase) was not included.
> > >
> > > Julien can confirm, but I think this might be intentional to have the
> > paths
> > > read out of the modules configuration rather than the global one to
> save
> > > time when scanning the path (rather than scanning all of the jars for
> all
> > > paths given in the override file).
> > >
> > > On Fri, Jan 8, 2016 at 4:32 PM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > >
> > > > Before 1.2, my udfs project contained an empty drill-override.conf
> file
> > > and
> > > > I used to update the drill-override.conf on all the drillbits to
> > specify
> > > > the package of my UDF. This is no longer working for me. I tried a
> few
> > > > things and below is how my drill-override.conf file looks now
> > > >
> > > > drill.classpath.scanning.packages :
> > > ${?drill.classpath.scanning.packages} [
> > > > org.apache.drill.udfs ]
> > > > drill.exec: {
> > > >   cluster-id: "rahul_cluster_com-drillbits",
> > > >   zk.connect: "localhost:5181"
> > > > }
> > > >
> > > > When I restart the drillbits, I get this strange error " Caused by:
> > > > com.fasterxml.jackson.databind.JsonMappingException: Could not
> resolve
> > > type
> > > > id 'hbase' into a subtype of [simple type, class
> > > > org.apache.drill.common.logical.StoragePluginConfig]"
> > > >
> > > > If I moved the package information to the drill-module.conf in my
> udf's
> > > > project, then things are working fine. However this requires
> > re-compiling
> > > > the udfs which is not desirable. Is there any other way around this ?
> > > >
> > > > - Rahul
> > > >
> > >
> >
>
>
>
> --
> Julien
>

Re: Classpath scanning & udfs

2016-01-11 Thread rahul challapalli

Thanks for your reply Jason.

If we cannot override the global configuration file, then for existing
UDF's we have to re-compile them by modifying the drill-module.conf file.
If so our UDF's are not backward compatible. Appreciate it if someone can
confirm this.

- Rahul

On Mon, Jan 11, 2016 at 9:59 AM, Jason Altekruse 
wrote:

> Rahul,
>
> The error message you are seeing is in reading a storage plugin
> configuration file. I am planning to fix these kinds of messages to
> actually direct users at the file that is failing parsing. I have seen this
> in the past when the classpath was incorrect and one of the plugins (like
> Hbase) was not included.
>
> Julien can confirm, but I think this might be intentional to have the paths
> read out of the modules configuration rather than the global one to save
> time when scanning the path (rather than scanning all of the jars for all
> paths given in the override file).
>
> On Fri, Jan 8, 2016 at 4:32 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Before 1.2, my udfs project contained an empty drill-override.conf file
> and
> > I used to update the drill-override.conf on all the drillbits to specify
> > the package of my UDF. This is no longer working for me. I tried a few
> > things and below is how my drill-override.conf file looks now
> >
> > drill.classpath.scanning.packages :
> ${?drill.classpath.scanning.packages} [
> > org.apache.drill.udfs ]
> > drill.exec: {
> >   cluster-id: "rahul_cluster_com-drillbits",
> >   zk.connect: "localhost:5181"
> > }
> >
> > When I restart the drillbits, I get this strange error " Caused by:
> > com.fasterxml.jackson.databind.JsonMappingException: Could not resolve
> type
> > id 'hbase' into a subtype of [simple type, class
> > org.apache.drill.common.logical.StoragePluginConfig]"
> >
> > If I moved the package information to the drill-module.conf in my udf's
> > project, then things are working fine. However this requires re-compiling
> > the udfs which is not desirable. Is there any other way around this ?
> >
> > - Rahul
> >
>

Classpath scanning & udfs

2016-01-08 Thread rahul challapalli

Before 1.2, my udfs project contained an empty drill-override.conf file and
I used to update the drill-override.conf on all the drillbits to specify
the package of my UDF. This is no longer working for me. I tried a few
things and below is how my drill-override.conf file looks now

drill.classpath.scanning.packages : ${?drill.classpath.scanning.packages} [
org.apache.drill.udfs ]
drill.exec: {
  cluster-id: "rahul_cluster_com-drillbits",
  zk.connect: "localhost:5181"
}

When I restart the drillbits, I get this strange error " Caused by:
com.fasterxml.jackson.databind.JsonMappingException: Could not resolve type
id 'hbase' into a subtype of [simple type, class
org.apache.drill.common.logical.StoragePluginConfig]"

If I moved the package information to the drill-module.conf in my udf's
project, then things are working fine. However this requires re-compiling
the udfs which is not desirable. Is there any other way around this ?

- Rahul

Re: Announcing new committer: Kristine Hahn

2015-12-04 Thread rahul challapalli

Congratulations Kristine :)

On Fri, Dec 4, 2015 at 9:43 AM, Abdel Hakim Deneche 
wrote:

> Congrats Kristine :D
>
> On Fri, Dec 4, 2015 at 9:36 AM, Sudheesh Katkam 
> wrote:
>
> > Congratulations and welcome, Kris!
> >
> > > On Dec 4, 2015, at 9:19 AM, Jacques Nadeau  wrote:
> > >
> > > The Apache Drill PMC is very pleased to announce Kristine Hahn as a new
> > > committer.
> > >
> > > Kris has worked tirelessly on creating and improving the Drill
> > > documentation. She has been extraordinary in her engagement with the
> > > community and has greatly accelerated the speed to resolution of doc
> > issues
> > > and improvements.
> > >
> > > Welcome Kristine!
> >
> >
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Order of records read in a parquet file

2015-11-06 Thread rahul challapalli

I did try your suggestion and sqlline displayed the columns from the json
file just fine. Raised the below jira to track this issue
https://issues.apache.org/jira/browse/DRILL-4048

On Fri, Nov 6, 2015 at 5:52 PM, Jacques Nadeau  wrote:

> I wouldn't jump to that conclusion. Sqlline uses toString. If we changed
> the toString behavior, it could be a problem. Maybe do a ctas to a json
> file to confirm.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Nov 6, 2015 at 5:40 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > From a previous build, I got the data for these columns just fine from
> > sqlline. So I think we can eliminate any display issues unless I am
> missing
> > something?
> >
> > - Rahul
> >
> > On Fri, Nov 6, 2015 at 5:34 PM, Jacques Nadeau 
> wrote:
> >
> > > Can you confirm if this is a display bug in sqlline or jdbc to string
> > > versus an actual data return?
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Fri, Nov 6, 2015 at 5:31 PM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > >
> > > > Jason,
> > > >
> > > > You were partly correct. We are not dropping records however we are
> > > > corrupting dictionary encoded binary columns. I got confused that we
> > are
> > > > returning different records, but we are trimming (or returning
> > unreadable
> > > > chars) some columns which are binary. I was able to reproduce with
> the
> > > > lineitem data set. I will raise a jira and I think this should be
> > treated
> > > > critical. Thoughts?
> > > >
> > > > - Rahul
> > > >
> > > > On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
> > > > challapallira...@gmail.com> wrote:
> > > >
> > > > > Jason,
> > > > >
> > > > > I missed that. Let me check whether we are dropping any records. I
> > > would
> > > > > be surprised if our regression tests missed that :)
> > > > >
> > > > > - Rahul
> > > > >
> > > > > On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <
> > > > altekruseja...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Rahul,
> > > > >>
> > > > >> Thanks for working on a reproduction of the issue. You didn't
> > actually
> > > > >> answer my first question, are you getting the same data out of the
> > > file,
> > > > >> just in a different order? It seems much more likely that we are
> > > > dropping
> > > > >> some records at the beginning than reordering them somehow,
> > although I
> > > > >> would have expected an error like this to be caught by the unit or
> > > > >> regression tests.
> > > > >>
> > > > >> Thanks,
> > > > >> Jason
> > > > >>
> > > > >> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> > > > >> challapallira...@gmail.com> wrote:
> > > > >>
> > > > >> > Thanks for your replies. The file is private and I will try to
> > > > >> construct a
> > > > >> > file without sensitive data which can expose this behavior.
> > > > >> >
> > > > >> > - Rahul
> > > > >> >
> > > > >> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> > > > >> altekruseja...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Is this a large or private parquet file? Can you share it to
> > allow
> > > > me
> > > > >> to
> > > > >> > > debug the read path for it?
> > > > >> > >
> > > > >> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> > > > >> > altekruseja...@gmail.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > The changes to parquet were not supposed to be functional at
> > > all.
> > > > We
> > > > >> > had
> > > > >> > > > been maintaining our fork of parquet-mr to have a ByteBuffer
> > >

Re: Order of records read in a parquet file

2015-11-06 Thread rahul challapalli

>From a previous build, I got the data for these columns just fine from
sqlline. So I think we can eliminate any display issues unless I am missing
something?

- Rahul

On Fri, Nov 6, 2015 at 5:34 PM, Jacques Nadeau  wrote:

> Can you confirm if this is a display bug in sqlline or jdbc to string
> versus an actual data return?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Nov 6, 2015 at 5:31 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Jason,
> >
> > You were partly correct. We are not dropping records however we are
> > corrupting dictionary encoded binary columns. I got confused that we are
> > returning different records, but we are trimming (or returning unreadable
> > chars) some columns which are binary. I was able to reproduce with the
> > lineitem data set. I will raise a jira and I think this should be treated
> > critical. Thoughts?
> >
> > - Rahul
> >
> > On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> >
> > > Jason,
> > >
> > > I missed that. Let me check whether we are dropping any records. I
> would
> > > be surprised if our regression tests missed that :)
> > >
> > > - Rahul
> > >
> > > On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <
> > altekruseja...@gmail.com>
> > > wrote:
> > >
> > >> Rahul,
> > >>
> > >> Thanks for working on a reproduction of the issue. You didn't actually
> > >> answer my first question, are you getting the same data out of the
> file,
> > >> just in a different order? It seems much more likely that we are
> > dropping
> > >> some records at the beginning than reordering them somehow, although I
> > >> would have expected an error like this to be caught by the unit or
> > >> regression tests.
> > >>
> > >> Thanks,
> > >> Jason
> > >>
> > >> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> > >> challapallira...@gmail.com> wrote:
> > >>
> > >> > Thanks for your replies. The file is private and I will try to
> > >> construct a
> > >> > file without sensitive data which can expose this behavior.
> > >> >
> > >> > - Rahul
> > >> >
> > >> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> > >> altekruseja...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Is this a large or private parquet file? Can you share it to allow
> > me
> > >> to
> > >> > > debug the read path for it?
> > >> > >
> > >> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> > >> > altekruseja...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > The changes to parquet were not supposed to be functional at
> all.
> > We
> > >> > had
> > >> > > > been maintaining our fork of parquet-mr to have a ByteBuffer
> based
> > >> read
> > >> > > and
> > >> > > > write path to reduce heap memory usage. The work done was just
> > >> getting
> > >> > > > these changes merged back into parquet-mr and making
> corresponding
> > >> > > changes
> > >> > > > in Drill to accommodate any interface modifications introduced
> > >> since we
> > >> > > > last rebased (there were mostly just package renames). There
> were
> > a
> > >> lot
> > >> > > of
> > >> > > > comments on the PR, and a decent amount of refactoring that was
> > >> done to
> > >> > > > consolidate and otherwise clean up the code, but there shouldn't
> > >> have
> > >> > > been
> > >> > > > any changes to the behavior of the reader or writer.
> > >> > > >
> > >> > > > Are you getting all of the same data out if you read the whole
> > file,
> > >> > just
> > >> > > > in a different order?
> > >> > > >
> > >> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> > >> > > > challapallira...@gmail.com> wrote:
> > >> > > >
> > >> > > >> parquet-meta command suggests that there is only one row group
> > >> > > >>
> > >> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau <
> > jacq...@dremio.com
> > >> >
> > >> > > >> wrote:
> > >> > > >>
> > >> > > >> > How many row groups?
> > >> > > >> >
> > >> > > >> > --
> > >> > > >> > Jacques Nadeau
> > >> > > >> > CTO and Co-Founder, Dremio
> > >> > > >> >
> > >> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
> > >> > > >> > challapallira...@gmail.com> wrote:
> > >> > > >> >
> > >> > > >> > > Drillers,
> > >> > > >> > >
> > >> > > >> > > With the new parquet library update, can someone throw some
> > >> light
> > >> > on
> > >> > > >> the
> > >> > > >> > > order in which the records are read from a single parquet
> > file?
> > >> > > >> > >
> > >> > > >> > > With the older library, when I run the below query on a
> > single
> > >> > > parquet
> > >> > > >> > > file, I used to get a set of records. Now after the parquet
> > >> > library
> > >> > > >> > update,
> > >> > > >> > > I am seeing a different set of records. Just wanted to
> > >> understand
> > >> > > what
> > >> > > >> > > specifically has changed.
> > >> > > >> > >
> > >> > > >> > > select * from `file.parquet` limit 5;
> > >> > > >> > >
> > >> > > >> > > - Rahul
> > >> > > >> > >
> > >> > > >> >
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Order of records read in a parquet file

2015-11-06 Thread rahul challapalli

Jason,

You were partly correct. We are not dropping records however we are
corrupting dictionary encoded binary columns. I got confused that we are
returning different records, but we are trimming (or returning unreadable
chars) some columns which are binary. I was able to reproduce with the
lineitem data set. I will raise a jira and I think this should be treated
critical. Thoughts?

- Rahul

On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> Jason,
>
> I missed that. Let me check whether we are dropping any records. I would
> be surprised if our regression tests missed that :)
>
> - Rahul
>
> On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse 
> wrote:
>
>> Rahul,
>>
>> Thanks for working on a reproduction of the issue. You didn't actually
>> answer my first question, are you getting the same data out of the file,
>> just in a different order? It seems much more likely that we are dropping
>> some records at the beginning than reordering them somehow, although I
>> would have expected an error like this to be caught by the unit or
>> regression tests.
>>
>> Thanks,
>> Jason
>>
>> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
>> challapallira...@gmail.com> wrote:
>>
>> > Thanks for your replies. The file is private and I will try to
>> construct a
>> > file without sensitive data which can expose this behavior.
>> >
>> > - Rahul
>> >
>> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
>> altekruseja...@gmail.com>
>> > wrote:
>> >
>> > > Is this a large or private parquet file? Can you share it to allow me
>> to
>> > > debug the read path for it?
>> > >
>> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
>> > altekruseja...@gmail.com>
>> > > wrote:
>> > >
>> > > > The changes to parquet were not supposed to be functional at all. We
>> > had
>> > > > been maintaining our fork of parquet-mr to have a ByteBuffer based
>> read
>> > > and
>> > > > write path to reduce heap memory usage. The work done was just
>> getting
>> > > > these changes merged back into parquet-mr and making corresponding
>> > > changes
>> > > > in Drill to accommodate any interface modifications introduced
>> since we
>> > > > last rebased (there were mostly just package renames). There were a
>> lot
>> > > of
>> > > > comments on the PR, and a decent amount of refactoring that was
>> done to
>> > > > consolidate and otherwise clean up the code, but there shouldn't
>> have
>> > > been
>> > > > any changes to the behavior of the reader or writer.
>> > > >
>> > > > Are you getting all of the same data out if you read the whole file,
>> > just
>> > > > in a different order?
>> > > >
>> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
>> > > > challapallira...@gmail.com> wrote:
>> > > >
>> > > >> parquet-meta command suggests that there is only one row group
>> > > >>
>> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau > >
>> > > >> wrote:
>> > > >>
>> > > >> > How many row groups?
>> > > >> >
>> > > >> > --
>> > > >> > Jacques Nadeau
>> > > >> > CTO and Co-Founder, Dremio
>> > > >> >
>> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
>> > > >> > challapallira...@gmail.com> wrote:
>> > > >> >
>> > > >> > > Drillers,
>> > > >> > >
>> > > >> > > With the new parquet library update, can someone throw some
>> light
>> > on
>> > > >> the
>> > > >> > > order in which the records are read from a single parquet file?
>> > > >> > >
>> > > >> > > With the older library, when I run the below query on a single
>> > > parquet
>> > > >> > > file, I used to get a set of records. Now after the parquet
>> > library
>> > > >> > update,
>> > > >> > > I am seeing a different set of records. Just wanted to
>> understand
>> > > what
>> > > >> > > specifically has changed.
>> > > >> > >
>> > > >> > > select * from `file.parquet` limit 5;
>> > > >> > >
>> > > >> > > - Rahul
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Order of records read in a parquet file

2015-11-06 Thread rahul challapalli

Jason,

I missed that. Let me check whether we are dropping any records. I would be
surprised if our regression tests missed that :)

- Rahul

On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse 
wrote:

> Rahul,
>
> Thanks for working on a reproduction of the issue. You didn't actually
> answer my first question, are you getting the same data out of the file,
> just in a different order? It seems much more likely that we are dropping
> some records at the beginning than reordering them somehow, although I
> would have expected an error like this to be caught by the unit or
> regression tests.
>
> Thanks,
> Jason
>
> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Thanks for your replies. The file is private and I will try to construct
> a
> > file without sensitive data which can expose this behavior.
> >
> > - Rahul
> >
> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> altekruseja...@gmail.com>
> > wrote:
> >
> > > Is this a large or private parquet file? Can you share it to allow me
> to
> > > debug the read path for it?
> > >
> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> > altekruseja...@gmail.com>
> > > wrote:
> > >
> > > > The changes to parquet were not supposed to be functional at all. We
> > had
> > > > been maintaining our fork of parquet-mr to have a ByteBuffer based
> read
> > > and
> > > > write path to reduce heap memory usage. The work done was just
> getting
> > > > these changes merged back into parquet-mr and making corresponding
> > > changes
> > > > in Drill to accommodate any interface modifications introduced since
> we
> > > > last rebased (there were mostly just package renames). There were a
> lot
> > > of
> > > > comments on the PR, and a decent amount of refactoring that was done
> to
> > > > consolidate and otherwise clean up the code, but there shouldn't have
> > > been
> > > > any changes to the behavior of the reader or writer.
> > > >
> > > > Are you getting all of the same data out if you read the whole file,
> > just
> > > > in a different order?
> > > >
> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> > > > challapallira...@gmail.com> wrote:
> > > >
> > > >> parquet-meta command suggests that there is only one row group
> > > >>
> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau 
> > > >> wrote:
> > > >>
> > > >> > How many row groups?
> > > >> >
> > > >> > --
> > > >> > Jacques Nadeau
> > > >> > CTO and Co-Founder, Dremio
> > > >> >
> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
> > > >> > challapallira...@gmail.com> wrote:
> > > >> >
> > > >> > > Drillers,
> > > >> > >
> > > >> > > With the new parquet library update, can someone throw some
> light
> > on
> > > >> the
> > > >> > > order in which the records are read from a single parquet file?
> > > >> > >
> > > >> > > With the older library, when I run the below query on a single
> > > parquet
> > > >> > > file, I used to get a set of records. Now after the parquet
> > library
> > > >> > update,
> > > >> > > I am seeing a different set of records. Just wanted to
> understand
> > > what
> > > >> > > specifically has changed.
> > > >> > >
> > > >> > > select * from `file.parquet` limit 5;
> > > >> > >
> > > >> > > - Rahul
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Order of records read in a parquet file

2015-11-06 Thread rahul challapalli

Thanks for your replies. The file is private and I will try to construct a
file without sensitive data which can expose this behavior.

- Rahul

On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse 
wrote:

> Is this a large or private parquet file? Can you share it to allow me to
> debug the read path for it?
>
> On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse 
> wrote:
>
> > The changes to parquet were not supposed to be functional at all. We had
> > been maintaining our fork of parquet-mr to have a ByteBuffer based read
> and
> > write path to reduce heap memory usage. The work done was just getting
> > these changes merged back into parquet-mr and making corresponding
> changes
> > in Drill to accommodate any interface modifications introduced since we
> > last rebased (there were mostly just package renames). There were a lot
> of
> > comments on the PR, and a decent amount of refactoring that was done to
> > consolidate and otherwise clean up the code, but there shouldn't have
> been
> > any changes to the behavior of the reader or writer.
> >
> > Are you getting all of the same data out if you read the whole file, just
> > in a different order?
> >
> > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> >
> >> parquet-meta command suggests that there is only one row group
> >>
> >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau 
> >> wrote:
> >>
> >> > How many row groups?
> >> >
> >> > --
> >> > Jacques Nadeau
> >> > CTO and Co-Founder, Dremio
> >> >
> >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
> >> > challapallira...@gmail.com> wrote:
> >> >
> >> > > Drillers,
> >> > >
> >> > > With the new parquet library update, can someone throw some light on
> >> the
> >> > > order in which the records are read from a single parquet file?
> >> > >
> >> > > With the older library, when I run the below query on a single
> parquet
> >> > > file, I used to get a set of records. Now after the parquet library
> >> > update,
> >> > > I am seeing a different set of records. Just wanted to understand
> what
> >> > > specifically has changed.
> >> > >
> >> > > select * from `file.parquet` limit 5;
> >> > >
> >> > > - Rahul
> >> > >
> >> >
> >>
> >
> >
>

Re: Order of records read in a parquet file

2015-11-06 Thread rahul challapalli

parquet-meta command suggests that there is only one row group

On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau  wrote:

> How many row groups?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > Drillers,
> >
> > With the new parquet library update, can someone throw some light on the
> > order in which the records are read from a single parquet file?
> >
> > With the older library, when I run the below query on a single parquet
> > file, I used to get a set of records. Now after the parquet library
> update,
> > I am seeing a different set of records. Just wanted to understand what
> > specifically has changed.
> >
> > select * from `file.parquet` limit 5;
> >
> > - Rahul
> >
>

Order of records read in a parquet file

2015-11-06 Thread rahul challapalli

Drillers,

With the new parquet library update, can someone throw some light on the
order in which the records are read from a single parquet file?

With the older library, when I run the below query on a single parquet
file, I used to get a set of records. Now after the parquet library update,
I am seeing a different set of records. Just wanted to understand what
specifically has changed.

select * from `file.parquet` limit 5;

- Rahul

Semantics for boolean expressions in order by clause

2015-11-02 Thread rahul challapalli

Drillers,

What are the semantics for the below query ? Should this syntax even be
supported?

select * from hive.dest2 order by key+1 = 497;

- Rahul

Re: Externally created Parquet files and partition pruning

2015-10-21 Thread rahul challapalli

Chris,

Its not just sufficient to specify which column is the partition column.
The data should also be organized accordingly. Below is a high level
description of how partition pruning works with parquet files

1. Use CTAS with partition by clause : Here drill creates a single (or
more) file for each distinct partition column value and all the records
which have the same partition column value would go into the file. The
metadata of each parquet file contains information necessary to identify
the partition column(s) in that file.
2. Use a query with a filter on the partition column : During the planning
time, if drill detects a filter on the partition column then it would
instruct the execution engine to only scan the files whose partition column
value matches the filter condition.

Hope this helps.

- Rahul

On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews  wrote:

> We create a JSON format schema for the Parquet file using the Avro
> specification and use this schema when loading data.
>
> Is there anything special we have to do to flag a column as a partitioning
> column ?
> Sorry I don’t understand your answer. What do you mean by ‘discover the
> columns with a single value’ ?
>
> Cheers — Chris
>
> > On 21 Oct 2015, at 20:02, Mehant Baid  wrote:
> >
> > The information is stored in the footer of the parquet files. Drill
> reads the metadata information stored in the parquet footer to discover the
> columns with a single value and treats them as partitioning columns.
> >
> > Thanks
> > Mehant
> >
> > On 10/21/15 11:52 AM, Chris Mathews wrote:
> >> Thank Mehant; yes we did look at doing this, but the advantages of
> using the new PARTITION BY feature is that the partitioned columns are
> automatically detected during any subsequent queries.  This is a major
> advantage as our customers are using the Tableau BI tool, and knowing
> details such as the exact partition levels and directories is not an option.
> >>
> >> By the way, having created a table using PARTITION BY and CTAS ,how
> does a query know how to action the pruning ?  Where is this information
> stored for the query to access the tables/files efficiently ?
> >>
> >> Cheers — Chris
> >>
> >>> On 21 Oct 2015, at 19:37, Mehant Baid  wrote:
> >>>
> >>> In addition to the auto partitioning done by CTAS, Drill also supports
> directory based pruning. You could load data into different(nested)
> directories underneath the top level table location and use the 'where'
> clause to get the pruning performance benefits. Following is a typical
> example
> >>>
> >>> Table location: /home/user/table_name
> >>> Within this you could create nested directory structure of the form
> >>> /home/user/table_name/2010/jan
> >>> /home/user/table_name/2010/feb
> >>> ...
> >>> /home/user/table_name/2010/dec
> >>>
> >>> /home/user/table_name/2011/jan
> >>> ...
> >>> /home/user/table_name/2011/dec
> >>>
> >>> Given this directory structure you could have a query that looks like
> >>>
> >>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and
> dir1 = jan;
> >>>
> >>> This would prune out scanning the parquet files under the other
> directories.
> >>>
> >>> Thanks
> >>> Mehant
> >>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>  We have an existing ETL framework processing machine generated data,
> which we are updating to write Parquet files out directly to HDFS using
> AvroParquetWriter for access by Drill.
> 
>  Some questions:
> 
>  How do we take advantage of Drill’s partition pruning capabilities
> with PARTITION BY if we are not using CTAS to load the Parquet files ?
> 
>  It seems there is no way of taking advantage of these features if the
> Parquet files are created externally to CTAS - am I correct ?
> 
>  If I am, then is there any way using a Drill API of programatically
> loading our data into Parquet files and utilise Drill's parallelisation
> techniques using CTAS, or do we have to write the data out to a file and
> then load that file again as input to a CTAS command ?
> 
>  Another potential issue is that we are constantly writing Parquet
> files out to HDFS directories so the data in these files eventually appears
> as additional data in a Drill query - so how can we do this with CTAS ?
> Does CTAS append to an existing directory structure or does it insist on a
> new table name each time it is executed ?
> 
>  What I am getting at here is that there seem to be performance
> enhancement features available to Drill when the Parquet files are created
> using an existing file as input to a CTAS that are not possible otherwise.
> With the volumes of data we are talking about it is not really an option to
> write the files out, form them to then be read back in again for conversion
> using CTAS; which is why we write the Parquet files out directly to HDFS
> and append them to existing directories.
> 
>  Am I missing something obvious here - quite possibly yes ?
> 
>  Thanks fo

Re: How is dir0 inferred from the directory path

2015-10-20 Thread rahul challapalli

Thanks for your replies. For my usecase treating the first variable
directory as dir0 made sense. I just wanted to confirm that this was not an
unintended side-effect

- Rahul

On Tue, Oct 20, 2015 at 10:34 AM, Jason Altekruse 
wrote:

> I can understand an argument for consistency starting at the root requested
> directory. However, I don't think it isn't crazy to start at the first
> variable directory, because anything before that is providing information
> back to users that they put into the query explicitly themselves.
>
> On Mon, Oct 19, 2015 at 10:42 PM, Jacques Nadeau 
> wrote:
>
> > The first variable directory gets treated as a dirX starting point I
> > believe.
> >
> > Doesn't seem like a bug to me.
> > On Oct 19, 2015 9:56 AM, "rahul challapalli"  >
> > wrote:
> >
> > > Drillers,
> > >
> > > The below result suggests that 'dir0' is inferred treating
> > > '/drill/testdata/audits' as the root in the below query. Is this by
> > design
> > > that the first '*' gets treated as dir0?
> > >
> > > select * from dfs.`/drill/testdata/audits/*/audit/*.json` limit 1;
> > >
> > >
> >
> ++++---+--++--++-+
> > > |  dir0 |  dir1  |timestamp| operation  |
> uid
> > >  |   ipAddress   | columnFamily  | columnQualifier  |tableFid|
> > >
> > >
> >
> +--++--+---+--++--++-+
> > > | node1  | audit  | 2015-06-06 10:41:19.248  | op1  | 0|
> > > 10.10.105.51  | CF1| clq1  | 123
> > >   |
> > >
> > >
> >
> ++++---+--++--++-+
> > >
> > > - Rahul
> > >
> >
>

How is dir0 inferred from the directory path

2015-10-19 Thread rahul challapalli

Drillers,

The below result suggests that 'dir0' is inferred treating
'/drill/testdata/audits' as the root in the below query. Is this by design
that the first '*' gets treated as dir0?

select * from dfs.`/drill/testdata/audits/*/audit/*.json` limit 1;
++++---+--++--++-+
|  dir0 |  dir1  |timestamp| operation  | uid
 |   ipAddress   | columnFamily  | columnQualifier  |tableFid|
+--++--+---+--++--++-+
| node1  | audit  | 2015-06-06 10:41:19.248  | op1  | 0|
10.10.105.51  | CF1| clq1  | 123
  |
++++---+--++--++-+

- Rahul

RE: CSV with windows carriage return causes issues

2015-09-30 Thread rahul challapalli

Looks like a bug to me. Can you raise a jira for this if you haven't done
it already
On Sep 30, 2015 8:04 AM,  wrote:

> I've seen that issue too... ;)
>
> My personal opinion is that Drill (and sqlline) should treat Windows
> end-of-line characters the same as Unix end-of-line characters.  It doesn't
> seem reasonable to expect users (especially in an enterprise setting) to
> use dos2unix on every data file just so they can trust their query results.
>
>
> Phil
>
>
> Philip A Grim II
> Chief Engineer
> L-3 Data Tactics
> 7901 Jones Branch Dr.
> Suite 700
> McLean, VA  22102
>
>
> 
> From: Christopher Matta [cma...@mapr.com]
> Sent: Wednesday, September 30, 2015 10:56 AM
> To: user@drill.apache.org
> Subject: CSV with windows carriage return causes issues
>
> I’ve created a very simple reproduction of an issue I’ve observed with
> files that have a carriage return (\r) instead of a line feed (\n) ending.
>
> My CSV file was created using notepad on Windows and looks like this when
> queried directly from drill:
>
> 0: jdbc:drill:zk=local> select * from
> dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> +---+
> |columns|
> +---+
> | ["1","test1","test2","test3","test4\r"]   |
> | ["2","test5","test6","test7","test8\r"]   |
> | ["3","test9","test10","test11","test12"]  |
> +---+
>
> As you can see the first two rows have \r at the end, also note that
> column[0] has five digits.
>
> When casting into their own columns the a column gets a digit truncated:
>
> 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
> cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
> as c, cast(columns[3] as varchar(32)) as d, cast(columns[4] as
> varchar(32)) as e from
> dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> +++-+-+-+
> |   a|   b|c|d|e|
> +++-+-+-+
>   |  | test1  | test2   | test3   | test4
>   |  | test5  | test6   | test7   | test8
> | 3  | test9  | test10  | test11  | test12  |
> +++-+-+-+
>
> I can get around this by using regexp_replace on the last column:
>
> 0: jdbc:drill:zk=local> select cast(columns[0] as integer) as a,
> cast(columns[1] as varchar(32)) as b, cast(columns[2] as varchar(32))
> as c, cast(columns[3] as varchar(32)) as d,
> cast(regexp_replace(columns[4], '\r', '') as varchar(32)) as e from
> dfs.`Users/cmatta/Downloads/windows_drill_test.csv`;
> +++-+-+-+
> |   a|   b|c|d|e|
> +++-+-+-+
> | 1  | test1  | test2   | test3   | test4   |
> | 2  | test5  | test6   | test7   | test8   |
> | 3  | test9  | test10  | test11  | test12  |
> +++-+-+-+
>
> Is this expected, or should Drill treat carriage returns as line feeds? Is
> this simply sqlline interpreting the \r character?
> Chris mattacma...@mapr.com
> 215-701-3146
> 
>

Re: :querying avro data stored in Hbase through drill

2015-09-29 Thread rahul challapalli

Once you serialized your avro data into hbase, then avro should no longer
come into picture. Now your table is just a normal hbase table. You can
refer to the below documentation on querying hbase tables

https://drill.apache.org/docs/querying-hbase/

- Rahul

On Tue, Sep 29, 2015 at 12:14 AM, Amandeep Singh 
wrote:

> Hi,
>
> I need to use sql queries as supported by drill to fetch data from hbase
> which is stored in avro serialized format having predefined schema
> definition.
> Please suggest a way for the same.
>
> Regards,
> Amandeep Singh
>

Re: Making parquet data available to Tableau

2015-09-28 Thread rahul challapalli

There has been discussion around this in the past. But I am not sure if
there is a JIRA open for it. Can you please go ahead and raise a JIRA for
this?

- Rahul

On Mon, Sep 28, 2015 at 9:13 AM, Chris Mathews  wrote:

> Thank you Rahul for confirmation - I thought I was losing the plot for a
> while there.
>
> Are there any plans for Drill to to utilise the metadata from the footer
> of the parquet files, or even the new metadata cache files, or should a
> Jira request be raised for this as it seems a major step towards
> simplification for reporting tools ?
>
> Cheers — Chris
>
> > On 28 Sep 2015, at 16:50, rahul challapalli 
> wrote:
> >
> > Your observation is right. We need to create a view on top of any
> > file/folder for it to be available in Tableau or any reporting tool. This
> > makes sense with text and even json formats as drill does not know the
> data
> > types for the fields until it executes the queries. With parquet however
> > drill could leverage that information from the footers and make it
> > available to reporting tools. But currently it does not do that.
> >
> > With the new "REFRESH TABLE METADATA" feature, we collect all the
> > information from the parquet footers and store it in a cache file. Even
> in
> > this case, drill does not leverage this information to provide metadata
> to
> > reporting tools
> >
> > - Rahul
> >
> > On Mon, Sep 28, 2015 at 6:25 AM, Chris Mathews  wrote:
> >
> >> Hi
> >>
> >> Being new to Drill I am working on a capabilities study to store
> telecoms
> >> probe data as parquet files on an HDFS server, for later analysis and
> >> visualisation using Tableau Desktop/Server with Drill and Zookeeper via
> >> ODBC/JDBC etc.
> >>
> >> We store the parquet files on the HDFS server using an in-house ETL
> >> platform, which amongst other things transforms the massive volumes of
> >> telecoms probe data into millions of parquet files, writing out the
> parquet
> >> files directly to HDFS using AvroParquetWriter. The probe data arrives
> at
> >> regular intervals (5 to 15 minutes; configurable), so for performance
> >> reasons we use this direct AvroParquetWriter approach rather than
> writing
> >> out intermediate files and loading them via the Drill CTAS route.
> >>
> >> There has been some success, together with some frustration. After
> >> extensive experimentation we have come to the conclusion that to access
> >> these parquet files using Tableau we have to configure Drill with
> >> individual views for each parquet schema, and cast the columns to
> specific
> >> data types before Tableau can access the data correctly.
> >>
> >> This is a surprise as I thought Drill would have some way of exporting
> the
> >> schemas to Tableau having defined AVRO schemas for each parquet file,
> and
> >> the parquet files storing the schema as part of the data.  We now find
> we
> >> have to generate schema definitions in AVRO for the AvroParquetWriter
> >> phase, and also a Drill view for each schema to make them visible to
> >> Tableau.
> >>
> >> Also, as part of our experimentation we did create some parquet files
> >> using CTAS. The directory is created and the files contain the data but
> the
> >> tables do not seem to be displayed when we do a SHOW TABLES command.
> >>
> >> Are we correct in our thinking about Tableau requiring views to be
> >> created, or have we missed something obvious here ?
> >>
> >> Will the new REFRESH TABLE METADATA  feature (Drill 1.2
> ?)
> >> help us when it becomes available ?
> >>
> >> Help and suggestions much appreciated.
> >>
> >> Cheers -- Chris
> >>
> >>
>
>

Re: Making parquet data available to Tableau

2015-09-28 Thread rahul challapalli

Your observation is right. We need to create a view on top of any
file/folder for it to be available in Tableau or any reporting tool. This
makes sense with text and even json formats as drill does not know the data
types for the fields until it executes the queries. With parquet however
drill could leverage that information from the footers and make it
available to reporting tools. But currently it does not do that.

With the new "REFRESH TABLE METADATA" feature, we collect all the
information from the parquet footers and store it in a cache file. Even in
this case, drill does not leverage this information to provide metadata to
reporting tools

- Rahul

On Mon, Sep 28, 2015 at 6:25 AM, Chris Mathews  wrote:

> Hi
>
> Being new to Drill I am working on a capabilities study to store telecoms
> probe data as parquet files on an HDFS server, for later analysis and
> visualisation using Tableau Desktop/Server with Drill and Zookeeper via
> ODBC/JDBC etc.
>
> We store the parquet files on the HDFS server using an in-house ETL
> platform, which amongst other things transforms the massive volumes of
> telecoms probe data into millions of parquet files, writing out the parquet
> files directly to HDFS using AvroParquetWriter. The probe data arrives at
> regular intervals (5 to 15 minutes; configurable), so for performance
> reasons we use this direct AvroParquetWriter approach rather than writing
> out intermediate files and loading them via the Drill CTAS route.
>
> There has been some success, together with some frustration. After
> extensive experimentation we have come to the conclusion that to access
> these parquet files using Tableau we have to configure Drill with
> individual views for each parquet schema, and cast the columns to specific
> data types before Tableau can access the data correctly.
>
> This is a surprise as I thought Drill would have some way of exporting the
> schemas to Tableau having defined AVRO schemas for each parquet file, and
> the parquet files storing the schema as part of the data.  We now find we
> have to generate schema definitions in AVRO for the AvroParquetWriter
> phase, and also a Drill view for each schema to make them visible to
> Tableau.
>
> Also, as part of our experimentation we did create some parquet files
> using CTAS. The directory is created and the files contain the data but the
> tables do not seem to be displayed when we do a SHOW TABLES command.
>
> Are we correct in our thinking about Tableau requiring views to be
> created, or have we missed something obvious here ?
>
> Will the new REFRESH TABLE METADATA  feature (Drill 1.2 ?)
> help us when it becomes available ?
>
> Help and suggestions much appreciated.
>
> Cheers -- Chris
>
>

Re: Regarding drill jdbc with big file

2015-08-28 Thread rahul challapalli

Thanks Abhishek for digging that up.

Kunal,

Can you add a comment to jira with your use case as well. One of the
developers is working on a new memory allocator. Once that change gets
merged in we can re-run the different use cases and verify them.

- Rahul

On Fri, Aug 28, 2015 at 8:29 AM, Abhishek Girish 
wrote:

> This looks similar to DRILL-2882
> <https://issues.apache.org/jira/browse/DRILL-2882>.
>
> On Fri, Aug 28, 2015 at 7:56 AM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
>
> > I also commented on the JIRA.
> >
> > How much memory is available on the system for Drill?
> >
> > Also see what happens when you increase the planner query memory on the
> > node, as the files are large and will execute in a single thread.
> Normally
> > it is better to have JSON files in the 128-256MB range size pending the
> use
> > case, as it will allow for better execution with more threads than a
> single
> > large file.
> >
> > See what the query memory per node is set at and increase it to see if it
> > resolves your problem.
> > The parameter is planner.memory.max_query_memory_per_node
> > Query sys.options to see what it is set as and use alter system to
> modify.
> > https://drill.apache.org/docs/configuring-drill-memory/ <
> > https://drill.apache.org/docs/configuring-drill-memory/>
> > https://drill.apache.org/docs/alter-system/ <
> > https://drill.apache.org/docs/alter-system/>
> > https://drill.apache.org/docs/configuration-options-introduction/ <
> > https://drill.apache.org/docs/configuration-options-introduction/>
> >
> > —Andries
> >
> >
> > > On Aug 28, 2015, at 7:00 AM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> > >
> > > Can you search for the error id in the logs and post the stack trace?
> > >
> > > It looks like an overflow bug to me.
> > >
> > > - Rahul
> > > On Aug 28, 2015 6:47 AM, "Kunal Ghosh"  wrote:
> > >
> > >> Hi,
> > >>
> > >> I am new to apache drill. I have configured apache drill on machine
> with
> > >> centos.
> > >>
> > >> "DRILL_MAX_DIRECT_MEMORY" = 25g
> > >> "DRILL_HEAP" = 4g
> > >>
> > >> I have a 600 mb and 3 gb json file [sample file attached]. When i fire
> > >> query on relativly small size file everything works fine but as I fire
> > same
> > >> query with 600 mb and 3 gb files it gives following error.
> > >>
> > >> Query -
> > >> select tbl5.product_id product_id,tbl5.gender gender,tbl5.item_number
> > >> item_number,tbl5.price price,tbl5.description
> > >> description,tbl5.color_swatch.image image,tbl5.color_swatch.color
> color
> > from
> > >> (select tbl4.product_id product_id,tbl4.gender gender,tbl4.item_number
> > >> item_number,tbl4.price price,tbl4.size.description
> > >> description,FLATTEN(tbl4.size.color_swatch) color_swatch from
> > >> (select tbl3.product_id product_id,tbl3.catalog_item.gender
> > >> gender,tbl3.catalog_item.item_number
> item_number,tbl3.catalog_item.price
> > >> price,FLATTEN(tbl3.catalog_item.size) size from
> > >> (select tbl2.product.product_id as
> > >> product_id,FLATTEN(tbl2.product.catalog_item) as catalog_item from
> > >> (select FLATTEN(tbl1.catalog.product) product from
> dfs.root.`demo.json`
> > >> tbl1) tbl2) tbl3) tbl4) tbl5
> > >>
> > >>
> >
> --
> > >> Error -
> > >>
> > >> SYSTEM ERROR: IllegalArgumentException: initialCapacity: -2147483648
> > >> (expectd: 0+)
> > >>
> > >> Fragment 0:0
> > >>
> > >> [Error Id: 60cf1b95-762d-4a0d-8cae-a2db418d4ea9 on sinhagad:31010]
> > >>
> > >>
> > >>
> >
> --
> > >>
> > >> 1) Am i doing someting wrong or missing something ( probably because i
> > am
> > >> not using cluster ?? ).
> > >>
> > >> Please guide me through this.
> > >>
> > >> Thanks & Regards
> > >>
> > >> Kunal Ghosh
> > >>
> >
> >
>

Re: Regarding drill jdbc with big file

2015-08-28 Thread rahul challapalli

Can you search for the error id in the logs and post the stack trace?

It looks like an overflow bug to me.

- Rahul
On Aug 28, 2015 6:47 AM, "Kunal Ghosh"  wrote:

> Hi,
>
> I am new to apache drill. I have configured apache drill on machine with
> centos.
>
> "DRILL_MAX_DIRECT_MEMORY" = 25g
> "DRILL_HEAP" = 4g
>
> I have a 600 mb and 3 gb json file [sample file attached]. When i fire
> query on relativly small size file everything works fine but as I fire same
> query with 600 mb and 3 gb files it gives following error.
>
> Query -
> select tbl5.product_id product_id,tbl5.gender gender,tbl5.item_number
> item_number,tbl5.price price,tbl5.description
> description,tbl5.color_swatch.image image,tbl5.color_swatch.color color from
> (select tbl4.product_id product_id,tbl4.gender gender,tbl4.item_number
> item_number,tbl4.price price,tbl4.size.description
> description,FLATTEN(tbl4.size.color_swatch) color_swatch from
> (select tbl3.product_id product_id,tbl3.catalog_item.gender
> gender,tbl3.catalog_item.item_number item_number,tbl3.catalog_item.price
> price,FLATTEN(tbl3.catalog_item.size) size from
> (select tbl2.product.product_id as
> product_id,FLATTEN(tbl2.product.catalog_item) as catalog_item from
> (select FLATTEN(tbl1.catalog.product) product from dfs.root.`demo.json`
> tbl1) tbl2) tbl3) tbl4) tbl5
>
> --
> Error -
>
> SYSTEM ERROR: IllegalArgumentException: initialCapacity: -2147483648
> (expectd: 0+)
>
> Fragment 0:0
>
> [Error Id: 60cf1b95-762d-4a0d-8cae-a2db418d4ea9 on sinhagad:31010]
>
>
> --
>
> 1) Am i doing someting wrong or missing something ( probably because i am
> not using cluster ?? ).
>
> Please guide me through this.
>
> Thanks & Regards
>
> Kunal Ghosh
>

No of files created by CTAS auto partition feature

2015-08-26 Thread rahul challapalli

Drillers,

I executed the below query on TPCH SF100 with drill and it took ~2hrs to
complete on a 2 node cluster.

alter session set `planner.width.max_per_node` = 4;
alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
create table lineitem partition by (l_shipdate, l_receiptdate) as select *
from dfs.`/drill/testdata/tpch100/lineitem`;

The below query returned 75780, so I expected drill to create the same no
of files or may be a little more. But drill created so many files that a
"hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
did not change the default parquet block size)

select count(*) from (select l_shipdate, l_receiptdate from
dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, l_receiptdate)
sub;
+-+
| EXPR$0  |
+-+
| 75780   |
+-+


Any thoughts on why drill is creating so many files?

- Rahul

Re: Issue in using drill JDBC jar in Java code for Hive storage

2015-08-14 Thread rahul challapalli

I believe this has nothing to do with JDBC in particular. Your hive storage
plugin info seems to be corrupted on your workstation. From the error
message it looks like the drillbit itself failed to start.

Can you backup "/tmp/drill/sys.storage_plugins/hive.sys.drill". Now delete
the "/tmp/drill/sys.storage_plugins/hive.sys.drill" folder. From the UI try
re-creating the hive storage plugin again and see if that changes anything

- Rahul

On Fri, Aug 14, 2015 at 1:26 AM, Devender Yadav <
devender.ya...@impetus.co.in> wrote:

> Hi,
>
>
>
> I am using drill in embedded mode. I added plugin configuration for hive:
>
> {
> "type": "hive",
> "enabled": true,
> "configProps":
>
> { "hive.metastore.uris": "", "javax.jdo.option.ConnectionURL":
> "jdbc:mysql://localhost:3306/metastore",
> "javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver",
> "javax.jdo.option.ConnectionUserName": "root",
> "javax.jdo.option.ConnectionPassword": "root",
> "hive.metastore.warehouse.dir": "/user/hive/warehouse", "fs.default.name":
> "file:///", "hive.metastore.sasl.enabled": "false" }
>
> }
>
> Useful portion of my code:
>
> Connection conn = new Driver().connect("jdbc:drill:zk=local",null);
> Statement stmt = conn.createStatement();
> ResultSet rs = stmt.executeQuery("show databases");
> while (rs.next())
>
> { String SCHEMA_NAME = rs.getString("SCHEMA_NAME");
> System.out.println(SCHEMA_NAME); }
>
> Exception:
>
> java.sql.SQLException: Failure in starting embedded Drillbit:
> java.lang.RuntimeException: Unable to deserialize
> "/tmp/drill/sys.storage_plugins/hive.sys.drill"
> at
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:109)
> at
> org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:66)
> at
> org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:69)
> at
> net.hydromatic.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:126)
> at org.apache.drill.jdbc.Driver.connect(Driver.java:78)
> at java.sql.DriverManager.getConnection(DriverManager.java:571)
> at java.sql.DriverManager.getConnection(DriverManager.java:187)
> at com.mkyong.App.main(App.java:28)
> Caused by: java.lang.RuntimeException: Unable to deserialize
> "/tmp/drill/sys.storage_plugins/hive.sys.drill"
> at
> org.apache.drill.exec.store.sys.local.FilePStore.get(FilePStore.java:140)
> at
> org.apache.drill.exec.store.sys.local.FilePStore$Iter$DeferredEntry.getValue(FilePStore.java:219)
> at
> org.apache.drill.exec.store.StoragePluginRegistry.createPlugins(StoragePluginRegistry.java:168)
> at
> org.apache.drill.exec.store.StoragePluginRegistry.init(StoragePluginRegistry.java:132)
> at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:244)
> at
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:100)
> ... 7 more
> Caused by: com.fasterxml.jackson.databind.JsonMappingException: Could not
> resolve type id 'hive' into a subtype of [simple type, class
> org.apache.drill.common.logical.StoragePluginConfig]
> at [Source: [B@c64ff75; line: 2, column: 3]
> at
> com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)
> at
> com.fasterxml.jackson.databind.DeserializationContext.unknownTypeException(DeserializationContext.java:849)
> at
> com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:167)
> at
> com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:99)
> at
> com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:84)
> at
> com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:132)
> at
> com.fasterxml.jackson.databind.deser.impl.TypeWrappedDeserializer.deserialize(TypeWrappedDeserializer.java:41)
> at
> com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:1269)
> at
> com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:912)
> at
> org.apache.drill.exec.store.sys.serialize.JacksonSerializer.deserialize(JacksonSerializer.java:44)
> at
> org.apache.drill.exec.store.sys.local.FilePStore.get(FilePStore.java:138)
> ... 12 more
>
>
> ?
>
>
> Regards,
> Devender
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>

1 2 >

1 - 100 of 116 matches

Mail list logo