Small dataset query issue and the workaround we found

2022-08-25 Thread François Méthot
Hi,

I am looking for an explanation to a workaround I have found to an issue
that has been bugging my team for the past weeks.

Last June we moved from Drill 1.12 to Drill 1.19... Long overdue upgrade!
Not long after we started getting the issue described  below.

We run a query daily on about 410GB of text data spread over  ~2200 files,
it has a cost of ~22 Billions queued as Large query
When the same query runs on 200MB spread over 130 files  (same data
format), with cost of 36 Millions queued as Large query, it never completes.

The small dataset query would stop doing any progress after a few minutes,
leaving on for hours, no progress, never complete.

The last running fragment is a HASH_PARTITION_SENDER showing 100% fragment
time.

After much shoot in the dark debugging session, analysing the src data etc.

We reviewed our cluster configuration,
when we changed
 exec.queue.threshold=3000 to 6000
to categorize our 200MB dataset as a small query,

The small dataset query started to work consistently in less than 10
seconds.

The physical plan is identical whether the query is Large or Small.

Is there a difference internally in drill execution whether the query is
small or large?
Would you be able to provide an explanation why this workaround works?

Cluster detail:
8 drillbits running in kubernetes
16GB direct mem
4GB Heap
data stored on a very efficient nfs, exposed as k8s pv/pvc to drillbit pods.

Thanks for any insight you can provide on that setting or in regards to our
initial problem.
François


Query compile exception after upgrade to 1.16

2019-08-20 Thread François Méthot
Hi all,

 Drill 1.12 has been serving us well for far, we are now interested in
Kafka adhoc query support and decided to upgrade to Drill 1.16.

I have a query that involve multiple embedded select statement that now
fails after the update.
It uses our custom function regexExtract that parses each full row of text
into array of string using some regex. The function has been rebuilt using
Drill Jars from 1.16.
Here is the relevant part of the query:

create table FinalResult as(
   select NAME,GROUP,STATUS from
(select ... from
(select ... from
   (select regexExtract(rowTextData) from some.files)
)
)
   where STATUS IS NOT NULL and NAME is NOT NULL and GROUP is not NULL.
)


The error I get is:

Error: SYSTEM ERROR: Compile Exception: Line 167, Column 74 "value" is
neither a method, a field nor a member of class
"org.apache.drill.exec.vector.UntypedNullHolder".


I could make the query work by
  1)  generating an intermediate table without the "IS NOT NULL" filters,
but I don't know how reliable this workaround is:

  alter session `store.format`='parquet';
  create table IntermediateResult as(
   select NAME,GROUP,STATUS from
(select ... from
(select ... from
   (select regexExtract(rowTextData) from some.files)
)
)
  ) -- where filter is omitted;



  2) Then create my FinalResult table using the filter on the intermediate
table.

  create table FinalResult as(
 select NAME,GROUP,STATUS from IntermediateResult
 where STATUS IS NOT NULL, and NAME is NOT NULL and GROUP is not NULL.
 -- where filter applied on the intermediate result
 )


So what seems to be triggering the issue is the filter "where STATUS IS NOT
NULL, and NAME is NOT NULL and GROUP is not NULL".

The error happens in code generated by drill for that query and is
difficult to debug, so far I dedicated my time at finding a work around.

>From the drillbit log I get:

ClassTransformationException: Failure generating transformation classes for
value:

...Java code..
Class FilterGen58166
{
...
}



I am aware that the description of this issue is broad, but if it rings a
bell to anyone  regarding a known bug, please let me know. I am trying to
replicate the problem with a simple data set.

Thanks in advance for reading and for any advises.

Francois


Kafka Message Reader With Null Value

2019-07-10 Thread François Méthot
Hi,

  When using Drill (1.15) with Kafka Topics containing Json data, if the
Message Value is null, the JsonMessageReader is not able to process the row
and stop the query.

  Error: DATA_READ_ERROR: Failure while reading message from Kafka.
RecordReader was atrecord 1

  null
  Fragment 1:0


When using KStreams's StateStore backed by a changelog topic, it is common
to see changelog topic with Null message value: a delete in a StateStore
generates the message  in the changelog topic.

Is there any way to deal with Null message value when using Drill Kafka
plugin?

Thanks
Francois
p.s. Drill supports filter pushdown on message offset and message
timestamp, it is a very neat feature!


Work around for JSON type error

2017-11-23 Thread François Méthot
Hi,

Is there a workaround for this Jira issue:

Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
are using a ValueWriter of type NullableVarCharWriterImpl.

File /tmp/test.json
Record 2
Fragment 0:0

https://issues.apache.org/jira/browse/DRILL-4520


I tried Union with a source file and does the same issue (hoping drill
would properly set the column type form the beginning)

The only workaround I could find is to force the query to run on one thread
only and hope that the thread will be assigned a file not causing this
issue as it's first item to scan. It is very slow solution...(3000+ files
on hdfs)

The other solution would be to write a Map Reduce Job to validate and fix
the faulty column.


Any advise is welcome.

Thanks
Francois


Re: log flooded by "date values definitively CORRECT"

2017-10-19 Thread François Méthot
Thanks for your input.

After investigation, we essentially fixed the issue by re-adjusting the
size of the parquet files we generate. Bigger files, less of them.

It was interesting to explore the extreme limit in term of number of files
and how it impacts a single foreman at planning time. Good to know.
We will be adjusting the log  level as well.

François



On 17 October 2017 at 14:43, Kunal Khatua  wrote:

> Ouch!
>
> Looks like a logger was left behind in DEBUG mode. Can you manually turn
> that off?
>
> More memory would help in this case, because it seems that the foreman
> node is the one running out of heap space as it goes through the metadata
> for all the files. Is there a reason you are generating so many files to
> query? There is most likely a lower threshold for a parquet file size,
> below which you might be better off just using something like a CSV format.
>
>
>
> -----Original Message-
> From: François Méthot [mailto:fmetho...@gmail.com]
> Sent: Tuesday, October 17, 2017 10:35 AM
> To: dev@drill.apache.org
> Subject: log flooded by "date values definitively CORRECT"
>
> Hi again,
>
>   I am running into an issue on a query done on 760 000 parquet files
> stored in HDFS. We are using Drill 1.10, 8GB heap, 20GB direct mem. Drill
> runs with debug log enabled all the time.
>
> The query is standard select on  8 fields from hdfs.`/path` where this =
> that 
>
>
> For about an hour I see this message on the foreman:
>
> [pool-9-thread-##] DEBUG o.a.d.exec.store.parquet.Metadata - It is
> determined from metadata that the date values are definitely CORRECT
>
> Then
>
> [some UUID:foreman] INFO o.a.d.exec.store.parquet.Metadata - Fetch
> parquet metadata : Executed 761659 out of 761659 using 16 threads. Time :
> 3022416ms
>
> Then :
> Java.lang.OutOfMemoryError: Java Heap Space
>at java.util.Arrays.copyOf
>...
>at java.io.PrintWriter.println(PrintWriter.java:757)
>at org.apache.calcite.rel.externalize.RelWriterImplt.explain
> (RelWriterImpl.java:118)
>at org.apachje.calcite.rel.externalize.RelWriterImpl.done
> (RelWriterImpl.java:160)
> ...
>at org.apache.calcite.plan.RelOptUtil.toString (RelOptUtil.java:1927)
>at
> org.apache.drill.exec.planner.sql.handlers.DefaultSQLHandler.log(
> DefaultSQLHandler.java:138)
>...
>at
> org.apache.drill.exec.planner.sql.handlers.CreateTableHandler.getPlan(
> CreateTableHandler:102)
>at
> org.apache.drill.exec.planner.DrillSqlWorker.getQueryPlan(
> DrillSqlWorker:131)
>...
>at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:1050)
>at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:281)
>
>
>
> I think it might be caused by having too much files to query, chunking our
> select into smaller piece actually helped.
> Also suspect that the DEBUG logging is taxing the poor node a bit much.
>
> Do you think adding more memory would address the issue (I can't try this
> right now) or you would think it is caused by a bug?
>
>
> Thank in advance for any advises,
>
> Francois
>


log flooded by "date values definitively CORRECT"

2017-10-17 Thread François Méthot
Hi again,

  I am running into an issue on a query done on 760 000 parquet files
stored in HDFS. We are using Drill 1.10, 8GB heap, 20GB direct mem. Drill
runs with debug log enabled all the time.

The query is standard select on  8 fields from hdfs.`/path` where this =
that 


For about an hour I see this message on the foreman:

[pool-9-thread-##] DEBUG o.a.d.exec.store.parquet.Metadata - It is
determined from metadata that the date values are definitely CORRECT

Then

[some UUID:foreman] INFO o.a.d.exec.store.parquet.Metadata - Fetch parquet
metadata : Executed 761659 out of 761659 using 16 threads. Time : 3022416ms

Then :
Java.lang.OutOfMemoryError: Java Heap Space
   at java.util.Arrays.copyOf
   ...
   at java.io.PrintWriter.println(PrintWriter.java:757)
   at org.apache.calcite.rel.externalize.RelWriterImplt.explain
(RelWriterImpl.java:118)
   at org.apachje.calcite.rel.externalize.RelWriterImpl.done
(RelWriterImpl.java:160)
...
   at org.apache.calcite.plan.RelOptUtil.toString (RelOptUtil.java:1927)
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSQLHandler.log(DefaultSQLHandler.java:138)
   ...
   at
org.apache.drill.exec.planner.sql.handlers.CreateTableHandler.getPlan(CreateTableHandler:102)
   at
org.apache.drill.exec.planner.DrillSqlWorker.getQueryPlan(DrillSqlWorker:131)
   ...
   at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:1050)
   at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:281)



I think it might be caused by having too much files to query, chunking our
select into smaller piece actually helped.
Also suspect that the DEBUG logging is taxing the poor node a bit much.

Do you think adding more memory would address the issue (I can't try this
right now) or you would think it is caused by a bug?


Thank in advance for any advises,

Francois


Re: Parquet Metadata table on Rolling window

2017-10-16 Thread François Méthot
Thanks Padma,

Would we benefits at all from running metadata on directories that we know
we will never modify?

We would end up with:
/mydata/3/(Metadata generated...)
/mydata/4/(Metadata generated...)
/mydata/.../(Metadata generated...)
/mydata/109/(Metadata generated...)
/mydata/110/(Current partition : Metadata NOT generated yet...)

When users query /mydata, would drill take advantage of the metadata
available in each subfolder?

Francois


Parquet Metadata table on Rolling window

2017-10-05 Thread François Méthot
Hi,

  I have been using drill for more than year now, we are running 1.10.

My queries can spend from 5 to 10 minutes for planning because I am dealing
with lots of file in HDFS. (then 5 min to 60 min for execution)

I maintain a rolling window of data  partitionned by the epoch seconds
rounded to the hour.
/mydata/3/   -> Next partition to be deleted (nightly check)
/mydata/4/
/mydata/.../
/mydata/109/
/mydata/110/ -> current hour, this is where new parquet files are added

I am  considering using REFRESH TABLE METADATA.
Is it beneficial at all in a situation where new files are added
constantly, (but only to the latest partition, older partition are set in
stone)?
Will drill detect that new files are added to the latest partition (110) ?
-Will it trigger a refresh metadata on all the directory, on just on
/mydata/110?


Thanks for your help
François


Re: Dedupping json records based on nested value

2017-08-31 Thread François Méthot
I manage to implement a single UDF that returns a copy of a MapHolder input
var, it allowed me to figure how to use SingleMapReaderImpl input and
ComplexWriter as out.

I tried to move that approach into an aggregation function that looks like
the snippet below.
I want to return the first MapHolder value encountered in a group by
operation.

select firstvalue(tb1.field1), firstvalue(tb1.field2),
firstvalue(tb1.field3), firstvalue(tb1.field4) from dfs.`doc.json` tb1
group by tb1.field4.key_data

I get:
Error: SYSTEM ERROR: UnsupportedOperationException: Unable to get new
vector for minor type [LATE] and mode [REQUIRED]

I am not sure, if we can use the "out" variable within the add method.
Any hint from the experts to put me on track would be appreciated.

Thanks
Francois


@FunctionTemplate(name = "firstvalue", scope =
FunctionTemplate.FunctionScope.POINT_AGGREGATE)
public static class BitCount implements DrillAggFunc{
   public static class FirstValueComplex implements DrillAggFunc
   {
 @Param
 MapHolder map;

 @Workspace
 BitHolder firstSeen;

 @Output
 ComplexWriter out;

 @Override
 public void Setup()
 {
firstSeen.value=0;
 }

 @Override
 public void add()
 {
   if(firstSeen.value == 0)
   {
  org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl map
=
(org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl)(Object)map;
  map.copyAsValue(out.rootAsMap());
  firstSeen.value=1;
   }
 }

 @Override
 public void output()
 {

 }

 @Override
 public void reset()
 {
   out.clear();
   firstSeen.value=0;
 }
   }
}

On 30 August 2017 at 16:57, François Méthot  wrote:

>
> Hi,
>
> Congrat for the 1.11 release, we are happy to have our suggestion
> implemented in the new release (automatic HDFS block size for parquet
> files).
>
> It seems like we are pushing the limit of Drill with new type query...(I
> am learning new SQL trick in the process)
>
> We are trying to aggregate a json document based on a nested value.
>
> Document looks like this:
>
> {
>  "field1" : {
>  "f1_a" : "infoa",
>  "f1_b" : "infob"
>   },
>  "field2" : "very long string",
>  "field3" : {
>  "f3_a" : "infoc",
>  "f3_b" : "infod",
>  "f4_c" : {
>   
>   }
>   },
>   "field4" : {
>  "key_data" : "String to aggregate on",
>  "f4_b" : "a string2",
>  "f4_c" : {
>    complex structure...
>   }
>   }
> }
>
>
> We want a first, or last (or any) occurrence of field1, field2, field3 and
> field4 group by field4.key_data;
>
>
> Unfortunately min, max function does not support json complex column
> (MapHolder). Therefor group by type of queries do not work.
>
> We tried a window function like this
> create table  as (
>   select first_value(tb1.field1) over (partition by tb1.field4.key_data)
> as field1,
>first_value(tb1.field2) over (partition by tb1.field4.key_data) as
> field2,
>first_value(tb1.field3) over (partition by tb1.field4.key_data) as
> field3,
>first_value(tb1.field4) over (partition by tb1.field4.key_data) as
> field4
>   from dfs.`doc.json` tb1;
> )
>
> We get IndexOutOfBoundException.
>
> We got better success with:
> create table  as (
>  select * from
>   (select tb1.*,
>   row_number() over (partition by tb1.field4.key_data) as row_num
>from  dfs.`doc.json` tb1
>   ) t
>  where t.row_num = 1
> )
>
> This works on single json file or with multiple file in a session
> configured with planner.width_max_per_node=1.
>
> As soon as we put more than 1 thread per query, We get
> IndexOutOfBoundException.
> This was tried on 1.10 and 1.11.
> It looks like a bug.
>
>
> Would you have other suggestion to bypass that issue?
> Is there an existing aggregation function (to work with group by) that
> would return the first,last, or random MapHolder column from json document?
> If not, I am thinking of implementing one, would there be an example on
> how to Clone a MapHolder within a function? (pretty sure I can't assign
> "in" param to output within a function)
>
>
> Thank you for your time reading this.
> any suggestions to try are welcome
>
> Francois
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Dedupping json records based on nested value

2017-08-30 Thread François Méthot
Hi,

Congrat for the 1.11 release, we are happy to have our suggestion
implemented in the new release (automatic HDFS block size for parquet
files).

It seems like we are pushing the limit of Drill with new type query...(I am
learning new SQL trick in the process)

We are trying to aggregate a json document based on a nested value.

Document looks like this:

{
 "field1" : {
 "f1_a" : "infoa",
 "f1_b" : "infob"
  },
 "field2" : "very long string",
 "field3" : {
 "f3_a" : "infoc",
 "f3_b" : "infod",
 "f4_c" : {
  
  }
  },
  "field4" : {
 "key_data" : "String to aggregate on",
 "f4_b" : "a string2",
 "f4_c" : {
   complex structure...
  }
  }
}


We want a first, or last (or any) occurrence of field1, field2, field3 and
field4 group by field4.key_data;


Unfortunately min, max function does not support json complex column
(MapHolder). Therefor group by type of queries do not work.

We tried a window function like this
create table  as (
  select first_value(tb1.field1) over (partition by tb1.field4.key_data) as
field1,
   first_value(tb1.field2) over (partition by tb1.field4.key_data) as
field2,
   first_value(tb1.field3) over (partition by tb1.field4.key_data) as
field3,
   first_value(tb1.field4) over (partition by tb1.field4.key_data) as
field4
  from dfs.`doc.json` tb1;
)

We get IndexOutOfBoundException.

We got better success with:
create table  as (
 select * from
  (select tb1.*,
  row_number() over (partition by tb1.field4.key_data) as row_num
   from  dfs.`doc.json` tb1
  ) t
 where t.row_num = 1
)

This works on single json file or with multiple file in a session
configured with planner.width_max_per_node=1.

As soon as we put more than 1 thread per query, We get
IndexOutOfBoundException.
This was tried on 1.10 and 1.11.
It looks like a bug.


Would you have other suggestion to bypass that issue?
Is there an existing aggregation function (to work with group by) that
would return the first,last, or random MapHolder column from json document?
If not, I am thinking of implementing one, would there be an example on how
to Clone a MapHolder within a function? (pretty sure I can't assign "in"
param to output within a function)


Thank you for your time reading this.
any suggestions to try are welcome

Francois


Re: Parquet files size

2017-06-30 Thread François Méthot
Thanks for your opinion, I will look into
reducing planner.width.max_per_node for now, and try it back up when
smaller parquet files get rolled out.

On Thu, Jun 29, 2017 at 11:21 AM, Andries Engelbrecht  wrote:

> With limited memory and what seems to be higher concurrency you may want
> to reduce the minor fragments (threads) per query per node.
> See if you can reduce planner.width.max_per_node on the cluster and not
> have too much impact on the response times.
>
> Slightly smaller (512MB) parquet files may potentially also help, but that
> is usually harder to restructure the data than system settings.
>
> --Andries
>
>
>
> On 6/29/17, 7:39 AM, "François Méthot"  wrote:
>
> Hi,
>
>   I am investigating issue where we are started getting Out of Heap
> space
> error when querying parquet files in Drill 1.10. It is currently set
> to 8GB
> heap, and 20GB off -heap. We can't spare more.
>
> We usually query 0.7 to 1.2 GB parquet files. recently we have been
> more on
> the 1.2GB side. For same number of files.
>
> It now fails on simple
>select bunch of fields where needle in haystack type of
> params.
>
>
> Drill is configured with the old reader:
> store.parquet_use_reader=false
> because of this bug DRILL-5435 (Limit cause Mem Leak)
>
> I have set the max number of large query to 2 instead of 10
> temporarly,
> It did help so far.
>
> My question:
> Could parquet file size be related to those new exceptions?
> Would reducing max file size help to improve robustness of query in
> drill
> (at the expense of having more files to scan)?
>
> Thanks
> Francois
>
>
>


Parquet files size

2017-06-29 Thread François Méthot
Hi,

  I am investigating issue where we are started getting Out of Heap space
error when querying parquet files in Drill 1.10. It is currently set to 8GB
heap, and 20GB off -heap. We can't spare more.

We usually query 0.7 to 1.2 GB parquet files. recently we have been more on
the 1.2GB side. For same number of files.

It now fails on simple
   select bunch of fields where needle in haystack type of params.


Drill is configured with the old reader:
store.parquet_use_reader=false
because of this bug DRILL-5435 (Limit cause Mem Leak)

I have set the max number of large query to 2 instead of 10 temporarly,
It did help so far.

My question:
Could parquet file size be related to those new exceptions?
Would reducing max file size help to improve robustness of query in drill
(at the expense of having more files to scan)?

Thanks
Francois


Re: Jars for BaseTestQuery

2017-04-26 Thread François Méthot
Hi Paul,

  Thanks for your detailed comments. I am keeping your email as a reference.

We tend to favor the System Level testing first because they are easier to
maintain and refactoring code.  After trying to use the System level test,
we tried pure JUnit test approach and we realize that some of the
builder/constructor/factory for objects (DrillBuf, BufferLedger) that are
required by some function are package protected.

I will revisit our test revamp based on your comments once 1.11 is released
(DRILL-5323 and DRILL-5318).


François





On Thu, Apr 20, 2017 at 7:11 PM, Paul Rogers  wrote:

> Hi François,
>
> You raised two issues, I’ll address both.
>
> First, it is true that Maven’s model is that test code is not packaged, it
> is visible only to the maven module in which the test code resides. As you
> point out, this is an inconvenience in multiple-module projects such as
> Drill. Drill gets around the problem by minimizing unit testing; most
> testing outside of java-exec is done via system tests: running all of Drill
> and throwing queries at it.
>
> Drill, at present, has no support for reusing tests outside of their home
> module. It would be great if someone volunteers to solve the problem. Here
> are two references: [1], [2]
>
> Second, you mentioned you want to unit test a storage plugin. Here, it is
> necessary to understand how Drill’s usage of the term “unit test" differs
> from common industry usage. In the industry, a “unit test” would be one
> where you test your reader in isolation. Specially, give it an operator
> definition (the so-called “physical operator” or “sub scan POP” in Drill.)
> You’d then grab data and verify that the returned data batches are correct.
>
> Similarly, for the planning side of the plugin, you’d let Drill plan the
> query, then verify that the plan JSON is as you expect it to be.
>
> Drill, however, uses “unit test” to mean a system-level test written using
> JUnit. That is, most Drill tests run a query and examine the results. The
> BaseTestQuery class you mentioned is a JUnit test, but it is a system level
> test: it starts up an embedded Drillbit to which you can send queries. It
> has helper classes that let you examine results o the entire query (not
> just of your reader.) If you construct the correct SQL, your query can
> include nothing but a scan and the screen operator. Still, this approach
> introduces many layers between your test and your reader. (I call it trying
> to fix a watch while wearing oven mitts.)
>
> There are two recent additions to Drill’s test tools that may be of
> interest. First, we have a simpler way to run system tests based on a
> “cluster test fixture”. BaseTestQuery provides very poor control over
> boot-time configuration, but the test fixture gives you much better
> control. Plus, the new fixture lets you reuse the “TestBuilder” classes
> from BestTestQuery while also providing very easy ways to run queries, time
> results and so on. Check out the package-info in [3] and the example test
> in [4]. Unfortunately, this code has the same Maven packaging issues as
> described above.
>
> Of course, even the simplified test fixture is still a system test. We are
> in the process of checking in a new set of “sub-operator” unit test
> fixtures that enable true unit tests: you test only your code. See
> DRILL-5323 and DRILL-5318. Those PRs will be followed by a complete set of
> tests for the sort operator. I can point you to my personal dev branch if
> you want a preview.
>
> With these tools, you can set up to run just your own reader, then set up
> expected results and validate that things work as expected. Unit tests let
> you verify behavior at a very fine grain: verify each kind of column data
> type, verify filters you wish to push and so on. This is important because
> Drill suffers from a very large number of minor bugs: bugs that are hard to
> find using system tests, but which become obvious when using true unit
> tests.
>
> The in-flight version of the test framework was built for an “internal”
> operator (the sort.) Some work will be required to extend the tests to work
> with a reader (and to refactor the reader so it does not depend on a
> running Drillbit.) This is a worthwhile effort that I can help with if you
> want to go this route.
>
> Thanks,
>
> - Paul
>
> [1] http://stackoverflow.com/questions/14722873/sharing-
> src-test-classes-between-modules-in-a-multi-module-maven-project
> [2] http://maven.apache.org/guides/mini/guide-attached-tests.html
> [3] https://github.com/apache/drill/blob/master/exec/java-
> exec/src/test/java/org/apache/drill/test/package-info.java
> [4] https://github.com/apache/drill/blob/master/exec/java-
> exec/src/test/java/org/apache/drill/test/ExampleT

Jars for BaseTestQuery

2017-04-20 Thread François Méthot
Hi,

   I need to develop unit test of our storage plugins and if possible I
would like to borrow from the tests done in "TestCsvHeader.java" and other
classes in that package.

Those tests depends on  BaseTestQuery, DrillTest and ExecTest classes which
are not packaged in the Drill release (please correct me if I am wrong).

Are those jar shared somewhere for Storage Plugin Developer that rely on
the pre-built jar?

Thanks
Francois


Re: Memory was Leaked error when using "limit" in 1.10

2017-04-13 Thread François Méthot
Yes it did, the problem is gone. Thanks

I will share the details I have on a Jira ticket now.



On Tue, Apr 11, 2017 at 9:22 PM, Kunal Khatua  wrote:

> Did this help resolve the memory leak, Francois?
>
>
> Could you share the stack trace and other relevant logs on a JIRA?
>
>
> Thanks
>
> Kunal
>
>
>
>
> 
> From: Kunal Khatua 
> Sent: Wednesday, April 5, 2017 2:03:19 PM
> To: dev@drill.apache.org
> Subject: Re: Memory was Leaked error when using "limit" in 1.10
>
> Hi Francois
>
> Could you try those queries with the AsyncPageReader turned off?
>
> alter  set `store.parquet.reader.pagereader.async`=false;
>
> For Drill 1.9+ , this feature is enabled. However, there were some perf
> related improvements that Drill 1.10 carried out.
>
> If the problem goes away, could you file a JIRA and share the sample query
> and data to allow us a repro ?
>
> Thanks
>
> Kunal
>
> 
> From: François Méthot 
> Sent: Wednesday, April 5, 2017 1:39:38 PM
> To: dev@drill.apache.org
> Subject: Memory was Leaked error when using "limit" in 1.10
>
> Hi,
>
>   I am still investigating this problem, but I will describe the symptom to
> you in case there is known issue with drill 1.10.
>
>   We migrated our production system from Drill 1.9 to 1.10 just 5 days ago.
> (220 nodes cluster)
>
> Our log show there was some 900+ queries ran without problem in first 4
> days.  (similar queries, that never use the `limit` clause)
>
> Yesterday we started doing simple adhoc select * ... limit 10 queries (like
> we often do, that was our first use of limit with 1.10)
> and we got a `Memory was leaked` exception below.
>
> Also, once we get the error, Most of all subsequent user queries fails with
> Channel Close Exception. We need to restart Drill to bring it back to
> normal.
>
> A day later, I used a similar select * limit 10 queries, and the same thing
> happen, had to restart Drill.
>
> In the exception, it was refering to a file (1_0_0.parquet)
> I moved that file to smaller test cluster (12 nodes) and got the error on
> the first attempt. but I am no longer able to reproduce the issue on that
> file. Between the 12 and 220 nodes cluster, a different Column name and Row
> Group Start was listed in the error.
> The parquet file was generated by Drill 1.10.
>
> I tried the same file with a local drill-embedded 1.9 and 1.10 and had no
> issue.
>
>
> Here is the error (manually typed), if you think of anything obvious, let
> us know.
>
>
> AsyncPageReader - User Error Occured: Exception Occurred while reading from
> disk (can not read class o.a.parquet.format.PageHeader:
> java.io.IOException: input stream is closed.)
>
> File:/1_0_0.parquet
> Column: StringColXYZ
> Row Group Start: 115215476
>
> [Error Id: ]
>   at UserException.java:544)
>   at
> o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.
> handleAndThrowException(AsynvPageReader.java:199)
>   at
> o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.
> access(AsynvPageReader.java:81)
>   at
> o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.
> AsyncPageReaderTask.call(AsyncPageReader.java:483)
>   at
> o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.
> AsyncPageReaderTask.call(AsyncPageReader.java:392)
>   at
> o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.
> AsyncPageReaderTask.call(AsyncPageReader.java:392)
> ...
> Caused by: java.io.IOException: can not read class
> org.apache.parquet.format.PageHeader: java.io.IOException: Input Stream is
> closed.
>at o.a.parquet.format.Util.read(Util.java:216)
>at o.a.parquet.format.Util.readPageHeader(Util.java:65)
>at
> o.a.drill.exec.store.parquet.columnreaders.AsyncPageReader(
> AsyncPageReaderTask:430)
> Caused by: parquet.org.apache.thrift.transport.TTransportException: Input
> stream is closed
>at ...read(TIOStreamTransport.java:129)
>at TTransport.readAll(TTransport.java:84)
>at TCompactProtocol.readByte(TCompactProtocol.java:474)
>at TCompactProtocol.readFieldBegin(TCompactProtocol.java:481)
>at InterningProtocol.readFieldBegin(InterningProtocol.java:158)
>at o.a.parquet.format.PageHeader.read(PageHeader.java:828)
>at o.a.parquet.format.Util.read(Util.java:213)
>
>
> Fragment 0:0
> [Error id: ...]
> o.a.drill.common.exception.UserException: SYSTEM ERROR:
> IllegalStateException: Memory was leaked by query. Memory leaked: (524288)
> Allocator(op:0:0:4:ParquetRowGroupScan) 100/524288/39919616/
> 100
>   at o.a.d.common.excepti

Re: [jira] [Created] (DRILL-5432) Want a memory format for PCAP files

2017-04-13 Thread François Méthot
Hi Ted,

  We did a proof of concept with reading pcap from drill. Our approach was
to avoid writing yet another pcap decoder so we tried to adapt Drill to use
an existing one. We took Tshark as an example. It already comes with 1000s
of dissectors.

We approached the problem from a different angle: How to drive and read the
output of an external application from a SQL query within Drill.

Our experiment started with the Text Input Storage plugin from Drill, we
modified slightly to be a .pcap plugin.

When a Drill query is run on a pcap file,  the plugin RecordReader setup
function launches the TShark external app for each file that drill needs to
scan.
The column specified in the select statement are passed as an input
parameter to the external application.

In RecordReader next() method, it reads each record streamed back by
TShark. The stream output of the process is parsed by a slightly modified
TextInput. Once the data is streamed in the drill space, user can leverage
on the SQL Language to do all kind of data aggregation.

For this technique to work, the external application needs to support
Streaming in and out data.

To run on HDFS with a native application that has not been build for HDFS,
the storage plugin launches: "hdfs cat test.pcap | tshark "

For this to work, TShark needs to be deployed everywhere a drill bit is
running.

I don't have any metrics on performance, this was a proof of concept, but
it works. It will probably not beat the performance of the solution you are
aiming, but it leverages on years of development of an existing tool.


Francois















On Wed, Apr 12, 2017 at 2:25 PM, Ted Dunning (JIRA)  wrote:

> Ted Dunning created DRILL-5432:
> --
>
>  Summary: Want a memory format for PCAP files
>  Key: DRILL-5432
>  URL: https://issues.apache.org/jira/browse/DRILL-5432
>  Project: Apache Drill
>   Issue Type: New Feature
> Reporter: Ted Dunning
>
>
> PCAP files [1] are the de facto standard for storing network capture data.
> In security and protocol applications, it is very common to want to extract
> particular packets from a capture for further analysis.
>
> At a first level, it is desirable to query and filter by source and
> destination IP and port or by protocol. Beyond that, however, it would be
> very useful to be able to group packets by TCP session and eventually to
> look at packet contents. For now, however, the most critical requirement is
> that we should be able to scan captures at very high speed.
>
> I previously wrote a (kind of working) proof of concept for a PCAP decoder
> that did lazy deserialization and could traverse hundreds of MB of PCAP
> data per second per core. This compares to roughly 2-3 MB/s for widely
> available Apache-compatible open source PCAP decoders.
>
> This JIRA covers the integration and extension of that proof of concept as
> a Drill file format.
>
> Initial work is available at https://github.com/mapr-demos/pcap-query
>
>
> [1] https://en.wikipedia.org/wiki/Pcap
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>


Memory was Leaked error when using "limit" in 1.10

2017-04-05 Thread François Méthot
Hi,

  I am still investigating this problem, but I will describe the symptom to
you in case there is known issue with drill 1.10.

  We migrated our production system from Drill 1.9 to 1.10 just 5 days ago.
(220 nodes cluster)

Our log show there was some 900+ queries ran without problem in first 4
days.  (similar queries, that never use the `limit` clause)

Yesterday we started doing simple adhoc select * ... limit 10 queries (like
we often do, that was our first use of limit with 1.10)
and we got a `Memory was leaked` exception below.

Also, once we get the error, Most of all subsequent user queries fails with
Channel Close Exception. We need to restart Drill to bring it back to
normal.

A day later, I used a similar select * limit 10 queries, and the same thing
happen, had to restart Drill.

In the exception, it was refering to a file (1_0_0.parquet)
I moved that file to smaller test cluster (12 nodes) and got the error on
the first attempt. but I am no longer able to reproduce the issue on that
file. Between the 12 and 220 nodes cluster, a different Column name and Row
Group Start was listed in the error.
The parquet file was generated by Drill 1.10.

I tried the same file with a local drill-embedded 1.9 and 1.10 and had no
issue.


Here is the error (manually typed), if you think of anything obvious, let
us know.


AsyncPageReader - User Error Occured: Exception Occurred while reading from
disk (can not read class o.a.parquet.format.PageHeader:
java.io.IOException: input stream is closed.)

File:/1_0_0.parquet
Column: StringColXYZ
Row Group Start: 115215476

[Error Id: ]
  at UserException.java:544)
  at
o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.handleAndThrowException(AsynvPageReader.java:199)
  at
o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.access(AsynvPageReader.java:81)
  at
o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.AsyncPageReaderTask.call(AsyncPageReader.java:483)
  at
o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.AsyncPageReaderTask.call(AsyncPageReader.java:392)
  at
o.a.d.exec.store.parquet.columnreaders.AsyncPageReader.AsyncPageReaderTask.call(AsyncPageReader.java:392)
...
Caused by: java.io.IOException: can not read class
org.apache.parquet.format.PageHeader: java.io.IOException: Input Stream is
closed.
   at o.a.parquet.format.Util.read(Util.java:216)
   at o.a.parquet.format.Util.readPageHeader(Util.java:65)
   at
o.a.drill.exec.store.parquet.columnreaders.AsyncPageReader(AsyncPageReaderTask:430)
Caused by: parquet.org.apache.thrift.transport.TTransportException: Input
stream is closed
   at ...read(TIOStreamTransport.java:129)
   at TTransport.readAll(TTransport.java:84)
   at TCompactProtocol.readByte(TCompactProtocol.java:474)
   at TCompactProtocol.readFieldBegin(TCompactProtocol.java:481)
   at InterningProtocol.readFieldBegin(InterningProtocol.java:158)
   at o.a.parquet.format.PageHeader.read(PageHeader.java:828)
   at o.a.parquet.format.Util.read(Util.java:213)


Fragment 0:0
[Error id: ...]
o.a.drill.common.exception.UserException: SYSTEM ERROR:
IllegalStateException: Memory was leaked by query. Memory leaked: (524288)
Allocator(op:0:0:4:ParquetRowGroupScan) 100/524288/39919616/100
  at o.a.d.common.exceptions.UserException (UserException.java:544)
  at
o.a.d.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:293)
  at o.a.d.exec.work.fragment.FragmentExecutor.cleanup(
FragmentExecutor.java:160)
  at
o.a.d.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:262)
...
Caused by: IllegalStateException: Memory was leaked by query. Memory
leaked: (524288)
  at o.a.d.exec.memory.BaseAllocator.close(BaseAllocator.java:502)
  at o.a.d.exec.ops.OperatorContextImpl(OperatorContextImpl.java:149)
  at
o.a.d.exec.ops.FragmentContext.suppressingClose(FragmentContext.java:422)
  at o.a.d.exec.ops.FragmentContext.close(FragmentContext.java:411)
  at
o.a.d.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:318)
  at
o.a.d.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:155)











Francois


Re: Single Hdfs block per parquet file

2017-03-24 Thread François Méthot
Done,
Thanks for the feedback

https://issues.apache.org/jira/browse/DRILL-5379


On Thu, Mar 23, 2017 at 4:29 PM, Kunal Khatua  wrote:

> This seems like a reasonable feature request. It could also be expanded to
> detect the underlying block size for the location being written to.
>
>
> Could you file a JIRA for this?
>
>
> Thanks
>
> Kunal
>
> ____
> From: François Méthot 
> Sent: Thursday, March 23, 2017 9:08:51 AM
> To: dev@drill.apache.org
> Subject: Re: Single Hdfs block per parquet file
>
> After further investigation, Drill uses the hadoop ParquetFileWriter (
> https://github.com/Parquet/parquet-mr/blob/master/
> parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
> ).
> This is where the file creation occurs so it might be tricky after all.
>
> However ParquetRecordWriter.java (
> https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/
> ParquetRecordWriter.java)
> in Drill creates the ParquetFileWriter with an hadoop configuration object.
>
> However something to explore: Could the block size be set as a property
> within the Configuration object before passing it to ParquetFileWriter
> constructor?
>
> François
>
> On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy 
> wrote:
>
> > Yes, seems like it is possible to create files with different block
> sizes.
> > We could potentially pass the configured store.parquet.block-size to the
> > create call.
> > I will try it out and see. will let you know.
> >
> > Thanks,
> > Padma
> >
> >
> > > On Mar 22, 2017, at 4:16 PM, François Méthot 
> > wrote:
> > >
> > > Here are 2 links I could find:
> > >
> > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> > fs.Path,%20boolean,%20int,%20short,%20long)
> > >
> > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> > fs.Path,%20boolean,%20int,%20short,%20long)
> > >
> > > Francois
> > >
> > > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <
> ppenumar...@mapr.com>
> > > wrote:
> > >
> > >> I think we create one file for each parquet block.
> > >> If underlying HDFS block size is 128 MB and parquet block size  is  >
> > >> 128MB,
> > >> it will create more blocks on HDFS.
> > >> Can you let me know what is the HDFS API that would allow you to
> > >> do otherwise ?
> > >>
> > >> Thanks,
> > >> Padma
> > >>
> > >>
> > >>> On Mar 22, 2017, at 11:54 AM, François Méthot 
> > >> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Is there a way to force Drill to store CTAS generated parquet file
> as a
> > >>> single block when using HDFS? Java HDFS API allows to do that, files
> > >> could
> > >>> be created with the Parquet block-size.
> > >>>
> > >>> We are using Drill on hdfs configured with block size of 128MB.
> > Changing
> > >>> this size is not an option at this point.
> > >>>
> > >>> It would be ideal for us to have single parquet file per hdfs block,
> > >> setting
> > >>> store.parquet.block-size to 128MB would fix our issue but we end up
> > with
> > >> a
> > >>> lot more files to deal with.
> > >>>
> > >>> Thanks
> > >>> Francois
> > >>
> > >>
> >
> >
>


Re: Single Hdfs block per parquet file

2017-03-23 Thread François Méthot
After further investigation, Drill uses the hadoop ParquetFileWriter (
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
).
This is where the file creation occurs so it might be tricky after all.

However ParquetRecordWriter.java (
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
in Drill creates the ParquetFileWriter with an hadoop configuration object.

However something to explore: Could the block size be set as a property
within the Configuration object before passing it to ParquetFileWriter
constructor?

François

On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy 
wrote:

> Yes, seems like it is possible to create files with different block sizes.
> We could potentially pass the configured store.parquet.block-size to the
> create call.
> I will try it out and see. will let you know.
>
> Thanks,
> Padma
>
>
> > On Mar 22, 2017, at 4:16 PM, François Méthot 
> wrote:
> >
> > Here are 2 links I could find:
> >
> > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> fs.Path,%20boolean,%20int,%20short,%20long)
> >
> > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> fs.Path,%20boolean,%20int,%20short,%20long)
> >
> > Francois
> >
> > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy 
> > wrote:
> >
> >> I think we create one file for each parquet block.
> >> If underlying HDFS block size is 128 MB and parquet block size  is  >
> >> 128MB,
> >> it will create more blocks on HDFS.
> >> Can you let me know what is the HDFS API that would allow you to
> >> do otherwise ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >>> On Mar 22, 2017, at 11:54 AM, François Méthot 
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is there a way to force Drill to store CTAS generated parquet file as a
> >>> single block when using HDFS? Java HDFS API allows to do that, files
> >> could
> >>> be created with the Parquet block-size.
> >>>
> >>> We are using Drill on hdfs configured with block size of 128MB.
> Changing
> >>> this size is not an option at this point.
> >>>
> >>> It would be ideal for us to have single parquet file per hdfs block,
> >> setting
> >>> store.parquet.block-size to 128MB would fix our issue but we end up
> with
> >> a
> >>> lot more files to deal with.
> >>>
> >>> Thanks
> >>> Francois
> >>
> >>
>
>


Re: Single Hdfs block per parquet file

2017-03-22 Thread François Méthot
Here are 2 links I could find:

http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)

http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)

Francois

On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy 
wrote:

> I think we create one file for each parquet block.
> If underlying HDFS block size is 128 MB and parquet block size  is  >
> 128MB,
> it will create more blocks on HDFS.
> Can you let me know what is the HDFS API that would allow you to
> do otherwise ?
>
> Thanks,
> Padma
>
>
> > On Mar 22, 2017, at 11:54 AM, François Méthot 
> wrote:
> >
> > Hi,
> >
> > Is there a way to force Drill to store CTAS generated parquet file as a
> > single block when using HDFS? Java HDFS API allows to do that, files
> could
> > be created with the Parquet block-size.
> >
> > We are using Drill on hdfs configured with block size of 128MB. Changing
> > this size is not an option at this point.
> >
> > It would be ideal for us to have single parquet file per hdfs block,
> setting
> > store.parquet.block-size to 128MB would fix our issue but we end up with
> a
> > lot more files to deal with.
> >
> > Thanks
> > Francois
>
>


Single Hdfs block per parquet file

2017-03-22 Thread François Méthot
Hi,

Is there a way to force Drill to store CTAS generated parquet file as a
single block when using HDFS? Java HDFS API allows to do that, files could
be created with the Parquet block-size.

We are using Drill on hdfs configured with block size of 128MB. Changing
this size is not an option at this point.

It would be ideal for us to have single parquet file per hdfs block, setting
store.parquet.block-size to 128MB would fix our issue but we end up with a
lot more files to deal with.

Thanks
Francois


Re: [Drill 1.9.0] : [CONNECTION ERROR] :- (user client) closed unexpectedly. Drillbit down?

2017-03-21 Thread François Méthot
Hi,

 We have been having client-foreman connection and ZkConnection issue few
months ago. It went from annoying to a show stopper when we moved from a 12
nodes cluster to a 220 nodes cluster.

Nodes specs
- 8 cores total (2 x E5620)
- 72 GB  RAM Total
- Other applications share the same hardware.

~ 100 TB parquet data on hdfs.




Based on our observation we have done few months ago, we ended up with
those setting/guideline/changes:

- Memory Setting
  DRILL_MAX_DIRECT_MEMORY="20G"
  DRILL_HEAP="8G"

  Remaining RAM is for other applications


- Threading
  planner.width.max_per_node = 4

  We think that higher number of threads will generate network traffic or
more context switch on each node, leading to more chances of getting Zk
disconnection.
  But we observed that even with max_per_node of 1, we would still get
disconnection. We had no clear indication from Cloudera Manager that
Mem/CPU/Network is overloaded on faulty node. Although on very rare
occasion we would get no stats  data at all from certain node.

- Affinity Factor
  We change the affinity factor from default to a big value.
  planner.affinity_factor = 1000.0

  This improved issue with some drillbit of our cluster scanning data
stored on remote nodes. It somehow maximizes the chances of a drillbit
reading local data. When drillbits only scan local data, it reduces the
amount of network traffic, It accelerate queries and reduce the chance of
ZkDisconnect.

- If using hdfs, make sure each data file is stored on 1 block

- Try more recent 1.8 JVM or switch to JVM 1.7
  We have had CLIENT to  FOREMAN disconnection issue with certain version
of JVM (linux, windows, mac). (we sent an email about this to the dev
mailing list in the past)

- Query Pattern
  The more fields are getting selected (select * vs select few specific
field) the more chance we will get the error. More data selected means more
cpu/network activity leading to more chances of Zookeeper skipping a
heartbeat.


- Foreman QueryManager Resilience Hack
When a query would fail, our log indicated that a drillbit was getting
unregistered and then get registed again a short time after (few ms to few
seconds), but the foreman QueryManager would catch the
"drillbitUnregistered" event and fail the queries right away. As a test, we
changed the QueryManager to not fail queries when a drillbit is getting
unregistered. We have put this change in place in 1.8 and our log now
indicates Zk Disconnect-Reconnect while query keeps running, so we kept
that test code in. A query will now fail only if the drillbit lose
connection with other drillbit (through the RPC bus) at some point. We have
since move to 1.9 with that change as well. I haven't had chance to try
back without the hack in 1.9.

org.apache.drill.exec.work.foreman
   QueryManager.java
private void drillbitUnregistered(.)

if (atLeastOneFailure)
   -> just log the error, do not cancel query.

our query success rate went from <50% to >95% with all the changes above.
We hope to get rid of the hack when an official fix is available.



To cover the missing 5% error (any other type of errors), we advise users
to try again. We also have built-in retry strategy implemented in our
hourly python scripts that aggregates data.

Hope it helps

Francois









On Thu, Mar 9, 2017 at 2:31 PM, Anup Tiwari 
wrote:

> Hi John,
>
> First of all sorry for delayed response and thanks for your suggestion,
> reducing value of "planner.width.max_per_node" helped me a lot, above issue
> which was coming 8 out of 10 times earlier now it is coming only 2 out of
> 10 times.
>
> As mentioned above occurrences of connection error came down considerably,
> but now sometimes i get "Heap Space Error" for few queries and due to this
> sometimes drill-bits on some/all nodes gets killed. Let me know if any
> other variable i can check for this(As of now, i have 8GB of Heap and 20GB
> of Direct memory) :
>
> *Error Log :*
>
> ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure
> Occurred,
> exiting. Information message: Unable to handle out of memory condition in
> FragmentExecutor.
> java.lang.OutOfMemoryError: Java heap space
> at org.apache.xerces.dom.DeferredDocumentImpl.
> getNodeObject(Unknown
> Source) ~[xercesImpl-2.11.0.jar:na]
> at
> org.apache.xerces.dom.DeferredDocumentImpl.synchronizeChildren(Unknown
> Source) ~[xercesImpl-2.11.0.jar:na]
> at
> org.apache.xerces.dom.DeferredElementImpl.synchronizeChildren(Unknown
> Source) ~[xercesImpl-2.11.0.jar:na]
> at org.apache.xerces.dom.ElementImpl.normalize(Unknown Source)
> ~[xercesImpl-2.11.0.jar:na]
> at org.apache.xerces.dom.ElementImpl.normalize(Unknown Source)
> ~[xercesImpl-2.11.0.jar:na]
> at org.apache.xerces.dom.ElementImpl.normalize(Unknown Source)
> ~[xercesImpl-2.11.0.jar:na]
> at com.games24x7.device.NewDeviceData.setup(NewDeviceData.java:94)
> ~[DeviceDataClient-0.0.1-SNAPSHOT.jar:na]
> at
> org.ap

Question on Dynamic UDF with hdfs

2016-12-16 Thread François Méthot
Hi,

  Dynamic UDF is very neat new feature. We are trying to make it work on
HDFS.


we are using a config that looks like this:

drill.exec.udf {
  retry-attempts: 5,
  directory : {
fs: "hdfs:/ourname.node:8020",
root: "/an_hdfs/drill_path/udf",
staging: "/staging",
registry: "/registry",
tmp: "/a_drillbit/local/tmp"
local: "/udf"
}
}

We drop UDF jar in staging directory on hdfs.

>hadoop fs -copyFromLocal drill_test.udf-1.0.0.jar
/an_hdfs/drill_path/udf/staging/drill.test_udf.jar

Then in Drill :

CREATE FUNCTION USING JAR 'drill.test_udf.jar';

1st problem:
  It returns:
  Files does not exist:
/an_hdfs/drill_path/udf/drill.test_udf-sources.jar


So we copy the same file again (with added "-sources"):
>hadoop fs -copyFromLocal drill_test.udf-1.0.0.jar
/an_hdfs/drill_path/udf/drill.test_udf-sources.jar

So that we have 2 identical file, one has "-sources"


Redo the create function:
CREATE FUNCTION USING JAR 'drill.test_udf.jar';
This time it works:
   The following  UDFs in jar drill.test_udf.jar have been registered:
   [hello_word(VARCHAR_OPTIONAL), hello_world(VACHAR-REQUIRED)]

   ( Note:  if we do CREATE FUNCTION USING JAR
'drill.test_udf-sources.jar', it complains it can't find
'drill.test_udf-sources-sources.jar')

2nd problem
  We are unable to use the UDF,
  No Match for function signature hello_world()...

Before I debug the function code it self, I would like to make sure that
the UDF is actually  being seen locally by each drillbit.

Based on the doc for "Local" property:
   "The relative path concatenated to the Drill temporary directory to
indicate the local UDF directory. "

I was expecting drill to copy the Dynamic UDF jar from hdfs to a local dir
of each drill bit in
/a_drillbit/local/tmp/udf

Is it where it should be, based on our config?

Thanks


Re: Limit the number of output parquet files in CTAS

2016-11-01 Thread François Méthot
Thanks Andries,

I experimented with the order by and it works as you mentionned.

I will do some reading and experimentation with the store.partition.hash_
distribute.

Francois




On Mon, Oct 31, 2016 at 4:24 PM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> You can try and set store.partition.hash_distribute to true, but it is
> still listed as an alpha feature.
>
> You can also add a sort operation (order by) to the CTAS statement to
> force a single data stream at output. I believe this was discussed a while
> back on the user list.
>
> Ideally you want to look at the data set size and how much parallelism
> would work best in your environment for reading the data later.
>
> --Andries
>
>
> > On Oct 31, 2016, at 12:57 PM, François Méthot 
> wrote:
> >
> > Hi,
> >
> > Is there a way to limit the number of files produced by a CTAS query ?
> > I would like the speed benefits of having hundreds of scanner fragment
> but
> > don't want to deal with hundreds of output files.
> >
> > Our usecase right now is using 880 thread to scan and produce a report
> > output spread over... 880 parquets files.
> > Each resulting file is ~7M.
> >
> > Only way I found to reduce those files to smaller set is  to a perform
> > second CTAS query on the aggregated files with
> planner.width.max_per_query
> > set to smaller number.
> >
> > Any possible way to do this in one query?
> >
> > Thanks
> > Francois
>
>


Limit the number of output parquet files in CTAS

2016-10-31 Thread François Méthot
Hi,

Is there a way to limit the number of files produced by a CTAS query ?
I would like the speed benefits of having hundreds of scanner fragment but
don't want to deal with hundreds of output files.

Our usecase right now is using 880 thread to scan and produce a report
output spread over... 880 parquets files.
Each resulting file is ~7M.

Only way I found to reduce those files to smaller set is  to a perform
second CTAS query on the aggregated files with planner.width.max_per_query
set to smaller number.

Any possible way to do this in one query?

Thanks
Francois


Re: ZK lost connectivity issue on large cluster

2016-10-26 Thread François Méthot
Hi,

Sorry it took so long, lost the origin picture, had to go to a ms paint
training
here we go:

https://github.com/fmethot/imagine/blob/master/affinity_factor.png



On Thu, Oct 20, 2016 at 12:57 PM, Sudheesh Katkam 
wrote:

> The mailing list does not seem to allow for images. Can you put the image
> elsewhere (Github or Dropbox), and reply with a link to it?
>
> - Sudheesh
>
> > On Oct 19, 2016, at 5:37 PM, François Méthot 
> wrote:
> >
> > We had problem on the 220 nodes cluster. No problem on the 12 nodes
> cluster.
> >
> > I agree that the data may not be distributed evenly. It would be a long
> and tedious process for me to produce a report.
> >
> > Here is a drawing  of the fragments overview before and after the
> changes of the affinity factory on a sample query ran on the 220 nodes
> cluster.  max_width_per_node=8 on both, but it turned out to be irrelevant
> to the issue.
> >
> >
> >
> >
> >
> >
> > Before: SYSTEM ERROR: ForemanException: One more more nodes lost
> connectivity during query.  Identified nodes were [server121:31010].
> >
> > After: error is gone
> >
> > Before: low disk io, high network io on the bottom part of the graph
> > after : high disk io, low network io on the bottom part of the graph
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy <
> ppenumar...@maprtech.com <mailto:ppenumar...@maprtech.com>> wrote:
> > Hi Francois,
> >
> > It would be good to understand how increasing affinity_factor helped in
> your case
> > so we can better document and also use that knowledge to improve things
> in future release.
> >
> > If you have two clusters,  it is not clear whether you had the problem
> on 12 node cluster
> > or 220 node cluster or both. Is the dataset same on both ? Is
> max_width_per_node=8 in both clusters ?
> >
> > Increasing affinity factor will lower remote reads  by scheduling more
> fragments/doing more work
> > on nodes which have data available locally.  So, there seem to be some
> kind of non uniform
> > data distribution for sure. It would be good if you can provide more
> details i.e. how the data is
> > distributed in the cluster and how the load on the nodes changed when
> affinity factor was increased.
> >
> > Thanks,
> > Padma
> >
> >
> > > On Oct 14, 2016, at 6:45 PM, François Méthot  <mailto:fmetho...@gmail.com>> wrote:
> > >
> > > We have  a 12 nodes cluster and a 220 nodes cluster, but they do not
> talk
> > > to each other. So Padma's analysis do not apply but thanks for your
> > > comments. Our goal had been to run Drill on the 220 nodes cluster
> after it
> > > proved worthy of it on the small cluster.
> > >
> > > planner.width.max_per_node was eventually reduced to 2 when we were
> trying
> > > to figure this out, it would still fail. After we figured out the
> > > affinity_factor, we put it back to its original value and it would work
> > > fine.
> > >
> > >
> > >
> > > Sudheesh: Indeed, The Zk/drill services use the same network on our
> bigger
> > > cluster.
> > >
> > > potential improvements:
> > > - planner.affinity_factor should be better documented.
> > > - When ZK disconnected, the running queries systematically failed.
> When we
> > > disabled the ForemanException thrown in the QueryManager.
> > > drillbitUnregistered method, most of our query started to run
> successfully,
> > > we would sometime get Drillbit Disconnected error within the rpc work
> bus.
> > > It did confirm that we still had something on our network going on,
> but it
> > > also showed that the RPC bus between drillbits was more resilient to
> > > network hiccup. I could not prove it, but I think under certain
> condition,
> > > the ZK session gets recreated, which cause a Query Manager unregistered
> > > (query fail) and register call right after, but the RPC
> > > bus  would remains connected.
> > >
> > >
> > > We really appreciate your feedback and we hope to contribute to this
> great
> > > project in the future.
> > > Thanks
> > > Francois
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy <
> ppenumar...@maprtech.com <mailto:ppenumar...@maprtech.com>>
> > > wrote:
> > >
> > >>
> > >>

Re: ZK lost connectivity issue on large cluster

2016-10-19 Thread François Méthot
We had problem on the 220 nodes cluster. No problem on the 12 nodes cluster.

I agree that the data may not be distributed evenly. It would be a long and
tedious process for me to produce a report.

Here is a drawing  of the fragments overview before and after the changes
of the affinity factory on a sample query ran on the 220 nodes
cluster.  max_width_per_node=8
on both, but it turned out to be irrelevant to the issue.



[image: Inline image 1]


Before: SYSTEM ERROR: ForemanException: One more more nodes lost
connectivity during query. Identified nodes were [server121:31010].

After: error is gone

Before: low disk io, high network io on the bottom part of the graph
after : high disk io, low network io on the bottom part of the graph







On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy  wrote:

> Hi Francois,
>
> It would be good to understand how increasing affinity_factor helped in
> your case
> so we can better document and also use that knowledge to improve things in
> future release.
>
> If you have two clusters,  it is not clear whether you had the problem on
> 12 node cluster
> or 220 node cluster or both. Is the dataset same on both ? Is
> max_width_per_node=8 in both clusters ?
>
> Increasing affinity factor will lower remote reads  by scheduling more
> fragments/doing more work
> on nodes which have data available locally.  So, there seem to be some
> kind of non uniform
> data distribution for sure. It would be good if you can provide more
> details i.e. how the data is
> distributed in the cluster and how the load on the nodes changed when
> affinity factor was increased.
>
> Thanks,
> Padma
>
>
> > On Oct 14, 2016, at 6:45 PM, François Méthot 
> wrote:
> >
> > We have  a 12 nodes cluster and a 220 nodes cluster, but they do not talk
> > to each other. So Padma's analysis do not apply but thanks for your
> > comments. Our goal had been to run Drill on the 220 nodes cluster after
> it
> > proved worthy of it on the small cluster.
> >
> > planner.width.max_per_node was eventually reduced to 2 when we were
> trying
> > to figure this out, it would still fail. After we figured out the
> > affinity_factor, we put it back to its original value and it would work
> > fine.
> >
> >
> >
> > Sudheesh: Indeed, The Zk/drill services use the same network on our
> bigger
> > cluster.
> >
> > potential improvements:
> > - planner.affinity_factor should be better documented.
> > - When ZK disconnected, the running queries systematically failed. When
> we
> > disabled the ForemanException thrown in the QueryManager.
> > drillbitUnregistered method, most of our query started to run
> successfully,
> > we would sometime get Drillbit Disconnected error within the rpc work
> bus.
> > It did confirm that we still had something on our network going on, but
> it
> > also showed that the RPC bus between drillbits was more resilient to
> > network hiccup. I could not prove it, but I think under certain
> condition,
> > the ZK session gets recreated, which cause a Query Manager unregistered
> > (query fail) and register call right after, but the RPC
> > bus  would remains connected.
> >
> >
> > We really appreciate your feedback and we hope to contribute to this
> great
> > project in the future.
> > Thanks
> > Francois
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy <
> ppenumar...@maprtech.com>
> > wrote:
> >
> >>
> >> Seems like you have 215 nodes, but the data for your query is there on
> >> only 12 nodes.
> >> Drill tries to distribute the scan fragments across the cluster more
> >> uniformly (trying to utilize all CPU resources).
> >> That is why you have lot of remote reads going on and increasing
> affinity
> >> factor eliminates running scan
> >> fragments on the other (215-12) nodes.
> >>
> >> you also mentioned planner.width.max_per_node is set to 8.
> >> So, with increased affinity factor,  you have 8 scan fragments doing a
> lot
> >> more work on these 12 nodes.
> >> Still, you got 10X improvement. Seems like your network is the obvious
> >> bottleneck. Is it a 10G or 1G ?
> >>
> >> Also, increasing affinity factor helped in your case because there is no
> >> data on other nodes.
> >> But, if you have data non uniformly distributed across more nodes, you
> >> might still have the problem.
> >>
> >> Thanks,
> >> Padma
> >>
> >>> On Oct 14, 2016, at 11:18 AM, Sudheesh 

Re: ZK lost connectivity issue on large cluster

2016-10-14 Thread François Méthot
We have  a 12 nodes cluster and a 220 nodes cluster, but they do not talk
to each other. So Padma's analysis do not apply but thanks for your
comments. Our goal had been to run Drill on the 220 nodes cluster after it
proved worthy of it on the small cluster.

planner.width.max_per_node was eventually reduced to 2 when we were trying
to figure this out, it would still fail. After we figured out the
affinity_factor, we put it back to its original value and it would work
fine.



Sudheesh: Indeed, The Zk/drill services use the same network on our bigger
cluster.

potential improvements:
- planner.affinity_factor should be better documented.
- When ZK disconnected, the running queries systematically failed. When we
disabled the ForemanException thrown in the QueryManager.
drillbitUnregistered method, most of our query started to run successfully,
we would sometime get Drillbit Disconnected error within the rpc work bus.
It did confirm that we still had something on our network going on, but it
also showed that the RPC bus between drillbits was more resilient to
network hiccup. I could not prove it, but I think under certain condition,
the ZK session gets recreated, which cause a Query Manager unregistered
(query fail) and register call right after, but the RPC
 bus  would remains connected.


We really appreciate your feedback and we hope to contribute to this great
project in the future.
Thanks
Francois






On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy 
wrote:

>
> Seems like you have 215 nodes, but the data for your query is there on
> only 12 nodes.
> Drill tries to distribute the scan fragments across the cluster more
> uniformly (trying to utilize all CPU resources).
> That is why you have lot of remote reads going on and increasing affinity
> factor eliminates running scan
> fragments on the other (215-12) nodes.
>
> you also mentioned planner.width.max_per_node is set to 8.
> So, with increased affinity factor,  you have 8 scan fragments doing a lot
> more work on these 12 nodes.
> Still, you got 10X improvement. Seems like your network is the obvious
> bottleneck. Is it a 10G or 1G ?
>
> Also, increasing affinity factor helped in your case because there is no
> data on other nodes.
> But, if you have data non uniformly distributed across more nodes, you
> might still have the problem.
>
> Thanks,
> Padma
>
> > On Oct 14, 2016, at 11:18 AM, Sudheesh Katkam 
> wrote:
> >
> > Hi Francois,
> >
> > Thank you for posting your findings! Glad to see a 10X improvement.
> >
> > By increasing affinity factor, looks like Drill’s parallelizer is forced
> to assign fragments on nodes with data i.e. with high favorability for data
> locality.
> >
> > Regarding the random disconnection, I agree with your guess that the
> network bandwidth is being used up by remote reads which causes lags in
> drillbit to ZooKeeper heartbeats (since these services use the same
> network)? Maybe others can comment here.
> >
> > Thank you,
> > Sudheesh
> >
> >> On Oct 12, 2016, at 6:06 PM, François Méthot 
> wrote:
> >>
> >> Hi,
> >>
> >> We finally got rid of this error. We have tried many, many things  (like
> >> modifying drill to ignore the error!), it ultimately came down to this
> >> change:
> >>
> >> from default
> >> planner.affinity_factor=1.2
> >>  to
> >> planner.affinity_factor=100
> >>
> >> Basically this encourages fragment to only care about locally stored
> files.
> >> We looked at the code that used that property and figured that 100 would
> >> have strong impact.
> >>
> >> What led us to this property is the fact that 1/4 of our fragments would
> >> take a lot more time to complete their scan, up to  10x the time of the
> >> fastest nodes.  On the slower nodes, Cloudera Manager would show very
> low
> >> Disk IOPS with high Network IO compare to our faster nodes. We had
> noticed
> >> that before but figured it would be some optimization to be done later
> when
> >> more pressing issue would be fixed, like Zk disconnection and OOM. We
> were
> >> desperate and decided to fix anything that  would look unusual.
> >>
> >> After this change, query ran up to 10x faster.
> >>
> >> We no longer get random disconnection between node and Zookeeper.
> >>
> >> We are still wondering why exactly. Network should not be a bottleneck.
> >> Could high network traffic between a Drillbit and HDFS causes the
> Drillbit
> >> to timeout with zookeeper?
> >>
> >>
> >> On Fri, Sep 30, 2016 at 4:21 PM, François Méthot 
> >> wro

Re: ZK lost connectivity issue on large cluster

2016-10-12 Thread François Méthot
Hi,

 We finally got rid of this error. We have tried many, many things  (like
modifying drill to ignore the error!), it ultimately came down to this
change:

 from default
planner.affinity_factor=1.2
   to
planner.affinity_factor=100

Basically this encourages fragment to only care about locally stored files.
We looked at the code that used that property and figured that 100 would
have strong impact.

What led us to this property is the fact that 1/4 of our fragments would
take a lot more time to complete their scan, up to  10x the time of the
fastest nodes.  On the slower nodes, Cloudera Manager would show very low
Disk IOPS with high Network IO compare to our faster nodes. We had noticed
that before but figured it would be some optimization to be done later when
more pressing issue would be fixed, like Zk disconnection and OOM. We were
desperate and decided to fix anything that  would look unusual.

After this change, query ran up to 10x faster.

We no longer get random disconnection between node and Zookeeper.

We are still wondering why exactly. Network should not be a bottleneck.
Could high network traffic between a Drillbit and HDFS causes the Drillbit
to timeout with zookeeper?


On Fri, Sep 30, 2016 at 4:21 PM, François Méthot 
wrote:

> After the 30 seconds gap, all the Drill nodes receives the following:
>
> 2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit set
> changed. Now includes 220 total bits. New Active drill bits
> ...faulty node is not on the list...
> 2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit set
> changed. Now includes 221 total bits. New Active drill bits
> ...faulty node is back on the list...
>
>
> So the faulty Drill node get unregistered and registered right after.
>
> Drill is using the low level API for registering and unregistering, and
> the only place with unregistering occurs is when the drillbit is closed at
> shutdown.
>
> That particular drillbit is still up and running after those log, it could
> not have trigger the unregistering process through a shutdown.
>
>
>
>
> Would you have an idea what else could cause a Drillbit to be unregistered
> from the DiscoveryService and registered again right after?
>
>
>
> We are using Zookeeper 3.4.5
>
>
>
>
>
>
>
>
>
>
> On Wed, Sep 28, 2016 at 10:36 AM, François Méthot 
> wrote:
>
>> Hi,
>>
>>  We have switched to 1.8 and we are still getting node disconnection.
>>
>> We did many tests, we thought initially our stand alone parquet converter
>> was generating parquet files with problematic data (like 10K characters
>> string), but we were able to reproduce it with employee data from the
>> tutorial.
>>
>> For example,  we duplicated the Drill Tutorial "Employee" data to reach
>> 500 M records spread over 130 parquet files.
>> Each files is ~60 MB.
>>
>>
>> We ran this query over and over on 5 different sessions using a script:
>>select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
>> not exist%';
>>
>>Query return no rows and would take ~35 to 45 seconds to return.
>>
>> Leaving the script running on each node, we eventually hit the "nodes
>> lost connectivity during query" error.
>>
>> One the done that failed,
>>
>>We see those log:
>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report:
>> RUNNING
>> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
>>
>> <--- 30 seconds gap for that fragment --->
>>
>> 2016-09-26 20:37:09,976 [BitServer-2] WARN 
>> o.a.d.exec.rpc.control.ControlServer
>> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual
>> duration was 23617ms.
>>
>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>> o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
>> remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
>> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
>> o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing fragment
>> manager : ...uuid...
>>
>>
>>
>> For the same query on a working node:
>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
>> o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to report:
>> RUNNING
>> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
>> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
>> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
>> o.a.d

Re: ZK lost connectivity issue on large cluster

2016-09-30 Thread François Méthot
After the 30 seconds gap, all the Drill nodes receives the following:

2016-09-26 20:07:38,629 [Curator-ServiceCache-0] Debug Active drillbit set
changed. Now includes 220 total bits. New Active drill bits
...faulty node is not on the list...
2016-09-26 20:07:38,897 [Curator-ServiceCache-0] Debug Active drillbit set
changed. Now includes 221 total bits. New Active drill bits
...faulty node is back on the list...


So the faulty Drill node get unregistered and registered right after.

Drill is using the low level API for registering and unregistering, and the
only place with unregistering occurs is when the drillbit is closed at
shutdown.

That particular drillbit is still up and running after those log, it could
not have trigger the unregistering process through a shutdown.




Would you have an idea what else could cause a Drillbit to be unregistered
from the DiscoveryService and registered again right after?



We are using Zookeeper 3.4.5










On Wed, Sep 28, 2016 at 10:36 AM, François Méthot 
wrote:

> Hi,
>
>  We have switched to 1.8 and we are still getting node disconnection.
>
> We did many tests, we thought initially our stand alone parquet converter
> was generating parquet files with problematic data (like 10K characters
> string), but we were able to reproduce it with employee data from the
> tutorial.
>
> For example,  we duplicated the Drill Tutorial "Employee" data to reach
> 500 M records spread over 130 parquet files.
> Each files is ~60 MB.
>
>
> We ran this query over and over on 5 different sessions using a script:
>select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
> not exist%';
>
>Query return no rows and would take ~35 to 45 seconds to return.
>
> Leaving the script running on each node, we eventually hit the "nodes lost
> connectivity during query" error.
>
> One the done that failed,
>
>We see those log:
> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO 
> o.a.d.e.w.f.FragmentStatusReporter
> - ...uuid...:1:10: State to report: RUNNING
> 2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
> o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010
>
> <--- 30 seconds gap for that fragment --->
>
> 2016-09-26 20:37:09,976 [BitServer-2] WARN 
> o.a.d.exec.rpc.control.ControlServer
> - Message of mode REQUEST of rpc type 2 took longer then 500 ms. Actual
> duration was 23617ms.
>
> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG 
> o.a.d.e.p.i.s.RemovingRecordBatch
> - doWork(): 0 records copied out of 0, remaining: 0 incoming schema
> BatchSchema [, selectionVector=TWO_BYTE]
> 2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG 
> o.a.d.exec.rpc.control.WorkEventBus
> - Cancelling and removing fragment manager : ...uuid...
>
>
>
> For the same query on a working node:
> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO 
> o.a.d.e.w.f.FragmentStatusReporter
> - ...uuid...:1:2: State to report: RUNNING
> 2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
> o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG 
> o.a.d.e.p.i.s.RemovingRecordBatch
> - doWork(): 0 records copied out of 0, remaining: 0 incoming schema
> BatchSchema [, selectionVector=TWO_BYTE]
> 2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG 
> o.a.d.e.p.i.s.RemovingRecordBatch
> - doWork(): 0 records copied out of 0, remaining: 0 incoming schema
> BatchSchema [, selectionVector=TWO_BYTE]
> 2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG 
> o.a.d.e.s.p.c.ParquetRecordReader
> - Read 87573 records out of row groups(0) in file `/data/drill/tmp/PARQUET_
> EMPLOYEE/0_0_14.parquet
>
>
>
>
> We are investigating what could get cause that 30 seconds gap for that
> fragment.
>
> Any idea let us know
>
> Thanks
> Francois
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Sep 19, 2016 at 2:59 PM, François Méthot 
> wrote:
>
>> Hi Sudheesh,
>>
>>   If I add selection filter so that no row are returned, the same problem
>> occur. I also simplified the query to include only few integer columns.
>>
>> That particular data repo is ~200+ Billions records spread over ~50 000
>> parquet files.
>>
>> We have other CSV data repo that are 100x smaller that does not trigger
>> this issue.
>>
>>
>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
>> There is also a bizarre case where the node that is reported as lost is the
>> node itself.
>> Yes, the stack trace is from the ticket, It did occurred once or twice
>> (in the

Re: select count(1) : Cannot convert Indexed schema to NamePart

2016-09-30 Thread François Méthot
I have created a ticket:

https://issues.apache.org/jira/browse/DRILL-4919

The error happen on csv with header.

The actual error from the Drill's original TextFormatPlugin is

Error: UNSUPPORTED_OPERATION ERROR: With extractHeader enabled, only header
names are supported


Forget about the originally reported error, it happens on a modified
version of the TextFormatPlugin we are using.


On Wed, Sep 28, 2016 at 1:01 PM, Jinfeng Ni  wrote:

> I tried to query a regular csv file and a csv.gz file, and did not run
> into the problem you saw. When you create a JIRA, it would be helpful
> if you can share a sample file for re-produce purpose.
>
>
>
> On Wed, Sep 28, 2016 at 9:33 AM, Aman Sinha  wrote:
> > Is this specific to CSV format files ?  Yes, you should create a JIRA for
> > this.   Thanks for reporting.
> >
> > On Wed, Sep 28, 2016 at 8:55 AM, François Méthot 
> > wrote:
> >
> >> Hi,
> >>
> >>  Since release 1.8,
> >>
> >> we have a workspace hdfs.datarepo1 mapped to
> >> /year/month/day/
> >> containging csv.gz
> >>
> >> if we do select count(1) on any level of the dir structure like
> >>select count(1) from hdfs.datarepo1.`/2016/08`;
> >> We get
> >> Error: SYSTEM ERROR: IllegalStateException: You cannot convert a
> >> indexed schema path to a   NamePart. NameParts can only reference
> Vectors,
> >> not individual records or values.
> >>
> >> same error with
> >>select count(1) from hdfs.datarepo1.`/` where dir0=2016 and dir1=08;
> >>
> >>
> >> While this query works (or any select column)
> >>select count(column1) from hdfs.datarepo1.`/2016/08`;
> >>
> >>
> >> Should I create a ticket?
> >>
> >>
> >> Francois
> >>
>


select count(1) : Cannot convert Indexed schema to NamePart

2016-09-28 Thread François Méthot
Hi,

 Since release 1.8,

we have a workspace hdfs.datarepo1 mapped to
/year/month/day/
containging csv.gz

if we do select count(1) on any level of the dir structure like
   select count(1) from hdfs.datarepo1.`/2016/08`;
We get
Error: SYSTEM ERROR: IllegalStateException: You cannot convert a
indexed schema path to a   NamePart. NameParts can only reference Vectors,
not individual records or values.

same error with
   select count(1) from hdfs.datarepo1.`/` where dir0=2016 and dir1=08;


While this query works (or any select column)
   select count(column1) from hdfs.datarepo1.`/2016/08`;


Should I create a ticket?


Francois


Re: ZK lost connectivity issue on large cluster

2016-09-28 Thread François Méthot
Hi,

 We have switched to 1.8 and we are still getting node disconnection.

We did many tests, we thought initially our stand alone parquet converter
was generating parquet files with problematic data (like 10K characters
string), but we were able to reproduce it with employee data from the
tutorial.

For example,  we duplicated the Drill Tutorial "Employee" data to reach 500
M records spread over 130 parquet files.
Each files is ~60 MB.


We ran this query over and over on 5 different sessions using a script:
   select * from hdfs.tmp.`PARQUET_EMPLOYEE` where full_name like '%does
not exist%';

   Query return no rows and would take ~35 to 45 seconds to return.

Leaving the script running on each node, we eventually hit the "nodes lost
connectivity during query" error.

One the done that failed,

   We see those log:
2016-09-26 20:07:09,029 [...uuid...frag:1:10] INFO
o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:10: State to report:
RUNNING
2016-09-26 20:07:09,029 [...uuid...frag:1:10] DEBUG
o.a.d.e.w.FragmentExecutor - Starting fragment 1:10 on server064:31010

<--- 30 seconds gap for that fragment --->

2016-09-26 20:37:09,976 [BitServer-2] WARN
o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
2 took longer then 500 ms. Actual duration was 23617ms.

2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
2016-09-26 20:07:38,211 [...uuid...frag:1:10] DEBUG
o.a.d.exec.rpc.control.WorkEventBus - Cancelling and removing fragment
manager : ...uuid...



For the same query on a working node:
2016-09-26 20:07:09,056 [...uuid...frag:1:2] INFO
o.a.d.e.w.f.FragmentStatusReporter - ...uuid...:1:2: State to report:
RUNNING
2016-09-26 20:07:09,056 [...uuid...frag:1:2] DEBUG
o.a.d.e.w.FragmentExecutor - Starting fragment 1:2 on server125:31010
2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
2016-09-26 20:07:09,749 [...uuid...frag:1:2] DEBUG
o.a.d.e.p.i.s.RemovingRecordBatch - doWork(): 0 records copied out of 0,
remaining: 0 incoming schema BatchSchema [, selectionVector=TWO_BYTE]
2016-09-26 20:07:11,005 [...uuid...frag:1:2] DEBUG
o.a.d.e.s.p.c.ParquetRecordReader - Read 87573 records out of row groups(0)
in file `/data/drill/tmp/PARQUET_EMPLOYEE/0_0_14.parquet




We are investigating what could get cause that 30 seconds gap for that
fragment.

Any idea let us know

Thanks
Francois





















On Mon, Sep 19, 2016 at 2:59 PM, François Méthot 
wrote:

> Hi Sudheesh,
>
>   If I add selection filter so that no row are returned, the same problem
> occur. I also simplified the query to include only few integer columns.
>
> That particular data repo is ~200+ Billions records spread over ~50 000
> parquet files.
>
> We have other CSV data repo that are 100x smaller that does not trigger
> this issue.
>
>
> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
> There is also a bizarre case where the node that is reported as lost is the
> node itself.
> Yes, the stack trace is from the ticket, It did occurred once or twice (in
> the many many attempts) that it was the node itself.
>
> + Is there a spike in memory usage of the Drillbit this is the Foreman for
> the query (process memory, not just heap)?
> We don't notice any unusual spike, each nodes gets busy in the same range
> when query is running.
>
> I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see any
> improvement.
>
>
> We will update from 1.7 to 1.8 before going ahead with more investigation.
>
> Thanks a lot.
>
>
>
>
>
>
>
>
>
>
> On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam 
> wrote:
>
>> Hi Francois,
>>
>> A simple query with only projections is not an “ideal” use case, since
>> Drill is bound by how fast the client can consume records. There are 1000
>> scanners sending data to 1 client (vs far fewer scanners sending data in
>> the 12 node case).
>>
>> This might increase the load on the Drillbit that is the Foreman for the
>> query. In the query profile, the scanners should be spending a lot more
>> time “waiting” to send records to the client (via root fragment).
>> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
>> There is also a bizarre case where the node that is reported as lost is the
>> node itself.
>> + Is there a spike in memory usage of the Drillbit this is the Foreman
>> for the query (process memory, not just heap)?
>>
>> Regarding the warnings ...
>>
>> > 2016-09-19 

Re: ZK lost connectivity issue on large cluster

2016-09-19 Thread François Méthot
Hi Sudheesh,

  If I add selection filter so that no row are returned, the same problem
occur. I also simplified the query to include only few integer columns.

That particular data repo is ~200+ Billions records spread over ~50 000
parquet files.

We have other CSV data repo that are 100x smaller that does not trigger
this issue.


+ Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
There is also a bizarre case where the node that is reported as lost is the
node itself.
Yes, the stack trace is from the ticket, It did occurred once or twice (in
the many many attempts) that it was the node itself.

+ Is there a spike in memory usage of the Drillbit this is the Foreman for
the query (process memory, not just heap)?
We don't notice any unusual spike, each nodes gets busy in the same range
when query is running.

I tried running with 8GB/20GB and 4GB/24GB heap/off-heap, did not see any
improvement.


We will update from 1.7 to 1.8 before going ahead with more investigation.

Thanks a lot.










On Mon, Sep 19, 2016 at 1:19 PM, Sudheesh Katkam 
wrote:

> Hi Francois,
>
> A simple query with only projections is not an “ideal” use case, since
> Drill is bound by how fast the client can consume records. There are 1000
> scanners sending data to 1 client (vs far fewer scanners sending data in
> the 12 node case).
>
> This might increase the load on the Drillbit that is the Foreman for the
> query. In the query profile, the scanners should be spending a lot more
> time “waiting” to send records to the client (via root fragment).
> + Is atsqa4-133.qa.lab [1] the Foreman node for the query in this case?
> There is also a bizarre case where the node that is reported as lost is the
> node itself.
> + Is there a spike in memory usage of the Drillbit this is the Foreman for
> the query (process memory, not just heap)?
>
> Regarding the warnings ...
>
> > 2016-09-19 13:31:56,866 [BitServer-7] WARN
> > o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc
> type
> > 6 took longer than 500 ms. Actual Duration was 16053ms.
>
>
> RPC type 6 is a cancellation request; DRILL-4766 [2] should help in this
> case, which is resolved in the latest version of Drill. So as Chun
> suggested, upgrade the cluster to the latest version of Drill.
>
> > 2016-09-19 14:15:33,357 [BitServer-4] WARN
> > o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc
> type
> > 1 took longer than 500 ms. Actual Duration was 981ms.
>
> I am surprised that responses are taking that long to handle.
> + Are both messages on the same Drillbit?
>
> The other warnings can be ignored.
>
> Thank you,
> Sudheesh
>
> [1] I just realized that atsqa4-133.qa.lab is in one of our test
> environments :)
> [2] https://issues.apache.org/jira/browse/DRILL-4766 <
> https://issues.apache.org/jira/browse/DRILL-4766>
>
> > On Sep 19, 2016, at 9:15 AM, François Méthot 
> wrote:
> >
> > Hi Sudheesh,
> >
> >
> > + Does the query involve any aggregations or filters? Or is this a select
> > query with only projections?
> > Simple query with only projections
> >
> > + Any suspicious timings in the query profile?
> > Nothing specially different than our working query on our small cluster.
> >
> > + Any suspicious warning messages in the logs around the time of failure
> on
> > any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
> > (“..” are place holders):
> >  Message of mode .. of rpc type .. took longer than ..ms.  Actual
> duration
> > was ..ms.
> >
> > Well we do see this warning on the failing node (on my last test), I
> found
> > this WARNING in our log for the past month for pretty much every node I
> > checked.
> > 2016-09-19 13:31:56,866 [BitServer-7] WARN
> > o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc
> type
> > 6 took longer than 500 ms. Actual Duration was 16053ms.
> > 2016-09-19 14:15:33,357 [BitServer-4] WARN
> > o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc
> type
> > 1 took longer than 500 ms. Actual Duration was 981ms.
> >
> > We really appreciate your help. I will dig in the source code for when
> and
> > why this error happen.
> >
> >
> > Francois
> >
> > P.S.:
> > We do see this also:
> > 2016-09-19 14:48:23,444 [drill-executor-9] WARN
> > o.a.d.exec.rpc.control.WorkEventBus - Fragment ..:1:2 not found in
> the
> > work bus.
> > 2016-09-19 14:48:23,444 [drill-executor-11] WARN
> > o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:222 not found in
> the
> > work bus.
> > 2016-09-19 14:

Re: ZK lost connectivity issue on large cluster

2016-09-19 Thread François Méthot
Hi Sudheesh,


+ Does the query involve any aggregations or filters? Or is this a select
query with only projections?
Simple query with only projections

+ Any suspicious timings in the query profile?
Nothing specially different than our working query on our small cluster.

+ Any suspicious warning messages in the logs around the time of failure on
any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
(“..” are place holders):
  Message of mode .. of rpc type .. took longer than ..ms.  Actual duration
was ..ms.

Well we do see this warning on the failing node (on my last test), I found
this WARNING in our log for the past month for pretty much every node I
checked.
2016-09-19 13:31:56,866 [BitServer-7] WARN
o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type
6 took longer than 500 ms. Actual Duration was 16053ms.
2016-09-19 14:15:33,357 [BitServer-4] WARN
o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type
1 took longer than 500 ms. Actual Duration was 981ms.

We really appreciate your help. I will dig in the source code for when and
why this error happen.


Francois

P.S.:
We do see this also:
2016-09-19 14:48:23,444 [drill-executor-9] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment ..:1:2 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-11] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:222 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-12] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:442 not found in the
work bus.
2016-09-19 14:48:23,444 [drill-executor-10] WARN
o.a.d.exec.rpc.control.WorkEventBus - Fragment :1:662 not found in the
work bus.




On Sun, Sep 18, 2016 at 2:57 PM, Sudheesh Katkam 
wrote:

> Hi Francois,
>
> More questions..
>
> > + Can you share the query profile?
> >   I will sum it up:
> >  It is a select on 18 columns: 9 string, 9 integers.
> >  Scan is done on 13862 parquet files spread  on 1000 fragments.
> >  Fragments are spread accross 215 nodes.
>
> So ~5 leaf fragments (or scanners) per Drillbit seems fine.
>
> + Does the query involve any aggregations or filters? Or is this a select
> query with only projections?
> + Any suspicious timings in the query profile?
> + Any suspicious warning messages in the logs around the time of failure
> on any of the drillbits? Specially on atsqa4-133.qa.lab? Specially this one
> (“..” are place holders):
>   Message of mode .. of rpc type .. took longer than ..ms.  Actual
> duration was ..ms.
>
> Thank you,
> Sudheesh
>
> > On Sep 15, 2016, at 11:27 AM, François Méthot 
> wrote:
> >
> > Hi Sudheesh,
> >
> > + How many zookeeper servers in the quorum?
> > The quorum has 3 servers, everything looks healthy
> >
> > + What is the load on atsqa4-133.qa.lab when this happens? Any other
> > applications running on that node? How many threads is the Drill process
> > using?
> > The load on the failing node(8 cores) is 14, when Drill is running. Which
> > is nothing out of the ordinary according to our admin.
> > HBase is also running.
> > planner.width.max_per_node is set to 8
> >
> > + When running the same query on 12 nodes, is the data size same?
> > Yes
> >
> > + Can you share the query profile?
> >   I will sum it up:
> >  It is a select on 18 columns: 9 string, 9 integers.
> >  Scan is done on 13862 parquet files spread  on 1000 fragments.
> >  Fragments are spread accross 215 nodes.
> >
> >
> > We are in process of increasing our Zookeeper session timeout config to
> see
> > if it helps.
> >
> > thanks
> >
> > Francois
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam 
> > wrote:
> >
> >> Hi Francois,
> >>
> >> Few questions:
> >> + How many zookeeper servers in the quorum?
> >> + What is the load on atsqa4-133.qa.lab when this happens? Any other
> >> applications running on that node? How many threads is the Drill process
> >> using?
> >> + When running the same query on 12 nodes, is the data size same?
> >> + Can you share the query profile?
> >>
> >> This may not be the right thing to do, but for now, If the cluster is
> >> heavily loaded, increase the zk timeout.
> >>
> >> Thank you,
> >> Sudheesh
> >>
> >>> On Sep 14, 2016, at 11:53 AM, François Méthot 
> >> wrote:
> >>>
> >>> We are running 1.7.
> >>> The log were taken from the jira tickets.
> >>>
> >>> We will try out 1.8 soon.
> >>>
> >>&g

Re: ZK lost connectivity issue on large cluster

2016-09-15 Thread François Méthot
Hi Sudheesh,

+ How many zookeeper servers in the quorum?
The quorum has 3 servers, everything looks healthy

+ What is the load on atsqa4-133.qa.lab when this happens? Any other
applications running on that node? How many threads is the Drill process
using?
The load on the failing node(8 cores) is 14, when Drill is running. Which
is nothing out of the ordinary according to our admin.
HBase is also running.
planner.width.max_per_node is set to 8

+ When running the same query on 12 nodes, is the data size same?
Yes

+ Can you share the query profile?
   I will sum it up:
  It is a select on 18 columns: 9 string, 9 integers.
  Scan is done on 13862 parquet files spread  on 1000 fragments.
  Fragments are spread accross 215 nodes.


We are in process of increasing our Zookeeper session timeout config to see
if it helps.

thanks

Francois







On Wed, Sep 14, 2016 at 4:40 PM, Sudheesh Katkam 
wrote:

> Hi Francois,
>
> Few questions:
> + How many zookeeper servers in the quorum?
> + What is the load on atsqa4-133.qa.lab when this happens? Any other
> applications running on that node? How many threads is the Drill process
> using?
> + When running the same query on 12 nodes, is the data size same?
> + Can you share the query profile?
>
> This may not be the right thing to do, but for now, If the cluster is
> heavily loaded, increase the zk timeout.
>
> Thank you,
> Sudheesh
>
> > On Sep 14, 2016, at 11:53 AM, François Méthot 
> wrote:
> >
> > We are running 1.7.
> > The log were taken from the jira tickets.
> >
> > We will try out 1.8 soon.
> >
> >
> >
> >
> > On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang  wrote:
> >
> >> Looks like you are running 1.5. I believe there are some work done in
> that
> >> area and the newer release should behave better.
> >>
> >> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>>  We are trying to find a solution/workaround to issue:
> >>>
> >>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> >>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
> >>> One more more nodes lost connectivity during query.  Identified nodes
> >>> were [atsqa4-133.qa.lab:31010].
> >>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> >>> ForemanException: One more more nodes lost connectivity during query.
> >>> Identified nodes were [atsqa4-133.qa.lab:31010].
> >>>at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
> >>> close(Foreman.java:746)
> >>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> >>> processEvent(Foreman.java:858)
> >>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> >>> processEvent(Foreman.java:790)
> >>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> >>> moveToState(Foreman.java:792)
> >>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>at org.apache.drill.exec.work.foreman.Foreman.moveToState(
> >>> Foreman.java:909)
> >>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>at org.apache.drill.exec.work.foreman.Foreman.access$2700(
> >>> Foreman.java:110)
> >>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>at org.apache.drill.exec.work.foreman.Foreman$StateListener.
> >>> moveToState(Foreman.java:1183)
> >>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >>>
> >>>
> >>> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
> >>> ForemanException:
> >>> One or more nodes lost connectivity during query
> >>>
> >>>
> >>>
> >>> Any one experienced this issue ?
> >>>
> >>> It happens when running query involving many parquet files on a cluster
> >> of
> >>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
> >>>
> >>> It is not caused by garbage collection, (checked on both ZK node and
> the
> >>> involved drill bit).
> >>>
> >>> Negotiated max session timeout is 40 seconds.
> >>>
> >>> The sequence seems:
> >>> - Drill Query begins, using an existing ZK session.
> >>> - Drill Zk session timeouts
> >>>  - perhaps it was writing something that took too long
> >>> - Drill attempts to renew session
> >>>   - drill believes that the write operation failed, so it attempts
> >> to
> >>> re-create the zk node, which trigger another exception.
> >>>
> >>> We are open to any suggestion. We will report any finding.
> >>>
> >>> Thanks
> >>> Francois
> >>>
> >>
>
>


Re: Proposed changes for DRILL-3178

2016-09-14 Thread François Méthot
Hi Zelaine,

  I don't have the assign button available. I must be missing some
privilege.

This is the only group I am part of:
Groups: jira-users

Thanks for your quick reply

On Wed, Sep 14, 2016 at 2:57 PM, Zelaine Fong  wrote:

> Francois,
>
> Yes, feel free to assign the Jira to yourself and post a pull request.
>
> -- Zelaine
>
> On Wed, Sep 14, 2016 at 11:49 AM, François Méthot 
> wrote:
>
> > Hi,
> >
> >   I have on my local repo a fix for
> >
> > https://issues.apache.org/jira/browse/DRILL-3178
> >  csv reader should allow newlines inside quotes
> >
> > Can I be assigned this ticket so I can submit my proposed change?
> >
> > Francois
> >
>


Re: ZK lost connectivity issue on large cluster

2016-09-14 Thread François Méthot
We are running 1.7.
The log were taken from the jira tickets.

We will try out 1.8 soon.




On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang  wrote:

> Looks like you are running 1.5. I believe there are some work done in that
> area and the newer release should behave better.
>
> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot 
> wrote:
>
> > Hi,
> >
> >   We are trying to find a solution/workaround to issue:
> >
> > 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
> > o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
> > One more more nodes lost connectivity during query.  Identified nodes
> > were [atsqa4-133.qa.lab:31010].
> > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> > ForemanException: One more more nodes lost connectivity during query.
> > Identified nodes were [atsqa4-133.qa.lab:31010].
> > at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
> > close(Foreman.java:746)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> > processEvent(Foreman.java:858)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> > processEvent(Foreman.java:790)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
> > moveToState(Foreman.java:792)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > at org.apache.drill.exec.work.foreman.Foreman.moveToState(
> > Foreman.java:909)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > at org.apache.drill.exec.work.foreman.Foreman.access$2700(
> > Foreman.java:110)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> > at org.apache.drill.exec.work.foreman.Foreman$StateListener.
> > moveToState(Foreman.java:1183)
> > [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> >
> >
> > DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
> > ForemanException:
> > One or more nodes lost connectivity during query
> >
> >
> >
> > Any one experienced this issue ?
> >
> > It happens when running query involving many parquet files on a cluster
> of
> > 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
> >
> > It is not caused by garbage collection, (checked on both ZK node and the
> > involved drill bit).
> >
> > Negotiated max session timeout is 40 seconds.
> >
> > The sequence seems:
> > - Drill Query begins, using an existing ZK session.
> > - Drill Zk session timeouts
> >   - perhaps it was writing something that took too long
> > - Drill attempts to renew session
> >- drill believes that the write operation failed, so it attempts
> to
> > re-create the zk node, which trigger another exception.
> >
> >  We are open to any suggestion. We will report any finding.
> >
> > Thanks
> > Francois
> >
>


Proposed changes for DRILL-3178

2016-09-14 Thread François Méthot
Hi,

  I have on my local repo a fix for

https://issues.apache.org/jira/browse/DRILL-3178
 csv reader should allow newlines inside quotes

Can I be assigned this ticket so I can submit my proposed change?

Francois


ZK lost connectivity issue on large cluster

2016-09-14 Thread François Méthot
Hi,

  We are trying to find a solution/workaround to issue:

2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
One more more nodes lost connectivity during query.  Identified nodes
were [atsqa4-133.qa.lab:31010].
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
ForemanException: One more more nodes lost connectivity during query.
Identified nodes were [atsqa4-133.qa.lab:31010].
at 
org.apache.drill.exec.work.foreman.Foreman$ForemanResult.close(Foreman.java:746)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at 
org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:858)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at 
org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:790)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at 
org.apache.drill.exec.work.foreman.Foreman$StateSwitch.moveToState(Foreman.java:792)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at 
org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:909)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at 
org.apache.drill.exec.work.foreman.Foreman.access$2700(Foreman.java:110)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at 
org.apache.drill.exec.work.foreman.Foreman$StateListener.moveToState(Foreman.java:1183)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]


DRILL-4325  ForemanException:
One or more nodes lost connectivity during query



Any one experienced this issue ?

It happens when running query involving many parquet files on a cluster of
200 nodes. Same query on a smaller cluster of 12 nodes runs fine.

It is not caused by garbage collection, (checked on both ZK node and the
involved drill bit).

Negotiated max session timeout is 40 seconds.

The sequence seems:
- Drill Query begins, using an existing ZK session.
- Drill Zk session timeouts
  - perhaps it was writing something that took too long
- Drill attempts to renew session
   - drill believes that the write operation failed, so it attempts to
re-create the zk node, which trigger another exception.

 We are open to any suggestion. We will report any finding.

Thanks
Francois


Re: Permission denied for queries on individual file.

2016-04-28 Thread François Méthot
Thanks for trying it out.

In our case we don't use impersonation but we do have security enabled.

We will keep investigating, It could be a combination of

Cloudera/HDFS/Drill Security
+ the fact that when we moved to 1.6, we also downgraded from Java 8 _31
 to Java 7 _60 because of this issue: DRILL-4609
<https://issues.apache.org/jira/browse/DRILL-4609>

I will see if I can reproduce the issue on a separate Cloudera cluster, it
will take time but I will report my finding.


Francois








On Wed, Apr 27, 2016 at 4:56 PM, Abhishek Girish 
wrote:

> Hello Francois,
>
> I tried this on latest master (1.7.0) and on a MapR cluster. Was able to
> query an individual file with just read permissions:
>
> > select c_first_name from dfs.tmp.`customer.parquet` limit 1;
> +---+
> | c_first_name  |
> +---+
> | Javier|
> +---+
> 1 row selected (1.1 seconds)
>
>
> # hadoop fs -ls /tmp/customer.parquet
> -r--r--r--   3 root root7778841 2016-04-27 13:50 /tmp/customer.parquet
>
> Note: I did not use impersonation or have security enabled on my cluster.
>
> -Abhishek
>
> On Wed, Apr 27, 2016 at 12:24 PM, François Méthot 
> wrote:
>
> > Has anyone experienced the same issue?  We are Using HDFS managed by
> > Cloudera Manager.
> >
> > A simple upgrade to 1.6 caused queries done directly on individual files
> to
> > fail with "Permission denied: user=drill, access=EXECUTE" error.
> > Also, Using filter with "dir0" file structure also causes the issue to
> > happen.
> >ex: select col1 from hdfs.`/datasrc/` where dir0>= 1234567;
> >
> > We ended up giving "execute" access to all the data file.
> >
> > We would really like to know if this is the intend to have drill to
> expect
> > execute access permission on data files.
> >
> > Thanks
> >
> > On Tue, Apr 26, 2016 at 11:23 AM, François Méthot 
> > wrote:
> >
> > > Hi,
> > >
> > >   We just switched to version 1.6. Using java 1.7_60
> > >
> > > We noticed that we can no longer query individual files stored in HDFS
> > > from CLI and WebUI.
> > >
> > > select col1 from hdfs.`/data/file1.parquet`;
> > >
> > > Error: SYSTEM ERROR: RemoteException: Permission denied: user=drill,
> > > access=EXECUTE inode=/data/file1.parquet":mygroup:drill:-rw-rw-r--
> > >
> > > If we give execution permission to the file
> > >
> > > hdfs fs -chmod +x /data/file1.parquet
> > >
> > > Then the query works.
> > >
> > > If we query the parent folder (hdfs.`/data/`), the query works as well.
> > >
> > > Is it the expected behavior in 1.6?
> > >
> > > Francois
> > >
> > >
> >
>


Re: Permission denied for queries on individual file.

2016-04-27 Thread François Méthot
Has anyone experienced the same issue?  We are Using HDFS managed by
Cloudera Manager.

A simple upgrade to 1.6 caused queries done directly on individual files to
fail with "Permission denied: user=drill, access=EXECUTE" error.
Also, Using filter with "dir0" file structure also causes the issue to
happen.
   ex: select col1 from hdfs.`/datasrc/` where dir0>= 1234567;

We ended up giving "execute" access to all the data file.

We would really like to know if this is the intend to have drill to expect
execute access permission on data files.

Thanks

On Tue, Apr 26, 2016 at 11:23 AM, François Méthot 
wrote:

> Hi,
>
>   We just switched to version 1.6. Using java 1.7_60
>
> We noticed that we can no longer query individual files stored in HDFS
> from CLI and WebUI.
>
> select col1 from hdfs.`/data/file1.parquet`;
>
> Error: SYSTEM ERROR: RemoteException: Permission denied: user=drill,
> access=EXECUTE inode=/data/file1.parquet":mygroup:drill:-rw-rw-r--
>
> If we give execution permission to the file
>
> hdfs fs -chmod +x /data/file1.parquet
>
> Then the query works.
>
> If we query the parent folder (hdfs.`/data/`), the query works as well.
>
> Is it the expected behavior in 1.6?
>
> Francois
>
>


Permission denied for queries on individual file.

2016-04-26 Thread François Méthot
Hi,

  We just switched to version 1.6. Using java 1.7_60

We noticed that we can no longer query individual files stored in HDFS from
CLI and WebUI.

select col1 from hdfs.`/data/file1.parquet`;

Error: SYSTEM ERROR: RemoteException: Permission denied: user=drill,
access=EXECUTE inode=/data/file1.parquet":mygroup:drill:-rw-rw-r--

If we give execution permission to the file

hdfs fs -chmod +x /data/file1.parquet

Then the query works.

If we query the parent folder (hdfs.`/data/`), the query works as well.

Is it the expected behavior in 1.6?

Francois


Re: Wrong result in select with multiple identical UDF call

2016-04-18 Thread François Méthot
"select true, true, true from table" won't output true,true,true on all
rows." can be reproduced using :
JDK 1.8.0_31 and Drill 1.6
JDK 1.8.0_31 and Drill 1.5


"select name, ilike(name, 'jack'), ilike(name, 'jack'), ilike(name,
'jack'), ilike(name, 'jack'), ilike(name, 'jack') from hdfs.`/data/` where
ilike(name, 'jack')" won't output true on all column on :
JDK 1.8.0_45 and Drill 1.5
JDK 1.8.0_31 and Drill 1.6
JDK 1.8.0_31 and Drill 1.5



It is not reproducible on JDK 1.7.

It looks like a problem with the specific JDK we were using.

For now, we will stay away from 1.8.
we will test with the most recent version of 1.8 when we get the chance.


Francois















On Fri, Apr 15, 2016 at 10:57 AM, François Méthot 
wrote:

> We dig down the problem even more and we now have a reproducible issue
> even without using UDFs:
>
> https://issues.apache.org/jira/browse/DRILL-4609
>
> Doing a simple "select true, true, true from table" won't output
> true,true,true on all generated rows.
>
> We hope that fixing this first should resolve the inconsistency we see
> using the UDFs.
>
> Thanks
>
>
>
> On Thu, Apr 14, 2016 at 1:20 PM, François Méthot 
> wrote:
>
>> I was able to reproduce this on 1.5 running on a cluster
>> and on 1.6 in embedded mode.
>>
>> Within a single select, if I select the same udf(value) multiple time,
>> different result may get outputted for each columns.
>>
>> ex:
>> select name, ilike(name, 'jack'), ilike(name, 'jack'), ilike(name,
>> 'jack'), ilike(name, 'jack'), ilike(name, 'jack') from hdfs.`/data/` where
>> ilike(name, 'jack');
>>
>> I get
>>
>> jack | false | true | false
>> jack | true | true | true
>> jack | true | true | false
>> .
>> most of them are jack | true | true | true
>>
>> I observed this on parquet files as well as CSV file. I restart drill,
>> perform the query and it happens. Sometime it does not!
>>
>>
>>
>> If I do
>> select count(1) from hdfs.`/data/` where ilike(name, 'jack') = true;
>> or
>> select count(1) from hdfs.`/data/` where ilike(name, 'jack') = true and
>> like(name, 'jack') = true and like(name, 'jack') = true and like(name,
>> 'jack') = true;
>>
>> The count is always the same, which is good. It looks like the select
>> part is crippled with some issue.
>>
>> Francois
>> P.S. I ended up doing these weird tests because I was getting those same
>> inconsistent result from my own UDF, at some point I started testing the
>> built-in UDF in drill for my own sanity because I could see what could be
>> wrong with my code...
>>
>>
>>
>>
>>
>>
>>
>>
>


Re: Wrong result in select with multiple identical UDF call

2016-04-15 Thread François Méthot
We dig down the problem even more and we now have a reproducible issue even
without using UDFs:

https://issues.apache.org/jira/browse/DRILL-4609

Doing a simple "select true, true, true from table" won't output
true,true,true on all generated rows.

We hope that fixing this first should resolve the inconsistency we see
using the UDFs.

Thanks



On Thu, Apr 14, 2016 at 1:20 PM, François Méthot 
wrote:

> I was able to reproduce this on 1.5 running on a cluster
> and on 1.6 in embedded mode.
>
> Within a single select, if I select the same udf(value) multiple time,
> different result may get outputted for each columns.
>
> ex:
> select name, ilike(name, 'jack'), ilike(name, 'jack'), ilike(name,
> 'jack'), ilike(name, 'jack'), ilike(name, 'jack') from hdfs.`/data/` where
> ilike(name, 'jack');
>
> I get
>
> jack | false | true | false
> jack | true | true | true
> jack | true | true | false
> .
> most of them are jack | true | true | true
>
> I observed this on parquet files as well as CSV file. I restart drill,
> perform the query and it happens. Sometime it does not!
>
>
>
> If I do
> select count(1) from hdfs.`/data/` where ilike(name, 'jack') = true;
> or
> select count(1) from hdfs.`/data/` where ilike(name, 'jack') = true and
> like(name, 'jack') = true and like(name, 'jack') = true and like(name,
> 'jack') = true;
>
> The count is always the same, which is good. It looks like the select part
> is crippled with some issue.
>
> Francois
> P.S. I ended up doing these weird tests because I was getting those same
> inconsistent result from my own UDF, at some point I started testing the
> built-in UDF in drill for my own sanity because I could see what could be
> wrong with my code...
>
>
>
>
>
>
>
>


Wrong result in select with multiple identical UDF call

2016-04-14 Thread François Méthot
I was able to reproduce this on 1.5 running on a cluster
and on 1.6 in embedded mode.

Within a single select, if I select the same udf(value) multiple time,
different result may get outputted for each columns.

ex:
select name, ilike(name, 'jack'), ilike(name, 'jack'), ilike(name, 'jack'),
ilike(name, 'jack'), ilike(name, 'jack') from hdfs.`/data/` where
ilike(name, 'jack');

I get

jack | false | true | false
jack | true | true | true
jack | true | true | false
.
most of them are jack | true | true | true

I observed this on parquet files as well as CSV file. I restart drill,
perform the query and it happens. Sometime it does not!



If I do
select count(1) from hdfs.`/data/` where ilike(name, 'jack') = true;
or
select count(1) from hdfs.`/data/` where ilike(name, 'jack') = true and
like(name, 'jack') = true and like(name, 'jack') = true and like(name,
'jack') = true;

The count is always the same, which is good. It looks like the select part
is crippled with some issue.

Francois
P.S. I ended up doing these weird tests because I was getting those same
inconsistent result from my own UDF, at some point I started testing the
built-in UDF in drill for my own sanity because I could see what could be
wrong with my code...


Re: Can this scenario cause a query to hang ?

2016-04-08 Thread François Méthot
It might just adds up to the mystery of this issue but when we start
getting those hanging CTAS query,
if we restart our drill cluster and the problem goes away.

Next time we start getting this problem I will try to collect the JStack
output of the foreman too.

Thanks for looking into this.

Francois



On Fri, Apr 8, 2016 at 2:20 AM, Abdel Hakim Deneche 
wrote:

> Opened DRILL-4595 [1]  to track this issue.
>
> Thanks
>
> [1] https://issues.apache.org/jira/browse/DRILL-4595
>
> On Fri, Apr 8, 2016 at 6:42 AM, Abdel Hakim Deneche  >
> wrote:
>
> > Hey John, thanks for sharing your experience. If you see this again try
> > collecting the jstack output for the foreman node of the query, and also
> > check in the query profile which fragments are still marked as RUNNING.
> >
> > Thanks
> >
> > On Thu, Apr 7, 2016 at 2:29 PM, John Omernik  wrote:
> >
> >> Abdel -
> >>
> >> I think I've seen this on a MapR cluster I run, especially on CTAS.  For
> >> me, I have not brought it up because the cluster I am running on has
> some
> >> serious personal issues (like being hardware that's near 7 years old,
> its
> >> a
> >> test cluster) and given the "hard to reproduce" nature of the problem,
> >> I've
> >> been reluctant to create noise. Given what you've described, it seems
> very
> >> similar to CTAS hangs I've seen, but couldn't accurately reproduce.
> >>
> >> This didn't add much to your post, but I wanted to give you a +1 for
> >> outlining this potential problem.  Once I move to more robust hardware,
> >> and
> >> I am in similar situations, I will post more verbose details from my
> side.
> >>
> >> John
> >>
> >>
> >>
> >> On Thu, Apr 7, 2016 at 2:29 AM, Abdel Hakim Deneche <
> >> adene...@maprtech.com>
> >> wrote:
> >>
> >> > So, we've been seeing some queries hang, I've come up with a possible
> >> > explanation, but so far it's really difficult to reproduce. Let me
> know
> >> if
> >> > you think this explanation doesn't hold up or if you have any ideas
> how
> >> we
> >> > can reproduce it. Thanks
> >> >
> >> > - generally it's a CTAS running on a large cluster (lot's of writers
> >> > running in parallel)
> >> > - logs show that the user channel was closed and UserServer caused the
> >> root
> >> > fragment to move to a FAILED state [1]
> >> > - jstack shows that the root fragment is blocked in it's receiver
> >> waiting
> >> > for data [2]
> >> > - jstack also shows that ALL other fragments are no longer running,
> and
> >> the
> >> > logs show that all of them succeeded [3]
> >> > - the foreman waits *forever* for the root fragment to finish
> >> >
> >> > [1] the only case I can think off is when the user channel closed
> while
> >> the
> >> > fragment was waiting for an ack from the user client
> >> > [2] if a writer finishes earlier than the others, it will send a data
> >> batch
> >> > to the root fragment that will be sent to the user. The root will then
> >> > immediately block on it's receiver waiting for the remaining writers
> to
> >> > finish
> >> > [3] once the root fragment moves to a failed state, the receiver will
> >> > immediately release any received batch and return an OK to the sender
> >> > without putting the batch in it's blocking queue.
> >> >
> >> > Abdelhakim Deneche
> >> >
> >> > Software Engineer
> >> >
> >> >   
> >> >
> >> >
> >> > Now Available - Free Hadoop On-Demand Training
> >> > <
> >> >
> >>
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>


Re: CTAS query stuck

2016-04-05 Thread François Méthot
We restarted our Drill Cluster and the problem disappeared.

Our queries so far complete successfully without getting stuck.

Francois




On Mon, Apr 4, 2016 at 1:41 PM, François Méthot  wrote:

> Hi,
>
> Using Drill 1.5, 13 nodes (6 cores each, Max Dir. Mem: 32GB, Max Heap 8
> GB).
>
>   Sometime our CTAS queries gets "stuck". It runs for few hours or days
> and in the Fragment profiles graph, we see that the query is waiting on a
> single fragment ( for what appears to be a  single horizontal line in the
> graph ). When this happens, if we hit CTL-C on the console that initiated
> the query, everything returns normally and the expected table is created
> and valid.
>
> Once when the query seemed stuck after 1 day, we let it go, and after 5
> days it returned.
>
> We see this behavior with a query returning 0, 1 or millions of row.
>
> There was a post on this last year, I I can't find how or if the issue was
> resolved:
>
> http://mail-archives.apache.org/mod_mbox/drill-user/201505.mbox/%3cac7542f0-de0f-42e4-9858-0a1f0983b...@maprtech.com%3E
>
>
> Francois
>
>
>
>
>
>


Re: Simple query on 150 billion records

2016-04-05 Thread François Méthot
What I ended up doing is restart our Drill cluster.

The same query ran in 19 minutes, scanning the same amount of rows (~79
Billions)

So it looks like that after long period of up time and heavy usage, our
Drill cluster got into a certain state and become difficult to work with.

Until we find a better solution, we might just have to restart drill every
morning and whenever we encounter certain type of query error or
performance degradation.

Any else noticed something similar?
Is there specific query error that could lead to performance and stability
issue and that would inevitably require a restart?


Thanks for help











On Tue, Apr 5, 2016 at 9:18 AM, Darshan Singh 
wrote:

> Hi,
>
> How much data  you got from this query
>
> create table ANALYSIS_RESULT as (
> select Int32Field1 as SECONDS from hdfs.`/data/` where Int32Field2=123456
> or Int32Field2=4567898);
>
> As per your email you said single record.Also, in this query you used
> Int32Field1 as Seconds whereas in the first query it was just seconds.Are
> these same fields or do you have some sort for conversion for these fields
> in first query.
>
> A plan would be grateful as well.
>
> Thanks
>
> On Mon, Apr 4, 2016 at 3:09 PM, François Méthot 
> wrote:
>
> > Hi,
> >
> >   Querying 150 Billion records spread over ~21 000 parquets stored in
> hdfs
> > on 13 nodes (6 cores each, Max Dir. Mem: 32GB, Max Heap 8 GB).
> >
> > Is their a known issue or drill limitation that would explain why the
> first
> > query below can't return the expected single row and aggregation ?
> >
> > create table ANALYSIS_RESULT as (
> > select to_date(to_timestamp((SECONDS)), count(1)
> > from hdfs.`/data/
> > where Int32Field2=123456 or Int32Field2=4567898
> > group by to_date(to_timestamp((SECONDS)));
> >
> > After *20 hours*, SYSTEM ERROR: Foreman Exception: One more more nodes
> lost
> > connectivity during query.
> >
> >
> > If we do the query in 2 steps:
> > create table ANALYSIS_RESULT as (
> > select Int32Field1 as SECONDS from hdfs.`/data/` where Int32Field2=123456
> > or Int32Field2=4567898);
> >
> > result was returned in *43 minutes* ( a single record ).
> >
> > select to_date(to_timestamp((SECONDS)), count(1)
> > from ANALYSIS_RESULT
> > group by to_date(to_timestamp((SECONDS));
> >
> > Aggregation of that single record is of course done in  < 1 second.
> >2016-04-04  1
> >
> >
> >
> > I also tried
> > select to_date(to_timestamp((SECONDS)), count(1)  from (
> > select Int32Field1 as SECONDS
> > from hdfs.`/data/`
> > where Int32Field2=123456 or Int32Field2=4567898)
> > group by o_date(to_timestamp((SECONDS))
> >
> > Same thing: After *21 hours*, SYSTEM ERROR: Foreman Exception: One more
> > more nodes lost connectivity during query.
> >
> >
> > Thanks for your help
> > Francois
> >
>


CTAS query stuck

2016-04-04 Thread François Méthot
Hi,

Using Drill 1.5, 13 nodes (6 cores each, Max Dir. Mem: 32GB, Max Heap 8 GB).

  Sometime our CTAS queries gets "stuck". It runs for few hours or days and
in the Fragment profiles graph, we see that the query is waiting on a
single fragment ( for what appears to be a  single horizontal line in the
graph ). When this happens, if we hit CTL-C on the console that initiated
the query, everything returns normally and the expected table is created
and valid.

Once when the query seemed stuck after 1 day, we let it go, and after 5
days it returned.

We see this behavior with a query returning 0, 1 or millions of row.

There was a post on this last year, I I can't find how or if the issue was
resolved:
http://mail-archives.apache.org/mod_mbox/drill-user/201505.mbox/%3cac7542f0-de0f-42e4-9858-0a1f0983b...@maprtech.com%3E


Francois


Simple query on 150 billion records

2016-04-04 Thread François Méthot
Hi,

  Querying 150 Billion records spread over ~21 000 parquets stored in hdfs
on 13 nodes (6 cores each, Max Dir. Mem: 32GB, Max Heap 8 GB).

Is their a known issue or drill limitation that would explain why the first
query below can't return the expected single row and aggregation ?

create table ANALYSIS_RESULT as (
select to_date(to_timestamp((SECONDS)), count(1)
from hdfs.`/data/
where Int32Field2=123456 or Int32Field2=4567898
group by to_date(to_timestamp((SECONDS)));

After *20 hours*, SYSTEM ERROR: Foreman Exception: One more more nodes lost
connectivity during query.


If we do the query in 2 steps:
create table ANALYSIS_RESULT as (
select Int32Field1 as SECONDS from hdfs.`/data/` where Int32Field2=123456
or Int32Field2=4567898);

result was returned in *43 minutes* ( a single record ).

select to_date(to_timestamp((SECONDS)), count(1)
from ANALYSIS_RESULT
group by to_date(to_timestamp((SECONDS));

Aggregation of that single record is of course done in  < 1 second.
   2016-04-04  1



I also tried
select to_date(to_timestamp((SECONDS)), count(1)  from (
select Int32Field1 as SECONDS
from hdfs.`/data/`
where Int32Field2=123456 or Int32Field2=4567898)
group by o_date(to_timestamp((SECONDS))

Same thing: After *21 hours*, SYSTEM ERROR: Foreman Exception: One more
more nodes lost connectivity during query.


Thanks for your help
Francois


Re: Aggregation OutOfMemoryException

2016-03-30 Thread François Méthot
Thank you Abdel,

 After following your recommendation, our query actually went through. From
the Fragments Profiles Overview in the Web UI, we saw that after 2 days of
processing,  a thread was still doing some work. We let it on for another 3
days. Nothing was happening, then we hit ctrl-c from the console that had
initiated the query. And it returned successfully.

Francois


On Wed, Mar 16, 2016 at 1:30 PM, Abdel Hakim Deneche 
wrote:

> actually:
>
> sort limit = MQMPN / (NS * MPN * NC * 0.7)
>
> On Wed, Mar 16, 2016 at 6:30 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > sort memory limit is computed, as follows:
> >
> > MQMPN = planner.memory.max_query_memory_per_node
> > MPN = planner.width.max_per_node
> > NC = number of core in each cluster node
> > NS = number of sort operators in the query
> >
> > sort limit = MQMPN / (MPN * NC * 0.7)
> >
> > In your case I assume the query contains a single sort operator and you
> > have 16 cores per node. To increase the sort limit you can increase the
> > value of max_query_memory_per_node and you can also reduce the value of
> > planner.width.max_per_node. Please note that reducing the value of the
> > latter option may increase the query's execution time.
> >
> > On Wed, Mar 16, 2016 at 2:47 PM, François Méthot 
> > wrote:
> >
> >> The default spill directory (/tmp) did not have enough space. We fixed
> >> that. (thanks John)
> >>
> >> I altered session to set
> >> planner.memory.max_query_memory_per_node = 17179869184 (16GB)
> >> planner.enable_hashjoin=false;
> >> planner.enable_hashadd=false;
> >>
> >> We ran our aggregation.
> >>
> >> After 7h44m.
> >>
> >> We got
> >>
> >> Error: RESOURCE ERROR: External Sort encountered an error while spilling
> >> to
> >> disk
> >>
> >> Fragment 7:35
> >>
> >> Caused by org.apache.drill.exec.exception.OutOfMemory: Unable to
> allocate
> >> buffer of size 65536 (rounded from 37444) due to memory limit. Current
> >> allocation: 681080448.
> >>
> >>
> org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:216)
> >>
> >>
> org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:191)
> >>
> >>
> >>
> org.apache.drill.exec.cache.VectorAccessibleSerializable.readFromStream(VectorAccessibleSerializable.java:112)
> >>
> >>
> >>
> org.apache.drill.exec.physical.impl.xsort.BatchGroup.getBatch(BatchGroup.java:110)
> >>
> >>
> >>
> org.apache.drill.exec.physical.impl.xsort.BatchGroup.getNextIndex(BatchGroup.java:136)
> >>
> >>
> >>
> org.apache.drill.exec.test.generated.PriorityQueuedCopierGen975.next(PriorityQueueCopierTemplate.java:76)
> >>
> >>
> >>
> org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:557)
> >>
> >>
> >> I think we were close to have the query completed, In the Fragment
> >> Profiles
> >> Web UI, the 2 bottom major fragment (out of 5) were showing that they
> were
> >> done.
> >> I had the same query working on a (20x) smaller set of data.
> >> Should I add more mem to planner.memory.max_query_memory_per_node ?
> >>
> >>
> >>
> >> Abdel:
> >> We did get the memory leak below while doing streaming aggregation, when
> >> our /tmp directory was too small.
> >> After fixing that, our streaming  aggregation got us the error above.
> >>
> >> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query.
> >> Memory leaked: (389120)
> >> Allocator(op:6:51:2:ExternalSort) 200/389120/680576640/715827882
> >> (res/actual/peal/limit)
> >>
> >> Fragment 6:51
> >>
> >> [Error Id: . on node014.prod:31010]
> >>
> >>
> >>   (java.lan.IllegalStateException) Memory was leaked by query. Memory
> >> leaked (389120)
> >> Allocator(op:6:51:2:ExternalSort) 200/389120/680576640/715827882
> >> (res/actual/peal/limit)
> >> org.apache.drill.exec.memory.BaseAllocator.close():492
> >> org.apache.drill.exec.ops.OperatorContextImpl.close():124
> >> org.apache.drill.exec.ops.FragmentContext.supressingClose():416
> >> org.apache.drill.exec.ops.FragmentContext.close():405
> >>
> >>
> >>
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOu

Re: Aggregation OutOfMemoryException

2016-03-19 Thread François Méthot
The default spill directory (/tmp) did not have enough space. We fixed
that. (thanks John)

I altered session to set
planner.memory.max_query_memory_per_node = 17179869184 (16GB)
planner.enable_hashjoin=false;
planner.enable_hashadd=false;

We ran our aggregation.

After 7h44m.

We got

Error: RESOURCE ERROR: External Sort encountered an error while spilling to
disk

Fragment 7:35

Caused by org.apache.drill.exec.exception.OutOfMemory: Unable to allocate
buffer of size 65536 (rounded from 37444) due to memory limit. Current
allocation: 681080448.

org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:216)

org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:191)

org.apache.drill.exec.cache.VectorAccessibleSerializable.readFromStream(VectorAccessibleSerializable.java:112)

org.apache.drill.exec.physical.impl.xsort.BatchGroup.getBatch(BatchGroup.java:110)

org.apache.drill.exec.physical.impl.xsort.BatchGroup.getNextIndex(BatchGroup.java:136)

org.apache.drill.exec.test.generated.PriorityQueuedCopierGen975.next(PriorityQueueCopierTemplate.java:76)

org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:557)


I think we were close to have the query completed, In the Fragment Profiles
Web UI, the 2 bottom major fragment (out of 5) were showing that they were
done.
I had the same query working on a (20x) smaller set of data.
Should I add more mem to planner.memory.max_query_memory_per_node ?



Abdel:
We did get the memory leak below while doing streaming aggregation, when
our /tmp directory was too small.
After fixing that, our streaming  aggregation got us the error above.

Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query.
Memory leaked: (389120)
Allocator(op:6:51:2:ExternalSort) 200/389120/680576640/715827882
(res/actual/peal/limit)

Fragment 6:51

[Error Id: . on node014.prod:31010]


  (java.lan.IllegalStateException) Memory was leaked by query. Memory
leaked (389120)
Allocator(op:6:51:2:ExternalSort) 200/389120/680576640/715827882
(res/actual/peal/limit)
org.apache.drill.exec.memory.BaseAllocator.close():492
org.apache.drill.exec.ops.OperatorContextImpl.close():124
org.apache.drill.exec.ops.FragmentContext.supressingClose():416
org.apache.drill.exec.ops.FragmentContext.close():405

org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():343
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():180
org.apache.drill.exec.work.fragment.FragmentExecutor.run():287
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrentThreadPoolExecutor.runWorker():1142


Thanks guys for your feedback.



On Sat, Mar 12, 2016 at 1:18 AM, Abdel Hakim Deneche 
wrote:

> Disabling hash aggregation will default to streaming aggregation + sort.
> This will allow you to handle larger data and spill to disk if necessary.
>
> Like stated in the documentation, starting from Drill 1.5 the default
> memory limit of sort may not be enough to process large data, but you can
> bump it up by increasing planner.memory.max_query_memory_per_node (defaults
> to 2GB), and if necessary reducing planner.width.max_per_node (defaults to
> 75% of number of cores).
>
> You said disabling hash aggregate and hash join causes a memory leak. Can
> you give more details about the error ? the query may fail with an out of
> memory but it shouldn't leak.
>
> On Fri, Mar 11, 2016 at 10:53 PM, John Omernik  wrote:
>
> > I've had some luck disabling multi-phase aggregations on some queries
> where
> > memory was an issue.
> >
> > https://drill.apache.org/docs/guidelines-for-optimizing-aggregation/
> >
> > After I try that, than I typically look at the hash aggregation as you
> have
> > done:
> >
> >
> >
> https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/
> >
> > I've had limited success with changing the max_query_memory_per_node and
> > max_width, sometimes it's a weird combination of things that work in
> there.
> >
> > https://drill.apache.org/docs/troubleshooting/#memory-issues
> >
> > Back to your spill stuff if you disable hash aggregation, do you know if
> > your spill directories are setup? That may be part of the issue, I am not
> > sure what the default spill behavior of Drill is for spill directory
> setup.
> >
> >
> >
> > On Fri, Mar 11, 2016 at 2:17 PM, François Méthot 
> > wrote:
> >
> > > Hi,
> > >
> > >Using version 1.5, DirectMemory is currently set at 32GB, heap is at
> > > 8GB. We have been trying to perform multiple aggregation in one query
> > (see
> > > below) on 40 Billions+ rows stored on 13 nodes. We are using parquet

Aggregation OutOfMemoryException

2016-03-11 Thread François Méthot
Hi,

   Using version 1.5, DirectMemory is currently set at 32GB, heap is at
8GB. We have been trying to perform multiple aggregation in one query (see
below) on 40 Billions+ rows stored on 13 nodes. We are using parquet format.

We keep getting OutOfMemoryException: Failure allocating buffer..

on a query that looks like this:

create table hdfs.`test1234` as
(
select string_field1,
  string_field2,
  min ( int_field3 ),
  max ( int_field4 ),
  count(1),
  count ( distinct int_field5 ),
  count ( distinct int_field6 ),
  count ( distinct string_field7 )
from hdfs.`/data/`
group by string_field1, string_field2;
)

The documentation state:
"Currently, hash-based operations do not spill to disk as needed."

and

"If the hash-based operators run out of memory during execution, the query
fails. If large hash operations do not fit in memory on your system, you
can disable these operations. When disabled, Drill creates alternative
plans that allow spilling to disk."

My understanding is that it will fall back to Streaming aggregation, which
required sorting..

but

"As of Drill 1.5, ... the sort operator (in queries that ran successfully
in previous releases) may not have enough memory, resulting in a failed
query"

And Indeed, disabling hash agg and hash join resulted in memory leak error.

So it looks like increasing direct memory our only option.

Is there a plan to have Hash Aggregation to spill on disk in the next
release?


Thanks for your feedback