RE: Drill Capacity

2017-11-08 Thread Yun Liu
Hi Kunal,

Please see below dataset I've provided this week. Hope it helps:

[ {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Avoid unreferenced Tables",
"key" : "1634",
"critical" : false
  },
  "result" : {
"grade" : 2,
"violationRatio" : {
  "totalChecks" : 52,
  "failedChecks" : 5,
  "successfulChecks" : 47,
  "ratio" : 0.9038461538461539
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 1,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : "Microsoft T-SQL",
"result" : {
  "grade" : 2.0769230769230775,
  "violationRatio" : {
"totalChecks" : 52,
"failedChecks" : 5,
"successfulChecks" : 47,
"ratio" : 0.9038461538461539
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 1,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Namespace naming convention - case control",
"key" : "3550",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 31,
  "failedChecks" : 0,
  "successfulChecks" : 31,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 31,
"failedChecks" : 0,
"successfulChecks" : 31,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "2",
"name" : "Interface naming convention - case and character set control",
"key" : "3554",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 10,
  "failedChecks" : 0,
  "successfulChecks" : 10,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 10,
"failedChecks" : 0,
"successfulChecks" : 10,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Enumerations naming convention - case and character set control",
"key" : "3558",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 7,
  "failedChecks" : 0,
  "successfulChecks" : 7,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 7,
"failedChecks" : 0,
"successfulChecks" : 7,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Enumeration Items naming convention - case and character set 
control",
"key" : "3560",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 65,
  "failedChecks" : 0,
  "successfulChecks" : 65,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 65,
"failedChecks" : 0,
"successfulChecks" : 65,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",

RE: Drill Capacity

2017-11-08 Thread Yun Liu
Hi Paul,

I've already done this: alter session set `store.json.all_text_mode`=true;

I don't believe this is an accurate error message because when I reduce # of 
rows in the Compliance.json file by half (while all fields and queries stay the 
same), everything works with no issues. I've tried the same with another 
dataset (same format, same fields but smaller size pool), no issue there 
either. So I am still convinced it's a size issue.

Please let me know what else I could provide to troubleshoot this.

Thanks for all your help so far.

Yun

-Original Message-
From: Paul Rogers [mailto:prog...@mapr.com] 
Sent: Tuesday, November 7, 2017 7:55 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

I looked at the sqlline.log file you posted. (Thanks much for doing so.) Here’s 
what I noted:

The log shows a failed query, but this one is different than the one we 
discussed earlier. Query:

SELECT * FROM `dfs`.`Inputs`.`./Compliance.json` LIMIT 100

Since this is a LIMIT query, with no ORDER BY, we got a different plan than the 
query we discussed earlier. The earlier one had a stack trace that suggested 
the query had an ORDER BY that used the legacy (non-managed) version of the 
sort.

Despite the fact that the query is different, the above query did, in fact, 
fail, but for a different reason.

JsonReader - User Error Occurred: You tried to write a VarChar type when you 
are using a ValueWriter of type NullableBitWriterImpl. (You tried to write a 
VarChar type when you are using a ValueWriter of type NullableBitWriterImpl.)
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried to 
write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.

This is saying that there is a schema change: but one within a single batch. 
Variations of this bug can occur if you have several values of the form 10, 20, 
30. But, then later, you have a value like “Hi” — we create a numeric vector 
then try to write a string.

Here, it appears you have values that are boolean, followed by a string:

… “a”: true …
… “a”: false …
… “a”: “a string!”

The JSON writer sees the boolean and locates a bit vector. Then, it sees the 
string, tries to write that into a bit vector, and gets the error displayed 
above.

You can work around this by using “all text mode” that reads all fields as 
text. Or, you can clean up your data.

Once this file works, perhaps you can try another run to recreate the original 
memory issue with the sort so we can track that one down.

Thanks,

- Paul

> On Nov 7, 2017, at 1:49 PM, Kunal Khatua <kkha...@mapr.com> wrote:
> 
> Hi Yun
> 
> The new release might not address this issue as we don't have a repro for 
> this. Any chance you can provide a sample anonymized data set. The JSON data 
> doesn't have to be meaningful, but we need to be able to reproduce it to 
> ensure that we are indeed addressing the issue you faced. 
> 
> Thanks
> ~K
> -Original Message-
> From: Yun Liu [mailto:y@castsoftware.com]
> Sent: Tuesday, November 07, 2017 7:17 AM
> To: user@drill.apache.org
> Subject: RE: Drill Capacity
> 
> Hi Arjun,
> 
> That was already altered and schema was not changed. I've reduced the json 
> size and everything works fine. I believe it was giving a false error. Seems 
> that's the only way to bypass this error until your new release comes out?
> 
> Thanks,
> Yun
> 
> -Original Message-
> From: Arjun kr [mailto:arjun...@outlook.com]
> Sent: Monday, November 6, 2017 7:39 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
> 
> Hi Yun,
> 
> 
> Looking at the log shared, You seems to be running below query.
> 
> 
> 2017-11-06 15:09:37,383 [25ff3e7e-39ef-a175-93e7-e4e62b284add:foreman] 
> INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 25ff3e7e-39ef-a175-93e7-e4e62b284add: SELECT * FROM 
> `dfs`.`Inputs`.`./Compliance.json` LIMIT 100
> 
> 
> Below is the exception with query failure.
> 
> 
> 2017-11-06 15:09:45,852 
> [25ff3e7e-39ef-a175-93e7-e4e62b284add:frag:0:0] INFO  
> o.a.d.e.vector.complex.fn.JsonReader - User Error Occurred: You tried 
> to write a VarChar type when you are using a ValueWriter of type 
> NullableBitWriterImpl. (You tried to write a VarChar type when you are 
> using a ValueWriter of type NullableBitWriterImpl.)^M
> org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried 
> to write a VarChar type when you are using a ValueWriter of type 
> NullableBitWriterImpl.
> 
> It could be related to schema change. Can you try setting below session 
> parameter if not tried already?
> 
> 
> alter session set `store.json.all_text_mode`=true;
> 
> 
> 
> Thanks,
> 
> Arjun
> 
> From: Yun Liu <y@castsoftw

Re: Drill Capacity

2017-11-07 Thread Paul Rogers
Hi Yun,

I looked at the sqlline.log file you posted. (Thanks much for doing so.) Here’s 
what I noted:

The log shows a failed query, but this one is different than the one we 
discussed earlier. Query:

SELECT * FROM `dfs`.`Inputs`.`./Compliance.json` LIMIT 100

Since this is a LIMIT query, with no ORDER BY, we got a different plan than the 
query we discussed earlier. The earlier one had a stack trace that suggested 
the query had an ORDER BY that used the legacy (non-managed) version of the 
sort.

Despite the fact that the query is different, the above query did, in fact, 
fail, but for a different reason.

JsonReader - User Error Occurred: You tried to write a VarChar type when you 
are using a ValueWriter of type NullableBitWriterImpl. (You tried to write a 
VarChar type when you are using a ValueWriter of type NullableBitWriterImpl.)
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried to 
write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.

This is saying that there is a schema change: but one within a single batch. 
Variations of this bug can occur if you have several values of the form 10, 20, 
30. But, then later, you have a value like “Hi” — we create a numeric vector 
then try to write a string.

Here, it appears you have values that are boolean, followed by a string:

… “a”: true …
… “a”: false …
… “a”: “a string!”

The JSON writer sees the boolean and locates a bit vector. Then, it sees the 
string, tries to write that into a bit vector, and gets the error displayed 
above.

You can work around this by using “all text mode” that reads all fields as 
text. Or, you can clean up your data.

Once this file works, perhaps you can try another run to recreate the original 
memory issue with the sort so we can track that one down.

Thanks,

- Paul

> On Nov 7, 2017, at 1:49 PM, Kunal Khatua <kkha...@mapr.com> wrote:
> 
> Hi Yun
> 
> The new release might not address this issue as we don't have a repro for 
> this. Any chance you can provide a sample anonymized data set. The JSON data 
> doesn't have to be meaningful, but we need to be able to reproduce it to 
> ensure that we are indeed addressing the issue you faced. 
> 
> Thanks
> ~K
> -Original Message-
> From: Yun Liu [mailto:y@castsoftware.com] 
> Sent: Tuesday, November 07, 2017 7:17 AM
> To: user@drill.apache.org
> Subject: RE: Drill Capacity
> 
> Hi Arjun,
> 
> That was already altered and schema was not changed. I've reduced the json 
> size and everything works fine. I believe it was giving a false error. Seems 
> that's the only way to bypass this error until your new release comes out?
> 
> Thanks,
> Yun
> 
> -Original Message-
> From: Arjun kr [mailto:arjun...@outlook.com]
> Sent: Monday, November 6, 2017 7:39 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
> 
> Hi Yun,
> 
> 
> Looking at the log shared, You seems to be running below query.
> 
> 
> 2017-11-06 15:09:37,383 [25ff3e7e-39ef-a175-93e7-e4e62b284add:foreman] INFO  
> o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 25ff3e7e-39ef-a175-93e7-e4e62b284add: SELECT * FROM 
> `dfs`.`Inputs`.`./Compliance.json` LIMIT 100
> 
> 
> Below is the exception with query failure.
> 
> 
> 2017-11-06 15:09:45,852 [25ff3e7e-39ef-a175-93e7-e4e62b284add:frag:0:0] INFO  
> o.a.d.e.vector.complex.fn.JsonReader - User Error Occurred: You tried to 
> write a VarChar type when you are using a ValueWriter of type 
> NullableBitWriterImpl. (You tried to write a VarChar type when you are using 
> a ValueWriter of type NullableBitWriterImpl.)^M
> org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried 
> to write a VarChar type when you are using a ValueWriter of type 
> NullableBitWriterImpl.
> 
> It could be related to schema change. Can you try setting below session 
> parameter if not tried already?
> 
> 
> alter session set `store.json.all_text_mode`=true;
> 
> 
> 
> Thanks,
> 
> Arjun
> 
> From: Yun Liu <y@castsoftware.com>
> Sent: Tuesday, November 7, 2017 1:46 AM
> To: user@drill.apache.org
> Subject: RE: Drill Capacity
> 
> Hi Arjun and Paul,
> 
> Yep those are turned and I am reading it from sqlline.log. Only max 
> allocation number I am reading is 10,000,000,000. Posted the logs in my 
> Dropbox:
> https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0
> [https://cfl.dropboxstatic.com/static/images/logo_catalog/glyph...@2x-vfla6ltfz.png]<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0>
> 
> Drill<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0>
> www.dropbox.com
> Shared with Dropbox

RE: Drill Capacity

2017-11-07 Thread Kunal Khatua
Hi Yun

The new release might not address this issue as we don't have a repro for this. 
Any chance you can provide a sample anonymized data set. The JSON data doesn't 
have to be meaningful, but we need to be able to reproduce it to ensure that we 
are indeed addressing the issue you faced. 

Thanks
~K
-Original Message-
From: Yun Liu [mailto:y@castsoftware.com] 
Sent: Tuesday, November 07, 2017 7:17 AM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Arjun,

That was already altered and schema was not changed. I've reduced the json size 
and everything works fine. I believe it was giving a false error. Seems that's 
the only way to bypass this error until your new release comes out?

Thanks,
Yun

-Original Message-
From: Arjun kr [mailto:arjun...@outlook.com]
Sent: Monday, November 6, 2017 7:39 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,


Looking at the log shared, You seems to be running below query.


2017-11-06 15:09:37,383 [25ff3e7e-39ef-a175-93e7-e4e62b284add:foreman] INFO  
o.a.drill.exec.work.foreman.Foreman - Query text for query id 
25ff3e7e-39ef-a175-93e7-e4e62b284add: SELECT * FROM 
`dfs`.`Inputs`.`./Compliance.json` LIMIT 100


Below is the exception with query failure.


2017-11-06 15:09:45,852 [25ff3e7e-39ef-a175-93e7-e4e62b284add:frag:0:0] INFO  
o.a.d.e.vector.complex.fn.JsonReader - User Error Occurred: You tried to write 
a VarChar type when you are using a ValueWriter of type NullableBitWriterImpl. 
(You tried to write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.)^M
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried to 
write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.

It could be related to schema change. Can you try setting below session 
parameter if not tried already?


alter session set `store.json.all_text_mode`=true;



Thanks,

Arjun

From: Yun Liu <y@castsoftware.com>
Sent: Tuesday, November 7, 2017 1:46 AM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Arjun and Paul,

Yep those are turned and I am reading it from sqlline.log. Only max allocation 
number I am reading is 10,000,000,000. Posted the logs in my Dropbox:
https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0
[https://cfl.dropboxstatic.com/static/images/logo_catalog/glyph...@2x-vfla6ltfz.png]<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0>

Drill<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0>
www.dropbox.com
Shared with Dropbox




Thank you!
Yun

-Original Message-
From: Arjun kr [mailto:arjun...@outlook.com]
Sent: Monday, November 6, 2017 1:20 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,


Are you running in Drill embedded mode ? If so , the logs will be available in 
sqllline.log and drillbit.log will not be populated. You can enable DEBUG 
logging in logback.xml , run the query and share log file as Paul suggested.


Edit $DRILL_HOME/conf/logback.xml to enable DEBUG level logging.


 


  


Thanks,


Arjun


From: Paul Rogers <prog...@mapr.com>
Sent: Monday, November 6, 2017 10:56 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

Sorry, it is a bit confusing. The log will contain two kinds of JSON. One is 
the query profile, which is what you found. The other is the physical plan used 
to run the query. It is the physical plan you want to find; that is the one 
that has the max allocation.

If you can post your logs somewhere, I'll d/l them and take a look.

- Paul

> On Nov 6, 2017, at 7:27 AM, Yun Liu <y@castsoftware.com> wrote:
>
> Hi Paul,
>
> I am using Drill v 1.11.0 so I am only seeing sqlline.log and 
> sqlline_queries.log. hopefully the same.
>
> I am following your instructions and I am not seeing any maxAllocation other 
> than 10,000,000,000. No other number (or small number) than this. The query 
> profile reads the following:
>
> {"queryId":"25ff81fc-3b7a-a840-b557-d2194cc6819a","schema":"","queryTe
> xt":"SELECT * FROM `dfs`.`Inputs`.`./ Compliance.json` LIMIT 
> 100","start":1509981699406,"finish":1509981707544,"outcome":"FAILED","
> username":"","remoteAddress":"localhost"}
>
> Is this what you're looking for?
>
> Thanks,
> Yun
>
> -Original Message-
> From: Paul Rogers [mailto:prog...@mapr.com]
> Sent: Friday, November 3, 2017 6:45 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
>
> Thanks for the info. Clearly you are way ahead of me.
>
> In issue 1, although you have only four (top level) fields, your example 
> shows that you have many nested fiel

RE: Drill Capacity

2017-11-07 Thread Yun Liu
Hi Arjun,

That was already altered and schema was not changed. I've reduced the json size 
and everything works fine. I believe it was giving a false error. Seems that's 
the only way to bypass this error until your new release comes out?

Thanks,
Yun

-Original Message-
From: Arjun kr [mailto:arjun...@outlook.com] 
Sent: Monday, November 6, 2017 7:39 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,


Looking at the log shared, You seems to be running below query.


2017-11-06 15:09:37,383 [25ff3e7e-39ef-a175-93e7-e4e62b284add:foreman] INFO  
o.a.drill.exec.work.foreman.Foreman - Query text for query id 
25ff3e7e-39ef-a175-93e7-e4e62b284add: SELECT * FROM 
`dfs`.`Inputs`.`./Compliance.json` LIMIT 100


Below is the exception with query failure.


2017-11-06 15:09:45,852 [25ff3e7e-39ef-a175-93e7-e4e62b284add:frag:0:0] INFO  
o.a.d.e.vector.complex.fn.JsonReader - User Error Occurred: You tried to write 
a VarChar type when you are using a ValueWriter of type NullableBitWriterImpl. 
(You tried to write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.)^M
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried to 
write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.

It could be related to schema change. Can you try setting below session 
parameter if not tried already?


alter session set `store.json.all_text_mode`=true;



Thanks,

Arjun

From: Yun Liu <y@castsoftware.com>
Sent: Tuesday, November 7, 2017 1:46 AM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Arjun and Paul,

Yep those are turned and I am reading it from sqlline.log. Only max allocation 
number I am reading is 10,000,000,000. Posted the logs in my Dropbox:
https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0
[https://cfl.dropboxstatic.com/static/images/logo_catalog/glyph...@2x-vfla6ltfz.png]<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0>

Drill<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0>
www.dropbox.com
Shared with Dropbox




Thank you!
Yun

-Original Message-
From: Arjun kr [mailto:arjun...@outlook.com]
Sent: Monday, November 6, 2017 1:20 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,


Are you running in Drill embedded mode ? If so , the logs will be available in 
sqllline.log and drillbit.log will not be populated. You can enable DEBUG 
logging in logback.xml , run the query and share log file as Paul suggested.


Edit $DRILL_HOME/conf/logback.xml to enable DEBUG level logging.


 


  


Thanks,


Arjun


From: Paul Rogers <prog...@mapr.com>
Sent: Monday, November 6, 2017 10:56 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

Sorry, it is a bit confusing. The log will contain two kinds of JSON. One is 
the query profile, which is what you found. The other is the physical plan used 
to run the query. It is the physical plan you want to find; that is the one 
that has the max allocation.

If you can post your logs somewhere, I'll d/l them and take a look.

- Paul

> On Nov 6, 2017, at 7:27 AM, Yun Liu <y@castsoftware.com> wrote:
>
> Hi Paul,
>
> I am using Drill v 1.11.0 so I am only seeing sqlline.log and 
> sqlline_queries.log. hopefully the same.
>
> I am following your instructions and I am not seeing any maxAllocation other 
> than 10,000,000,000. No other number (or small number) than this. The query 
> profile reads the following:
>
> {"queryId":"25ff81fc-3b7a-a840-b557-d2194cc6819a","schema":"","queryTe
> xt":"SELECT * FROM `dfs`.`Inputs`.`./ Compliance.json` LIMIT 
> 100","start":1509981699406,"finish":1509981707544,"outcome":"FAILED","
> username":"","remoteAddress":"localhost"}
>
> Is this what you're looking for?
>
> Thanks,
> Yun
>
> -Original Message-
> From: Paul Rogers [mailto:prog...@mapr.com]
> Sent: Friday, November 3, 2017 6:45 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
>
> Thanks for the info. Clearly you are way ahead of me.
>
> In issue 1, although you have only four (top level) fields, your example 
> shows that you have many nested fields. It is the total field count (across 
> all maps) that drives total width. And, it is the total amount of data that 
> drives memory consumption.
>
> You mentioned each record is 64KB and 3K rows. That suggests a total size of 
> around 200MB. But, you mention the total file size is 400MB. So, either the 
> rows are twice as large, or there are twice as many. If you have 3K rows of 
> 128MB each, then each batch of data is 400MB, which is pretty large.

Re: Drill Capacity

2017-11-06 Thread Arjun kr
Hi Yun,


Looking at the log shared, You seems to be running below query.


2017-11-06 15:09:37,383 [25ff3e7e-39ef-a175-93e7-e4e62b284add:foreman] INFO  
o.a.drill.exec.work.foreman.Foreman - Query text for query id 
25ff3e7e-39ef-a175-93e7-e4e62b284add: SELECT * FROM 
`dfs`.`Inputs`.`./Compliance.json` LIMIT 100


Below is the exception with query failure.


2017-11-06 15:09:45,852 [25ff3e7e-39ef-a175-93e7-e4e62b284add:frag:0:0] INFO  
o.a.d.e.vector.complex.fn.JsonReader - User Error Occurred: You tried to write 
a VarChar type when you are using a ValueWriter of type NullableBitWriterImpl. 
(You tried to write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.)^M
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried to 
write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.

It could be related to schema change. Can you try setting below session 
parameter if not tried already?


alter session set `store.json.all_text_mode`=true;



Thanks,

Arjun

From: Yun Liu <y@castsoftware.com>
Sent: Tuesday, November 7, 2017 1:46 AM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Arjun and Paul,

Yep those are turned and I am reading it from sqlline.log. Only max allocation 
number I am reading is 10,000,000,000. Posted the logs in my Dropbox:
https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0
[https://cfl.dropboxstatic.com/static/images/logo_catalog/glyph...@2x-vfla6ltfz.png]<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0>

Drill<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0>
www.dropbox.com
Shared with Dropbox




Thank you!
Yun

-Original Message-
From: Arjun kr [mailto:arjun...@outlook.com]
Sent: Monday, November 6, 2017 1:20 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,


Are you running in Drill embedded mode ? If so , the logs will be available in 
sqllline.log and drillbit.log will not be populated. You can enable DEBUG 
logging in logback.xml , run the query and share log file as Paul suggested.


Edit $DRILL_HOME/conf/logback.xml to enable DEBUG level logging.


 


  


Thanks,


Arjun


From: Paul Rogers <prog...@mapr.com>
Sent: Monday, November 6, 2017 10:56 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

Sorry, it is a bit confusing. The log will contain two kinds of JSON. One is 
the query profile, which is what you found. The other is the physical plan used 
to run the query. It is the physical plan you want to find; that is the one 
that has the max allocation.

If you can post your logs somewhere, I'll d/l them and take a look.

- Paul

> On Nov 6, 2017, at 7:27 AM, Yun Liu <y@castsoftware.com> wrote:
>
> Hi Paul,
>
> I am using Drill v 1.11.0 so I am only seeing sqlline.log and 
> sqlline_queries.log. hopefully the same.
>
> I am following your instructions and I am not seeing any maxAllocation other 
> than 10,000,000,000. No other number (or small number) than this. The query 
> profile reads the following:
>
> {"queryId":"25ff81fc-3b7a-a840-b557-d2194cc6819a","schema":"","queryTe
> xt":"SELECT * FROM `dfs`.`Inputs`.`./ Compliance.json` LIMIT
> 100","start":1509981699406,"finish":1509981707544,"outcome":"FAILED","
> username":"","remoteAddress":"localhost"}
>
> Is this what you're looking for?
>
> Thanks,
> Yun
>
> -Original Message-
> From: Paul Rogers [mailto:prog...@mapr.com]
> Sent: Friday, November 3, 2017 6:45 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
>
> Thanks for the info. Clearly you are way ahead of me.
>
> In issue 1, although you have only four (top level) fields, your example 
> shows that you have many nested fields. It is the total field count (across 
> all maps) that drives total width. And, it is the total amount of data that 
> drives memory consumption.
>
> You mentioned each record is 64KB and 3K rows. That suggests a total size of 
> around 200MB. But, you mention the total file size is 400MB. So, either the 
> rows are twice as large, or there are twice as many. If you have 3K rows of 
> 128MB each, then each batch of data is 400MB, which is pretty large.
>
> If your records are 64K in size, and we read 4K per batch, then the total 
> size is 256MB, which is also large.
>
> So, we are dealing with jumbo records and you really want the "batch size 
> control" feature that we are working on, but have not yet shipped.
>
> Let's work out the math. How many sorts in your query? What other operators 
> does the query include? Let's assume

RE: Drill Capacity

2017-11-06 Thread Yun Liu
Hi Arjun and Paul,

Yep those are turned and I am reading it from sqlline.log. Only max allocation 
number I am reading is 10,000,000,000. Posted the logs in my Dropbox:
https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?dl=0

Thank you!
Yun

-Original Message-
From: Arjun kr [mailto:arjun...@outlook.com] 
Sent: Monday, November 6, 2017 1:20 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,


Are you running in Drill embedded mode ? If so , the logs will be available in 
sqllline.log and drillbit.log will not be populated. You can enable DEBUG 
logging in logback.xml , run the query and share log file as Paul suggested.


Edit $DRILL_HOME/conf/logback.xml to enable DEBUG level logging.


 


  


Thanks,


Arjun


From: Paul Rogers <prog...@mapr.com>
Sent: Monday, November 6, 2017 10:56 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

Sorry, it is a bit confusing. The log will contain two kinds of JSON. One is 
the query profile, which is what you found. The other is the physical plan used 
to run the query. It is the physical plan you want to find; that is the one 
that has the max allocation.

If you can post your logs somewhere, I'll d/l them and take a look.

- Paul

> On Nov 6, 2017, at 7:27 AM, Yun Liu <y@castsoftware.com> wrote:
>
> Hi Paul,
>
> I am using Drill v 1.11.0 so I am only seeing sqlline.log and 
> sqlline_queries.log. hopefully the same.
>
> I am following your instructions and I am not seeing any maxAllocation other 
> than 10,000,000,000. No other number (or small number) than this. The query 
> profile reads the following:
>
> {"queryId":"25ff81fc-3b7a-a840-b557-d2194cc6819a","schema":"","queryTe
> xt":"SELECT * FROM `dfs`.`Inputs`.`./ Compliance.json` LIMIT 
> 100","start":1509981699406,"finish":1509981707544,"outcome":"FAILED","
> username":"","remoteAddress":"localhost"}
>
> Is this what you're looking for?
>
> Thanks,
> Yun
>
> -Original Message-
> From: Paul Rogers [mailto:prog...@mapr.com]
> Sent: Friday, November 3, 2017 6:45 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
>
> Thanks for the info. Clearly you are way ahead of me.
>
> In issue 1, although you have only four (top level) fields, your example 
> shows that you have many nested fields. It is the total field count (across 
> all maps) that drives total width. And, it is the total amount of data that 
> drives memory consumption.
>
> You mentioned each record is 64KB and 3K rows. That suggests a total size of 
> around 200MB. But, you mention the total file size is 400MB. So, either the 
> rows are twice as large, or there are twice as many. If you have 3K rows of 
> 128MB each, then each batch of data is 400MB, which is pretty large.
>
> If your records are 64K in size, and we read 4K per batch, then the total 
> size is 256MB, which is also large.
>
> So, we are dealing with jumbo records and you really want the "batch size 
> control" feature that we are working on, but have not yet shipped.
>
> Let's work out the math. How many sorts in your query? What other operators 
> does the query include? Let's assume a single sort.
>
> Max query memory is 10 GB. 10 GB / 1 sort / max width of 5 = 2 GB per sort. 
> Since your batches are ~400 MB, things should work.
>
> Since things don't work, I suspect that we're missing something.  
> (Note that the memory size we just calculated does not match the 
> numbers shown in an earlier post in which the sort got just ~40 MB of 
> memory...)
>
> Try this:
>
> * With your current settings, enable debug-level logging. Run your query.
>
> * Open the Drillbit log. Look for the JSON version of the query plan (there 
> will be two). One will tell you how much memory is given to the sort:
>
> maxAllocation: (some number)
>
> * Ignore the one that says 10,000,000, find the one with a smaller number. 
> What is that number?
>
> * Then, look in the query profile for your query. Look at the peak memory for 
> your JSON reader scan operator. The peak memory more-or-less reflects the 
> batch size. What is that number?
>
> With those, we can tell if the settings and sizes we think we are using are, 
> in fact, correct.
>
> Thanks,
>
> - Paul
>
>> On Nov 3, 2017, at 1:19 PM, Yun Liu <y@castsoftware.com> wrote:
>>
>> Hi Paul,
>>
>> Thanks for you detailed explanation. First off- I have 2 issues and I wanted 
>> to clear it out before continuing.
>>
>> Current setting: planner.memory.max_query_memory_per_node = 10GB, 

Re: Drill Capacity

2017-11-06 Thread Arjun kr
Hi Yun,


Are you running in Drill embedded mode ? If so , the logs will be available in 
sqllline.log and drillbit.log will not be populated. You can enable DEBUG 
logging in logback.xml , run the query and share log file as Paul suggested.


Edit $DRILL_HOME/conf/logback.xml to enable DEBUG level logging.


 


  


Thanks,


Arjun


From: Paul Rogers <prog...@mapr.com>
Sent: Monday, November 6, 2017 10:56 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

Sorry, it is a bit confusing. The log will contain two kinds of JSON. One is 
the query profile, which is what you found. The other is the physical plan used 
to run the query. It is the physical plan you want to find; that is the one 
that has the max allocation.

If you can post your logs somewhere, I’ll d/l them and take a look.

- Paul

> On Nov 6, 2017, at 7:27 AM, Yun Liu <y@castsoftware.com> wrote:
>
> Hi Paul,
>
> I am using Drill v 1.11.0 so I am only seeing sqlline.log and 
> sqlline_queries.log. hopefully the same.
>
> I am following your instructions and I am not seeing any maxAllocation other 
> than 10,000,000,000. No other number (or small number) than this. The query 
> profile reads the following:
>
> {"queryId":"25ff81fc-3b7a-a840-b557-d2194cc6819a","schema":"","queryText":"SELECT
>  * FROM `dfs`.`Inputs`.`./ Compliance.json` LIMIT 
> 100","start":1509981699406,"finish":1509981707544,"outcome":"FAILED","username":"","remoteAddress":"localhost"}
>
> Is this what you're looking for?
>
> Thanks,
> Yun
>
> -Original Message-
> From: Paul Rogers [mailto:prog...@mapr.com]
> Sent: Friday, November 3, 2017 6:45 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
>
> Thanks for the info. Clearly you are way ahead of me.
>
> In issue 1, although you have only four (top level) fields, your example 
> shows that you have many nested fields. It is the total field count (across 
> all maps) that drives total width. And, it is the total amount of data that 
> drives memory consumption.
>
> You mentioned each record is 64KB and 3K rows. That suggests a total size of 
> around 200MB. But, you mention the total file size is 400MB. So, either the 
> rows are twice as large, or there are twice as many. If you have 3K rows of 
> 128MB each, then each batch of data is 400MB, which is pretty large.
>
> If your records are 64K in size, and we read 4K per batch, then the total 
> size is 256MB, which is also large.
>
> So, we are dealing with jumbo records and you really want the “batch size 
> control” feature that we are working on, but have not yet shipped.
>
> Let’s work out the math. How many sorts in your query? What other operators 
> does the query include? Let’s assume a single sort.
>
> Max query memory is 10 GB. 10 GB / 1 sort / max width of 5 = 2 GB per sort. 
> Since your batches are ~400 MB, things should work.
>
> Since things don’t work, I suspect that we’re missing something.  (Note that 
> the memory size we just calculated does not match the numbers shown in an 
> earlier post in which the sort got just ~40 MB of memory…)
>
> Try this:
>
> * With your current settings, enable debug-level logging. Run your query.
>
> * Open the Drillbit log. Look for the JSON version of the query plan (there 
> will be two). One will tell you how much memory is given to the sort:
>
> maxAllocation: (some number)
>
> * Ignore the one that says 10,000,000, find the one with a smaller number. 
> What is that number?
>
> * Then, look in the query profile for your query. Look at the peak memory for 
> your JSON reader scan operator. The peak memory more-or-less reflects the 
> batch size. What is that number?
>
> With those, we can tell if the settings and sizes we think we are using are, 
> in fact, correct.
>
> Thanks,
>
> - Paul
>
>> On Nov 3, 2017, at 1:19 PM, Yun Liu <y@castsoftware.com> wrote:
>>
>> Hi Paul,
>>
>> Thanks for you detailed explanation. First off- I have 2 issues and I wanted 
>> to clear it out before continuing.
>>
>> Current setting: planner.memory.max_query_memory_per_node = 10GB, HEAP
>> = 12G, Direct memory = 32G, Perm 1024M, and planner.width.max_per_node
>> = 5
>>
>> Issue # 1:
>> When loading a json file with 400MB I keep getting a DATA_READ ERROR.
>> Each record in the file is about 64KB. Since it's a json file, there are 
>> only 4 fields per each record. Not sure how many records this file contains 
>> as it's too large to open with any tools, but I am guessing about 3k

Re: Drill Capacity

2017-11-06 Thread Paul Rogers
Hi Yun,

Sorry, it is a bit confusing. The log will contain two kinds of JSON. One is 
the query profile, which is what you found. The other is the physical plan used 
to run the query. It is the physical plan you want to find; that is the one 
that has the max allocation.

If you can post your logs somewhere, I’ll d/l them and take a look.

- Paul

> On Nov 6, 2017, at 7:27 AM, Yun Liu <y@castsoftware.com> wrote:
> 
> Hi Paul,
> 
> I am using Drill v 1.11.0 so I am only seeing sqlline.log and 
> sqlline_queries.log. hopefully the same.
> 
> I am following your instructions and I am not seeing any maxAllocation other 
> than 10,000,000,000. No other number (or small number) than this. The query 
> profile reads the following:
> 
> {"queryId":"25ff81fc-3b7a-a840-b557-d2194cc6819a","schema":"","queryText":"SELECT
>  * FROM `dfs`.`Inputs`.`./ Compliance.json` LIMIT 
> 100","start":1509981699406,"finish":1509981707544,"outcome":"FAILED","username":"","remoteAddress":"localhost"}
> 
> Is this what you're looking for?
> 
> Thanks,
> Yun
> 
> -Original Message-
> From: Paul Rogers [mailto:prog...@mapr.com] 
> Sent: Friday, November 3, 2017 6:45 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
> 
> Thanks for the info. Clearly you are way ahead of me.
> 
> In issue 1, although you have only four (top level) fields, your example 
> shows that you have many nested fields. It is the total field count (across 
> all maps) that drives total width. And, it is the total amount of data that 
> drives memory consumption.
> 
> You mentioned each record is 64KB and 3K rows. That suggests a total size of 
> around 200MB. But, you mention the total file size is 400MB. So, either the 
> rows are twice as large, or there are twice as many. If you have 3K rows of 
> 128MB each, then each batch of data is 400MB, which is pretty large.
> 
> If your records are 64K in size, and we read 4K per batch, then the total 
> size is 256MB, which is also large.
> 
> So, we are dealing with jumbo records and you really want the “batch size 
> control” feature that we are working on, but have not yet shipped.
> 
> Let’s work out the math. How many sorts in your query? What other operators 
> does the query include? Let’s assume a single sort.
> 
> Max query memory is 10 GB. 10 GB / 1 sort / max width of 5 = 2 GB per sort. 
> Since your batches are ~400 MB, things should work.
> 
> Since things don’t work, I suspect that we’re missing something.  (Note that 
> the memory size we just calculated does not match the numbers shown in an 
> earlier post in which the sort got just ~40 MB of memory…)
> 
> Try this:
> 
> * With your current settings, enable debug-level logging. Run your query.
> 
> * Open the Drillbit log. Look for the JSON version of the query plan (there 
> will be two). One will tell you how much memory is given to the sort: 
> 
> maxAllocation: (some number)
> 
> * Ignore the one that says 10,000,000, find the one with a smaller number. 
> What is that number?
> 
> * Then, look in the query profile for your query. Look at the peak memory for 
> your JSON reader scan operator. The peak memory more-or-less reflects the 
> batch size. What is that number?
> 
> With those, we can tell if the settings and sizes we think we are using are, 
> in fact, correct.
> 
> Thanks,
> 
> - Paul
> 
>> On Nov 3, 2017, at 1:19 PM, Yun Liu <y@castsoftware.com> wrote:
>> 
>> Hi Paul,
>> 
>> Thanks for you detailed explanation. First off- I have 2 issues and I wanted 
>> to clear it out before continuing.
>> 
>> Current setting: planner.memory.max_query_memory_per_node = 10GB, HEAP 
>> = 12G, Direct memory = 32G, Perm 1024M, and planner.width.max_per_node 
>> = 5
>> 
>> Issue # 1:
>> When loading a json file with 400MB I keep getting a DATA_READ ERROR.
>> Each record in the file is about 64KB. Since it's a json file, there are 
>> only 4 fields per each record. Not sure how many records this file contains 
>> as it's too large to open with any tools, but I am guessing about 3k rows.
>> With all the recommendations provided by various experts, nothing has worked.
>> 
>> Issue 2#:
>> While processing a query with is a join of 2 functional .json files, I am 
>> getting a RESOURCE ERROR: One or more nodes ran out of memory while 
>> executing the query. These 2 json files alone process fine but when joined 
>> together, Drill throws me that error.
>> Json#1 is 11k KB, has 8 fields with 74091 rows
>> Json#2

RE: Drill Capacity

2017-11-06 Thread Yun Liu
Hi Paul,

I am using Drill v 1.11.0 so I am only seeing sqlline.log and 
sqlline_queries.log. hopefully the same.

I am following your instructions and I am not seeing any maxAllocation other 
than 10,000,000,000. No other number (or small number) than this. The query 
profile reads the following:

{"queryId":"25ff81fc-3b7a-a840-b557-d2194cc6819a","schema":"","queryText":"SELECT
 * FROM `dfs`.`Inputs`.`./ Compliance.json` LIMIT 
100","start":1509981699406,"finish":1509981707544,"outcome":"FAILED","username":"","remoteAddress":"localhost"}

Is this what you're looking for?

Thanks,
Yun

-Original Message-----
From: Paul Rogers [mailto:prog...@mapr.com] 
Sent: Friday, November 3, 2017 6:45 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Thanks for the info. Clearly you are way ahead of me.

In issue 1, although you have only four (top level) fields, your example shows 
that you have many nested fields. It is the total field count (across all maps) 
that drives total width. And, it is the total amount of data that drives memory 
consumption.

You mentioned each record is 64KB and 3K rows. That suggests a total size of 
around 200MB. But, you mention the total file size is 400MB. So, either the 
rows are twice as large, or there are twice as many. If you have 3K rows of 
128MB each, then each batch of data is 400MB, which is pretty large.

If your records are 64K in size, and we read 4K per batch, then the total size 
is 256MB, which is also large.

So, we are dealing with jumbo records and you really want the “batch size 
control” feature that we are working on, but have not yet shipped.

Let’s work out the math. How many sorts in your query? What other operators 
does the query include? Let’s assume a single sort.

Max query memory is 10 GB. 10 GB / 1 sort / max width of 5 = 2 GB per sort. 
Since your batches are ~400 MB, things should work.

Since things don’t work, I suspect that we’re missing something.  (Note that 
the memory size we just calculated does not match the numbers shown in an 
earlier post in which the sort got just ~40 MB of memory…)

Try this:

* With your current settings, enable debug-level logging. Run your query.

* Open the Drillbit log. Look for the JSON version of the query plan (there 
will be two). One will tell you how much memory is given to the sort: 

maxAllocation: (some number)

* Ignore the one that says 10,000,000, find the one with a smaller number. What 
is that number?

* Then, look in the query profile for your query. Look at the peak memory for 
your JSON reader scan operator. The peak memory more-or-less reflects the batch 
size. What is that number?

With those, we can tell if the settings and sizes we think we are using are, in 
fact, correct.

Thanks,

- Paul

> On Nov 3, 2017, at 1:19 PM, Yun Liu <y@castsoftware.com> wrote:
> 
> Hi Paul,
> 
> Thanks for you detailed explanation. First off- I have 2 issues and I wanted 
> to clear it out before continuing.
> 
> Current setting: planner.memory.max_query_memory_per_node = 10GB, HEAP 
> = 12G, Direct memory = 32G, Perm 1024M, and planner.width.max_per_node 
> = 5
> 
> Issue # 1:
> When loading a json file with 400MB I keep getting a DATA_READ ERROR.
> Each record in the file is about 64KB. Since it's a json file, there are only 
> 4 fields per each record. Not sure how many records this file contains as 
> it's too large to open with any tools, but I am guessing about 3k rows.
> With all the recommendations provided by various experts, nothing has worked.
> 
> Issue 2#:
> While processing a query with is a join of 2 functional .json files, I am 
> getting a RESOURCE ERROR: One or more nodes ran out of memory while executing 
> the query. These 2 json files alone process fine but when joined together, 
> Drill throws me that error.
> Json#1 is 11k KB, has 8 fields with 74091 rows
> Json#2 is 752kb, has 8 fields with 4245 rows
> 
> Besides breaking them up to smaller files, not sure what else I could do.
> 
> Thanks for the help so far!
> 
> Yun
> 
> -Original Message-
> From: Paul Rogers [mailto:prog...@mapr.com]
> Sent: Thursday, November 2, 2017 11:06 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
> 
> Hi Yun,
> 
> I’m going to give you multiple ways to understand the issue based on the 
> information you’ve provided. I generally like to see the full logs to 
> diagnose such problems, but we’ll start with what you’ve provided thus far.
> 
> How large is each record in your file? How many fields? How many 
> bytes? (Alternatively, how big is a single input file and how many 
> records does it contain?)
> 
> You mention the limit of 64K columns in CSV. This makes me wonde

Re: Drill Capacity

2017-11-03 Thread Paul Rogers
Thanks for the info. Clearly you are way ahead of me.

In issue 1, although you have only four (top level) fields, your example shows 
that you have many nested fields. It is the total field count (across all maps) 
that drives total width. And, it is the total amount of data that drives memory 
consumption.

You mentioned each record is 64KB and 3K rows. That suggests a total size of 
around 200MB. But, you mention the total file size is 400MB. So, either the 
rows are twice as large, or there are twice as many. If you have 3K rows of 
128MB each, then each batch of data is 400MB, which is pretty large.

If your records are 64K in size, and we read 4K per batch, then the total size 
is 256MB, which is also large.

So, we are dealing with jumbo records and you really want the “batch size 
control” feature that we are working on, but have not yet shipped.

Let’s work out the math. How many sorts in your query? What other operators 
does the query include? Let’s assume a single sort.

Max query memory is 10 GB. 10 GB / 1 sort / max width of 5 = 2 GB per sort. 
Since your batches are ~400 MB, things should work.

Since things don’t work, I suspect that we’re missing something.  (Note that 
the memory size we just calculated does not match the numbers shown in an 
earlier post in which the sort got just ~40 MB of memory…)

Try this:

* With your current settings, enable debug-level logging. Run your query.

* Open the Drillbit log. Look for the JSON version of the query plan (there 
will be two). One will tell you how much memory is given to the sort: 

maxAllocation: (some number)

* Ignore the one that says 10,000,000, find the one with a smaller number. What 
is that number?

* Then, look in the query profile for your query. Look at the peak memory for 
your JSON reader scan operator. The peak memory more-or-less reflects the batch 
size. What is that number?

With those, we can tell if the settings and sizes we think we are using are, in 
fact, correct.

Thanks,

- Paul

> On Nov 3, 2017, at 1:19 PM, Yun Liu <y@castsoftware.com> wrote:
> 
> Hi Paul,
> 
> Thanks for you detailed explanation. First off- I have 2 issues and I wanted 
> to clear it out before continuing.
> 
> Current setting: planner.memory.max_query_memory_per_node = 10GB, HEAP = 12G, 
> Direct memory = 32G, Perm 1024M, and planner.width.max_per_node = 5
> 
> Issue # 1:
> When loading a json file with 400MB I keep getting a DATA_READ ERROR.
> Each record in the file is about 64KB. Since it's a json file, there are only 
> 4 fields per each record. Not sure how many records this file contains as 
> it's too large to open with any tools, but I am guessing about 3k rows.
> With all the recommendations provided by various experts, nothing has worked.
> 
> Issue 2#:
> While processing a query with is a join of 2 functional .json files, I am 
> getting a RESOURCE ERROR: One or more nodes ran out of memory while executing 
> the query. These 2 json files alone process fine but when joined together, 
> Drill throws me that error.
> Json#1 is 11k KB, has 8 fields with 74091 rows
> Json#2 is 752kb, has 8 fields with 4245 rows
> 
> Besides breaking them up to smaller files, not sure what else I could do.
> 
> Thanks for the help so far!
> 
> Yun
> 
> -Original Message-
> From: Paul Rogers [mailto:prog...@mapr.com] 
> Sent: Thursday, November 2, 2017 11:06 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
> 
> Hi Yun,
> 
> I’m going to give you multiple ways to understand the issue based on the 
> information you’ve provided. I generally like to see the full logs to 
> diagnose such problems, but we’ll start with what you’ve provided thus far.
> 
> How large is each record in your file? How many fields? How many bytes? 
> (Alternatively, how big is a single input file and how many records does it 
> contain?)
> 
> You mention the limit of 64K columns in CSV. This makes me wonder if you have 
> a “jumbo” record. If each individual record is large, then there won’t be 
> enough space in the sort to take even a single batch of records, and you’ll 
> get the sv2 error that you saw.
> 
> We can guess the size, however, from the info you provided:
> 
> batchGroups.size 1
> spilledBatchGroups.size 0
> allocated memory 42768000
> allocator limit 41943040
> 
> This says you have a batch in memory and are trying to allocate some memory 
> (the “sv2”). The allocated memory number tells us that each batch size is 
> probably ~43 MB. But, the sort only has 42 MB to play with. The sort needs at 
> least two batches in memory to make progress, hence the out-of-memory errors.
> 
> It would be nice to confirm this from the logs, but unfortunately, Drill does 
> not normally log the size of each batch. As it turns out, however, the 
> “manag

RE: Drill Capacity

2017-11-03 Thread Yun Liu
Yes- I guess breaking them into smaller file will solve this.

Thanks!
Yun

-Original Message-
From: Arjun kr [mailto:arjun...@outlook.com] 
Sent: Friday, November 3, 2017 5:40 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity



I have seen a use-case where query fails for 12 GB single json file having 
structure ''{ "key":[obj1, obj2, obj3..objn]}''. Here json file has a key 
element and value is array of json object 'obj'. There were around 175K objects 
in this array and each obj is again complex json object with nested array 
elements. From what I understood, Drill reads entire file content as  single 
json record (which actually is) and fails with DATA_READ ERROR.


The solution was to re organize the data to either of following structure. Also 
to break single file into multiple smaller file for better parallelism.


Structure 2: File has array of json object like below
[ {obj1},{obj2}..,{objn}]

Structure 3:  File has  json objects as below
{obj1}
{obj1}
..
{objn}


I was checking if this is the case here..


Thanks,


Arjun



From: Yun Liu <y@castsoftware.com>
Sent: Saturday, November 4, 2017 2:27 AM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Arjun,

Column 4 has the most data and a bit long here. The other 3 columns has maybe a 
word or 2. Thanks for your patience.

[ {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Avoid unreferenced Tables",
"key" : "1634",
"critical" : false
  },
  "result" : {
"grade" : 2,
"violationRatio" : {
  "totalChecks" : 52,
  "failedChecks" : 5,
  "successfulChecks" : 47,
  "ratio" : 0.9038461538461539
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 1,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : "Microsoft T-SQL",
"result" : {
  "grade" : 2.0769230769230775,
  "violationRatio" : {
"totalChecks" : 52,
"failedChecks" : 5,
"successfulChecks" : 47,
"ratio" : 0.9038461538461539
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 1,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Namespace naming convention - case control",
"key" : "3550",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 31,
  "failedChecks" : 0,
  "successfulChecks" : 31,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 31,
"failedChecks" : 0,
"successfulChecks" : 31,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "2",
"name" : "Interface naming convention - case and character set control",
"key" : "3554",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 10,
  "failedChecks" : 0,
  "successfulChecks" : 10,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationR

Re: Drill Capacity

2017-11-03 Thread Arjun kr


I have seen a use-case where query fails for 12 GB single json file having 
structure '‘{ "key":[obj1, obj2, obj3..objn]}’'. Here json file has a key 
element and value is array of json object 'obj'. There were around 175K objects 
in this array and each obj is again complex json object with nested array 
elements. From what I understood, Drill reads entire file content as  single 
json record (which actually is) and fails with DATA_READ ERROR.


The solution was to re organize the data to either of following structure. Also 
to break single file into multiple smaller file for better parallelism.


Structure 2: File has array of json object like below
[ {obj1},{obj2}..,{objn}]

Structure 3:  File has  json objects as below
{obj1}
{obj1}
..
{objn}


I was checking if this is the case here..


Thanks,


Arjun



From: Yun Liu <y@castsoftware.com>
Sent: Saturday, November 4, 2017 2:27 AM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Arjun,

Column 4 has the most data and a bit long here. The other 3 columns has maybe a 
word or 2. Thanks for your patience.

[ {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Avoid unreferenced Tables",
"key" : "1634",
"critical" : false
  },
  "result" : {
"grade" : 2,
"violationRatio" : {
  "totalChecks" : 52,
  "failedChecks" : 5,
  "successfulChecks" : 47,
  "ratio" : 0.9038461538461539
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 1,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : "Microsoft T-SQL",
"result" : {
  "grade" : 2.0769230769230775,
  "violationRatio" : {
"totalChecks" : 52,
"failedChecks" : 5,
"successfulChecks" : 47,
"ratio" : 0.9038461538461539
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 1,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Namespace naming convention - case control",
"key" : "3550",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 31,
  "failedChecks" : 0,
  "successfulChecks" : 31,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 31,
"failedChecks" : 0,
"successfulChecks" : 31,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "2",
"name" : "Interface naming convention - case and character set control",
"key" : "3554",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 10,
  "failedChecks" : 0,
  "successfulChecks" : 10,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 10,
"failedChecks" : 0,
"successfulChecks" : 10,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCri

RE: Drill Capacity

2017-11-03 Thread Yun Liu
Hi Arjun,

Column 4 has the most data and a bit long here. The other 3 columns has maybe a 
word or 2. Thanks for your patience.

[ {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Avoid unreferenced Tables",
"key" : "1634",
"critical" : false
  },
  "result" : {
"grade" : 2,
"violationRatio" : {
  "totalChecks" : 52,
  "failedChecks" : 5,
  "successfulChecks" : 47,
  "ratio" : 0.9038461538461539
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 1,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : "Microsoft T-SQL",
"result" : {
  "grade" : 2.0769230769230775,
  "violationRatio" : {
"totalChecks" : 52,
"failedChecks" : 5,
"successfulChecks" : 47,
"ratio" : 0.9038461538461539
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 1,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Namespace naming convention - case control",
"key" : "3550",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 31,
  "failedChecks" : 0,
  "successfulChecks" : 31,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 31,
"failedChecks" : 0,
"successfulChecks" : 31,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "2",
"name" : "Interface naming convention - case and character set control",
"key" : "3554",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 10,
  "failedChecks" : 0,
  "successfulChecks" : 10,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 10,
"failedChecks" : 0,
"successfulChecks" : 10,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Enumerations naming convention - case and character set control",
"key" : "3558",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 7,
  "failedChecks" : 0,
  "successfulChecks" : 7,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 7,
"failedChecks" : 0,
"successfulChecks" : 7,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : "quality-rules",
  "reference" : {
"href" : "",
"name" : "Enumeration Items naming convention - case and character set 
control",
"key" : "3560",
"critical" : false
  },
  "result" : {
"grade" : 4.0,
"violationRatio" : {
  "totalChecks" : 65,
  "failedChecks" : 0,
  "successfulChecks" : 65,
  "ratio" : 1.0
},
"evolutionSummary" : {
  "addedCriticalViolations" : 0,
  "removedCriticalViolations" : 0,
  "addedViolations" : 0,
  "removedViolations" : 0
}
  },
  "technologyResults" : [ {
"technology" : ".NET",
"result" : {
  "grade" : 4.0,
  "violationRatio" : {
"totalChecks" : 65,
"failedChecks" : 0,
"successfulChecks" : 65,
"ratio" : 1.0
  },
  "evolutionSummary" : {
"addedCriticalViolations" : 0,
"removedCriticalViolations" : 0,
"addedViolations" : 0,
"removedViolations" : 0
  }
}
  } ]
}, {
  "type" : 

Re: Drill Capacity

2017-11-03 Thread Arjun kr
Hi Yun,


Could you please provide more details on your json data structure for 400 MB 
json file.


Structure 1:


‘{ "key":[obj1, obj2, obj3..objn]}’


Structure 2:
[ {obj1},{obj2}..,{objn}]

Structure 3:
{obj1}
{obj1}
..
{objn}



Thanks,


Arjun



From: Yun Liu <y@castsoftware.com>
Sent: Saturday, November 4, 2017 1:49 AM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Paul,

Thanks for you detailed explanation. First off- I have 2 issues and I wanted to 
clear it out before continuing.

Current setting: planner.memory.max_query_memory_per_node = 10GB, HEAP = 12G, 
Direct memory = 32G, Perm 1024M, and planner.width.max_per_node = 5

Issue # 1:
When loading a json file with 400MB I keep getting a DATA_READ ERROR.
Each record in the file is about 64KB. Since it's a json file, there are only 4 
fields per each record. Not sure how many records this file contains as it's 
too large to open with any tools, but I am guessing about 3k rows.
With all the recommendations provided by various experts, nothing has worked.

Issue 2#:
While processing a query with is a join of 2 functional .json files, I am 
getting a RESOURCE ERROR: One or more nodes ran out of memory while executing 
the query. These 2 json files alone process fine but when joined together, 
Drill throws me that error.
Json#1 is 11k KB, has 8 fields with 74091 rows
Json#2 is 752kb, has 8 fields with 4245 rows

Besides breaking them up to smaller files, not sure what else I could do.

Thanks for the help so far!

Yun

-Original Message-
From: Paul Rogers [mailto:prog...@mapr.com]
Sent: Thursday, November 2, 2017 11:06 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

I’m going to give you multiple ways to understand the issue based on the 
information you’ve provided. I generally like to see the full logs to diagnose 
such problems, but we’ll start with what you’ve provided thus far.

How large is each record in your file? How many fields? How many bytes? 
(Alternatively, how big is a single input file and how many records does it 
contain?)

You mention the limit of 64K columns in CSV. This makes me wonder if you have a 
“jumbo” record. If each individual record is large, then there won’t be enough 
space in the sort to take even a single batch of records, and you’ll get the 
sv2 error that you saw.

We can guess the size, however, from the info you provided:

batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040

This says you have a batch in memory and are trying to allocate some memory 
(the “sv2”). The allocated memory number tells us that each batch size is 
probably ~43 MB. But, the sort only has 42 MB to play with. The sort needs at 
least two batches in memory to make progress, hence the out-of-memory errors.

It would be nice to confirm this from the logs, but unfortunately, Drill does 
not normally log the size of each batch. As it turns out, however, the 
“managed” version that Boaz mentioned added more logging around this problem: 
it will tell you how large it thinks each batch is, and will warn if you have, 
say, a 43 MB batch but only 42 MB in which to sort.

(If you do want to use the “managed” version of the sort, I suggest you try 
Drill 1.12 when it is released as that version contains additional fixes to 
handle constrained memory.)

Also, at present, The JSON record reader loads 4096 records into each batch. If 
your file has at least that many records, then we can guess each record is 
about 43 MB / 4096 =~ 10K in size. (You can confirm, as noted above, by 
dividing total file size by record count.)

We are doing work to handle such large batches, but the work is not yet 
available in a release. Unfortunately, in the meanwhile, we also don’t let you 
control the batch size. But, we can provide another solution.

Let's explain why the message you provided said that the “allocator limit” was 
42 MB. Drill does the following to allocate memory to the sort:

* Take the “max query memory per node” (default of 2 GB regardless of actual 
direct memory),
* Divide by the number of sort operators in the plan (as shown in the 
visualized query profile)
* Divide by the “planner width” which is, by default, 70% of the number of 
cores on your system.

In your case, if you are using the default 2 GB total, but getting 41 MB per 
sort, the divisor is 50. Maybe you have 2 sorts and 32 cores? (2 * 32 * 70% =~ 
45.) Or some other combination.

We can’t reduce the number of sorts; that’s determined by your query. But, we 
can play with the other numbers.

First, we can increase the memory per query:

ALTER SESSION SET `planner.memory.max_query_memory_per_node` = 4,294,967,296

That is, 4 GB. This obviously means you must have at least 6 GB of direct 
memory; more is better.

And/or, we can reduce the number of fragments:

ALTER SESSION SET `planner.width.max_per_node` = 

The value is a bit tricky. Drill normally 

RE: Drill Capacity

2017-11-03 Thread Yun Liu
Hi Paul,

Thanks for you detailed explanation. First off- I have 2 issues and I wanted to 
clear it out before continuing.

Current setting: planner.memory.max_query_memory_per_node = 10GB, HEAP = 12G, 
Direct memory = 32G, Perm 1024M, and planner.width.max_per_node = 5

Issue # 1:
When loading a json file with 400MB I keep getting a DATA_READ ERROR.
Each record in the file is about 64KB. Since it's a json file, there are only 4 
fields per each record. Not sure how many records this file contains as it's 
too large to open with any tools, but I am guessing about 3k rows.
With all the recommendations provided by various experts, nothing has worked.

Issue 2#:
While processing a query with is a join of 2 functional .json files, I am 
getting a RESOURCE ERROR: One or more nodes ran out of memory while executing 
the query. These 2 json files alone process fine but when joined together, 
Drill throws me that error.
Json#1 is 11k KB, has 8 fields with 74091 rows
Json#2 is 752kb, has 8 fields with 4245 rows

Besides breaking them up to smaller files, not sure what else I could do.

Thanks for the help so far!

Yun

-Original Message-
From: Paul Rogers [mailto:prog...@mapr.com] 
Sent: Thursday, November 2, 2017 11:06 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

I’m going to give you multiple ways to understand the issue based on the 
information you’ve provided. I generally like to see the full logs to diagnose 
such problems, but we’ll start with what you’ve provided thus far.
 
How large is each record in your file? How many fields? How many bytes? 
(Alternatively, how big is a single input file and how many records does it 
contain?)

You mention the limit of 64K columns in CSV. This makes me wonder if you have a 
“jumbo” record. If each individual record is large, then there won’t be enough 
space in the sort to take even a single batch of records, and you’ll get the 
sv2 error that you saw.

We can guess the size, however, from the info you provided:

batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040

This says you have a batch in memory and are trying to allocate some memory 
(the “sv2”). The allocated memory number tells us that each batch size is 
probably ~43 MB. But, the sort only has 42 MB to play with. The sort needs at 
least two batches in memory to make progress, hence the out-of-memory errors.

It would be nice to confirm this from the logs, but unfortunately, Drill does 
not normally log the size of each batch. As it turns out, however, the 
“managed” version that Boaz mentioned added more logging around this problem: 
it will tell you how large it thinks each batch is, and will warn if you have, 
say, a 43 MB batch but only 42 MB in which to sort.

(If you do want to use the “managed” version of the sort, I suggest you try 
Drill 1.12 when it is released as that version contains additional fixes to 
handle constrained memory.)

Also, at present, The JSON record reader loads 4096 records into each batch. If 
your file has at least that many records, then we can guess each record is 
about 43 MB / 4096 =~ 10K in size. (You can confirm, as noted above, by 
dividing total file size by record count.)

We are doing work to handle such large batches, but the work is not yet 
available in a release. Unfortunately, in the meanwhile, we also don’t let you 
control the batch size. But, we can provide another solution.

Let's explain why the message you provided said that the “allocator limit” was 
42 MB. Drill does the following to allocate memory to the sort:

* Take the “max query memory per node” (default of 2 GB regardless of actual 
direct memory),
* Divide by the number of sort operators in the plan (as shown in the 
visualized query profile)
* Divide by the “planner width” which is, by default, 70% of the number of 
cores on your system.

In your case, if you are using the default 2 GB total, but getting 41 MB per 
sort, the divisor is 50. Maybe you have 2 sorts and 32 cores? (2 * 32 * 70% =~ 
45.) Or some other combination.

We can’t reduce the number of sorts; that’s determined by your query. But, we 
can play with the other numbers.

First, we can increase the memory per query:

ALTER SESSION SET `planner.memory.max_query_memory_per_node` = 4,294,967,296

That is, 4 GB. This obviously means you must have at least 6 GB of direct 
memory; more is better.

And/or, we can reduce the number of fragments:

ALTER SESSION SET `planner.width.max_per_node` = 

The value is a bit tricky. Drill normally creates a number of fragments equal 
to 70% of the number of CPUs on your system. Let’s say you have 32 cores. If 
so, change the max_per_node to, say, 10 or even 5. This will mean fewer sorts 
and so more memory per sort, helping compensate for the “jumbo” batches in your 
query. Pick a number based on your actual number of cores.

As an alternative, as Ted suggested, you could create a larger number of 
smaller files as this would

RE: Drill Capacity

2017-11-03 Thread Yun Liu
Hi Boaz,

Looks like I've already had those set to "false". So it didn't change much.

Thanks,
Yun

-Original Message-
From: Boaz Ben-Zvi [mailto:bben-...@mapr.com] 
Sent: Thursday, November 2, 2017 6:14 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

 Hi Yun,

 Can you try using the “managed” version of the external sort – either 
change this option to false:

0: jdbc:drill:zk=local> select * from sys.options where name like '%man%';
++--+---+--+--+--+-+---++
|name|   kind   | accessibleScopes  | optionScope  |  
status  | num_val  | string_val  | bool_val  | float_val  |
++--+---+--+--+--+-+---++
| exec.sort.disable_managed  | BOOLEAN  | ALL   | BOOT | 
DEFAULT  | null | null| false | null   |
++--+---+--+--+--+-+---++

Or override it into ‘false’ in the configuration:

0: jdbc:drill:zk=local> select * from sys.boot where name like '%managed%';
+---+--+---+--+-+--+-+---++
| name  |   kind   | accessibleScopes  
| optionScope  | status  | num_val  | string_val  | bool_val  | float_val  |
+---+--+---+--+-+--+-+---++
| drill.exec.options.exec.sort.disable_managed  | BOOLEAN  | BOOT  
| BOOT | BOOT| null | null| false | null   |
+---+--+---+--+-+--+-+---++

i.e., in the drill-override.conf file:

  sort: {
 external: {
 disable_managed: false
  }
  }

  Please let us know if this change helped,

 -- Boaz 


On 11/2/17, 1:12 PM, "Yun Liu" <y@castsoftware.com> wrote:

Please help me as to what further information I could provide to get this 
going. I am also experiencing a separate issue:

RESOURCE ERROR: One or more nodes ran out of memory while executing the 
query.

Unable to allocate sv2 for 8501 records, and not enough batchGroups to 
spill.
batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040

Current setting is: 
planner.memory.max_query_memory_per_node= 10GB 
HEAP to 12G 
Direct memory to 32G 
Perm to 1024M

What is the issue here?

Thanks,
Yun

-Original Message-
From: Yun Liu [mailto:y@castsoftware.com] 
Sent: Thursday, November 2, 2017 3:52 PM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Yes- I increased planner.memory.max_query_memory_per_node to 10GB HEAP to 
12G Direct memory to 16G And Perm to 1024M

It didn't have any schema changes. As with the same file format but less 
data- it works perfectly ok. I am unable to tell if there's corruption.

Yun

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Thursday, November 2, 2017 3:35 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

What memory setting did you increase? Have you tried 6 or 8GB?

How much memory is allocated to Drill Heap and Direct memory for the 
embedded Drillbit?

Also did you check the larger document doesn’t have any schema changes or 
corruption?

--Andries



On 11/2/17, 12:31 PM, "Yun Liu" <y@castsoftware.com> wrote:

Hi Kunal and Andries,

Thanks for your reply. We need json in this case because Drill only 
supports up to 65536 columns in a csv file. I also tried increasing the memory 
size to 4GB but I am still experiencing same issues. Drill is installed in 
Embedded Mode.

Thanks,
Yun

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com] 
Sent: Thursday, November 2, 2017 2:01 PM
    To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Yun

Andries solution should address your problem. However, do understand 
that, unlike CSV files, a JSON file cannot be processed in parallel, because 
there is no clear record delimiter (CSV data usually has a new-line character 
to indicate the end of a record). So, the larger a file gets, the more work a 
single minor fragment has to do in processing it, including maintaining 
i

RE: Drill Capacity

2017-11-03 Thread Yun Liu
Hi Boaz,



Seems I've already had those set to false. So it didn't help ☹



Thanks,

Yun



-Original Message-
From: Boaz Ben-Zvi [mailto:bben-...@mapr.com]
Sent: Thursday, November 2, 2017 6:14 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity



Hi Yun,



 Can you try using the “managed” version of the external sort – either 
change this option to false:



0: jdbc:drill:zk=local> select * from sys.options where name like '%man%';

++--+---+--+--+--+-+---++

|name|   kind   | accessibleScopes  | optionScope  |  
status  | num_val  | string_val  | bool_val  | float_val  |

++--+---+--+--+--+-+---++

| exec.sort.disable_managed  | BOOLEAN  | ALL   | BOOT | 
DEFAULT  | null | null| false | null   |

++--+---+--+--+--+-+---++



Or override it into ‘false’ in the configuration:



0: jdbc:drill:zk=local> select * from sys.boot where name like '%managed%';

+---+--+---+--+-+--+-+---++

| name  |   kind   | accessibleScopes  
| optionScope  | status  | num_val  | string_val  | bool_val  | float_val  |

+---+--+---+--+-+--+-+---++

| drill.exec.options.exec.sort.disable_managed  | BOOLEAN  | BOOT  
| BOOT | BOOT| null | null| false | null   |

+---+--+---+--+-+--+-+---++



i.e., in the drill-override.conf file:



  sort: {

 external: {

 disable_managed: false

  }

  }



  Please let us know if this change helped,



 -- Boaz





On 11/2/17, 1:12 PM, "Yun Liu" 
<y@castsoftware.com<mailto:y@castsoftware.com>> wrote:



Please help me as to what further information I could provide to get this 
going. I am also experiencing a separate issue:



RESOURCE ERROR: One or more nodes ran out of memory while executing the 
query.



Unable to allocate sv2 for 8501 records, and not enough batchGroups to 
spill.

batchGroups.size 1

spilledBatchGroups.size 0

allocated memory 42768000

allocator limit 41943040



Current setting is:

planner.memory.max_query_memory_per_node= 10GB

HEAP to 12G

Direct memory to 32G

Perm to 1024M



What is the issue here?



Thanks,

Yun



-Original Message-

From: Yun Liu [mailto:y@castsoftware.com]

Sent: Thursday, November 2, 2017 3:52 PM

To: user@drill.apache.org<mailto:user@drill.apache.org>

Subject: RE: Drill Capacity



Yes- I increased planner.memory.max_query_memory_per_node to 10GB HEAP to 
12G Direct memory to 16G And Perm to 1024M



It didn't have any schema changes. As with the same file format but less 
data- it works perfectly ok. I am unable to tell if there's corruption.



Yun



-Original Message-

From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]

Sent: Thursday, November 2, 2017 3:35 PM

To: user@drill.apache.org<mailto:user@drill.apache.org>

Subject: Re: Drill Capacity



What memory setting did you increase? Have you tried 6 or 8GB?



How much memory is allocated to Drill Heap and Direct memory for the 
embedded Drillbit?



Also did you check the larger document doesn’t have any schema changes or 
corruption?



--Andries







On 11/2/17, 12:31 PM, "Yun Liu" 
<y@castsoftware.com<mailto:y@castsoftware.com>> wrote:



Hi Kunal and Andries,



Thanks for your reply. We need json in this case because Drill only 
supports up to 65536 columns in a csv file. I also tried increasing the memory 
size to 4GB but I am still experiencing same issues. Drill is installed in 
Embedded Mode.



Thanks,

Yun



-Original Message-

From: Kunal Khatua [mailto:kkha...@mapr.com]

Sent: Thursday, November 2, 2017 2:01 PM

To: user@drill.apache.org<mailto:user@drill.apache.org>

Subject: RE: Drill Capacity



Hi Yun



Andries solution should address your problem. However, do understand 
that, unlike CSV files, a JSON file cannot be processed in parallel, because 
there is no clear record delimiter (CSV data usually has a new-line character 
to

Re: Drill Capacity

2017-11-02 Thread Paul Rogers
Hi Yun,

I’m going to give you multiple ways to understand the issue based on the 
information you’ve provided. I generally like to see the full logs to diagnose 
such problems, but we’ll start with what you’ve provided thus far.
 
How large is each record in your file? How many fields? How many bytes? 
(Alternatively, how big is a single input file and how many records does it 
contain?)

You mention the limit of 64K columns in CSV. This makes me wonder if you have a 
“jumbo” record. If each individual record is large, then there won’t be enough 
space in the sort to take even a single batch of records, and you’ll get the 
sv2 error that you saw.

We can guess the size, however, from the info you provided:

batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040

This says you have a batch in memory and are trying to allocate some memory 
(the “sv2”). The allocated memory number tells us that each batch size is 
probably ~43 MB. But, the sort only has 42 MB to play with. The sort needs at 
least two batches in memory to make progress, hence the out-of-memory errors.

It would be nice to confirm this from the logs, but unfortunately, Drill does 
not normally log the size of each batch. As it turns out, however, the 
“managed” version that Boaz mentioned added more logging around this problem: 
it will tell you how large it thinks each batch is, and will warn if you have, 
say, a 43 MB batch but only 42 MB in which to sort.

(If you do want to use the “managed” version of the sort, I suggest you try 
Drill 1.12 when it is released as that version contains additional fixes to 
handle constrained memory.)

Also, at present, The JSON record reader loads 4096 records into each batch. If 
your file has at least that many records, then we can guess each record is 
about 43 MB / 4096 =~ 10K in size. (You can confirm, as noted above, by 
dividing total file size by record count.)

We are doing work to handle such large batches, but the work is not yet 
available in a release. Unfortunately, in the meanwhile, we also don’t let you 
control the batch size. But, we can provide another solution.

Let's explain why the message you provided said that the “allocator limit” was 
42 MB. Drill does the following to allocate memory to the sort:

* Take the “max query memory per node” (default of 2 GB regardless of actual 
direct memory),
* Divide by the number of sort operators in the plan (as shown in the 
visualized query profile)
* Divide by the “planner width” which is, by default, 70% of the number of 
cores on your system.

In your case, if you are using the default 2 GB total, but getting 41 MB per 
sort, the divisor is 50. Maybe you have 2 sorts and 32 cores? (2 * 32 * 70% =~ 
45.) Or some other combination.

We can’t reduce the number of sorts; that’s determined by your query. But, we 
can play with the other numbers.

First, we can increase the memory per query:

ALTER SESSION SET `planner.memory.max_query_memory_per_node` = 4,294,967,296

That is, 4 GB. This obviously means you must have at least 6 GB of direct 
memory; more is better.

And/or, we can reduce the number of fragments:

ALTER SESSION SET `planner.width.max_per_node` = 

The value is a bit tricky. Drill normally creates a number of fragments equal 
to 70% of the number of CPUs on your system. Let’s say you have 32 cores. If 
so, change the max_per_node to, say, 10 or even 5. This will mean fewer sorts 
and so more memory per sort, helping compensate for the “jumbo” batches in your 
query. Pick a number based on your actual number of cores.

As an alternative, as Ted suggested, you could create a larger number of 
smaller files as this would solve the batch size problem while also getting the 
parallelization benefits that Kunal mentioned.

That is three separate possible solutions. Try them one by one or (carefully) 
together.

- Paul

>> On 11/2/17, 12:31 PM, "Yun Liu"  wrote:
>> 
>>Hi Kunal and Andries,
>> 
>>Thanks for your reply. We need json in this case because Drill only
>> supports up to 65536 columns in a csv file.


Re: Drill Capacity

2017-11-02 Thread Ted Dunning
What happens if you split your large file into 5 smaller files?



On Thu, Nov 2, 2017 at 12:52 PM, Yun Liu <y@castsoftware.com> wrote:

> Yes- I increased planner.memory.max_query_memory_per_node to 10GB
> HEAP to 12G
> Direct memory to 16G
> And Perm to 1024M
>
> It didn't have any schema changes. As with the same file format but less
> data- it works perfectly ok. I am unable to tell if there's corruption.
>
> Yun
>
> -Original Message-
> From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
> Sent: Thursday, November 2, 2017 3:35 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
>
> What memory setting did you increase? Have you tried 6 or 8GB?
>
> How much memory is allocated to Drill Heap and Direct memory for the
> embedded Drillbit?
>
> Also did you check the larger document doesn’t have any schema changes or
> corruption?
>
> --Andries
>
>
>
> On 11/2/17, 12:31 PM, "Yun Liu" <y@castsoftware.com> wrote:
>
> Hi Kunal and Andries,
>
> Thanks for your reply. We need json in this case because Drill only
> supports up to 65536 columns in a csv file. I also tried increasing the
> memory size to 4GB but I am still experiencing same issues. Drill is
> installed in Embedded Mode.
>
> Thanks,
> Yun
>
> -Original Message-
> From: Kunal Khatua [mailto:kkha...@mapr.com]
> Sent: Thursday, November 2, 2017 2:01 PM
> To: user@drill.apache.org
> Subject: RE: Drill Capacity
>
> Hi Yun
>
> Andries solution should address your problem. However, do understand
> that, unlike CSV files, a JSON file cannot be processed in parallel,
> because there is no clear record delimiter (CSV data usually has a new-line
> character to indicate the end of a record). So, the larger a file gets, the
> more work a single minor fragment has to do in processing it, including
> maintaining internal data-structures to represent the complex JSON document.
>
> The preferable way would be to create more JSON files so that the
> files can be processed in parallel.
>
> Hope that helps.
>
> ~ Kunal
>
> -Original Message-
> From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
> Sent: Thursday, November 02, 2017 10:26 AM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
>
> How much memory is allocated to the Drill environment?
> Embedded or in a cluster?
>
> I don’t think there is a particular limit, but a single JSON file will
> be read by a single minor fragment, in general it is better to match the
> number/size of files to the Drill environment.
>
> In the short term try to bump up planner.memory.max_query_memory_per_node
> in the options and see if that works for you.
>
> --Andries
>
>
>
> On 11/2/17, 7:46 AM, "Yun Liu" <y@castsoftware.com> wrote:
>
> Hi,
>
> I've been using Apache Drill actively and just wondering what is
> the capacity of Drill? I have a json file which is 390MB and it keeps
> throwing me an DATA_READ ERROR. I have another json file with exact same
> format but only 150MB and it's processing fine. When I did a *select* on
> the large json, it returns successfully for some of the fields. None of
> these errors really apply to me. So I am trying to understand the capacity
> of the json files Drill supports up to. Or if there's something else I
> missed.
>
> Thanks,
>
> Yun Liu
> Solutions Delivery Consultant
> 321 West 44th St | Suite 501 | New York, NY 10036
> +1 212.871.8355 office | +1 646.752.4933 mobile
>
> CAST, Leader in Software Analysis and Measurement
> Achieve Insight. Deliver Excellence.
> Join the discussion http://blog.castsoftware.com/
> LinkedIn<http://www.linkedin.com/companies/162909> | Twitter<
> http://twitter.com/onquality> | Facebook<http://www.facebook.
> com/pages/CAST/105668942817177>
>
>
>
>
>
>


Re: Drill Capacity

2017-11-02 Thread Boaz Ben-Zvi
 Hi Yun,

 Can you try using the “managed” version of the external sort – either 
change this option to false:

0: jdbc:drill:zk=local> select * from sys.options where name like '%man%';
++--+---+--+--+--+-+---++
|name|   kind   | accessibleScopes  | optionScope  |  
status  | num_val  | string_val  | bool_val  | float_val  |
++--+---+--+--+--+-+---++
| exec.sort.disable_managed  | BOOLEAN  | ALL   | BOOT | 
DEFAULT  | null | null| false | null   |
++--+---+--+--+--+-+---++

Or override it into ‘false’ in the configuration:

0: jdbc:drill:zk=local> select * from sys.boot where name like '%managed%';
+---+--+---+--+-+--+-+---++
| name  |   kind   | accessibleScopes  
| optionScope  | status  | num_val  | string_val  | bool_val  | float_val  |
+---+--+---+--+-+--+-+---++
| drill.exec.options.exec.sort.disable_managed  | BOOLEAN  | BOOT  
| BOOT | BOOT| null | null| false | null   |
+---+--+---+--+-+--+-+---++

i.e., in the drill-override.conf file:

  sort: {
 external: {
 disable_managed: false
  }
  }

  Please let us know if this change helped,

 -- Boaz 


On 11/2/17, 1:12 PM, "Yun Liu" <y@castsoftware.com> wrote:

Please help me as to what further information I could provide to get this 
going. I am also experiencing a separate issue:

RESOURCE ERROR: One or more nodes ran out of memory while executing the 
query.

Unable to allocate sv2 for 8501 records, and not enough batchGroups to 
spill.
batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040

Current setting is: 
planner.memory.max_query_memory_per_node= 10GB 
HEAP to 12G 
Direct memory to 32G 
Perm to 1024M

What is the issue here?

Thanks,
Yun

-Original Message-
From: Yun Liu [mailto:y@castsoftware.com] 
Sent: Thursday, November 2, 2017 3:52 PM
To: user@drill.apache.org
    Subject: RE: Drill Capacity

Yes- I increased planner.memory.max_query_memory_per_node to 10GB HEAP to 
12G Direct memory to 16G And Perm to 1024M

It didn't have any schema changes. As with the same file format but less 
data- it works perfectly ok. I am unable to tell if there's corruption.

Yun

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Thursday, November 2, 2017 3:35 PM
To: user@drill.apache.org
    Subject: Re: Drill Capacity

What memory setting did you increase? Have you tried 6 or 8GB?

How much memory is allocated to Drill Heap and Direct memory for the 
embedded Drillbit?

Also did you check the larger document doesn’t have any schema changes or 
corruption?

--Andries



On 11/2/17, 12:31 PM, "Yun Liu" <y@castsoftware.com> wrote:

Hi Kunal and Andries,

Thanks for your reply. We need json in this case because Drill only 
supports up to 65536 columns in a csv file. I also tried increasing the memory 
size to 4GB but I am still experiencing same issues. Drill is installed in 
Embedded Mode.

Thanks,
Yun

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com] 
Sent: Thursday, November 2, 2017 2:01 PM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Yun

Andries solution should address your problem. However, do understand 
that, unlike CSV files, a JSON file cannot be processed in parallel, because 
there is no clear record delimiter (CSV data usually has a new-line character 
to indicate the end of a record). So, the larger a file gets, the more work a 
single minor fragment has to do in processing it, including maintaining 
internal data-structures to represent the complex JSON document. 

The preferable way would be to create more JSON files so that the files 
can be processed in parallel. 

Hope that helps.

~ Kunal

-Origi

RE: Drill Capacity

2017-11-02 Thread Yun Liu
Please help me as to what further information I could provide to get this 
going. I am also experiencing a separate issue:

RESOURCE ERROR: One or more nodes ran out of memory while executing the query.

Unable to allocate sv2 for 8501 records, and not enough batchGroups to spill.
batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040

Current setting is: 
planner.memory.max_query_memory_per_node= 10GB 
HEAP to 12G 
Direct memory to 32G 
Perm to 1024M

What is the issue here?

Thanks,
Yun

-Original Message-
From: Yun Liu [mailto:y@castsoftware.com] 
Sent: Thursday, November 2, 2017 3:52 PM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Yes- I increased planner.memory.max_query_memory_per_node to 10GB HEAP to 12G 
Direct memory to 16G And Perm to 1024M

It didn't have any schema changes. As with the same file format but less data- 
it works perfectly ok. I am unable to tell if there's corruption.

Yun

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Thursday, November 2, 2017 3:35 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

What memory setting did you increase? Have you tried 6 or 8GB?

How much memory is allocated to Drill Heap and Direct memory for the embedded 
Drillbit?

Also did you check the larger document doesn’t have any schema changes or 
corruption?

--Andries



On 11/2/17, 12:31 PM, "Yun Liu" <y@castsoftware.com> wrote:

Hi Kunal and Andries,

Thanks for your reply. We need json in this case because Drill only 
supports up to 65536 columns in a csv file. I also tried increasing the memory 
size to 4GB but I am still experiencing same issues. Drill is installed in 
Embedded Mode.

Thanks,
Yun

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com] 
Sent: Thursday, November 2, 2017 2:01 PM
To: user@drill.apache.org
    Subject: RE: Drill Capacity

Hi Yun

Andries solution should address your problem. However, do understand that, 
unlike CSV files, a JSON file cannot be processed in parallel, because there is 
no clear record delimiter (CSV data usually has a new-line character to 
indicate the end of a record). So, the larger a file gets, the more work a 
single minor fragment has to do in processing it, including maintaining 
internal data-structures to represent the complex JSON document. 

The preferable way would be to create more JSON files so that the files can 
be processed in parallel. 

Hope that helps.

~ Kunal

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Thursday, November 02, 2017 10:26 AM
To: user@drill.apache.org
    Subject: Re: Drill Capacity

How much memory is allocated to the Drill environment?
Embedded or in a cluster?

I don’t think there is a particular limit, but a single JSON file will be 
read by a single minor fragment, in general it is better to match the 
number/size of files to the Drill environment.

In the short term try to bump up planner.memory.max_query_memory_per_node 
in the options and see if that works for you.

--Andries



On 11/2/17, 7:46 AM, "Yun Liu" <y@castsoftware.com> wrote:

Hi,

I've been using Apache Drill actively and just wondering what is the 
capacity of Drill? I have a json file which is 390MB and it keeps throwing me 
an DATA_READ ERROR. I have another json file with exact same format but only 
150MB and it's processing fine. When I did a *select* on the large json, it 
returns successfully for some of the fields. None of these errors really apply 
to me. So I am trying to understand the capacity of the json files Drill 
supports up to. Or if there's something else I missed.

Thanks,

Yun Liu
Solutions Delivery Consultant
321 West 44th St | Suite 501 | New York, NY 10036
+1 212.871.8355 office | +1 646.752.4933 mobile

CAST, Leader in Software Analysis and Measurement
Achieve Insight. Deliver Excellence.
Join the discussion http://blog.castsoftware.com/
LinkedIn<http://www.linkedin.com/companies/162909> | 
Twitter<http://twitter.com/onquality> | 
Facebook<http://www.facebook.com/pages/CAST/105668942817177>







RE: Drill Capacity

2017-11-02 Thread Yun Liu
Yes- I increased planner.memory.max_query_memory_per_node to 10GB
HEAP to 12G
Direct memory to 16G
And Perm to 1024M

It didn't have any schema changes. As with the same file format but less data- 
it works perfectly ok. I am unable to tell if there's corruption.

Yun

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Thursday, November 2, 2017 3:35 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

What memory setting did you increase? Have you tried 6 or 8GB?

How much memory is allocated to Drill Heap and Direct memory for the embedded 
Drillbit?

Also did you check the larger document doesn’t have any schema changes or 
corruption?

--Andries



On 11/2/17, 12:31 PM, "Yun Liu" <y@castsoftware.com> wrote:

Hi Kunal and Andries,

Thanks for your reply. We need json in this case because Drill only 
supports up to 65536 columns in a csv file. I also tried increasing the memory 
size to 4GB but I am still experiencing same issues. Drill is installed in 
Embedded Mode.

Thanks,
Yun

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com] 
Sent: Thursday, November 2, 2017 2:01 PM
To: user@drill.apache.org
    Subject: RE: Drill Capacity

Hi Yun

Andries solution should address your problem. However, do understand that, 
unlike CSV files, a JSON file cannot be processed in parallel, because there is 
no clear record delimiter (CSV data usually has a new-line character to 
indicate the end of a record). So, the larger a file gets, the more work a 
single minor fragment has to do in processing it, including maintaining 
internal data-structures to represent the complex JSON document. 

The preferable way would be to create more JSON files so that the files can 
be processed in parallel. 

Hope that helps.

~ Kunal

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Thursday, November 02, 2017 10:26 AM
To: user@drill.apache.org
    Subject: Re: Drill Capacity

How much memory is allocated to the Drill environment?
Embedded or in a cluster?

I don’t think there is a particular limit, but a single JSON file will be 
read by a single minor fragment, in general it is better to match the 
number/size of files to the Drill environment.

In the short term try to bump up planner.memory.max_query_memory_per_node 
in the options and see if that works for you.

--Andries



On 11/2/17, 7:46 AM, "Yun Liu" <y@castsoftware.com> wrote:

Hi,

I've been using Apache Drill actively and just wondering what is the 
capacity of Drill? I have a json file which is 390MB and it keeps throwing me 
an DATA_READ ERROR. I have another json file with exact same format but only 
150MB and it's processing fine. When I did a *select* on the large json, it 
returns successfully for some of the fields. None of these errors really apply 
to me. So I am trying to understand the capacity of the json files Drill 
supports up to. Or if there's something else I missed.

Thanks,

Yun Liu
Solutions Delivery Consultant
321 West 44th St | Suite 501 | New York, NY 10036
+1 212.871.8355 office | +1 646.752.4933 mobile

CAST, Leader in Software Analysis and Measurement
Achieve Insight. Deliver Excellence.
Join the discussion http://blog.castsoftware.com/
LinkedIn<http://www.linkedin.com/companies/162909> | 
Twitter<http://twitter.com/onquality> | 
Facebook<http://www.facebook.com/pages/CAST/105668942817177>







RE: Drill Capacity

2017-11-02 Thread Kunal Khatua
Hi Yun

Andries solution should address your problem. However, do understand that, 
unlike CSV files, a JSON file cannot be processed in parallel, because there is 
no clear record delimiter (CSV data usually has a new-line character to 
indicate the end of a record). So, the larger a file gets, the more work a 
single minor fragment has to do in processing it, including maintaining 
internal data-structures to represent the complex JSON document. 

The preferable way would be to create more JSON files so that the files can be 
processed in parallel. 

Hope that helps.

~ Kunal

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Thursday, November 02, 2017 10:26 AM
To: user@drill.apache.org
Subject: Re: Drill Capacity

How much memory is allocated to the Drill environment?
Embedded or in a cluster?

I don’t think there is a particular limit, but a single JSON file will be read 
by a single minor fragment, in general it is better to match the number/size of 
files to the Drill environment.

In the short term try to bump up planner.memory.max_query_memory_per_node in 
the options and see if that works for you.

--Andries



On 11/2/17, 7:46 AM, "Yun Liu" <y@castsoftware.com> wrote:

Hi,

I've been using Apache Drill actively and just wondering what is the 
capacity of Drill? I have a json file which is 390MB and it keeps throwing me 
an DATA_READ ERROR. I have another json file with exact same format but only 
150MB and it's processing fine. When I did a *select* on the large json, it 
returns successfully for some of the fields. None of these errors really apply 
to me. So I am trying to understand the capacity of the json files Drill 
supports up to. Or if there's something else I missed.

Thanks,

Yun Liu
Solutions Delivery Consultant
321 West 44th St | Suite 501 | New York, NY 10036
+1 212.871.8355 office | +1 646.752.4933 mobile

CAST, Leader in Software Analysis and Measurement
Achieve Insight. Deliver Excellence.
Join the discussion http://blog.castsoftware.com/
LinkedIn<http://www.linkedin.com/companies/162909> | 
Twitter<http://twitter.com/onquality> | 
Facebook<http://www.facebook.com/pages/CAST/105668942817177>





Re: Drill Capacity

2017-11-02 Thread Andries Engelbrecht
How much memory is allocated to the Drill environment?
Embedded or in a cluster?

I don’t think there is a particular limit, but a single JSON file will be read 
by a single minor fragment, in general it is better to match the number/size of 
files to the Drill environment.

In the short term try to bump up planner.memory.max_query_memory_per_node in 
the options and see if that works for you.

--Andries



On 11/2/17, 7:46 AM, "Yun Liu"  wrote:

Hi,

I've been using Apache Drill actively and just wondering what is the 
capacity of Drill? I have a json file which is 390MB and it keeps throwing me 
an DATA_READ ERROR. I have another json file with exact same format but only 
150MB and it's processing fine. When I did a *select* on the large json, it 
returns successfully for some of the fields. None of these errors really apply 
to me. So I am trying to understand the capacity of the json files Drill 
supports up to. Or if there's something else I missed.

Thanks,

Yun Liu
Solutions Delivery Consultant
321 West 44th St | Suite 501 | New York, NY 10036
+1 212.871.8355 office | +1 646.752.4933 mobile

CAST, Leader in Software Analysis and Measurement
Achieve Insight. Deliver Excellence.
Join the discussion http://blog.castsoftware.com/
LinkedIn | 
Twitter | 
Facebook





Re: Drill Capacity

2017-11-02 Thread Prasad Nagaraj Subramanya
Hi Yun,

Drill is designed to query large datasets. There is no specific limit on
the size, it works well even when data is in hundreds of GBs.

DATA_READ ERROR has something to do with the data in your file. The data in
some of the columns may not be consistent with the datatype.
Please refer to this link for one such example -
https://stackoverflow.com/questions/40217328/apache-drill-mysql-and-data-read-error-failure-while-attempting-to-read-from


Thanks,
Prasad

On Thu, Nov 2, 2017 at 7:46 AM, Yun Liu  wrote:

> Hi,
>
> I've been using Apache Drill actively and just wondering what is the
> capacity of Drill? I have a json file which is 390MB and it keeps throwing
> me an DATA_READ ERROR. I have another json file with exact same format but
> only 150MB and it's processing fine. When I did a *select* on the large
> json, it returns successfully for some of the fields. None of these errors
> really apply to me. So I am trying to understand the capacity of the json
> files Drill supports up to. Or if there's something else I missed.
>
> Thanks,
>
> Yun Liu
> Solutions Delivery Consultant
> 321 West 44th St | Suite 501 | New York, NY 10036
> +1 212.871.8355 office | +1 646.752.4933 mobile
>
> CAST, Leader in Software Analysis and Measurement
> Achieve Insight. Deliver Excellence.
> Join the discussion http://blog.castsoftware.com/
> LinkedIn | Twitter<
> http://twitter.com/onquality> | Facebook com/pages/CAST/105668942817177>
>
>


Drill Capacity

2017-11-02 Thread Yun Liu
Hi,

I've been using Apache Drill actively and just wondering what is the capacity 
of Drill? I have a json file which is 390MB and it keeps throwing me an 
DATA_READ ERROR. I have another json file with exact same format but only 150MB 
and it's processing fine. When I did a *select* on the large json, it returns 
successfully for some of the fields. None of these errors really apply to me. 
So I am trying to understand the capacity of the json files Drill supports up 
to. Or if there's something else I missed.

Thanks,

Yun Liu
Solutions Delivery Consultant
321 West 44th St | Suite 501 | New York, NY 10036
+1 212.871.8355 office | +1 646.752.4933 mobile

CAST, Leader in Software Analysis and Measurement
Achieve Insight. Deliver Excellence.
Join the discussion http://blog.castsoftware.com/
LinkedIn | 
Twitter | 
Facebook