date:20170507

[jira] [Created] (DRILL-5485) Remove WebServer dependency on DrillClient

2017-05-07 Thread Sorabh Hamirwasia (JIRA)

Sorabh Hamirwasia created DRILL-5485:


 Summary: Remove WebServer dependency on DrillClient
 Key: DRILL-5485
 URL: https://issues.apache.org/jira/browse/DRILL-5485
 Project: Apache Drill
  Issue Type: Improvement
  Components: Web Server
Reporter: Sorabh Hamirwasia
 Fix For: 1.11.0


With encryption support using SASL, client's won't be able to authenticate 
using PLAIN mechanism when encryption is enabled on the cluster. Today 
WebServer which is embedded inside Drillbit creates a DrillClient instance for 
each WebClient session. And the WebUser is authenticated as part of 
authentication between DrillClient instance and Drillbit using PLAIN mechanism. 
But with encryption enabled this will fail since encryption doesn't support 
authentication using PLAN mechanism, hence no WebClient can connect to a 
Drillbit. There are below issues as well with this approach:
1) Since DrillClient is used per WebUser session this is expensive as it has 
heavyweight RPC layer for DrillClient and all it's dependencies. 
2) If the Foreman for a WebUser is also selected to be a different node then 
there will be extra hop of transferring data back to WebClient.
To resolve all the above issue it would be better to authenticate the WebUser 
locally using the Drillbit on which WebServer is running without creating 
DrillClient instance. We can use the local PAMAuthenticator to authenticate the 
user. After authentication is successful the local Drillbit can also serve as 
the Foreman for all the queries submitted by WebUser. This can be achieved by 
submitting the query to the local Drillbit Foreman work queue. This will also 
remove the requirement to encrypt the channel opened between WebServer 
(DrillClient) and selected Drillbit since with this approach there won't be any 
physical channel opened between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (DRILL-5484) easy.text.compliant.RepeatedVarCharOutput creates unnecessary 64K byte field

2017-05-07 Thread Paul Rogers (JIRA)

Paul Rogers created DRILL-5484:
--

 Summary: easy.text.compliant.RepeatedVarCharOutput creates 
unnecessary 64K byte field
 Key: DRILL-5484
 URL: https://issues.apache.org/jira/browse/DRILL-5484
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers
Priority: Minor


The "Easy" text readers include a "complaint" reader for reading things like 
CSV. That mechanism includes a class, {{RepeatedVarCharOutput}}, which gathers 
field data into a single array, "columns".

Part of the work is to implement project by reading only needed columns. This 
is done with a {{fields}} array. Since the constructor that sets up the array 
does not know the number of fields, it guesses that there will be the maximum: 
64K.

{code}
  public static final int MAXIMUM_NUMBER_COLUMNS = 64 * 1024;
  ...
  boolean[] fields = new boolean[MAXIMUM_NUMBER_COLUMNS];
{code}

This is, of course, a quick & dirty solution, but it is a bit of a heavy price 
to pay for a single bit that indicates we want to read all field. It is not 
clear that the performance advantage of a flag check is worth the cost of 
having many 64K heap blocks allocated: we need one per file per reader.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (DRILL-5483) Production-quality solution to define text file field types and widths

2017-05-07 Thread Paul Rogers (JIRA)

[
https://issues.apache.org/jira/browse/DRILL-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Rogers updated DRILL-5483:
---
Description:
This bug is in response to the work done in DRILL-5419. In that PR, we
essentially:

* Define field width in a CAST statement: CAST(columns[2] AS VARCHAR(10))
* Propagate known size information up through the internal representation to
compute widths for each column in the result set.

This is a wonderful start and a big improvement. Users can create views that
contain the needed casts. The size information allows tools that need size
information to work correctly. All is good.

However, if we start thinking about the user implications, we quickly realize
that the above is just a very partial fix for real-world use in a data lake
application.

* A data lake has many types of files with new ones added constantly.
* Drill is often used for data discovery: to see what is in each file.
* The number of files is typically huge: 100K, 1 M or more. (This is, after
all, Big Data.)
* New files, of existing types, arrive constantly. For example, web server logs
might arrive every five minutes.

Let's consider how the DRILL-5419 fix would apply in this environment.

* If a user queries a file directly, without a CAST, Drill will have no column
width information and will return a width of 64K (the maximum field width
allowed by Drill) to the client, which will fail due to over-sized buffers.
* The user must repeat the query, but assign a width to each column. Since this
is data discovery, the user does not know the width.
* So, the user must run a query to compute the maximum width by scanning all
the data. Write down the answers. Use this in a CAST in each subsequent query.

Now, the above can be simplified. Once we know the widths:

* Create a view for the file(s). This requires that the analytic user have
write access to the file system and use a tool other than Drill to create the
view.
* As new files arrive, rerun the max-length query to check for new lengths. If
so, manually update the views.

For the above to work, views must be created for each and every file (or users
must share expected widths somehow and write the CAST statement into queries
for files for which views are not defined.

But, this is a big data system, so there are millions of files. So, work out a
way to create views for all these files. Perhaps create scripts that scan all
new files and contain code to revise the views.

But, this is a multi-user system, so the users must agree on who will do the
full-table scan to compute the widths. For ad-hoc us, they must define a Wiki
or e-mail system or other means to share the widths to use in casts (with the
information eventually going into views.)

If done by script, then the scripts have to handle race conditions that occur
when replacing views while users may be trying to access the views. Done wrong
and users will get occasional failures due to missing or partial view
definitions as they try to read the file while the script (or a human) is
updating them.

Views require different names in the query than the table. So, users must know
when to us the actual file name (with manually-tracked field widths) and when
to use view names. That information must be published somewhere so users can
consult it. Simply looking at available files is not enough, the user must know
that file a/b/c/d.csv must be queried with a/b/c/d-view.drill. Train the users
on these rules.

Views must be stored somewhere. Putting them with data has permission and
directory time-stamp issues. (Adding a view changes the directory time-stamp,
but that time-stamp is often used to detect new files.) Putting them in some
other location requires know the file-to-view location mapping. Either solution
is a major cost.

The full table scans to compute field lengths duplicate work to be done by the
proposed statistics system. (The stats collection will compute other values,
but not maximum field width.) This doubles the load on the system.

Drill CAST operations are based on the idea that, if the user says that the
field should be 20 characters, then go ahead and truncate the rest. But, the
use case here is that we want the actual field width, the CAST is just a
work-around. Truncating the data can mean data loss. (Truncating "12345678" to
"12345", because that's what we expect, changes the meaning of the number and
should be considered data corruption.)

The DRILL cast operator works by making a copy. So, we greatly degrade
performance by making data copies when all we really want to do is to specify
width. Thus, the trade off is to overload the client tool, or slow query
performance.

Drill must scan the views prior to each query. To improve performance, we'd
want to cache the views. But, each Drillbit is independent of the others, so
each would cache its own copy. With millions of vi

[jira] [Created] (DRILL-5483) Production-quality solution to define text file field types and widths

2017-05-07 Thread Paul Rogers (JIRA)

Paul Rogers created DRILL-5483:
--

 Summary: Production-quality solution to define text file field 
types and widths
 Key: DRILL-5483
 URL: https://issues.apache.org/jira/browse/DRILL-5483
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers


This bug is in response to the work done in DRILL-5419. In that PR, we 
essentially:

* Define field width in a CAST statement: CAST(columns[2] AS VARCHAR(10))
* Propagate known size information up through the internal representation to 
compute widths for each column in the result set.

This is a wonderful start and a big improvement. Users can create views that 
contain the needed casts. The size information allows tools that need size 
information to work correctly. All is good.

However, if we start thinking about the user implications, we quickly realize 
that the above is just a very partial fix for real-world use in a data lake 
application.

* A data lake has many types of files with new ones added constantly.
* Drill is often used for data discovery: to see what is in each file.
* The number of files is typically huge: 100K, 1 M or more. (This is, after 
all, Big Data.)
* New files, of existing types, arrive constantly. For example, web server logs 
might arrive every five minutes.

Let's consider how the DRILL-5419 fix would apply in this environment.

* If a user queries a file directly, without a CAST, Drill will have no column 
width information and will return a width of 64K (the maximum field width 
allowed by Drill) to the client, which will fail due to over-sized buffers.
* The user must repeat the query, but assign a width to each column. Since this 
is data discovery, the user does not know the width.
* So, the user must run a query to compute the maximum width by scanning all 
the data. Write down the answers. Use this in a CAST in each subsequent query.

Now, the above can be simplified. Once we know the widths:

* Create a view for the file(s). This requires that the analytic user have 
write access to the file system and use a tool other than Drill to create the 
view.
* As new files arrive, rerun the max-length query to check for new lengths. If 
so, manually update the views.

For the above to work, views must be created for each and every file (or users 
must share expected widths somehow and write the CAST statement into queries 
for files for which views are not defined.

But, this is a big data system, so there are millions of files. So, work out a 
way to create views for all these files. Perhaps create scripts that scan all 
new files and contain code to revise the views.

But, this is a multi-user system, so the users must agree on who will do the 
full-table scan to compute the widths. For ad-hoc us, they must define a Wiki 
or e-mail system or other means to share the widths to use in casts (with the 
information eventually going into views.)

If done by script, then the scripts have to handle race conditions that occur 
when replacing views while users may be trying to access the views. Done wrong 
and users will get occasional failures due to missing or partial view 
definitions as they try to read the file while the script (or a human) is 
updating them.

Views require different names in the query than the table. So, users must know 
when to us the actual file name (with manually-tracked field widths) and when 
to use view names. That information must be published somewhere so users can 
consult it. Simply looking at available files is not enough, the user must know 
that file a/b/c/d.csv must be queried with a/b/c/d-view.drill. Train the users 
on these rules.

Views must be stored somewhere. Putting them with data has permission and 
directory time-stamp issues. (Adding a view changes the directory time-stamp, 
but that time-stamp is often used to detect new files.) Putting them in some 
other location requires know the file-to-view location mapping. Either solution 
is a major cost.

The full table scans to compute field lengths duplicate work to be done by the 
proposed statistics system. (The stats collection will compute other values, 
but not maximum field width.) This doubles the load on the system.

Drill CAST operations are based on the idea that, if the user says that the 
field should be 20 characters, then go ahead and truncate the rest. But, the 
use case here is that we want the actual field width, the CAST is just a 
work-around. Truncating the data can mean data loss. (Truncating "12345678" to 
"12345", because that's what we expect, changes the meaning of the number and 
should be considered data corruption.)

The DRILL cast operator works by making a copy. So, we greatly degrade 
performance by making data copies when all we really want to do is to specify 
width. Thus, the trade off is to overload the client tool, or slow query 
performance.

Drill must scan

[jira] [Commented] (DRILL-5470) CSV reader data corruption on truncated lines

2017-05-07 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1687#comment-1687
 ] 

Paul Rogers commented on DRILL-5470:


A workaround. For the same test file as above, decent results can be had by 
ignoring the header and instead reading the file as an array of strings.

Code:

{code}
  TextFormatConfig csvFormat = new TextFormatConfig();
  csvFormat.fieldDelimiter = ',';
  csvFormat.skipFirstLine = true;
  csvFormat.extractHeader = false;
  cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
  String sql = "SELECT columns[0] AS h, columns[1] AS u FROM 
`dfs.data`.`csv/test4.csv`";
{code}

Input file:
{code}
h,u
abc,def
ghi
{code}

Output:
{code}
h,u
abc,def
ghi,null
{code}

The cost is a bit more fiddling in the query, and a data copy from the column 
array into the named columns. But, at least we don't get the bogus field 
lengths.


> CSV reader data corruption on truncated lines
> -
>
> Key: DRILL-5470
> URL: https://issues.apache.org/jira/browse/DRILL-5470
> Project: Apache Drill
>  Issue Type: Bug
>  Components:  Server
>Affects Versions: 1.10.0
> Environment: - ubuntu 14.04
> - r3.8xl (32 CPU/240GB Mem)
> - openjdk version "1.8.0_111"
> - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked
>Reporter: Nathan Butler
>Assignee: Paul Rogers
>Priority: Critical
>
> Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing 
> this Jira issue. Drill seems to be running out of memory when doing an 
> External Sort. Per Zelaine's suggestion I enabled 
> sort.external.disable_managed in drill-override.conf and in the sqlline 
> session. This caused the query to run for longer but it still would fail with 
> the same message.
> Per Paul's suggestion, I enabled debug logging for the 
> org.apache.drill.exec.physical.impl.xsort.managed package and re-ran the 
> query.
> Here's the initial DEBUG line for ExternalSortBatch for our query:
> bq. 2017-05-03 12:02:56,095 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:15] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, 
> spill file size = 268435456, spill batch size = 8388608, merge limit = 
> 2147483647, merge batch size = 16777216
> And here's the last DEBUG line before the stack trace:
> bq. 2017-05-03 12:37:44,249 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:4] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, 
> buffer memory = 10719535268, merge memory = 10707140978
> And the stacktrace:
> {quote}
> 2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO  
> o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort 
> encountered an error while spilling to disk (Un
> able to allocate buffer of size 268435456 due to memory limit. Current 
> allocation: 10579849472)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External 
> Sort encountered an error while spilling to disk
> [Error Id: 5d53c677-0cd9-4c01-a664-c02089670a1c ]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
>  ~[drill-common-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:618)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:660)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:559)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  [drill-java-exec-1.10.0.jar:1.10.0]
>

[jira] [Commented] (DRILL-5470) CSV reader data corruption on truncated lines

2017-05-07 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999714#comment-15999714
 ] 

Paul Rogers commented on DRILL-5470:


Raised the priority to Critical since this is both a data corruption issue and 
an issue that exhausts memory and causes queries to fail.

> CSV reader data corruption on truncated lines
> -
>
> Key: DRILL-5470
> URL: https://issues.apache.org/jira/browse/DRILL-5470
> Project: Apache Drill
>  Issue Type: Bug
>  Components:  Server
>Affects Versions: 1.10.0
> Environment: - ubuntu 14.04
> - r3.8xl (32 CPU/240GB Mem)
> - openjdk version "1.8.0_111"
> - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked
>Reporter: Nathan Butler
>Assignee: Paul Rogers
>Priority: Critical
>
> Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing 
> this Jira issue. Drill seems to be running out of memory when doing an 
> External Sort. Per Zelaine's suggestion I enabled 
> sort.external.disable_managed in drill-override.conf and in the sqlline 
> session. This caused the query to run for longer but it still would fail with 
> the same message.
> Per Paul's suggestion, I enabled debug logging for the 
> org.apache.drill.exec.physical.impl.xsort.managed package and re-ran the 
> query.
> Here's the initial DEBUG line for ExternalSortBatch for our query:
> bq. 2017-05-03 12:02:56,095 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:15] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, 
> spill file size = 268435456, spill batch size = 8388608, merge limit = 
> 2147483647, merge batch size = 16777216
> And here's the last DEBUG line before the stack trace:
> bq. 2017-05-03 12:37:44,249 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:4] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, 
> buffer memory = 10719535268, merge memory = 10707140978
> And the stacktrace:
> {quote}
> 2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO  
> o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort 
> encountered an error while spilling to disk (Un
> able to allocate buffer of size 268435456 due to memory limit. Current 
> allocation: 10579849472)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External 
> Sort encountered an error while spilling to disk
> [Error Id: 5d53c677-0cd9-4c01-a664-c02089670a1c ]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
>  ~[drill-common-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:618)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:660)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:559)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:137)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
> [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:144)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.e

[jira] [Updated] (DRILL-5470) CSV reader data corruption on truncated lines

2017-05-07 Thread Paul Rogers (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5470:
---
Priority: Critical  (was: Major)

> CSV reader data corruption on truncated lines
> -
>
> Key: DRILL-5470
> URL: https://issues.apache.org/jira/browse/DRILL-5470
> Project: Apache Drill
>  Issue Type: Bug
>  Components:  Server
>Affects Versions: 1.10.0
> Environment: - ubuntu 14.04
> - r3.8xl (32 CPU/240GB Mem)
> - openjdk version "1.8.0_111"
> - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked
>Reporter: Nathan Butler
>Assignee: Paul Rogers
>Priority: Critical
>
> Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing 
> this Jira issue. Drill seems to be running out of memory when doing an 
> External Sort. Per Zelaine's suggestion I enabled 
> sort.external.disable_managed in drill-override.conf and in the sqlline 
> session. This caused the query to run for longer but it still would fail with 
> the same message.
> Per Paul's suggestion, I enabled debug logging for the 
> org.apache.drill.exec.physical.impl.xsort.managed package and re-ran the 
> query.
> Here's the initial DEBUG line for ExternalSortBatch for our query:
> bq. 2017-05-03 12:02:56,095 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:15] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, 
> spill file size = 268435456, spill batch size = 8388608, merge limit = 
> 2147483647, merge batch size = 16777216
> And here's the last DEBUG line before the stack trace:
> bq. 2017-05-03 12:37:44,249 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:4] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, 
> buffer memory = 10719535268, merge memory = 10707140978
> And the stacktrace:
> {quote}
> 2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO  
> o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort 
> encountered an error while spilling to disk (Un
> able to allocate buffer of size 268435456 due to memory limit. Current 
> allocation: 10579849472)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External 
> Sort encountered an error while spilling to disk
> [Error Id: 5d53c677-0cd9-4c01-a664-c02089670a1c ]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
>  ~[drill-common-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:618)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:660)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:559)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:137)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
> [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:144)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94) 
> [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentEx

[jira] [Comment Edited] (DRILL-5470) CSV reader data corruption on truncated lines

2017-05-07 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999712#comment-15999712
 ] 

Paul Rogers edited comment on DRILL-5470 at 5/7/17 6:59 AM:


To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the 
following form:

{code}
h,u
abc,def
ghi
{code}

Then, I created a simple test using the "cluster fixture" framework:

{code}
  @Test
  public void readerTest() throws Exception {
FixtureBuilder builder = ClusterFixture.builder()
.maxParallelization(1);

try (ClusterFixture cluster = builder.build();
 ClientFixture client = cluster.clientFixture()) {
  TextFormatConfig csvFormat = new TextFormatConfig();
  csvFormat.fieldDelimiter = ',';
  csvFormat.skipFirstLine = false;
  csvFormat.extractHeader = true;
  cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
  String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
  client.queryBuilder().sql(sql).printCsv();
}
  }
{code}

The results show we've got a problem:

{code}
Exception (no rows returned): 
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
IllegalArgumentException: length: -3 (expected: >= 0)
{code}

If the last line were:

{code}
efg,
{code}

Then the offset vector should look like this:

{code}
[0, 3, 3]
{code}

Very likely we have an offset vector that looks like this instead:

{code}
[0, 3, 0]
{code}

When we compute the second column of the second row, we should compute:

{code}
length = offset[2] - offset[1] = 3 - 3 = 0
{code}

Instead we get:

{code}
length = offset[2] - offset[1] = 0 - 3 = -3
{code}

Somehow, in the user's scenario, the number are far larger and the value has 
wrapped around to the bogus length shown.

The summary is that a premature EOF appears to cause the "missing" columns to 
be skipped; they are not filled with a blank value to "bump" the offset vectors 
to fill in the last row. Instead, they are left at 0, causing havoc downstream 
in the query.


was (Author: paul-rogers):
To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the 
following form:

{code}
h,u
abc,def
ghi
{code}

Then, I created a simple test using the "cluster fixture" framework:

{code}
  @Test
  public void readerTest() throws Exception {
FixtureBuilder builder = ClusterFixture.builder()
.maxParallelization(1);

try (ClusterFixture cluster = builder.build();
 ClientFixture client = cluster.clientFixture()) {
  TextFormatConfig csvFormat = new TextFormatConfig();
  csvFormat.fieldDelimiter = ',';
  csvFormat.skipFirstLine = false;
  csvFormat.extractHeader = true;
  cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
  String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
  client.queryBuilder().sql(sql).printCsv();
}
  }
{code}

The results show we've got a problem:

{code}
Exception (no rows returned): 
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
IllegalArgumentException: length: -3 (expected: >= 0)
{code}

If the last line were:

{code}
efg,
{code}

Then the offset vector should look like this:

{code}
\[0, 3, 3]
{code}

Very likely we have an offset vector that looks like this instead:

{code}
\[0, 3, 0]
{code}

When we compute the second column of the second row, we should compute:

{code}
length = offset\[2] - offset\[1] = 3 - 3 = 0
{code}

Instead we get:

{code}
length = offset\[2] - offset\[1] = 0 - 3 = -3
{code}

Somehow, in the user's scenario, the number are far larger and the value has 
wrapped around to the bogus length shown.

The summary is that a premature EOF appears to cause the "missing" columns to 
be skipped; they are not filled with a blank value to "bump" the offset vectors 
to fill in the last row. Instead, they are left at 0, causing havoc downstream 
in the query.

> CSV reader data corruption on truncated lines
> -
>
> Key: DRILL-5470
> URL: https://issues.apache.org/jira/browse/DRILL-5470
> Project: Apache Drill
>  Issue Type: Bug
>  Components:  Server
>Affects Versions: 1.10.0
> Environment: - ubuntu 14.04
> - r3.8xl (32 CPU/240GB Mem)
> - openjdk version "1.8.0_111"
> - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked
>Reporter: Nathan Butler
>Assignee: Paul Rogers
>
> Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing 
> this Jira issue. Drill seems to be running out of memory when doing an 
> External Sort. Per Zelaine's suggestion I enabled 
> sort.external.disable_managed in drill-override.conf and in the sqlline 
> session. This caused the query to run for longer but it still would fail with 
> the same message.
> Per Paul's

[jira] [Commented] (DRILL-5470) CSV reader data corruption on truncated lines

2017-05-07 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999712#comment-15999712
 ] 

Paul Rogers commented on DRILL-5470:


To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the 
following form:

{code}
h,u
abc,def
ghi
{code}

Then, I created a simple test using the "cluster fixture" framework:

{code}
  @Test
  public void readerTest() throws Exception {
FixtureBuilder builder = ClusterFixture.builder()
.maxParallelization(1);

try (ClusterFixture cluster = builder.build();
 ClientFixture client = cluster.clientFixture()) {
  TextFormatConfig csvFormat = new TextFormatConfig();
  csvFormat.fieldDelimiter = ',';
  csvFormat.skipFirstLine = false;
  csvFormat.extractHeader = true;
  cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
  String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
  client.queryBuilder().sql(sql).printCsv();
}
  }
{code}

The results show we've got a problem:

{code}
Exception (no rows returned): 
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
IllegalArgumentException: length: -3 (expected: >= 0)
{code}

If the last line were:

{code}
efg,
{code}

Then the offset vector should look like this:

{code}
\[0, 3, 3]
{code}

Very likely we have an offset vector that looks like this instead:

{code}
\[0, 3, 0]
{code}

When we compute the second column of the second row, we should compute:

{code}
length = offset\[2] - offset\[1] = 3 - 3 = 0
{code}

Instead we get:

{code}
length = offset\[2] - offset\[1] = 0 - 3 = -3
{code}

Somehow, in the user's scenario, the number are far larger and the value has 
wrapped around to the bogus length shown.

The summary is that a premature EOF appears to cause the "missing" columns to 
be skipped; they are not filled with a blank value to "bump" the offset vectors 
to fill in the last row. Instead, they are left at 0, causing havoc downstream 
in the query.

> CSV reader data corruption on truncated lines
> -
>
> Key: DRILL-5470
> URL: https://issues.apache.org/jira/browse/DRILL-5470
> Project: Apache Drill
>  Issue Type: Bug
>  Components:  Server
>Affects Versions: 1.10.0
> Environment: - ubuntu 14.04
> - r3.8xl (32 CPU/240GB Mem)
> - openjdk version "1.8.0_111"
> - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked
>Reporter: Nathan Butler
>Assignee: Paul Rogers
>
> Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing 
> this Jira issue. Drill seems to be running out of memory when doing an 
> External Sort. Per Zelaine's suggestion I enabled 
> sort.external.disable_managed in drill-override.conf and in the sqlline 
> session. This caused the query to run for longer but it still would fail with 
> the same message.
> Per Paul's suggestion, I enabled debug logging for the 
> org.apache.drill.exec.physical.impl.xsort.managed package and re-ran the 
> query.
> Here's the initial DEBUG line for ExternalSortBatch for our query:
> bq. 2017-05-03 12:02:56,095 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:15] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, 
> spill file size = 268435456, spill batch size = 8388608, merge limit = 
> 2147483647, merge batch size = 16777216
> And here's the last DEBUG line before the stack trace:
> bq. 2017-05-03 12:37:44,249 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:4] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, 
> buffer memory = 10719535268, merge memory = 10707140978
> And the stacktrace:
> {quote}
> 2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO  
> o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort 
> encountered an error while spilling to disk (Un
> able to allocate buffer of size 268435456 due to memory limit. Current 
> allocation: 10579849472)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External 
> Sort encountered an error while spilling to disk
> [Error Id: 5d53c677-0cd9-4c01-a664-c02089670a1c ]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
>  ~[drill-common-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at

[jira] [Created] (DRILL-5485) Remove WebServer dependency on DrillClient

[jira] [Created] (DRILL-5484) easy.text.compliant.RepeatedVarCharOutput creates unnecessary 64K byte field

[jira] [Updated] (DRILL-5483) Production-quality solution to define text file field types and widths

[jira] [Created] (DRILL-5483) Production-quality solution to define text file field types and widths

[jira] [Commented] (DRILL-5470) CSV reader data corruption on truncated lines

[jira] [Commented] (DRILL-5470) CSV reader data corruption on truncated lines

[jira] [Updated] (DRILL-5470) CSV reader data corruption on truncated lines

[jira] [Comment Edited] (DRILL-5470) CSV reader data corruption on truncated lines

[jira] [Commented] (DRILL-5470) CSV reader data corruption on truncated lines

9 matches

Site Navigation

Mail list logo

Footer information