[jira] [Commented] (DRILL-8200) Update hadoop-common to ≥ 3.2.3 for CVE-2022-26612
[ https://issues.apache.org/jira/browse/DRILL-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528249#comment-17528249 ] Ted Dunning commented on DRILL-8200: My reading of the CVE indicates that this applies only on Windows. Do others see it the same? > Update hadoop-common to ≥ 3.2.3 for CVE-2022-26612 > -- > > Key: DRILL-8200 > URL: https://issues.apache.org/jira/browse/DRILL-8200 > Project: Apache Drill > Issue Type: Bug > Components: library >Affects Versions: 1.20.0 >Reporter: James Turton >Assignee: James Turton >Priority: Critical > Fix For: 2.0.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (DRILL-7949) documentation error - missing link
Ted Dunning created DRILL-7949: -- Summary: documentation error - missing link Key: DRILL-7949 URL: https://issues.apache.org/jira/browse/DRILL-7949 Project: Apache Drill Issue Type: Task Reporter: Ted Dunning In checking rc1 for 1.19, I noted that this page: [https://drill.apache.org/docs/configuring-storage-plugins/] has a link to "Start the web UI" to [https://drill.apache.org/docs/starting-the-web-console/] and that page does not exist. I think that link should go to [https://drill.apache.org/docs/starting-the-web-ui/] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7277) Bug in planner with redundant order-by
[ https://issues.apache.org/jira/browse/DRILL-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848071#comment-16848071 ] Ted Dunning commented on DRILL-7277: This query: {{select row_number() over (order by department_id desc) r, department_id from (select department_id from cp.`employee.json` order by department_id desc) ;}} blows beets as below but putting department_id first in the output doesn't. {{java.sql.SQLException: [MapR][DrillJDBCDriver](500165) Query execution error. Details: SYSTEM ERROR: CannotPlanException: Node [rel#26937:Subset#4.LOGICAL.ANY([]).[1 DESC]] could not be implemented; planner state: Root: rel#26937:Subset#4.LOGICAL.ANY([]).[1 DESC] Original rel: LogicalProject(subset=[rel#26937:Subset#4.LOGICAL.ANY([]).[1 DESC]], r=[$1], department_id=[$0]): rowcount = 100.0, cumulative cost = {100.0 rows, 200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 26935 LogicalWindow(subset=[rel#26934:Subset#3.NONE.ANY([]).[1 DESC]], window#0=[window(partition {} order by [0 DESC] rows between UNBOUNDED PRECEDING and CURRENT ROW aggs [ROW_NUMBER()])]): rowcount = 100.0, cumulative cost = {100.0 rows, 200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 26933 LogicalSort(subset=[rel#26932:Subset#2.NONE.ANY([]).[0 DESC]], sort0=[$0], dir0=[DESC]): rowcount = 100.0, cumulative cost = {100.0 rows, 1842.0680743952366 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 26931 LogicalProject(subset=[rel#26930:Subset#1.NONE.ANY([]).[]], department_id=[$1]): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 26929 EnumerableTableScan(subset=[rel#26928:Subset#0.ENUMERABLE.ANY([]).[]], table=[[cp, employee.json]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 26880 Sets: Set#0, type: RecordType(DYNAMIC_STAR **, ANY department_id) rel#26928:Subset#0.ENUMERABLE.ANY([]).[], best=rel#26880, importance=0.59049001 rel#26880:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[cp, employee.json]), rowcount=100.0, cumulative cost={100.0 rows, 101.0 cpu, 0.0 io, 0.0 network, 0.0 memory} rel#26952:Subset#0.LOGICAL.ANY([]).[], best=rel#26954, importance=0.3247695 rel#26954:DrillScanRel.LOGICAL.ANY([]).[](table=[cp, employee.json],groupscan=EasyGroupScan [selectionRoot=classpath:/employee.json, numFiles=1, columns=[`**`, `department_id`], files=[classpath:/employee.json]]), rowcount=463.0, cumulative cost={463.0 rows, 463.0 cpu, 0.0 io, 0.0 network, 0.0 memory} Set#1, type: RecordType(ANY department_id) rel#26930:Subset#1.NONE.ANY([]).[], best=null, importance=0.6561 rel#26929:LogicalProject.NONE.ANY([]).[](input=rel#26928:Subset#0.ENUMERABLE.ANY([]).[],department_id=$1), rowcount=100.0, cumulative cost={inf} rel#26931:LogicalSort.NONE.ANY([]).[0 DESC](input=rel#26930:Subset#1.NONE.ANY([]).[],sort0=$0,dir0=DESC), rowcount=100.0, cumulative cost={inf} rel#26943:Subset#1.LOGICAL.ANY([]).[], best=rel#26950, importance=0.405 rel#26944:DrillSortRel.LOGICAL.ANY([]).[0 DESC](input=rel#26943:Subset#1.LOGICAL.ANY([]).[],sort0=$0,dir0=DESC), rowcount=463.0, cumulative cost={926.0 rows, 11830.070504167705 cpu, 0.0 io, 0.0 network, 0.0 memory} rel#26950:DrillScanRel.LOGICAL.ANY([]).[](table=[cp, employee.json],groupscan=EasyGroupScan [selectionRoot=classpath:/employee.json, numFiles=1, columns=[`department_id`], files=[classpath:/employee.json]]), rowcount=463.0, cumulative cost={463.0 rows, 463.0 cpu, 0.0 io, 0.0 network, 0.0 memory} rel#26953:DrillProjectRel.LOGICAL.ANY([]).[](input=rel#26952:Subset#0.LOGICAL.ANY([]).[],department_id=$1), rowcount=463.0, cumulative cost={926.0 rows, 4630463.0 cpu, 0.0 io, 0.0 network, 0.0 memory} rel#26946:Subset#1.NONE.ANY([]).[0 DESC], best=null, importance=0.7291 rel#26931:LogicalSort.NONE.ANY([]).[0 DESC](input=rel#26930:Subset#1.NONE.ANY([]).[],sort0=$0,dir0=DESC), rowcount=100.0, cumulative cost={inf} rel#26947:Subset#1.LOGICAL.ANY([]).[1 DESC], best=null, importance=0.81 rel#26948:Subset#1.LOGICAL.ANY([]).[0 DESC], best=rel#26944, importance=0.405 rel#26944:DrillSortRel.LOGICAL.ANY([]).[0 DESC](input=rel#26943:Subset#1.LOGICAL.ANY([]).[],sort0=$0,dir0=DESC), rowcount=463.0, cumulative cost={926.0 rows, 11830.070504167705 cpu, 0.0 io, 0.0 network, 0.0 memory} Set#3, type: RecordType(ANY department_id, BIGINT w0$o0) rel#26934:Subset#3.NONE.ANY([]).[1 DESC], best=null, importance=0.81 rel#26933:LogicalWindow.NONE.ANY([]).[[1 DESC]](input=rel#26946:Subset#1.NONE.ANY([]).[0 DESC],window#0=window(partition {} order by [0 DESC] rows between UNBOUNDED PRECEDING and CURRENT ROW aggs [ROW_NUMBER()]
[jira] [Created] (DRILL-7277) Bug in planner with redundant order-by
Ted Dunning created DRILL-7277: -- Summary: Bug in planner with redundant order-by Key: DRILL-7277 URL: https://issues.apache.org/jira/browse/DRILL-7277 Project: Apache Drill Issue Type: Bug Affects Versions: 1.14.0 Reporter: Ted Dunning -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-4223) PIVOT and UNPIVOT to rotate table valued expressions
[ https://issues.apache.org/jira/browse/DRILL-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388637#comment-16388637 ] Ted Dunning commented on DRILL-4223: This is actually related to list_aggregate and some kind of inverse to flatten/unnest. My guess is that if we had a JSON constructor, this would be just about as good. The idea would be that columns could be specified to determine the key and value in an object. Aggregation would be the final step to get what John wants. Aggregation over structures is an open question since you don't necessarily know the keys in a structure. It would be nice to be able to apply an aggregation function to all members of the structure without knowing which members exist. > PIVOT and UNPIVOT to rotate table valued expressions > > > Key: DRILL-4223 > URL: https://issues.apache.org/jira/browse/DRILL-4223 > Project: Apache Drill > Issue Type: New Feature > Components: Execution - Codegen, SQL Parser >Reporter: Ashwin Aravind >Priority: Major > > Capability to PIVOT and UNPIVOT table values expressions which are results of > a SELECT query -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6190) Packets can be bigger than strictly legal
[ https://issues.apache.org/jira/browse/DRILL-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382894#comment-16382894 ] Ted Dunning commented on DRILL-6190: Wasn't this already reviewed? The changes since then are trivial. Same for 6191. > Packets can be bigger than strictly legal > - > > Key: DRILL-6190 > URL: https://issues.apache.org/jira/browse/DRILL-6190 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning >Assignee: Ted Dunning >Priority: Major > Labels: ready-to-commit > Fix For: 1.13.0 > > > Packets, especially those generated by malware, can be bigger than the legal > limit for IP. The fix is to leave 64kB padding in the buffers instead of 9kB. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6191) Need more information on TCP flags
[ https://issues.apache.org/jira/browse/DRILL-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381466#comment-16381466 ] Ted Dunning commented on DRILL-6191: Fixed the test to release results. Updated pull request. This pull may now conflict with DRILL-6190, but probably not. > Need more information on TCP flags > -- > > Key: DRILL-6191 > URL: https://issues.apache.org/jira/browse/DRILL-6191 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning >Assignee: Ted Dunning >Priority: Major > Fix For: 1.13.0 > > > > This is a small fix based on input from Charles Givre -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6190) Packets can be bigger than strictly legal
[ https://issues.apache.org/jira/browse/DRILL-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381429#comment-16381429 ] Ted Dunning commented on DRILL-6190: Travis build is fixed: h3. [ #5031 passed|https://travis-ci.org/apache/drill/builds/347567906] * Ran for 43 min 7 sec > Packets can be bigger than strictly legal > - > > Key: DRILL-6190 > URL: https://issues.apache.org/jira/browse/DRILL-6190 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning >Assignee: Ted Dunning >Priority: Major > Fix For: 1.13.0 > > > Packets, especially those generated by malware, can be bigger than the legal > limit for IP. The fix is to leave 64kB padding in the buffers instead of 9kB. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6191) Need more information on TCP flags
[ https://issues.apache.org/jira/browse/DRILL-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-6191: --- Fix Version/s: 1.13.0 > Need more information on TCP flags > -- > > Key: DRILL-6191 > URL: https://issues.apache.org/jira/browse/DRILL-6191 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning >Priority: Major > Fix For: 1.13.0 > > > > This is a small fix based on input from Charles Givre -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6190) Packets can be bigger than strictly legal
[ https://issues.apache.org/jira/browse/DRILL-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-6190: --- Fix Version/s: 1.13.0 > Packets can be bigger than strictly legal > - > > Key: DRILL-6190 > URL: https://issues.apache.org/jira/browse/DRILL-6190 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning >Priority: Major > Fix For: 1.13.0 > > > Packets, especially those generated by malware, can be bigger than the legal > limit for IP. The fix is to leave 64kB padding in the buffers instead of 9kB. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6191) Need more information on TCP flags
[ https://issues.apache.org/jira/browse/DRILL-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378816#comment-16378816 ] Ted Dunning commented on DRILL-6191: Created pull request for this > Need more information on TCP flags > -- > > Key: DRILL-6191 > URL: https://issues.apache.org/jira/browse/DRILL-6191 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning >Priority: Major > > > This is a small fix based on input from Charles Givre -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6190) Packets can be bigger than strictly legal
[ https://issues.apache.org/jira/browse/DRILL-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378817#comment-16378817 ] Ted Dunning commented on DRILL-6190: Created pull request for this. > Packets can be bigger than strictly legal > - > > Key: DRILL-6190 > URL: https://issues.apache.org/jira/browse/DRILL-6190 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning >Priority: Major > > Packets, especially those generated by malware, can be bigger than the legal > limit for IP. The fix is to leave 64kB padding in the buffers instead of 9kB. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6191) Need more information on TCP flags
Ted Dunning created DRILL-6191: -- Summary: Need more information on TCP flags Key: DRILL-6191 URL: https://issues.apache.org/jira/browse/DRILL-6191 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning This is a small fix based on input from Charles Givre -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6190) Packets can be bigger than strictly legal
Ted Dunning created DRILL-6190: -- Summary: Packets can be bigger than strictly legal Key: DRILL-6190 URL: https://issues.apache.org/jira/browse/DRILL-6190 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning Packets, especially those generated by malware, can be bigger than the legal limit for IP. The fix is to leave 64kB padding in the buffers instead of 9kB. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6067) Add acknowledgement sequence number and flags to TCP fields
Ted Dunning created DRILL-6067: -- Summary: Add acknowledgement sequence number and flags to TCP fields Key: DRILL-6067 URL: https://issues.apache.org/jira/browse/DRILL-6067 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5957) Wire protocol versioning, version negotiation
[ https://issues.apache.org/jira/browse/DRILL-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249063#comment-16249063 ] Ted Dunning commented on DRILL-5957: This suggestion has the virtue that only breaking changes will cause a version update, but it still has the problem that the version has to move no matter what part of the protocol changes. This is reminiscent of the old CORBA versioning nightmares. Also, is there really any way to negotiate the value vector format without having a reformatting step inserted with fairly catastrophic performance hit? I don't see a consideration of the cost of maintaining old version compatibility, either. If old client versions work, then there will be no incentive to upgrade. That will increase pressure to keep adding multiple protocol support to the server and will seemingly lock down any real progress just as much as client/server lockstepping. It seems that the short term desire here is to allow the vector format to change. What about making the current dvector parts be optional and adding alternative (optional) dvector parts in new formats? This effectively allows versioning of only the dvector stuff, leaving all the rest of the protocol to be soft-versioned as is currently done. The client advertised version could be used to trigger one format or the other and the incentive to upgrade is in the form of much slower transfer for the old format due to transcoding. > Wire protocol versioning, version negotiation > - > > Key: DRILL-5957 > URL: https://issues.apache.org/jira/browse/DRILL-5957 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.11.0 >Reporter: Paul Rogers > > Drill has very limited support for evolving its wire protocol. As Drill > becomes more widely deployed, this limitation will constrain the project's > ability to rapidly evolve the wire protocol based on user experience to > improve simplicitly, performance or minimize resource use. > Proposed is a standard mechanism to version the API and negotiate the API > version between client and server at connect time. The focus here is between > Drill clients (JDBC, ODBC) and the Drill server. The same mechanism can also > be used between servers to support rolling upgrades. > This proposal is an outline; it is not a detailed design. The purpose here is > to drive understanding of the problem. Once we have that, we can focus on the > implementation details. > h4. Problem Statement > The problem we wish to address here concerns both the _syntax_ and > _semantics_ of API messages. Syntax concerns: > * The set of messages and their sequence > * The format of bytes on the wire > * The format of message packets > Semantics concerns: > * The meaning of each field. > * The layout of non-message data (vectors, in Drill.) > We wish to introduce a system whereby both syntax and semantics can be > evolved in a controlled, known manner such that: > * A client of version x can connect to, and interoperate with, a server in a > range of versions (x-y, x+z) for some values of y and z. > For example, version x of the Drill client is deployed in the field. It must > connect to the oldest Drill cluster available to that client. (That is it > must connect to servers up to y versions old.) During an upgrade, the server > may be upgraded before the client. Thus, the client must also work with > servers up to z versions newer than the client. > If we wish to tackle rolling upgrades, then y and z can both be 1 for > server-to-server APIs. A version x server will talk with (x-1) servers when > the cluster upgrades to x, and will talk to (x+1) servers when the cluster is > upgraded to version (x+1). > h4. Current State > Drill currently provides some ad-hoc version compatibility: > * Slow change. Drill's APIs have not changed much since Drill 1.0, thereby > avoiding the issue. > * Protobuf support. Drill uses Protobuf for message bodies, leveraging that > format's ability to absorb the additional or deprecation of individual fields. > * API version number. The API holds a version number, though the code to use > it is rather ad-hoc. > The above has allowed clever coding to handle some version changes, but each > is a one-off, ad-hoc collision. The recent security work is an example that, > with enough effort, ad-hoc solutions can be found. > The above cannot handle: > * Change in the message order > * Change in the "pbody/dbody" structure of each message. > * Change in the structure of serialized value vectors. > As a result, the current structure prevents any change to Drill's core > mechanism, value vectors, as there is no way or clients and servers to > negotiate the vector wire format. For example, Drill cannot adopt Arrow > because a pre-Arrow client would n
[jira] [Created] (DRILL-5790) PCAP format explicitly opens local file
Ted Dunning created DRILL-5790: -- Summary: PCAP format explicitly opens local file Key: DRILL-5790 URL: https://issues.apache.org/jira/browse/DRILL-5790 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning Note the new FileInputStream line {code} @Override public void setup(final OperatorContext context, final OutputMutator output) throws ExecutionSetupException { try { this.output = output; this.buffer = new byte[10]; this.in = new FileInputStream(inputPath); this.decoder = new PacketDecoder(in); this.validBytes = in.read(buffer); this.projectedCols = getProjectedColsIfItNull(); setColumns(projectedColumns); } catch (IOException io) { throw UserException.dataReadError(io) .addContext("File name:", inputPath) .build(logger); } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992131#comment-15992131 ] Ted Dunning commented on DRILL-5432: Grouping by TCP stream now works. The only remaining significant issues are: 1) rebase to track master (should be trivial since all changes were to new files) 2) decide whether to support pcap-ng now or later (likely it will be later when somebody asks for it) 3) measure speed > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 MB/s for widely available > Apache-compatible open source PCAP decoders. > This JIRA covers the integration and extension of that proof of concept as a > Drill file format. > Initial work is available at https://github.com/mapr-demos/drill-pcap-format > [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981793#comment-15981793 ] Ted Dunning commented on DRILL-5432: The version in github is now working. Thanks for Charles for the mac address code. {code} 0: jdbc:drill:zk=local> select src_ip, count(1), sum(packet_length) from dfs.`/Users/tdunning/Apache/drill-pcap-format/x.pcap` group by src_ip; +--+-+-+ | src_ip | EXPR$1 | EXPR$2 | +--+-+-+ | 10.0.1.5 | 24 | 3478| | 23.72.217.110| 1 | 66 | | 199.59.150.11| 1 | 66 | | 35.167.153.146 | 2 | 194 | | 149.174.66.131 | 1 | 54 | | 152.163.13.6 | 1 | 54 | | 35.166.185.92| 2 | 194 | | 173.194.202.189 | 2 | 145 | | 23.72.187.41 | 2 | 132 | | 108.174.10.10| 4 | 561 | | 12.220.154.66| 1 | 174 | | 52.20.156.183| 1 | 98 | | 74.125.28.189| 1 | 73 | | 192.30.253.124 | 1 | 66 | +--+-+-+ {code} This is now up to the basic idea that we would like to have. The only major thing missing is the ability to group by TCP stream. You can emulate that by grouping by src_ip, dst_ip, src_port, dst_port, but we want something better. Can somebody take a look at the code? > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 MB/s for widely available > Apache-compatible open source PCAP decoders. > This JIRA covers the integration and extension of that proof of concept as a > Drill file format. > Initial work is available at https://github.com/mapr-demos/drill-pcap-format > [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-5432: --- Description: PCAP files [1] are the de facto standard for storing network capture data. In security and protocol applications, it is very common to want to extract particular packets from a capture for further analysis. At a first level, it is desirable to query and filter by source and destination IP and port or by protocol. Beyond that, however, it would be very useful to be able to group packets by TCP session and eventually to look at packet contents. For now, however, the most critical requirement is that we should be able to scan captures at very high speed. I previously wrote a (kind of working) proof of concept for a PCAP decoder that did lazy deserialization and could traverse hundreds of MB of PCAP data per second per core. This compares to roughly 2-3 MB/s for widely available Apache-compatible open source PCAP decoders. This JIRA covers the integration and extension of that proof of concept as a Drill file format. Initial work is available at https://github.com/mapr-demos/drill-pcap-format [1] https://en.wikipedia.org/wiki/Pcap was: PCAP files [1] are the de facto standard for storing network capture data. In security and protocol applications, it is very common to want to extract particular packets from a capture for further analysis. At a first level, it is desirable to query and filter by source and destination IP and port or by protocol. Beyond that, however, it would be very useful to be able to group packets by TCP session and eventually to look at packet contents. For now, however, the most critical requirement is that we should be able to scan captures at very high speed. I previously wrote a (kind of working) proof of concept for a PCAP decoder that did lazy deserialization and could traverse hundreds of MB of PCAP data per second per core. This compares to roughly 2-3 MB/s for widely available Apache-compatible open source PCAP decoders. This JIRA covers the integration and extension of that proof of concept as a Drill file format. Initial work is available at https://github.com/mapr-demos/pcap-query [1] https://en.wikipedia.org/wiki/Pcap > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 MB/s for widely available > Apache-compatible open source PCAP decoders. > This JIRA covers the integration and extension of that proof of concept as a > Drill file format. > Initial work is available at https://github.com/mapr-demos/drill-pcap-format > [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967049#comment-15967049 ] Ted Dunning commented on DRILL-5432: Wow. Missed that. New URL: https://github.com/mapr-demos/drill-pcap-format I will update the original comment so as to limit the number of people who are confused. > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 MB/s for widely available > Apache-compatible open source PCAP decoders. > This JIRA covers the integration and extension of that proof of concept as a > Drill file format. > Initial work is available at https://github.com/mapr-demos/pcap-query > [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5432) Want a memory format for PCAP files
[ https://issues.apache.org/jira/browse/DRILL-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967041#comment-15967041 ] Ted Dunning commented on DRILL-5432: Charles, I don't understand your comment. Tug reported the following output from a sample file: {code} select * from dfs.`data`.`airtunes.pcap` limit 10 +---+--+--+-+-+---+---++---+ | Type | Network |Timestamp | dst_ip | src_ip | src_port | dst_port | packet_length | data | +---+--+--+-+-+---+---++---+ | TCP | 1| 2012-03-29 22:05:41.808 | /192.168.3.123 | /192.168.3.107 | 51594 | 5000 | 78 | []| | TCP | 1| 2012-03-29 22:05:41.808 | /192.168.3.107 | /192.168.3.123 | 5000 | 51594 | 78 | []| | TCP | 1| 2012-03-29 22:05:41.808 | /192.168.3.123 | /192.168.3.107 | 51594 | 5000 | 66 | []| +---+--+--+-+-+---+---++---+ {code} What is your change going to do? > Want a memory format for PCAP files > --- > > Key: DRILL-5432 > URL: https://issues.apache.org/jira/browse/DRILL-5432 > Project: Apache Drill > Issue Type: New Feature >Reporter: Ted Dunning > > PCAP files [1] are the de facto standard for storing network capture data. In > security and protocol applications, it is very common to want to extract > particular packets from a capture for further analysis. > At a first level, it is desirable to query and filter by source and > destination IP and port or by protocol. Beyond that, however, it would be > very useful to be able to group packets by TCP session and eventually to look > at packet contents. For now, however, the most critical requirement is that > we should be able to scan captures at very high speed. > I previously wrote a (kind of working) proof of concept for a PCAP decoder > that did lazy deserialization and could traverse hundreds of MB of PCAP data > per second per core. This compares to roughly 2-3 MB/s for widely available > Apache-compatible open source PCAP decoders. > This JIRA covers the integration and extension of that proof of concept as a > Drill file format. > Initial work is available at https://github.com/mapr-demos/pcap-query > [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5432) Want a memory format for PCAP files
Ted Dunning created DRILL-5432: -- Summary: Want a memory format for PCAP files Key: DRILL-5432 URL: https://issues.apache.org/jira/browse/DRILL-5432 Project: Apache Drill Issue Type: New Feature Reporter: Ted Dunning PCAP files [1] are the de facto standard for storing network capture data. In security and protocol applications, it is very common to want to extract particular packets from a capture for further analysis. At a first level, it is desirable to query and filter by source and destination IP and port or by protocol. Beyond that, however, it would be very useful to be able to group packets by TCP session and eventually to look at packet contents. For now, however, the most critical requirement is that we should be able to scan captures at very high speed. I previously wrote a (kind of working) proof of concept for a PCAP decoder that did lazy deserialization and could traverse hundreds of MB of PCAP data per second per core. This compares to roughly 2-3 MB/s for widely available Apache-compatible open source PCAP decoders. This JIRA covers the integration and extension of that proof of concept as a Drill file format. Initial work is available at https://github.com/mapr-demos/pcap-query [1] https://en.wikipedia.org/wiki/Pcap -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-4884) Drill produced IOB exception while querying data of 65536 limitation using non batched reader
[ https://issues.apache.org/jira/browse/DRILL-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15603623#comment-15603623 ] Ted Dunning commented on DRILL-4884: Hmm putting four copies of my parquet file into a directory made no difference. Can't seem to replicate this. > Drill produced IOB exception while querying data of 65536 limitation using > non batched reader > - > > Key: DRILL-4884 > URL: https://issues.apache.org/jira/browse/DRILL-4884 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Affects Versions: 1.8.0 > Environment: CentOS 6.5 / JAVA 8 >Reporter: Hongze Zhang >Assignee: Jinfeng Ni > Original Estimate: 168h > Remaining Estimate: 168h > > Drill produces IOB while using a non batched scanner and limiting SQL by > 65536. > SQL: > {noformat} > select id from xx limit 1 offset 65535 > {noformat} > Result: > {noformat} > at > org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534) > ~[classes/:na] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:324) > [classes/:na] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184) > [classes/:na] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290) > [classes/:na] > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > [classes/:na] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_101] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_101] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_101] > Caused by: java.lang.IndexOutOfBoundsException: index: 131072, length: 2 > (expected: range(0, 131072)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:175) > ~[classes/:4.0.27.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:197) > ~[classes/:4.0.27.Final] > at io.netty.buffer.DrillBuf.setChar(DrillBuf.java:517) > ~[classes/:4.0.27.Final] > at > org.apache.drill.exec.record.selection.SelectionVector2.setIndex(SelectionVector2.java:79) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.limitWithNoSV(LimitRecordBatch.java:167) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.doWork(LimitRecordBatch.java:145) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:93) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(LimitRecordBatch.java:115) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:215) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:94) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:215) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:132) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:215) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) > ~[classes/:na] > at >
[jira] [Commented] (DRILL-4884) Drill produced IOB exception while querying data of 65536 limitation using non batched reader
[ https://issues.apache.org/jira/browse/DRILL-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15603615#comment-15603615 ] Ted Dunning commented on DRILL-4884: I just did an experiment (which must be flawed) to try to recreate this problem. First, I created a table: {code} drop table maprfs.ted.`q1.parquet`; create table maprfs.ted.`q1.parquet` as with x1(a,b) as (values (1, rand()-0.5), (1, rand()-0.5)), x2 as (select t1.a as a, t1.b + t2.b + t3.b + t4.b as b from x1 t1, x1 t2, x1 t3, x1 t4 where t1.a = t2.a and t2.a = t3.a and t3.a = t4.a), x3 as (select t1.a as a, t1.b + t2.b + t3.b + t4.b as b from x2 t1, x2 t2, x2 t3, x2 t4 where t1.a = t2.a and t2.a = t3.a and t3.a = t4.a) , x4 as (select t1.a as a, t1.b + t2.b + t3.b + t4.b as b from x1 t1, x1 t2, x1 t3, x3 t4 where t1.a = t2.a and t2.a = t3.a and t3.a = t4.a) select * from x4; {code} This table has about half a million rows (x1 has 2 rows, x2 has 2^4, x3 has 16^4 = 65536, x4 has 2 * 2 * 2 * 65,536): {code} 0: jdbc:drill:> select count(*) from maprfs.ted.`q1.parquet`; +-+ | EXPR$0 | +-+ | 524288 | +-+ {code} Unfortunately, I can't get Drill to fail using a limit of 65536±1. Or 100,000. Or 200,000. Does the phrase "non batched scanner" somehow magical here? Or do I need to have multiple files in a directory? > Drill produced IOB exception while querying data of 65536 limitation using > non batched reader > - > > Key: DRILL-4884 > URL: https://issues.apache.org/jira/browse/DRILL-4884 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization >Affects Versions: 1.8.0 > Environment: CentOS 6.5 / JAVA 8 >Reporter: Hongze Zhang >Assignee: Jinfeng Ni > Original Estimate: 168h > Remaining Estimate: 168h > > Drill produces IOB while using a non batched scanner and limiting SQL by > 65536. > SQL: > {noformat} > select id from xx limit 1 offset 65535 > {noformat} > Result: > {noformat} > at > org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534) > ~[classes/:na] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:324) > [classes/:na] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184) > [classes/:na] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290) > [classes/:na] > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > [classes/:na] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_101] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_101] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_101] > Caused by: java.lang.IndexOutOfBoundsException: index: 131072, length: 2 > (expected: range(0, 131072)) > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:175) > ~[classes/:4.0.27.Final] > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:197) > ~[classes/:4.0.27.Final] > at io.netty.buffer.DrillBuf.setChar(DrillBuf.java:517) > ~[classes/:4.0.27.Final] > at > org.apache.drill.exec.record.selection.SelectionVector2.setIndex(SelectionVector2.java:79) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.limitWithNoSV(LimitRecordBatch.java:167) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.doWork(LimitRecordBatch.java:145) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:93) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(LimitRecordBatch.java:115) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:215) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[classes/:na] > at > org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:94) > ~[classes/:na] > at > org.apache.drill.exec.record.AbstractR
[jira] [Commented] (DRILL-4754) Missing values are not missing
[ https://issues.apache.org/jira/browse/DRILL-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350440#comment-15350440 ] Ted Dunning commented on DRILL-4754: This other bug (from 18 months ago with no apparent progress) notes the conflation of empty and missing, but doesn't directly address it. > Missing values are not missing > -- > > Key: DRILL-4754 > URL: https://issues.apache.org/jira/browse/DRILL-4754 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > > If I have a query which reads from a JSON file where a field is a list or is > missing, then the records where the field should missing will instead have a > value for that field that is an empty list: > {code} > 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; > +++--+ > | *a* | *b*| *c* | > | 3 | [3,2] | xyz | > | 7 | [] | wxy | > | 7 | [] | null | > +++--+ > 2 rows selected (1.279 seconds) > {code} > where the file in question contains these three records: > {code} > {'a':3, 'b':[3,2], 'c':'xyz'} > {'a':7, 'c':'wxy'} > {"a":7, "b":[]} > {code} > The problem is in the second record of the result. I would have expected b to > have had the value NULL. > I am using drill-1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4754) Missing values are not missing
[ https://issues.apache.org/jira/browse/DRILL-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350430#comment-15350430 ] Ted Dunning commented on DRILL-4754: Hmm... I can't find any such JIRA's just off hand. I see DRILL-3831, but that seems to be a very different matter. I will look further. > Missing values are not missing > -- > > Key: DRILL-4754 > URL: https://issues.apache.org/jira/browse/DRILL-4754 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > > If I have a query which reads from a JSON file where a field is a list or is > missing, then the records where the field should missing will instead have a > value for that field that is an empty list: > {code} > 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; > +++--+ > | *a* | *b*| *c* | > | 3 | [3,2] | xyz | > | 7 | [] | wxy | > | 7 | [] | null | > +++--+ > 2 rows selected (1.279 seconds) > {code} > where the file in question contains these three records: > {code} > {'a':3, 'b':[3,2], 'c':'xyz'} > {'a':7, 'c':'wxy'} > {"a":7, "b":[]} > {code} > The problem is in the second record of the result. I would have expected b to > have had the value NULL. > I am using drill-1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4754) Missing values are not missing
[ https://issues.apache.org/jira/browse/DRILL-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-4754: --- Description: If I have a query which reads from a JSON file where a field is a list or is missing, then the records where the field should missing will instead have a value for that field that is an empty list: {code} 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; +++--+ | *a* | *b*| *c* | | 3 | [3,2] | xyz | | 7 | [] | wxy | | 7 | [] | null | +++--+ 2 rows selected (1.279 seconds) {code} where the file in question contains these three records: {code} {'a':3, 'b':[3,2], 'c':'xyz'} {'a':7, 'c':'wxy'} {"a":7, "b":[]} {code} The problem is in the second record of the result. I would have expected b to have had the value NULL. I am using drill-1.6.0. was: If I have a query which reads from a JSON file where a field is a list or is missing, then the records where the field should missing will instead have a value for that field that is an empty list: {{ 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; +++--+ | *a* | *b*| *c* | | 3 | [3,2] | xyz | | 7 | [] | wxy | | 7 | [] | null | +++--+ 2 rows selected (1.279 seconds) }} where the file in question contains these two records: {{ {'a':3, 'b':[3,2], 'c':'xyz'} {'a':7, 'c':'wxy'} {"a":7, "b":[]} }} The problem is in the second record of the result. I would have expected b to have had the value NULL. I am using drill-1.6.0. > Missing values are not missing > -- > > Key: DRILL-4754 > URL: https://issues.apache.org/jira/browse/DRILL-4754 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > > If I have a query which reads from a JSON file where a field is a list or is > missing, then the records where the field should missing will instead have a > value for that field that is an empty list: > {code} > 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; > +++--+ > | *a* | *b*| *c* | > | 3 | [3,2] | xyz | > | 7 | [] | wxy | > | 7 | [] | null | > +++--+ > 2 rows selected (1.279 seconds) > {code} > where the file in question contains these three records: > {code} > {'a':3, 'b':[3,2], 'c':'xyz'} > {'a':7, 'c':'wxy'} > {"a":7, "b":[]} > {code} > The problem is in the second record of the result. I would have expected b to > have had the value NULL. > I am using drill-1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4754) Missing values are not missing
[ https://issues.apache.org/jira/browse/DRILL-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-4754: --- Description: If I have a query which reads from a JSON file where a field is a list or is missing, then the records where the field should missing will instead have a value for that field that is an empty list: {{ 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; +++--+ | *a* | *b*| *c* | | 3 | [3,2] | xyz | | 7 | [] | wxy | | 7 | [] | null | +++--+ 2 rows selected (1.279 seconds) }} where the file in question contains these two records: {{ {'a':3, 'b':[3,2], 'c':'xyz'} {'a':7, 'c':'wxy'} {"a":7, "b":[]} }} The problem is in the second record of the result. I would have expected b to have had the value NULL. I am using drill-1.6.0. was: If I have a query which reads from a JSON file where a field is a list or is missing, then the records where the field should missing will instead have a value for that field that is an empty list: {{ 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; +++--+ | *a* | b| c | | 3 | [3,2] | xyz | | 7 | [] | wxy | | 7 | [] | null | +++--+ 2 rows selected (1.279 seconds) }} where the file in question contains these two records: {{ {'a':3, 'b':[3,2], 'c':'xyz'} {'a':7, 'c':'wxy'} {"a":7, "b":[]} }} The problem is in the second record of the result. I would have expected b to have had the value NULL. I am using drill-1.6.0. > Missing values are not missing > -- > > Key: DRILL-4754 > URL: https://issues.apache.org/jira/browse/DRILL-4754 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > > If I have a query which reads from a JSON file where a field is a list or is > missing, then the records where the field should missing will instead have a > value for that field that is an empty list: > {{ > 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; > +++--+ > | *a* | *b*| *c* | > | 3 | [3,2] | xyz | > | 7 | [] | wxy | > | 7 | [] | null | > +++--+ > 2 rows selected (1.279 seconds) > }} > where the file in question contains these two records: > {{ > {'a':3, 'b':[3,2], 'c':'xyz'} > {'a':7, 'c':'wxy'} > {"a":7, "b":[]} > }} > The problem is in the second record of the result. I would have expected b to > have had the value NULL. > I am using drill-1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4754) Missing values are not missing
[ https://issues.apache.org/jira/browse/DRILL-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-4754: --- Description: If I have a query which reads from a JSON file where a field is a list or is missing, then the records where the field should missing will instead have a value for that field that is an empty list: {{ 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; +++--+ | *a* | b| c | | 3 | [3,2] | xyz | | 7 | [] | wxy | | 7 | [] | null | +++--+ 2 rows selected (1.279 seconds) }} where the file in question contains these two records: {{ {'a':3, 'b':[3,2], 'c':'xyz'} {'a':7, 'c':'wxy'} {"a":7, "b":[]} }} The problem is in the second record of the result. I would have expected b to have had the value NULL. I am using drill-1.6.0. was: If I have a query which reads from a JSON file where a field is a list or is missing, then the records where the field should missing will instead have a value for that field that is an empty list: 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; +++--+ | a | b| c | +++--+ | 3 | [3,2] | xyz | | 7 | [] | wxy | +++--+ 2 rows selected (1.279 seconds) where the file in question contains these two records: {'a':3, 'b':[3,2], 'c':'xyz'} {'a':7, 'c':'wxy'} The problem is in the second record of the result. I would have expected b to have had the value NULL. I am using drill-1.6.0. > Missing values are not missing > -- > > Key: DRILL-4754 > URL: https://issues.apache.org/jira/browse/DRILL-4754 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > > If I have a query which reads from a JSON file where a field is a list or is > missing, then the records where the field should missing will instead have a > value for that field that is an empty list: > {{ > 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; > +++--+ > | *a* | b| c | > | 3 | [3,2] | xyz | > | 7 | [] | wxy | > | 7 | [] | null | > +++--+ > 2 rows selected (1.279 seconds) > }} > where the file in question contains these two records: > {{ > {'a':3, 'b':[3,2], 'c':'xyz'} > {'a':7, 'c':'wxy'} > {"a":7, "b":[]} > }} > The problem is in the second record of the result. I would have expected b to > have had the value NULL. > I am using drill-1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-4754) Missing values are not missing
Ted Dunning created DRILL-4754: -- Summary: Missing values are not missing Key: DRILL-4754 URL: https://issues.apache.org/jira/browse/DRILL-4754 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning If I have a query which reads from a JSON file where a field is a list or is missing, then the records where the field should missing will instead have a value for that field that is an empty list: 0: jdbc:drill:> select * from maprfs.ted.`bug.json`; +++--+ | a | b| c | +++--+ | 3 | [3,2] | xyz | | 7 | [] | wxy | +++--+ 2 rows selected (1.279 seconds) where the file in question contains these two records: {'a':3, 'b':[3,2], 'c':'xyz'} {'a':7, 'c':'wxy'} The problem is in the second record of the result. I would have expected b to have had the value NULL. I am using drill-1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3912) Common subexpression elimination
[ https://issues.apache.org/jira/browse/DRILL-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947721#comment-14947721 ] Ted Dunning commented on DRILL-3912: It sounds like this only deals with common sub-expressions in expressions. A far more significant optimization would be to deal with common sub-expressions at a larger scale. A classic case is multiple re-use of a single expression in a common table expression. For instance, {code} with x as (select dir0, id from dfs.tdunning.zoom where id < 12), y as (select id, count(*) cnt from x group by id), z as (select count(distinct id) id_count from x) select dir0, x.id, y.cnt from x , y, z where x.id = y.id and y.cnt / z.id_count > 3 {code} Without good sub-expression elimination, table zoom will be scanned three times. Last I heard, DRILL doesn't optimize this away. > Common subexpression elimination > > > Key: DRILL-3912 > URL: https://issues.apache.org/jira/browse/DRILL-3912 > Project: Apache Drill > Issue Type: Bug >Reporter: Steven Phillips >Assignee: Steven Phillips > > Drill currently will evaluate the full expression tree, even if there are > redundant subtrees. Many of these redundant evaluations can be eliminated by > reusing the results from previously evaluated expression trees. > For example, > {code} > select a + 1, (a + 1)* (a - 1) from t > {code} > Will compute the entire (a + 1) expression twice. With CSE, it will only be > evaluated once. > The benefit will be reducing the work done when evaluating expressions, as > well as reducing the amount of code that is generated, which could also lead > to better JIT optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3894) Directory functions (MaxDir, MinDir ..) should have optional filename parameter
[ https://issues.apache.org/jira/browse/DRILL-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943954#comment-14943954 ] Ted Dunning commented on DRILL-3894: Actually, I just tested a bit with this. I agree that this is a valid request, but there is actually a trivial (but not necessarily obvious work-around). TLDR: Just use '.' as the table name. I created a workspace zoom under my home directory as the directory zoom. >From my home directory, I can use {{MAXDIR}} as expected: {code} 0: jdbc:drill:> select count(*) from dfs.tdunning.zoom; +-+ | EXPR$0 | +-+ | 600 | +-+ 1 row selected (0.378 seconds) 0: jdbc:drill:> select count(*) from dfs.tdunning.zoom where dir0 = MAXDIR('dfs.tdunning', 'zoom'); +-+ | EXPR$0 | +-+ | 200 | +-+ 1 row selected (0.799 seconds) {code} So that all works. If I try to touch the zoom work-space, I immediately have some issues because a workspace isn't a table. {code} 0: jdbc:drill:> select count(*) from dfs.zoom; Error: PARSE ERROR: From line 1, column 22 to line 1, column 24: Table 'dfs.zoom' not found {code} Using the hack of {{`.`}} as a table resolves this, however: {code} 0: jdbc:drill:> select count(*) from dfs.zoom.`.`; +-+ | EXPR$0 | +-+ | 600 | +-+ 1 row selected (0.336 seconds) 0: jdbc:drill:> select count(*) from dfs.zoom.`.` where dir0 = maxdir('dfs.zoom', '.'); +-+ | EXPR$0 | +-+ | 200 | +-+ 1 row selected (0.777 seconds) 0: jdbc:drill:> {code} > Directory functions (MaxDir, MinDir ..) should have optional filename > parameter > --- > > Key: DRILL-3894 > URL: https://issues.apache.org/jira/browse/DRILL-3894 > Project: Apache Drill > Issue Type: Improvement > Components: Functions - Drill >Affects Versions: 1.2.0 >Reporter: Neeraja > > https://drill.apache.org/docs/query-directory-functions/ > The directory functions documented above should provide ability to have > second parameter(file name) as optional. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3815) unknown suffixes .not_json and .json_not treated differently (multi-file case)
[ https://issues.apache.org/jira/browse/DRILL-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901691#comment-14901691 ] Ted Dunning commented on DRILL-3815: Daniel, Did you trace this a bit to see where the extensions are being matched? Could it be a naively constructed regex? Kinda smells like that. > unknown suffixes .not_json and .json_not treated differently (multi-file case) > -- > > Key: DRILL-3815 > URL: https://issues.apache.org/jira/browse/DRILL-3815 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Other >Reporter: Daniel Barclay (Drill) >Assignee: Jacques Nadeau > > In scanning a directory subtree used as a table, unknown filename extensions > seem to be treated differently depending on whether they're similar to known > file extensions. The behavior suggests that Drill checks whether a file name > _contains_ an extension's string rather than _ending_ with it. > For example, given these subtrees with almost identical leaf file names: > {noformat} > $ find /tmp/testext_xx_json/ > /tmp/testext_xx_json/ > /tmp/testext_xx_json/voter2.not_json > /tmp/testext_xx_json/voter1.json > $ find /tmp/testext_json_xx/ > /tmp/testext_json_xx/ > /tmp/testext_json_xx/voter1.json > /tmp/testext_json_xx/voter2.json_not > $ > {noformat} > the results of trying to use them as tables differs: > {noformat} > 0: jdbc:drill:zk=local> SELECT * FROM `dfs.tmp`.`testext_xx_json`; > Sep 21, 2015 11:41:50 AM > org.apache.calcite.sql.validate.SqlValidatorException > ... > Error: VALIDATION ERROR: From line 1, column 17 to line 1, column 25: Table > 'dfs.tmp.testext_xx_json' not found > [Error Id: 6fe41deb-0e39-43f6-beca-de27b39d276b on dev-linux2:31010] > (state=,code=0) > 0: jdbc:drill:zk=local> SELECT * FROM `dfs.tmp`.`testext_json_xx`; > +---+ > | onecf | > +---+ > | {"name":"someName1"} | > | {"name":"someName2"} | > +---+ > 2 rows selected (0.149 seconds) > {noformat} > (Other probing seems to indicate that there is also some sensitivity to > whether the extension contains an underscore character.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3698) Expose Show Files Command As SQL for sorting/filtering
[ https://issues.apache.org/jira/browse/DRILL-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14708633#comment-14708633 ] Ted Dunning commented on DRILL-3698: I think that making [show tables] into the equivalent of a select query is the way to go with this. If that can work syntactically, then everything you want just falls out directly. > Expose Show Files Command As SQL for sorting/filtering > -- > > Key: DRILL-3698 > URL: https://issues.apache.org/jira/browse/DRILL-3698 > Project: Apache Drill > Issue Type: Improvement > Components: SQL Parser >Affects Versions: Future > Environment: All >Reporter: John Omernik >Assignee: Aman Sinha > Labels: features > Fix For: Future > > > When using drill, I had a workspace setup, and I found myself using the show > files command often to find my directories etc. The thing is, the return of > show files is not ordered. And when looking at file system data there are > many possible ways to order the results for efficiency as a user. > Consider the ls command in unix; the ability to specify different sorting is > built in there. I checked out > http://drill.apache.org/docs/show-files-command/ as well as tried the > "obvious" show files order by name and that didn't work nor did I see how I > could in the documentation. > Based on a mailing list discussion there is no way to do that currently in > Drill, hence this JIRA I think just adding ORDER BY SQL methodology would be > perfect here, you have 8 fields (seen below) and ordering by any one of them, > or group of them, with ASC/DESC just like standard SQL order by would be a > huge win. > I suppose one could potentially ask for WHERE clause (filtering)too, and > maybe a select (which fieldsto display) however I am more concerned with the > order, but if I had to implement all there I could see examples below: > (All Three, select, where, and order) (I.e. after "Files" if the token isn't > WHERE or ORDER then check for the fields, if it's not a valid field list > error) > SHOW FILES name, accessTime WHERE name like '%.csv' ORDER BY name; > (Where clause and order, note the token after FILES is WHERE) > SHOW FILES WHERE name like '%.csv' ORDER BY length ASC, name DESC; > (Only Order, ORDER Is the first token after FILES) > SHOW FILES ORDER BY length ASC, name DESC > I don't think we have to grant full SQL functionality here (i.e. aggregates), > just the ability to display various fields, filter on criteria, and ordering. > If you wanted to get fancy, I suppose you could take the table and make it a > full on table, i.e. take the results make it a quick inmemory table and then > utilize the whole drill stack on it. Lots of options. I just wanted to get > this down in an email/JIRA as it was something I found myself wishing I had > over and over during data exploration. > Fields Currently Returned: > |name| isDirectory|isFile|length|owner > group|permissions|accessTime|modificationTime| -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-3545) Need documentation on BINARY_STRING and STRING_BINARY functions
Ted Dunning created DRILL-3545: -- Summary: Need documentation on BINARY_STRING and STRING_BINARY functions Key: DRILL-3545 URL: https://issues.apache.org/jira/browse/DRILL-3545 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning These are darn handy but we need to document them so the community at large can find out about them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-3544) Need better error messages when convert_to is given a bad type
Ted Dunning created DRILL-3544: -- Summary: Need better error messages when convert_to is given a bad type Key: DRILL-3544 URL: https://issues.apache.org/jira/browse/DRILL-3544 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning The first query below fails because I used UTF-8 instead of UTF8. This should have a decent error message. {code} 0: jdbc:drill:zk=local> SELECT CONVERT_TO('[ [1, 2], [3, 4], [5]]' ,'UTF-8') AS MYCOL1 FROM sys.version; Error: SYSTEM ERROR: org.apache.drill.exec.work.foreman.ForemanException: Unexpected exception during fragment initialization: null [Error Id: 899207da-2338-4b09-bdc8-8e12e320b661 on 172.16.0.61:31010] (state=,code=0) 0: jdbc:drill:zk=local> SELECT CONVERT_TO('[ [1, 2], [3, 4], [5]]' ,'UTF8') AS MYCOL1 FROM sys.version; +-+ | MYCOL1| +-+ | [B@71f3d3a | +-+ 1 row selected (0.108 seconds) 0: jdbc:drill:zk=local> {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (DRILL-3517) Add a UDF development FAQ
[ https://issues.apache.org/jira/browse/DRILL-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634309#comment-14634309 ] Ted Dunning edited comment on DRILL-3517 at 7/21/15 1:44 AM: - 8) My UDF is not a pure function so calling it with the same arguments will result in different values each time. Drill is only using the result of calling my function once. How can I fix that? - Add the isRandom flag to the annotation that defines your class as a function: @FunctionTemplate(isRandom = true, ...) was (Author: tdunning): 8) My UDF is not a pure function so calling it with the same arguments will result in different values each time. Drill is only using the result of calling my function once. How can I fix that? > Add a UDF development FAQ > - > > Key: DRILL-3517 > URL: https://issues.apache.org/jira/browse/DRILL-3517 > Project: Apache Drill > Issue Type: Sub-task > Components: Documentation >Reporter: Jacques Nadeau >Assignee: Bridget Bevens > > Lets create a UDF FAQ of common issues, log entries, etc so that people know > what to do when they hit certain issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3517) Add a UDF development FAQ
[ https://issues.apache.org/jira/browse/DRILL-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634309#comment-14634309 ] Ted Dunning commented on DRILL-3517: 8) My UDF is not a pure function so calling it with the same arguments will result in different values each time. Drill is only using the result of calling my function once. How can I fix that? > Add a UDF development FAQ > - > > Key: DRILL-3517 > URL: https://issues.apache.org/jira/browse/DRILL-3517 > Project: Apache Drill > Issue Type: Sub-task > Components: Documentation >Reporter: Jacques Nadeau >Assignee: Bridget Bevens > > Lets create a UDF FAQ of common issues, log entries, etc so that people know > what to do when they hit certain issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3520) Provide better logging around UDF loading and module loading
[ https://issues.apache.org/jira/browse/DRILL-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634267#comment-14634267 ] Ted Dunning commented on DRILL-3520: I think that the function logging should become much more extensive if there are any functions that fail to load. A list of jars that had failed functions, a list that were loaded without error and so on would be very helpful, as would a dump of the classpath. The volume of logging isn't a big deal since this is a one-off that only occurs when things are borked. > Provide better logging around UDF loading and module loading > > > Key: DRILL-3520 > URL: https://issues.apache.org/jira/browse/DRILL-3520 > Project: Apache Drill > Issue Type: Sub-task >Reporter: Jacques Nadeau > > When adding an extension to Drill, sometimes it is hard to know what is going > on. We should: > - improve logging so that we report at INFO level information about all the > Jar files that were included in Drill's consideration set (marked) and those > that are not. (and include this debug analysis in the documentation) > - If Drill fails to load a function, register an error function so that > trying to invoke the UDF will provide the user with information about why the > function failed to load. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3518) Do a better job of providing conceptual overview to UDF creation
[ https://issues.apache.org/jira/browse/DRILL-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634262#comment-14634262 ] Ted Dunning commented on DRILL-3518: Pitfalls I have fallen into: 1) All data that a UDF uses must be annotated and must be a type that the annotation accepts 2) All class references must be fully qualified 3) It is super-easy to make a UDF that doesn't actually load and it is hard to see why (at first) 4) UDAF's can't have complex @Workspace variables because there seems to be no way to allocate even a Repeated* value, much less to have a ComplexWriter in the @Workspace 5) The annotated input, workspace and output variables have life-cycles that aren't apparent from the lexical structure of a UDAF. The fact that at add time the add() method can't see the output and that at output time both workspace and output variables are visible is confusing. 6) figuring out the maven-fu to create acceptable jars takes quite a while > Do a better job of providing conceptual overview to UDF creation > > > Key: DRILL-3518 > URL: https://issues.apache.org/jira/browse/DRILL-3518 > Project: Apache Drill > Issue Type: Sub-task > Components: Documentation >Reporter: Jacques Nadeau >Assignee: Bridget Bevens > > Since UDFs are effectively written in Java, people find it confusing when > some Java features aren't supported. Let's try to do a better job of > outlining the pitfalls. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3517) Add a UDF development FAQ
[ https://issues.apache.org/jira/browse/DRILL-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634258#comment-14634258 ] Ted Dunning commented on DRILL-3517: More in the line of FAQ's: 1) my UDF didn't get loaded. What happened? - see the log - you probably have a subtle error related to the lexical environment 2) my UDF got loaded, but didn't get used. What happened? - you may not have the right types/may need more type variants 3) Is there an example of a simple transforming UDF? 4) Is there a sample of a UDAF? 5) How can I use my own types as temporary data in my UDAF? 6) I defined fields in my class that implements a UDF, but I get a compile error that says that they are undefined. Heh? 7) How can I see the generated code that includes my UDF? > Add a UDF development FAQ > - > > Key: DRILL-3517 > URL: https://issues.apache.org/jira/browse/DRILL-3517 > Project: Apache Drill > Issue Type: Sub-task > Components: Documentation >Reporter: Jacques Nadeau >Assignee: Bridget Bevens > > Lets create a UDF FAQ of common issues, log entries, etc so that people know > what to do when they hit certain issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3517) Add a UDF development FAQ
[ https://issues.apache.org/jira/browse/DRILL-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634253#comment-14634253 ] Ted Dunning commented on DRILL-3517: Let's start in these comments: 1) Drill UDF's are not run in the lexical environment that you might think 1.a) imports will just confuse you 1.b) all class references need to be fully qualified 1.c) fields that you define will get left behind 2) Inputs and outputs have to be in *Holder types 3) complex outputs can be created using the ComplexWriter 4) you need to build both source and binary jars. Examples available Not sure if these are FAQ format anymore. > Add a UDF development FAQ > - > > Key: DRILL-3517 > URL: https://issues.apache.org/jira/browse/DRILL-3517 > Project: Apache Drill > Issue Type: Sub-task > Components: Documentation >Reporter: Jacques Nadeau >Assignee: Bridget Bevens > > Lets create a UDF FAQ of common issues, log entries, etc so that people know > what to do when they hit certain issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-3516) UDF documentation doesn't get people where they need to be
Ted Dunning created DRILL-3516: -- Summary: UDF documentation doesn't get people where they need to be Key: DRILL-3516 URL: https://issues.apache.org/jira/browse/DRILL-3516 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning The UDF documentation on the web side rooted at http://drill.apache.org/docs/develop-custom-functions/ does not describe the high level process for how UDF's are used, nor does it describe why simple things like toString can't work on *Holder data structures. This leads to huge confusion and frustration on the part of potentially contributors. Here are some pertinent threads: http://mail-archives.apache.org/mod_mbox/drill-user/201507.mbox/%3CCACAwhF%3DBxs-bXNdrm0pNJ4e8hZiaueqtZMhJ%3DRiBpf%3Dt%3DzEOWA%40mail.gmail.com%3E http://mail-archives.apache.org/mod_mbox/drill-user/201507.mbox/%3CCACAwhFmwWkP6udc05UEGFTzpEsaRvAxSRKW%2B2Mg-ijYX8QoQxQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-3461) Need to add javadocs to class where they are missing
[ https://issues.apache.org/jira/browse/DRILL-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-3461: --- Attachment: (was: no-javadoc-no-comments.txt) > Need to add javadocs to class where they are missing > > > Key: DRILL-3461 > URL: https://issues.apache.org/jira/browse/DRILL-3461 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > Attachments: no-javadocs-templates.txt, no-javadocs.txt, > no-javadocs.txt > > > 1220 classes in Drill have no Javadocs whatsoever. I will attach a detailed > list. > Some kind of expression of intent and basic place in the architecture should > be included in all classes. > The good news is that at least there are 1838 (1868 in 1.1.0 branch) classes > that have at least some kind of javadocs. > I would be happy to help write comments, but I can't figure out what these > classes do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-3461) Need to add javadocs to class where they are missing
[ https://issues.apache.org/jira/browse/DRILL-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-3461: --- Attachment: (was: no-comments.txt) > Need to add javadocs to class where they are missing > > > Key: DRILL-3461 > URL: https://issues.apache.org/jira/browse/DRILL-3461 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > Attachments: no-javadocs-templates.txt, no-javadocs.txt, > no-javadocs.txt > > > 1220 classes in Drill have no Javadocs whatsoever. I will attach a detailed > list. > Some kind of expression of intent and basic place in the architecture should > be included in all classes. > The good news is that at least there are 1838 (1868 in 1.1.0 branch) classes > that have at least some kind of javadocs. > I would be happy to help write comments, but I can't figure out what these > classes do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-3461) Need to add javadocs to class where they are missing
[ https://issues.apache.org/jira/browse/DRILL-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-3461: --- Attachment: no-javadocs-templates.txt no-javadocs.txt Updated lists of files. Current count is 1239 java files with no javadoc and 67 templates. The previous count was possibly distorted by generated files and seeing the Apache license as a comment. > Need to add javadocs to class where they are missing > > > Key: DRILL-3461 > URL: https://issues.apache.org/jira/browse/DRILL-3461 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > Attachments: no-comments.txt, no-javadoc-no-comments.txt, > no-javadocs-templates.txt, no-javadocs.txt, no-javadocs.txt > > > 1220 classes in Drill have no Javadocs whatsoever. I will attach a detailed > list. > Some kind of expression of intent and basic place in the architecture should > be included in all classes. > The good news is that at least there are 1838 (1868 in 1.1.0 branch) classes > that have at least some kind of javadocs. > I would be happy to help write comments, but I can't figure out what these > classes do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-3462) There appears to be no way to have complex intermediate state
Ted Dunning created DRILL-3462: -- Summary: There appears to be no way to have complex intermediate state Key: DRILL-3462 URL: https://issues.apache.org/jira/browse/DRILL-3462 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning After spending several frustrating days on the problem (see also DRILL-3461), it appears that there is no viable idiom for building an aggregator that has internal state that is anything more than a scalar. What is needed is: 1) The ability to allocate a Repeated* type for use in a Workspace variables. Currently, new works to get the basic structure, but there is no good way to allocate the corresponding vector. 2) The ability to use and to allocate a ComplexWriter in the Workspace variables. 3) The ability to write a UDAF that supports multi-phase aggregation. It would be just fine if I simply have to write a combine method on my UDAF class. I don't think that there is any way to infer such a combiner from the parameters and workspace variables. An alternative API would be to have a form of the output function that is given an Iterable, but that is probably much less efficient than simply having a combine method that is called repeatedly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-3461) Need to meet basic coding standards
[ https://issues.apache.org/jira/browse/DRILL-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-3461: --- Attachment: no-javadoc-no-comments.txt no-comments.txt Here are other views of the situation. A quick summary is that only one file that has no javadoc has any // comments. > Need to meet basic coding standards > --- > > Key: DRILL-3461 > URL: https://issues.apache.org/jira/browse/DRILL-3461 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > Attachments: no-comments.txt, no-javadoc-no-comments.txt, > no-javadocs.txt > > > 1220 classes in Drill have no Javadocs whatsoever. I will attach a detailed > list. > Some kind of expression of intent and basic place in the architecture should > be included in all classes. > The good news is that at least there are 1838 (1868 in 1.1.0 branch) classes > that have at least some kind of javadocs. > I would be happy to help write comments, but I can't figure out what these > classes do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-3461) Need to meet basic coding standards
[ https://issues.apache.org/jira/browse/DRILL-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated DRILL-3461: --- Attachment: no-javadocs.txt This is a list of the 1220 classes with no javadocs > Need to meet basic coding standards > --- > > Key: DRILL-3461 > URL: https://issues.apache.org/jira/browse/DRILL-3461 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > Attachments: no-javadocs.txt > > > 1220 classes in Drill have no Javadocs whatsoever. I will attach a detailed > list. > Some kind of expression of intent and basic place in the architecture should > be included in all classes. > The good news is that at least there are 1838 (1868 in 1.1.0 branch) classes > that have at least some kind of javadocs. > I would be happy to help write comments, but I can't figure out what these > classes do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-3461) Need to meet basic coding standards
Ted Dunning created DRILL-3461: -- Summary: Need to meet basic coding standards Key: DRILL-3461 URL: https://issues.apache.org/jira/browse/DRILL-3461 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning 1220 classes in Drill have no Javadocs whatsoever. I will attach a detailed list. Some kind of expression of intent and basic place in the architecture should be included in all classes. The good news is that at least there are 1838 (1868 in 1.1.0 branch) classes that have at least some kind of javadocs. I would be happy to help write comments, but I can't figure out what these classes do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3444) Implement Is Not Null/Is Null on List of objects - [isnotnull(MAP-REPEATED)] error
[ https://issues.apache.org/jira/browse/DRILL-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611048#comment-14611048 ] Ted Dunning commented on DRILL-3444: I talked to Tug about this today and we walked through the code looking at how to fix this. The key lack is a missing IsNull operator. Tug started in trying to figure out how to write such an operator. Right now we have a bunch of template generated operators for all of the specific scalar types and also the uniform list types. What we don't have is a null operator for general lists. Can somebody point at how such an operator ought to be implemented? > Implement Is Not Null/Is Null on List of objects - [isnotnull(MAP-REPEATED)] > error > --- > > Key: DRILL-3444 > URL: https://issues.apache.org/jira/browse/DRILL-3444 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types, Functions - Drill > Environment: Drill 1.0 >Reporter: Tugdual Grall >Assignee: Daniel Barclay (Drill) >Priority: Critical > > It is not possble to use the IS NULL / IS NOT NULL operator on an attribuite > that contains a list of "object". (it is working with a list of scalar types) > Query: > {code} > select * > from dfs.`/working/json_array/*.json` p > where p.tags IS NOT NULL > {code} > Document: > {code} > { > "name" : "PPRODUCT_002", > "price" : 200.00, > "tags" : [ { "type" : "sports" } , { "type" : "ocean" }] > } > {code} > Error: > {code} > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: > org.apache.drill.exec.exception.SchemaChangeException: Failure while trying > to materialize incoming schema. Errors: Error in expression at index -1. > Error: Missing function implementation: [isnotnull(MAP-REPEATED)]. Full > expression: --UNKNOWN EXPRESSION--.. Fragment 0:0 [Error Id: > 384e6b86-ce17-4eb9-b5eb-27870a341c90 on 192.168.99.13:31010] > {code} > Workaround: > By using a sub element it is working, for example: > {code} > select * > from dfs.`/Users/tgrall/working/json_array/*.json` p > where p.tags.type IS NULL > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-3226) Upper and lower casing doesn't work correctly on non-ASCII characters
Ted Dunning created DRILL-3226: -- Summary: Upper and lower casing doesn't work correctly on non-ASCII characters Key: DRILL-3226 URL: https://issues.apache.org/jira/browse/DRILL-3226 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning {code} 0: jdbc:drill:zk=local> select z, lower(z), upper(z) from dfs.root.`/Users/tdunning/tmp/data.json`; +--+-+-+ | z | EXPR$1 | EXPR$2 | +--+-+-+ | åäö | åäö | åäö | | aBc | abc | ABC | +--+-+-+ {code} Expected result would be {code} +--+-+-+ | z | EXPR$1 | EXPR$2 | +--+-+-+ | åäö | åäö | ÅÄÖ | | aBc | abc | ABC | +--+-+-+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-3222) Need a zip function to combine coordinated lists
Ted Dunning created DRILL-3222: -- Summary: Need a zip function to combine coordinated lists Key: DRILL-3222 URL: https://issues.apache.org/jira/browse/DRILL-3222 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning It is often very useful to be able to turn a pair (or more) of lists into a single list of pairs. Thus zip([a,b], [1,2]) => [[a,1], [b,2]]. The handling of short lists, more than two lists and so on is TBD, but the base function is an important one. One use case is in time series where storing times as one list and values as another is very handy but processing these results would be much better done by using flatten(zip(times, values)). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-3164) Compilation fails with Java 8
Ted Dunning created DRILL-3164: -- Summary: Compilation fails with Java 8 Key: DRILL-3164 URL: https://issues.apache.org/jira/browse/DRILL-3164 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning I just got this: {code} ted:drill[1.0.0*]$ mvn package -DskipTests ... Detected JDK Version: 1.8.0-40 is not in the allowed range [1.7,1.8). ... {code} Clearly there is an overly restrictive pattern at work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-2620) Casting to float is changing the value slightly
[ https://issues.apache.org/jira/browse/DRILL-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386941#comment-14386941 ] Ted Dunning commented on DRILL-2620: What did you expect to see? In SQL the default precision of a FLOAT is implementation defined. I strongly suspect that in Drill the default is 24 (i.e. single precision). If you care (and you seem to), you might be better served by specifying DOUBLE as the type or FLOAT(53). Single precision floating point (aka float) only provides 6 digits of precision. You, as the lucky person you are, got 7. http://en.wikipedia.org/wiki/Single-precision_floating-point_format > Casting to float is changing the value slightly > --- > > Key: DRILL-2620 > URL: https://issues.apache.org/jira/browse/DRILL-2620 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Data Types >Reporter: Rahul Challapalli >Assignee: Daniel Barclay (Drill) > > git.commit.id.abbrev=c11fcf7 > Data Set : > {code} > 2345552345.5342 > 4784.5735 > {code} > Drill Query : > {code} > select cast(columns[0] as float) from `abc.tbl`; > ++ > | EXPR$0 | > ++ > | 2.34555238E9 | > | 4784.5737 | > ++ > {code} > I am not sure whether this is a known limitation or a bug -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-2521) Revert from protobuf 2.6 to 2.5
[ https://issues.apache.org/jira/browse/DRILL-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376514#comment-14376514 ] Ted Dunning commented on DRILL-2521: Any reason for this? > Revert from protobuf 2.6 to 2.5 > --- > > Key: DRILL-2521 > URL: https://issues.apache.org/jira/browse/DRILL-2521 > Project: Apache Drill > Issue Type: Bug > Components: Execution - RPC >Reporter: Jacques Nadeau >Assignee: Jacques Nadeau > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-1918) Drill does very obscure things when a JRE is used instead of a JDK.
[ https://issues.apache.org/jira/browse/DRILL-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14262674#comment-14262674 ] Ted Dunning commented on DRILL-1918: oops. Excess of zeal in creating JIRA's > Drill does very obscure things when a JRE is used instead of a JDK. > --- > > Key: DRILL-1918 > URL: https://issues.apache.org/jira/browse/DRILL-1918 > Project: Apache Drill > Issue Type: Bug >Reporter: Ted Dunning > > In > http://answers.mapr.com/questions/161911/apache-drill-07-errors.html#comment-161933 > a user describes the consequences of running Drill with a JRE instead of a > JDK. > Surely this could be detected and this confusion could be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-1918) Drill does very obscure things when a JRE is used instead of a JDK.
Ted Dunning created DRILL-1918: -- Summary: Drill does very obscure things when a JRE is used instead of a JDK. Key: DRILL-1918 URL: https://issues.apache.org/jira/browse/DRILL-1918 Project: Apache Drill Issue Type: Bug Reporter: Ted Dunning In http://answers.mapr.com/questions/161911/apache-drill-07-errors.html#comment-161933 a user describes the consequences of running Drill with a JRE instead of a JDK. Surely this could be detected and this confusion could be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)