[jira] [Created] (DRILL-6034) repeated_contains returns a count, not a boolean, subject to overflow

2017-12-15 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6034:
--

 Summary: repeated_contains returns a count, not a boolean, subject 
to overflow
 Key: DRILL-6034
 URL: https://issues.apache.org/jira/browse/DRILL-6034
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Paul Rogers


Consider the existing Drill unit tests 
{{testJsonReader.testRepeatedContains()}}. Consider the following query:

{code}
select repeated_contains(str_list, 'asdf') from 
cp.`store/json/json_basic_repeated_varchar.json`
{code}

According to the 
[documentation|http://drill.apache.org/docs/repeated-contains/]:

bq. REPEATED_CONTAINS returns true if Drill finds a match; otherwise, the 
function returns false.

Run the above query and print the results:

{noformat}
select repeated_contains(str_list, 'asdf') from 
cp.`store/json/json_basic_repeated_varchar.json`
#: EXPR$0
0: 5
1: 0
2: 0
3: 0
{noformat}

Note that the first row has a value of 5 which is *not* a Boolean. Drill has no 
Boolean type and instead uses the traditional encoding to integers: {{TRUE}} = 
1, {{FALSE}} = 0. A value of 5 is not a valid Boolean value. It may be that the 
following expression will fail:

{code}
SELECT * FROM cp.`store/json/json_basic_repeated_varchar.json`
  WHERE repeated_contains(str_list, 'asdf') = TRUE
{code}

The schema of the returned count value uses the Drill {{BIT}} type. For various 
historical reasons, Drill implements {{BIT}} as "UInt1" -- an unsigned 8 bit 
integer.

Further, since the function seems to return a count, it is subject to overflow 
if the count is 256, 512 or any multiple o 256. That is, if a list has 256 
occurrences of the pattern, {{repeated_contains}} will return 256 modulo 256 = 
0, which is the equivalent of SQL {{FALSE}}.

The recommendation is that the function be modified to return either 1 or 0. If 
there is a reason to have a count, use the existing {{repeated_count}} function.

Note that the "test" never caught this because it simply ran the query, but did 
not verify results:

{code}
  @Test
  public void testRepeatedContains() throws Exception {
test("select repeated_contains(str_list, 'asdf') from 
cp.`store/json/json_basic_repeated_varchar.json`");
...
{code}

The issue was revealed when adding verification. For now, the new test verifies 
the incorrect results; it should be modified to match the documented results 
if/when the code is updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: REPEATED_CONTAINS

2016-03-30 Thread Jean-Claude Cote
Hey Jacques,

I have done some rudimentary tests and realized my use case is a bit more
complicated then stated above.

{"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
{"name":"flour","cal":300} ] }

SELECT a,b,c,d,e,f,g, max(suggar_val) FROM (
SELECT a,b,c,d,e,f,g, case when fill['name'] = 'sugar' then fill['cal']
else null as suggar_val FROM (
SELECT a,b,c,d,e,f,g, FLATTEN(t.fillings) AS fill FROM
dfs.flatten.`test.json` t
)
)
group by a,b,c,d,e,f,g

Usng this technique I can "pull" the values that interest me out of the
repeated field which results in an easy to work with schema
a,b,c,d,e,f,g,sugar,flour,butter

However as you pointed out colapsing those rows back from the flattened
might be costly. For example the group by will need to check that all the
a,b,c,d,e,f,g fields are the same. Some of these fields are quite large
VARCHARs so it would need to check every byte of the VARCHAR?

I've done some testing with the above technique and performance was so so.
I'll also try my hands at using the FieldReader so I have something to
compare with.

Cheers


On Wed, 30 Mar 2016 at 00:23 Jacques Nadeau  wrote:

> I think the best answer is to test it and share your findings.
> Hypothesizing about performance in complicated systems is also suspect :)
>
> That said, I'll make a guess...
>
> In general, I would expect the flatten to be faster in your example since a
> flatten without a cartesian is trivial operation and can be done in
> vectorized fashion because of the shape of how data is held in memory. This
> is different than how complex UDFs are written today (using the FieldReader
> model). These UDFs are object-based execution, record by record. So,
> vectorized and full runtime code generation
>
> That being said, if you changed your code to be something more like [select
> a,b,c,d,e,f,g, flatten(t.fillings) as fill], you might see the two be
> closer together. This is because this would then require a cartesian copy
> of all the fields abcdefg, which then have to be filtered out. In this
> case, the extra cost of the copies might be more expensive than the object
> overhead required for traversing the complex object structure.
>
> In general, start with the methodology that works. If we don't see the
> performance to satisfy your usecase, we can see if we can suggest some
> things. (For example, supporting operation pushdowns that push through
> FLATTEN would probably be very helpful.)
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Mar 29, 2016 at 6:37 PM, Jean-Claude Cote 
> wrote:
>
> > I've noticed drill offers a REPEATED_CONTAINS which can be applied to
> > fields which are arrays.
> >
> > https://drill.apache.org/docs/repeated-contains/
> >
> > I have a schema stored in parquet files which contain a repeated field
> > containing a key and a value. However such structures can't be queried
> > using the REPEATED_CONTAINS. I was thinking of writing a user defined
> > function to look through it.
> >
> > My question is: is it worth it? Will it be faster than doing this?
> >
> > {"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
> > {"name":"flour","cal":300} ] }
> >
> > SELECT flat.fill FROM (SELECT FLATTEN(t.fillings) AS fill FROM
> > dfs.flatten.`test.json` t) flat WHERE flat.fill.name like 'sug%';
> >
> > Specifically what's the cost of using FLATTEN compared to iterating over
> > the array right in a UDF?
> >
> > Thanks
> > Jean-Claude
> >
>


Re: REPEATED_CONTAINS

2016-03-29 Thread Jacques Nadeau
I think the best answer is to test it and share your findings.
Hypothesizing about performance in complicated systems is also suspect :)

That said, I'll make a guess...

In general, I would expect the flatten to be faster in your example since a
flatten without a cartesian is trivial operation and can be done in
vectorized fashion because of the shape of how data is held in memory. This
is different than how complex UDFs are written today (using the FieldReader
model). These UDFs are object-based execution, record by record. So,
vectorized and full runtime code generation

That being said, if you changed your code to be something more like [select
a,b,c,d,e,f,g, flatten(t.fillings) as fill], you might see the two be
closer together. This is because this would then require a cartesian copy
of all the fields abcdefg, which then have to be filtered out. In this
case, the extra cost of the copies might be more expensive than the object
overhead required for traversing the complex object structure.

In general, start with the methodology that works. If we don't see the
performance to satisfy your usecase, we can see if we can suggest some
things. (For example, supporting operation pushdowns that push through
FLATTEN would probably be very helpful.)



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 29, 2016 at 6:37 PM, Jean-Claude Cote  wrote:

> I've noticed drill offers a REPEATED_CONTAINS which can be applied to
> fields which are arrays.
>
> https://drill.apache.org/docs/repeated-contains/
>
> I have a schema stored in parquet files which contain a repeated field
> containing a key and a value. However such structures can't be queried
> using the REPEATED_CONTAINS. I was thinking of writing a user defined
> function to look through it.
>
> My question is: is it worth it? Will it be faster than doing this?
>
> {"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
> {"name":"flour","cal":300} ] }
>
> SELECT flat.fill FROM (SELECT FLATTEN(t.fillings) AS fill FROM
> dfs.flatten.`test.json` t) flat WHERE flat.fill.name like 'sug%';
>
> Specifically what's the cost of using FLATTEN compared to iterating over
> the array right in a UDF?
>
> Thanks
> Jean-Claude
>


REPEATED_CONTAINS

2016-03-29 Thread Jean-Claude Cote
I've noticed drill offers a REPEATED_CONTAINS which can be applied to
fields which are arrays.

https://drill.apache.org/docs/repeated-contains/

I have a schema stored in parquet files which contain a repeated field
containing a key and a value. However such structures can't be queried
using the REPEATED_CONTAINS. I was thinking of writing a user defined
function to look through it.

My question is: is it worth it? Will it be faster than doing this?

{"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
{"name":"flour","cal":300} ] }

SELECT flat.fill FROM (SELECT FLATTEN(t.fillings) AS fill FROM
dfs.flatten.`test.json` t) flat WHERE flat.fill.name like 'sug%';

Specifically what's the cost of using FLATTEN compared to iterating over
the array right in a UDF?

Thanks
Jean-Claude


[jira] [Resolved] (DRILL-1757) Add support for wildcards within REPEATED_CONTAINS

2015-03-18 Thread Mehant Baid (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mehant Baid resolved DRILL-1757.

Resolution: Fixed

Fixed in bfbc0e01423ab85f362013a03ef0f00e95cd721b

> Add support for wildcards within REPEATED_CONTAINS
> --
>
> Key: DRILL-1757
> URL: https://issues.apache.org/jira/browse/DRILL-1757
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Functions - Drill
>Affects Versions: 0.6.0
> Environment: 0.7.0 on MacOSX, embedded mode
>Reporter: Bob Rumsby
>Assignee: Mehant Baid
> Fix For: Future
>
> Attachments: DRILL-1757.patch
>
>
> REPEATED_CONTAINS does not seem to support the % or * wildcard. If this is 
> expected behavior (I assume not), it needs to be documented.
> 0: jdbc:drill:zk=local> select name, repeated_contains(categories, 'Doctors') 
> from dfs.yelp.`yelp_academic_dataset_business.json` limit 1;
> +++
> |name|   EXPR$1   |
> +++
> | Eric Goldberg, MD | true   |
> +++
> 1 row selected (0.087 seconds)
> 0: jdbc:drill:zk=local> select name, repeated_contains(categories, 'Doc%') 
> from dfs.yelp.`yelp_academic_dataset_business.json` limit 1;
> +++
> |name|   EXPR$1   |
> +++
> | Eric Goldberg, MD | false  |
> +++
> 1 row selected (0.093 seconds)
> 0: jdbc:drill:zk=local> select name, repeated_contains(categories, 'Doc*') 
> from dfs.yelp.`yelp_academic_dataset_business.json` limit 1;
> +++
> |name|   EXPR$1   |
> +++
> | Eric Goldberg, MD | false  |
> +++
> 1 row selected (0.09 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 29666: DRILL-1757 Add support for wildcards within REPEATED_CONTAINS

2015-02-28 Thread Mehant Baid

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29666/#review74686
---

Ship it!



exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SimpleRepeatedFunctions.java
<https://reviews.apache.org/r/29666/#comment121356>

I think it would be better if you use matches() instead of find(). Since 
you are invoking reset() above it will get the same result however invoking 
matches() explicitly states the intention of looking for the pattern in the 
entire string input.


Looks ok, just one very minor comment.

- Mehant Baid


On Feb. 19, 2015, 8:17 p.m., Venkata krishnan Sowrirajan wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29666/
> ---
> 
> (Updated Feb. 19, 2015, 8:17 p.m.)
> 
> 
> Review request for drill and Mehant Baid.
> 
> 
> Repository: drill-git
> 
> 
> Description
> ---
> 
> Support for wildcards with in repeated_contains() is implemented similar to 
> MS SQL contains(). MS SQL contains() documentation 
> http://msdn.microsoft.com/en-us/library/ms187787.aspx 
> 
> 
> Diffs
> -
> 
>   
> exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SimpleRepeatedFunctions.java
>  f7f0e91 
>   
> exec/java-exec/src/test/java/org/apache/drill/exec/fn/impl/TestNewSimpleRepeatedFunctions.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/29666/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Venkata krishnan Sowrirajan
> 
>



Re: Review Request 29666: DRILL-1757 Add support for wildcards within REPEATED_CONTAINS

2015-02-19 Thread Venkata krishnan Sowrirajan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29666/
---

(Updated Feb. 19, 2015, 8:17 p.m.)


Review request for drill and Mehant Baid.


Changes
---

Last patch is not completely viewable. So added a new working one.


Repository: drill-git


Description
---

Support for wildcards with in repeated_contains() is implemented similar to MS 
SQL contains(). MS SQL contains() documentation 
http://msdn.microsoft.com/en-us/library/ms187787.aspx 


Diffs (updated)
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SimpleRepeatedFunctions.java
 f7f0e91 
  
exec/java-exec/src/test/java/org/apache/drill/exec/fn/impl/TestNewSimpleRepeatedFunctions.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/29666/diff/


Testing
---


Thanks,

Venkata krishnan Sowrirajan



Re: Review Request 29666: DRILL-1757 Add support for wildcards within REPEATED_CONTAINS

2015-02-19 Thread Venkata krishnan Sowrirajan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29666/
---

(Updated Feb. 19, 2015, 8:13 p.m.)


Review request for drill and Mehant Baid.


Changes
---

Added new couple test cases for the changes


Repository: drill-git


Description
---

Support for wildcards with in repeated_contains() is implemented similar to MS 
SQL contains(). MS SQL contains() documentation 
http://msdn.microsoft.com/en-us/library/ms187787.aspx 


Diffs (updated)
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SimpleRepeatedFunctions.java
 f7f0e91 
  
exec/java-exec/src/test/java/org/apache/drill/exec/fn/impl/TestNewSimpleRepeatedFunctions.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/29666/diff/


Testing
---


Thanks,

Venkata krishnan Sowrirajan



Re: Review Request 29666: DRILL-1757 Add support for wildcards within REPEATED_CONTAINS

2015-02-10 Thread Mehant Baid

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29666/#review71766
---


Code changes look ok. Could you please add a couple of unit tests.

- Mehant Baid


On Jan. 7, 2015, 7:43 p.m., Venkata krishnan Sowrirajan wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29666/
> ---
> 
> (Updated Jan. 7, 2015, 7:43 p.m.)
> 
> 
> Review request for drill and Mehant Baid.
> 
> 
> Repository: drill-git
> 
> 
> Description
> ---
> 
> Support for wildcards with in repeated_contains() is implemented similar to 
> MS SQL contains(). MS SQL contains() documentation 
> http://msdn.microsoft.com/en-us/library/ms187787.aspx 
> 
> 
> Diffs
> -
> 
>   
> exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SimpleRepeatedFunctions.java
>  f7f0e91 
> 
> Diff: https://reviews.apache.org/r/29666/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Venkata krishnan Sowrirajan
> 
>



[jira] [Resolved] (DRILL-1727) REPEATED_CONTAINS sometimes doesn't work

2015-01-21 Thread Mehant Baid (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mehant Baid resolved DRILL-1727.

Resolution: Fixed

> REPEATED_CONTAINS sometimes doesn't work
> 
>
> Key: DRILL-1727
> URL: https://issues.apache.org/jira/browse/DRILL-1727
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Tomer Shiran
>Assignee: Mehant Baid
> Fix For: 1.0.0
>
>
> Here are three queries. It seems that they should all work (the first is the 
> most intuitive one), but only one of them does.
> {code}
> 0: jdbc:drill:zk=localhost:2181> SELECT name FROM 
> dfs.root.`Users/tshiran/Development/demo/data/yelp/business.json` WHERE 
> REPEATED_CONTAINS(categories, 'Mediterranean') LIMIT 1;
> Query failed: Failure while running sql.
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> 0: jdbc:drill:zk=localhost:2181> SELECT name FROM 
> dfs.root.`Users/tshiran/Development/demo/data/yelp/business.json` WHERE true 
> and REPEATED_CONTAINS(categories, 'Mediterranean') LIMIT 1;
> ++
> |name|
> ++
> | People's Bakery |
> ++
> 1 row selected (0.15 seconds)
> 0: jdbc:drill:zk=localhost:2181> SELECT name FROM 
> dfs.root.`Users/tshiran/Development/demo/data/yelp/business.json` WHERE 1 and 
> REPEATED_CONTAINS(categories, 'Mediterranean') LIMIT 1;
> Query failed: Failure while running sql.
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 29666: DRILL-1757 Add support for wildcards within REPEATED_CONTAINS

2015-01-07 Thread Venkata krishnan Sowrirajan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29666/
---

Review request for drill and Mehant Baid.


Repository: drill-git


Description
---

Support for wildcards with in repeated_contains() is implemented similar to MS 
SQL contains(). MS SQL contains() documentation 
http://msdn.microsoft.com/en-us/library/ms187787.aspx 


Diffs
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SimpleRepeatedFunctions.java
 f7f0e91 

Diff: https://reviews.apache.org/r/29666/diff/


Testing
---


Thanks,

Venkata krishnan Sowrirajan