subject:"Review Request 49619\: sorting of tuple array using multiple fields"

Review Request 49619: sorting of tuple array using multiple fields

2016-07-04 Thread Simanchal Das


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/
---

Review request for hive and Carl Steinbach.


Repository: hive-git


Description
---

Problem Statement:

When we are working with complext structure of data like avro.
Most of the times we are encountering array contains multiple tuples and each 
tuple have struct schema.

Suppose here struct schema is like below:

{
"name": "employee",
"type": [{
"type": "record",
"name": "Employee",
"namespace": "com.company.Employee",
"fields": [{
"name": "empId",
"type": "int"
}, {
"name": "empName",
"type": "string"
}, {
"name": "age",
"type": "int"
}, {
"name": "salary",
"type": "double"
}]
}]
}


Then while running our hive query complex array looks like array of employee 
objects.
Example: 
//(array>)

Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]


When we are implementing business use cases day to day life we are encountering 
problems like sorting a tuple array by specific field[s] like empIdm,salary,etc.


Proposal:

I have developed a udf 'sort_array_field' which will sort a tuple array by one 
or more fields in naural order.

Example:
1.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary");
output: 
array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]

2.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary");
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]

3.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age);
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]


Diffs
-

  itests/src/test/resources/testconfiguration.properties 1ab914d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 2f4a94c 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayField.java 
PRE-CREATION 
  
ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSortArrayField.java
 PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_field_wrong1.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_field_wrong2.q PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_sort_array_field.q PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_field_wrong1.q.out 
PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_field_wrong2.q.out 
PRE-CREATION 
  ql/src/test/results/clientpositive/udf_sort_array_field.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/49619/diff/


Testing
---

Junit test cases and query.q files are attached


Thanks,

Simanchal Das

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-06 Thread Simanchal Das


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/
---

(Updated July 7, 2016, 5:03 a.m.)


Review request for hive and Carl Steinbach.


Changes
---

added udf name in show function q.out file


Repository: hive-git


Description
---

Problem Statement:

When we are working with complext structure of data like avro.
Most of the times we are encountering array contains multiple tuples and each 
tuple have struct schema.

Suppose here struct schema is like below:

{
"name": "employee",
"type": [{
"type": "record",
"name": "Employee",
"namespace": "com.company.Employee",
"fields": [{
"name": "empId",
"type": "int"
}, {
"name": "empName",
"type": "string"
}, {
"name": "age",
"type": "int"
}, {
"name": "salary",
"type": "double"
}]
}]
}


Then while running our hive query complex array looks like array of employee 
objects.
Example: 
//(array>)

Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]


When we are implementing business use cases day to day life we are encountering 
problems like sorting a tuple array by specific field[s] like empIdm,salary,etc.


Proposal:

I have developed a udf 'sort_array_field' which will sort a tuple array by one 
or more fields in naural order.

Example:
1.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary");
output: 
array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]

2.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary");
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]

3.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age);
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]


Diffs (updated)
-

  itests/src/test/resources/testconfiguration.properties 1ab914d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 2f4a94c 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayField.java 
PRE-CREATION 
  
ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSortArrayField.java
 PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_field_wrong1.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_field_wrong2.q PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_sort_array_field.q PRE-CREATION 
  ql/src/test/results/beelinepositive/show_functions.q.out 4f3ec40 
  ql/src/test/results/clientnegative/udf_sort_array_field_wrong1.q.out 
PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_field_wrong2.q.out 
PRE-CREATION 
  ql/src/test/results/clientpositive/udf_sort_array_field.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/49619/diff/


Testing
---

Junit test cases and query.q files are attached


Thanks,

Simanchal Das

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-06 Thread Carl Steinbach


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/#review141130
---




ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java (line 427)


To me "sort_array_field" makes it sound like this function sorts the 
elements in an array field, as opposed to sorting an array on a particular 
field, which is what is actually does. I think the purpose of this function 
would be clearer if the name were changed 'sort_array_on_field' or 
'sort_array_by' (I prefer the latter).



ql/src/test/queries/clientpositive/udf_sort_array_field.q (line 1)


Is this really necessary?



ql/src/test/queries/clientpositive/udf_sort_array_field.q (line 9)


No need for this. Please remove.



ql/src/test/queries/clientpositive/udf_sort_array_field.q (line 16)


The rows should have different struct values.



ql/src/test/queries/clientpositive/udf_sort_array_field.q (line 25)


Consider using named_struct() instead of struct(). This will allow you to 
provide names for the struct fields.



ql/src/test/results/beelinepositive/show_functions.q.out (line 183)


The number of rows is off by 8. This looks like a bug, thought not one 
caused by this patch.



ql/src/test/results/beelinepositive/show_functions.q.out (line 184)


It looks like you're stripping whitespace out of the patch. I suspect this 
is the cause of the failure in show_functions.q


- Carl Steinbach


On July 7, 2016, 5:07 a.m., Simanchal Das wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/49619/
> ---
> 
> (Updated July 7, 2016, 5:07 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Carl Steinbach.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Problem Statement:
> 
> When we are working with complext structure of data like avro.
> Most of the times we are encountering array contains multiple tuples and each 
> tuple have struct schema.
> 
> Suppose here struct schema is like below:
> 
> {
>   "name": "employee",
>   "type": [{
>   "type": "record",
>   "name": "Employee",
>   "namespace": "com.company.Employee",
>   "fields": [{
>   "name": "empId",
>   "type": "int"
>   }, {
>   "name": "empName",
>   "type": "string"
>   }, {
>   "name": "age",
>   "type": "int"
>   }, {
>   "name": "salary",
>   "type": "double"
>   }]
>   }]
> }
> 
> 
> Then while running our hive query complex array looks like array of employee 
> objects.
> Example: 
>   //(array>)
>   
> Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]
> 
> 
> When we are implementing business use cases day to day life we are 
> encountering problems like sorting a tuple array by specific field[s] like 
> empIdm,salary,etc.
> 
> 
> Proposal:
> 
> I have developed a udf 'sort_array_field' which will sort a tuple array by 
> one or more fields in naural order.
> 
> Example:
>   1.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary");
>   output: 
> array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]
>   
>   2.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary");
>   output: 
> array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> 
>   3.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age);
>   output: 
> array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> 
> 
> Diffs
> -
> 
>   itests/src/test/resources/testconfiguration.properties 1ab914d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 2f4a94c 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-08 Thread Simanchal Das


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/
---

(Updated July 8, 2016, 12:35 p.m.)


Review request for hive, Ashutosh Chauhan and Carl Steinbach.


Changes
---

renamed the udf to sort_array_by and fixed all review the comments


Repository: hive-git


Description (updated)
---

Problem Statement:

When we are working with complex structure of data like avro.
Most of the times we are encountering array contains multiple tuples and each 
tuple have struct schema.
Suppose here struct schema is like below:
{
"name": "employee",
"type": [{
"type": "record",
"name": "Employee",
"namespace": "com.company.Employee",
"fields": [{
"name": "empId",
"type": "int"
}, {
"name": "empName",
"type": "string"
}, {
"name": "age",
"type": "int"
}, {
"name": "salary",
"type": "double"
}]
}]
}

Then while running our hive query complex array looks like array of employee 
objects.
Example: 
//(array>)

Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]

When we are implementing business use cases day to day life we are encountering 
problems like sorting a tuple array by specific field[s] like 
empId,name,salary,etc by ASC or DESC order.
Proposal:
I have developed a udf 'sort_array_by' which will sort a tuple array by one or 
more fields in ASC or DESC order provided by user ,default is ascending order .
Example:
1.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
output: 
array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]

2.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]

3.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age,"ASC");
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]


Diffs (updated)
-

  itests/src/test/resources/testconfiguration.properties 1ab914d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 2f4a94c 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java
 PRE-CREATION 
  
ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSortArrayByField.java
 PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong1.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong2.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong3.q PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_sort_array_by.q PRE-CREATION 
  ql/src/test/results/beelinepositive/show_functions.q.out 4f3ec40 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong1.q.out 
PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong2.q.out 
PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong3.q.out 
PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out a811747 
  ql/src/test/results/clientpositive/udf_sort_array_by.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/49619/diff/


Testing
---

Junit test cases and query.q files are attached


Thanks,

Simanchal Das

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-08 Thread Simanchal Das



> On July 7, 2016, 6:45 a.m., Carl Steinbach wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java, line 427
> > 
> >
> > To me "sort_array_field" makes it sound like this function sorts the 
> > elements in an array field, as opposed to sorting an array on a particular 
> > field, which is what is actually does. I think the purpose of this function 
> > would be clearer if the name were changed 'sort_array_on_field' or 
> > 'sort_array_by' (I prefer the latter).

fixed


> On July 7, 2016, 6:45 a.m., Carl Steinbach wrote:
> > ql/src/test/queries/clientpositive/udf_sort_array_field.q, line 1
> > 
> >
> > Is this really necessary?

removed


> On July 7, 2016, 6:45 a.m., Carl Steinbach wrote:
> > ql/src/test/queries/clientpositive/udf_sort_array_field.q, line 9
> > 
> >
> > No need for this. Please remove.

removed


> On July 7, 2016, 6:45 a.m., Carl Steinbach wrote:
> > ql/src/test/queries/clientpositive/udf_sort_array_field.q, line 16
> > 
> >
> > The rows should have different struct values.

chnaged the values


> On July 7, 2016, 6:45 a.m., Carl Steinbach wrote:
> > ql/src/test/queries/clientpositive/udf_sort_array_field.q, line 25
> > 
> >
> > Consider using named_struct() instead of struct(). This will allow you 
> > to provide names for the struct fields.

Used named_struct()


- Simanchal


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/#review141130
---


On July 8, 2016, 12:35 p.m., Simanchal Das wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/49619/
> ---
> 
> (Updated July 8, 2016, 12:35 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Carl Steinbach.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Problem Statement:
> 
> When we are working with complex structure of data like avro.
> Most of the times we are encountering array contains multiple tuples and each 
> tuple have struct schema.
> Suppose here struct schema is like below:
> {
>   "name": "employee",
>   "type": [{
>   "type": "record",
>   "name": "Employee",
>   "namespace": "com.company.Employee",
>   "fields": [{
>   "name": "empId",
>   "type": "int"
>   }, {
>   "name": "empName",
>   "type": "string"
>   }, {
>   "name": "age",
>   "type": "int"
>   }, {
>   "name": "salary",
>   "type": "double"
>   }]
>   }]
> }
> 
> Then while running our hive query complex array looks like array of employee 
> objects.
> Example: 
>   //(array>)
>   
> Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]
> 
> When we are implementing business use cases day to day life we are 
> encountering problems like sorting a tuple array by specific field[s] like 
> empId,name,salary,etc by ASC or DESC order.
> Proposal:
> I have developed a udf 'sort_array_by' which will sort a tuple array by one 
> or more fields in ASC or DESC order provided by user ,default is ascending 
> order .
> Example:
>   1.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
>   output: 
> array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]
>   
>   2.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
>   output: 
> array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> 
>   3.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age,"ASC");
>   output: 
> array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> 
> 
> Diffs
> -
> 
>   itests/src/test/resources/testconfiguration.properties 1ab914d 
>

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-08 Thread Simanchal Das



> On July 7, 2016, 6:45 a.m., Carl Steinbach wrote:
> > ql/src/test/results/beelinepositive/show_functions.q.out, line 183
> > 
> >
> > The number of rows is off by 8. This looks like a bug, thought not one 
> > caused by this patch.

re run the test


> On July 7, 2016, 6:45 a.m., Carl Steinbach wrote:
> > ql/src/test/results/beelinepositive/show_functions.q.out, line 184
> > 
> >
> > It looks like you're stripping whitespace out of the patch. I suspect 
> > this is the cause of the failure in show_functions.q

re run the test


- Simanchal


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/#review141130
---


On July 8, 2016, 12:35 p.m., Simanchal Das wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/49619/
> ---
> 
> (Updated July 8, 2016, 12:35 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Carl Steinbach.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Problem Statement:
> 
> When we are working with complex structure of data like avro.
> Most of the times we are encountering array contains multiple tuples and each 
> tuple have struct schema.
> Suppose here struct schema is like below:
> {
>   "name": "employee",
>   "type": [{
>   "type": "record",
>   "name": "Employee",
>   "namespace": "com.company.Employee",
>   "fields": [{
>   "name": "empId",
>   "type": "int"
>   }, {
>   "name": "empName",
>   "type": "string"
>   }, {
>   "name": "age",
>   "type": "int"
>   }, {
>   "name": "salary",
>   "type": "double"
>   }]
>   }]
> }
> 
> Then while running our hive query complex array looks like array of employee 
> objects.
> Example: 
>   //(array>)
>   
> Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]
> 
> When we are implementing business use cases day to day life we are 
> encountering problems like sorting a tuple array by specific field[s] like 
> empId,name,salary,etc by ASC or DESC order.
> Proposal:
> I have developed a udf 'sort_array_by' which will sort a tuple array by one 
> or more fields in ASC or DESC order provided by user ,default is ascending 
> order .
> Example:
>   1.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
>   output: 
> array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]
>   
>   2.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
>   output: 
> array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> 
>   3.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age,"ASC");
>   output: 
> array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> 
> 
> Diffs
> -
> 
>   itests/src/test/resources/testconfiguration.properties 1ab914d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 2f4a94c 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java
>  PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSortArrayByField.java
>  PRE-CREATION 
>   ql/src/test/queries/clientnegative/udf_sort_array_by_wrong1.q PRE-CREATION 
>   ql/src/test/queries/clientnegative/udf_sort_array_by_wrong2.q PRE-CREATION 
>   ql/src/test/queries/clientnegative/udf_sort_array_by_wrong3.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/udf_sort_array_by.q PRE-CREATION 
>   ql/src/test/results/beelinepositive/show_functions.q.out 4f3ec40 
>   ql/src/test/results/clientnegative/udf_sort_array_by_wrong1.q.out 
> PRE-CREATION 
>   ql/src/test/results/clientnegative/udf_sort_array_by_wrong2.q.out 
> PRE-CREATION 
>   ql/src/test/results/clientnegative/udf_sort_array_by_wrong3.q.out 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/show_functions.q.out a811747 
>   ql/src/test/results/

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-09 Thread Carl Steinbach

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/#review141413
---

Some more comments.

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java
 (line 56)

A couple notes:

1. I think the example should actually return a value that is different 
than the input. It would also be good to include more than two elements in the 
input. If screen space is an issue I recommend only including a single element 
in each of the structs in the example, which I think has the added benefit of 
making the example clearer by not distracting the reader with irrelevant 
details.

2. It looks like the default sorting order (ASC) is actually the reverse of 
what I would expect it to be, i.e. I expect 'b' to come before 'g'.

3. Related to point (2), I think it's important to ensure that the sorting 
order of this UDF is consisent with ORDER BY, e.g. for a table t containing a 
single row with a single array of struct field a_struct_array, the queries 
"SELECT a_struct FROM t LATERAL VIEW explode(a_struct_array) structTable AS 
a_struct ORDER BY a_struct.col1 DESC" should return the same results as "SELECT 
a_struct FROM t LATERAL VIEW explode(sort_array_by(a_struct_array, 'col1', 
'DESC')) structTable AS a_struct". Note that I probably didn't get the syntax 
for LATERAL VIEW and explode() correct.

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java
 (line 93)

Unnecessary string concatenation operators.

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java
 (line 104)

Unnecessary "+" operator.

ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSortArrayByField.java
 (line 1)

Does this unit test provide any additional coverage or advantages over the 
q file tests? Is it necessary to have both?

Note that I am a strong advocate of end-to-end qfile tests over unit tests, 
which is an opinion that not everyone holds.

- Carl Steinbach

On July 8, 2016, 12:35 p.m., Simanchal Das wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/49619/
> ---
> 
> (Updated July 8, 2016, 12:35 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Carl Steinbach.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Problem Statement:
> 
> When we are working with complex structure of data like avro.
> Most of the times we are encountering array contains multiple tuples and each 
> tuple have struct schema.
> Suppose here struct schema is like below:
> {
>   "name": "employee",
>   "type": [{
>   "type": "record",
>   "name": "Employee",
>   "namespace": "com.company.Employee",
>   "fields": [{
>   "name": "empId",
>   "type": "int"
>   }, {
>   "name": "empName",
>   "type": "string"
>   }, {
>   "name": "age",
>   "type": "int"
>   }, {
>   "name": "salary",
>   "type": "double"
>   }]
>   }]
> }
> 
> Then while running our hive query complex array looks like array of employee 
> objects.
> Example: 
>   //(array>)
>   
> Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]
> 
> When we are implementing business use cases day to day life we are 
> encountering problems like sorting a tuple array by specific field[s] like 
> empId,name,salary,etc by ASC or DESC order.
> Proposal:
> I have developed a udf 'sort_array_by' which will sort a tuple array by one 
> or more fields in ASC or DESC order provided by user ,default is ascending 
> order .
> Example:
>   1.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
>   output: 
> array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]
>   
>   2.Select 
> sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
>   output: 
> array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> 
>   3.Select

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-11 Thread Simanchal Das



> On July 9, 2016, 8:53 p.m., Carl Steinbach wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java,
> >  line 56
> > 
> >
> > A couple notes:
> > 
> > 1. I think the example should actually return a value that is different 
> > than the input. It would also be good to include more than two elements in 
> > the input. If screen space is an issue I recommend only including a single 
> > element in each of the structs in the example, which I think has the added 
> > benefit of making the example clearer by not distracting the reader with 
> > irrelevant details.
> > 
> > 2. It looks like the default sorting order (ASC) is actually the 
> > reverse of what I would expect it to be, i.e. I expect 'b' to come before 
> > 'g'.
> > 
> > 3. Related to point (2), I think it's important to ensure that the 
> > sorting order of this UDF is consisent with ORDER BY, e.g. for a table t 
> > containing a single row with a single array of struct field a_struct_array, 
> > the queries "SELECT a_struct FROM t LATERAL VIEW explode(a_struct_array) 
> > structTable AS a_struct ORDER BY a_struct.col1 DESC" should return the same 
> > results as "SELECT a_struct FROM t LATERAL VIEW 
> > explode(sort_array_by(a_struct_array, 'col1', 'DESC')) structTable AS 
> > a_struct". Note that I probably didn't get the syntax for LATERAL VIEW and 
> > explode() correct.

1. Sorry there was a typo.
2. corrected the sorting order in example.
3. As per your instruction I have added test example of LATERAL VIEW 
explode(array) and explode(udf). Which gives same results.


> On July 9, 2016, 8:53 p.m., Carl Steinbach wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java,
> >  line 93
> > 
> >
> > Unnecessary string concatenation operators.

removed


> On July 9, 2016, 8:53 p.m., Carl Steinbach wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java,
> >  line 104
> > 
> >
> > Unnecessary "+" operator.

removed


> On July 9, 2016, 8:53 p.m., Carl Steinbach wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSortArrayByField.java,
> >  line 1
> > 
> >
> > Does this unit test provide any additional coverage or advantages over 
> > the q file tests? Is it necessary to have both?
> > 
> > Note that I am a strong advocate of end-to-end qfile tests over unit 
> > tests, which is an opinion that not everyone holds.

These are kind of same as q file. I feels test cases on Test classes are good 
for doing unit testing while development and takes less time to identify 
problem comparare to q files.
Any ways I have removed some test cases from Test class.


- Simanchal


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/#review141413
---


On July 8, 2016, 12:35 p.m., Simanchal Das wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/49619/
> ---
> 
> (Updated July 8, 2016, 12:35 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Carl Steinbach.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Problem Statement:
> 
> When we are working with complex structure of data like avro.
> Most of the times we are encountering array contains multiple tuples and each 
> tuple have struct schema.
> Suppose here struct schema is like below:
> {
>   "name": "employee",
>   "type": [{
>   "type": "record",
>   "name": "Employee",
>   "namespace": "com.company.Employee",
>   "fields": [{
>   "name": "empId",
>   "type": "int"
>   }, {
>   "name": "empName",
>   "type": "string"
>   }, {
>   "name": "age",
>   "type": "int"
>   }, {
>   "name": "salary",
>   "type": "double"
>   }]
>   }]
> }
> 
> Then while running our hive query complex array looks like array of employee 
> objects.
> Example: 
>   //(array>)
>   
> Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]
> 
> When we are implementing business use cases day to day life we are 
> encountering problems like sorting a tuple array by specific field[s] like

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-11 Thread Simanchal Das


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/
---

(Updated July 11, 2016, 8:37 a.m.)


Review request for hive, Ashutosh Chauhan and Carl Steinbach.


Repository: hive-git


Description
---

Problem Statement:

When we are working with complex structure of data like avro.
Most of the times we are encountering array contains multiple tuples and each 
tuple have struct schema.
Suppose here struct schema is like below:
{
"name": "employee",
"type": [{
"type": "record",
"name": "Employee",
"namespace": "com.company.Employee",
"fields": [{
"name": "empId",
"type": "int"
}, {
"name": "empName",
"type": "string"
}, {
"name": "age",
"type": "int"
}, {
"name": "salary",
"type": "double"
}]
}]
}

Then while running our hive query complex array looks like array of employee 
objects.
Example: 
//(array>)

Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]

When we are implementing business use cases day to day life we are encountering 
problems like sorting a tuple array by specific field[s] like 
empId,name,salary,etc by ASC or DESC order.
Proposal:
I have developed a udf 'sort_array_by' which will sort a tuple array by one or 
more fields in ASC or DESC order provided by user ,default is ascending order .
Example:
1.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
output: 
array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]

2.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]

3.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age,"ASC");
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]


Diffs (updated)
-

  itests/src/test/resources/testconfiguration.properties 1ab914d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 2f4a94c 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java
 PRE-CREATION 
  
ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSortArrayByField.java
 PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong1.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong2.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong3.q PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_sort_array_by.q PRE-CREATION 
  ql/src/test/results/beelinepositive/show_functions.q.out 4f3ec40 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong1.q.out 
PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong2.q.out 
PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong3.q.out 
PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out a811747 
  ql/src/test/results/clientpositive/udf_sort_array_by.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/49619/diff/


Testing
---

Junit test cases and query.q files are attached


Thanks,

Simanchal Das

Re: Review Request 49619: sorting of tuple array using multiple fields

2016-07-11 Thread Simanchal Das


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49619/
---

(Updated July 11, 2016, 1:32 p.m.)


Review request for hive, Ashutosh Chauhan and Carl Steinbach.


Repository: hive-git


Description (updated)
---

https://issues.apache.org/jira/browse/HIVE-14159

Problem Statement:

When we are working with complex structure of data like avro.
Most of the times we are encountering array contains multiple tuples and each 
tuple have struct schema.
Suppose here struct schema is like below:
{
"name": "employee",
"type": [{
"type": "record",
"name": "Employee",
"namespace": "com.company.Employee",
"fields": [{
"name": "empId",
"type": "int"
}, {
"name": "empName",
"type": "string"
}, {
"name": "age",
"type": "int"
}, {
"name": "salary",
"type": "double"
}]
}]
}

Then while running our hive query complex array looks like array of employee 
objects.
Example: 
//(array>)

Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]

When we are implementing business use cases day to day life we are encountering 
problems like sorting a tuple array by specific field[s] like 
empId,name,salary,etc by ASC or DESC order.
Proposal:
I have developed a udf 'sort_array_by' which will sort a tuple array by one or 
more fields in ASC or DESC order provided by user ,default is ascending order .
Example:
1.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
output: 
array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]

2.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]

3.Select 
sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age,"ASC");
output: 
array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]


Diffs
-

  itests/src/test/resources/testconfiguration.properties 1ab914d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 2f4a94c 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArrayByField.java
 PRE-CREATION 
  
ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSortArrayByField.java
 PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong1.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong2.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_sort_array_by_wrong3.q PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_sort_array_by.q PRE-CREATION 
  ql/src/test/results/beelinepositive/show_functions.q.out 4f3ec40 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong1.q.out 
PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong2.q.out 
PRE-CREATION 
  ql/src/test/results/clientnegative/udf_sort_array_by_wrong3.q.out 
PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out a811747 
  ql/src/test/results/clientpositive/udf_sort_array_by.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/49619/diff/


Testing
---

Junit test cases and query.q files are attached


Thanks,

Simanchal Das

Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

Re: Review Request 49619: sorting of tuple array using multiple fields

10 matches

Site Navigation

Mail list logo

Footer information