Not sure if there is a function for that but I wrote a UDF to do so -
https://github.com/chandeepsingh/Hive-UDFs
<https://github.com/chandeepsingh/Hive-UDFs>
hive> ADD JAR hive-udfs-1.0-uber.jar;
Added [hive-udfs-1.0-uber.jar] to class path
Added resources: [hive-udfs-1.0-uber.jar]
hive> CREATE TEMPORARY FUNCTION array_dedup AS 'com.hive.udfs.UdfArrayDeDup';
OK
Time taken: 0.015 seconds
hive> SELECT array_dedup(array("blah","blah","blah")) from table1 limit 1;
OK
["blah"]
Time taken: 0.502 seconds, Fetched: 1 row(s)
Here is the code:
package com.hive.udfs;
/**
*
* @author chandeepsingh
* Remove duplicates from an array
*/
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;
import java.util.HashSet;
import java.util.List;
public class UdfArrayDeDup extends GenericUDF {
ListObjectInspector arrayOI = null;
@Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
arrayOI = (ListObjectInspector) arguments[0];
return ObjectInspectorUtils.getStandardObjectInspector(arrayOI);
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
List<?> myArr = (List<?>)
ObjectInspectorUtils.copyToStandardObject(arguments[0].get(), arrayOI);
HashSet<Object> myHashSet = new HashSet<>();
myHashSet.addAll(myArr);
if (myHashSet != null) {
return new ArrayList<>(myHashSet);
} else {
return null;
}
}
@Override
public String getDisplayString(String[] input) {
return new String();
}
}
> On Mar 13, 2016, at 1:30 AM, Rex X <[email protected]> wrote:
>
> For the first question, is there any way to use "set" instead of an "array"
> to dedupe all elements?
>
> "select array(1,1)" will return "[1,1]", not "[1]".
>
>
>
> On Sat, Mar 12, 2016 at 5:26 PM, Rex X <[email protected]
> <mailto:[email protected]>> wrote:
> Thank you, Chandeep. Yes, my first problem solved.
> How about the second one? Is there any way to append an element to an
> existing array?
>
>
>
> On Sat, Mar 12, 2016 at 5:10 PM, Chandeep Singh <[email protected]
> <mailto:[email protected]>> wrote:
> If you only want the array while you’re querying table1 your example should
> work. If you want to add AB to the table you’ll probably need to create a new
> table by selecting everything you need from table1.
>
> hive> select * from table1 limit 1;
> OK
> temp1 temp2 temp3
>
> hive> select f1, array(f2, f3) AS AB from table1 limit 1;
> OK
> temp1 [“temp2”,"temp3"]
>
>
>> On Mar 13, 2016, at 12:33 AM, Rex X <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> How to make the following work?
>>
>> 1. combine columns A and B to make one array as a new column AB. Both column
>> A and B are string types.
>>
>> select
>> string_columnA,
>> string_columnB,
>> array(string_columnA, string_columnB) as AB
>> from Table1;
>>
>> 2. append columnA to an existing array-type column B
>>
>> select
>> string_columnA,
>> array_columnB,
>> array_flatmerge(string_columnA, array_columnB) as AB
>> from Table2;
>>
>> In fact, I should say "set" instead of "array" above, since I expect no
>> duplicates.
>>
>> Any idea?
>>
>
>
>