[
https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747635#comment-13747635
]
Edward Capriolo commented on HIVE-3562:
---------------------------------------
{quote}1. Is it possible to use one DataStructure (TreeSet probably) for two
implementations of ReduceSinkHash. If so, that will simplify this code quite a
bit.{quote}
I think we are better off using the MinMaxPriorityQueue
http://highscalability.com/blog/2013/5/22/strategy-stop-using-linked-lists.html
Structures that use Node's are really inefficient in java. The pointers bloat
the structure, and the non-local memory access and causes a lot of cache
misses. I benched a similar scenario sorting large tree sets, and even though
the big O is smaller (believe it or not) to sort a list then maintain a
treeSet. MinMaxPriorityQueue has a middle-ground implementation which real
world works much faster then TreeSet.
{code}
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;
import java.util.TreeSet;
public class TreeSetTest {
static Random r = new Random();
public static void main(String [] args){
doIt(500000,new TreeSet<Double>());
doIt(500000,new TreeSet<Double>());
doIt(500000,new TreeSet<Double>());
doIt(500000,new TreeSet<Double>());
{
List<Double> arrayList = new ArrayList<Double>();
doIt(500000,arrayList);
sortIt(arrayList);
}
{
List<Double> arrayList = new ArrayList<Double>();
doIt(500000,arrayList);
sortIt(arrayList);
}
{
List<Double> arrayList = new LinkedList<Double>();
doIt(500000,arrayList);
sortIt(arrayList);
}
{
List<Double> arrayList = new ArrayList<Double>();
doIt(500000,arrayList);
sortIt(arrayList);
}
{
List<Double> arrayList = new ArrayList<Double>();
doIt(500000,arrayList);
sortIt(arrayList);
}
doIt(500000,new TreeSet<Double>());
doIt(500000,new TreeSet<Double>());
{
List<Double> arrayList = new LinkedList<Double>();
doIt(500000,arrayList);
sortIt(arrayList);
}
}
public static void sortIt(List l){
long start = System.currentTimeMillis();
Collections.sort(l);
long end = System.currentTimeMillis();
System.out.println(l.getClass().getName()+" sort it "+ (end-start));
}
public static void doIt(int n, Collection l){
long start = System.currentTimeMillis();
for (int i =0;i<n;i++){
randomDouble(l);
}
long end = System.currentTimeMillis();
System.out.println(l.getClass().getName()+" add it "+ (end-start));
}
private static void randomDouble(Collection l){
int rangeMin=0;
int rangeMax=20000000;
double randomValue = rangeMin + (rangeMax - rangeMin) * r.nextDouble();
l.add(randomValue);
}
}
{code}
> Some limit can be pushed down to map stage
> ------------------------------------------
>
> Key: HIVE-3562
> URL: https://issues.apache.org/jira/browse/HIVE-3562
> Project: Hive
> Issue Type: Bug
> Reporter: Navis
> Assignee: Navis
> Priority: Trivial
> Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch,
> HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch,
> HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree,
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira