[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

Ashutosh Chauhan (JIRA) Sun, 13 Sep 2009 00:47:23 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754671#action_12754671
 ]


Ashutosh Chauhan commented on PIG-953:
--------------------------------------

1. [Pradeep] zebra store function would basically needs to know the sort keys 
in order and which of them are asc/dsc. For this they would iterate over our 
data structure and require that the ordering of the keys match the 
primary/secondary order of the sort keys

[Ashutosh] What about LinkedHashMap? It provides all the properties we are 
seeking here, one data structure, O(1) lookup and guaranteed iteration order. 

2. In Utils.java
{code}
public static boolean checkNullAndClass(Object obj1, Object obj2) {
        return checkNullEquals(obj1, obj2, false) && obj1.getClass() == 
obj2.getClass();
}
{code}

will result in NPE when both obj1 and obj2 are null. 

A minor detail:  Suppose obj1 is declared of type ArrayList<Integer> and obj2 
is declared of type ArrayList<String>, obj1.getClass() == obj2.getClass() will 
return true thanks to type erasure by java compiler at compile time. Not sure 
if thats OK or not for the check here. 

3. In StoreConfig.java One of the scenarios in which SortInfo is returned as 
null is
{code}
* 3) the store follows an "order by" but the schema
* of "order by" does not have column name(s) for the sort
* column(s)
{code}
I understand that reason for this additional constraint is because SortInfo 
maintains list of column names. But even if schema contains only type 
information and not the column names, that still is a sufficient information to 
build indexes. Information about on which column data is sorted on can be 
recorded using column positions isn't it? Does zebra requires columns to be 
named? If it doesn't then SortInfo could be changed in such a way that it can 
provide column position instead of names to loader, if columns arent named.

In POMergeJoin.java
4.
{code}
+        currentFileName = lFile.getFileName();
+        loader = 
(LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
+        is = FileLocalizer.open(currentFileName, offset, pc);
+        if (currentFileName.endsWith(".bz") || 
currentFileName.endsWith(".bz2")) {
+            is = new CBZip2InputStream((SeekableInputStream)is, 9);
+        } else if (currentFileName.endsWith(".gz")) {
+            is = new GZIPInputStream(is);
+        }
+
{code}

Isnt this blocked on https://issues.apache.org/jira/browse/PIG-930 ?

5.
{code}
default: // We don't deal with ERR/NULL. just pass them down
    return res;
{code}
 should be changed to                       
{code}
default: 
    throwProcessingException(false,null);
{code}

because if status is Error, execution should be stopped and exception should be 
thrown as early as possible instead of continue doing work which will be 
wasted. If status is Null NPE will occur while doing join.

6.
{code}
InputStream is = FileLocalizer.open(rightInputFileName, pc);
rightLoader.bindTo(rightInputFileName, new BufferedPositionedInputStream(is), 
0, Long.MAX_VALUE);
{code}
I dont see any use of this code. I think its not required and can be removed.

Infact, there is no need of following function too:
{code}
    /**
     * @param rightInputFileName the rightInputFileName to set
     */
    public void setRightInputFileName(String rightInputFileName) {
        this.rightInputFileName = rightInputFileName;
    }
{code} 
file name of right side is obtained from index which is contained in index 
file. Index file is directly passed as a constructor argument of 
indexableLoadFunc, so there is no need of passing rightinputfilename from 
MRCompiler to POMergeJoin.
And if this reasoning is correct then DefaultIndexableLoader.bindTo() should 
throw an IOException, because contract on DefaultIndexableLoader is that it is 
initialized with all the info it needs in constructor and then seekNear is 
called on it to seek to correct location. bindTo() shouldn't be used for this 
loader. 
Also, seekNear() doesn't sound right. How about seekToClosest() ?       

7. I think introducing order preserving flag on logical operator is a good 
idea. 
First its self documenting as the information is contained within operator and 
not checked by doing instanceof else where in code. 
Second its a useful information which if present can help make optimizer smart 
decisions. As an example, optimizer can rewrite a symmetric hash join to 
merge-sort join if all the logical operators in query DAG from join inputs to 
the root has these flags set to true. Without this flag, doing such 
optimizations will be hard.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-953
>                 URL: https://issues.apache.org/jira/browse/PIG-953
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

Reply via email to