[ 
https://issues.apache.org/jira/browse/PIG-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249630#comment-13249630
 ] 

Jonathan Coveney commented on PIG-2600:
---------------------------------------

Prashant, thanks for tackling this. Maps have been a second class citizen for 
too long :) Some comments:

General:
- I personally usually do "if (input == null || input.size() == 0)" just so you 
don't get the array index error, but that's fine. Same issue as in VALUELIST 
comes up with that as well.

VALUELIST/SET
- in some cases maps actually have information on the type of their keys. If 
this is the case, that information should be used to create the appropriate 
output schema.
- I personally like the approach of making TupleFactory and BagFactory a static 
final member of classes, but that's hardly Pig dogma :)
- In the case of a null map that is passed in, there are two options. One is to 
return null as you do, another is to return an empty bag... the importance is 
important because a null value won't get thrown out on a flatten, but a null 
bag will. I think I'd return a null bag, but would be open either way. It's 
just something we have to be conscious of.

KEYSET/LIST
{code}
+        //Inner schema for keys would always be type chararray
+        return new Schema(new Schema.FieldSchema(null, DataType.BAG)); 
{code}
You should build the Schema so that it takes into account the fact that you 
know it will be chararrays.

VALUESET
- You have an interesting comment about how you could iterate over the 
collection, and 

from HashSet, here's the constructor code:
{code}
public HashSet(Collection<? extends E> c) {
    map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
    addAll(c);
}
{code}

And addAll does what you're predict. The point of this being that you can 
probably save in the net amount of logic by iterating over the values directly 
as you suggest and checking on the HashSet if it already exists. There wouldn't 
be much of a change compared to what's going on under the covers anyway, as 
it's not like they do anything special to predict the size it's going to 
be...you'll just be doing the resizes as you iterate over values manually 
instead of all at once.

INVERSEMAP
- in doInverse, I don't think you really need to do the chain of isntanceofs. 
You're really just checking that it's not a DataBag or Tuple, right? you could 
do something like
{code}
if (!(o instanceof Tuple || o instanceof DataBag))
  newKey = o.toString();
{code}
The casting is unecessary. If it is the proper type, then toString will give 
you what you want irrespective.
- For updating hashmaps in cases like this, I'm a fan of the following to avoid 
having to do containsKey and then a get right afterwards
{code}
DataBag bag = (DataBag)inverseMap.get(newKey);
if (bag==null) {
    //logic
} else {
    bad.add( ... //more logic)
}
{code}

Thanks again for the legwork on this.
                
> Better Map support
> ------------------
>
>                 Key: PIG-2600
>                 URL: https://issues.apache.org/jira/browse/PIG-2600
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Prashant Kommireddi
>             Fix For: 0.11
>
>         Attachments: PIG-2600.patch, PIG-2600_2.patch, PIG-2600_3.patch, 
> PIG-2600_4.patch
>
>
> It would be nice if Pig played better with Maps. To that end, I'd like to add 
> a lot of utility around Maps.
> - TOBAG should take a Map and output {(key, value)}
> - TOMAP should take a Bag in that same form and make a map.
> - KEYSET should return the set of keys.
> - VALUESET should return the set of values.
> - VALUELIST should return the List of values (no deduping).
> - INVERSEMAP would return a Map of values => the set of keys that refer to 
> that Key
> This would all be pretty easy. A more substantial piece of work would be to 
> make Pig support non-String keys (this is especially an issue since UDFs and 
> whatnot probably assume that they are all Integers). Not sure if it is worth 
> it.
> I'd love to hear other things that would be useful for people!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to