[
https://issues.apache.org/jira/browse/PIG-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249630#comment-13249630
]
Jonathan Coveney commented on PIG-2600:
---------------------------------------
Prashant, thanks for tackling this. Maps have been a second class citizen for
too long :) Some comments:
General:
- I personally usually do "if (input == null || input.size() == 0)" just so you
don't get the array index error, but that's fine. Same issue as in VALUELIST
comes up with that as well.
VALUELIST/SET
- in some cases maps actually have information on the type of their keys. If
this is the case, that information should be used to create the appropriate
output schema.
- I personally like the approach of making TupleFactory and BagFactory a static
final member of classes, but that's hardly Pig dogma :)
- In the case of a null map that is passed in, there are two options. One is to
return null as you do, another is to return an empty bag... the importance is
important because a null value won't get thrown out on a flatten, but a null
bag will. I think I'd return a null bag, but would be open either way. It's
just something we have to be conscious of.
KEYSET/LIST
{code}
+ //Inner schema for keys would always be type chararray
+ return new Schema(new Schema.FieldSchema(null, DataType.BAG));
{code}
You should build the Schema so that it takes into account the fact that you
know it will be chararrays.
VALUESET
- You have an interesting comment about how you could iterate over the
collection, and
from HashSet, here's the constructor code:
{code}
public HashSet(Collection<? extends E> c) {
map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
addAll(c);
}
{code}
And addAll does what you're predict. The point of this being that you can
probably save in the net amount of logic by iterating over the values directly
as you suggest and checking on the HashSet if it already exists. There wouldn't
be much of a change compared to what's going on under the covers anyway, as
it's not like they do anything special to predict the size it's going to
be...you'll just be doing the resizes as you iterate over values manually
instead of all at once.
INVERSEMAP
- in doInverse, I don't think you really need to do the chain of isntanceofs.
You're really just checking that it's not a DataBag or Tuple, right? you could
do something like
{code}
if (!(o instanceof Tuple || o instanceof DataBag))
newKey = o.toString();
{code}
The casting is unecessary. If it is the proper type, then toString will give
you what you want irrespective.
- For updating hashmaps in cases like this, I'm a fan of the following to avoid
having to do containsKey and then a get right afterwards
{code}
DataBag bag = (DataBag)inverseMap.get(newKey);
if (bag==null) {
//logic
} else {
bad.add( ... //more logic)
}
{code}
Thanks again for the legwork on this.
> Better Map support
> ------------------
>
> Key: PIG-2600
> URL: https://issues.apache.org/jira/browse/PIG-2600
> Project: Pig
> Issue Type: Improvement
> Reporter: Jonathan Coveney
> Assignee: Prashant Kommireddi
> Fix For: 0.11
>
> Attachments: PIG-2600.patch, PIG-2600_2.patch, PIG-2600_3.patch,
> PIG-2600_4.patch
>
>
> It would be nice if Pig played better with Maps. To that end, I'd like to add
> a lot of utility around Maps.
> - TOBAG should take a Map and output {(key, value)}
> - TOMAP should take a Bag in that same form and make a map.
> - KEYSET should return the set of keys.
> - VALUESET should return the set of values.
> - VALUELIST should return the List of values (no deduping).
> - INVERSEMAP would return a Map of values => the set of keys that refer to
> that Key
> This would all be pretty easy. A more substantial piece of work would be to
> make Pig support non-String keys (this is especially an issue since UDFs and
> whatnot probably assume that they are all Integers). Not sure if it is worth
> it.
> I'd love to hear other things that would be useful for people!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira