[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671355#action_12671355 ] Pradeep Kamath commented on PIG-653: Interface for passing required fields information to the loader Proposal Two new Classes will be introduced in the API call to the loader for passing information about required fields. {code} class RequiredField { String alias; // will hold name of the field (would be null if not supplied) int index; // will hold the index (position) of the required field (would be -1 if not supplied), index is 0 based List subFields; // A list of sub fields in this field (this could be a list of hash keys for example). This would be null if the entire field is required and no specific sub fields are required. In the initial implementation only one level of subfields will be populated. byte type; // Type of this field - the value could be any current PIG DataType (as specified by the constants in DataType class. A new Type BAG_OF_MAP will be added to represent a bag of maps field). // Constructor and getters and setters follow // getters are getAlias(), getIndex(), getSubFields(), getType() // setters are setAlias(), setIndex(), setSubFields(), setType() } {code} NOTE: Both alias and index could be set. The index has a value as perceived by Pig if all fields were sent to it from the loader. For performance it would be good if when a single key in a map is requested the loader returns a map with just that key. Likewise, when the required fields is a key in a bag of map field, the expected value from the loader would be a bag of map where the maps contain that key (preferably only that key for performance since this will reduce the data handed by the loader). {code} class RequiredFieldResponse { boolean requiredFieldRequestHonored; // true if the loader will return a schema containing only the List of RequiredFields in that order. false if the loader will return all fields in the data } {code} The reason we have a RequiredFieldResponse class encapsulating the boolean is to allow for future extensibility. For example, in the future the loader may be able to honor all top level field requests but not subfields in hashes. So it may hand back top level maps in return for sub field requests. The loader will then need to inform back to the caller which fields will be returned exactly the way they were requested and which will be sent as top level fields (even though the request was for subfields). For the first pass though it is all or none conveyed through the Boolean. The API call in LoadFunc will change from {code} void fieldsToRead(Schema schema) {code} to {code} RequiredFieldResponse fieldsToRead(List requiredFields, boolean allFieldsRequired); {code} NOTE: 1. It is expected that the loader returns the required fields in exactly the same order as in the List provided in the above call. 3. The boolean flag allFieldsRequired is set to true when all fields are required. The loader should first check this flag and use the List ONLY if this flag is false. Use Cases = Use Cases which only use aliases {noformat} 1. Required fields are columns x (int), y (long) [ { alias=>x, index => -1, subfields => null, type => DataType.INTEGER }, { alias=>y, index => -1, subfields => null, type => DataType.LONG } ] 2. Required fields are m1#key1 (map subcolumn), b1#key2 (subcolumn from a bag of maps), [ { alias=>m, index => -1, subfields => [ { alias => key1, index => -1, subfields => null, // only one sublevel in the initial implementation, so this has to be null! Type => DataType.BYTEARRAY } ] type => DataType.MAP }, { alias=>b1, index => -1, subfields => [ { alias => key2, index => -1, subfields => null, // only one sublevel in the initial implementation, so this has to be null! Type => DataType.BYTEARRAY } ] type => DataType.BAG_OF_MAP } ] 3. Required fields are m2#(key3, key4) (map subcolumns), b2#(key5, key6) (subcolumns from bag of maps) [ { alias=>m2, index => -1, subfields => [ { alias => key3, index => -1, subfields => null, // only one sublevel in the initial implementation, so this has to be null! Type => DataType.BYTEARRAY }, { alias => key4, index => -1, subfields => null, // only one sublevel in the initial implementation, so this has to be null! Type => DataType.BYTEARRAY } ] type => DataType.MAP }, { alias=>b2, index => -1, subfields => [ { alias => key5, index => -1, subfields => null, // only one sublevel in the initial implem
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671371#action_12671371 ] Hong Tang commented on PIG-653: --- I don't like the idea of adding BAG_OF_MAP type. It really is a composite of two existing types BAG of MAP. Here is another idea I came up, and briefly discussed with Pradeep. {code} public interface Filter { /** * Return the actual type of the filter. It can then be downcast to the * actual Filter. * * @return one of the following constants defined in DataType: TUPLE, BAG, and * MAP */ byte getType(); } class TupleFilter implements Filter { private static class TupleFilterEntry { String alias; Filter filter; TupleFilterEntry(String a, Filter f) { alias = a; filter = f; } } SortedMap entries; public byte getType() { return DataType.TUPLE; } public TupleFilter() { entries = new TreeMap(); } /** * Convenience constructor for simple positioned based filtering. * @param indices */ public TupleFilter(int...indices) { entries = new TreeMap(); for (int i : indices) { entries.put(i, new TupleFilterEntry(null, null)); } } /** * Adding an entry into the filter. (Building the filter.) * * @param index * The field index we are interested * @param alias * The alias name of the field, optional * @param filter * Further filtering on the filed, null means no more nested filter. */ public synchronized void add(int index, String alias, Filter filter) { entries.put(index, new TupleFilterEntry(alias, filter)); } /** * Get the interested fields. * * @return The indices to the interested fields, sorted in ascending order. */ public synchronized int[] getFields() { int[] ret = new int[entries.size()]; int i = 0; for (Iterator it = entries.keySet().iterator(); it.hasNext(); ++i) { ret[i] = it.next(); } return ret; } public synchronized String getAlias(int index) { TupleFilterEntry entry = entries.get(index); if (entry == null) { throw new IllegalArgumentException("Unrecognized field index"); } return entry.alias; } public synchronized Filter getFilter(int index) { TupleFilterEntry entry = entries.get(index); if (entry == null) { throw new IllegalArgumentException("Unrecognized field index"); } return entry.filter; } } class MapFilter implements Filter { Map entries; public MapFilter() { entries = new TreeMap(); } /** * Convenience constructor for simple key matching filtering. * * @param keys * interested keys */ public MapFilter(String... keys) { this(); add(keys); } /** * Adding keys to the interested key set without further filteriing. * * @param keys * interested keys. */ public void add(String... keys) { add(null, keys); } /** * Adding keys to the interested key set with further filtering * * @param f * The filter * @param keys * the keys */ public synchronized void add(Filter f, String... keys) { for (String k : keys) { entries.put(k, f); } } @Override public byte getType() { return DataType.MAP; } public synchronized Map getKeyFilterMapping() { return entries; } } class BagFilter implements Filter { Filter filter; public BagFilter(TupleFilter filter) { this.filter = filter; } @Override public byte getType() { return DataType.BAG; } public Filter getTupleFilter() { return filter; } } {code} > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672015#action_12672015 ] Hong Tang commented on PIG-653: --- Should subFields also have the type RequiredFieldList? > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-653-2.comment > > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672020#action_12672020 ] Pradeep Kamath commented on PIG-653: Not so sure about using RequiredFieldList for subFields - this will mean we could ask for all subFields in two ways - 1. By just asking for the main field (this will imply we need all sub fields) 2. By asking for the main field with a subField which has its allFieldsRequired flag set to true. I think it would be better to keep the subFields as only THE required subfields represented as a list. RequiredFieldList is specifically being introduced to handle top level information to be given to the loader which may not be applicable at a field level. Thoughts? > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-653-2.comment > > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672176#action_12672176 ] Hong Tang commented on PIG-653: --- my quibble is that the interface uses null to indicate all required for nested fields, but uses a concrete class for top level fields. any justification why possible future extensions are only applicable to top-level fields but not nested fields? > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-653-2.comment > > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785926#action_12785926 ] Yan Zhou commented on PIG-653: -- +1 > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch > > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785928#action_12785928 ] Hadoop QA commented on PIG-653: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426879/PIG-653.patch against trunk revision 887049. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 97 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 395 release audit warnings (more than the trunk's current 368 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/console This message is automatically generated. > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch > > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785936#action_12785936 ] Yan Zhou commented on PIG-653: -- The 27 release audit failures are from 25 pig test scripts and 2 test data files, none of them are source files and should be ignored. > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch > > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785937#action_12785937 ] Yan Zhou commented on PIG-653: -- A typo in my last comment. should have been 27 audit *warnings* not *failures* > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch > > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786046#action_12786046 ] Yan Zhou commented on PIG-653: -- Zebra changes commited to both trunk and the 6.0 branch. > Make fieldsToRead work in loader > > > Key: PIG-653 > URL: https://issues.apache.org/jira/browse/PIG-653 > Project: Pig > Issue Type: New Feature >Reporter: Alan Gates >Assignee: Pradeep Kamath > Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch > > > Currently pig does not call the fieldsToRead function in LoadFunc, thus it > does not provide information to load functions on what fields are needed. We > need to implement a visitor that determines (where possible) which fields in > a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.