[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-02-06 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671355#action_12671355
 ] 

Pradeep Kamath commented on PIG-653:


Interface for passing required fields information to the loader
Proposal
Two new Classes will be introduced in the API call to the loader for passing 
information about required fields.
{code}
class RequiredField {
String alias; // will hold name of the field (would be null if not 
supplied)
int index; // will hold the index (position) of the required field 
(would be -1 if not supplied), index is 0 based
List subFields; // A list of sub fields in this field 
(this could be a list of hash keys for example). This would be null if the 
entire field is required and no specific sub fields are required. In the 
initial implementation only one level of subfields will be populated.
byte type; // Type of this field - the value could be any current PIG 
DataType (as specified by the constants in DataType class. A new Type 
BAG_OF_MAP will be added to represent a bag of maps field).

// Constructor and getters and setters follow
// getters are getAlias(), getIndex(), getSubFields(), getType()
// setters are setAlias(), setIndex(), setSubFields(), setType()
}
{code}

NOTE: Both alias and index could be set. The index has a value as perceived by 
Pig if all fields were sent to it from the loader.

For performance it would be good if when a single key in a map is requested the 
loader returns a map with just that key. Likewise, when the required fields is 
a key in a bag of map field, the expected value from the loader would be a bag 
of map where the maps contain that key (preferably only that key for 
performance since this will reduce the data handed by the loader).

{code}
class RequiredFieldResponse {
boolean requiredFieldRequestHonored; // true if the loader will return 
a schema containing only the List of RequiredFields in that order. false if the 
loader will return all fields in the data
}
{code}

The reason we have a RequiredFieldResponse class encapsulating the boolean is 
to allow for future extensibility. For example, in the future the loader may be 
able to honor all top level field requests but not subfields in hashes. So it 
may hand back top level maps in return for sub field requests. The loader will 
then need to inform back to the caller which fields will be returned exactly 
the way they were requested and which will be sent as top level fields (even 
though the request was for subfields). For the first pass though it is all or 
none conveyed through the Boolean.

The API call in LoadFunc will change from 
{code}
void fieldsToRead(Schema schema) 
{code}
to
{code}
RequiredFieldResponse fieldsToRead(List requiredFields, boolean 
allFieldsRequired);
{code}

NOTE: 
1.  It is expected that the loader returns the required fields in exactly 
the same order as in the List provided in the above call.
3.  The boolean flag allFieldsRequired is set to true when all fields are 
required. The loader should first check this flag and use the 
List ONLY if this flag is false.

Use Cases
=

Use Cases which only use aliases

{noformat}
1.  Required fields are columns x (int), y (long)
[
{
alias=>x,
index => -1,
subfields => null,
type => DataType.INTEGER
},
{
alias=>y,
index => -1,
subfields => null,
type => DataType.LONG
}
]

2.  Required fields are m1#key1 (map subcolumn), b1#key2 (subcolumn from a 
bag of maps),
[
{
alias=>m,
index => -1,
subfields => [
{
alias => key1,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this 
has to be null!
Type => DataType.BYTEARRAY
}
   ]
type => DataType.MAP
},
{
alias=>b1,
index => -1,
subfields => [
{
alias => key2,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this 
has to be null!
Type => DataType.BYTEARRAY
}
   ]
type => DataType.BAG_OF_MAP
}
]

3.  Required fields are   m2#(key3, key4)  (map subcolumns), b2#(key5, 
key6) (subcolumns from bag of maps)
[
{
alias=>m2,
index => -1,
subfields => [
{
alias => key3,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this 
has to be null!
Type => DataType.BYTEARRAY
},
{
alias => key4,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this 
has to be null!
Type => DataType.BYTEARRAY
}

   ]
type => DataType.MAP
},
{
alias=>b2,
index => -1,
subfields => [
{
alias => key5,
index => -1,
subfields => null, // only one sublevel in the initial implem

[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-02-06 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671371#action_12671371
 ] 

Hong Tang commented on PIG-653:
---

I don't like the idea of adding BAG_OF_MAP type. It really is a composite of 
two existing types BAG of MAP.

Here is another idea I came up, and briefly discussed with Pradeep.

{code}
public interface Filter {
  /**
   * Return the actual type of the filter. It can then be downcast to the
   * actual Filter.
   * 
   * @return one of the following constants defined in DataType: TUPLE, BAG, and
   * MAP
   */
  byte getType();
}

class TupleFilter implements Filter {
  private static class TupleFilterEntry {
String alias;
Filter filter;
TupleFilterEntry(String a, Filter f) {
  alias = a;
  filter = f;
}
  }
  
  SortedMap entries;

  public byte getType() { return DataType.TUPLE; }
  
  public TupleFilter() {
entries = new TreeMap();
  }

  /**
   * Convenience constructor for simple positioned based filtering.
   * @param indices
   */
  public TupleFilter(int...indices) {
entries = new TreeMap();
for (int i : indices) {
  entries.put(i, new TupleFilterEntry(null, null));
}
  }
  
  /**
   * Adding an entry into the filter. (Building the filter.)
   * 
   * @param index
   *  The field index we are interested
   * @param alias
   *  The alias name of the field, optional
   * @param filter
   *  Further filtering on the filed, null means no more nested filter.
   */
  public synchronized void add(int index, String alias, Filter filter) {
entries.put(index, new TupleFilterEntry(alias, filter));
  }
  
  /**
   * Get the interested fields.
   * 
   * @return The indices to the interested fields, sorted in ascending order.
   */
  public synchronized int[] getFields() {
int[] ret = new int[entries.size()];
int i = 0;
for (Iterator it = entries.keySet().iterator(); it.hasNext(); ++i) 
{
  ret[i] = it.next();
}
return ret;
  }

  public synchronized String getAlias(int index) {
TupleFilterEntry entry = entries.get(index);
if (entry == null) {
  throw new IllegalArgumentException("Unrecognized field index");
}
return entry.alias;
  }

  public synchronized Filter getFilter(int index) {
TupleFilterEntry entry = entries.get(index);
if (entry == null) {
  throw new IllegalArgumentException("Unrecognized field index");
}
return entry.filter;
  }
}

class MapFilter implements Filter {
  Map entries;
  
  public MapFilter() {
entries = new TreeMap();
  }
  
  /**
   * Convenience constructor for simple key matching filtering.
   * 
   * @param keys
   *  interested keys
   */
  public MapFilter(String... keys) {
this();
add(keys);
  }
  
  /**
   * Adding keys to the interested key set without further filteriing.
   * 
   * @param keys
   *  interested keys.
   */
  public void add(String... keys) {
add(null, keys);
  }

  /**
   * Adding keys to the interested key set with further filtering
   * 
   * @param f
   *  The filter
   * @param keys
   *  the keys
   */
  public synchronized void add(Filter f, String... keys) {
for (String k : keys) {
  entries.put(k, f);
}
  }
  
  @Override
  public byte getType() {
return DataType.MAP;
  }
  
  public synchronized Map getKeyFilterMapping() {
return entries;
  }
}

class BagFilter implements Filter {
  Filter filter;

  public BagFilter(TupleFilter filter) {
this.filter = filter;
  }

  @Override
  public byte getType() {
return DataType.BAG;
  }

  public Filter getTupleFilter() {
return filter;
  }
}
{code}

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-02-09 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672015#action_12672015
 ] 

Hong Tang commented on PIG-653:
---

Should subFields also have the type RequiredFieldList?

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-02-09 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672020#action_12672020
 ] 

Pradeep Kamath commented on PIG-653:


Not so sure about using RequiredFieldList for subFields - this will mean we 
could ask for all subFields in two ways - 
1. By just asking for the main field (this will imply we need all sub fields)
2. By asking for the main field with a subField which has its allFieldsRequired 
flag set to true.

I think it would be better to keep the subFields as only THE required subfields 
represented as a list. RequiredFieldList is specifically being introduced to 
handle top level information to be given to the loader which may not be 
applicable at a field level. 

Thoughts?

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-02-09 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672176#action_12672176
 ] 

Hong Tang commented on PIG-653:
---

my quibble is that the interface uses null to indicate all required for nested 
fields, but uses a concrete class for top level fields. any justification why 
possible future extensions are only applicable to top-level fields but not 
nested fields?

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-12-04 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785926#action_12785926
 ] 

Yan Zhou commented on PIG-653:
--

+1

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-12-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785928#action_12785928
 ] 

Hadoop QA commented on PIG-653:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426879/PIG-653.patch
  against trunk revision 887049.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 97 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 395 release audit warnings 
(more than the trunk's current 368 warnings).

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/89/console

This message is automatically generated.

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-12-04 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785936#action_12785936
 ] 

Yan Zhou commented on PIG-653:
--

The 27 release audit failures are from 25 pig test scripts and 2 test data 
files, none of them are source files and should be ignored.

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-12-04 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785937#action_12785937
 ] 

Yan Zhou commented on PIG-653:
--

A typo in my last comment. should have been 27 audit *warnings* not *failures*

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-12-04 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786046#action_12786046
 ] 

Yan Zhou commented on PIG-653:
--

Zebra changes commited to both trunk and the 6.0 branch.

> Make fieldsToRead work in loader
> 
>
> Key: PIG-653
> URL: https://issues.apache.org/jira/browse/PIG-653
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-653-2.comment, PIG-653-3-proposal.txt, PIG-653.patch
>
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
> does not provide information to load functions on what fields are needed.  We 
> need to implement a visitor that determines (where possible) which fields in 
> a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.