Make a JIRA and attach the patch, please.

2012/11/19 pablomar <pablo.daniel.marti...@gmail.com>

> hi all,
>
> I did it as simple as I could. What about this changes ?
>
>
> PigStorage.java
> original:
>     private void readField(byte[] buf, int start, int end) {
>         if (start == end) {
>             // NULL value
>             mProtoTuple.add(null);
>         } else {
>             mProtoTuple.add(new DataByteArray(buf, start, end));
>         }
>     }
>
>
> new:
>     protected void addToCurrentTuple(DataByteArray data) {
>             mProtoTuple.add(data);
>     }
>
>     protected void readField(byte[] buf, int start, int end) {
>         if (start == end) {
>             // NULL value
>             addToCurrentTuple(null);
>         } else {
>             addToCurrentTuple(new DataByteArray(buf, start, end));
>         }
>     }
>
>
>
> so, what's the advantage ?
> with the new version, if I want to extend PigStorage just to override
> readField, I do something like:
>
> import org.apache.pig.builtin.PigStorage;
> import org.apache.pig.data.DataByteArray;
>
> public class MyLoader extends PigStorage {
>   public MyLoader() {
>     this("\t");
>   }
>
>   public MyLoader(String delimiter) {
>     super(delimiter);
>   }
>
>   @Override
>   protected void readField(byte[] buf, int start, int end) {
>     // remove the field name (example: min=, max=)
>     for (int i=start; i<=end;i++) {
>       if (buf[i] == '=') {
>         start = i + 1;
>         break;
>       }
>     }
>
>     if (start == end) {
>       // NULL value
>       addToCurrentTuple(null);
>     } else {
>       addToCurrentTuple(new DataByteArray(buf, start, end));
>     }
>   }
> }
>
> currently, just to override readField, I also need to copy/paste a lot from
> PigStorage:
>
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.Properties;
> import org.apache.hadoop.io.Text;
> import org.apache.pig.backend.executionengine.ExecException;
> import org.apache.pig.builtin.PigStorage;
> import org.apache.pig.PigException;
> import org.apache.pig.data.DataByteArray;
> import org.apache.pig.data.Tuple;
> import org.apache.pig.data.TupleFactory;
> import org.apache.pig.impl.util.ObjectSerializer;
> import org.apache.pig.impl.util.StorageUtil;
> import org.apache.pig.impl.util.UDFContext;
>
> public class MyLoaderextends PigStorage {
>   private byte fieldDel = '\t';
>   private ArrayList<Object> mProtoTuple = null;
>   private TupleFactory mTupleFactory = TupleFactory.getInstance();
>   private boolean mRequiredColumnsInitialized = false;
>
>   public MyLoader() {
>     this("\t");
>   }
>
>   public MyLoader(String delimiter) {
>     super(delimiter);
>     fieldDel = StorageUtil.parseFieldDel(delimiter);
>   }
>
>   // exactly the same than in PigStorage, but I needed to modify readField
> which is private, so I have to copy all of it
>     @Override
>     public Tuple getNext() throws IOException {
>         mProtoTuple = new ArrayList<Object>();
>         if (!mRequiredColumnsInitialized) {
>             if (signature!=null) {
>                 Properties p =
> UDFContext.getUDFContext().getUDFProperties(this.getClass());
>                 mRequiredColumns =
> (boolean[])ObjectSerializer.deserialize(p.getProperty(signature));
>             }
>             mRequiredColumnsInitialized = true;
>         }
>         try {
>             boolean notDone = in.nextKeyValue();
>             if (!notDone) {
>                 return null;
>
> }
>
>             Text value = (Text) in.getCurrentValue();
>             byte[] buf = value.getBytes();
>             int len = value.getLength();
>             int start = 0;
>             int fieldID = 0;
>             for (int i = 0; i < len; i++) {
>                 if (buf[i] == fieldDel) {
>                     if (mRequiredColumns==null ||
> (mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))
>                         readField(buf, start, i);
>                     start = i + 1;
>                     fieldID++;
>                 }
>             }
>             // pick up the last field
>             if (start <= len && (mRequiredColumns==null ||
> (mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))) {
>                 readField(buf, start, len);
>             }
>             Tuple t =  mTupleFactory.newTupleNoCopy(mProtoTuple);
>             return t;
>         } catch (InterruptedException e) {
>             int errCode = 6018;
>             String errMsg = "Error while reading input";
>             throw new ExecException(errMsg, errCode,
>                     PigException.REMOTE_ENVIRONMENT, e);
>         }
>
>     }
>
>   private void readField(byte[] buf, int start, int end) {
>     // remove the field name (example: min=, max=)
>     for (int i=start; i<=end;i++) {
>       if (buf[i] == '=') {
>         start = i + 1;
>         break;
>       }
>     }
>
>     if (start == end) {
>       // NULL value
>       mProtoTuple.add(null);
>     } else {
>       mProtoTuple.add(new DataByteArray(buf, start, end));
>     }
>   }
> }
>
>
> On Mon, Nov 19, 2012 at 12:24 PM, pablomar
> <pablo.daniel.marti...@gmail.com>wrote:
>
> > sure. My initial (and dirty) idea changed only 2 lines. I completely
> agree
> > with you
> >
> >
> >
> >
> >
> > On Mon, Nov 19, 2012 at 12:16 PM, Bill Graham <billgra...@gmail.com
> >wrote:
> >
> >> +1 as well, but I'd suggest we do the following:
> >>
> >> - Keep mProtoTuple private and add protected getters/setters instead
> with
> >> javadocs describing expected usage.
> >> - Rename mProtoTuple and the getters/setters to something more
> descriptive
> >> than mProtoTuple.
> >>
> >>
> >> On Fri, Nov 16, 2012 at 2:15 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
> >> wrote:
> >>
> >> > That sounds reasonable, I've run into the same problem. Do you mind
> >> > submitting a patch?
> >> >
> >> > On Fri, Nov 16, 2012 at 12:48 PM, pablomar
> >> > <pablo.daniel.marti...@gmail.com> wrote:
> >> > > hi all,
> >> > >
> >> > > I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely)
> >> > > I got a case today on which I needed to clean up some fields before
> >> > > processing. I will need to do the same for all my scripts. So
> instead
> >> of
> >> > > doing it inside the scripts, I thought in extending PigStorage and
> do
> >> it
> >> > > inside my own Loader. My scripts will be shorter and cleaner
> >> > >
> >> > > in fact, the only method that I needed to overwrite was :
> >> > > void *readField*(byte[] buf, int start, int end)
> >> > >
> >> > >
> >> > > Everything was ok and it is working. Problem was that I had to
> >> > copy/paste a
> >> > > lot just because private declarations
> >> > > for example:
> >> > >   private byte fieldDel = '\t';
> >> > >   private ArrayList<Object> mProtoTuple = null;
> >> > >   private TupleFactory mTupleFactory = TupleFactory.getInstance();
> >> > >   private boolean mRequiredColumnsInitialized = false;
> >> > >
> >> > > and of course:
> >> > > *private *void readField(byte[] buf, int start, int end)
> >> > >
> >> > > so I had to copy/paste:
> >> > > public Tuple getNext() and all the aforementioned variables just to
> be
> >> > able
> >> > > to write my own *readField*
> >> > >
> >> > >
> >> > > would it be possible in next versions of Pig to have *readField
> >> > *protected
> >> > > as well as *mProtoTuple *? I think it could be useful in some cases
> >> like
> >> > > mine
> >> > > I'm asking because I don't know the reasoning after the decisions of
> >> made
> >> > > them private
> >> > >
> >> > > thanks a lot,
> >> >
> >>
> >>
> >>
> >> --
> >> *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> >> billgra...@gmail.com going forward.*
> >>
> >
> >
>

Reply via email to