Make a JIRA and attach the patch, please.
2012/11/19 pablomar <pablo.daniel.marti...@gmail.com> > hi all, > > I did it as simple as I could. What about this changes ? > > > PigStorage.java > original: > private void readField(byte[] buf, int start, int end) { > if (start == end) { > // NULL value > mProtoTuple.add(null); > } else { > mProtoTuple.add(new DataByteArray(buf, start, end)); > } > } > > > new: > protected void addToCurrentTuple(DataByteArray data) { > mProtoTuple.add(data); > } > > protected void readField(byte[] buf, int start, int end) { > if (start == end) { > // NULL value > addToCurrentTuple(null); > } else { > addToCurrentTuple(new DataByteArray(buf, start, end)); > } > } > > > > so, what's the advantage ? > with the new version, if I want to extend PigStorage just to override > readField, I do something like: > > import org.apache.pig.builtin.PigStorage; > import org.apache.pig.data.DataByteArray; > > public class MyLoader extends PigStorage { > public MyLoader() { > this("\t"); > } > > public MyLoader(String delimiter) { > super(delimiter); > } > > @Override > protected void readField(byte[] buf, int start, int end) { > // remove the field name (example: min=, max=) > for (int i=start; i<=end;i++) { > if (buf[i] == '=') { > start = i + 1; > break; > } > } > > if (start == end) { > // NULL value > addToCurrentTuple(null); > } else { > addToCurrentTuple(new DataByteArray(buf, start, end)); > } > } > } > > currently, just to override readField, I also need to copy/paste a lot from > PigStorage: > > import java.io.IOException; > import java.util.ArrayList; > import java.util.Properties; > import org.apache.hadoop.io.Text; > import org.apache.pig.backend.executionengine.ExecException; > import org.apache.pig.builtin.PigStorage; > import org.apache.pig.PigException; > import org.apache.pig.data.DataByteArray; > import org.apache.pig.data.Tuple; > import org.apache.pig.data.TupleFactory; > import org.apache.pig.impl.util.ObjectSerializer; > import org.apache.pig.impl.util.StorageUtil; > import org.apache.pig.impl.util.UDFContext; > > public class MyLoaderextends PigStorage { > private byte fieldDel = '\t'; > private ArrayList<Object> mProtoTuple = null; > private TupleFactory mTupleFactory = TupleFactory.getInstance(); > private boolean mRequiredColumnsInitialized = false; > > public MyLoader() { > this("\t"); > } > > public MyLoader(String delimiter) { > super(delimiter); > fieldDel = StorageUtil.parseFieldDel(delimiter); > } > > // exactly the same than in PigStorage, but I needed to modify readField > which is private, so I have to copy all of it > @Override > public Tuple getNext() throws IOException { > mProtoTuple = new ArrayList<Object>(); > if (!mRequiredColumnsInitialized) { > if (signature!=null) { > Properties p = > UDFContext.getUDFContext().getUDFProperties(this.getClass()); > mRequiredColumns = > (boolean[])ObjectSerializer.deserialize(p.getProperty(signature)); > } > mRequiredColumnsInitialized = true; > } > try { > boolean notDone = in.nextKeyValue(); > if (!notDone) { > return null; > > } > > Text value = (Text) in.getCurrentValue(); > byte[] buf = value.getBytes(); > int len = value.getLength(); > int start = 0; > int fieldID = 0; > for (int i = 0; i < len; i++) { > if (buf[i] == fieldDel) { > if (mRequiredColumns==null || > (mRequiredColumns.length>fieldID && mRequiredColumns[fieldID])) > readField(buf, start, i); > start = i + 1; > fieldID++; > } > } > // pick up the last field > if (start <= len && (mRequiredColumns==null || > (mRequiredColumns.length>fieldID && mRequiredColumns[fieldID]))) { > readField(buf, start, len); > } > Tuple t = mTupleFactory.newTupleNoCopy(mProtoTuple); > return t; > } catch (InterruptedException e) { > int errCode = 6018; > String errMsg = "Error while reading input"; > throw new ExecException(errMsg, errCode, > PigException.REMOTE_ENVIRONMENT, e); > } > > } > > private void readField(byte[] buf, int start, int end) { > // remove the field name (example: min=, max=) > for (int i=start; i<=end;i++) { > if (buf[i] == '=') { > start = i + 1; > break; > } > } > > if (start == end) { > // NULL value > mProtoTuple.add(null); > } else { > mProtoTuple.add(new DataByteArray(buf, start, end)); > } > } > } > > > On Mon, Nov 19, 2012 at 12:24 PM, pablomar > <pablo.daniel.marti...@gmail.com>wrote: > > > sure. My initial (and dirty) idea changed only 2 lines. I completely > agree > > with you > > > > > > > > > > > > On Mon, Nov 19, 2012 at 12:16 PM, Bill Graham <billgra...@gmail.com > >wrote: > > > >> +1 as well, but I'd suggest we do the following: > >> > >> - Keep mProtoTuple private and add protected getters/setters instead > with > >> javadocs describing expected usage. > >> - Rename mProtoTuple and the getters/setters to something more > descriptive > >> than mProtoTuple. > >> > >> > >> On Fri, Nov 16, 2012 at 2:15 PM, Dmitriy Ryaboy <dvrya...@gmail.com> > >> wrote: > >> > >> > That sounds reasonable, I've run into the same problem. Do you mind > >> > submitting a patch? > >> > > >> > On Fri, Nov 16, 2012 at 12:48 PM, pablomar > >> > <pablo.daniel.marti...@gmail.com> wrote: > >> > > hi all, > >> > > > >> > > I'm using Pig 0.9.2 (Apache Pig version 0.9.2-cdh4.0.1, precisely) > >> > > I got a case today on which I needed to clean up some fields before > >> > > processing. I will need to do the same for all my scripts. So > instead > >> of > >> > > doing it inside the scripts, I thought in extending PigStorage and > do > >> it > >> > > inside my own Loader. My scripts will be shorter and cleaner > >> > > > >> > > in fact, the only method that I needed to overwrite was : > >> > > void *readField*(byte[] buf, int start, int end) > >> > > > >> > > > >> > > Everything was ok and it is working. Problem was that I had to > >> > copy/paste a > >> > > lot just because private declarations > >> > > for example: > >> > > private byte fieldDel = '\t'; > >> > > private ArrayList<Object> mProtoTuple = null; > >> > > private TupleFactory mTupleFactory = TupleFactory.getInstance(); > >> > > private boolean mRequiredColumnsInitialized = false; > >> > > > >> > > and of course: > >> > > *private *void readField(byte[] buf, int start, int end) > >> > > > >> > > so I had to copy/paste: > >> > > public Tuple getNext() and all the aforementioned variables just to > be > >> > able > >> > > to write my own *readField* > >> > > > >> > > > >> > > would it be possible in next versions of Pig to have *readField > >> > *protected > >> > > as well as *mProtoTuple *? I think it could be useful in some cases > >> like > >> > > mine > >> > > I'm asking because I don't know the reasoning after the decisions of > >> made > >> > > them private > >> > > > >> > > thanks a lot, > >> > > >> > >> > >> > >> -- > >> *Note that I'm no longer using my Yahoo! email address. Please email me > at > >> billgra...@gmail.com going forward.* > >> > > > > >