I am adding the support to ignore files being copied, that is, the files
that end with "_COPYING_" in the FileSplitterInput.

However I don't understand the ignore character set to ":". Why will there
be files with ":" in the name/path exist on hdfs if these are unsupported
by hdfs.

Thanks,
Chandni

On Mon, May 9, 2016 at 11:29 AM, Pramod Immaneni <[email protected]>
wrote:

> Chandni,
>
> I agree with your original assessment that there shouldn't be a separate
> operator if the new functionality falls under the "functionality domain" of
> the original operator and the features should just be added to the original
> operator. Based on your description, I agree with points 1. 2. and 3.
>
> However if you delete an operator that is useful in some use cases, what is
> the substitute for that knowledge? For example look like the
> HDFSFileSplitter seems to ignore some commonly present temporary files. Do
> everyone have to learn this themselves and figure it out?
>
> Thanks
>
> On Fri, May 6, 2016 at 4:44 PM, Chandni Singh <[email protected]>
> wrote:
>
> > Just saw that there is *HDFSFileSplitter* in the library as well.
> > This sets *ignoreFilePatternRegularExp *to ".*._COPYING_"  and
> > *unsupportedChar* to ":",
> >
> > IMO this class should be removed as well.
> >
> > Chandni
> >
> > On Fri, May 6, 2016 at 4:16 PM, Chandni Singh <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > Recently there was FSFileSplitter added to Malhar library.
> > > I have created https://issues.apache.org/jira/browse/APEXMALHAR-2081
> to
> > > remove this operator and adds its functionality to the
> FileSplitterInput.
> > >
> > > The reason to do so is because this extension just adds 3 trivial
> > features
> > > which makes it difficult for the user to know which operator to use. It
> > > adds more classes which essentially do the same thing.
> > >
> > > This operator add 3 properties to FileSplitterInput.
> > >
> > > 1. ignoreFilePatternRegularExp: regular expression that specifies which
> > > files to ignore.
> > > This is useful to have in the FileSplitterInput.
> > >
> > > 2. unsupportedChar: first of all this is a String. File having this
> > String
> > > will be ignored.
> > > IMO this is redundant. #1 can be used to accomplish this.
> > > I think this should be removed.
> > >
> > > 3. sequentialFileReader: when this property is set, the block metadata
> of
> > > the same files have the same hashcode. This I think may have been done
> so
> > > that all the block metadata of a particular file go to the same block
> > > reader.
> > >
> > > IMO this is a  hacky way of accomplishing this. If an application needs
> > > this then this should have been done using a StreamCodec.
> > >
> > > I think this should be removed.
> > >
> > > Thanks,
> > > Chandni
> > >
> >
>

Reply via email to