Hi,

As I remember there was some issue found during testing. The ":" is not
supported by HDFS, ideally such files shouldn't exist on HDFS, but I
remember to have found some bug. Let me look up for reference. If I can't
find one I will do some more testing around it and we can decide to remove
it.

-Priyanka

On Mon, May 9, 2016 at 11:37 AM, Pramod Immaneni <[email protected]>
wrote:

> I see, lets wait for their response on the colon.
>
> Thanks
>
> On Mon, May 9, 2016 at 11:34 AM, Chandni Singh <[email protected]>
> wrote:
>
> > I am adding the support to ignore files being copied, that is, the files
> > that end with "_COPYING_" in the FileSplitterInput.
> >
> > However I don't understand the ignore character set to ":". Why will
> there
> > be files with ":" in the name/path exist on hdfs if these are unsupported
> > by hdfs.
> >
> > Thanks,
> > Chandni
> >
> > On Mon, May 9, 2016 at 11:29 AM, Pramod Immaneni <[email protected]
> >
> > wrote:
> >
> > > Chandni,
> > >
> > > I agree with your original assessment that there shouldn't be a
> separate
> > > operator if the new functionality falls under the "functionality
> domain"
> > of
> > > the original operator and the features should just be added to the
> > original
> > > operator. Based on your description, I agree with points 1. 2. and 3.
> > >
> > > However if you delete an operator that is useful in some use cases,
> what
> > is
> > > the substitute for that knowledge? For example look like the
> > > HDFSFileSplitter seems to ignore some commonly present temporary files.
> > Do
> > > everyone have to learn this themselves and figure it out?
> > >
> > > Thanks
> > >
> > > On Fri, May 6, 2016 at 4:44 PM, Chandni Singh <[email protected]
> >
> > > wrote:
> > >
> > > > Just saw that there is *HDFSFileSplitter* in the library as well.
> > > > This sets *ignoreFilePatternRegularExp *to ".*._COPYING_"  and
> > > > *unsupportedChar* to ":",
> > > >
> > > > IMO this class should be removed as well.
> > > >
> > > > Chandni
> > > >
> > > > On Fri, May 6, 2016 at 4:16 PM, Chandni Singh <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Recently there was FSFileSplitter added to Malhar library.
> > > > > I have created
> https://issues.apache.org/jira/browse/APEXMALHAR-2081
> > > to
> > > > > remove this operator and adds its functionality to the
> > > FileSplitterInput.
> > > > >
> > > > > The reason to do so is because this extension just adds 3 trivial
> > > > features
> > > > > which makes it difficult for the user to know which operator to
> use.
> > It
> > > > > adds more classes which essentially do the same thing.
> > > > >
> > > > > This operator add 3 properties to FileSplitterInput.
> > > > >
> > > > > 1. ignoreFilePatternRegularExp: regular expression that specifies
> > which
> > > > > files to ignore.
> > > > > This is useful to have in the FileSplitterInput.
> > > > >
> > > > > 2. unsupportedChar: first of all this is a String. File having this
> > > > String
> > > > > will be ignored.
> > > > > IMO this is redundant. #1 can be used to accomplish this.
> > > > > I think this should be removed.
> > > > >
> > > > > 3. sequentialFileReader: when this property is set, the block
> > metadata
> > > of
> > > > > the same files have the same hashcode. This I think may have been
> > done
> > > so
> > > > > that all the block metadata of a particular file go to the same
> block
> > > > > reader.
> > > > >
> > > > > IMO this is a  hacky way of accomplishing this. If an application
> > needs
> > > > > this then this should have been done using a StreamCodec.
> > > > >
> > > > > I think this should be removed.
> > > > >
> > > > > Thanks,
> > > > > Chandni
> > > > >
> > > >
> > >
> >
>

Reply via email to