Re: Multiple directories

Munagala Ramanath Wed, 25 May 2016 09:18:50 -0700

You have 2 options: (a) AbstractFileInputOperator (b)
FileSplitter/BlockReader


For (a), each partition (i.e. replica or the operator) can scan only a
single directory, so if you have 100
directories, you can simply start with 100 partitions; since each partition
is scanning its own directory
you don't need to worry about which files the lines came from. This
approach however needs a custom
definePartition() implementation in your subclass to assign the appropriate
directory and XML parsing
config file to each partition; it also needs adequate cluster resources to
be able to spin up the required
number of partitions.

For (b), there is some documentation in the Operators section at
http://docs.datatorrent.com/ including
sample code. There operators support scanning multiple directories out of
the box but have more
elaborate configuration options. Check this out and see if it works in your
use case.

Ram

On Wed, May 25, 2016 at 8:17 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
[email protected]> wrote:

> Hello Ram/Team,
>
>
>
> My requirement is to read input feeds from different locations on HDFS and
> parse those files by reading XML configuration files (each input feed has
> configuration file which defines the fields inside the input feeds).
>
>
>
> My approach : I would like to define a mapping file which contains
> individual feed identifier, feed location , configuration file location. I
> would like to read this mapping file at initial load within setup() method
> and define my DirectoryScan.acceptFiles. Here my challenge is when I read
> the files , I should parse the lines by reading the individual
> configuration files. How do I know the line is from particular file , if I
> know this I can read the corresponding configuration file before parsing
> the line.
>
>
>
> Please let me know how do I handle this.
>
>
>
> Regards,
>
> Surya Vamshi
>
>
>
> *From:* Munagala Ramanath [mailto:[email protected]]
> *Sent:* 2016, May, 24 5:49 PM
> *To:* Mukkamula, Suryavamshivardhan (CWM-NR)
> *Subject:* Multiple directories
>
>
>
> One way of addressing the issue is to use some sort of external tool (like
> a script) to
>
> copy all the input files to a common directory (making sure that the file
> names are
>
> unique to prevent one file from overwriting another) before the Apex
> application starts.
>
>
>
> The Apex application then starts and processes files from this directory.
>
>
>
> If you set the partition count of the file input operator to N, it will
> create N partitions and
>
> the files will be automatically distributed among the partitions. The
> partitions will work
>
> in parallel.
>
>
>
> Ram
>
> _______________________________________________________________________
>
> This [email] may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this [email] or the information it contains by other than an
> intended recipient is unauthorized. If you received this [email] in error,
> please advise the sender (by return [email] or otherwise) immediately. You
> have consented to receive the attached electronically at the above-noted
> address; please retain a copy of this confirmation for future reference.
>
>

Re: Multiple directories

Reply via email to