How about allowing also a varArg of multiple file names for the input format?
We'd then have the option of - File or directory - List of files or directories - Base directory + regex that matches contained file paths On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > +1 :) > > On Wed, Jul 1, 2015 at 10:08 AM, chan fentes <chanfen...@gmail.com> wrote: > >> Thank you all for your help and for pointing out different possibilities. >> It would be nice to have an input format that takes a directory and a >> regex pattern (for file names) to create one data source instead of 1500. >> This would have helped me to avoid the problem. Maybe this can be included >> in one of the future releases. ;) >> >> 2015-06-30 19:02 GMT+02:00 Stephan Ewen <se...@apache.org>: >> >>> I agree with Aljoscha and Ufuk. >>> >>> As said, it will be hard for the system (currently) to handle 1500 >>> sources, but handling a parallel source with 1500 files will be very >>> efficient. >>> This is possible, if all sources (files) deliver the same data type and >>> would be unioned. >>> >>> If that is true, you can >>> >>> - Specify the input as a directory. >>> >>> - If you cannot do that, because there is no common parent directory, >>> you can "union" the files into one data source with a simple trick, as >>> described here: >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html >>> >>> >>> >>> On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek <aljos...@apache.org> >>> wrote: >>> >>>> Hi Chan, >>>> Flink sources support giving a directory as an input path in a source. >>>> If you do this it will read each of the files in that directory. They way >>>> you do it leads to a very big plan, because the plan will be replicated >>>> 1500 times, this could lead to the OutOfMemoryException. >>>> >>>> Is there a specific reason why you create 1500 separate sources? >>>> >>>> Regards, >>>> Aljoscha >>>> >>>> On Tue, 30 Jun 2015 at 17:17 chan fentes <chanfen...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> how many data sources can I use in one Flink plan? Is there any limit? >>>>> I get an >>>>> java.lang.OutOfMemoryException: unable to create native thread >>>>> when having approx. 1500 files. What I basically do is the following: >>>>> DataSource ->Map -> Map -> GroupBy -> GroupReduce per file >>>>> and then >>>>> Union -> GroupBy -> Sum in a tree-like reduction. >>>>> >>>>> I have checked the workflow. It runs on a cluster without any problem, >>>>> if I only use few files. Does Flink use a thread per operator? It seems as >>>>> if I am limited in the amount of threads I can use. How can I avoid the >>>>> exception mentioned above? >>>>> >>>>> Best regards >>>>> Chan >>>>> >>>> >>> >> > >