Re: How to share text file across tasks at run time in flink.

Baswaraj Kasture Wed, 24 Aug 2016 22:04:57 -0700

Thanks to all for your inputs.
Yeah, I could put all these common configurations/rules in DB and workers
can pick it up dynamically at run time.
In this case DB configuration/connection details need to be hard coded  ?
Is there any way worker can pickup  DB name/credentials etc at run time
dynamically ?


I am going through the feature/API documentation, but how about using
function closer  and setGlobalJobParameters/getGlobalJobParameters ?

+Baswaraj

On Wed, Aug 24, 2016 at 5:17 PM, Maximilian Michels <m...@apache.org> wrote:

> Hi!
>
> 1. The community is working on adding side inputs to the DataStream
> API. That will allow you to easily distribute data to all of your
> workers.
>
> 2. In the meantime, you could use `.broadcast()` on a DataSet to
> broadcast data to all workers. You still have to join that data with
> another stream though.
>
> 3. The easiest method of all is to simply load your file in the
> RichMapFunction's open() method. The file can reside in a distributed
> file system which is accessible by all workers.
>
> Cheers,
> Max
>
> On Wed, Aug 24, 2016 at 6:45 AM, Jark Wu <wuchong...@alibaba-inc.com>
> wrote:
> > Hi,
> >
> > I think what Bswaraj want is excatly something like Storm Distributed
> Cache
> > API[1] (if I’m not misunderstanding).
> >
> > The distributed cache feature in storm is used to efficiently distribute
> > files (or blobs, which is the equivalent terminology for a file in the
> > distributed cache and is used interchangeably in this document) that are
> > large and can change during the lifetime of a topology, such as
> geo-location
> > data, dictionaries, etc. Typical use cases include phrase recognition,
> > entity extraction, document classification, URL re-writing,
> location/address
> > detection and so forth. Such files may be several KB to several GB in
> size.
> > For small datasets that don't need dynamic updates, including them in the
> > topology jar could be fine. But for large files, the startup times could
> > become very large. In these cases, the distributed cache feature can
> provide
> > fast topology startup, especially if the files were previously downloaded
> > for the same submitter and are still in the cache. This is useful with
> > frequent deployments, sometimes few times a day with updated jars,
> because
> > the large cached files will remain available without changes. The large
> > cached blobs that do not change frequently will remain available in the
> > distributed cache.
> >
> >
> > We can look into this whether it is a common use case and how to
> implement
> > it in Flink.
> >
> > [1] http://storm.apache.org/releases/2.0.0-SNAPSHOT/
> distcache-blobstore.html
> >
> >
> > - Jark Wu
> >
> > 在 2016年8月23日，下午9:45，Lohith Samaga M <lohith.sam...@mphasis.com> 写道：
> >
> > Hi
> > May be you could use Cassandra to store and fetch all such reference
> data.
> > This way the reference data can be updated without restarting your
> > application.
> >
> > Lohith
> >
> > Sent from my Sony Xperia™ smartphone
> >
> >
> >
> > ---- Baswaraj Kasture wrote ----
> >
> > Thanks Kostas !
> > I am using DataStream API.
> >
> > I have few config/property files (key vale text file) and also have
> business
> > rule files (json).
> > These rules and configurations are needed when we process incoming event.
> > Is there any way to share them to task nodes from driver program ?
> > I think this is very common use case and am sure other users may face
> > similar issues.
> >
> > +Baswaraj
> >
> > On Mon, Aug 22, 2016 at 4:56 PM, Kostas Kloudas
> > <k.klou...@data-artisans.com> wrote:
> >>
> >> Hello Baswaraj,
> >>
> >> Are you using the DataSet (batch) or the DataStream API?
> >>
> >> If you are in the first, you can use a broadcast variable for your task.
> >> If you are using the DataStream one, then there is no proper support for
> >> that.
> >>
> >> Thanks,
> >> Kostas
> >>
> >> On Aug 20, 2016, at 12:33 PM, Baswaraj Kasture <kbaswar...@gmail.com>
> >> wrote:
> >>
> >> Am running Flink standalone cluster.
> >>
> >> I have text file that need to be shared across tasks when i submit my
> >> application.
> >> in other words , put this text file in class path of running tasks.
> >>
> >> How can we achieve this with flink ?
> >>
> >> In spark, spark-submit has --jars option that puts all the files
> specified
> >> in class path of executors (executors run in separate JVM and spawned
> >> dynamically, so it is possible).
> >>
> >> Flink's task managers run tasks in separate thread under taskmanager JVM
> >> (?) , how can we make this text file to be accessible on all tasks
> spawned
> >> by current application ?
> >>
> >> Using HDFS, NFS or including file in program jar is one way that i know,
> >> but am looking for solution that can allows me to provide text file at
> run
> >> time and still accessible in all tasks.
> >> Thanks.
> >>
> >>
> >
> >
> > Information transmitted by this e-mail is proprietary to Mphasis, its
> > associated companies and/ or its customers and is intended
> > for use only by the individual or entity to which it is addressed, and
> may
> > contain information that is privileged, confidential or
> > exempt from disclosure under applicable law. If you are not the intended
> > recipient or it appears that this mail has been forwarded
> > to you without proper authority, you are notified that any use or
> > dissemination of this information in any manner is strictly
> > prohibited. In such cases, please notify us immediately at
> > mailmas...@mphasis.com and delete this mail from your records.
> >
> >
>

Re: How to share text file across tasks at run time in flink.

Reply via email to