On Fri, Apr 17, 2020 at 3:52 PM Robert Bradshaw <[email protected]> wrote:
> On Fri, Apr 17, 2020 at 2:56 PM Holden Karau <[email protected]> wrote: > >> >> On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw <[email protected]> >> wrote: >> >>> Hi Holden! >>> >>> I agree with Kyle that it makes sense to have some caveat about Flink >>> and Spark, though at this point they're not /that/ new (at least not >>> Flink). >>> >> True, maybe "early-stage" would be better wording? The TFX PyBeam Flink >> support isn't yet mature enough (although there is interest in integrating >> it in Kubeflow I believe, it hasn't happened yet). >> > > I might just say "not as mature." Most of the work being done now is > fit-n-finish. There's also some extra flags that need to be passed to work > around bugs in Flink itself encountered when running TFX jobs. > Does this currently work at scale? The last time I tried to use TFX on Beam on Flink it had difficulty at data above ~10mb. > (There's the separate question of using Kubernetes to deploy/manage the > Flink cluster itself, but the mode where Flink workers invoke docker to > start up the Python binaries is pretty stable at this point.) > So we would say maybe the OSS path would be to run TFX on Beam on Flink on YARN (like EMR)? > > >> I am curious what extra support Kubeflow is "missing" (or, conversely, >>> what extra support it has for Dataflow that goes beyond just specifying a >>> different runner) to the point that these runners are declared >>> "unsupported." Or it it literally a matter of not providing user support? >>> >> So the Kubeflow TFX components (in >> https://github.com/kubeflow/pipelines/tree/master/components) are >> limited to local mode. >> > > So in that sense it's not less supported than Dataflow? > >From the component side it’s the same. But if someone wanted do it “by hand” Dataflow offers better support. > > >> >>> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver <[email protected]> >>> wrote: >>> >>>> Hi Holden, >>>> >>>> The note on Flink & Spark support sounds reasonable to me. I am >>>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I >>>> agree that we don't want to over-promise. >>>> >>>> I'm not so sure about the status of Dataflow here, perhaps someone else >>>> can comment on that. >>>> >>>> Looking forward to the book :) >>>> >>>> Kyle >>>> >>>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau <[email protected]> >>>> wrote: >>>> >>>>> Hi Apache Beam Developers, >>>>> >>>>> I'm working on a book about Kubeflow, which naturally has a section on >>>>> TFX. I want to set users expectations correctly so I wanted to know what >>>>> y'all thought of this NOTE we were thinking of including in the early >>>>> release: >>>>> >>>>> Apache Beam’s Python support outside of Google cloud's Dataflow is >>>>> relatively new. TFX is a Python tool, so scaling it depends on Apache >>>>> Beam's Python support. You can scale your job by using the non-portable >>>>> dataflow component, but this requires changing your pipeline code and >>>>> isn't >>>>> supported by Kubeflow's current TFX components. As Apache Beam's support >>>>> for Apache Flink & Spark improves support may be added for scaling the TFX >>>>> components in a portable manner. >>>>> >>>>> Does this sound reasonable to folks? I don't want to over-promise but >>>>> I also don't want to scare people away given all of the progress that is >>>>> being made in supporting the open-source runners with language >>>>> portability. >>>>> >>>>> Cheers, >>>>> >>>>> Holden :) >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
