On Fri, Apr 17, 2020 at 3:52 PM Robert Bradshaw <[email protected]> wrote:

> On Fri, Apr 17, 2020 at 2:56 PM Holden Karau <[email protected]> wrote:
>
>>
>> On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> Hi Holden!
>>>
>>> I agree with Kyle that it makes sense to have some caveat about Flink
>>> and Spark, though at this point they're not /that/ new (at least not
>>> Flink).
>>>
>> True, maybe "early-stage" would be better wording?  The TFX PyBeam Flink
>> support isn't yet mature enough (although there is interest in integrating
>> it in Kubeflow I believe, it hasn't happened yet).
>>
>
> I might just say "not as mature." Most of the work being done now is
> fit-n-finish. There's also some extra flags that need to be passed to work
> around bugs in Flink itself encountered when running TFX jobs.
>
Does this currently work at scale? The last time I tried to use TFX on Beam
on Flink it had difficulty at data above ~10mb.

> (There's the separate question of using Kubernetes to deploy/manage the
> Flink cluster itself, but the mode where Flink workers invoke docker to
> start up the Python binaries is pretty stable at this point.)
>
So we would say maybe the OSS path would be to run TFX on Beam on Flink on
YARN (like EMR)?

>
>
>> I am curious what extra support Kubeflow is "missing" (or, conversely,
>>> what extra support it has for Dataflow that goes beyond just specifying a
>>> different runner) to the point that these runners are declared
>>> "unsupported." Or it it literally a matter of not providing user support?
>>>
>> So the Kubeflow TFX components (in
>> https://github.com/kubeflow/pipelines/tree/master/components) are
>> limited to local mode.
>>
>
> So in that sense it's not less supported than Dataflow?
>
>From the component side it’s the same. But if someone wanted do it “by
hand” Dataflow offers better support.

>
>
>>
>>> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver <[email protected]>
>>> wrote:
>>>
>>>> Hi Holden,
>>>>
>>>> The note on Flink & Spark support sounds reasonable to me. I am
>>>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
>>>> agree that we don't want to over-promise.
>>>>
>>>> I'm not so sure about the status of Dataflow here, perhaps someone else
>>>> can comment on that.
>>>>
>>>> Looking forward to the book :)
>>>>
>>>> Kyle
>>>>
>>>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Apache Beam Developers,
>>>>>
>>>>> I'm working on a book about Kubeflow, which naturally has a section on
>>>>> TFX. I want to set users expectations correctly so I wanted to know what
>>>>> y'all thought of this NOTE we were thinking of including in the early
>>>>> release:
>>>>>
>>>>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>>>>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>>>>> Beam's Python support. You can scale your job by using the non-portable
>>>>> dataflow component, but this requires changing your pipeline code and 
>>>>> isn't
>>>>> supported by Kubeflow's current TFX components. As Apache Beam's support
>>>>> for Apache Flink & Spark improves support may be added for scaling the TFX
>>>>> components in a portable manner.
>>>>>
>>>>> Does this sound reasonable to folks? I don't want to over-promise but
>>>>> I also don't want to scare people away given all of the progress that is
>>>>> being made in supporting the open-source runners with language 
>>>>> portability.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Reply via email to