Re: Reference to Beam in upcoming Kubeflow Book

Robert Bradshaw Fri, 17 Apr 2020 17:27:21 -0700

On Fri, Apr 17, 2020 at 4:58 PM Holden Karau <[email protected]> wrote:


>
> On Fri, Apr 17, 2020 at 3:52 PM Robert Bradshaw <[email protected]>
> wrote:
>
>> On Fri, Apr 17, 2020 at 2:56 PM Holden Karau <[email protected]>
>> wrote:
>>
>>>
>>> On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> Hi Holden!
>>>>
>>>> I agree with Kyle that it makes sense to have some caveat about Flink
>>>> and Spark, though at this point they're not /that/ new (at least not
>>>> Flink).
>>>>
>>> True, maybe "early-stage" would be better wording?  The TFX PyBeam Flink
>>> support isn't yet mature enough (although there is interest in integrating
>>> it in Kubeflow I believe, it hasn't happened yet).
>>>
>>
>> I might just say "not as mature." Most of the work being done now is
>> fit-n-finish. There's also some extra flags that need to be passed to work
>> around bugs in Flink itself encountered when running TFX jobs.
>>
> Does this currently work at scale? The last time I tried to use TFX on
> Beam on Flink it had difficulty at data above ~10mb.
>

The largest TFX job I've personally run on Flink is about ~1GB (local
cluster), but that was quite a while ago. As mentioned there is a flag or
two (BATCH_FORCED IIRC) you have to pass to work around Flink getting stuck
in its memory allocation routines. (I don't remember what the final status
of the TFX benchmarks on Flink is though...)

(There's the separate question of using Kubernetes to deploy/manage the
>> Flink cluster itself, but the mode where Flink workers invoke docker to
>> start up the Python binaries is pretty stable at this point.)
>>
> So we would say maybe the OSS path would be to run TFX on Beam on Flink on
> YARN (like EMR)?
>

Flink has several deployment options, and Beam doesn't care which one you
use. Basic mode of operation is that you submit an uber jar just like an
"ordinary" Flink job, and the docker command must be available on the
workers. (There are more complicated setups like the one that Lyft uses to
avoid docker-in-docker on there kubernetes deployment, but that's more
advanced usage...)

But perhaps we're getting a bit off topic here. I think "not as mature"
explains things the best. I see no reason it shouldn't run at scale, but
would like to have regular benchmarking set up to promise anything.


> I am curious what extra support Kubeflow is "missing" (or, conversely,
>>>> what extra support it has for Dataflow that goes beyond just specifying a
>>>> different runner) to the point that these runners are declared
>>>> "unsupported." Or it it literally a matter of not providing user support?
>>>>
>>> So the Kubeflow TFX components (in
>>> https://github.com/kubeflow/pipelines/tree/master/components) are
>>> limited to local mode.
>>>
>>
>> So in that sense it's not less supported than Dataflow?
>>
> From the component side it’s the same. But if someone wanted do it “by
> hand” Dataflow offers better support.
>

Ack.


>
>>
>>>
>>>> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Holden,
>>>>>
>>>>> The note on Flink & Spark support sounds reasonable to me. I am
>>>>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
>>>>> agree that we don't want to over-promise.
>>>>>
>>>>> I'm not so sure about the status of Dataflow here, perhaps someone
>>>>> else can comment on that.
>>>>>
>>>>> Looking forward to the book :)
>>>>>
>>>>> Kyle
>>>>>
>>>>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Apache Beam Developers,
>>>>>>
>>>>>> I'm working on a book about Kubeflow, which naturally has a section
>>>>>> on TFX. I want to set users expectations correctly so I wanted to know 
>>>>>> what
>>>>>> y'all thought of this NOTE we were thinking of including in the early
>>>>>> release:
>>>>>>
>>>>>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>>>>>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>>>>>> Beam's Python support. You can scale your job by using the non-portable
>>>>>> dataflow component, but this requires changing your pipeline code and 
>>>>>> isn't
>>>>>> supported by Kubeflow's current TFX components. As Apache Beam's support
>>>>>> for Apache Flink & Spark improves support may be added for scaling the 
>>>>>> TFX
>>>>>> components in a portable manner.
>>>>>>
>>>>>> Does this sound reasonable to folks? I don't want to over-promise but
>>>>>> I also don't want to scare people away given all of the progress that is
>>>>>> being made in supporting the open-source runners with language 
>>>>>> portability.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden :)
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Reference to Beam in upcoming Kubeflow Book

Reply via email to