Re: FlumeJava - what happened to PObjects

Kenneth Knowles Mon, 03 Feb 2025 06:49:30 -0800

We did actually get to the point with the SQL shell that you can issue
windowed streaming queries and watch the results come in (on the
DirectRunner with a hack probably similar to interactive runner). But it
turns out that interactively waiting for the end of the window is a bit
boring :-)


Kenn

On Fri, Jan 31, 2025 at 1:20 PM Robert Burke <rob...@frantil.com> wrote:

> Really the main difference that makes it a little more complicated (but
> not terribly so) is Flume isn't windowed (along with the metadata). Beam
> would need to make that up front and available for use.
>
> On Fri, Jan 31, 2025, 9:50 AM Robert Bradshaw via dev <dev@beam.apache.org>
> wrote:
>
>> On Fri, Jan 31, 2025 at 8:19 AM Joey Tran <joey.t...@schrodinger.com>
>> wrote:
>>
>>> Is there an equivalent to `PCollectionView` in python?
>>>
>>> > ... as there's not much novel one can do with a PObject vs. a
>>> singleton PCollection.
>>>
>>> Ah maybe I misunderstood how PObjects work. From the FlumeJava paper:
>>>
>>>
>>>> These features can be used to express a computation that needs
>>>> to iterate until the computed data converges:
>>>>
>>>> PCollection<Data> results = computeInitialApproximation();
>>>> for (;;) {
>>>>     results = computeNextApproximation(results);
>>>>     PCollection<Boolean> haveConverged =
>>>>         results.parallelDo(checkIfConvergedFn(),
>>>>                            collectionOf(booleans()));
>>>>     PObject<Boolean> allHaveConverged =
>>>>         haveConverged.combine(AND_BOOLS);
>>>>     FlumeJava.run();
>>>>     if (allHaveConverged.getValue()) break;
>>>> }
>>>> ... continue working with converged results ...
>>>
>>>
>>> I had understood this to mean that the `PObject`   will materialize the
>>> pcollection into in-memory values, which is maybe a little novel? At least,
>>> in the python SDK, I've always been writing elements to disk by transform
>>> and then reading it manually outside of the pipeline back into the original
>>> python objects.
>>>
>>
>> FWIW, this wasn't unique to PObjects in FlumeJava, one could do the same
>> for PCollections. While this is useful on the one hand, this notion of
>> "materialization" (especially combined with further execution) has
>> complexities in that keeping the main program alive now is essential to
>> pipeline completion (when this feature is used) as opposed to the "fire and
>> forget" model of Dataflow. (There were thoughts about doing runner-side
>> lazy graph expansion, but they never got fleshed out.)
>>
>> Note that one can easily write a PTransform that internally executes a
>> write and exposes an API to return the set of elements as in-memory objects
>> (using the coder that was used for write) post pipeline completion. One
>> would probably need to supply a distributed filestore to use, as there's
>> not really a good "default." This doesn't have to be provided as part of
>> Beam (though arguably it's a common enough usecase that maybe it should
>> be). There's also some exploration in this area with Python's interactive
>> utilities. One could conceivably even support this in streaming (though not
>> without providing manual details of a backing store, e.g. a cofka instance
>> or pubsub topic to use).
>>
>>
>>> On Fri, Jan 31, 2025 at 9:50 AM Kenneth Knowles <k...@apache.org> wrote:
>>>
>>>> Fun fact, this was one of my onboarding projects :-)
>>>>
>>>> It is the way it is because the invariant we need in order to apply
>>>> windowing independent of core business logic is: if you apply a transform
>>>> to windowed input, each window should contain the same output it would if
>>>> it were the entirety of the data. (composites are permitted to break
>>>> this rule, but the core compute primitives never do)
>>>>
>>>> And I guess it seemed wrong to call something an "Object" when it was
>>>> really an object per window. TBH the "P" was probably always a misnomer...
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Jan 30, 2025 at 7:26 PM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> On Thu, Jan 30, 2025 at 4:25 PM Reuven Lax via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> PCollectionView is the equivalent of PObject. Given that the Beam API
>>>>>> needed to work with the windowing model, we needed something like a 
>>>>>> PObject
>>>>>> that could be windowed. This is what PCollectionView provides.
>>>>>>
>>>>>
>>>>> +1, I'd forgotten about the windowing complexities as well.
>>>>>
>>>>>
>>>>>> On Thu, Jan 30, 2025 at 4:20 PM Joey Tran <joey.t...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I read the FlumeJava paper and I was just curious what happened to
>>>>>>> PObjects. They seem like a useful construct. Do they exist in the java 
>>>>>>> SDK
>>>>>>> in some version still? Or were they done away with because they made
>>>>>>> pipeline optimization more difficult?
>>>>>>>
>>>>>>

Re: FlumeJava - what happened to PObjects

Reply via email to