Re: Why is Pipeline not Serializable and can it be changed to be Serializable
Thanks for the information. I will take a look. Best Regards, Pulasthi On Fri, Nov 15, 2019 at 2:07 PM Luke Cwik wrote: > They are serialized but not with Java serialization. There is a > CloudObject serialization[1] layer that only Dataflow uses while all other > runners who need to serialize are using the Coder -> Proto serialization > layer[2]. The CloudObject representation is slated for deletion once we can > migrate Dataflow to the pure proto representation. Feel free to send a PR > to improve the wording in the Javadoc. > > 1: > https://github.com/apache/beam/blob/5de27d5c0cb86962e28b5168b7f1dec62352230b/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/CloudObjects.java#L87 > 2: > https://github.com/apache/beam/blob/5de27d5c0cb86962e28b5168b7f1dec62352230b/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L68 > > On Fri, Nov 15, 2019 at 11:00 AM Pulasthi Supun Wickramasinghe < > pulasthi...@gmail.com> wrote: > >> Hi Luke, >> >> Aren't the coders supposed to be serializable? The doc on the Coder >> interface has the following java doc comment, which seems to mean that they >> should be, and most of the basic coders seem to serializable. >> >> " {@link Coder} instances are serialized during job creation and >> deserialized before use. This will generally be performed by serializing >> the object via Java Serialization." >> >> However "TimestampedValueCoder" does not have a default constructor so it >> cannot be deserialized. adding a private non-arg constructor would solve >> the problem, but this may not be the only class that has this issue. I can >> work on a pull request to add the private non-args constructors to coders >> that have them missing if this was not done intentionally. WDYT? >> >> Best Regards, >> Pulasthi >> >> On Fri, Nov 15, 2019 at 12:05 AM Pulasthi Supun Wickramasinghe < >> pulasthi...@gmail.com> wrote: >> >>> Hi Luke, >>> >>> That is the approach i am taking currently to handle the functions. I >>> Might have to do the same for Coders as well since some coders have the >>> same issue of not having default constructors. >>> >>> I also initially considered converting the pipeline into a JSON format >>> and sending that over to the workers, Will take a look at the option you >>> have mentioned since we do plan to implement a Portable pipeline runner for >>> Twister2 as well. Thanks for the information >>> >>> Best Regards, >>> Pulasthi >>> >>> >>> On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik wrote: >>> You should create placeholders inside of your Twister2/OpenMPI implementation that represent these functions and then instantiate actual instances of them on the workers if you want to write your own pipeline representation and format for OpenMPI/Twister2. Or consider converting the pipeline to its proto representation and building a portable pipeline runner. This way you could run Go and Python pipelines as well. The best example of this is the current Flink integration[1] 1: https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Dev's > > Currently, the Pipeline class in Beam is not Serializable. This is not > a problem for the current runners since the pipeline is translated and > submitted through a centralized Driver like model. However, if the runner > has a decentralized model similar to OpenMPI (MPI), which is also the case > with Twister2, which I am developing a runner currently, it would have > been > better if the pipeline itself was Serializable. > > Currently, I am trying to transform the Pipeline into a Twister2 graph > and then send over to the workers, however since there are some functions > such as "SystemReduceFn" that are not serializable this also is somewhat > troublesome. > > Was the decision to make Pipelines not Serializable made due to some > specific reason or because all the current use cases did not present any > valid requirement to make them Serializable? > > Best Regards, > Pulasthi > -- > Pulasthi S. Wickramasinghe > PhD Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > cell: 224-386-9035 <(224)%20386-9035> > >>> >>> -- >>> Pulasthi S. Wickramasinghe >>> PhD Candidate | Research Assistant >>> School of Informatics and Computing | Digital Science Center >>> Indiana University, Bloomington >>> cell: 224-386-9035 <(224)%20386-9035> >>> >> >> >> -- >> Pulasthi S. Wickramasinghe >> PhD Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana Universi
Re: Why is Pipeline not Serializable and can it be changed to be Serializable
Serializable classes are not required to have default, no-arg constructors. Reuven On Fri, Nov 15, 2019 at 11:00 AM Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Luke, > > Aren't the coders supposed to be serializable? The doc on the Coder > interface has the following java doc comment, which seems to mean that they > should be, and most of the basic coders seem to serializable. > > " {@link Coder} instances are serialized during job creation and > deserialized before use. This will generally be performed by serializing > the object via Java Serialization." > > However "TimestampedValueCoder" does not have a default constructor so it > cannot be deserialized. adding a private non-arg constructor would solve > the problem, but this may not be the only class that has this issue. I can > work on a pull request to add the private non-args constructors to coders > that have them missing if this was not done intentionally. WDYT? > > Best Regards, > Pulasthi > > On Fri, Nov 15, 2019 at 12:05 AM Pulasthi Supun Wickramasinghe < > pulasthi...@gmail.com> wrote: > >> Hi Luke, >> >> That is the approach i am taking currently to handle the functions. I >> Might have to do the same for Coders as well since some coders have the >> same issue of not having default constructors. >> >> I also initially considered converting the pipeline into a JSON format >> and sending that over to the workers, Will take a look at the option you >> have mentioned since we do plan to implement a Portable pipeline runner for >> Twister2 as well. Thanks for the information >> >> Best Regards, >> Pulasthi >> >> >> On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik wrote: >> >>> You should create placeholders inside of your Twister2/OpenMPI >>> implementation that represent these functions and then instantiate actual >>> instances of them on the workers if you want to write your own pipeline >>> representation and format for OpenMPI/Twister2. >>> >>> Or consider converting the pipeline to its proto representation and >>> building a portable pipeline runner. This way you could run Go and Python >>> pipelines as well. The best example of this is the current Flink >>> integration[1] >>> >>> 1: >>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java >>> >>> On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe < >>> pulasthi...@gmail.com> wrote: >>> Hi Dev's Currently, the Pipeline class in Beam is not Serializable. This is not a problem for the current runners since the pipeline is translated and submitted through a centralized Driver like model. However, if the runner has a decentralized model similar to OpenMPI (MPI), which is also the case with Twister2, which I am developing a runner currently, it would have been better if the pipeline itself was Serializable. Currently, I am trying to transform the Pipeline into a Twister2 graph and then send over to the workers, however since there are some functions such as "SystemReduceFn" that are not serializable this also is somewhat troublesome. Was the decision to make Pipelines not Serializable made due to some specific reason or because all the current use cases did not present any valid requirement to make them Serializable? Best Regards, Pulasthi -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035 <(224)%20386-9035> >>> >> >> -- >> Pulasthi S. Wickramasinghe >> PhD Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> cell: 224-386-9035 <(224)%20386-9035> >> > > > -- > Pulasthi S. Wickramasinghe > PhD Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > cell: 224-386-9035 <(224)%20386-9035> >
Re: Why is Pipeline not Serializable and can it be changed to be Serializable
They are serialized but not with Java serialization. There is a CloudObject serialization[1] layer that only Dataflow uses while all other runners who need to serialize are using the Coder -> Proto serialization layer[2]. The CloudObject representation is slated for deletion once we can migrate Dataflow to the pure proto representation. Feel free to send a PR to improve the wording in the Javadoc. 1: https://github.com/apache/beam/blob/5de27d5c0cb86962e28b5168b7f1dec62352230b/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/CloudObjects.java#L87 2: https://github.com/apache/beam/blob/5de27d5c0cb86962e28b5168b7f1dec62352230b/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L68 On Fri, Nov 15, 2019 at 11:00 AM Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Luke, > > Aren't the coders supposed to be serializable? The doc on the Coder > interface has the following java doc comment, which seems to mean that they > should be, and most of the basic coders seem to serializable. > > " {@link Coder} instances are serialized during job creation and > deserialized before use. This will generally be performed by serializing > the object via Java Serialization." > > However "TimestampedValueCoder" does not have a default constructor so it > cannot be deserialized. adding a private non-arg constructor would solve > the problem, but this may not be the only class that has this issue. I can > work on a pull request to add the private non-args constructors to coders > that have them missing if this was not done intentionally. WDYT? > > Best Regards, > Pulasthi > > On Fri, Nov 15, 2019 at 12:05 AM Pulasthi Supun Wickramasinghe < > pulasthi...@gmail.com> wrote: > >> Hi Luke, >> >> That is the approach i am taking currently to handle the functions. I >> Might have to do the same for Coders as well since some coders have the >> same issue of not having default constructors. >> >> I also initially considered converting the pipeline into a JSON format >> and sending that over to the workers, Will take a look at the option you >> have mentioned since we do plan to implement a Portable pipeline runner for >> Twister2 as well. Thanks for the information >> >> Best Regards, >> Pulasthi >> >> >> On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik wrote: >> >>> You should create placeholders inside of your Twister2/OpenMPI >>> implementation that represent these functions and then instantiate actual >>> instances of them on the workers if you want to write your own pipeline >>> representation and format for OpenMPI/Twister2. >>> >>> Or consider converting the pipeline to its proto representation and >>> building a portable pipeline runner. This way you could run Go and Python >>> pipelines as well. The best example of this is the current Flink >>> integration[1] >>> >>> 1: >>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java >>> >>> On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe < >>> pulasthi...@gmail.com> wrote: >>> Hi Dev's Currently, the Pipeline class in Beam is not Serializable. This is not a problem for the current runners since the pipeline is translated and submitted through a centralized Driver like model. However, if the runner has a decentralized model similar to OpenMPI (MPI), which is also the case with Twister2, which I am developing a runner currently, it would have been better if the pipeline itself was Serializable. Currently, I am trying to transform the Pipeline into a Twister2 graph and then send over to the workers, however since there are some functions such as "SystemReduceFn" that are not serializable this also is somewhat troublesome. Was the decision to make Pipelines not Serializable made due to some specific reason or because all the current use cases did not present any valid requirement to make them Serializable? Best Regards, Pulasthi -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035 <(224)%20386-9035> >>> >> >> -- >> Pulasthi S. Wickramasinghe >> PhD Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> cell: 224-386-9035 <(224)%20386-9035> >> > > > -- > Pulasthi S. Wickramasinghe > PhD Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > cell: 224-386-9035 <(224)%20386-9035> >
Re: Why is Pipeline not Serializable and can it be changed to be Serializable
Hi Luke, Aren't the coders supposed to be serializable? The doc on the Coder interface has the following java doc comment, which seems to mean that they should be, and most of the basic coders seem to serializable. " {@link Coder} instances are serialized during job creation and deserialized before use. This will generally be performed by serializing the object via Java Serialization." However "TimestampedValueCoder" does not have a default constructor so it cannot be deserialized. adding a private non-arg constructor would solve the problem, but this may not be the only class that has this issue. I can work on a pull request to add the private non-args constructors to coders that have them missing if this was not done intentionally. WDYT? Best Regards, Pulasthi On Fri, Nov 15, 2019 at 12:05 AM Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Luke, > > That is the approach i am taking currently to handle the functions. I > Might have to do the same for Coders as well since some coders have the > same issue of not having default constructors. > > I also initially considered converting the pipeline into a JSON format and > sending that over to the workers, Will take a look at the option you have > mentioned since we do plan to implement a Portable pipeline runner for > Twister2 as well. Thanks for the information > > Best Regards, > Pulasthi > > > On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik wrote: > >> You should create placeholders inside of your Twister2/OpenMPI >> implementation that represent these functions and then instantiate actual >> instances of them on the workers if you want to write your own pipeline >> representation and format for OpenMPI/Twister2. >> >> Or consider converting the pipeline to its proto representation and >> building a portable pipeline runner. This way you could run Go and Python >> pipelines as well. The best example of this is the current Flink >> integration[1] >> >> 1: >> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java >> >> On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe < >> pulasthi...@gmail.com> wrote: >> >>> Hi Dev's >>> >>> Currently, the Pipeline class in Beam is not Serializable. This is not a >>> problem for the current runners since the pipeline is translated and >>> submitted through a centralized Driver like model. However, if the runner >>> has a decentralized model similar to OpenMPI (MPI), which is also the case >>> with Twister2, which I am developing a runner currently, it would have been >>> better if the pipeline itself was Serializable. >>> >>> Currently, I am trying to transform the Pipeline into a Twister2 graph >>> and then send over to the workers, however since there are some functions >>> such as "SystemReduceFn" that are not serializable this also is somewhat >>> troublesome. >>> >>> Was the decision to make Pipelines not Serializable made due to some >>> specific reason or because all the current use cases did not present any >>> valid requirement to make them Serializable? >>> >>> Best Regards, >>> Pulasthi >>> -- >>> Pulasthi S. Wickramasinghe >>> PhD Candidate | Research Assistant >>> School of Informatics and Computing | Digital Science Center >>> Indiana University, Bloomington >>> cell: 224-386-9035 <(224)%20386-9035> >>> >> > > -- > Pulasthi S. Wickramasinghe > PhD Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > cell: 224-386-9035 > -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035
Re: Why is Pipeline not Serializable and can it be changed to be Serializable
Hi Luke, That is the approach i am taking currently to handle the functions. I Might have to do the same for Coders as well since some coders have the same issue of not having default constructors. I also initially considered converting the pipeline into a JSON format and sending that over to the workers, Will take a look at the option you have mentioned since we do plan to implement a Portable pipeline runner for Twister2 as well. Thanks for the information Best Regards, Pulasthi On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik wrote: > You should create placeholders inside of your Twister2/OpenMPI > implementation that represent these functions and then instantiate actual > instances of them on the workers if you want to write your own pipeline > representation and format for OpenMPI/Twister2. > > Or consider converting the pipeline to its proto representation and > building a portable pipeline runner. This way you could run Go and Python > pipelines as well. The best example of this is the current Flink > integration[1] > > 1: > https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java > > On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe < > pulasthi...@gmail.com> wrote: > >> Hi Dev's >> >> Currently, the Pipeline class in Beam is not Serializable. This is not a >> problem for the current runners since the pipeline is translated and >> submitted through a centralized Driver like model. However, if the runner >> has a decentralized model similar to OpenMPI (MPI), which is also the case >> with Twister2, which I am developing a runner currently, it would have been >> better if the pipeline itself was Serializable. >> >> Currently, I am trying to transform the Pipeline into a Twister2 graph >> and then send over to the workers, however since there are some functions >> such as "SystemReduceFn" that are not serializable this also is somewhat >> troublesome. >> >> Was the decision to make Pipelines not Serializable made due to some >> specific reason or because all the current use cases did not present any >> valid requirement to make them Serializable? >> >> Best Regards, >> Pulasthi >> -- >> Pulasthi S. Wickramasinghe >> PhD Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> cell: 224-386-9035 <(224)%20386-9035> >> > -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035
Re: Why is Pipeline not Serializable and can it be changed to be Serializable
You should create placeholders inside of your Twister2/OpenMPI implementation that represent these functions and then instantiate actual instances of them on the workers if you want to write your own pipeline representation and format for OpenMPI/Twister2. Or consider converting the pipeline to its proto representation and building a portable pipeline runner. This way you could run Go and Python pipelines as well. The best example of this is the current Flink integration[1] 1: https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Dev's > > Currently, the Pipeline class in Beam is not Serializable. This is not a > problem for the current runners since the pipeline is translated and > submitted through a centralized Driver like model. However, if the runner > has a decentralized model similar to OpenMPI (MPI), which is also the case > with Twister2, which I am developing a runner currently, it would have been > better if the pipeline itself was Serializable. > > Currently, I am trying to transform the Pipeline into a Twister2 graph and > then send over to the workers, however since there are some functions such > as "SystemReduceFn" that are not serializable this also is somewhat > troublesome. > > Was the decision to make Pipelines not Serializable made due to some > specific reason or because all the current use cases did not present any > valid requirement to make them Serializable? > > Best Regards, > Pulasthi > -- > Pulasthi S. Wickramasinghe > PhD Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > cell: 224-386-9035 <(224)%20386-9035> >
Why is Pipeline not Serializable and can it be changed to be Serializable
Hi Dev's Currently, the Pipeline class in Beam is not Serializable. This is not a problem for the current runners since the pipeline is translated and submitted through a centralized Driver like model. However, if the runner has a decentralized model similar to OpenMPI (MPI), which is also the case with Twister2, which I am developing a runner currently, it would have been better if the pipeline itself was Serializable. Currently, I am trying to transform the Pipeline into a Twister2 graph and then send over to the workers, however since there are some functions such as "SystemReduceFn" that are not serializable this also is somewhat troublesome. Was the decision to make Pipelines not Serializable made due to some specific reason or because all the current use cases did not present any valid requirement to make them Serializable? Best Regards, Pulasthi -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035