Hi JB, On Fri, Nov 25, 2016 at 2:36 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > > By the way, you can also use TensorFrame allowing you to use TensorFlow > directly with Spark dataframe, and more direct access. I discussed with Tim > Hunter from Databricks about that who's working on TensorFrame. >
Yes, we have been discussing and experimenting a bit with TensorFrame. The work is very interesting, although it has some limitations. So actually that would mean take a step back in our plan of getting away from the specifics of the concrete processing engine. Back on Beam, what you could do: > > 1. you expose the service on a microservice container (for instance Apache > Karaf ;)) > In your pipeline, you have two options: > 2.a. in your Beam pipeline, in a DoFn, in the @Setup you can create the > REST client (using CXF, or whatever), and in the @ProcessElement you can > use the service (hosted by Karaf) > Besides a different microservice infrastructure, I already started to play with DoFn and the concepts around. 2.b. I also have a RestIO (source and sink) that can request a REST > endpoint. However, for now, this IO acts as a pipeline endpoint > (PTransform<PBegin, PCollection> or PTransform<PCollection, PDone>). In > your case, if the service called is a step of your pipeline, ParDo(your > DoFn) would be easier. > Yes, that's was what I understood of the Beam design. IO is expected for the head or the tail of the pipeline. > Is it what you mean by microservice ? > Yep, exactly that. Thanks so much! On 11/25/2016 01:18 PM, Sergio Fernández wrote: > Hi JB, > > On Tue, Nov 22, 2016 at 11:14 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> >> DoFn will execute per element (with eventually a hook on StartBundle, >> FinishBundle, and Teardown). It's basic the way it works in IO WriteFn: we >> create the connection in StartBundle and send each element (with a batch) >> to external resource. >> >> PTransform is maybe more flexible in case of interact with "outside" >> resources. >> >> > Probably PTransform would be a better place. I'm still pretty new to some > of the Beam terms and apis. > > Do you have use case to be sure I understand ? > > > Yes, Well, it's far more complex, but this question I can simplify it: > > We have a TensorFlow-based classifier. In our pipeline one step performs > that classification of the data. Currently it's implemented as a Spark > Function, because TensorFlow models can directly be embedded within > pipelines using PySpark. > > Therefore I'm looking for the best option to move such classification > process one level up in the abstraction with Beam, so I could make it > portable. The first idea I'm exploring is relying on a external function > (i.e., microservice) that I'd need to scale up and down independently of > the pipeline. So I'm more than welcome to discuss ideas ;-) > > Thanks. > > Cheers, > > > > On 11/22/2016 10:39 AM, Sergio Fernández wrote: >> >> Hi, >>> >>> I'd like resume the idea to have TensorFlow-based tasks running in a Beam >>> Pipeline. So far the cleaner approach I can imagine would be to have it >>> running outside (Functions in GCP, Lambdas in AWS, Microservices >>> generally >>> speaking). >>> >>> Therefore, does the current Beam model provide the sense of a DoFn which >>> actually runs externally? >>> >>> Thanks in advance for the feedback. >>> >>> Cheers, >>> >>> >>> -- >> Jean-Baptiste Onofré >> jbono...@apache.org >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >> -- >> <http://www.talend.com> >> <http://www.talend.com> >> Sergio Fernández >> Partner Technology Manager >> Redlink GmbH >> m: +43 6602747925 >> e: <http://www.talend.com>sergio.fernan...@redlink.co >> w: http://redlink.co >> >> > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com -- Sergio Fernández Partner Technology Manager Redlink GmbH m: +43 6602747925 e: sergio.fernan...@redlink.co w: http://redlink.co