Re: Python IO Connector

2020-01-07 Thread Lucas Magalhães
Hi Peter.

Why don't you use this external library?
https://pypi.org/project/beam-nuggets/   They already use SQLAlchemy and is
pretty easy to use.


On Mon, Jan 6, 2020 at 10:17 PM Luke Cwik  wrote:

> Eugene, the JdbcIO output should be updated to support Beam's schema
> format which would allow for "rows" to cross the language boundaries.
>
> If the connector is easy to write and maintain then it makes sense for
> native. Maybe the Python version will have an easier time to support
> splitting and hence could overtake the Java implementation in useful
> features.
>
> On Mon, Jan 6, 2020 at 3:55 PM  wrote:
>
>> Apache Airflow went for the DB API approach as well and it seems like to
>> have worked well for them. We will likely need to add extra_requires for
>> each database engine Python package though, which adds some complexity but
>> not a lot
>>
>> On Jan 6, 2020, at 6:12 PM, Eugene Kirpichov  wrote:
>>
>> Agreed with above, it seems prudent to develop a pure-Python connector
>> for something as common as interacting with a database. It's likely easier
>> to achieve an idiomatic API, familiar to non-Beam Python SQL users, within
>> pure Python.
>>
>> Developing a cross-language connector here might be plain impossible,
>> because rows read from a database are (at least in JDBC) not encodable -
>> they require a user's callback to translate to an encodable user type, and
>> the callback can't be in Python because then you have to encode its input
>> before giving it to Python. Same holds for the write transform.
>>
>> Not sure about sqlalchemy though, maybe use plain DB-API
>> https://www.python.org/dev/peps/pep-0249/ instead? Seems like the Python
>> one is more friendly than JDBC in the sense that it actually returns rows
>> as tuples of simple data types.
>>
>> On Mon, Jan 6, 2020 at 1:42 PM Robert Bradshaw 
>> wrote:
>>
>>> On Mon, Jan 6, 2020 at 1:39 PM Chamikara Jayalath 
>>> wrote:
>>>
>>>> Regarding cross-language transforms, we need to add better
>>>> documentation, but for now you'll have to go with existing examples and
>>>> tests. For example,
>>>>
>>>>
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/gcp/pubsub.py
>>>>
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py
>>>>
>>>> Note that cross-language transforms feature is currently only available
>>>> for Flink Runner. Dataflow support is in development.
>>>>
>>>
>>> I think it works with all non-Dataflow runners, with the exception of
>>> the Java and Go Direct runners. (It does work with the Python direct
>>> runner.)
>>>
>>>
>>>> I'm fine with developing this natively for Python as well. AFAIK Java
>>>> JDBC IO connector is not a super-complicated connector and it should be
>>>> fine to make relatively easy to maintain and widely usable connectors
>>>> available in multiple SDKs.
>>>>
>>>
>>> Yes, a case can certainly be made for having native connectors for
>>> particular common/simple sources. (We certainly don't call cross-language
>>> to read text files for example.)
>>>
>>>
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>> On Mon, Jan 6, 2020 at 10:56 AM Luke Cwik  wrote:
>>>>
>>>>> +Chamikara Jayalath  +Heejong Lee
>>>>> 
>>>>>
>>>>> On Mon, Jan 6, 2020 at 10:20 AM  wrote:
>>>>>
>>>>>> How do I go about doing that? From the docs, it appears cross
>>>>>> language transforms are
>>>>>> currently undocumented.
>>>>>> https://beam.apache.org/roadmap/connectors-multi-sdk/
>>>>>> On Jan 6, 2020, at 12:55 PM, Luke Cwik  wrote:
>>>>>>
>>>>>> What about using a cross language transform between Python and the
>>>>>> already existing Java JdbcIO transform?
>>>>>>
>>>>>> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann 
>>>>>> wrote:
>>>>>>
>>>>>>> I’d like to develop the Python SDK’s SQL IO connector. I was
>>>>>>> thinking it would be easiest to use sqlalchemy to achieve maximum 
>>>>>>> database
>>>>>>> engine support, but I suppose I could also create an ABC for databases 
>>>>>>> that
>>>>>>> follow the DB API and create subclasses for each database engine that
>>>>>>> override a connect method. What are your thoughts on the best way to do
>>>>>>> this?
>>>>>>>
>>>>>>

-- 
Lucas Magalhães,
CTO

Paralelo CS - Consultoria e Serviços
Tel: +55 (11) 3090-5557
Cel: +55 (11) 99420-4667
lucas.magalh...@paralelocs.com.br

<http://www.inteligenciaemnegocios.com.br>www.paralelocs.com.br


Re: Reading from RDB, ParDo or BoundedSource

2019-09-28 Thread Lucas Magalhães
Hi Pablo.

Thanks for that.. That is exactly what i needed and it is much more simple
than I thought hehe


Em sáb, 28 de set de 2019 00:31, Pablo Estrada 
escreveu:

> Hi Lucas!
> That makes sense. I saw a question for this on StackOverflow recently.
> Perhaps that was you? [1] - perhaps not, but then you're not the only one
> trying to do this.
>
> I do not know a lot about connecting to RDBs from Python - it seemed to me
> that you'd need to also install ODBC / JDBC drivers, and that's not that
> easy to do on Dataflow. - So you would need to code a special transform
> depending on the database you're reading from.
>
> As far as I know, Postgres also does not have an easy way to read data in
> multiple threads in parallel, so consuming the results of your query would
> be done in a single thread, so you can do it with a relatively simple DoFn.
> Check my answer to the question [2], which has a DoFn for reading from
> Postgres and one for MySQL.
>
> LMK if that helps!
>
> [1]
> https://stackoverflow.com/questions/46528343/how-to-use-gcp-cloud-sql-as-dataflow-source-and-or-sink-with-python/58106722#58106722
> [2] https://stackoverflow.com/a/58106722/1255356
>
> On Fri, Sep 27, 2019 at 4:43 PM Eugene Kirpichov 
> wrote:
>
>> I'm actually very surprised why to this day nobody wrote a Python
>> connector for the Python Database API, like JdbcIO.
>> Do we maybe have a way to use JdbcIO from Python via the cross-language
>> connectors stuff?
>>
>> On Fri, Sep 27, 2019 at 4:28 PM Lucas Magalhães <
>> lucas.magalh...@paralelocs.com.br> wrote:
>>
>>> Hi guys.
>>>
>>> Sorry. I forgot to mention that.. I'm using python SDK.. Its seems that
>>> Java SDK looks like more mature, but i have no skill on that language.
>>>
>>> I'm trying to extract data from postgres (Cloud SQL), make some
>>> agregations and save into BigQuery.
>>>
>>> Em sex, 27 de set de 2019 19:21, Pablo Estrada 
>>> escreveu:
>>>
>>>> Hi Lucas!
>>>> Can you share more information about your use case? Java has JdbcIO.
>>>> Maybe that's all you need? Or perhaps you're using Python SDK?
>>>> Best
>>>> -P.
>>>>
>>>> On Fri, Sep 27, 2019 at 3:08 PM Eugene Kirpichov 
>>>> wrote:
>>>>
>>>>> Hi Lucas,
>>>>> Any reason why you can't use JdbcIO?
>>>>> You almost certainly should *not* use BoundedSource, nor Splittable
>>>>> DoFn for this. BoundedSource is obsolete in favor of assembling your
>>>>> connector from regular transforms and/or using an SDF, and SDF is an
>>>>> extremely advanced feature whose primary audience is Beam SDK authors.
>>>>>
>>>>> On Fri, Sep 27, 2019 at 2:52 PM Lucas Magalhães <
>>>>> lucas.magalh...@paralelocs.com.br> wrote:
>>>>>
>>>>>> Hi guys.
>>>>>>
>>>>>> I'm new on apache Beam and o would like some help to undestand some
>>>>>> behaviours.
>>>>>>
>>>>>> 1. Is there some performance issue when i'm reading data from a
>>>>>> relational database using a ParDo instead of BoundedSource?
>>>>>>
>>>>>> 2. If I'm going to implement a BoundedSource how does Beam manage
>>>>>> the connection? I need to open and close in every method, like split, 
>>>>>> read,
>>>>>> estimete size and so on??
>>>>>>
>>>>>> 3. I read something about splittable dofn but i didnt fine
>>>>>> instructions about to How implement. Has anyone have something about ir?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>


Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Lucas Magalhães
Hi guys.

Sorry. I forgot to mention that.. I'm using python SDK.. Its seems that
Java SDK looks like more mature, but i have no skill on that language.

I'm trying to extract data from postgres (Cloud SQL), make some agregations
and save into BigQuery.

Em sex, 27 de set de 2019 19:21, Pablo Estrada 
escreveu:

> Hi Lucas!
> Can you share more information about your use case? Java has JdbcIO. Maybe
> that's all you need? Or perhaps you're using Python SDK?
> Best
> -P.
>
> On Fri, Sep 27, 2019 at 3:08 PM Eugene Kirpichov 
> wrote:
>
>> Hi Lucas,
>> Any reason why you can't use JdbcIO?
>> You almost certainly should *not* use BoundedSource, nor Splittable DoFn
>> for this. BoundedSource is obsolete in favor of assembling your connector
>> from regular transforms and/or using an SDF, and SDF is an extremely
>> advanced feature whose primary audience is Beam SDK authors.
>>
>> On Fri, Sep 27, 2019 at 2:52 PM Lucas Magalhães <
>> lucas.magalh...@paralelocs.com.br> wrote:
>>
>>> Hi guys.
>>>
>>> I'm new on apache Beam and o would like some help to undestand some
>>> behaviours.
>>>
>>> 1. Is there some performance issue when i'm reading data from a
>>> relational database using a ParDo instead of BoundedSource?
>>>
>>> 2. If I'm going to implement a BoundedSource how does Beam manage the
>>> connection? I need to open and close in every method, like split, read,
>>> estimete size and so on??
>>>
>>> 3. I read something about splittable dofn but i didnt fine instructions
>>> about to How implement. Has anyone have something about ir?
>>>
>>> Thanks
>>>
>>>
>>>
>>>


Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Lucas Magalhães
Hi guys.

I'm new on apache Beam and o would like some help to undestand some
behaviours.

1. Is there some performance issue when i'm reading data from a relational
database using a ParDo instead of BoundedSource?

2. If I'm going to implement a BoundedSource how does Beam manage the
connection? I need to open and close in every method, like split, read,
estimete size and so on??

3. I read something about splittable dofn but i didnt fine instructions
about to How implement. Has anyone have something about ir?

Thanks


Re: MQTT to Python SDK

2019-09-16 Thread Lucas Magalhães
Thanks Altay.. Do you know where I could find more about cross language
transforms? Documentation and examples as well.

thanks again

On Mon, Sep 16, 2019 at 4:00 PM Ahmet Altay  wrote:

> A framework for python sdk to use a native unbounded connector does not
> exist yet. You might be able to use the same connector from Java using
> cross language transforms.
>
> /cc +Chamikara Jayalath 
>
> On Mon, Sep 16, 2019 at 11:00 AM Lucas Magalhães <
> lucas.magalh...@paralelocs.com.br> wrote:
>
>> Hello dears!
>>
>> I'm starding a new project here and the mainly source is a MQTT.
>>
>> I could´n find any documentantion about to How to develeop a unbounded
>> connector.
>>
>> Could anyone send me some instructions or guide line?
>>
>> Thanks a lot
>>
>> --
>> Lucas Magalhães,
>> CTO
>>
>> Paralelo CS - Consultoria e Serviços
>> Tel: +55 (11) 3090-5557 <+55%2011%203090-5557>
>> Cel: +55 (11) 99420-4667 <+55%2011%2099420-4667>
>> lucas.magalh...@paralelocs.com.br
>>
>> <http://www.inteligenciaemnegocios.com.br>www.paralelocs.com.br
>>
>

-- 
Lucas Magalhães,
CTO

Paralelo CS - Consultoria e Serviços
Tel: +55 (11) 3090-5557
Cel: +55 (11) 99420-4667
lucas.magalh...@paralelocs.com.br

<http://www.inteligenciaemnegocios.com.br>www.paralelocs.com.br


MQTT to Python SDK

2019-09-16 Thread Lucas Magalhães
Hello dears!

I'm starding a new project here and the mainly source is a MQTT.

I could´n find any documentantion about to How to develeop a unbounded
connector.

Could anyone send me some instructions or guide line?

Thanks a lot

-- 
Lucas Magalhães,
CTO

Paralelo CS - Consultoria e Serviços
Tel: +55 (11) 3090-5557
Cel: +55 (11) 99420-4667
lucas.magalh...@paralelocs.com.br

<http://www.inteligenciaemnegocios.com.br>www.paralelocs.com.br