Re: Pipeline manager/scheduler frameworks

2019-02-08 Thread Adeel Ahmad
Airflow would be good but you will probably have to modify it to support stream processing. Any DAG based manager would be useful in your case. Luigi works too, but airflow has a sleeker UI. You could also try streamsets. GCP provides composer which uses airflow and dataflow for beam. AWS has Glue

Re: Pipeline manager/scheduler frameworks

2019-02-08 Thread Rui Wang
Apache Airflow is a scheduling system that can help manage data pipelines. I have seen Airflow is used to manage a few thousand hive/spark/presto pipelines. -Rui On Fri, Feb 8, 2019 at 4:08 PM Sridevi Nookala < snook...@parallelwireless.com> wrote: > Hi, > > > Our analytics app has many data

Pipeline manager/scheduler frameworks

2019-02-08 Thread Sridevi Nookala
Hi, Our analytics app has many data pipelines , some in python /java (using beam) etc, Any suggestions for a pipeline manager/scheduler framework that manages/orchestrates these different pipelines. thanks Sri

Scio 0.7.1

2019-02-08 Thread Claire McGinty
Hi all! Scio 0.7.1 has just been released. It includes a few new features and improvements from 0.7.0: https://github.com/spotify/scio/releases/tag/v0.7.1 *"Taxidea Taxus"* Features - New HashCode-based partitioning method for keyed SCollections (#1654

What choose: HashMap to CSV ? or select * to CSV ?

2019-02-08 Thread Henrique Molina
Hi folks, I using query select * from VIEW_*1* after View_*2*, on database , and next step is collect rows and export to CSV. I actual in this point: PCollection> view1 = p.apply(JdbcIO.>read() .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(