Re: Pipeline is passing on local runner and failing on Dataflow runner - help with error

2018-06-21 Thread Ahmet Altay
Hi Ella, I do not know which package includes indexes.base. You could first try using --requirements_file option to stage all those dependencies to see if your problem is resolved. After that you could eliminate them. You can also find the list of all pre-installed packages [1] for Dataflow worker

Re: Go SDK: Biquery and Legacy SQL

2018-06-21 Thread Henning Rohde
Legacy SQL was the default when the IO was written. IIRC Standard SQL changed the structure of table names and had some problematic corner cases in that regard, so it was easier to just stay with legacy SQL. Feel free to open a JIRA and/or take a stab at it. Btw, you should be able to use time.Tim

Re: Go SDK: Bigquery and nullable field types.

2018-06-21 Thread Henning Rohde
The Go SDK can't actually serialize named types -- we serialize the structural information and recreate assignment-compatible isomorphic unnamed types at runtime for convenience. This usually works fine, but perhaps not if inspected reflectively. Have you tried to Register the Record (or bigquery.N

Go SDK: Biquery and Legacy SQL

2018-06-21 Thread eduardo . morales
I am trying to read a column of type TIMESTAMP, this type is mapped by the bigquery client to time.Time. Unfortunately, it is not possible to use time.Time structs because this type contains slices which are not supported by the beam Go SDK (run fine in the direct runner, but panic on dataflow)

Go SDK: Bigquery and nullable field types.

2018-06-21 Thread eduardo . morales
I am using the bigqueryio transform and I am using the following struct to collect a data row: type Record { source_service biquery.NullString .. etc... } This works fine with the direct runner, but when I try it with the dataflow runner, then I get the following exception: java.util.con

Re: Issues with self executing jar and FileSystems API

2018-06-21 Thread Sameer Abhyankar
That was it! Thanks Lukasz. I had to use a custom assembly to get around this. Thanks! On Thu, Jun 21, 2018 at 3:28 PM Lukasz Cwik wrote: > The FileSystems API uses a ServiceLoader[1] to find Apache Beam FileSystem > implementations. The ServiceLoader works by finding "service" files on the > cl

Re: Pipeline is passing on local runner and failing on Dataflow runner - help with error

2018-06-21 Thread OrielResearch Eila Arich-Landkof
Hi Ahmet, Thank you. I have attached the requirements.txt that was generated from pip freeze > requirments.txt on the datalab notebook. The file does not include apache-beam package, only apache-airflow==1.9.0 Could you please let me know which package includes indexes.base Best, Eila On Thu, J

Unsubscribe

2018-06-21 Thread Yong Li
UnsubscribeYong LiProgram Director, DataWorks, IBM Analytics PlatformP: 978 899 2086 yong...@us.ibm.com -Todd Alvord wrote: -To: user@beam.apache.orgFrom: Todd Alvord Date: 06/21/2018 03:04PMSubject: UnsubscribeUnsubscribe

Re: Issues with self executing jar and FileSystems API

2018-06-21 Thread Lukasz Cwik
The FileSystems API uses a ServiceLoader[1] to find Apache Beam FileSystem implementations. The ServiceLoader works by finding "service" files on the classpath containing a list of classes implementing the Apache Beam FileSystem API. The way in which your creating an executable jar is likely droppi

Issues with self executing jar and FileSystems API

2018-06-21 Thread Sameer Abhyankar
Hello! I am trying to package a Beam Dataflow pipeline as a self executing jar using these instructions. However, I am running into a weird issue when attempting to execute this jar. My pipeline needs to read a file (avr

Re: Perform aggregations across multiple windows

2018-06-21 Thread Robert Bradshaw
You can re-window after the first aggregation (say, into the global window) and state will be stored with respect to this window. On Thu, Jun 21, 2018 at 12:02 PM Harshvardhan Agrawal wrote: > > Hello, > > We are currently working on implementing a data pipeline using Beam on top of > Flink. We h

Unsubscribe

2018-06-21 Thread Todd Alvord
Unsubscribe

Perform aggregations across multiple windows

2018-06-21 Thread Harshvardhan Agrawal
Hello, We are currently working on implementing a data pipeline using Beam on top of Flink. We have an unbounded data source that sends us some financial positions data. For each account, we perform certain aggregations (let’s assume it’s summation for simplicity) across all products owned by the

Re: Pipeline is passing on local runner and failing on Dataflow runner - help with error

2018-06-21 Thread Ahmet Altay
Hi Ella, It seems like, the package related to indexes.base is not installed in the workers. Could you try one of the methods in "Managing Python Pipeline Dependencies" [1], to stage that dependency? Ahmet [1] https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ On Thu, Jun

Re: Pipeline is passing on local runner and failing on Dataflow runner - help with error

2018-06-21 Thread OrielResearch Eila Arich-Landkof
Hello all, Exploring that issue (Local runner - works great and Dataflow fails), there might be a mismatch between the apache_beam version and the dataflow version Please let me know what your thoughts are. if it is a version issue, what updates should be executed? how do I cover the installation

Re: Python: Single vs Multiple DoFns for Image Processing

2018-06-21 Thread Robert Bradshaw
I would write these as three separate DoFns; they will get fused together to minimize IO. 400 workers may not be overkill, depending on how many images you have. Ia dataflow not scaling up and sharing the work? Where is your list of images coming from? On Thu, Jun 21, 2018 at 8:49 AM Cristian Gar

Python: Single vs Multiple DoFns for Image Processing

2018-06-21 Thread Cristian Garcia
Hi, I am running Beam with the DataflowRunner and want to do 3 tasks: 1. Read an image from GCS 2. Process the image (data augmentation) 3. Serialize the image to a string I could do all this in a single DoFn, but I could also split it into these 3 stages. I don't know what would be bet

Re: Using user developped source in streamline python

2018-06-21 Thread Lukasz Cwik
+d...@beam.apache.org Python streaming custom source support will be available via SplittableDoFn. It is actively being worked on by a few contributors but to my knowledge there is no roadmap yet for having support for this for Dataflow. On Thu, Jun 21, 2018 at 1:19 AM Sebastien Morand < sebasti

Using user developped source in streamline python

2018-06-21 Thread Sebastien Morand
Hi, We need to setup streaming dataflow in python using developed source (in the opposite of native source) for Cloud SQL integration and Firestore integration. This is currently not supported as far as I understood the documentation, can you confirm? Any roadmap on the topic? Thanks by advance,