Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread OrielResearch Eila Arich-Landkof
Hi Roberts, Thank you. I usually work with the custom worker configuration options I will custom it to low number of cores with large memory and see if it solves my problem Thanks so much, — Eila www.orielresearch.com https://www.meetup.com/Deep-Learning-In-Production Sent from my iPhone > On

BQ pipeline fires and error

2020-09-12 Thread OrielResearch Eila Arich-Landkof
Hi all. I am initiating the following pipeline to read from table and write to new table and receive the following error: Table is ~60,000 rows input_query = “select * from table" p = beam.Pipeline(options=options) # the first list is the root idx_and_sample[0:1] (p | 'Step 5.1.1: read each ro

Sorting JSON output by key

2020-06-25 Thread OrielResearch Eila Arich-Landkof
Hi all, I have a pipeline that outputs json elements Next, I want to read the JSON elements in order by one of the keys. Any idea how to do it within the pipeline? Thanks, Eila — Eila www.orielesearch.com https://www.meetup.com/Deep-Learning-In-Production Sent from my iPhone

Writing pipeline output to google sheet in google drive

2020-06-06 Thread OrielResearch Eila Arich-Landkof
Hello, Is it possible to have the pipeline sink to a google sheet within a specific google drive directory. Something like that: p = beam.Pipeline(options=options) (p | 'Step 1: read file ' >> beam.io.ReadFromText(path/to/file) | 'Step 2: process data ' >> beam.ParDo(get_daata(l])) | 's

Read a file => process => write to multiple files

2020-05-29 Thread OrielResearch Eila Arich-Landkof
Hi all, I am looking for a way to read a large file and generate the following 3 files: 1. extract header 2. extract column #1 from all lines 3. extract column # 2 from all files I use DoFn to extract the values. I am looking for a way to redirect the output to three different files? My thought

Re: GoogleCloudOptions.worker_machine_type = 'n1-highcpu-96'

2020-05-12 Thread OrielResearch Eila Arich-Landkof
ons) > google_cloud_options.project = 'my-project-id' > ... > # Create the Pipeline with the specified options. > p = Pipeline(options=options) > > Alternatively you should be able to just specify --worker_machine_type at > the command line if you're parsing the Pip

GoogleCloudOptions.worker_machine_type = 'n1-highcpu-96'

2020-05-12 Thread OrielResearch Eila Arich-Landkof
Hello, I am trying to check if the setting of the resources are actually being implemented. What will be the right way to do it. *the code is:* GoogleCloudOptions.worker_machine_type = 'n1-highcpu-96' and *the dataflow view is *the following (nothing that reflects the highcpu machine. Please advi

Re: resources management at worker machine or how to debug hanging execution on worker machine

2020-05-12 Thread OrielResearch Eila Arich-Landkof
name: "step-3-2-n1-highcpu-96-5912" project_id: "***" region: "us-central1" step_id: "" } type: "dataflow_step" } severity: "ERROR" timestamp: "2020-05-12T05:34:12.500823Z" } On Mon, May 11, 2020 at 12:52 PM OrielResearch Eil

resources management at worker machine or how to debug hanging execution on worker machine

2020-05-11 Thread OrielResearch Eila Arich-Landkof
Hi all, I am trying to run the Kallisto package command on the apache beam worker. Below is a table that describes my steps on the apache beam pipeline code and local compute Debian machine (new machine). I used both of them for debug and comparison. On a local machine, the execution completes wit

Re: compute engine cap - request for configuration advice

2020-05-08 Thread OrielResearch Eila Arich-Landkof
Great. thanks On Fri, May 8, 2020 at 2:33 PM Luke Cwik wrote: > You should set max num workers to set an upper bound on how many will be > used in your job. It will not fail the pipeline when that limit is reached. > > On Fri, May 8, 2020 at 10:03 AM OrielResearch Eila Arich-L

compute engine cap - request for configuration advice

2020-05-08 Thread OrielResearch Eila Arich-Landkof
Hi all, I am hitting the compute engine limit. I need to escalate that request with Google support to increase this quota. Could you please advise what is the best way to manage this? see attached the exact quota screenshot. Does limiting the num_of_workers will allow the job to proceed with a li

Re: adding apt-get to setup.py fails passing apt-get commands

2020-05-07 Thread OrielResearch Eila Arich-Landkof
Java folder to PATH. ['export', 'PATH=$PATH:/opt/userowned/jdk-14.0.1/bin'] error: [Errno 2] No such file or directory: 'export' What is the right way to use export in setup time? Thanks, Eila On Tue, May 5, 2020 at 12:47 AM OrielResearch Eila Arich-Landkof < e.

Re: adding apt-get to setup.py fails passing apt-get commands

2020-05-04 Thread OrielResearch Eila Arich-Landkof
python/apache_beam/examples/complete/juliaset/setup.py Thanks, Eila subprocess.CalledProcessError: Command '['apt-get', '--assume-yes', 'install', 'unzip']' returned non-zero exit status 100 On Mon, May 4, 2020 at 1:43 PM OrielResearch Eila Arich-La

Re: adding apt-get to setup.py fails passing apt-get commands

2020-05-04 Thread OrielResearch Eila Arich-Landkof
cate is not intended to be used when the amount of output is > large. If you need to just run the process, I would recommend a simple > subprocess.check_output(). > > On Mon, May 4, 2020 at 9:00 AM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >>

Re: adding apt-get to setup.py fails passing apt-get commands

2020-05-04 Thread OrielResearch Eila Arich-Landkof
e code now:* stdout_data, *stderr_data* = p.communicate() print('Command output: %s' % stdout_data) *print('Command error data : %s' % stderr_data)* So that issue is resolved for me. Thanks, Eila On Sat, May 2, 2020 at 11:27 PM OrielResearch Eila Arich-Landkof < e...@orielresearch.

adding apt-get to setup.py fails passing apt-get commands

2020-05-02 Thread OrielResearch Eila Arich-Landkof
Hi all, I have experience and very odd behaviour. when executing the setup.py with the following CUSTOM COMMAND CUSTOM_COMMANDS = [['echo', 'Custom command worked!'], ['apt-get', 'update'], ['apt-get', 'install', '-y', 'unzip']] everything works great. when ex

Error and initiating new worker when executing a subprocess with Popen

2020-05-02 Thread OrielResearch Eila Arich-Landkof
Hi all, I would appreciate any help on that issue. I have no idea where to start debugging that issue. On a local machine, it is working fine. The challenge is to have it working on a worker machine. There is an option to run the command below(kallisto) using multiple threads with -t [# threads].

Executing a third party library command on beam workers failure

2020-04-22 Thread OrielResearch Eila Arich-Landkof
Hello, I would like to run a third party library. The library can be copied to the machines or installed via conda. I have tried the following two methods, both of them failed *method 1:* - use the setup.py for wget & tar the library to the workers foler: /opt/userowned/ - execute the command w

Re: Copying tar.gz libraries to apache-beam workers

2020-04-21 Thread OrielResearch Eila Arich-Landkof
t;] > > See the Anaconda silent install[1] instructions for more details. > > 1: https://docs.anaconda.com/anaconda/install/silent-mode/#linux-macos > > > On Fri, Apr 17, 2020 at 9:28 PM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >>

Re: Copying tar.gz libraries to apache-beam workers

2020-04-17 Thread OrielResearch Eila Arich-Landkof
nks, Eila On Fri, Apr 17, 2020 at 12:12 PM Luke Cwik wrote: > On Dataflow you should be able to use /opt/userowned > > On Fri, Apr 17, 2020 at 9:01 AM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> See inline >> >> >> — >&g

Re: Copying tar.gz libraries to apache-beam workers

2020-04-17 Thread OrielResearch Eila Arich-Landkof
l > It should help you locate any errors that might have happened when executing > the custom commands. Will try that. Is it correct to target to this folder. Any other folder is ‘dedicated’ for custom downloads? Thanks Eila > >> On Thu, Apr 16, 2020 at 6:30 PM OrielRes

Copying tar.gz libraries to apache-beam workers

2020-04-16 Thread OrielResearch Eila Arich-Landkof
Hi all, This is a question that I have post on user-h...@apache.org. in case, the other one is not a valid address. posting here again. if you have already received it, apologies for the spam. I hope that you are all well. I would like to copy a tools library into the worker machines and uses the

Re: transpose CSV transform

2019-02-09 Thread OrielResearch Eila Arich-Landkof
ble[1]. It might be possible to >> modify this to work on a CSV source. >> >> [1] >> https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-bigquery-transpose >> >> >> On Sun, Jan 13, 2019 at 1:58 AM OrielResearch Eila Ari

INFO:oauth2client.client:Attempting refresh to obtain initial access_token

2019-01-18 Thread OrielResearch Eila Arich-Landkof
Hello all, My pipeline looks like the following: (p | "read TXT " >> beam.io.ReadFromText('gs://path/to/file.txt',skip_header_lines= False) | "transpose file " >> beam.ParDo(transposeCSVFn(List1,List2))) The following error is being fired for the read: INFO:oauth2client.client:Attempting refr

transpose CSV transform

2019-01-12 Thread OrielResearch Eila Arich-Landkof
Hi all, I am working with many CSV files where the common part is the row names and therefore, my processing should be by columns. My plan is to have the tables transposed and have the combines tables written into BQ. So , the code should perform: 1. transpose the tables (columns -> new_rows, rows

Re: Recordings and presentations from Beam Summit London 2018

2018-12-21 Thread OrielResearch Eila Arich-Landkof
Thank you for sharing. will definitely watch!!! Happy New Year to all of you, Eila On Thu, Dec 20, 2018 at 6:49 PM Manu Zhang wrote: > Thanks for sharing. The YouTube channel link is > https://www.youtube.com/channel/UChNnb_YO_7B0HlW6FhAXZZQ > > Thanks, > Manu Zhang > On Nov 2, 2018, 11:06 PM +0

Re: 2019 Beam Events

2018-12-04 Thread OrielResearch Eila Arich-Landkof
agree 👍 On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel wrote: > Israel would be nice to have one > chaim > On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas wrote: > > > > Hi Beam Community, > > > > I started curating industry conferences, meetups and events that are > relevant for Beam, this initia

using tfma / ModelAnalysis with tensorflow (not estimator) model

2018-10-19 Thread OrielResearch Eila Arich-Landkof
Hello all, I would like to use the modelAnalysis API for model debugging. I don't have a background in model serving and generating the model eval graph for it - so there might be basic background that I am missing. as a start, I would like to add the tfma to this colab (open to other suggestion

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-27 Thread OrielResearch Eila Arich-Landkof
ache_beam/examples/complete/game/leader_board.py#L326 > > May I know how many distinct column are you expecting across all files? > > > On Wed, Sep 26, 2018 at 8:06 PM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hi Ankur / users, >> >

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-26 Thread OrielResearch Eila Arich-Landkof
miss something here / what your thoughts are Many thanks, Eila On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof < e...@orielresearch.org> wrote: > Hi Ankur, > > Thank you. Trying this approach now. Will let you know if I have any issue > implementing it. > Be

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-26 Thread OrielResearch Eila Arich-Landkof
BQ. > > Thanks, > Ankur > > > On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hello, >> I would like to write large number of CSV file to BQ where the headers >> from all of them is aggrega

Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-25 Thread OrielResearch Eila Arich-Landkof
Hello, I would like to write large number of CSV file to BQ where the headers from all of them is aggregated to one common headers. any advice is very appreciated. The details are: 1. 2.5M CSV files 2. Each CSV file: header of 50-60 columns 2. Each CSV file: one data row there are common columns

Creating a pCollection from large numpy matrix with row and column names

2018-08-28 Thread OrielResearch Eila Arich-Landkof
Hello all, I would like to process a large numpy matrix with dimensions: (100K+, 30K+) The column names and the row names are meaningful. My plan was to save the numpy matrix values as a txt file and read it to a PColleciton. However, I am not sure how to add the row names to the element for pr

Re: INFO:root:Executing Error when executing a pipeline on dataflow

2018-08-22 Thread OrielResearch Eila Arich-Landkof
The issue was with the pip version. --download was deprecated. I dont know where this need to be mentioned / fixed. running pip install pip==9.0.3 solved the issue. Thanks, eila On Wed, Aug 22, 2018 at 11:20 AM OrielResearch Eila Arich-Landkof < e...@orielresearch.org> wrote: >

Re: INFO:root:Executing Error when executing a pipeline on dataflow

2018-08-22 Thread OrielResearch Eila Arich-Landkof
cmd) 191 return 0 192 CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/tmpyyiizo', 'google-cloud-dataflow==2.0.0', '--no-binary', '

INFO:root:Executing Error when executing a pipeline on dataflow

2018-08-22 Thread OrielResearch Eila Arich-Landkof
Hello all, I am running a pipeline that used to be executed on dataflow with no issues. I am using the datalab environment. See below the error. To my understanding happening before the pipeline code is being is being executed. Any idea what went wrong? Thanks, Eila Executing the pipeline: *p.

Re: Generating data to beam.io.Write(beam.io.BigQuerySink(

2018-08-14 Thread OrielResearch Eila Arich-Landkof
ou already know), but >> it should be immediately queryable. If you look at the table details, you >> should see records in the streaming buffer. >> >> Kind Regards, >> >> Damien >> >> On Mon, 13 Aug 2018, 20:00 OrielResearch Eila Arich-Landkof, < >

Re: Generating data to beam.io.Write(beam.io.BigQuerySink(

2018-08-13 Thread OrielResearch Eila Arich-Landkof
ase let me know if I am writing the data at the right format to BQ (I had no issues writing it to other type of outputs) Thanks for any help, Eila On Mon, Aug 13, 2018 at 1:55 PM, OrielResearch Eila Arich-Landkof < e...@orielresearch.org> wrote: > update: > > I tried the following op

Re: Generating data to beam.io.Write(beam.io.BigQuerySink(

2018-08-13 Thread OrielResearch Eila Arich-Landkof
, OrielResearch Eila Arich-Landkof < e...@orielresearch.org> wrote: > Hello, > > I am generating a data to be written in new BQ table with a specific > schema. The data is generated at DoFn function. > > My question is: what is the recommended format of data that I should > retur

Generating data to beam.io.Write(beam.io.BigQuerySink(

2018-08-13 Thread OrielResearch Eila Arich-Landkof
Hello, I am generating a data to be written in new BQ table with a specific schema. The data is generated at DoFn function. My question is: what is the recommended format of data that I should return from DnFn (getValuesStrFn bellow) ? is it dictionary? list? other? I tried list and str and it fi

Re: writing to BQ - need help with the error (probably syntax issue)

2018-08-10 Thread OrielResearch Eila Arich-Landkof
validate=False, coder=None): > > > On Fri, Aug 10, 2018 at 4:48 PM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hi, >> >> This pipeline is sinking the rows to BQ. I use the following syntax: >> >> | 'write to BQ&#x

writing to BQ - need help with the error (probably syntax issue)

2018-08-10 Thread OrielResearch Eila Arich-Landkof
Hi, This pipeline is sinking the rows to BQ. I use the following syntax: | 'write to BQ' >> beam.io.Write(beam.io.BigQuerySink(dataset='dataset_cell_line.cell_lines_1', schema='accession_list:STRING,comment_list:STRING,derived_from:STRING,disease_list:STRING,name_list:STRING,\ reference_lis

Re: PCollection from DataFrame

2018-08-08 Thread OrielResearch Eila Arich-Landkof
Hi Jon, thank you. will try that. Best, Eila On Wed, Aug 8, 2018 at 9:00 AM, Jon Goodrum wrote: > Hi Eila, > > > You can turn your DataFrame into a list via *df.values.tolist()* and pass > that into *beam.Create(...)* directly: > > > import apache_beam

Re: google.cloud.bigQuery version on workers - please HELP

2018-07-16 Thread OrielResearch Eila Arich-Landkof
Hi Ahmet, thank you for the detailed explanation. Looking forward for the latest BQ - beam version upgrade. Best, Eila On Fri, Jul 13, 2018 at 9:02 PM, Ahmet Altay wrote: > > > On Thu, Jul 12, 2018 at 7:35 PM, OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wr

How to inforce google-cloud-bigquery==0.28.0 installation on the dataflow workers

2018-07-13 Thread OrielResearch Eila Arich-Landkof
Hi all, I am having very hard time to get the latest version of bigquery working on dataflow workers. Can someone advice what will be the easier way to make this work. Having setup.py with google-cloud-bigquery==0.28.0 will fail the installtion of the dataflow on the workers-startup time. As you

Re: google.cloud.bigQuery version on workers - please HELP

2018-07-12 Thread OrielResearch Eila Arich-Landkof
ependencies in > the workers. Using requirements.txt is one of those options. > > Ahmet > > [1] https://cloud.google.com/dataflow/docs/concepts/sdk- > worker-dependencies#version-250_1 > > On Thu, Jul 12, 2018 at 8:51 AM, OrielResearch Eila Arich-Landkof < > e...@o

google.cloud.bigQuery version on workers - please HELP

2018-07-12 Thread OrielResearch Eila Arich-Landkof
Hi all, I am running python pipeline with google.cloud.bigquery library. on the local runner, everything runs great bigquery.__version__ is 0.28.0 on the dataflow runner, the version is 0.23.0 bigquery.__version__ is 0.23.0 and there are many API changes between these versions. What will be the

INFO:oauth2client.client:Refreshing due to a 401 (attempt 1/2)

2018-07-09 Thread OrielResearch Eila Arich-Landkof
Hello, I am running a pipeline that extracts columns from a bigquery table and writes the extracted data to a tsv file. The pipeline is stack on this message: oauth2client.client:Refreshing due to a 401 (attempt 1/2) Could you please let me know what might be the reason for that. is it another i

Re: Help with adding python package dependencies when executing pyhton pipeline

2018-07-03 Thread OrielResearch Eila Arich-Landkof
x27;, '--no-binary', ':all:']' returned non-zero exit status 1 any suggestion? Thanks, Eila On Tue, Jul 3, 2018 at 5:25 PM, OrielResearch Eila Arich-Landkof < e...@orielresearch.org> wrote: > thank you. where do i add the reference to requirements.txt? can i do

Re: Help with adding python package dependencies when executing pyhton pipeline

2018-07-03 Thread OrielResearch Eila Arich-Landkof
8 at 2:09 PM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hello all, >> >> >> I am using the python code to run my pipeline. similar to the following: >> >> options = PipelineOptions()google_cloud_options = >> options.view_as

Help with adding python package dependencies when executing pyhton pipeline

2018-07-03 Thread OrielResearch Eila Arich-Landkof
Hello all, I am using the python code to run my pipeline. similar to the following: options = PipelineOptions()google_cloud_options = options.view_as(GoogleCloudOptions)google_cloud_options.project = 'my-project-id'google_cloud_options.job_name = 'myjob'google_cloud_options.staging_location = 'g

Re: [Events] Big Data in Production Boston Meetup, Today!

2018-06-28 Thread OrielResearch Eila Arich-Landkof
Sure. great great! On Thu, Jun 28, 2018 at 5:52 AM, Matthias Baetens wrote: > Looks awesome! Shall I add the talk to the Beam YouTube channel as well? > > On Tue, 26 Jun 2018 at 23:40 Griselda Cuevas wrote: > >> In case you'd like to follow the talk live, here is the livestream link: >> >> http

Re: Pipeline is passing on local runner and failing on Dataflow runner - help with error

2018-06-21 Thread OrielResearch Eila Arich-Landkof
et > > [1] https://beam.apache.org/documentation/sdks/python- > pipeline-dependencies/ > > On Thu, Jun 21, 2018 at 9:40 AM, OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hello all, >> >> Exploring that issue (Local runner - works

Re: Pipeline is passing on local runner and failing on Dataflow runner - help with error

2018-06-21 Thread OrielResearch Eila Arich-Landkof
to TSV files/Write/WriteImpl/GroupByKey/Reify+writing to TSV files/Write/WriteImpl/GroupByKey/Write failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on: label-archs4-tsv-06210931-a4r1-harness-rlqz, labe

Pipeline is passing on local runner and failing on Dataflow runner - help with error

2018-06-20 Thread OrielResearch Eila Arich-Landkof
Hello, I am running the following pipeline on the local runner with no issues. logging.info('Define the pipeline') p = beam.Pipeline(options=options) samplePath = outputPath ExploreData = (p | "Extract the rows from dataframe" >> beam.io.Read(beam.io.BigQuerySource('archs4.Debug_annotation'))

Re: Returning dataframe from parDo and printing its value - advice?

2018-06-19 Thread OrielResearch Eila Arich-Landkof
Thanks!!! On Mon, Jun 18, 2018 at 4:41 PM, Chamikara Jayalath wrote: > A ParDo should always return an iterator not a string. So if you want to > output a single string it should either be "return [str]" or "yield str". > > > On Mon, Jun 18, 2018 at 1:39

Re: Returning dataframe from parDo and printing its value - advice?

2018-06-18 Thread OrielResearch Eila Arich-Landkof
gt; after CreateColForSampleFn which takes the 1x164 record and concatenates > each value with ',' in between. > > On Mon, Jun 18, 2018 at 9:00 AM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hi, >> >> Is anyone listening on th

Re: Returning dataframe from parDo and printing its value - advice?

2018-06-18 Thread OrielResearch Eila Arich-Landkof
o be able to open the output file with Google Sheet. Thanks, Eila On Fri, Jun 15, 2018 at 2:45 PM, OrielResearch Eila Arich-Landkof < e...@orielresearch.org> wrote: > Hi all, > > I am running a pipeline, where a table from BQ is being processed line by > line using ParDo functio

Returning dataframe from parDo and printing its value - advice?

2018-06-15 Thread OrielResearch Eila Arich-Landkof
Hi all, I am running a pipeline, where a table from BQ is being processed line by line using ParDo function. CreateColForSampleFn generates a data frame, with headers and values (shape: 1x164 ) that I want to pass to WriteToText. See the followings: ExploreData = (p | "Extract the rows from dataf

Re: Celebrating Pride... in the Apache Beam Logo

2018-06-15 Thread OrielResearch Eila Arich-Landkof
👍👍👍 On Fri, Jun 15, 2018 at 1:50 PM, Griselda Cuevas wrote: > Someone in my team edited some Open-Source-Projects' logos to celebrate > pride and Apache Beam was included! > > > I'm attaching what she did... sprinkling some fun in the mailing list, > because it's Friday! > -- Eila www.orielr

Re: saving image object on GCS from DoFn

2018-04-27 Thread OrielResearch Eila Arich-Landkof
👍 thank you! On Fri, Apr 27, 2018 at 11:03 AM, Eugene Kirpichov wrote: > You need to close the file object, I think that should fix the issue. > > > On Fri, Apr 27, 2018, 7:47 AM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> I tried

Re: saving image object on GCS from DoFn

2018-04-27 Thread OrielResearch Eila Arich-Landkof
Eugene Kirpichov wrote: > You can use FileSystems.create() > <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py> > to create a file on gs:// and you can pass the result of that method to > img.save(). > > > On Thu, Apr 26, 2018, 9:02 AM OrielResearch E

saving image object on GCS from DoFn

2018-04-26 Thread OrielResearch Eila Arich-Landkof
Hello all, I am running the following simplified code from DoFn (ParDo) from PIL import Image img = Image.fromarray(array)img.save('testrgb.png') img.save() with gs:// drive does not work. What would be the recommended way to save the img object on google drive as .png file any advice is apprec

Re: Passing parameter to DoFn in Python

2018-04-10 Thread OrielResearch Eila Arich-Landkof
process(self, element): > # use self.samplePath here, will get to remote workers via pickling > > On Tue, Apr 10, 2018 at 4:27 PM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hi all, >> >> Is it possible to pass a string parameter with D

Passing parameter to DoFn in Python

2018-04-10 Thread OrielResearch Eila Arich-Landkof
Hi all, Is it possible to pass a string parameter with DoFn function and what would be the syntax? The call should look something like that: beam.ParDo(SampleFn(samplePath)) how would the class definition be updated? class SampleFn(beam.DoFn): def process(self,element): Thanks, -- Eila www

Re: Beam modules are not recognized after dataflow SDK upgrade

2018-04-05 Thread OrielResearch Eila Arich-Landkof
-cloud-dataflow on a fresh virtual environment ? > > Thanks, > Cham > > > On Thu, Apr 5, 2018 at 8:51 AM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hello, >> >> It is probably a FAQ question: >> >> After exec

Beam modules are not recognized after dataflow SDK upgrade

2018-04-05 Thread OrielResearch Eila Arich-Landkof
Hello, It is probably a FAQ question: After executing: *pip install --upgrade google-cloud-dataflow * The following command fires unrecognized module error: from apache_beam.options.pipeline_options import GoogleCloudOptions Could you please advice what is the way to get rid of that message. I

Re: H5 potential intermediate solution

2018-04-04 Thread OrielResearch Eila Arich-Landkof
t the summit that there will be a way to >>> write to BQ without schema. Is something like that on the roadmap? >>> >> >> I don't think supporting this is in the immediate road map of Beam but >> any contributions in this space are welcome. >> >

Re: H5 potential intermediate solution

2018-04-02 Thread OrielResearch Eila Arich-Landkof
.WriteToText('gs://archs4/output/', file_name_suffix='.txt')) *Is there a walk around for providing schema for beam.io <http://beam.io>.BigQuerySink?* Many thanks, Eila On Mon, Apr 2, 2018 at 11:33 AM, OrielResearch Eila Arich-Landkof < e...@orielresearch.org> wrote:

H5 potential intermediate solution

2018-04-02 Thread OrielResearch Eila Arich-Landkof
Hello all, I would like to try a different way to leverage Apache beam for H5 => BQ (file to table transfer). For my use case, I would like to read every 10K rows of H5 data (numpy array format), transpose them and write them to BQ 10K columns. 10K is BQ columns limit. My code is below and fires

Executing a pipeline from datalab - run.wait_until_finished() error

2018-03-23 Thread OrielResearch Eila Arich-Landkof
Hello all, When I run the pipeline with 4 samples (very small dataset), I don't get any error on DirectRunner or DataflowRunner When I run it with 50 samples dataset, I get the following error for the run.wait_until_finished() What does this error mean? Thanks, Eila KeyErrorTraceback (most recen

Dynamic output - Python

2018-03-12 Thread OrielResearch Eila Arich-Landkof
Hello all, I would like to print each PCollection element to a different folder on gs://.../ the folder name is not available in advance, but it is available on the pCollection data (or can be calculated prior to printing). My question are: - Is it possible to use the folder path as a parameter to

Re: Regarding Beam SlackChannel

2018-03-09 Thread OrielResearch Eila Arich-Landkof
thanks!! On Fri, Mar 9, 2018 at 1:58 PM, Lukasz Cwik wrote: > Invite sent, welcome. > > On Thu, Mar 8, 2018 at 7:08 PM, OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hi Lukasz, >> >> Could you please add me as well >> Than

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread OrielResearch Eila Arich-Landkof
Hi Eugene, is there a video that I can watch? Many thanks, Eila On Thu, Mar 8, 2018 at 2:49 PM, Eugene Kirpichov wrote: > Hey all, > > The slides for my yesterday's talk at Strata San Jose https://conferences. > oreilly.com/strata/strata-ca/public/schedule/detail/63696 have been > posted on the

Re: Regarding Beam SlackChannel

2018-03-08 Thread OrielResearch Eila Arich-Landkof
Hi Lukasz, Could you please add me as well Thanks, Eila On Thu, Mar 8, 2018 at 2:56 PM, Lukasz Cwik wrote: > Invite sent, welcome. > > On Thu, Mar 8, 2018 at 11:50 AM, Chang Liu wrote: > >> Hello >> >> Can someone please add me to the Beam slackchannel? >> >> Thanks. >> >> >> Best regards/祝好,

Processing genomics data from bigQuery prior to training model

2018-02-23 Thread OrielResearch Eila Arich-Landkof
Hi all, I am looking for a good reference for processing data prior to training a model using APACHE BEAM *Phase1:* 30K+ columns of features, partitioned between big query tables - each of 10K, and 100K+ rows. *Phase 2:* more columns and more rows any reference is highly appreciated. Thank you

Fwd: dataflow HDF5 loading pipeline errors

2018-02-13 Thread OrielResearch Eila Arich-Landkof
Hello, Any help will be greatly appreciated!!! I an using dataflow to process H5 (HDF5 format ) file. The H5 file was uploaded to google storage from: https://amp.pharm.mssm. edu/archs4/download.html H5 / HDF5 is an hierarchical data structure to present scie