Re: Implicit file-size limit of input files?

Tobias Feldhaus Fri, 10 Feb 2017 03:18:57 -0800

Addendum: When running in streaming mode with version 0.5 of the SDK, 
the elements are basically stuck before getting emitted [0], but the whole 
process starts and is running up to a point when most likely the memory is 
full (GC overhead error) and it crashes [0].


It seems like the Reshuffle that is taking place prevents any output to happen. 
To get rid of that, I would need to find another way to write to a partition in 
BigQuery in batch mode without using the workaround that is described here [1], 
but I don't know how.

[0] https://puu.sh/tWInq/f41beae65b.png
[1] 
http://stackoverflow.com/questions/38114306/creating-writing-to-parititoned-bigquery-table-via-google-cloud-dataflow/40863609#40863609

On 10.02.17, 10:34, "Tobias Feldhaus" <tobias.feldh...@localsearch.ch> wrote:

    Hi,
    
    I am currently facing a problem with a relatively simple pipeline [0] that 
is
    reading gzipped JSON files on Google Cloud Storage (GCS), adding a 
timestamp, 
    and pushing it into BigQuery. The only special thing I am doing as well is
    partitioning it via a PartioningWindowFn that is assigning a partition 
    for each element as described here [1].
    
    The pipeline works locally and remotely on the Google Cloud Dataflow Service
    (GCDS) with smaller test files, but if I run it on the about 100 real ones 
with
    2GB each it breaks down in streaming and batch mode with different errors. 
    
    The pipeline runs in batch mode, but in the end it gets stuck with 
processing only 
    1000-5000 streaming inserts per second to BQ, while constantly scaling up 
the 
    number of instances [2]. As you can see in the screenshot the shuffle never
    started, before I had to stop it to cut the costs.
    
    If run in streaming mode, the pipeline creation fails because of a resource
    allocation failure (Step setup_resource_disks_harness19: Set up of resource
    disks_harness failed: Unable to create data disk(s): One or more operations 
    had an error: [QUOTA_EXCEEDED] 'Quota 'DISKS_TOTAL_GB' exceeded.  
    Limit: 80000.0) This means, it has requested more than 80 (!) TB for the 
job that 
    operates on 200 GB compressed (or 2 TB uncompressed) files. 
    
    I’ve tried to run it with instances that are as large as n1-highmem-16 
    (104 GB memory each) and 1200 GB local storage.
    
    I know this is a mailing list of Apache Beam and not intended for GCDF 
support, 
    my question is therefore if anyone has faced the issue with the SDK before, 
or
    if there is a known size limit for files.
    
    
    Thanks,
    Tobias
    
    [0] https://gist.github.com/james-woods/98901f7ef2b405a7e58760057c48162f
    [1] http://stackoverflow.com/a/40863609/5497956
    [2] https://puu.sh/tWzkh/49b99477e3.png

Re: Implicit file-size limit of input files?

Reply via email to