[ https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Hausel updated BEAM-2815: ------------------------------- Attachment: Screen Shot 2017-08-27 at 9.00.29 AM.png Screen Shot 2017-08-27 at 9.06.00 AM.png > Python DirectRunner is unusable with input files in the 100-250MB range > ----------------------------------------------------------------------- > > Key: BEAM-2815 > URL: https://issues.apache.org/jira/browse/BEAM-2815 > Project: Beam > Issue Type: Bug > Components: runner-direct > Affects Versions: 2.1.0 > Environment: python 2.7.10, beam 2.1, os x > Reporter: Peter Hausel > Assignee: Thomas Groh > Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot > 2017-08-27 at 9.06.00 AM.png > > > The current python DirectRunner implementation seems to be unusable with > training data sets that are bigger than tiny samples - making serious local > development impossible or very cumbersome. I am aware of some of the > limitations of the current DirectRunner implementation[1][2][3], however I > was not sure if this odd behavior is expected. > [1][2][3] > https://stackoverflow.com/a/44765621 > https://issues.apache.org/jira/browse/BEAM-1442 > https://beam.apache.org/documentation/runners/direct/ > Repro: > The simple script below blew up my laptop (MBP 2015) and had to terminate the > process after 10 minutes or so (screenshots about high memory and CPU > utilization are also attached). > {code} > from apache_beam.io import textio > import apache_beam as beam > from apache_beam.options.pipeline_options import PipelineOptions > from apache_beam.options.pipeline_options import SetupOptions > import argparse > def run(argv=None): > """Main entry point; defines and runs the pipeline.""" > parser = argparse.ArgumentParser() > parser.add_argument('--input', > dest='input', > > default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv', > help='Input file to process.') > known_args, pipeline_args = parser.parse_known_args(argv) > pipeline_options = PipelineOptions(pipeline_args) > pipeline_options.view_as(SetupOptions).save_main_session = True > pipeline = beam.Pipeline(options=pipeline_options) > raw_data = ( > pipeline > | 'ReadTrainData' >> textio.ReadFromText(known_args.input, > skip_header_lines=1) > | 'Map' >> beam.Map(lambda line: line.lower()) > ) > result = pipeline.run() > result.wait_until_finish() > print(raw_data) > if __name__ == '__main__': > run() > {code} > Example dataset: > https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009 > for comparison: > {code} > lines = [line.lower() for line in > open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')] > print(len(lines)) > {code} > this vanilla python script runs on the same hardware and dataset in 0m4.909s. -- This message was sent by Atlassian JIRA (v6.4.14#64029)