[ 
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Hausel updated BEAM-2815:
-------------------------------
    Attachment: Screen Shot 2017-08-27 at 9.00.29 AM.png
                Screen Shot 2017-08-27 at 9.06.00 AM.png

> Python DirectRunner is unusable with input files in the 100-250MB range
> -----------------------------------------------------------------------
>
>                 Key: BEAM-2815
>                 URL: https://issues.apache.org/jira/browse/BEAM-2815
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct
>    Affects Versions: 2.1.0
>         Environment: python 2.7.10, beam 2.1, os x 
>            Reporter: Peter Hausel
>            Assignee: Thomas Groh
>         Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 
> 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with 
> training data sets that are bigger than tiny samples - making serious local 
> development impossible or very cumbersome. I am aware of some of the 
> limitations of the current DirectRunner implementation[1][2][3], however I 
> was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the 
> process after 10 minutes or so (screenshots about high memory and CPU 
> utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>      """Main entry point; defines and runs the pipeline."""
>      parser = argparse.ArgumentParser()
>      parser.add_argument('--input',
>                       dest='input',
>                       
> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>                       help='Input file to process.')
>      known_args, pipeline_args = parser.parse_known_args(argv)
>      pipeline_options = PipelineOptions(pipeline_args)
>      pipeline_options.view_as(SetupOptions).save_main_session = True
>      pipeline = beam.Pipeline(options=pipeline_options)
>      raw_data = (
>            pipeline
>            | 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
> skip_header_lines=1)
>            | 'Map' >> beam.Map(lambda line: line.lower())
>      )
>      result = pipeline.run()
>      result.wait_until_finish()
>      print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  
> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in 
> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to