[jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner
[ https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371695#comment-16371695 ] Konstantinos Katsiapis commented on BEAM-1442: -- Is this now fixed (and expected to land in Beam 2.4)? > Performance improvement of the Python DirectRunner > -- > > Key: BEAM-1442 > URL: https://issues.apache.org/jira/browse/BEAM-1442 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Charles Chen >Priority: Major > Labels: gsoc2017, mentor, python > > The DirectRunner for Python and Java are intended to act as policy enforcers, > and correctness checkers for Beam pipelines; but there are users that run > data processing tasks in them. > Currently, the Python Direct Runner has less-than-great performance, although > some work has gone into improving it. There are more opportunities for > improvement. > Skills for this project: > * Python > * Cython (nice to have) > * Working through the Beam getting started materials (nice to have) > To start figuring out this problem, it is advisable to run a simple pipeline, > and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions > directly on JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range
[ https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291988#comment-16291988 ] Konstantinos Katsiapis commented on BEAM-2815: -- Until https://issues.apache.org/jira/browse/BEAM-1442 is resolved, is it worth trying out a (temporary) workaround along the lines of: beam.Pipeline(runner=apache_beam.runners.portability.fn_api_runner.FnApiRunner()): ... (Thanks [~charleschen] for the tip). > Python DirectRunner is unusable with input files in the 100-250MB range > --- > > Key: BEAM-2815 > URL: https://issues.apache.org/jira/browse/BEAM-2815 > Project: Beam > Issue Type: Bug > Components: runner-direct, sdk-py-core >Affects Versions: 2.1.0 > Environment: python 2.7.10, beam 2.1, os x >Reporter: Peter Hausel >Assignee: Charles Chen > Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot > 2017-08-27 at 9.06.00 AM.png > > > The current python DirectRunner implementation seems to be unusable with > training data sets that are bigger than tiny samples - making serious local > development impossible or very cumbersome. I am aware of some of the > limitations of the current DirectRunner implementation[1][2][3], however I > was not sure if this odd behavior is expected. > [1][2][3] > https://stackoverflow.com/a/44765621 > https://issues.apache.org/jira/browse/BEAM-1442 > https://beam.apache.org/documentation/runners/direct/ > Repro: > The simple script below blew up my laptop (MBP 2015) and had to terminate the > process after 10 minutes or so (screenshots about high memory and CPU > utilization are also attached). > {code} > from apache_beam.io import textio > import apache_beam as beam > from apache_beam.options.pipeline_options import PipelineOptions > from apache_beam.options.pipeline_options import SetupOptions > import argparse > def run(argv=None): > """Main entry point; defines and runs the pipeline.""" > parser = argparse.ArgumentParser() > parser.add_argument('--input', > dest='input', > > default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv', > help='Input file to process.') > known_args, pipeline_args = parser.parse_known_args(argv) > pipeline_options = PipelineOptions(pipeline_args) > pipeline_options.view_as(SetupOptions).save_main_session = True > pipeline = beam.Pipeline(options=pipeline_options) > raw_data = ( >pipeline >| 'ReadTrainData' >> textio.ReadFromText(known_args.input, > skip_header_lines=1) >| 'Map' >> beam.Map(lambda line: line.lower()) > ) > result = pipeline.run() > result.wait_until_finish() > print(raw_data) > if __name__ == '__main__': > run() > {code} > Example dataset: > https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009 > for comparison: > {code} > lines = [line.lower() for line in > open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')] > print(len(lines)) > {code} > this vanilla python script runs on the same hardware and dataset in 0m4.909s. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner
[ https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261433#comment-16261433 ] Konstantinos Katsiapis commented on BEAM-1442: -- [~charleschen] Are you working on this (multi process direct runner)? > Performance improvement of the Python DirectRunner > -- > > Key: BEAM-1442 > URL: https://issues.apache.org/jira/browse/BEAM-1442 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Ahmet Altay > Labels: gsoc2017, mentor, python > > The DirectRunner for Python and Java are intended to act as policy enforcers, > and correctness checkers for Beam pipelines; but there are users that run > data processing tasks in them. > Currently, the Python Direct Runner has less-than-great performance, although > some work has gone into improving it. There are more opportunities for > improvement. > Skills for this project: > * Python > * Cython (nice to have) > * Working through the Beam getting started materials (nice to have) > To start figuring out this problem, it is advisable to run a simple pipeline, > and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions > directly on JIRA. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2966) Allow subclasses of tuple, list, dict as pvalues.
[ https://issues.apache.org/jira/browse/BEAM-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173400#comment-16173400 ] Konstantinos Katsiapis commented on BEAM-2966: -- [~reuvenlax][~robertwb]: Is this fixed and will it be included in the Beam 2.2.0 release? > Allow subclasses of tuple, list, dict as pvalues. > - > > Key: BEAM-2966 > URL: https://issues.apache.org/jira/browse/BEAM-2966 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-778) Make fileio._CompressedFile seekable.
[ https://issues.apache.org/jira/browse/BEAM-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947589#comment-15947589 ] Konstantinos Katsiapis commented on BEAM-778: - In general the Beam programming model usually requires thread-compatibility (not the stronger thread-safety). I would hazard a guess that we should follow suite for _CompressedFile (ie no need for locking?). Having said that it might be worth documenting the thread-compatibility but [~chamikara] or [~altay] would have a better sense about overall documentation etc. > Make fileio._CompressedFile seekable. > - > > Key: BEAM-778 > URL: https://issues.apache.org/jira/browse/BEAM-778 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Tibor Kiss > Fix For: Not applicable > > > We have a TODO to make fileio._CompressedFile seekable. > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L692 > Without this, compressed file objects produce for FileBasedSource > implementations may not be able to use libraries that utilize methods seek() > and tell(). > For example tarfile.open(). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-778) Make fileio._CompressedFile seekable.
[ https://issues.apache.org/jira/browse/BEAM-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940890#comment-15940890 ] Konstantinos Katsiapis commented on BEAM-778: - Yes, when I had looked at that in the past, it was very easy to use gzip.GzipFile(..., fileobj = self._file, ...) [1] but unfortunately not so for Bzip2 [2] or Snappy [3]. And we wanted to share as much implementation as possible (as opposed to have completely different codepaths for each compression type). Provided that we can have a single interface that allows us to handle Gzip/Bzip2 (and ideally in the future Snappy and other whole-file compression techniques) with minimal diffs, changing the underlying implementation is I think fair game. [1] https://docs.python.org/2/library/gzip.html#gzip.GzipFile and https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L103 [2] https://docs.python.org/2/library/bz2.html#bz2.BZ2File [3] https://pypi.python.org/pypi/python-snappy > Make fileio._CompressedFile seekable. > - > > Key: BEAM-778 > URL: https://issues.apache.org/jira/browse/BEAM-778 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Tibor Kiss > Fix For: Not applicable > > > We have a TODO to make fileio._CompressedFile seekable. > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L692 > Without this, compressed file objects produce for FileBasedSource > implementations may not be able to use libraries that utilize methods seek() > and tell(). > For example tarfile.open(). -- This message was sent by Atlassian JIRA (v6.3.15#6346)