[jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner

2018-02-21 Thread Konstantinos Katsiapis (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371695#comment-16371695
 ] 

Konstantinos Katsiapis commented on BEAM-1442:
--

Is this now fixed (and expected to land in Beam 2.4)?

> Performance improvement of the Python DirectRunner
> --
>
> Key: BEAM-1442
> URL: https://issues.apache.org/jira/browse/BEAM-1442
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Charles Chen
>Priority: Major
>  Labels: gsoc2017, mentor, python
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

2017-12-14 Thread Konstantinos Katsiapis (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291988#comment-16291988
 ] 

Konstantinos Katsiapis commented on BEAM-2815:
--

Until https://issues.apache.org/jira/browse/BEAM-1442 is resolved, is it worth 
trying out a (temporary) workaround along the lines of:

beam.Pipeline(runner=apache_beam.runners.portability.fn_api_runner.FnApiRunner()):
   ...

(Thanks [~charleschen] for the tip).

> Python DirectRunner is unusable with input files in the 100-250MB range
> ---
>
> Key: BEAM-2815
> URL: https://issues.apache.org/jira/browse/BEAM-2815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct, sdk-py-core
>Affects Versions: 2.1.0
> Environment: python 2.7.10, beam 2.1, os x 
>Reporter: Peter Hausel
>Assignee: Charles Chen
> Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 
> 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with 
> training data sets that are bigger than tiny samples - making serious local 
> development impossible or very cumbersome. I am aware of some of the 
> limitations of the current DirectRunner implementation[1][2][3], however I 
> was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the 
> process after 10 minutes or so (screenshots about high memory and CPU 
> utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>  """Main entry point; defines and runs the pipeline."""
>  parser = argparse.ArgumentParser()
>  parser.add_argument('--input',
>   dest='input',
>   
> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>   help='Input file to process.')
>  known_args, pipeline_args = parser.parse_known_args(argv)
>  pipeline_options = PipelineOptions(pipeline_args)
>  pipeline_options.view_as(SetupOptions).save_main_session = True
>  pipeline = beam.Pipeline(options=pipeline_options)
>  raw_data = (
>pipeline
>| 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
> skip_header_lines=1)
>| 'Map' >> beam.Map(lambda line: line.lower())
>  )
>  result = pipeline.run()
>  result.wait_until_finish()
>  print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  
> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in 
> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner

2017-11-21 Thread Konstantinos Katsiapis (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261433#comment-16261433
 ] 

Konstantinos Katsiapis commented on BEAM-1442:
--

[~charleschen] Are you working on this (multi process direct runner)?

> Performance improvement of the Python DirectRunner
> --
>
> Key: BEAM-1442
> URL: https://issues.apache.org/jira/browse/BEAM-1442
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Ahmet Altay
>  Labels: gsoc2017, mentor, python
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2966) Allow subclasses of tuple, list, dict as pvalues.

2017-09-20 Thread Konstantinos Katsiapis (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173400#comment-16173400
 ] 

Konstantinos Katsiapis commented on BEAM-2966:
--

[~reuvenlax][~robertwb]: Is this fixed and will it be included in the Beam 
2.2.0 release?

> Allow subclasses of tuple, list, dict as pvalues.
> -
>
> Key: BEAM-2966
> URL: https://issues.apache.org/jira/browse/BEAM-2966
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Robert Bradshaw
>Assignee: Robert Bradshaw
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-778) Make fileio._CompressedFile seekable.

2017-03-29 Thread Konstantinos Katsiapis (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947589#comment-15947589
 ] 

Konstantinos Katsiapis commented on BEAM-778:
-

In general the Beam programming model usually requires thread-compatibility 
(not the stronger thread-safety). I would hazard a guess that we should follow 
suite for _CompressedFile (ie no need for locking?). Having said that it might 
be worth documenting the thread-compatibility but [~chamikara] or [~altay] 
would have a better sense about overall documentation etc.

> Make fileio._CompressedFile seekable.
> -
>
> Key: BEAM-778
> URL: https://issues.apache.org/jira/browse/BEAM-778
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Tibor Kiss
> Fix For: Not applicable
>
>
> We have a TODO to make fileio._CompressedFile seekable.
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L692
> Without this, compressed file objects produce for FileBasedSource 
> implementations may not be able to use libraries that utilize methods seek() 
> and tell().
> For example tarfile.open().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-778) Make fileio._CompressedFile seekable.

2017-03-24 Thread Konstantinos Katsiapis (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940890#comment-15940890
 ] 

Konstantinos Katsiapis commented on BEAM-778:
-

Yes, when I had looked at that in the past, it was very easy to use 
gzip.GzipFile(..., fileobj = self._file, ...) [1] but unfortunately not so for 
Bzip2 [2] or Snappy [3]. And we wanted to share as much implementation as 
possible (as opposed to have completely different codepaths for each 
compression type).

Provided that we can have a single interface that allows us to handle 
Gzip/Bzip2 (and ideally in the future Snappy and other whole-file compression 
techniques) with minimal diffs, changing the underlying implementation is I 
think fair game.

[1] https://docs.python.org/2/library/gzip.html#gzip.GzipFile and 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L103
[2] https://docs.python.org/2/library/bz2.html#bz2.BZ2File
[3] https://pypi.python.org/pypi/python-snappy

> Make fileio._CompressedFile seekable.
> -
>
> Key: BEAM-778
> URL: https://issues.apache.org/jira/browse/BEAM-778
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Tibor Kiss
> Fix For: Not applicable
>
>
> We have a TODO to make fileio._CompressedFile seekable.
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L692
> Without this, compressed file objects produce for FileBasedSource 
> implementations may not be able to use libraries that utilize methods seek() 
> and tell().
> For example tarfile.open().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)