[ 
https://issues.apache.org/jira/browse/FLINK-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi resolved FLINK-528.
-------------------------------

    Resolution: Invalid

> First release for python-language-binding
> -----------------------------------------
>
>                 Key: FLINK-528
>                 URL: https://issues.apache.org/jira/browse/FLINK-528
>             Project: Flink
>          Issue Type: Bug
>          Components: Python API
>            Reporter: GitHub Import
>              Labels: github-import
>             Fix For: pre-apache
>
>
> Hi, 
> since the python-language-binding is moving forward towards a first release I 
> wanted to open an issue to show the current state, syntax and the planned 
> roadmap towards the release (and future-roadmap) and open it for discussion 
> ;) 
> Current State in short
> ( https://github.com/filiphaase/stratosphere/tree/langbinding )
> python-language-binding enables the user to write stratosphere jobs in python 
> including the following operators: FileInputFormat, CSVOutputFormat, Join, 
> Cross, CoGroup, Map, Reduce (without Combiner) and Union for all of them. 
> The execution can be locally or on a cluster. For cluster execution a job can 
> be submitted via the stratosphere bashscript, whereby the java code of the 
> framework is used as jar and the python-script of the user and the python 
> files of the language-binding-framework are shipped via the configuration.
> To show you the syntax I setup a little documentation in GDocs: 
> https://docs.google.com/document/d/1Caml9rr7irecKo32TmfM5p-ns9OUDhw4fQ5Xzb_wAS0/edit?usp=sharing
> And as always a WordCount-Example to show current syntax:
> ```python
> inputPath1 = r"file:///home/filip/Documents/stratosphere/hamlet.txt"
> outputPath = r"file:///home/filip/Documents/stratosphere/WCresult.txt"
> def split(record, collector):
>     filteredLine = re.sub(r"\W+", " ", record[0].lower())
>     [collector.collect((s, 1)) for s in filteredLine.split()]
>         
> def count(iter, collector):
>     sum = 0
>     record = None
>     
>     for val in iter:
>         record = val
>         sum += 1
>        
>     if(record != None):
>         collector.collect((record[0], int(sum)))
> TextInputFormat(inputPath1).map(split, [ValueType.String, ValueType.Int]) \
>     .reduce(count, [ValueType.String, ValueType.Int], keyInd = 0) \
>     .outputCSV(outputPath, [0,1], fieldDelimiter = " ", recordDelimiter = 
> "\n"  \
>     .execute()
> ```
> Release-Roadmap
> - Add fault-tolerance and debugging possibilities for users. 
>     Currently stdin/stdout is used for IPC, therefore the user is not allowed 
> to use print() and any debugging must be done over files. Furthermore it is 
> possible that an error
> occurs in the python process and java is waiting endless for an answer.
> Solution:
> Use files for IPC (or directly shared memory if easily implementable) and do 
> execution 
> with three threads: One for execution, one for stdout (to allow the user 
> debuggin), one 
> for stderr and error-detection.
> - Adding pyInstaller in execution process. pyInstaller can pack the python 
> script with all dependencies in an executable and therefore enables users to 
> use any libraries and scripts which are only installed on the master-machine 
> and not on the whole cluster.
> - Add missing functionality:
>     - ValueType.long
>     - CSVInputFormat
> - Add pyStratosphere bash script for call of python-lang-binding and enable 
> user to hand command line arguments to python process. 
> Longterm-Roadmap(partly covered in mailing list “Contributing to the language 
> binding”)
> - Missing functionality:
>     - Iterators
>     - Aggregators
>     - Accumulators and Counters
>     - Combinable
>     - Broadcast Variables
> - use shared memory/improved serialization/type handling for improved speed
> - "standalone" driver for the language binding to use it on "small" data on a 
> local machine & for development
> - develop machine-learning use-case
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/528
> Created by: [filiphaase|https://github.com/filiphaase]
> Labels: 
> Created at: Mon Mar 03 17:50:24 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to