[
https://issues.apache.org/jira/browse/FLINK-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ufuk Celebi updated FLINK-528:
------------------------------
Component/s: Python API
> First release for python-language-binding
> -----------------------------------------
>
> Key: FLINK-528
> URL: https://issues.apache.org/jira/browse/FLINK-528
> Project: Flink
> Issue Type: Bug
> Components: Python API
> Reporter: GitHub Import
> Labels: github-import
> Fix For: pre-apache
>
>
> Hi,
> since the python-language-binding is moving forward towards a first release I
> wanted to open an issue to show the current state, syntax and the planned
> roadmap towards the release (and future-roadmap) and open it for discussion
> ;)
> Current State in short
> ( https://github.com/filiphaase/stratosphere/tree/langbinding )
> python-language-binding enables the user to write stratosphere jobs in python
> including the following operators: FileInputFormat, CSVOutputFormat, Join,
> Cross, CoGroup, Map, Reduce (without Combiner) and Union for all of them.
> The execution can be locally or on a cluster. For cluster execution a job can
> be submitted via the stratosphere bashscript, whereby the java code of the
> framework is used as jar and the python-script of the user and the python
> files of the language-binding-framework are shipped via the configuration.
> To show you the syntax I setup a little documentation in GDocs:
> https://docs.google.com/document/d/1Caml9rr7irecKo32TmfM5p-ns9OUDhw4fQ5Xzb_wAS0/edit?usp=sharing
> And as always a WordCount-Example to show current syntax:
> ```python
> inputPath1 = r"file:///home/filip/Documents/stratosphere/hamlet.txt"
> outputPath = r"file:///home/filip/Documents/stratosphere/WCresult.txt"
> def split(record, collector):
> filteredLine = re.sub(r"\W+", " ", record[0].lower())
> [collector.collect((s, 1)) for s in filteredLine.split()]
>
> def count(iter, collector):
> sum = 0
> record = None
>
> for val in iter:
> record = val
> sum += 1
>
> if(record != None):
> collector.collect((record[0], int(sum)))
> TextInputFormat(inputPath1).map(split, [ValueType.String, ValueType.Int]) \
> .reduce(count, [ValueType.String, ValueType.Int], keyInd = 0) \
> .outputCSV(outputPath, [0,1], fieldDelimiter = " ", recordDelimiter =
> "\n" \
> .execute()
> ```
> Release-Roadmap
> - Add fault-tolerance and debugging possibilities for users.
> Currently stdin/stdout is used for IPC, therefore the user is not allowed
> to use print() and any debugging must be done over files. Furthermore it is
> possible that an error
> occurs in the python process and java is waiting endless for an answer.
> Solution:
> Use files for IPC (or directly shared memory if easily implementable) and do
> execution
> with three threads: One for execution, one for stdout (to allow the user
> debuggin), one
> for stderr and error-detection.
> - Adding pyInstaller in execution process. pyInstaller can pack the python
> script with all dependencies in an executable and therefore enables users to
> use any libraries and scripts which are only installed on the master-machine
> and not on the whole cluster.
> - Add missing functionality:
> - ValueType.long
> - CSVInputFormat
> - Add pyStratosphere bash script for call of python-lang-binding and enable
> user to hand command line arguments to python process.
> Longterm-Roadmap(partly covered in mailing list “Contributing to the language
> binding”)
> - Missing functionality:
> - Iterators
> - Aggregators
> - Accumulators and Counters
> - Combinable
> - Broadcast Variables
> - use shared memory/improved serialization/type handling for improved speed
> - "standalone" driver for the language binding to use it on "small" data on a
> local machine & for development
> - develop machine-learning use-case
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/528
> Created by: [filiphaase|https://github.com/filiphaase]
> Labels:
> Created at: Mon Mar 03 17:50:24 CET 2014
> State: open
--
This message was sent by Atlassian JIRA
(v6.2#6252)