[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-01 Thread aljoscha
GitHub user aljoscha opened a pull request: https://github.com/apache/flink/pull/1084 [FLINK-2583] Add Stream Sink For Rolling HDFS Files Note: The rolling sink is not yet integrated with checkpointing/fault-tolerance. You can merge this pull request into a Git repository by runni

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-01 Thread rmetzger
Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/1084#discussion_r38435351 --- Diff: flink-staging/flink-streaming/flink-streaming-connectors/flink-connector-hdfs/pom.xml --- @@ -0,0 +1,107 @@ + + +http://maven.apache.o

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-02 Thread StephanEwen
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137094936 Looks very good from a first glance! Can you explain how the writing behaves in cases of failues? Will it start a new file? Will it try to append to the prev

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-02 Thread aljoscha
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137125224 If it fails in the middle of writing or before sync/flush is called on the writer then the data can be in an inconsistent state. I see three ways of dealing with this,

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-02 Thread tillrohrmann
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137143368 Concerning the second short term option: Is the problem there that the checkpointing is not aligned with the rolling? Thus, you can take a checkpoint but you still

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-02 Thread aljoscha
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137144072 Should be, but then we can also just keep them in memory and write when checkpointing. But this has more potential of blowing up because OOM. --- If your project is se

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-02 Thread tillrohrmann
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137148970 What about renaming the files upon checkpointing? I mean not triggering the rolling mechanism but to write the current temp file as the latest part and start a new

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-03 Thread rmetzger
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137390414 I'm currently trying out the module. Some comments: - Why do we name the module `flink-connector-hdfs`. I think a name such as `flink-connector-filesystems` or `flin

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-03 Thread rmetzger
Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/1084#discussion_r38627730 --- Diff: docs/apis/streaming_guide.md --- @@ -1836,6 +1837,110 @@ More about information about Elasticsearch can be found [here](https://elastic.c

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-03 Thread StephanEwen
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137393554 I think using truncate for exactly once is the way to go. To support users with older HDFS versions, how about this: 1. We consider only valid what was writt

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-03 Thread aljoscha
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137396268 Question is, should we do exactly once now or put it in as it is (more or less)? --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-04 Thread aljoscha
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137764994 I (almost) completely reworked the sink. It is now called `RollingSink` and the module is called `flink-connector-filesystem` to show that it works with any Hadoop File

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-04 Thread StephanEwen
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137782086 How hard is it to support the truncate file code path also for regular Unix file systems (rather than only HDFS 2.7+)? The reason is that this way we would s

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-05 Thread aljoscha
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-137962730 What do you mean? This is using the Hadoop FileSystem for everything. Is your suggestion to abstract away the filesystems behind our own FileSystem class again?

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-07 Thread StephanEwen
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-138237480 I mean can we get the "truncate()" behavior for local file systems even if you build for Hadoop versions earlier than 2.7? --- If your project is set up for it, yo

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-07 Thread aljoscha
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-138252252 It could be done by circumventing FileSystem and just using the regular Java File I/O API to perform the truncate if we detect that the FileSystem works on file "file:/

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-07 Thread StephanEwen
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-138268930 Let's add this after we merge this, but it sounds quite valueable to me... --- If your project is set up for it, you can reply to this email and have your reply appe

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-13 Thread aminouvic
Github user aminouvic commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-139911603 Since the RollingSink was a little inspired by flume's HDFS Sink, it would be nice to include another really valuable features that could make it more complete.

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-14 Thread fhueske
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-139994388 Yes, we just had a discussion on the user mailing list about a partitioned output format for batch processing. Definitely a good addition for both DataSet and DataStream

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-14 Thread aljoscha
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-139995006 Yes, I would be very much in favor of adding it in the current incarnation before adding more stuff. It sill has a bug right now with exactly once, somehow some

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-14 Thread StephanEwen
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-140048228 Yes, let's merge it before extending it. @aminouvic If you want, can you create a JIRA with a description of that behavior, as a followup task? As a

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-15 Thread aminouvic
Github user aminouvic commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-140325179 Yeah you're right better have an operational version of the sink first, followup JIRA created https://issues.apache.org/jira/browse/FLINK-2672 --- If your project is

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-15 Thread aljoscha
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-140523411 I fixed the bug and now I'm confident that it should work. Also, I updated the travis build matrix because the sink does not work with hadoop 2.0.0-alpha. We ha

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-16 Thread rmetzger
Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/1084#discussion_r39624668 --- Diff: .travis.yml --- @@ -19,9 +19,9 @@ matrix: - jdk: "oraclejdk7" # this will also deploy a uberjar to s3 at some point env: PROFIL

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-16 Thread rmetzger
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-140732373 I think the pull request has grown quite a lot. I think we should merge it now and then improve it from there. --- If your project is set up for it, you can reply to t

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-16 Thread aljoscha
Github user aljoscha commented on a diff in the pull request: https://github.com/apache/flink/pull/1084#discussion_r39640472 --- Diff: .travis.yml --- @@ -19,9 +19,9 @@ matrix: - jdk: "oraclejdk7" # this will also deploy a uberjar to s3 at some point env: PROFIL

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-16 Thread rmetzger
Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/1084#discussion_r39640862 --- Diff: .travis.yml --- @@ -19,9 +19,9 @@ matrix: - jdk: "oraclejdk7" # this will also deploy a uberjar to s3 at some point env: PROFIL

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-18 Thread StephanEwen
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1084#issuecomment-141411612 Looks good, will merge this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] flink pull request: [FLINK-2583] Add Stream Sink For Rolling HDFS ...

2015-09-18 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1084 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enab