Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-10-26 Thread Márton Balassi
Hey Rex, Writing half-baked records is definitely unwanted, thanks for spotting this. Most likely it can be solved by adding a flush at the end of every invoke call, let me check. Best, Marton On Mon, Oct 26, 2015 at 7:56 AM, Rex Ge wrote: > Hi, flinkers! > > I'm new to this whole thing, > an

Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-10-26 Thread Márton Balassi
The problem persists in the current master, simply a format.flush() is needed here [1]. I'll do a quick hotfix, thanks for the report again! [1] https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/FileSinkFunction.java#L99 O

Re: Error running an hadoop job from web interface

2015-10-26 Thread Flavio Pompermaier
Running from the shell everything works..is it a problem of classloaders hierarchy in the webapp? On Fri, Oct 23, 2015 at 5:53 PM, Maximilian Michels wrote: > ./bin/flink run /path/to/jar arguments > > or > > ./bin/flink run -c MainClass /path/to/jar arguments > > On Fri, Oct 23, 2015 at 5:50 PM

Re: Error running an hadoop job from web interface

2015-10-26 Thread Maximilian Michels
Correct. I'll fix it today. Cheers, Max On Mon, Oct 26, 2015 at 9:08 AM, Flavio Pompermaier wrote: > Running from the shell everything works..is it a problem of classloaders > hierarchy in the webapp? > > On Fri, Oct 23, 2015 at 5:53 PM, Maximilian Michels > wrote: > >> ./bin/flink run /path/t

Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-10-26 Thread Maximilian Michels
Not sure whether we really want to flush at every invoke call. If you want to flush every time, you may want to set the update condition to 0 milliseconds. That way, flush will be called every time. In the API this is exposed by using the FileSinkFunctionByMillis. If you flush every time, performan

Re: reading csv file from null value

2015-10-26 Thread Maximilian Michels
As far as I know the null support was removed from the Table API because its support was consistently supported with all operations. See https://issues.apache.org/jira/browse/FLINK-2236 On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena wrote: > For a similar problem where we wanted to preserve and t

Re: Reading null value from datasets

2015-10-26 Thread Maximilian Michels
As far as I know the null support was removed from the Table API because its support was consistently supported with all operations. See https://issues.apache.org/jira/browse/FLINK-2236 On Fri, Oct 23, 2015 at 7:20 PM, Shiti Saxena wrote: > For a similar problem where we wanted to preserve and

Re: reading csv file from null value

2015-10-26 Thread Philip Lee
Thanks for your reply. What if I do not use Table API? The error happens when using just env.readFromCsvFile(). I heard that using RowSerializer would handle this null value, but its error of TypeInformation happens when it is converted On Mon, Oct 26, 2015 at 10:26 AM, Maximilian Michels wrote

Re: reading csv file from null value

2015-10-26 Thread Fabian Hueske
Hi Philip, the CsvInputFormat does not support to read empty fields. I see two ways to achieve this functionality: - Use a TextInputFormat that returns each line as a String and do the parsing in a subsequent MapFunction - Extend the CsvInputFormat to support empty fields Cheers, Fabian 2015-10

Re: Specially introduced Flink to chinese users in CNCC(China National Computer Congress)

2015-10-26 Thread Maximilian Michels
Hi Liang, We greatly appreciate you introduced Flink to the Chinese users at CNCC! We would love to hear how people like Flink. Please keep us up to date and point the users to the mailing list or Stackoverflow if they have any difficulties. Best regards, Max On Sat, Oct 24, 2015 at 5:48 PM, Li

Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-10-26 Thread Márton Balassi
Hey Max, The solution I am proposing is not flushing on every record, but it makes sure to forward the flushing from the sinkfunction to the outputformat whenever it is triggered. Practically this means that the buffering is done (almost) solely in the sink and not in the outputformat any more. O

FastR-Flink: a new open source Truffle project

2015-10-26 Thread Juan Fumero
Hi everyone, we have just published a new open source Truffle project, FastR-Flink. It is available in https://bitbucket.org/allr/fastr-flink FastR is an implementation of the R language on top of Truffle and Graal [3] developed by Purdue University, Johannes Kepler University and Oracle Labs

Re: FastR-Flink: a new open source Truffle project

2015-10-26 Thread Maximilian Michels
Hi Juan, Thanks for sharing. This looks very promising. A great way for people who want to use R and Flink without compromising much on performance. I would be curious about some cluster performance tests. Have you run any yet? Best, max https://bitbucket.org/allr/fastr-flink/src/71cf3f264a1faf

Re: Error running an hadoop job from web interface

2015-10-26 Thread Flavio Pompermaier
Now that I've recompiled flink and restarted the web-client everything works fine. However, when I flag the job I want to run I see parallelism 1 in the right panel, but when I click on "Run Job" button + show optimizer plan flagged I see parallelism 36. Is that a bug of the first preview? On Mo

Re: FastR-Flink: a new open source Truffle project

2015-10-26 Thread Juan Fumero
Hi Max, yes, we started running some benchmarks, but still this is very preliminary version. Concerning performance what I can tell is we have very good speedups on shared memory compared to fastR. Concerning cluster applications we do not have good speedups yet for big R applications. We are s

Wrong owner of HDFS output folder

2015-10-26 Thread Flavio Pompermaier
Hi to all, when I run my job within my hadoop cluster (both from command line and from webapp) the output of my job (HDFS) works fine until I set the write parallelism to 1 (the output file is created with the user running the job). If I leave the default parallelism (>1) the job fails because it c

Re: Error running an hadoop job from web interface

2015-10-26 Thread Maximilian Michels
Did you set the default parallelism of the cluster to 36? This is because the plan gets optimized against the cluster configuration when you try to run the uploaded program. Before, it doesn't do any optimization. This might not be very intuitive. We should probably change that. On Mon, Oct 26, 20

Re: Error running an hadoop job from web interface

2015-10-26 Thread Flavio Pompermaier
No, I just use the default parallelism On Mon, Oct 26, 2015 at 3:05 PM, Maximilian Michels wrote: > Did you set the default parallelism of the cluster to 36? This is because > the plan gets optimized against the cluster configuration when you try to > run the uploaded program. Before, it doesn't

Re: Wrong owner of HDFS output folder

2015-10-26 Thread Maximilian Michels
Hi Flavio, Are you runing your Flink cluster with root permissions? The directory to hold the output splits are created by the JobManager. So if you run then JobManager with root permissions, it will create a folder owned by root. If the task managers are not run with root permissions, this could

Re: Error running an hadoop job from web interface

2015-10-26 Thread Maximilian Michels
That's odd. Does it also execute with parallelism 36 then? On Mon, Oct 26, 2015 at 3:06 PM, Flavio Pompermaier wrote: > No, I just use the default parallelism > > On Mon, Oct 26, 2015 at 3:05 PM, Maximilian Michels > wrote: > >> Did you set the default parallelism of the cluster to 36? This is

Re: Error running an hadoop job from web interface

2015-10-26 Thread Flavio Pompermaier
Yes, it's only the first job preview that has a wrong parallelism. The optimized plan and the job execution work just fine. On Mon, Oct 26, 2015 at 3:11 PM, Maximilian Michels wrote: > That's odd. Does it also execute with parallelism 36 then? > > On Mon, Oct 26, 2015 at 3:06 PM, Flavio Pomperma

Re: Wrong owner of HDFS output folder

2015-10-26 Thread Flavio Pompermaier
Yes, the job manager starts as a root process, while taskmanagers with my user..is that normal? I was convinced that start-cluster.sh was starting all processes with the same user :O On Mon, Oct 26, 2015 at 3:09 PM, Maximilian Michels wrote: > Hi Flavio, > > Are you runing your Flink cluster wit

Re: Wrong owner of HDFS output folder

2015-10-26 Thread Flavio Pompermaier
I just stopped the cluster with stop-cluster.sh but I had to manually kill the root process because it was not able to terminate it using the aforementioned script. Then I restarted the cluster via start-cluster.sh and now all processes run with the user it was supposed to. Probably once in the pas

Re: Wrong owner of HDFS output folder

2015-10-26 Thread Maximilian Michels
The problem is that non-root processes may not be able to read root-owned files/folders. Therefore, we cannot really check as a non-root users whether root-owned clusters have been started. It's better not to run Flink with root permissions. You're welcome. Cheers, Max On Mon, Oct 26, 2015 at 3: