Re: writeAsCsv

2015-10-07 Thread Robert Schmidtke
Hi, as far as I know only collect, print and execute actually trigger the
execution. What you're missing is env.execute() after the writeAsCsv call.
Hope this helps.

On Wed, Oct 7, 2015 at 9:35 PM, Lydia Ickler 
wrote:

> Hi,
>
> stupid question: Why is this not saved to file?
> I want to transform an array to a DataSet but the Graph stops at collect().
>
> //Transform Spectrum to DataSet
> List> dataList = new LinkedList String>>();
> double[][] arr = filteredSpectrum.getAs2DDoubleArray();
> for (int i=0;i dataList.add(new Tuple2( arr[0][i], arr[1][i]));
> }
> env.fromCollection(dataList).writeAsCsv(output);
>
> Best regards,
> Lydia
>
>


-- 
My GPG Key ID: 336E2680


Re: writeAsCsv

2015-10-07 Thread Lydia Ickler
ok, thanks! :)

I will try that!



> Am 07.10.2015 um 21:35 schrieb Lydia Ickler :
> 
> Hi, 
> 
> stupid question: Why is this not saved to file?
> I want to transform an array to a DataSet but the Graph stops at collect().
> 
> //Transform Spectrum to DataSet
> List> dataList = new LinkedList String>>();
> double[][] arr = filteredSpectrum.getAs2DDoubleArray();
> for (int i=0;i dataList.add(new Tuple2( arr[0][i], arr[1][i]));
> }
> env.fromCollection(dataList).writeAsCsv(output);
> Best regards, 
> Lydia
> 



Re: writeAsCsv on HDFS

2015-06-25 Thread Robert Metzger
Hi,

Flink is not loading the Hadoop configuration from the classloader. You
have to specify the path to the Hadoop configuration in the flink
configuration "fs.hdfs.hadoopconf"

On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier 
wrote:

> Hi to all,
> I'm experiencing some problem in writing a file as csv on HDFS with flink
> 0.9.0.
> The code I use is
>   myDataset.writeAsCsv(new Path("hdfs:///tmp", "myFile.csv").toString());
>
> If I run the job from Eclipse everything works fine but when I deploy the
> job on the cluster (cloudera 5.1.3) I obtain the following exception:
>
> Caused by: java.io.IOException: The given HDFS file URI
> (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
> use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
> or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
> Either no default file system was registered, or the provided configuration
> contains no valid authority component (fs.default.name or fs.defaultFS)
> describing the (hdfs namenode) host and port.
> at
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
> at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
> at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
> at
> org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
> at
> org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
> at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
> ... 25 more
>
> The core-site.xml is present in the fat jar and contains the property
>
> 
> fs.defaultFS
> hdfs://myServerX:8020
>   
>
> I compiled flink with the following command:
>
>  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
> -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3
> -DskipTests -Pvendor-repos
>
> How can I fix that?
>
> Best,
> Flavio
>
>


Re: writeAsCsv on HDFS

2015-06-25 Thread Flavio Pompermaier
Could you describe it better with an example please? Why Flink doesn't load
automatically the properties of the hadoop conf files within the jar?

On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger  wrote:

> Hi,
>
> Flink is not loading the Hadoop configuration from the classloader. You
> have to specify the path to the Hadoop configuration in the flink
> configuration "fs.hdfs.hadoopconf"
>
> On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier 
> wrote:
>
>> Hi to all,
>> I'm experiencing some problem in writing a file as csv on HDFS with flink
>> 0.9.0.
>> The code I use is
>>   myDataset.writeAsCsv(new Path("hdfs:///tmp", "myFile.csv").toString());
>>
>> If I run the job from Eclipse everything works fine but when I deploy the
>> job on the cluster (cloudera 5.1.3) I obtain the following exception:
>>
>> Caused by: java.io.IOException: The given HDFS file URI
>> (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
>> use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
>> or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
>> Either no default file system was registered, or the provided configuration
>> contains no valid authority component (fs.default.name or fs.defaultFS)
>> describing the (hdfs namenode) host and port.
>> at
>> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
>> at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
>> at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
>> at
>> org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
>> at
>> org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
>> at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
>> ... 25 more
>>
>> The core-site.xml is present in the fat jar and contains the property
>>
>> 
>> fs.defaultFS
>> hdfs://myServerX:8020
>>   
>>
>> I compiled flink with the following command:
>>
>>  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
>> -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3
>> -DskipTests -Pvendor-repos
>>
>> How can I fix that?
>>
>> Best,
>> Flavio
>>
>>
>


Re: writeAsCsv on HDFS

2015-06-25 Thread Robert Metzger
Hi Flavio,

there is a file called "conf/flink-conf.yaml"
Add a new line in the file with the following contents:

fs.hdfs.hadoopconf: /path/to/your/hadoop/config

This should fix the problem.
Flink can not load the configuration file from the jar containing the user
code, because the file system is initialized independent of the the job. So
there is (currently) no way of initializing the file system using the user
code classloader.

What you can do is making the configuration file available to Flink's
system classloader. For example by putting your user jar into the lib/
folder of Flink. You can also add the path to the Hadoop configuration
files into the CLASSPATH of Flink (but you need to do that on all machines).

I think the easiest approach is using Flink's configuration file.


On Thu, Jun 25, 2015 at 2:59 PM, Flavio Pompermaier 
wrote:

> Could you describe it better with an example please? Why Flink doesn't
> load automatically the properties of the hadoop conf files within the jar?
>
> On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger 
> wrote:
>
>> Hi,
>>
>> Flink is not loading the Hadoop configuration from the classloader. You
>> have to specify the path to the Hadoop configuration in the flink
>> configuration "fs.hdfs.hadoopconf"
>>
>> On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier > > wrote:
>>
>>> Hi to all,
>>> I'm experiencing some problem in writing a file as csv on HDFS with
>>> flink 0.9.0.
>>> The code I use is
>>>   myDataset.writeAsCsv(new Path("hdfs:///tmp", "myFile.csv").toString());
>>>
>>> If I run the job from Eclipse everything works fine but when I deploy
>>> the job on the cluster (cloudera 5.1.3) I obtain the following exception:
>>>
>>> Caused by: java.io.IOException: The given HDFS file URI
>>> (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
>>> use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
>>> or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
>>> Either no default file system was registered, or the provided configuration
>>> contains no valid authority component (fs.default.name or fs.defaultFS)
>>> describing the (hdfs namenode) host and port.
>>> at
>>> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
>>> at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
>>> at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
>>> at
>>> org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
>>> at
>>> org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
>>> at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
>>> ... 25 more
>>>
>>> The core-site.xml is present in the fat jar and contains the property
>>>
>>> 
>>> fs.defaultFS
>>> hdfs://myServerX:8020
>>>   
>>>
>>> I compiled flink with the following command:
>>>
>>>  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
>>> -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3
>>> -DskipTests -Pvendor-repos
>>>
>>> How can I fix that?
>>>
>>> Best,
>>> Flavio
>>>
>>>
>>
>


Re: writeAsCsv on HDFS

2015-06-25 Thread Flavio Pompermaier
*fs.hdfs.hadoopconf* represents the folder containing the hadoop config
files (*-site.xml) or just one specific hadoop config file (e.g.
core-site.xml or the hdfs-site.xml)?

On Thu, Jun 25, 2015 at 3:04 PM, Robert Metzger  wrote:

> Hi Flavio,
>
> there is a file called "conf/flink-conf.yaml"
> Add a new line in the file with the following contents:
>
> fs.hdfs.hadoopconf: /path/to/your/hadoop/config
>
> This should fix the problem.
> Flink can not load the configuration file from the jar containing the user
> code, because the file system is initialized independent of the the job. So
> there is (currently) no way of initializing the file system using the user
> code classloader.
>
> What you can do is making the configuration file available to Flink's
> system classloader. For example by putting your user jar into the lib/
> folder of Flink. You can also add the path to the Hadoop configuration
> files into the CLASSPATH of Flink (but you need to do that on all machines).
>
> I think the easiest approach is using Flink's configuration file.
>
>
> On Thu, Jun 25, 2015 at 2:59 PM, Flavio Pompermaier 
> wrote:
>
>> Could you describe it better with an example please? Why Flink doesn't
>> load automatically the properties of the hadoop conf files within the jar?
>>
>> On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger 
>> wrote:
>>
>>> Hi,
>>>
>>> Flink is not loading the Hadoop configuration from the classloader. You
>>> have to specify the path to the Hadoop configuration in the flink
>>> configuration "fs.hdfs.hadoopconf"
>>>
>>> On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>>
 Hi to all,
 I'm experiencing some problem in writing a file as csv on HDFS with
 flink 0.9.0.
 The code I use is
   myDataset.writeAsCsv(new Path("hdfs:///tmp",
 "myFile.csv").toString());

 If I run the job from Eclipse everything works fine but when I deploy
 the job on the cluster (cloudera 5.1.3) I obtain the following exception:

 Caused by: java.io.IOException: The given HDFS file URI
 (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
 use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
 or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
 Either no default file system was registered, or the provided configuration
 contains no valid authority component (fs.default.name or
 fs.defaultFS) describing the (hdfs namenode) host and port.
 at
 org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
 at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
 at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
 at
 org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
 at
 org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
 at
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
 ... 25 more

 The core-site.xml is present in the fat jar and contains the property

 
 fs.defaultFS
 hdfs://myServerX:8020
   

 I compiled flink with the following command:

  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
 -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3
 -DskipTests -Pvendor-repos

 How can I fix that?

 Best,
 Flavio


>>>
>>
>


Re: writeAsCsv on HDFS

2015-06-25 Thread Chiwan Park
It represents the folder containing the hadoop config files. :)

Regards,
Chiwan Park


> On Jun 25, 2015, at 10:07 PM, Flavio Pompermaier  wrote:
> 
> fs.hdfs.hadoopconf represents the folder containing the hadoop config files 
> (*-site.xml) or just one specific hadoop config file (e.g. core-site.xml or 
> the hdfs-site.xml)?
> 
> On Thu, Jun 25, 2015 at 3:04 PM, Robert Metzger  wrote:
> Hi Flavio,
> 
> there is a file called "conf/flink-conf.yaml"
> Add a new line in the file with the following contents:
> 
> fs.hdfs.hadoopconf: /path/to/your/hadoop/config
> 
> This should fix the problem.
> Flink can not load the configuration file from the jar containing the user 
> code, because the file system is initialized independent of the the job. So 
> there is (currently) no way of initializing the file system using the user 
> code classloader.
> 
> What you can do is making the configuration file available to Flink's system 
> classloader. For example by putting your user jar into the lib/ folder of 
> Flink. You can also add the path to the Hadoop configuration files into the 
> CLASSPATH of Flink (but you need to do that on all machines).
> 
> I think the easiest approach is using Flink's configuration file.
> 
> 
> On Thu, Jun 25, 2015 at 2:59 PM, Flavio Pompermaier  
> wrote:
> Could you describe it better with an example please? Why Flink doesn't load 
> automatically the properties of the hadoop conf files within the jar?
> 
> On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger  wrote:
> Hi,
> 
> Flink is not loading the Hadoop configuration from the classloader. You have 
> to specify the path to the Hadoop configuration in the flink configuration 
> "fs.hdfs.hadoopconf"
> 
> On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier  
> wrote:
> Hi to all,
> I'm experiencing some problem in writing a file as csv on HDFS with flink 
> 0.9.0.
> The code I use is 
>   myDataset.writeAsCsv(new Path("hdfs:///tmp", "myFile.csv").toString());
> 
> If I run the job from Eclipse everything works fine but when I deploy the job 
> on the cluster (cloudera 5.1.3) I obtain the following exception:
> 
> Caused by: java.io.IOException: The given HDFS file URI 
> (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to 
> use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault' 
> or 'fs.hdfs.hdfssite' config parameter failed due to the following problem: 
> Either no default file system was registered, or the provided configuration 
> contains no valid authority component (fs.default.name or fs.defaultFS) 
> describing the (hdfs namenode) host and port.
>   at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
>   at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
>   at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
>   at 
> org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
>   at 
> org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
>   ... 25 more
> 
> The core-site.xml is present in the fat jar and contains the property
> 
> 
> fs.defaultFS
> hdfs://myServerX:8020
>   
> 
> I compiled flink with the following command:
> 
>  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3 
> -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3 
> -DskipTests -Pvendor-repos
> 
> How can I fix that?
> 
> Best,
> Flavio
> 



Re: writeAsCsv on HDFS

2015-06-25 Thread Flavio Pompermaier
Do I have to put the hadoop conf file on each task manager or just on the
job-manager?

On Thu, Jun 25, 2015 at 3:12 PM, Chiwan Park  wrote:

> It represents the folder containing the hadoop config files. :)
>
> Regards,
> Chiwan Park
>
>
> > On Jun 25, 2015, at 10:07 PM, Flavio Pompermaier 
> wrote:
> >
> > fs.hdfs.hadoopconf represents the folder containing the hadoop config
> files (*-site.xml) or just one specific hadoop config file (e.g.
> core-site.xml or the hdfs-site.xml)?
> >
> > On Thu, Jun 25, 2015 at 3:04 PM, Robert Metzger 
> wrote:
> > Hi Flavio,
> >
> > there is a file called "conf/flink-conf.yaml"
> > Add a new line in the file with the following contents:
> >
> > fs.hdfs.hadoopconf: /path/to/your/hadoop/config
> >
> > This should fix the problem.
> > Flink can not load the configuration file from the jar containing the
> user code, because the file system is initialized independent of the the
> job. So there is (currently) no way of initializing the file system using
> the user code classloader.
> >
> > What you can do is making the configuration file available to Flink's
> system classloader. For example by putting your user jar into the lib/
> folder of Flink. You can also add the path to the Hadoop configuration
> files into the CLASSPATH of Flink (but you need to do that on all machines).
> >
> > I think the easiest approach is using Flink's configuration file.
> >
> >
> > On Thu, Jun 25, 2015 at 2:59 PM, Flavio Pompermaier <
> pomperma...@okkam.it> wrote:
> > Could you describe it better with an example please? Why Flink doesn't
> load automatically the properties of the hadoop conf files within the jar?
> >
> > On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger 
> wrote:
> > Hi,
> >
> > Flink is not loading the Hadoop configuration from the classloader. You
> have to specify the path to the Hadoop configuration in the flink
> configuration "fs.hdfs.hadoopconf"
> >
> > On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier <
> pomperma...@okkam.it> wrote:
> > Hi to all,
> > I'm experiencing some problem in writing a file as csv on HDFS with
> flink 0.9.0.
> > The code I use is
> >   myDataset.writeAsCsv(new Path("hdfs:///tmp", "myFile.csv").toString());
> >
> > If I run the job from Eclipse everything works fine but when I deploy
> the job on the cluster (cloudera 5.1.3) I obtain the following exception:
> >
> > Caused by: java.io.IOException: The given HDFS file URI
> (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
> use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
> or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
> Either no default file system was registered, or the provided configuration
> contains no valid authority component (fs.default.name or fs.defaultFS)
> describing the (hdfs namenode) host and port.
> >   at
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
> >   at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
> >   at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
> >   at
> org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
> >   at
> org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
> >   at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
> >   ... 25 more
> >
> > The core-site.xml is present in the fat jar and contains the property
> >
> > 
> > fs.defaultFS
> > hdfs://myServerX:8020
> >   
> >
> > I compiled flink with the following command:
> >
> >  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
> -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3
> -DskipTests -Pvendor-repos
> >
> > How can I fix that?
> >
> > Best,
> > Flavio
> >
>
>


Re: writeAsCsv on HDFS

2015-06-25 Thread Robert Metzger
You have to put it into all machines

On Thu, Jun 25, 2015 at 3:17 PM, Flavio Pompermaier 
wrote:

> Do I have to put the hadoop conf file on each task manager or just on the
> job-manager?
>
> On Thu, Jun 25, 2015 at 3:12 PM, Chiwan Park 
> wrote:
>
>> It represents the folder containing the hadoop config files. :)
>>
>> Regards,
>> Chiwan Park
>>
>>
>> > On Jun 25, 2015, at 10:07 PM, Flavio Pompermaier 
>> wrote:
>> >
>> > fs.hdfs.hadoopconf represents the folder containing the hadoop config
>> files (*-site.xml) or just one specific hadoop config file (e.g.
>> core-site.xml or the hdfs-site.xml)?
>> >
>> > On Thu, Jun 25, 2015 at 3:04 PM, Robert Metzger 
>> wrote:
>> > Hi Flavio,
>> >
>> > there is a file called "conf/flink-conf.yaml"
>> > Add a new line in the file with the following contents:
>> >
>> > fs.hdfs.hadoopconf: /path/to/your/hadoop/config
>> >
>> > This should fix the problem.
>> > Flink can not load the configuration file from the jar containing the
>> user code, because the file system is initialized independent of the the
>> job. So there is (currently) no way of initializing the file system using
>> the user code classloader.
>> >
>> > What you can do is making the configuration file available to Flink's
>> system classloader. For example by putting your user jar into the lib/
>> folder of Flink. You can also add the path to the Hadoop configuration
>> files into the CLASSPATH of Flink (but you need to do that on all machines).
>> >
>> > I think the easiest approach is using Flink's configuration file.
>> >
>> >
>> > On Thu, Jun 25, 2015 at 2:59 PM, Flavio Pompermaier <
>> pomperma...@okkam.it> wrote:
>> > Could you describe it better with an example please? Why Flink doesn't
>> load automatically the properties of the hadoop conf files within the jar?
>> >
>> > On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger 
>> wrote:
>> > Hi,
>> >
>> > Flink is not loading the Hadoop configuration from the classloader. You
>> have to specify the path to the Hadoop configuration in the flink
>> configuration "fs.hdfs.hadoopconf"
>> >
>> > On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier <
>> pomperma...@okkam.it> wrote:
>> > Hi to all,
>> > I'm experiencing some problem in writing a file as csv on HDFS with
>> flink 0.9.0.
>> > The code I use is
>> >   myDataset.writeAsCsv(new Path("hdfs:///tmp",
>> "myFile.csv").toString());
>> >
>> > If I run the job from Eclipse everything works fine but when I deploy
>> the job on the cluster (cloudera 5.1.3) I obtain the following exception:
>> >
>> > Caused by: java.io.IOException: The given HDFS file URI
>> (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
>> use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
>> or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
>> Either no default file system was registered, or the provided configuration
>> contains no valid authority component (fs.default.name or fs.defaultFS)
>> describing the (hdfs namenode) host and port.
>> >   at
>> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
>> >   at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
>> >   at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
>> >   at
>> org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
>> >   at
>> org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
>> >   at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
>> >   ... 25 more
>> >
>> > The core-site.xml is present in the fat jar and contains the property
>> >
>> > 
>> > fs.defaultFS
>> > hdfs://myServerX:8020
>> >   
>> >
>> > I compiled flink with the following command:
>> >
>> >  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
>> -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3
>> -DskipTests -Pvendor-repos
>> >
>> > How can I fix that?
>> >
>> > Best,
>> > Flavio
>> >
>>
>>
>
>


Re: writeAsCsv on HDFS

2015-06-25 Thread Stephan Ewen
You could also just qualify the HDFS URL, if that is simpler (put host and
port of the namenode in there): "hdfs://myhost:40010/path/to/file"

On Thu, Jun 25, 2015 at 3:20 PM, Robert Metzger  wrote:

> You have to put it into all machines
>
> On Thu, Jun 25, 2015 at 3:17 PM, Flavio Pompermaier 
> wrote:
>
>> Do I have to put the hadoop conf file on each task manager or just on the
>> job-manager?
>>
>> On Thu, Jun 25, 2015 at 3:12 PM, Chiwan Park 
>> wrote:
>>
>>> It represents the folder containing the hadoop config files. :)
>>>
>>> Regards,
>>> Chiwan Park
>>>
>>>
>>> > On Jun 25, 2015, at 10:07 PM, Flavio Pompermaier 
>>> wrote:
>>> >
>>> > fs.hdfs.hadoopconf represents the folder containing the hadoop config
>>> files (*-site.xml) or just one specific hadoop config file (e.g.
>>> core-site.xml or the hdfs-site.xml)?
>>> >
>>> > On Thu, Jun 25, 2015 at 3:04 PM, Robert Metzger 
>>> wrote:
>>> > Hi Flavio,
>>> >
>>> > there is a file called "conf/flink-conf.yaml"
>>> > Add a new line in the file with the following contents:
>>> >
>>> > fs.hdfs.hadoopconf: /path/to/your/hadoop/config
>>> >
>>> > This should fix the problem.
>>> > Flink can not load the configuration file from the jar containing the
>>> user code, because the file system is initialized independent of the the
>>> job. So there is (currently) no way of initializing the file system using
>>> the user code classloader.
>>> >
>>> > What you can do is making the configuration file available to Flink's
>>> system classloader. For example by putting your user jar into the lib/
>>> folder of Flink. You can also add the path to the Hadoop configuration
>>> files into the CLASSPATH of Flink (but you need to do that on all machines).
>>> >
>>> > I think the easiest approach is using Flink's configuration file.
>>> >
>>> >
>>> > On Thu, Jun 25, 2015 at 2:59 PM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>> > Could you describe it better with an example please? Why Flink doesn't
>>> load automatically the properties of the hadoop conf files within the jar?
>>> >
>>> > On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger 
>>> wrote:
>>> > Hi,
>>> >
>>> > Flink is not loading the Hadoop configuration from the classloader.
>>> You have to specify the path to the Hadoop configuration in the flink
>>> configuration "fs.hdfs.hadoopconf"
>>> >
>>> > On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>> > Hi to all,
>>> > I'm experiencing some problem in writing a file as csv on HDFS with
>>> flink 0.9.0.
>>> > The code I use is
>>> >   myDataset.writeAsCsv(new Path("hdfs:///tmp",
>>> "myFile.csv").toString());
>>> >
>>> > If I run the job from Eclipse everything works fine but when I deploy
>>> the job on the cluster (cloudera 5.1.3) I obtain the following exception:
>>> >
>>> > Caused by: java.io.IOException: The given HDFS file URI
>>> (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
>>> use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
>>> or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
>>> Either no default file system was registered, or the provided configuration
>>> contains no valid authority component (fs.default.name or fs.defaultFS)
>>> describing the (hdfs namenode) host and port.
>>> >   at
>>> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
>>> >   at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
>>> >   at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
>>> >   at
>>> org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
>>> >   at
>>> org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
>>> >   at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
>>> >   ... 25 more
>>> >
>>> > The core-site.xml is present in the fat jar and contains the property
>>> >
>>> > 
>>> > fs.defaultFS
>>> > hdfs://myServerX:8020
>>> >   
>>> >
>>> > I compiled flink with the following command:
>>> >
>>> >  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
>>> -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3
>>> -DskipTests -Pvendor-repos
>>> >
>>> > How can I fix that?
>>> >
>>> > Best,
>>> > Flavio
>>> >
>>>
>>>
>>
>>
>


Re: writeAsCsv on HDFS

2015-06-25 Thread Hawin Jiang
HI  Flavio

Here is the example from Marton:
You can used env.writeAsText method directly.


StreamExecutionEnvironment env = StreamExecutionEnvironment.
getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
  .map(/* do you operations*/)

.wirteAsText("hdfs://:/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]

[1]
http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#connecting-to-the-outside-world


On Thu, Jun 25, 2015 at 6:25 AM, Stephan Ewen  wrote:

> You could also just qualify the HDFS URL, if that is simpler (put host and
> port of the namenode in there): "hdfs://myhost:40010/path/to/file"
>
> On Thu, Jun 25, 2015 at 3:20 PM, Robert Metzger 
> wrote:
>
>> You have to put it into all machines
>>
>> On Thu, Jun 25, 2015 at 3:17 PM, Flavio Pompermaier > > wrote:
>>
>>> Do I have to put the hadoop conf file on each task manager or just on
>>> the job-manager?
>>>
>>> On Thu, Jun 25, 2015 at 3:12 PM, Chiwan Park 
>>> wrote:
>>>
 It represents the folder containing the hadoop config files. :)

 Regards,
 Chiwan Park


 > On Jun 25, 2015, at 10:07 PM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:
 >
 > fs.hdfs.hadoopconf represents the folder containing the hadoop config
 files (*-site.xml) or just one specific hadoop config file (e.g.
 core-site.xml or the hdfs-site.xml)?
 >
 > On Thu, Jun 25, 2015 at 3:04 PM, Robert Metzger 
 wrote:
 > Hi Flavio,
 >
 > there is a file called "conf/flink-conf.yaml"
 > Add a new line in the file with the following contents:
 >
 > fs.hdfs.hadoopconf: /path/to/your/hadoop/config
 >
 > This should fix the problem.
 > Flink can not load the configuration file from the jar containing the
 user code, because the file system is initialized independent of the the
 job. So there is (currently) no way of initializing the file system using
 the user code classloader.
 >
 > What you can do is making the configuration file available to Flink's
 system classloader. For example by putting your user jar into the lib/
 folder of Flink. You can also add the path to the Hadoop configuration
 files into the CLASSPATH of Flink (but you need to do that on all 
 machines).
 >
 > I think the easiest approach is using Flink's configuration file.
 >
 >
 > On Thu, Jun 25, 2015 at 2:59 PM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:
 > Could you describe it better with an example please? Why Flink
 doesn't load automatically the properties of the hadoop conf files within
 the jar?
 >
 > On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger 
 wrote:
 > Hi,
 >
 > Flink is not loading the Hadoop configuration from the classloader.
 You have to specify the path to the Hadoop configuration in the flink
 configuration "fs.hdfs.hadoopconf"
 >
 > On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:
 > Hi to all,
 > I'm experiencing some problem in writing a file as csv on HDFS with
 flink 0.9.0.
 > The code I use is
 >   myDataset.writeAsCsv(new Path("hdfs:///tmp",
 "myFile.csv").toString());
 >
 > If I run the job from Eclipse everything works fine but when I deploy
 the job on the cluster (cloudera 5.1.3) I obtain the following exception:
 >
 > Caused by: java.io.IOException: The given HDFS file URI
 (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
 use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
 or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
 Either no default file system was registered, or the provided configuration
 contains no valid authority component (fs.default.name or
 fs.defaultFS) describing the (hdfs namenode) host and port.
 >   at
 org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
 >   at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
 >   at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
 >   at
 org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
 >   at
 org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
 >   at
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
 >   ... 25 more
 >
 > The core-site.xml is present in the fat jar and contains the property
 >
 > 
 > fs.defaultFS
 > hdfs://myServerX:8020
 >   
 >
 > I compiled flink with the following command:
 >
 >  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
 -Dhbase.version=0.98.1-c

Re: writeAsCSV with partitionBy

2016-02-15 Thread Fabian Hueske
Hi Srikanth,

DataSet.partitionBy() will partition the data on the declared partition
fields.
If you append a DataSink with the same parallelism as the partition
operator, the data will be written out with the defined partitioning.
It should be possible to achieve the behavior you described using
DataSet.partitionByHash() or partitionByRange().

Best, Fabian


2016-02-12 20:53 GMT+01:00 Srikanth :

> Hello,
>
>
>
> Is there a Hive(or Spark dataframe) partitionBy equivalent in Flink?
>
> I'm looking to save output as CSV files partitioned by two columns(date
> and hour).
>
> The partitionBy dataset API is more to partition the data based on a
> column for further processing.
>
>
>
> I'm thinking there is no direct API to do this. But what will be the best
> way of achieving this?
>
>
>
> Srikanth
>
>
>


Re: writeAsCSV with partitionBy

2016-02-16 Thread Srikanth
Fabian,

Not sure if we are on the same page. If I do something like below code, it
will groupby field 0 and each task will write a separate part file in
parallel.

val sink = data1.join(data2)
.where(1).equalTo(0) { ((l,r) => ( l._3, r._3) ) }
.partitionByHash(0)
.writeAsCsv(pathBase + "output/test", rowDelimiter="\n",
fieldDelimiter="\t" , WriteMode.OVERWRITE)

This will create folder ./output/test/<1,2,3,4...>

But what I was looking for is Hive style partitionBy that will output with
folder structure

   ./output/field0=1/file
   ./output/field0=2/file
   ./output/field0=3/file
   ./output/field0=4/file

Assuming field0 is Int and has unique values 1,2,3&4.

Srikanth


On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske  wrote:

> Hi Srikanth,
>
> DataSet.partitionBy() will partition the data on the declared partition
> fields.
> If you append a DataSink with the same parallelism as the partition
> operator, the data will be written out with the defined partitioning.
> It should be possible to achieve the behavior you described using
> DataSet.partitionByHash() or partitionByRange().
>
> Best, Fabian
>
>
> 2016-02-12 20:53 GMT+01:00 Srikanth :
>
>> Hello,
>>
>>
>>
>> Is there a Hive(or Spark dataframe) partitionBy equivalent in Flink?
>>
>> I'm looking to save output as CSV files partitioned by two columns(date
>> and hour).
>>
>> The partitionBy dataset API is more to partition the data based on a
>> column for further processing.
>>
>>
>>
>> I'm thinking there is no direct API to do this. But what will be the best
>> way of achieving this?
>>
>>
>>
>> Srikanth
>>
>>
>>
>
>


Re: writeAsCSV with partitionBy

2016-02-16 Thread Fabian Hueske
Yes, you're right. I did not understand your question correctly.

Right now, Flink does not feature an output format that writes records to
output files depending on a key attribute.
You would need to implement such an output format yourself and append it as
follows:

val data = ...
data.partitionByHash(0) // partition to send all records with the same key
to the same machine
  .output(new YourOutputFormat())

In case of many distinct keys, you would need to limit the number of open
file handles. The OF will be easier to implement, if you do a
sortPartition(0, Order.ASCENDING) before the output format to sort the data
by key.

Cheers, Fabian




2016-02-16 19:52 GMT+01:00 Srikanth :

> Fabian,
>
> Not sure if we are on the same page. If I do something like below code, it
> will groupby field 0 and each task will write a separate part file in
> parallel.
>
> val sink = data1.join(data2)
> .where(1).equalTo(0) { ((l,r) => ( l._3, r._3) ) }
> .partitionByHash(0)
> .writeAsCsv(pathBase + "output/test", rowDelimiter="\n",
> fieldDelimiter="\t" , WriteMode.OVERWRITE)
>
> This will create folder ./output/test/<1,2,3,4...>
>
> But what I was looking for is Hive style partitionBy that will output with
> folder structure
>
>./output/field0=1/file
>./output/field0=2/file
>./output/field0=3/file
>./output/field0=4/file
>
> Assuming field0 is Int and has unique values 1,2,3&4.
>
> Srikanth
>
>
> On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske  wrote:
>
>> Hi Srikanth,
>>
>> DataSet.partitionBy() will partition the data on the declared partition
>> fields.
>> If you append a DataSink with the same parallelism as the partition
>> operator, the data will be written out with the defined partitioning.
>> It should be possible to achieve the behavior you described using
>> DataSet.partitionByHash() or partitionByRange().
>>
>> Best, Fabian
>>
>>
>> 2016-02-12 20:53 GMT+01:00 Srikanth :
>>
>>> Hello,
>>>
>>>
>>>
>>> Is there a Hive(or Spark dataframe) partitionBy equivalent in Flink?
>>>
>>> I'm looking to save output as CSV files partitioned by two columns(date
>>> and hour).
>>>
>>> The partitionBy dataset API is more to partition the data based on a
>>> column for further processing.
>>>
>>>
>>>
>>> I'm thinking there is no direct API to do this. But what will be the
>>> best way of achieving this?
>>>
>>>
>>>
>>> Srikanth
>>>
>>>
>>>
>>
>>
>


Re: writeAsCSV with partitionBy

2016-05-23 Thread KirstiLaurila
Is there any plans to implement this kind of feature (possibility to write to
data specified partitions) in the near future?



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/writeAsCSV-with-partitionBy-tp4893p7099.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: writeAsCSV with partitionBy

2016-05-23 Thread Fabian Hueske
Hi Kirsti,

I'm not aware of anybody working on this issue.
Would you like to create a JIRA issue for it?

Best, Fabian

2016-05-23 16:56 GMT+02:00 KirstiLaurila :

> Is there any plans to implement this kind of feature (possibility to write
> to
> data specified partitions) in the near future?
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/writeAsCSV-with-partitionBy-tp4893p7099.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: writeAsCSV with partitionBy

2016-05-23 Thread KirstiLaurila
Yeah, created this one  https://issues.apache.org/jira/browse/FLINK-3961
  




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/writeAsCSV-with-partitionBy-tp4893p7118.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: writeAsCSV with partitionBy

2016-05-24 Thread Srikanth
Isn't this related to -- https://issues.apache.org/jira/browse/FLINK-2672 ??

This can be achieved with a RollingSink[1] & custom Bucketer probably.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/connectors/fs/RollingSink.html

Srikanth

On Tue, May 24, 2016 at 1:07 AM, KirstiLaurila 
wrote:

> Yeah, created this one  https://issues.apache.org/jira/browse/FLINK-3961
> 
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/writeAsCSV-with-partitionBy-tp4893p7118.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: writeAsCSV with partitionBy

2016-05-24 Thread Juho Autio
RollingSink is part of Flink Streaming API. Can it be used in Flink Batch
jobs, too?

As implied in FLINK-2672, RollingSink doesn't support dynamic bucket paths
based on the tuple fields. The path must be given when creating the
RollingSink instance, ie. before deploying the job. Yes, a custom Bucketer
can be provided, but as the current method signature is, tuple is not
passed to Bucketer.

On Tue, May 24, 2016 at 4:45 PM, Srikanth  wrote:

> Isn't this related to -- https://issues.apache.org/jira/browse/FLINK-2672
> ??
>
> This can be achieved with a RollingSink[1] & custom Bucketer probably.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/connectors/fs/RollingSink.html
>
> Srikanth
>
> On Tue, May 24, 2016 at 1:07 AM, KirstiLaurila 
> wrote:
>
>> Yeah, created this one  https://issues.apache.org/jira/browse/FLINK-3961
>> 
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/writeAsCSV-with-partitionBy-tp4893p7118.html
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>


Re: writeAsCSV with partitionBy

2016-05-24 Thread KirstiLaurila
Maybe, I don't know, but with streaming. How about batch?


Srikanth wrote
> Isn't this related to -- https://issues.apache.org/jira/browse/FLINK-2672
> ??
> 
> This can be achieved with a RollingSink[1] & custom Bucketer probably.
> 
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/connectors/fs/RollingSink.html





--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/writeAsCSV-with-partitionBy-tp4893p7140.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: writeAsCSV with partitionBy

2016-05-25 Thread Aljoscha Krettek
Hi,
the RollingSink can only be used with streaming. Adding support for dynamic
paths based on element contents is certainly interesting. I imagine it can
be tricky, though, to figure out when to close/flush the buckets.

Cheers,
Aljoscha

On Wed, 25 May 2016 at 08:36 KirstiLaurila  wrote:

> Maybe, I don't know, but with streaming. How about batch?
>
>
> Srikanth wrote
> > Isn't this related to --
> https://issues.apache.org/jira/browse/FLINK-2672
> > ??
> >
> > This can be achieved with a RollingSink[1] & custom Bucketer probably.
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/connectors/fs/RollingSink.html
>
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/writeAsCSV-with-partitionBy-tp4893p7140.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-06-30 Thread Till Rohrmann
Hi Mihail,

have you checked that the DataSet you want to write to HDFS actually
contains data elements? You can try calling collect which retrieves the
data to your client to see what’s in there.

Cheers,
Till
​

On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru <
vi...@informatik.hu-berlin.de> wrote:

> Hi,
>
> the writeAsCsv method is not writing anything to HDFS (version 1.2.1) when
> the WriteMode is set to OVERWRITE.
> A file is created but it's empty. And no trace of errors in the Flink or
> Hadoop logs on all nodes in the cluster.
>
> What could cause this issue? I really really need this feature..
>
> Best,
> Mihail
>


Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-06-30 Thread Mihail Vieru

Hi Till,

thank you for your reply.

I have the following code snippet:

/intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, "\n", 
";", WriteMode.OVERWRITE);/


When I remove the WriteMode parameter, it works. So I can reason that 
the DataSet contains data elements.


Cheers,
Mihail


On 30.06.2015 12:06, Till Rohrmann wrote:


Hi Mihail,

have you checked that the |DataSet| you want to write to HDFS actually 
contains data elements? You can try calling |collect| which retrieves 
the data to your client to see what’s in there.


Cheers,
Till

​

On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru 
mailto:vi...@informatik.hu-berlin.de>> 
wrote:


Hi,

the writeAsCsv method is not writing anything to HDFS (version
1.2.1) when the WriteMode is set to OVERWRITE.
A file is created but it's empty. And no trace of errors in the
Flink or Hadoop logs on all nodes in the cluster.

What could cause this issue? I really really need this feature..

Best,
Mihail






Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-06-30 Thread Mihail Vieru

I think my problem is related to a loop in my job.

Before the loop, the writeAsCsv method works fine, even in overwrite mode.

In the loop, in the first iteration, it writes an empty folder 
containing empty files to HDFS. Even though the DataSet it is supposed 
to write contains elements.


Needless to say, this doesn't occur in a local execution environment, 
when writing to the local file system.



I would appreciate any input on this.

Best,
Mihail


On 30.06.2015 12:10, Mihail Vieru wrote:

Hi Till,

thank you for your reply.

I have the following code snippet:

/intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, "\n", 
";", WriteMode.OVERWRITE);/


When I remove the WriteMode parameter, it works. So I can reason that 
the DataSet contains data elements.


Cheers,
Mihail


On 30.06.2015 12:06, Till Rohrmann wrote:


Hi Mihail,

have you checked that the |DataSet| you want to write to HDFS 
actually contains data elements? You can try calling |collect| which 
retrieves the data to your client to see what’s in there.


Cheers,
Till

​

On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru 
> wrote:


Hi,

the writeAsCsv method is not writing anything to HDFS (version
1.2.1) when the WriteMode is set to OVERWRITE.
A file is created but it's empty. And no trace of errors in the
Flink or Hadoop logs on all nodes in the cluster.

What could cause this issue? I really really need this feature..

Best,
Mihail








Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-06-30 Thread Maximilian Michels
HI Mihail,

Thank you for your question. Do you have a short example that reproduces
the problem? It is hard to find the cause without an error message or some
example code.

I wonder how your loop works without WriteMode.OVERWRITE because it should
throw an exception in this case. Or do you change the file names on every
write?

Cheers,
Max

On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru  wrote:

>  I think my problem is related to a loop in my job.
>
> Before the loop, the writeAsCsv method works fine, even in overwrite mode.
>
> In the loop, in the first iteration, it writes an empty folder containing
> empty files to HDFS. Even though the DataSet it is supposed to write
> contains elements.
>
> Needless to say, this doesn't occur in a local execution environment, when
> writing to the local file system.
>
>
> I would appreciate any input on this.
>
> Best,
> Mihail
>
>
>
> On 30.06.2015 12:10, Mihail Vieru wrote:
>
> Hi Till,
>
> thank you for your reply.
>
> I have the following code snippet:
>
> *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, "\n",
> ";", WriteMode.OVERWRITE);*
>
> When I remove the WriteMode parameter, it works. So I can reason that the
> DataSet contains data elements.
>
> Cheers,
> Mihail
>
>
> On 30.06.2015 12:06, Till Rohrmann wrote:
>
>  Hi Mihail,
>
> have you checked that the DataSet you want to write to HDFS actually
> contains data elements? You can try calling collect which retrieves the
> data to your client to see what’s in there.
>
> Cheers,
> Till
> ​
>
> On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru <
> vi...@informatik.hu-berlin.de> wrote:
>
>> Hi,
>>
>> the writeAsCsv method is not writing anything to HDFS (version 1.2.1)
>> when the WriteMode is set to OVERWRITE.
>> A file is created but it's empty. And no trace of errors in the Flink or
>> Hadoop logs on all nodes in the cluster.
>>
>> What could cause this issue? I really really need this feature..
>>
>> Best,
>> Mihail
>>
>
>
>
>


Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-07-02 Thread Maximilian Michels
Hi Mihail,

Thanks for the code. I'm trying to reproduce the problem now.

On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru 
wrote:

>  Hi Max,
>
> thank you for your reply. I wanted to revise and dismiss all other factors
> before writing back. I've attached you my code and sample input data.
>
> I run the *APSPNaiveJob* using the following arguments:
>
> *0 100 hdfs://path/to/vertices-test-100 hdfs://path/to/edges-test-100
> hdfs://path/to/tempgraph 10 0.5 hdfs://path/to/output-apsp 9*
>
> I was wrong, I originally thought that the first writeAsCsv call (line 50)
> doesn't work. An exception is thrown without the WriteMode.OVERWRITE when
> the file exists.
>
> But the problem lies with the second call (line 74), trying to write to
> the same path on HDFS.
>
> This issue is blocking me, because I need to persist the vertices dataset
> between iterations.
>
> Cheers,
> Mihail
>
> P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1.
>
>
>
> On 30.06.2015 16:51, Maximilian Michels wrote:
>
>   HI Mihail,
>
>  Thank you for your question. Do you have a short example that reproduces
> the problem? It is hard to find the cause without an error message or some
> example code.
>
>  I wonder how your loop works without WriteMode.OVERWRITE because it
> should throw an exception in this case. Or do you change the file names on
> every write?
>
>  Cheers,
>  Max
>
> On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru <
> vi...@informatik.hu-berlin.de> wrote:
>
>>  I think my problem is related to a loop in my job.
>>
>> Before the loop, the writeAsCsv method works fine, even in overwrite mode.
>>
>> In the loop, in the first iteration, it writes an empty folder containing
>> empty files to HDFS. Even though the DataSet it is supposed to write
>> contains elements.
>>
>> Needless to say, this doesn't occur in a local execution environment,
>> when writing to the local file system.
>>
>>
>> I would appreciate any input on this.
>>
>> Best,
>> Mihail
>>
>>
>>
>> On 30.06.2015 12:10, Mihail Vieru wrote:
>>
>> Hi Till,
>>
>> thank you for your reply.
>>
>> I have the following code snippet:
>>
>> *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, "\n",
>> ";", WriteMode.OVERWRITE);*
>>
>> When I remove the WriteMode parameter, it works. So I can reason that the
>> DataSet contains data elements.
>>
>> Cheers,
>> Mihail
>>
>>
>> On 30.06.2015 12:06, Till Rohrmann wrote:
>>
>>  Hi Mihail,
>>
>> have you checked that the DataSet you want to write to HDFS actually
>> contains data elements? You can try calling collect which retrieves the
>> data to your client to see what’s in there.
>>
>> Cheers,
>> Till
>> ​
>>
>> On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru <
>> vi...@informatik.hu-berlin.de> wrote:
>>
>>> Hi,
>>>
>>> the writeAsCsv method is not writing anything to HDFS (version 1.2.1)
>>> when the WriteMode is set to OVERWRITE.
>>> A file is created but it's empty. And no trace of errors in the Flink or
>>> Hadoop logs on all nodes in the cluster.
>>>
>>> What could cause this issue? I really really need this feature..
>>>
>>> Best,
>>> Mihail
>>>
>>
>>
>>
>>
>
>


Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-07-02 Thread Maximilian Michels
The problem is that your input and output path are the same. Because Flink
executes in a pipelined fashion, all the operators will come up at once.
When you set WriteMode.OVERWRITE for the sink, it will delete the path
before writing anything. That means that when your DataSource reads the
input, there will be nothing to read from. Thus you get an empty DataSet
which you write to HDFS again. Any further loops will then just write
nothing.

You can circumvent this problem, by prefixing every output file with a
counter that you increment in your loop. Alternatively, if you only want to
keep the latest output, you can use two files and let them alternate to be
input and output.

Let me know if you have any further questions.

Kind regards,
Max

On Thu, Jul 2, 2015 at 10:20 AM, Maximilian Michels  wrote:

> Hi Mihail,
>
> Thanks for the code. I'm trying to reproduce the problem now.
>
> On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru <
> vi...@informatik.hu-berlin.de> wrote:
>
>>  Hi Max,
>>
>> thank you for your reply. I wanted to revise and dismiss all other
>> factors before writing back. I've attached you my code and sample input
>> data.
>>
>> I run the *APSPNaiveJob* using the following arguments:
>>
>> *0 100 hdfs://path/to/vertices-test-100 hdfs://path/to/edges-test-100
>> hdfs://path/to/tempgraph 10 0.5 hdfs://path/to/output-apsp 9*
>>
>> I was wrong, I originally thought that the first writeAsCsv call (line
>> 50) doesn't work. An exception is thrown without the WriteMode.OVERWRITE
>> when the file exists.
>>
>> But the problem lies with the second call (line 74), trying to write to
>> the same path on HDFS.
>>
>> This issue is blocking me, because I need to persist the vertices dataset
>> between iterations.
>>
>> Cheers,
>> Mihail
>>
>> P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1.
>>
>>
>>
>> On 30.06.2015 16:51, Maximilian Michels wrote:
>>
>>   HI Mihail,
>>
>>  Thank you for your question. Do you have a short example that reproduces
>> the problem? It is hard to find the cause without an error message or some
>> example code.
>>
>>  I wonder how your loop works without WriteMode.OVERWRITE because it
>> should throw an exception in this case. Or do you change the file names on
>> every write?
>>
>>  Cheers,
>>  Max
>>
>> On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru <
>> vi...@informatik.hu-berlin.de> wrote:
>>
>>>  I think my problem is related to a loop in my job.
>>>
>>> Before the loop, the writeAsCsv method works fine, even in overwrite
>>> mode.
>>>
>>> In the loop, in the first iteration, it writes an empty folder
>>> containing empty files to HDFS. Even though the DataSet it is supposed to
>>> write contains elements.
>>>
>>> Needless to say, this doesn't occur in a local execution environment,
>>> when writing to the local file system.
>>>
>>>
>>> I would appreciate any input on this.
>>>
>>> Best,
>>> Mihail
>>>
>>>
>>>
>>> On 30.06.2015 12:10, Mihail Vieru wrote:
>>>
>>> Hi Till,
>>>
>>> thank you for your reply.
>>>
>>> I have the following code snippet:
>>>
>>> *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, "\n",
>>> ";", WriteMode.OVERWRITE);*
>>>
>>> When I remove the WriteMode parameter, it works. So I can reason that
>>> the DataSet contains data elements.
>>>
>>> Cheers,
>>> Mihail
>>>
>>>
>>> On 30.06.2015 12:06, Till Rohrmann wrote:
>>>
>>>  Hi Mihail,
>>>
>>> have you checked that the DataSet you want to write to HDFS actually
>>> contains data elements? You can try calling collect which retrieves the
>>> data to your client to see what’s in there.
>>>
>>> Cheers,
>>> Till
>>> ​
>>>
>>> On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru <
>>> vi...@informatik.hu-berlin.de> wrote:
>>>
 Hi,

 the writeAsCsv method is not writing anything to HDFS (version 1.2.1)
 when the WriteMode is set to OVERWRITE.
 A file is created but it's empty. And no trace of errors in the Flink
 or Hadoop logs on all nodes in the cluster.

 What could cause this issue? I really really need this feature..

 Best,
 Mihail

>>>
>>>
>>>
>>>
>>
>>
>


Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-07-02 Thread Mihail Vieru

I've implemented the alternating 2 files solution and everything works now.

Thanks a lot! You saved my day :)

Cheers,
Mihail

On 02.07.2015 12:37, Maximilian Michels wrote:
The problem is that your input and output path are the same. Because 
Flink executes in a pipelined fashion, all the operators will come up 
at once. When you set WriteMode.OVERWRITE for the sink, it will delete 
the path before writing anything. That means that when your DataSource 
reads the input, there will be nothing to read from. Thus you get an 
empty DataSet which you write to HDFS again. Any further loops will 
then just write nothing.


You can circumvent this problem, by prefixing every output file with a 
counter that you increment in your loop. Alternatively, if you only 
want to keep the latest output, you can use two files and let them 
alternate to be input and output.


Let me know if you have any further questions.

Kind regards,
Max

On Thu, Jul 2, 2015 at 10:20 AM, Maximilian Michels > wrote:


Hi Mihail,

Thanks for the code. I'm trying to reproduce the problem now.

On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru
mailto:vi...@informatik.hu-berlin.de>> wrote:

Hi Max,

thank you for your reply. I wanted to revise and dismiss all
other factors before writing back. I've attached you my code
and sample input data.

I run the /APSPNaiveJob/ using the following arguments:

/0 100 hdfs://path/to/vertices-test-100
hdfs://path/to/edges-test-100 hdfs://path/to/tempgraph 10 0.5
hdfs://path/to/output-apsp 9/

I was wrong, I originally thought that the first writeAsCsv
call (line 50) doesn't work. An exception is thrown without
the WriteMode.OVERWRITE when the file exists.

But the problem lies with the second call (line 74), trying to
write to the same path on HDFS.

This issue is blocking me, because I need to persist the
vertices dataset between iterations.

Cheers,
Mihail

P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1.



On 30.06.2015 16:51, Maximilian Michels wrote:

HI Mihail,

Thank you for your question. Do you have a short example that
reproduces the problem? It is hard to find the cause without
an error message or some example code.

I wonder how your loop works without WriteMode.OVERWRITE
because it should throw an exception in this case. Or do you
change the file names on every write?

Cheers,
Max

On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru
mailto:vi...@informatik.hu-berlin.de>> wrote:

I think my problem is related to a loop in my job.

Before the loop, the writeAsCsv method works fine, even
in overwrite mode.

In the loop, in the first iteration, it writes an empty
folder containing empty files to HDFS. Even though the
DataSet it is supposed to write contains elements.

Needless to say, this doesn't occur in a local execution
environment, when writing to the local file system.


I would appreciate any input on this.

Best,
Mihail



On 30.06.2015 12:10, Mihail Vieru wrote:

Hi Till,

thank you for your reply.

I have the following code snippet:

/intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath,
"\n", ";", WriteMode.OVERWRITE);/

When I remove the WriteMode parameter, it works. So I
can reason that the DataSet contains data elements.

Cheers,
Mihail


On 30.06.2015 12:06, Till Rohrmann wrote:


Hi Mihail,

have you checked that the |DataSet| you want to write
to HDFS actually contains data elements? You can try
calling |collect| which retrieves the data to your
client to see what’s in there.

Cheers,
Till

​

On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru
mailto:vi...@informatik.hu-berlin.de>> wrote:

Hi,

the writeAsCsv method is not writing anything to
HDFS (version 1.2.1) when the WriteMode is set to
OVERWRITE.
A file is created but it's empty. And no trace of
errors in the Flink or Hadoop logs on all nodes in
the cluster.

What could cause this issue? I really really need
this feature..

Best,
Mihail















Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-07-03 Thread Maximilian Michels
You're welcome. I'm glad I could help out :)

Cheers,
Max

On Thu, Jul 2, 2015 at 9:17 PM, Mihail Vieru 
wrote:

>  I've implemented the alternating 2 files solution and everything works
> now.
>
> Thanks a lot! You saved my day :)
>
> Cheers,
> Mihail
>
>
> On 02.07.2015 12:37, Maximilian Michels wrote:
>
>   The problem is that your input and output path are the same. Because
> Flink executes in a pipelined fashion, all the operators will come up at
> once. When you set WriteMode.OVERWRITE for the sink, it will delete the
> path before writing anything. That means that when your DataSource reads
> the input, there will be nothing to read from. Thus you get an empty
> DataSet which you write to HDFS again. Any further loops will then just
> write nothing.
>
>  You can circumvent this problem, by prefixing every output file with a
> counter that you increment in your loop. Alternatively, if you only want to
> keep the latest output, you can use two files and let them alternate to be
> input and output.
>
>  Let me know if you have any further questions.
>
>  Kind regards,
>  Max
>
> On Thu, Jul 2, 2015 at 10:20 AM, Maximilian Michels 
> wrote:
>
>> Hi Mihail,
>>
>> Thanks for the code. I'm trying to reproduce the problem now.
>>
>> On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru <
>> vi...@informatik.hu-berlin.de> wrote:
>>
>>>  Hi Max,
>>>
>>> thank you for your reply. I wanted to revise and dismiss all other
>>> factors before writing back. I've attached you my code and sample input
>>> data.
>>>
>>> I run the *APSPNaiveJob* using the following arguments:
>>>
>>> *0 100 hdfs://path/to/vertices-test-100 hdfs://path/to/edges-test-100
>>> hdfs://path/to/tempgraph 10 0.5 hdfs://path/to/output-apsp 9*
>>>
>>> I was wrong, I originally thought that the first writeAsCsv call (line
>>> 50) doesn't work. An exception is thrown without the WriteMode.OVERWRITE
>>> when the file exists.
>>>
>>> But the problem lies with the second call (line 74), trying to write to
>>> the same path on HDFS.
>>>
>>> This issue is blocking me, because I need to persist the vertices
>>> dataset between iterations.
>>>
>>> Cheers,
>>> Mihail
>>>
>>> P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1.
>>>
>>>
>>>
>>> On 30.06.2015 16:51, Maximilian Michels wrote:
>>>
>>>   HI Mihail,
>>>
>>>  Thank you for your question. Do you have a short example that
>>> reproduces the problem? It is hard to find the cause without an error
>>> message or some example code.
>>>
>>>  I wonder how your loop works without WriteMode.OVERWRITE because it
>>> should throw an exception in this case. Or do you change the file names on
>>> every write?
>>>
>>>  Cheers,
>>>  Max
>>>
>>> On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru <
>>> vi...@informatik.hu-berlin.de> wrote:
>>>
  I think my problem is related to a loop in my job.

 Before the loop, the writeAsCsv method works fine, even in overwrite
 mode.

 In the loop, in the first iteration, it writes an empty folder
 containing empty files to HDFS. Even though the DataSet it is supposed to
 write contains elements.

 Needless to say, this doesn't occur in a local execution environment,
 when writing to the local file system.


 I would appreciate any input on this.

 Best,
 Mihail



 On 30.06.2015 12:10, Mihail Vieru wrote:

 Hi Till,

 thank you for your reply.

 I have the following code snippet:

 *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, "\n",
 ";", WriteMode.OVERWRITE);*

 When I remove the WriteMode parameter, it works. So I can reason that
 the DataSet contains data elements.

 Cheers,
 Mihail


 On 30.06.2015 12:06, Till Rohrmann wrote:

  Hi Mihail,

 have you checked that the DataSet you want to write to HDFS actually
 contains data elements? You can try calling collect which retrieves
 the data to your client to see what’s in there.

 Cheers,
 Till
 ​

 On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru <
 vi...@informatik.hu-berlin.de> wrote:

> Hi,
>
> the writeAsCsv method is not writing anything to HDFS (version 1.2.1)
> when the WriteMode is set to OVERWRITE.
> A file is created but it's empty. And no trace of errors in the Flink
> or Hadoop logs on all nodes in the cluster.
>
> What could cause this issue? I really really need this feature..
>
> Best,
> Mihail
>




>>>
>>>
>>
>
>