[jira] [Commented] (BEAM-2831) Pipeline crashes due to Beam encoder breaking Flink memory management

2018-03-08 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391419#comment-16391419
 ] 

Guillaume Balaine commented on BEAM-2831:
-

I fixed that in every Coder, including the KryoAtomicCoder (I am using Scio) 
and it now works properly on Flink, thanks.

> Pipeline crashes due to Beam encoder breaking Flink memory management
> -
>
> Key: BEAM-2831
> URL: https://issues.apache.org/jira/browse/BEAM-2831
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Affects Versions: 2.0.0, 2.1.0
> Environment: Flink 1.2.1 and 1.3.0, Java HotSpot and OpenJDK 8, macOS 
> 10.12.6 and unknown Linux
>Reporter: Reinier Kip
>Assignee: Aljoscha Krettek
>Priority: Major
>
> I’ve been running a Beam pipeline on Flink. Depending on the dataset size and 
> the heap memory configuration of the jobmanager and taskmanager, I may run 
> into an EOFException, which causes the job to fail.
> As [discussed on Flink's 
> mailinglist|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/EOFException-related-to-memory-segments-during-run-of-Beam-pipeline-on-Flink-td15255.html]
>  (stacktrace enclosed), Flink catches these EOFExceptions and activates disk 
> spillover. Because Beam wraps these exceptions, this mechanism fails, the 
> exception travels up the stack, and the job aborts.
> Hopefully this is enough information and this is something that can be 
> adjusted for in Beam. I'd be glad to provide more information where needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2831) Pipeline crashes due to Beam encoder breaking Flink memory management

2018-03-07 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389772#comment-16389772
 ] 

Guillaume Balaine commented on BEAM-2831:
-

The implication here, is that from 2.1 onwards it is impossible to run any 
reasonably sized batch with the FlinkRunner with binary formats like Avro and 
Protobuf with the default block size of FileIO...

> Pipeline crashes due to Beam encoder breaking Flink memory management
> -
>
> Key: BEAM-2831
> URL: https://issues.apache.org/jira/browse/BEAM-2831
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Affects Versions: 2.0.0, 2.1.0
> Environment: Flink 1.2.1 and 1.3.0, Java HotSpot and OpenJDK 8, macOS 
> 10.12.6 and unknown Linux
>Reporter: Reinier Kip
>Assignee: Aljoscha Krettek
>Priority: Major
>
> I’ve been running a Beam pipeline on Flink. Depending on the dataset size and 
> the heap memory configuration of the jobmanager and taskmanager, I may run 
> into an EOFException, which causes the job to fail.
> As [discussed on Flink's 
> mailinglist|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/EOFException-related-to-memory-segments-during-run-of-Beam-pipeline-on-Flink-td15255.html]
>  (stacktrace enclosed), Flink catches these EOFExceptions and activates disk 
> spillover. Because Beam wraps these exceptions, this mechanism fails, the 
> exception travels up the stack, and the job aborts.
> Hopefully this is enough information and this is something that can be 
> adjusted for in Beam. I'd be glad to provide more information where needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3649) HadoopSeekableByteChannel breaks when backing InputStream doesn't supporte ByteBuffers

2018-02-28 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380436#comment-16380436
 ] 

Guillaume Balaine commented on BEAM-3649:
-

I was using HDFS mostly for its compatibility with other APIs such as
Parquet (there is another ongoing PR for this with Beam), but certainly a
custom s3 client is better for simply appending.
The thing is Hadoop has a huge ecosystem and the hadoop-fs is often
targeted for improved s3 access layers such as : SparkTC/stocator. So the
s3 impl in Beam needs to be pretty solid if it wants to get used.

On Mon, Feb 26, 2018 at 5:36 PM, Ismaël Mejía (JIRA) 



> HadoopSeekableByteChannel breaks when backing InputStream doesn't supporte 
> ByteBuffers
> --
>
> Key: BEAM-3649
> URL: https://issues.apache.org/jira/browse/BEAM-3649
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Guillaume Balaine
>Priority: Minor
> Fix For: Not applicable
>
>
> This happened last summer, when I wanted to use S3A as the backing HDFS 
> access implementation. 
> This is because while this method is called : 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java#L145]
> This class does not implement ByteBuffer readable 
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
> I fixed it by manually incrementing the read position and copying the backing 
> array instead of buffering.
> [https://github.com/Igosuki/beam/commit/3838f0db43b6422833a045d1f097f6d7643219f1]
> I know the s3 direct implementation is the preferred path, but this is 
> possible, and likely happens to a lot of developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-3649) HadoopSeekableByteChannel breaks when backing InputStream doesn't supporte ByteBuffers

2018-02-26 Thread Guillaume Balaine (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillaume Balaine closed BEAM-3649.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Fixed by BEAM-2790

> HadoopSeekableByteChannel breaks when backing InputStream doesn't supporte 
> ByteBuffers
> --
>
> Key: BEAM-3649
> URL: https://issues.apache.org/jira/browse/BEAM-3649
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Guillaume Balaine
>Priority: Minor
> Fix For: 2.4.0
>
>
> This happened last summer, when I wanted to use S3A as the backing HDFS 
> access implementation. 
> This is because while this method is called : 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java#L145]
> This class does not implement ByteBuffer readable 
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
> I fixed it by manually incrementing the read position and copying the backing 
> array instead of buffering.
> [https://github.com/Igosuki/beam/commit/3838f0db43b6422833a045d1f097f6d7643219f1]
> I know the s3 direct implementation is the preferred path, but this is 
> possible, and likely happens to a lot of developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3649) HadoopSeekableByteChannel breaks when backing InputStream doesn't supporte ByteBuffers

2018-02-26 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16377051#comment-16377051
 ] 

Guillaume Balaine commented on BEAM-3649:
-

Hello Ismaël, thanks, I saw the fix after I rebased on master yesterday. This 
was indeed the error I was getting, I should just submit my patches faster ! 

Closing this.

> HadoopSeekableByteChannel breaks when backing InputStream doesn't supporte 
> ByteBuffers
> --
>
> Key: BEAM-3649
> URL: https://issues.apache.org/jira/browse/BEAM-3649
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Guillaume Balaine
>Priority: Minor
>
> This happened last summer, when I wanted to use S3A as the backing HDFS 
> access implementation. 
> This is because while this method is called : 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java#L145]
> This class does not implement ByteBuffer readable 
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
> I fixed it by manually incrementing the read position and copying the backing 
> array instead of buffering.
> [https://github.com/Igosuki/beam/commit/3838f0db43b6422833a045d1f097f6d7643219f1]
> I know the s3 direct implementation is the preferred path, but this is 
> possible, and likely happens to a lot of developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3649) HadoopSeekableByteChannel breaks when backing InputStream doesn't supporte ByteBuffers

2018-02-08 Thread Guillaume Balaine (JIRA)
Guillaume Balaine created BEAM-3649:
---

 Summary: HadoopSeekableByteChannel breaks when backing InputStream 
doesn't supporte ByteBuffers
 Key: BEAM-3649
 URL: https://issues.apache.org/jira/browse/BEAM-3649
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-extensions
Affects Versions: 2.2.0, 2.1.0, 2.0.0
Reporter: Guillaume Balaine
Assignee: Reuven Lax


This happened last summer, when I wanted to use S3A as the backing HDFS access 
implementation. 

This is because while this method is called : 
[https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java#L145]

This class does not implement ByteBuffer readable 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

I fixed it by manually incrementing the read position and copying the backing 
array instead of buffering.

[https://github.com/Igosuki/beam/commit/3838f0db43b6422833a045d1f097f6d7643219f1]

I know the s3 direct implementation is the preferred path, but this is 
possible, and likely happens to a lot of developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-793) JdbcIO can create a deadlock when parallelism is greater than 1

2017-09-21 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174736#comment-16174736
 ] 

Guillaume Balaine commented on BEAM-793:


Alright, I made it worked by having a backoff strategy and rebuilding the batch 
every time. It's likely not the best solution, but it guarantees writes, which 
is good for bounded pipelines.


{code:java}
  private void processRecord(T record) throws RuntimeException {
try {
  preparedStatement.clearParameters();
  spec.getPreparedStatementSetter().setParameters(record, 
preparedStatement);
  preparedStatement.addBatch();
} catch (Exception e) {
  throw new RuntimeException(e);
}
  }
private void executeBatch() throws SQLException, IOException, 
InterruptedException {
LOG.info("Writing bundle {} batch of {} statements", 
this.bundleUUID.toString(), batchCount);
Sleeper sleeper = Sleeper.DEFAULT;
BackOff backoff = BUNDLE_WRITE_BACKOFF.backoff();
while (true) {
  // Batch upsert entities.
  try {
int[] updates = preparedStatement.executeBatch();
connection.commit();
LOG.info("Successfully wrote {} statements of bundle {}", 
updates.length, this.bundleUUID.toString());
// Break if the commit threw no exception.
break;
  } catch (SQLException exception) {
LOG.error("Error writing bundle {} to the Database ({}): {}", 
this.bundleUUID.toString(),
exception.getErrorCode(), exception.getMessage());
if (!BackOffUtils.next(sleeper, backoff)) {
  LOG.error("Aborting bundle {} after {} retries.", 
this.bundleUUID.toString(), MAX_RETRIES);
  throw exception;
} else {
  records.stream().forEach(this::processRecord);
}
  }
}
clearBatch();
  }
 private void clearBatch() {
batchCount = 0;
records.clear();
  }
{code}


> JdbcIO can create a deadlock when parallelism is greater than 1
> ---
>
> Key: BEAM-793
> URL: https://issues.apache.org/jira/browse/BEAM-793
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Jean-Baptiste Onofré
>Assignee: Jean-Baptiste Onofré
> Fix For: 2.2.0
>
>
> With the following JdbcIO configuration, if the parallelism is greater than 
> 1, we can have a {{Deadlock found when trying to get lock; try restarting 
> transaction}}.
> {code}
> MysqlDataSource dbCfg = new MysqlDataSource();
> dbCfg.setDatabaseName("db");
> dbCfg.setUser("user");
> dbCfg.setPassword("pass");
> dbCfg.setServerName("localhost");
> dbCfg.setPortNumber(3306);
> p.apply(Create.of(data))
> .apply(JdbcIO. Long>>write()
> 
> .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(dbCfg))
> .withStatement("INSERT INTO 
> smth(loc,event_type,hash,begin_date,end_date) VALUES(?, ?, ?, ?, ?) ON 
> DUPLICATE KEY UPDATE event_type=VALUES(event_type),end_date=VALUES(end_date)")
> .withPreparedStatementSetter(new 
> JdbcIO.PreparedStatementSetter Long>>() {
> public void setParameters(Tuple5 Integer, ByteString, Long, Long> element, PreparedStatement statement)
> throws Exception {
> statement.setInt(1, element.f0);
> statement.setInt(2, element.f1);
> statement.setBytes(3, 
> element.f2.toByteArray());
> statement.setLong(4, element.f3);
> statement.setLong(5, element.f4);
> }
> }));
> {code}
> This can happen due to the {{autocommit}}. I'm going to investigate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-793) JdbcIO can create a deadlock when parallelism is greater than 1

2017-09-21 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174391#comment-16174391
 ] 

Guillaume Balaine commented on BEAM-793:


Hello Jean-Baptiste, thanks for being so responsive.
I copied over JdbcIO and am trying to make it work. The use case is a bounded 
data pipeline, so it does not care if to add a few seconds of processing time. 
I use the same strategy as in the Google Datastore implementation with the 
BackoffUtils to make the batch flush fn sleep increasingly in case of 
deadlocks. 
Unfortunately, it seems that my pipeline then terminates before all batches can 
go through, perhaps because of the @Teardown which, in the current JdbcIO impl, 
closes the statement not caring whether there is still a retry loop ongoing or 
not. 
I'll let you know if that works.

> JdbcIO can create a deadlock when parallelism is greater than 1
> ---
>
> Key: BEAM-793
> URL: https://issues.apache.org/jira/browse/BEAM-793
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Jean-Baptiste Onofré
>Assignee: Jean-Baptiste Onofré
> Fix For: 2.2.0
>
>
> With the following JdbcIO configuration, if the parallelism is greater than 
> 1, we can have a {{Deadlock found when trying to get lock; try restarting 
> transaction}}.
> {code}
> MysqlDataSource dbCfg = new MysqlDataSource();
> dbCfg.setDatabaseName("db");
> dbCfg.setUser("user");
> dbCfg.setPassword("pass");
> dbCfg.setServerName("localhost");
> dbCfg.setPortNumber(3306);
> p.apply(Create.of(data))
> .apply(JdbcIO. Long>>write()
> 
> .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(dbCfg))
> .withStatement("INSERT INTO 
> smth(loc,event_type,hash,begin_date,end_date) VALUES(?, ?, ?, ?, ?) ON 
> DUPLICATE KEY UPDATE event_type=VALUES(event_type),end_date=VALUES(end_date)")
> .withPreparedStatementSetter(new 
> JdbcIO.PreparedStatementSetter Long>>() {
> public void setParameters(Tuple5 Integer, ByteString, Long, Long> element, PreparedStatement statement)
> throws Exception {
> statement.setInt(1, element.f0);
> statement.setInt(2, element.f1);
> statement.setBytes(3, 
> element.f2.toByteArray());
> statement.setLong(4, element.f3);
> statement.setLong(5, element.f4);
> }
> }));
> {code}
> This can happen due to the {{autocommit}}. I'm going to investigate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-793) JdbcIO can create a deadlock when parallelism is greater than 1

2017-09-20 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173911#comment-16173911
 ] 

Guillaume Balaine commented on BEAM-793:


This is a major problem, it easily occurs if batches are available in quick 
succession and contain unordered data.
There needs to be a lock so that finishBundle does not try to write if the 
previous batch is not finished.

> JdbcIO can create a deadlock when parallelism is greater than 1
> ---
>
> Key: BEAM-793
> URL: https://issues.apache.org/jira/browse/BEAM-793
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Jean-Baptiste Onofré
>Assignee: Jean-Baptiste Onofré
>
> With the following JdbcIO configuration, if the parallelism is greater than 
> 1, we can have a {{Deadlock found when trying to get lock; try restarting 
> transaction}}.
> {code}
> MysqlDataSource dbCfg = new MysqlDataSource();
> dbCfg.setDatabaseName("db");
> dbCfg.setUser("user");
> dbCfg.setPassword("pass");
> dbCfg.setServerName("localhost");
> dbCfg.setPortNumber(3306);
> p.apply(Create.of(data))
> .apply(JdbcIO. Long>>write()
> 
> .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(dbCfg))
> .withStatement("INSERT INTO 
> smth(loc,event_type,hash,begin_date,end_date) VALUES(?, ?, ?, ?, ?) ON 
> DUPLICATE KEY UPDATE event_type=VALUES(event_type),end_date=VALUES(end_date)")
> .withPreparedStatementSetter(new 
> JdbcIO.PreparedStatementSetter Long>>() {
> public void setParameters(Tuple5 Integer, ByteString, Long, Long> element, PreparedStatement statement)
> throws Exception {
> statement.setInt(1, element.f0);
> statement.setInt(2, element.f1);
> statement.setBytes(3, 
> element.f2.toByteArray());
> statement.setLong(4, element.f3);
> statement.setLong(5, element.f4);
> }
> }));
> {code}
> This can happen due to the {{autocommit}}. I'm going to investigate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

2017-08-11 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123445#comment-16123445
 ] 

Guillaume Balaine commented on BEAM-2500:
-

Sorry about that, discovered a bug later down the line, we actually have to set 
the position in the buffer manually after using the backing array, like this : 

@Override
public int read(ByteBuffer dst) throws IOException {
  if (closed) {
throw new IOException("Channel is closed");
  }
  int read = 0;
  try {
read = inputStream.read(dst);
  } catch (UnsupportedOperationException e) {
// Fallback read
read = inputStream.read(dst.array());
if (read > 0) {
  dst.position(dst.position() + read);
}
  }
  return read;
}

> Add support for S3 as a Apache Beam FileSystem
> --
>
> Key: BEAM-2500
> URL: https://issues.apache.org/jira/browse/BEAM-2500
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Luke Cwik
>Priority: Minor
> Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

2017-08-11 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123184#comment-16123184
 ] 

Guillaume Balaine edited comment on BEAM-2500 at 8/11/17 11:02 AM:
---

This patch works for me [^hadoop_fs_patch.patch]. Thanks Steve.


was (Author: igosuki):
This patch works for me.

> Add support for S3 as a Apache Beam FileSystem
> --
>
> Key: BEAM-2500
> URL: https://issues.apache.org/jira/browse/BEAM-2500
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Luke Cwik
>Priority: Minor
> Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

2017-08-11 Thread Guillaume Balaine (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillaume Balaine updated BEAM-2500:

Attachment: hadoop_fs_patch.patch

This patch works for me.

> Add support for S3 as a Apache Beam FileSystem
> --
>
> Key: BEAM-2500
> URL: https://issues.apache.org/jira/browse/BEAM-2500
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Luke Cwik
>Priority: Minor
> Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

2017-08-11 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123047#comment-16123047
 ] 

Guillaume Balaine commented on BEAM-2500:
-

Thanks, that's fine really, the only trouble was that I had to dig in some 
example code to find it out because no stacktraces pop in Beam. It's just that 
resolving a ResourceId with such a path from another folder gives you an 
incomplete URI, where the base path is truncated like :

(s3a://mybucket/myfolder/somefilename.fmt).resolve(somefilename-12:30-13:30.fmt)
 -> ResourceId{URI{somefilename-12:30-13:30.fmt}} 
while 
(s3a://mybucket/myfolder/somefilename.fmt).resolve(somefilename-12.30-13.30.fmt)
 -> ResourceId{URI{instead of 
s3a://mybucket/myfolder/somefilename-12.30-13.30.fmt}} 

so people need to be aware of their file name policies in beam.

On another note, reads don't work because S3 input streams don't implement 
ByteBufferReadable as you mentionned here 
https://stackoverflow.com/questions/44792884/apache-beam-unable-to-read-text-file-from-s3-using-hadoop-file-system-sdk
 so I guess fixing that would be enough to resolve this issue.


> Add support for S3 as a Apache Beam FileSystem
> --
>
> Key: BEAM-2500
> URL: https://issues.apache.org/jira/browse/BEAM-2500
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Luke Cwik
>Priority: Minor
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

2017-08-09 Thread Guillaume Balaine (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120095#comment-16120095
 ] 

Guillaume Balaine commented on BEAM-2500:
-

I got s3a to work on a simple aggregation job, I just write to s3a text files 
and include "org.apache.hadoop" % "hadoop-aws" % "2.7.3".
Is there anything we're missing ? The only trouble I had was in debugging, 
where my file policy was formatting ':' characters in files which gave a wrong 
resourceId in beam.

> Add support for S3 as a Apache Beam FileSystem
> --
>
> Key: BEAM-2500
> URL: https://issues.apache.org/jira/browse/BEAM-2500
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Luke Cwik
>Priority: Minor
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)