[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-10-09 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643903#comment-16643903
 ] 

Tim Robertson commented on BEAM-5036:
-

{quote}Should this be marked as a blocker for 2.8.0 ? PR is still in review.
{quote}
 
 I don't think this can block 2.8.0 for time constraints. The [PR (now 
closed)|https://github.com/apache/beam/pull/6289] caused a lot of discussion 
but was not suitable for merging. I suggest we modify {{HDFSFileSystem}} to 
always overwrite (i.e. move the necessary bits from the [existing 
PR|https://github.com/apache/beam/pull/6289] into the HDFS implementation only) 
and then the solution will be simpler and the change can be made to 
{{rename()}}.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-10-09 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-5036:

Fix Version/s: (was: 2.8.0)

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5107) Support ES 6.x for ElasticsearchIO

2018-10-08 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-5107:

Fix Version/s: (was: 2.7.0)
   2.8.0

> Support ES 6.x for ElasticsearchIO
> --
>
> Key: BEAM-5107
> URL: https://issues.apache.org/jira/browse/BEAM-5107
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Dat Tran
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 2.8.0
>
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> Elasticsearch has released 6.3.2 but ElasticsearchIO only supports 2x-5.x.
> We should support ES 6.x for ElasticsearchIO.
> https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
> https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5425) FileSystems contract rename/delete to allow prior partial successes to make continued progress

2018-09-20 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621666#comment-16621666
 ] 

Tim Robertson commented on BEAM-5425:
-

Please also see discussion on BEAM-5036 and 
[PR/6289|https://github.com/apache/beam/pull/6289]

> FileSystems contract rename/delete to allow prior partial successes to make 
> continued progress
> --
>
> Key: BEAM-5425
> URL: https://issues.apache.org/jira/browse/BEAM-5425
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-py-core
>Reporter: Luke Cwik
>Assignee: Ahmet Altay
>Priority: Major
>
> The [filesystems 
> contract|https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#heading=h.kpqagzh8i11w]
>  for delete/rename say that an error should be raised if the resources don't 
> exist.
>  
> I believe the contract should be updated to not have failures if the 
> resources don't exist as we want them to be retried on failure without 
> needing the caller know that a prior call may have been partially successful.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5425) FileSystems contract rename/delete to allow prior partial successes to make continued progress

2018-09-20 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-5425:

Description: 
The [filesystems 
contract|https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#heading=h.kpqagzh8i11w]
 for delete/rename say that an error should be raised if the resources don't 
exist.

 

I believe the contract should be updated to not have failures if the resources 
don't exist as we want them to be retried on failure without needing the caller 
know that a prior call may have been partially successful.

 

  was:
The filesystems contract for delete/rename say that an error should be raised 
if the resources don't exist.

 

I believe the contract should be updated to not have failures if the resources 
don't exist as we want them to be retried on failure without needing the caller 
know that a prior call may have been partially successful.

 


> FileSystems contract rename/delete to allow prior partial successes to make 
> continued progress
> --
>
> Key: BEAM-5425
> URL: https://issues.apache.org/jira/browse/BEAM-5425
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-py-core
>Reporter: Luke Cwik
>Assignee: Ahmet Altay
>Priority: Major
>
> The [filesystems 
> contract|https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#heading=h.kpqagzh8i11w]
>  for delete/rename say that an error should be raised if the resources don't 
> exist.
>  
> I believe the contract should be updated to not have failures if the 
> resources don't exist as we want them to be retried on failure without 
> needing the caller know that a prior call may have been partially successful.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


***UNCHECKED*** [jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-09-19 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620565#comment-16620565
 ] 

Tim Robertson commented on BEAM-5036:
-

BEAM-5429 is created for the GCS implementation ( CC [~sinisa_lyh] )

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-09-19 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620565#comment-16620565
 ] 

Tim Robertson edited comment on BEAM-5036 at 9/19/18 1:19 PM:
--

BEAM-5429 is created for the GCS implementation ( CC [~sinisa_lyh] ) and I'll 
aim to complete this one in time for 2.8.0


was (Author: timrobertson100):
BEAM-5429 is created for the GCS implementation ( CC [~sinisa_lyh] )

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5429) Optimise GCSFilesystem rename implementation

2018-09-19 Thread Tim Robertson (JIRA)
Tim Robertson created BEAM-5429:
---

 Summary: Optimise GCSFilesystem rename implementation
 Key: BEAM-5429
 URL: https://issues.apache.org/jira/browse/BEAM-5429
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Affects Versions: 2.6.0
Reporter: Tim Robertson
Assignee: Chamikara Jayalath


{{GCSFileSystem}} implements a {{rename()}} with a {{copy}} and {{delete}} 
operation.

However, GCS has an [Objects: 
rewrite|https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite] 
which looks like it would be a metadata operation only and therefore has the 
potential to be much quicker.

Once BEAM-5036 is fixed IOs that write files will make use of a {{rename}} 
operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-30 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597976#comment-16597976
 ] 

Tim Robertson commented on BEAM-5036:
-

Thanks [~reuvenlax] - that was in response to my concern about the files 
already existing right? It doesn't affect whether we use copy/delete or rename 
approach, or am I missing something?

I have added {{FileAlreadyExistsException}} in the [PR for changing 
HDFSFileSystem.rename()|https://github.com/apache/beam/pull/6285]. With that we 
can handle the case of failure when the destination already exists, delete it 
and retry thus forcing the overwrite. Together with the IGNORE_MISSING_FILES 
that should be as idempotent as we can achieve I think.

Sound reasonable? 

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-30 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597502#comment-16597502
 ] 

Tim Robertson commented on BEAM-5036:
-

Thanks to everyone for contributing to this. [~JozoVilcek] I've come to a 
similar conclusion overnight and think we need to do one of:
 # surface {{FileAlreadyExistsException}} as well as {{FileNotFoundException}} 
from {{FileSystem.rename()}} and let the caller decide (here I presume we would 
opt to overwrite by deleting the target only if the source still exists and 
then retry)
 # document and implement that {{FileSystem.rename()}} will always replace 
existing files for all filesystems
 # expose a {{forceOverwrite}} flag / option and use it here

I propose we should open a separate issue to explore optimising rename for Gcs. 
I had simply overlooked the rewrite option (sorry, I am not all that familiar 
with Gcs).

I still have some concern about rewriting output files that already exist 
though. Isn't it the case that if "run 1" produced 45 avro file parts but for 
some reason "run 2" split differently and produced 43 file parts, anything 
using a glob on the directory would get incorrect data (i.e. the addition of 2 
parts from run 1)? This would be relevant for bounded, but possibly even a 
restart / recover of a streaming scenario?

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596831#comment-16596831
 ] 

Tim Robertson commented on BEAM-5036:
-

[~reuvenlax], I believe those scenarios are covered as you write. I closed the 
PR today because I uncovered one other scenario by running a simple 
avro-to-avro file conversion job twice.

What to do for filesystems that support the atomic rename() where the 
destination files exist before the job starts? Beam 2.6.0 will simply overwrite 
for all Filesystems. If we change to use rename() then GCS, S3 will overwrite, 
while HDFS would either throw Exception if we merge the [PR for 
5036|https://github.com/apache/beam/pull/6285] or do nothing (which is clearly 
wrong) if not.

The Beam FileSystem [rename() 
documentation|https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java#L108]
 doesn't cover this scenario and I don't find it mentioned anywhere in the 
FileBasedSink docs either.

MapReduce and Spark jobs (on HDFS at least) fail for this scenario at start up 
presumably to prevent overwriting existing data inadvertently and I think is 
the better behaviour, but it is not what Beam has done to date.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596407#comment-16596407
 ] 

Tim Robertson commented on BEAM-5036:
-

Yes [~sinisa_lyh] it does.

I have observed big performance gains where rewriting 1.5TB Avro files 
({{AvroIO.write()}}) using Beam on Spark with HDFS can reduce from 1.7hrs to 42 
minutes on one of my clusters.
 It will only impact HDFS and LocalFileSystem though, as S3 and Gcs do not have 
the ability to do an optimised rename.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596353#comment-16596353
 ] 

Tim Robertson commented on BEAM-5036:
-

Thanks [~echauchot]
Gcs and S3 have no notion of a rename they are copy (overwrite) and delete (see 
links in comment above).

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596353#comment-16596353
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/29/18 1:51 PM:
--

Thanks [~echauchot]
Gcs and S3 have no notion of a rename and are implemented as copy (overwrite) 
and delete (see links in comment above).


was (Author: timrobertson100):
Thanks [~echauchot]
Gcs and S3 have no notion of a rename they are copy (overwrite) and delete (see 
links in comment above).

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596233#comment-16596233
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:20 PM:
---

The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to 
Hadoop operations failing if the output already exists so for me it is correct 
to fail if the output exists - I'd rather be forced to delete manually than 
accidentally be able to overwrite TBs of data.


was (Author: timrobertson100):
The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to 
Hadoop operations failing if the output already exists so for me it sounds 
wrong - I'd rather be forced to delete manually than accidentally be able to 
overwrite TBs of data.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596233#comment-16596233
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:03 PM:
---

The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to 
Hadoop operations failing if the output already exists so for me it sounds 
wrong - I'd rather be forced to delete manually than accidentally be able to 
overwrite TBs of data.


was (Author: timrobertson100):
The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()?

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596233#comment-16596233
 ] 

Tim Robertson commented on BEAM-5036:
-

The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()?

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596213#comment-16596213
 ] 

Tim Robertson commented on BEAM-5036:
-

I think the cross FS check is actually already in place here [~reuvenlax]

[https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L307]

In logs you get the following in rename() if the schemes are different:

{{Caused by: java.lang.IllegalArgumentException: Expect srcResourceIds and 
destResourceIds have the same scheme, but received file, hdfs.}}

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595337#comment-16595337
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/28/18 5:58 PM:
--

For info on the other rename() methods:
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check - we could simply log a 
warn before the call to rename().


was (Author: timrobertson100):
For info on the other rename() methods:
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595337#comment-16595337
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/28/18 5:26 PM:
--

For info on the other rename() methods:
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check.


was (Author: timrobertson100):
For info on the other FileSystem rename():
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595337#comment-16595337
 ] 

Tim Robertson commented on BEAM-5036:
-

For info on the other FileSystem rename():
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4952) Beam Dependency Update Request: org.apache.hbase:hbase-hadoop-compat 2.1.0

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594936#comment-16594936
 ] 

Tim Robertson commented on BEAM-4952:
-

Thanks [~chamikara]

Specifically on this one - I'm reluctant to bump HBase while the discussion on 
dev@ is underway. I'm concerned that Beam could alienate Hadoop users if we 
don't support the versions in use by  Amazon EMR, Cloudera and Hortonworks etc. 

Do you use HBase 2.1 yourself please?

> Beam Dependency Update Request: org.apache.hbase:hbase-hadoop-compat 2.1.0
> --
>
> Key: BEAM-4952
> URL: https://issues.apache.org/jira/browse/BEAM-4952
> Project: Beam
>  Issue Type: Sub-task
>  Components: dependencies
>Reporter: Beam JIRA Bot
>Assignee: Tim Robertson
>Priority: Major
>
> 2018-07-25 20:28:24.987897
> Please review and upgrade the org.apache.hbase:hbase-hadoop-compat to 
> the latest version 2.1.0 
>  
> cc: 
> 2018-08-06 12:11:58.406173
> Please review and upgrade the org.apache.hbase:hbase-hadoop-compat to 
> the latest version 2.1.0 
>  
> cc: 
> 2018-08-13 12:13:31.045787
> Please review and upgrade the org.apache.hbase:hbase-hadoop-compat to 
> the latest version 2.1.0 
>  
> cc: 
> 2018-08-20 12:14:04.735400
> Please review and upgrade the org.apache.hbase:hbase-hadoop-compat to 
> the latest version 2.1.0 
>  
> cc: 
> 2018-08-27 12:15:07.483727
> Please review and upgrade the org.apache.hbase:hbase-hadoop-compat to 
> the latest version 2.1.0 
>  
> cc: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-28 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4861:
---

Assignee: Tim Robertson  (was: Chamikara Jayalath)

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594794#comment-16594794
 ] 

Tim Robertson edited comment on BEAM-4861 at 8/28/18 10:33 AM:
---

On further inspection I think {{delete}} is correct to swallow a {{false}} 
response [~JozoVilcek] 
 * A {{delete}} for example will return {{false}} when you try and delete a non 
existing file which seems reasonable to swallow. It will throw exception for 
the scenarios that mater.

The {{copy}} seems indifferent, so we might as well throw exception to be 
cautious:
 * The {{copy}} returns false only if there is issue with {{mkdirs}} and the 
HDFS docs [1] state that it always returns true.

For {{rename()}} we can create the directory if not existing and then should 
throw exception on any response that is false. 

 

[1] 
[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 

 


was (Author: timrobertson100):
On further inspection I think {{delete}} and {{copy}} are correct to swallow a 
{{false}} response [~JozoVilcek] 
 * A {{delete}} for example will return {{false}} when you try and delete a non 
existing file which seems reasonable to swallow. It will throw exception for 
the scenarios that mater.
 * The {{copy}} returns false only if there is issue with {{mkdirs}} and the 
HDFS docs [1] state that it always returns true even if the directory is not 
created [1] I think we can ignore the local filesystem implementation.

For {{rename()}} we can create the directory if not existing and then should 
throw exception on any response that is false. 

 

[1] 
[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 

 

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594794#comment-16594794
 ] 

Tim Robertson commented on BEAM-4861:
-

On further inspection I think {{delete}} and {{copy}} are correct to swallow a 
{{false}} response [~JozoVilcek] 
 * A {{delete}} for example will return {{false}} when you try and delete a non 
existing file which seems reasonable to swallow. It will throw exception for 
the scenarios that mater.
 * The {{copy}} returns false only if there is issue with {{mkdirs}} and the 
HDFS docs [1] state that it always returns true even if the directory is not 
created [1] I think we can ignore the local filesystem implementation.

For {{rename()}} we can create the directory if not existing and then should 
throw exception on any response that is false. 

 

[1] 
[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 

 

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593406#comment-16593406
 ] 

Tim Robertson edited comment on BEAM-4861 at 8/27/18 10:22 AM:
---

The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  fileSystem.delete(resourceId.toPath(), false);
}
  }
{code}
{{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all 
return false indicating that the operation was not performed.

 

*1. Informing the user of the unsuccessful operation*

We could either:
 # Change the Beam {{FileSystem}} API to propagate this, although the [rules 
for HDFS 
rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 are not trivial to document and this might prove to be invasive in many places.
 # Throw an {{IOException}} to signal that the operation was not successful if 
the response is false?

I tend towards suggesting leaving the API to return void but throw an exception 
on the first case of error within the loops - thoughts?

 

*2. rename() in HDFS*

What do we believe are the expectations for {{rename()}} on HDFS?

Currently the user is not informed that nothing happens if an attempt to rename 
a file into a non existent directory is made. This is obviously bad.

We could change behaviour to one of:
 # Throw exception if the directory does not exist
 # Create the directory where necessary, letting existing files be overridden 
(equivalent of e.g. {{S3Filesystem}})
 # Verify that the directory does not exist, and only then create it and 
proceed, otherwise alerting with Exception (the usual behaviour of a 
{{MapReduce / Spark FileOutputFormat}} at job startup where it quickly fails 
with "directory already exists").

Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}} 
and {{delete()}} operation internally.

I tend towards creating the directory where necessary allowing for overwriting 
- thoughts?

CC [~reuvenlax] as relates to BEAM-5036 as well.


was (Author: timrobertson100):
The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  

[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593439#comment-16593439
 ] 

Tim Robertson commented on BEAM-5036:
-

Thanks [~reuvenlax]

1. Adding a cross FS check seems reasonable as a precaution.

2. Please see this comment on BEAM-4861 where we have a decision to make on the 
HDFS parent directory not existing. Appreciate your and [~JozoVilcek] thoughts 
on that (and others).

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593439#comment-16593439
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/27/18 10:09 AM:
---

Thanks [~reuvenlax]

1. Adding a cross FS check seems reasonable as a precaution.

2. Please see [this 
comment|https://issues.apache.org/jira/browse/BEAM-4861?focusedCommentId=16593406=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593406]
 on BEAM-4861 where we have a decision to make on the HDFS parent directory not 
existing. Appreciate your and [~JozoVilcek] thoughts on that (and others).


was (Author: timrobertson100):
Thanks [~reuvenlax]

1. Adding a cross FS check seems reasonable as a precaution.

2. Please see this comment on BEAM-4861 where we have a decision to make on the 
HDFS parent directory not existing. Appreciate your and [~JozoVilcek] thoughts 
on that (and others).

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593406#comment-16593406
 ] 

Tim Robertson edited comment on BEAM-4861 at 8/27/18 10:04 AM:
---

The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  fileSystem.delete(resourceId.toPath(), false);
}
  }
{code}
{{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all 
return false indicating that the operation was not performed.

 

*1. Informing the user of the unsuccessful operation*

We could either:
 # Change the Beam {{FileSystem}} API to propagate this, although the [rules 
for HDFS 
rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 are not trivial to document and this might prove to be invasive in many places.
 # Throw an {{IOException}} to signal that the operation was not successful if 
the response is false?

I tend towards suggesting leaving the API to return void but throw an exception 
on the first case of error within the loops - thoughts?

 

*2. rename() in HDFS*

What do we believe are the expectations for {{rename()}} on HDFS?

Currently the user is not informed that nothing happens if an attempt to rename 
a file into a non existent directory is made. This is obviously bad.

We could change behaviour to one of:
 # Throw exception if the directory does not exist
 # Create the directory where necessary, letting existing files be overridden 
(equivalent of e.g. {{S3Filesystem}})
 # Verify that the directory does not exist, and only then create it and 
proceed, otherwise alerting with Exception (the usual behaviour of a 
{{MapReduce FileOutputFormat}} at job startup where it quickly fails with 
"directory already exists").

Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}} 
and {{delete()}} operation internally.

I tend towards creating the directory where necessary allowing for overwriting 
- thoughts?

CC [~reuvenlax] as relates to BEAM-5036 as well.


was (Author: timrobertson100):
The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  

[jira] [Comment Edited] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593406#comment-16593406
 ] 

Tim Robertson edited comment on BEAM-4861 at 8/27/18 10:03 AM:
---

The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  fileSystem.delete(resourceId.toPath(), false);
}
  }
{code}
{{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all 
return false indicating that the operation was not performed.

 

*1. Informing the user of the unsuccessful operation*

We could either:
 # Change the Beam {{FileSystem}} API to propagate this, although the [rules 
for HDFS 
rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 are not trivial to document and this might prove to be invasive in many places.
 # Throw an {{IOException}} to signal that the operation was not successful if 
the response is false?

I tend towards suggesting leaving the API to return void but throw an exception 
on the first case of error within the loops - thoughts?

 

*2. rename() in HDFS*

What do we believe are the expectations for {{rename()}} on HDFS?

Currently the user is not informed that nothing happens if an attempt to rename 
a file into a non existent directory is made. This is obviously bad.

We could change behaviour to one of:
 # Throw exception if the directory does not exist
 # Create the directory where necessary, letting files be overridden if it does 
exist (equivalent of e.g. {{S3Filesystem}})
 # Verify that the directory does not exist, and only then create it and 
proceed, otherwise alerting with Exception (the usual behaviour of a 
{{MapReduce FileOutputFormat}} at job startup where it quickly fails with 
"directory already exists").

Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}} 
and {{delete()}} operation internally.

I tend towards creating the directory where necessary allowing for overwriting 
- thoughts?

CC [~reuvenlax] as relates to BEAM-5036 as well.


was (Author: timrobertson100):
The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  

[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593406#comment-16593406
 ] 

Tim Robertson commented on BEAM-4861:
-

The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  fileSystem.delete(resourceId.toPath(), false);
}
  }
{code}
{{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all 
return false indicating that the operation was not performed.

 

*1. Informing the user of the unsuccessful operation*

We could either:
 # Change the Beam {{FileSystem}} API to propagate this, although the [rules 
for HDFS 
rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 are not trivial to document and this might prove to be invasive in many places.
 # Throw an {{IOException}} to signal that the operation was not successful if 
the response is false?

I tend towards suggesting leaving the API to return void but throw an exception 
on the first case of error within the loops - thoughts?

 

***2. rename() in HDFS*

What do we believe are the expectations for {{rename()}} on HDFS?

Currently the user is not informed if an attempt to rename a file into a non 
existent directory is made. This is obviously bad.

We could change behaviour to one of:
 # Throw exception if the directory does not exist
 # Create the directory where necessary, letting files be overridden if it does 
exist (equivalent of e.g. {{S3Filesystem}})
 # Verify that the directory does not exist, and only then create it and 
proceed, otherwise alerting with Exception (the usual behaviour of a 
{{MapReduce FileOutputFormat}} at job startup where it quickly fails with 
"directory already exists").

Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}} 
and {{delete()}} operation internally.

I tend towards creating the directory where necessary allowing for overwriting 
- thoughts?

CC [~reuvenlax] as relates to BEAM-5036 as well.

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593406#comment-16593406
 ] 

Tim Robertson edited comment on BEAM-4861 at 8/27/18 9:44 AM:
--

The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  fileSystem.delete(resourceId.toPath(), false);
}
  }
{code}
{{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all 
return false indicating that the operation was not performed.

 

*1. Informing the user of the unsuccessful operation*

We could either:
 # Change the Beam {{FileSystem}} API to propagate this, although the [rules 
for HDFS 
rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 are not trivial to document and this might prove to be invasive in many places.
 # Throw an {{IOException}} to signal that the operation was not successful if 
the response is false?

I tend towards suggesting leaving the API to return void but throw an exception 
on the first case of error within the loops - thoughts?

 

*2. rename() in HDFS*

What do we believe are the expectations for {{rename()}} on HDFS?

Currently the user is not informed if an attempt to rename a file into a non 
existent directory is made. This is obviously bad.

We could change behaviour to one of:
 # Throw exception if the directory does not exist
 # Create the directory where necessary, letting files be overridden if it does 
exist (equivalent of e.g. {{S3Filesystem}})
 # Verify that the directory does not exist, and only then create it and 
proceed, otherwise alerting with Exception (the usual behaviour of a 
{{MapReduce FileOutputFormat}} at job startup where it quickly fails with 
"directory already exists").

Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}} 
and {{delete()}} operation internally.

I tend towards creating the directory where necessary allowing for overwriting 
- thoughts?

CC [~reuvenlax] as relates to BEAM-5036 as well.


was (Author: timrobertson100):
The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  fileSystem.delete(resourceId.toPath(), false);

[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-23 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590032#comment-16590032
 ] 

Tim Robertson commented on BEAM-4861:
-

[~chamikara]  - may I take this issue please as I am looking at BEAM-5036 which 
is related?

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-23 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-5036:
---

Assignee: Tim Robertson  (was: Eugene Kirpichov)

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3654) Port FilterExamplesTest off DoFnTester

2018-08-22 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588659#comment-16588659
 ] 

Tim Robertson edited comment on BEAM-3654 at 8/22/18 1:51 PM:
--

My apologies for the Kudu error.

-I have lodged and will fix https://issues.apache.org/jira/browse/BEAM-5193 now 
which was simply a mistake.-

Now fixed on master

 

 


was (Author: timrobertson100):
My apologies for the Kudu error.

I have lodged and will fix https://issues.apache.org/jira/browse/BEAM-5193 now 
which was simply a mistake.

> Port FilterExamplesTest off DoFnTester
> --
>
> Key: BEAM-3654
> URL: https://issues.apache.org/jira/browse/BEAM-3654
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Mikhail
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2277) IllegalArgumentException when using Hadoop file system for WordCount example.

2018-08-22 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588751#comment-16588751
 ] 

Tim Robertson commented on BEAM-2277:
-

Thank you again [~JozoVilcek]

> IllegalArgumentException when using Hadoop file system for WordCount example.
> -
>
> Key: BEAM-2277
> URL: https://issues.apache.org/jira/browse/BEAM-2277
> Project: Beam
>  Issue Type: Bug
>  Components: z-do-not-use-sdk-java-extensions
>Affects Versions: 2.6.0
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>Priority: Blocker
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> IllegalArgumentException when using Hadoop file system for WordCount example.
> Occurred when running WordCount example using Spark runner on a YARN cluster.
> Command-line arguments:
> {code:none}
> --runner=SparkRunner --inputFile=hdfs:///user/myuser/kinglear.txt 
> --output=hdfs:///user/myuser/wc/wc
> {code}
> Stack trace:
> {code:none}
> java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds 
> have the same scheme, but received file, hdfs.
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>   at 
> org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:394)
>   at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:236)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles(FileBasedSink.java:626)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBasedSink.java:516)
>   at 
> org.apache.beam.sdk.io.WriteFiles$2.processElement(WriteFiles.java:592)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3654) Port FilterExamplesTest off DoFnTester

2018-08-22 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588659#comment-16588659
 ] 

Tim Robertson commented on BEAM-3654:
-

My apologies for the Kudu error.

I have lodged and will fix https://issues.apache.org/jira/browse/BEAM-5193 now 
which was simply a mistake.

> Port FilterExamplesTest off DoFnTester
> --
>
> Key: BEAM-3654
> URL: https://issues.apache.org/jira/browse/BEAM-3654
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Mikhail
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5193) KuduIO testWrite not correctly verifying behaviour

2018-08-22 Thread Tim Robertson (JIRA)
Tim Robertson created BEAM-5193:
---

 Summary: KuduIO testWrite not correctly verifying behaviour
 Key: BEAM-5193
 URL: https://issues.apache.org/jira/browse/BEAM-5193
 Project: Beam
  Issue Type: Bug
  Components: io-ideas
Affects Versions: 2.6.0
Reporter: Tim Robertson
Assignee: Tim Robertson
 Fix For: 2.7.0


The {{testWrite}} in {{KuduIOTest}} has 2 typos which were committed in error.

The following code block

{code:java}
for (int i = 1; i <= targetParallelism + 1; i++) {
  expectedWriteLogs.verifyDebug(String.format(FakeWriter.LOG_OPEN_SESSION, 
i));
  expectedWriteLogs.verifyDebug(
  String.format(FakeWriter.LOG_WRITE, i)); // at least one per writer
  expectedWriteLogs.verifyDebug(String.format(FakeWriter.LOG_CLOSE_SESSION, 
i));
}
// verify all entries written
for (int n = 0; n > numberRecords; n++) {
  expectedWriteLogs.verifyDebug(
  String.format(FakeWriter.LOG_WRITE_VALUE, n)); // at least one per 
writer
}
{code}
 
Should have read:
{code:java}
for (int i = 1; i <= targetParallelism ; i++) {
  expectedWriteLogs.verifyDebug(String.format(FakeWriter.LOG_OPEN_SESSION, 
i));
  expectedWriteLogs.verifyDebug(
  String.format(FakeWriter.LOG_WRITE, i)); // at least one per writer
  expectedWriteLogs.verifyDebug(String.format(FakeWriter.LOG_CLOSE_SESSION, 
i));
}
// verify all entries written
for (int n = 0; n < numberRecords; n++) {
  expectedWriteLogs.verifyDebug(
  String.format(FakeWriter.LOG_WRITE_VALUE, n)); // at least one per 
writer
}
{code}

This has gone unnoticed because 1) the {{targetParallelism}} is a function of 
cores available, and the test uses 3 only (which is the min in 
{{DirectRunner}}) and 2) the incorrect {{>}} in the test simply meant the loop 
was not run.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2277) IllegalArgumentException when using Hadoop file system for WordCount example.

2018-08-21 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587910#comment-16587910
 ] 

Tim Robertson commented on BEAM-2277:
-

Thank you for investigating [~JozoVilcek].

bq. Is anyone have this problem if it is using HDFS paths in form with vs 
without authority?
Yes, but I think you've deduced that already

bq. The question is, what would be an elegant fix to that.
I'll try and find time to explore, but I am afraid it won't be until later this 
week. It would be great if we can solve this for 2.7.0.

> IllegalArgumentException when using Hadoop file system for WordCount example.
> -
>
> Key: BEAM-2277
> URL: https://issues.apache.org/jira/browse/BEAM-2277
> Project: Beam
>  Issue Type: Bug
>  Components: z-do-not-use-sdk-java-extensions
>Affects Versions: 2.6.0
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>Priority: Blocker
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> IllegalArgumentException when using Hadoop file system for WordCount example.
> Occurred when running WordCount example using Spark runner on a YARN cluster.
> Command-line arguments:
> {code:none}
> --runner=SparkRunner --inputFile=hdfs:///user/myuser/kinglear.txt 
> --output=hdfs:///user/myuser/wc/wc
> {code}
> Stack trace:
> {code:none}
> java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds 
> have the same scheme, but received file, hdfs.
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>   at 
> org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:394)
>   at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:236)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles(FileBasedSink.java:626)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBasedSink.java:516)
>   at 
> org.apache.beam.sdk.io.WriteFiles$2.processElement(WriteFiles.java:592)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (BEAM-2277) IllegalArgumentException when using Hadoop file system for WordCount example.

2018-08-20 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reopened BEAM-2277:
-

Reopening - 2 people on here and 1 on the user@ list have stumbled upon this

> IllegalArgumentException when using Hadoop file system for WordCount example.
> -
>
> Key: BEAM-2277
> URL: https://issues.apache.org/jira/browse/BEAM-2277
> Project: Beam
>  Issue Type: Bug
>  Components: z-do-not-use-sdk-java-extensions
>Affects Versions: 2.6.0
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>Priority: Blocker
> Fix For: 2.0.0
>
>
> IllegalArgumentException when using Hadoop file system for WordCount example.
> Occurred when running WordCount example using Spark runner on a YARN cluster.
> Command-line arguments:
> {code:none}
> --runner=SparkRunner --inputFile=hdfs:///user/myuser/kinglear.txt 
> --output=hdfs:///user/myuser/wc/wc
> {code}
> Stack trace:
> {code:none}
> java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds 
> have the same scheme, but received file, hdfs.
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>   at 
> org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:394)
>   at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:236)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles(FileBasedSink.java:626)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBasedSink.java:516)
>   at 
> org.apache.beam.sdk.io.WriteFiles$2.processElement(WriteFiles.java:592)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-2277) IllegalArgumentException when using Hadoop file system for WordCount example.

2018-08-20 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-2277:

Affects Version/s: 2.6.0

> IllegalArgumentException when using Hadoop file system for WordCount example.
> -
>
> Key: BEAM-2277
> URL: https://issues.apache.org/jira/browse/BEAM-2277
> Project: Beam
>  Issue Type: Bug
>  Components: z-do-not-use-sdk-java-extensions
>Affects Versions: 2.6.0
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>Priority: Blocker
> Fix For: 2.0.0
>
>
> IllegalArgumentException when using Hadoop file system for WordCount example.
> Occurred when running WordCount example using Spark runner on a YARN cluster.
> Command-line arguments:
> {code:none}
> --runner=SparkRunner --inputFile=hdfs:///user/myuser/kinglear.txt 
> --output=hdfs:///user/myuser/wc/wc
> {code}
> Stack trace:
> {code:none}
> java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds 
> have the same scheme, but received file, hdfs.
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>   at 
> org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:394)
>   at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:236)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles(FileBasedSink.java:626)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBasedSink.java:516)
>   at 
> org.apache.beam.sdk.io.WriteFiles$2.processElement(WriteFiles.java:592)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2277) IllegalArgumentException when using Hadoop file system for WordCount example.

2018-08-20 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586042#comment-16586042
 ] 

Tim Robertson commented on BEAM-2277:
-

A workaround is to instruct the temporary files to also go on hdfs, such as:

{code:java}
transform.apply("Write",
AvroIO.writeGenericRecords(schema)
.to(FileSystems.matchNewResource(options.getTarget(),true))
// BEAM-2277 workaround

.withTempDirectory(FileSystems.matchNewResource("hdfs://ha-nn/tmp/beam-avro", 
true)));
{code}

> IllegalArgumentException when using Hadoop file system for WordCount example.
> -
>
> Key: BEAM-2277
> URL: https://issues.apache.org/jira/browse/BEAM-2277
> Project: Beam
>  Issue Type: Bug
>  Components: z-do-not-use-sdk-java-extensions
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>Priority: Blocker
> Fix For: 2.0.0
>
>
> IllegalArgumentException when using Hadoop file system for WordCount example.
> Occurred when running WordCount example using Spark runner on a YARN cluster.
> Command-line arguments:
> {code:none}
> --runner=SparkRunner --inputFile=hdfs:///user/myuser/kinglear.txt 
> --output=hdfs:///user/myuser/wc/wc
> {code}
> Stack trace:
> {code:none}
> java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds 
> have the same scheme, but received file, hdfs.
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>   at 
> org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:394)
>   at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:236)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles(FileBasedSink.java:626)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBasedSink.java:516)
>   at 
> org.apache.beam.sdk.io.WriteFiles$2.processElement(WriteFiles.java:592)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2277) IllegalArgumentException when using Hadoop file system for WordCount example.

2018-08-20 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585979#comment-16585979
 ] 

Tim Robertson commented on BEAM-2277:
-

Seeing this in 2.6.0 (not word count, but my own pipeline)

> IllegalArgumentException when using Hadoop file system for WordCount example.
> -
>
> Key: BEAM-2277
> URL: https://issues.apache.org/jira/browse/BEAM-2277
> Project: Beam
>  Issue Type: Bug
>  Components: z-do-not-use-sdk-java-extensions
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>Priority: Blocker
> Fix For: 2.0.0
>
>
> IllegalArgumentException when using Hadoop file system for WordCount example.
> Occurred when running WordCount example using Spark runner on a YARN cluster.
> Command-line arguments:
> {code:none}
> --runner=SparkRunner --inputFile=hdfs:///user/myuser/kinglear.txt 
> --output=hdfs:///user/myuser/wc/wc
> {code}
> Stack trace:
> {code:none}
> java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds 
> have the same scheme, but received file, hdfs.
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>   at 
> org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:394)
>   at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:236)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles(FileBasedSink.java:626)
>   at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBasedSink.java:516)
>   at 
> org.apache.beam.sdk.io.WriteFiles$2.processElement(WriteFiles.java:592)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5147) Expose document metadata in ElasticsearchIO read

2018-08-14 Thread Tim Robertson (JIRA)
Tim Robertson created BEAM-5147:
---

 Summary: Expose document metadata in ElasticsearchIO read
 Key: BEAM-5147
 URL: https://issues.apache.org/jira/browse/BEAM-5147
 Project: Beam
  Issue Type: Improvement
  Components: io-java-elasticsearch
Affects Versions: 2.7.0
Reporter: Tim Robertson
Assignee: Etienne Chauchot


The beam ElasticsearchIO read does not give access to document metadata. A Beam 
user has requested that it be possible to expose the entire elastic document 
including metadata from the search result. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5107) Support ES 6.x for ElasticsearchIO

2018-08-12 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577690#comment-16577690
 ] 

Tim Robertson commented on BEAM-5107:
-

Thank you for the pull request [~dattran.vn01].

If you could explore tests for the changes you provided in the first PR that 
would be great. Beam favours embedded servers for unit tests, and then ES also 
has a suite of integration tests.

I am not sure what will happen when you put a ES6 on the classpath though for a 
unit test. It could be that you simply can't do that, and if so I'd suggest 1) 
doing an integration test copying the existing ones (without the "type" test) 
to demonstrate that it works and 2) documenting in the JDoc that the IO uses 
ES5 clients but is known to work with ES6. It's not ideal but will get ES6 
support into Beam and we can set up the IT to run regularly to give confidence 
it all works.

Can you please try and include a test that covers the withIDFn and withIndexFn 
for ES6? I see the full addressing test is not supported (ES6 dropping the 
ability to dynamically set a type) but the other two are useful.

Please also be aware that the future is likely to be built around the BEAM-3199 
version since it makes use of the newest ES client (RestHighLevelClient).

> Support ES 6.x for ElasticsearchIO
> --
>
> Key: BEAM-5107
> URL: https://issues.apache.org/jira/browse/BEAM-5107
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Dat Tran
>Assignee: Etienne Chauchot
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Elasticsearch has released 6.3.2 but ElasticsearchIO only supports 2x-5.x.
> We should support ES 6.x for ElasticsearchIO.
> https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
> https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3199) Upgrade to Elasticsearch 6.x

2018-08-08 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573735#comment-16573735
 ] 

Tim Robertson commented on BEAM-3199:
-

Hi [~jeroens] - I see ES6 becoming increasingly desirable - possibly in our 
project during 2018.

I am curious if you have any plans/time to further this or if you'd welcome 
help? Are you using it already in production?

> Upgrade to Elasticsearch 6.x
> 
>
> Key: BEAM-3199
> URL: https://issues.apache.org/jira/browse/BEAM-3199
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Jean-Baptiste Onofré
>Assignee: Jeroen Steggink
>Priority: Major
>
> Elasticsearch 6.x is now GA. As it's fully compatible with Elasticsearch 5.x, 
> it makes sense to upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3199) Upgrade to Elasticsearch 6.x

2018-08-08 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573734#comment-16573734
 ] 

Tim Robertson commented on BEAM-3199:
-

We should bring the Hadoop IF version up to the same for consistency when this 
is done.

> Upgrade to Elasticsearch 6.x
> 
>
> Key: BEAM-3199
> URL: https://issues.apache.org/jira/browse/BEAM-3199
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Jean-Baptiste Onofré
>Assignee: Jeroen Steggink
>Priority: Major
>
> Elasticsearch 6.x is now GA. As it's fully compatible with Elasticsearch 5.x, 
> it makes sense to upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3026) Improve retrying in ElasticSearch client

2018-08-02 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566886#comment-16566886
 ] 

Tim Robertson edited comment on BEAM-3026 at 8/2/18 2:37 PM:
-

Thanks for checking [~aalbatross] - things evolved from the original idea 
proposed on this jira. I would suggest mirroring as close as possible the 
SolrIO.  An overarching goal in Beam IO is to aim for consistency/familiarity 

If I remember correctly in ES, retrying is supported but not for `429` 
responses whereas in Solr we needed to handle a multitude of various 
exceptions. 


was (Author: timrobertson100):
Thanks for checking [~aalbatross] - things evolved from the original idea 
proposed on this jira. I would suggest mirroring as close as possible the 
`SolrIO`.  An overarching goal in Beam IO is to aim for consistency/familiarity 

If I remember correctly in ES, retrying is supported but not for `429` 
responses whereas in Solr we needed to handle a multitude of various 
exceptions. 

> Improve retrying in ElasticSearch client
> 
>
> Key: BEAM-3026
> URL: https://issues.apache.org/jira/browse/BEAM-3026
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Tim Robertson
>Assignee: Ravi Pathak
>Priority: Major
>
> Currently an overloaded ES server will result in clients failing fast.
> I suggest implementing backoff pauses.  Perhaps something like this:
> {code}
> ElasticsearchIO.ConnectionConfiguration conn = 
> ElasticsearchIO.ConnectionConfiguration
>   .create(new String[]{"http://...:9200"}, "test", "test")
>   .retryWithWaitStrategy(WaitStrategies.exponentialBackoff(1000, 
> TimeUnit.MILLISECONDS)
>   .retryWithStopStrategy(StopStrategies.stopAfterAttempt(10)
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3026) Improve retrying in ElasticSearch client

2018-08-02 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566886#comment-16566886
 ] 

Tim Robertson edited comment on BEAM-3026 at 8/2/18 2:37 PM:
-

Thanks for checking [~aalbatross] - things evolved from the original idea 
proposed on this jira. I would suggest mirroring as close as possible the 
`SolrIO`.  An overarching goal in Beam IO is to aim for consistency/familiarity 

If I remember correctly in ES, retrying is supported but not for `429` 
responses whereas in Solr we needed to handle a multitude of various 
exceptions. 


was (Author: timrobertson100):
Thanks for checking [~aalbatross] - things did evolved from the original idea 
proposed on this jira. I would suggest mirroring as close as possible the 
`SolrIO`.  An overarching goal in Beam IO is to aim for consistency/familiarity 

If I remember correctly in ES, retrying is supported but not for `429` 
responses whereas in Solr we needed to handle a multitude of various 
exceptions. 

> Improve retrying in ElasticSearch client
> 
>
> Key: BEAM-3026
> URL: https://issues.apache.org/jira/browse/BEAM-3026
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Tim Robertson
>Assignee: Ravi Pathak
>Priority: Major
>
> Currently an overloaded ES server will result in clients failing fast.
> I suggest implementing backoff pauses.  Perhaps something like this:
> {code}
> ElasticsearchIO.ConnectionConfiguration conn = 
> ElasticsearchIO.ConnectionConfiguration
>   .create(new String[]{"http://...:9200"}, "test", "test")
>   .retryWithWaitStrategy(WaitStrategies.exponentialBackoff(1000, 
> TimeUnit.MILLISECONDS)
>   .retryWithStopStrategy(StopStrategies.stopAfterAttempt(10)
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3026) Improve retrying in ElasticSearch client

2018-08-02 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566886#comment-16566886
 ] 

Tim Robertson commented on BEAM-3026:
-

Thanks for checking [~aalbatross] - things did evolved from the original idea 
proposed on this jira. I would suggest mirroring as close as possible the 
`SolrIO`.  An overarching goal in Beam IO is to aim for consistency/familiarity 

If I remember correctly in ES, retrying is supported but not for `429` 
responses whereas in Solr we needed to handle a multitude of various 
exceptions. 

> Improve retrying in ElasticSearch client
> 
>
> Key: BEAM-3026
> URL: https://issues.apache.org/jira/browse/BEAM-3026
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Tim Robertson
>Assignee: Ravi Pathak
>Priority: Major
>
> Currently an overloaded ES server will result in clients failing fast.
> I suggest implementing backoff pauses.  Perhaps something like this:
> {code}
> ElasticsearchIO.ConnectionConfiguration conn = 
> ElasticsearchIO.ConnectionConfiguration
>   .create(new String[]{"http://...:9200"}, "test", "test")
>   .retryWithWaitStrategy(WaitStrategies.exponentialBackoff(1000, 
> TimeUnit.MILLISECONDS)
>   .retryWithStopStrategy(StopStrategies.stopAfterAttempt(10)
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3026) Improve retrying in ElasticSearch client

2018-08-02 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-3026:
---

Assignee: Ravi Pathak  (was: Tim Robertson)

> Improve retrying in ElasticSearch client
> 
>
> Key: BEAM-3026
> URL: https://issues.apache.org/jira/browse/BEAM-3026
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Tim Robertson
>Assignee: Ravi Pathak
>Priority: Major
>
> Currently an overloaded ES server will result in clients failing fast.
> I suggest implementing backoff pauses.  Perhaps something like this:
> {code}
> ElasticsearchIO.ConnectionConfiguration conn = 
> ElasticsearchIO.ConnectionConfiguration
>   .create(new String[]{"http://...:9200"}, "test", "test")
>   .retryWithWaitStrategy(WaitStrategies.exponentialBackoff(1000, 
> TimeUnit.MILLISECONDS)
>   .retryWithStopStrategy(StopStrategies.stopAfterAttempt(10)
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-2661) Add KuduIO

2018-07-31 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson resolved BEAM-2661.
-
   Resolution: Fixed
Fix Version/s: 2.7.0

> Add KuduIO
> --
>
> Key: BEAM-2661
> URL: https://issues.apache.org/jira/browse/BEAM-2661
> Project: Beam
>  Issue Type: New Feature
>  Components: io-ideas
>Reporter: Jean-Baptiste Onofré
>Assignee: Tim Robertson
>Priority: Major
> Fix For: 2.7.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).
> This work is in progress [on this 
> branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with 
> design aspects documented below.
> h2. The API
> The {{KuduIO}} API requires the user to provide a function to convert objects 
> into operations. This is similar to the {{JdbcIO}} but different to others, 
> such as {{HBaseIO}} which requires a pre-transform stage beforehand to 
> convert into the mutations to apply. It was originally intended to copy the 
> {{HBaseIO}} approach, but this was not possible:
>  # The Kudu 
> [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
>  is a fat class, and is a subclass of {{KuduRpc}}. It 
> holds RPC logic, callbacks and a Kudu client. Because of this the 
> {{Operation}} does not serialize and furthermore, the logic for encoding the 
> operations (Insert, Upsert etc) in the Kudu Java API are one way only (no 
> decode) because the server is written in C++.
>  # An alternative could be to introduce a new object to beam (e.g. 
> {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable 
> {{PCollection}}. This was considered but was discounted 
> because:
>  ## It is not a familiar API to those already knowing Kudu
>  ## It still requires serialization and deserialization of the operations. 
> Using the existing Kudu approach of serializing into compact byte arrays 
> would require a decoder along the lines of [this almost complete 
> example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e].
>  This is possible but has fragilities given the Kudu code itself continues to 
> evolve. 
>  ## It becomes a trivial codebase in Beam to maintain by defer the object to 
> mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the 
> precedent to do this.
> h2. Testing framework
> {{Kudu}} is written in C++. While a 
> [TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
>  does exist in Java, it requires binaries to be available for the target 
> environment which is not portable (edit: this is now a [work in 
> progress|https://issues.apache.org/jira/browse/KUDU-2411] in Kudu). Therefore 
> we opt for the following:
>  # Unit tests will use a mock Kudu client
>  # Integration tests will cover the full aspects of the {{KuduIO}} and use a 
> Docker based Kudu instance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (BEAM-4260) Document usage for hcatalog 1.1

2018-07-27 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4260:

Comment: was deleted

(was: Thanks [~iemejia] - I will do that.

How would you feel about a 1 liner in the JDoc too? I just know it is where I'd 
look first)

> Document usage for hcatalog 1.1
> ---
>
> Key: BEAM-4260
> URL: https://issues.apache.org/jira/browse/BEAM-4260
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-hcatalog, website
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The {{HCatalogIO}} does not work with environments providing Hive Server 1.x 
> which is in widespread use - as an example the latest Cloudera (5.14.2) 
> provides 1.1.x
>  
> The {{HCatalogIO}} marks it's Hive dependencies as provided, so I believe the 
> intention was to be open to multiple versions.
>  
> The issues come from the following:  
>  - use of {{HCatUtil.getHiveMetastoreClient(hiveConf)}} while previous 
> versions used the [now 
> deprecated|https://github.com/apache/hive/blob/master/hcatalog/core/src/main/java/org/apache/hive/hcatalog/common/HCatUtil.java#L586]
>  {{getHiveClient(HiveConf hiveConf)}}  
>  - Changes to the signature of {{RetryingMetaStoreClient.getProxy(...)}}
>  
> Given this doesn't work in a major Hadoop distro, and will not until the next 
> CDH release later in 2018 (i.e. widespread adoption only expected in 2019) I 
> think it would be worthwhile providing a fix/workaround.
> I _think_ building for 2.3 and relocating in your own app might be a 
> workaround although I'm still testing it.  If that is successful I'd propose 
> adding it to the project README or in a separate markdown file linked from 
> the README.
> Does that sound like a reasonable approach please?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-2661) Add KuduIO

2018-07-23 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-2661:

Description: 
New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).

This work is in progress [on this 
branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with 
design aspects documented below.
h2. The API

The {{KuduIO}} API requires the user to provide a function to convert objects 
into operations. This is similar to the {{JdbcIO}} but different to others, 
such as {{HBaseIO}} which requires a pre-transform stage beforehand to convert 
into the mutations to apply. It was originally intended to copy the {{HBaseIO}} 
approach, but this was not possible:
 # The Kudu 
[Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
 is a fat class, and is a subclass of {{KuduRpc}}. It holds 
RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does 
not serialize and furthermore, the logic for encoding the operations (Insert, 
Upsert etc) in the Kudu Java API are one way only (no decode) because the 
server is written in C++.
 # An alternative could be to introduce a new object to beam (e.g. 
{{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection}}. 
This was considered but was discounted because:
 ## It is not a familiar API to those already knowing Kudu
 ## It still requires serialization and deserialization of the operations. 
Using the existing Kudu approach of serializing into compact byte arrays would 
require a decoder along the lines of [this almost complete 
example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e].
 This is possible but has fragilities given the Kudu code itself continues to 
evolve. 
 ## It becomes a trivial codebase in Beam to maintain by defer the object to 
mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the 
precedent to do this.

h2. Testing framework

{{Kudu}} is written in C++. While a 
[TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
 does exist in Java, it requires binaries to be available for the target 
environment which is not portable (edit: this is now a [work in 
progress|https://issues.apache.org/jira/browse/KUDU-2411] in Kudu). Therefore 
we opt for the following:
 # Unit tests will use a mock Kudu client
 # Integration tests will cover the full aspects of the {{KuduIO}} and use a 
Docker based Kudu instance

  was:
New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).

This work is in progress [on this 
branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with 
design aspects documented below.
h2. The API

The {{KuduIO}} API requires the user to provide a function to convert objects 
into operations. This is similar to the {{JdbcIO}} but different to others, 
such as {{HBaseIO}} which requires a pre-transform stage beforehand to convert 
into the mutations to apply. It was originally intended to copy the {{HBaseIO}} 
approach, but this was not possible:
 # The Kudu 
[Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
 is a fat class, and is a subclass of {{KuduRpc}}. It holds 
RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does 
not serialize and furthermore, the logic for encoding the operations (Insert, 
Upsert etc) in the Kudu Java API are one way only (no decode) because the 
server is written in C++.
 # An alternative could be to introduce a new object to beam (e.g. 
{{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection}}. 
This was considered but was discounted because:
 ## It is not a familiar API to those already knowing Kudu
 ## It still requires serialization and deserialization of the operations. 
Using the existing Kudu approach of serializing into compact byte arrays would 
require a decoder along the lines of [this almost complete 
example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e].
 This is possible but has fragilities given the Kudu code itself continues to 
evolve. 
 ## It becomes a trivial codebase in Beam to maintain by defer the object to 
mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the 
precedent to do this.

h2. Testing framework

{{Kudu}} is written in C++. While a 
[TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
 does exist in Java, it requires binaries to be available for the target 
environment which is not portable. Therefore we opt for the following:
 # Unit tests will use a mock Kudu client
 # Integration tests will cover the full aspects of the {{KuduIO}} and use a 
Docker based Kudu instance


> Add KuduIO
> --
>
> Key: BEAM-2661
> URL: 

[jira] [Updated] (BEAM-2661) Add KuduIO

2018-06-17 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-2661:

Description: 
New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).

This work is in progress [on this 
branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with 
design aspects documented below.
h2. The API

The {{KuduIO}} API requires the user to provide a function to convert objects 
into operations. This is similar to the {{JdbcIO}} but different to others, 
such as {{HBaseIO}} which requires a pre-transform stage beforehand to convert 
into the mutations to apply. It was originally intended to copy the {{HBaseIO}} 
approach, but this was not possible:
 # The Kudu 
[Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
 is a fat class, and is a subclass of {{KuduRpc}}. It holds 
RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does 
not serialize and furthermore, the logic for encoding the operations (Insert, 
Upsert etc) in the Kudu Java API are one way only (no decode) because the 
server is written in C++.
 # An alternative could be to introduce a new object to beam (e.g. 
{{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection}}. 
This was considered but was discounted because:
 ## It is not a familiar API to those already knowing Kudu
 ## It still requires serialization and deserialization of the operations. 
Using the existing Kudu approach of serializing into compact byte arrays would 
require a decoder along the lines of [this almost complete 
example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e].
 This is possible but has fragilities given the Kudu code itself continues to 
evolve. 
 ## It becomes a trivial codebase in Beam to maintain by defer the object to 
mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the 
precedent to do this.

h2. Testing framework

{{Kudu}} is written in C++. While a 
[TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
 does exist in Java, it requires binaries to be available for the target 
environment which is not portable. Therefore we opt for the following:
 # Unit tests will use a mock Kudu client
 # Integration tests will cover the full aspects of the {{KuduIO}} and use a 
Docker based Kudu instance

  was:
New IO for Apache Kudu (https://kudu.apache.org/overview.html).

This work is in progress [on this 
branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with 
design aspects documented below.

h2. The API

The {{KuduIO}} API requires the user to provide a function to convert objects 
into operations. This is similar to the {{JdbcIO}} but different to others, 
such as {{HBaseIO}} which requires a pre-transform stage beforehand to convert 
into the mutations to apply. It was originally intended to copy the {{HBaseIO}} 
approach, but this was not possible:

# The Kudu 
[Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
 is a fat class, and is a subclass of {{KuduRpc}}. It holds 
RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does 
not serialize and furthermore, the logic for encoding the operations (Insert, 
Upsert etc) in the Kudu Java API are one way only (no decode) because the 
server is written in C++.
# An alternative could be to introduce a new object to beam  (e.g. 
{{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection}}. 
This was considered but was discounted because:
## It is not a familiar API to those already knowing Kudu
## It still requires serialization and deserialization of the operations. Using 
the existing Kudu approach of serializing into compact byte arrays would 
require a decoder along the lines of [this almost complete 
example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]

h2. Testing framework

{{Kudu}} is written in C++. While a 
[TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
 does exist in Java, it requires binaries to be available for the target 
environment which is not portable.  Therefore we opt for the following:

# Unit tests will use a mock Kudu client 
# Integration tests will cover the full aspects of the {{KuduIO}} and use a 
Docker based Kudu instance



> Add KuduIO
> --
>
> Key: BEAM-2661
> URL: https://issues.apache.org/jira/browse/BEAM-2661
> Project: Beam
>  Issue Type: New Feature
>  Components: io-ideas
>Reporter: Jean-Baptiste Onofré
>Assignee: Tim Robertson
>Priority: Major
>
> New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).
> This work is in progress [on this 
> 

[jira] [Updated] (BEAM-2661) Add KuduIO

2018-06-17 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-2661:

Description: 
New IO for Apache Kudu (https://kudu.apache.org/overview.html).

This work is in progress [on this 
branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with 
design aspects documented below.

h2. The API

The {{KuduIO}} API requires the user to provide a function to convert objects 
into operations. This is similar to the {{JdbcIO}} but different to others, 
such as {{HBaseIO}} which requires a pre-transform stage beforehand to convert 
into the mutations to apply. It was originally intended to copy the {{HBaseIO}} 
approach, but this was not possible:

# The Kudu 
[Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
 is a fat class, and is a subclass of {{KuduRpc}}. It holds 
RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does 
not serialize and furthermore, the logic for encoding the operations (Insert, 
Upsert etc) in the Kudu Java API are one way only (no decode) because the 
server is written in C++.
# An alternative could be to introduce a new object to beam  (e.g. 
{{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection}}. 
This was considered but was discounted because:
## It is not a familiar API to those already knowing Kudu
## It still requires serialization and deserialization of the operations. Using 
the existing Kudu approach of serializing into compact byte arrays would 
require a decoder along the lines of [this almost complete 
example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]

h2. Testing framework

{{Kudu}} is written in C++. While a 
[TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
 does exist in Java, it requires binaries to be available for the target 
environment which is not portable.  Therefore we opt for the following:

# Unit tests will use a mock Kudu client 
# Integration tests will cover the full aspects of the {{KuduIO}} and use a 
Docker based Kudu instance


  was:
New IO for Apache Kudu (https://kudu.apache.org/overview.html).

This work is in progress [on this 
branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO].

Design aspects are documented below.

The API
# The Kudu 
[Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
 is a fat class, and is a subclass of {{KuduRpc}}. It holds 
RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does 
not serialize and furthermore, the logic for encoding the operations (Insert, 
Upsert etc) in the Kudu Java API are one way only (no decode) because the 
server is written in C++.
# An alternative could be to introduce a new object to beam  (e.g. 
{{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection}}. 
This was considered but was discounted because:
## It is not a familiar API to those already knowing Kudu
## It still requires serialization and deserialization of the operations. Using 
the existing Kudu approach of serializing into compact byte arrays would 
require a decoder along the lines of [this almost complete 
example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]





> Add KuduIO
> --
>
> Key: BEAM-2661
> URL: https://issues.apache.org/jira/browse/BEAM-2661
> Project: Beam
>  Issue Type: New Feature
>  Components: io-ideas
>Reporter: Jean-Baptiste Onofré
>Assignee: Tim Robertson
>Priority: Major
>
> New IO for Apache Kudu (https://kudu.apache.org/overview.html).
> This work is in progress [on this 
> branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with 
> design aspects documented below.
> h2. The API
> The {{KuduIO}} API requires the user to provide a function to convert objects 
> into operations. This is similar to the {{JdbcIO}} but different to others, 
> such as {{HBaseIO}} which requires a pre-transform stage beforehand to 
> convert into the mutations to apply. It was originally intended to copy the 
> {{HBaseIO}} approach, but this was not possible:
> # The Kudu 
> [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
>  is a fat class, and is a subclass of {{KuduRpc}}. It 
> holds RPC logic, callbacks and a Kudu client. Because of this the 
> {{Operation}} does not serialize and furthermore, the logic for encoding the 
> operations (Insert, Upsert etc) in the Kudu Java API are one way only (no 
> decode) because the server is written in C++.
> # An alternative could be to introduce a new object to beam  (e.g. 
> {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable 
> {{PCollection}}. This was considered but was discounted 
> because:
> ## It is not a 

[jira] [Updated] (BEAM-2661) Add KuduIO

2018-06-17 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-2661:

Description: 
New IO for Apache Kudu (https://kudu.apache.org/overview.html).

This work is in progress [on this 
branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO].

Design aspects are documented below.

The API
# The Kudu 
[Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
 is a fat class, and is a subclass of {{KuduRpc}}. It holds 
RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does 
not serialize and furthermore, the logic for encoding the operations (Insert, 
Upsert etc) in the Kudu Java API are one way only (no decode) because the 
server is written in C++.
# An alternative could be to introduce a new object to beam  (e.g. 
{{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection}}. 
This was considered but was discounted because:
## It is not a familiar API to those already knowing Kudu
## It still requires serialization and deserialization of the operations. Using 
the existing Kudu approach of serializing into compact byte arrays would 
require a decoder along the lines of [this almost complete 
example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]




  was:New IO for Apache Kudu (https://kudu.apache.org/overview.html).


> Add KuduIO
> --
>
> Key: BEAM-2661
> URL: https://issues.apache.org/jira/browse/BEAM-2661
> Project: Beam
>  Issue Type: New Feature
>  Components: io-ideas
>Reporter: Jean-Baptiste Onofré
>Assignee: Tim Robertson
>Priority: Major
>
> New IO for Apache Kudu (https://kudu.apache.org/overview.html).
> This work is in progress [on this 
> branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO].
> Design aspects are documented below.
> The API
> # The Kudu 
> [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
>  is a fat class, and is a subclass of {{KuduRpc}}. It 
> holds RPC logic, callbacks and a Kudu client. Because of this the 
> {{Operation}} does not serialize and furthermore, the logic for encoding the 
> operations (Insert, Upsert etc) in the Kudu Java API are one way only (no 
> decode) because the server is written in C++.
> # An alternative could be to introduce a new object to beam  (e.g. 
> {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable 
> {{PCollection}}. This was considered but was discounted 
> because:
> ## It is not a familiar API to those already knowing Kudu
> ## It still requires serialization and deserialization of the operations. 
> Using the existing Kudu approach of serializing into compact byte arrays 
> would require a decoder along the lines of [this almost complete 
> example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3848) SolrIO: Add retrying mechanism in client writes

2018-06-11 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-3848:

Issue Type: Improvement  (was: New Feature)

> SolrIO: Add retrying mechanism in client writes
> ---
>
> Key: BEAM-3848
> URL: https://issues.apache.org/jira/browse/BEAM-3848
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-solr
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Minor
> Fix For: 2.5.0
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> A busy SOLR server is prone to return RemoteSOLRException on writing which 
> currently fails a complete task (e.g. a partition of a spark RDD being 
> written to SOLR).
> A good addition would be the ability to provide a retrying mechanism for the 
> batch in flight, rather than failing fast, which will most likely trigger a 
> much larger retry of more writes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3201) ElasticsearchIO should allow the user to optionally pass id, type and index per document

2018-06-11 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-3201:

Issue Type: New Feature  (was: Improvement)

> ElasticsearchIO should allow the user to optionally pass id, type and index 
> per document
> 
>
> Key: BEAM-3201
> URL: https://issues.apache.org/jira/browse/BEAM-3201
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Tim Robertson
>Priority: Major
> Fix For: 2.5.0
>
>
> *Dynamic documents id*: Today the ESIO only inserts the payload of the ES 
> documents. Elasticsearch generates a document id for each record inserted. So 
> each new insertion is considered as a new document. Users want to be able to 
> update documents using the IO. So, for the write part of the IO, users should 
> be able to provide a document id so that they could update already stored 
> documents. Providing an id for the documents could also help the user on 
> indempotency.
> *Dynamic ES type and ES index*: In some cases (streaming pipeline with high 
> throughput) partitioning the PCollection to allow to plug to different ESIO 
> instances (pointing to different index/type) is not very practical, the users 
> would like to be able to set ES index/type per document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4311) Enforce ErrorProne analysis in Flink runner project

2018-06-06 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4311:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in Flink runner project
> ---
>
> Key: BEAM-4311
> URL: https://issues.apache.org/jira/browse/BEAM-4311
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-runners-flink}}. Additional context discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-runners-flink:assemble}}
> # Fix each ErrorProne warning from the {{runners/flink}} project.
> # In {{runners/flink/build.gradle}}, add {{failOnWarning: true}} to the call 
> the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3199) Upgrade to Elasticsearch 6.x

2018-05-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494141#comment-16494141
 ] 

Tim Robertson edited comment on BEAM-3199 at 5/29/18 7:53 PM:
--

This is fabulous to see.  I've also been in the ES issues and a few comments.

* I see your comment about the FluentBackoff not being serializable.  I'd 
suggest copying the approach from [SorlIO 
here|https://github.com/apache/beam/blob/master/sdks/java/io/solr/src/main/java/org/apache/beam/sdk/io/solr/SolrIO.java#L778]
 which is consistent with JdbcIO.
* Repeating [~echauchot] but can we make sure that the dynamic routing for 
index and document ID are included please as they are necessary for updates 
(upserts to be precise)? I know type is being dropped in ES so that can go 
[BEAM-3201] 
* Partial update support [is about to be merged | 
https://github.com/apache/beam/pull/5463] to fix [BEAM-4389] as well and is 
something I know one team rely on already
* We might want to consider the discussion on SolrIO versions 6&7 [BEAM-3947] 
when considering packaging (single/multiple modules) so we are consistent.

I'll be happy to help out of course - and thanks for sharing this.



was (Author: timrobertson100):
This is fabulous to see.  I've also been in the ES issues and a few comments.

* I see your comment about the FluentBackoff not being serializable.  I'd 
suggest copying the approach from [SorlIO 
here|https://github.com/apache/beam/blob/master/sdks/java/io/solr/src/main/java/org/apache/beam/sdk/io/solr/SolrIO.java#L778]
 which is consistent with JdbcIO.
* Repeating [~echauchot] but can we make sure that the dynamic routing for 
index and document ID are included please as they are necessary for updates 
(upserts to be precise)? I know type is being dropped in ES so that can go 
[BEAM-3201] 
* Partial update support [is about to be merged | 
https://github.com/apache/beam/pull/5463] to fix [BEAM-4389] as well and is 
something I know one team rely on already
* We might want to consider SolrIO v6&7 discussion [BEAM-3947] when considering 
packaging as one or several modules so we are consistent.

I'll be happy to help out of course - and thanks for sharing this.


> Upgrade to Elasticsearch 6.x
> 
>
> Key: BEAM-3199
> URL: https://issues.apache.org/jira/browse/BEAM-3199
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Jean-Baptiste Onofré
>Assignee: Jeroen Steggink
>Priority: Major
>
> Elasticsearch 6.x is now GA. As it's fully compatible with Elasticsearch 5.x, 
> it makes sense to upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3199) Upgrade to Elasticsearch 6.x

2018-05-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494141#comment-16494141
 ] 

Tim Robertson edited comment on BEAM-3199 at 5/29/18 7:52 PM:
--

This is fabulous to see.  I've also been in the ES issues and a few comments.

* I see your comment about the FluentBackoff not being serializable.  I'd 
suggest copying the approach from [SorlIO 
here|https://github.com/apache/beam/blob/master/sdks/java/io/solr/src/main/java/org/apache/beam/sdk/io/solr/SolrIO.java#L778]
 which is consistent with JdbcIO.
* Repeating [~echauchot] but can we make sure that the dynamic routing for 
index and document ID are included please as they are necessary for updates 
(upserts to be precise)? I know type is being dropped in ES so that can go 
[BEAM-3201] 
* Partial update support [is about to be merged | 
https://github.com/apache/beam/pull/5463] to fix [BEAM-4389] as well and is 
something I know one team rely on already
* We might want to consider SolrIO v6&7 discussion [BEAM-3947] when considering 
packaging as one or several modules so we are consistent.

I'll be happy to help out of course - and thanks for sharing this.



was (Author: timrobertson100):
This is fabulous to see.  I've also been in the ES issues and a few comments.

* I see your comment about the FluentBackoff not being serializable.  I'd 
suggest copying the approach from [SorlIO 
here|https://github.com/apache/beam/blob/master/sdks/java/io/solr/src/main/java/org/apache/beam/sdk/io/solr/SolrIO.java#L778]
 which is consistent with JdbcIO.
* Repeating [~echauchot] but can we make sure that the dynamic routing for 
index and document ID are included please as they are necessary for updates 
(upserts to be precise)? I know type is being dropped in ES so that can go 
[BEAM-3201] 
* Partial update support [is about to be merged | 
https://github.com/apache/beam/pull/5463] to fix [BEAM-4389] as well and is 
something I know one team rely on already

I'll be happy to help out of course.


> Upgrade to Elasticsearch 6.x
> 
>
> Key: BEAM-3199
> URL: https://issues.apache.org/jira/browse/BEAM-3199
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Jean-Baptiste Onofré
>Assignee: Jeroen Steggink
>Priority: Major
>
> Elasticsearch 6.x is now GA. As it's fully compatible with Elasticsearch 5.x, 
> it makes sense to upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3199) Upgrade to Elasticsearch 6.x

2018-05-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494141#comment-16494141
 ] 

Tim Robertson commented on BEAM-3199:
-

This is fabulous to see.  I've also been in the ES issues and a few comments.

* I see your comment about the FluentBackoff not being serializable.  I'd 
suggest copying the approach from [SorlIO 
here|https://github.com/apache/beam/blob/master/sdks/java/io/solr/src/main/java/org/apache/beam/sdk/io/solr/SolrIO.java#L778]
 which is consistent with JdbcIO.
* Repeating [~echauchot] but can we make sure that the dynamic routing for 
index and document ID are included please as they are necessary for updates 
(upserts to be precise)? I know type is being dropped in ES so that can go 
[BEAM-3201] 
* Partial update support [is about to be merged | 
https://github.com/apache/beam/pull/5463] to fix [BEAM-4389] as well and is 
something I know one team rely on already

I'll be happy to help out of course.


> Upgrade to Elasticsearch 6.x
> 
>
> Key: BEAM-3199
> URL: https://issues.apache.org/jira/browse/BEAM-3199
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Jean-Baptiste Onofré
>Assignee: Jeroen Steggink
>Priority: Major
>
> Elasticsearch 6.x is now GA. As it's fully compatible with Elasticsearch 5.x, 
> it makes sense to upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3111) Upgrade ElasticsearchIO elastic dependences to 5.6.3

2018-05-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494104#comment-16494104
 ] 

Tim Robertson commented on BEAM-3111:
-

Should this be closed?

> Upgrade ElasticsearchIO elastic dependences to 5.6.3
> 
>
> Key: BEAM-3111
> URL: https://issues.apache.org/jira/browse/BEAM-3111
> Project: Beam
>  Issue Type: Improvement
>  Components: z-do-not-use-sdk-java-extensions
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: Minor
> Fix For: 2.3.0
>
>
> Now that ES 6.0.RC1 is out, it is time to upgrade deps to the latest 5.x



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-67) Apache Sqoop connector

2018-05-24 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489068#comment-16489068
 ] 

Tim Robertson commented on BEAM-67:
---

Given the existence of `JdbcIO` (and to some extent `HCatalogIO`) can we detail 
a little what we're aiming to gain and how Beam would interact with Sqoop 
please? 

> Apache Sqoop connector
> --
>
> Key: BEAM-67
> URL: https://issues.apache.org/jira/browse/BEAM-67
> Project: Beam
>  Issue Type: New Feature
>  Components: io-ideas
>Reporter: Daniel Halperin
>Priority: Minor
>
> Bounded source has been requested in the past.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-67) Apache Sqoop connector

2018-05-24 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489068#comment-16489068
 ] 

Tim Robertson edited comment on BEAM-67 at 5/24/18 2:08 PM:


Given the existence of {{JdbcIO}} (and to some extent {{HCatalogIO}}) can we 
detail a little what we're aiming to gain and how Beam would interact with 
Sqoop please? 


was (Author: timrobertson100):
Given the existence of `JdbcIO` (and to some extent `HCatalogIO`) can we detail 
a little what we're aiming to gain and how Beam would interact with Sqoop 
please? 

> Apache Sqoop connector
> --
>
> Key: BEAM-67
> URL: https://issues.apache.org/jira/browse/BEAM-67
> Project: Beam
>  Issue Type: New Feature
>  Components: io-ideas
>Reporter: Daniel Halperin
>Priority: Minor
>
> Bounded source has been requested in the past.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4396) Remove all mentions of "mvn" from javadoc

2018-05-24 Thread Tim Robertson (JIRA)
Tim Robertson created BEAM-4396:
---

 Summary: Remove all mentions of "mvn" from javadoc
 Key: BEAM-4396
 URL: https://issues.apache.org/jira/browse/BEAM-4396
 Project: Beam
  Issue Type: Sub-task
  Components: build-system
Reporter: Tim Robertson
Assignee: Luke Cwik


There are (at least) 5 places in javadoc instructing the user to run {{mvn}}.  

An example is in 
https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch-tests/elasticsearch-tests-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOITCommon.java#L47

IDEA "find in files" reports:
 - CassandraTestDataSet
 - KafkaIOTest
 - HCatalogIOTest
 - PrintBase64Encodings 
 - ElasticsearchIOITCommon

It would be good to provide {{gradle}} instructions exclusively.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3249) Use Gradle to build/release project

2018-05-24 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-3249:
---

Assignee: Tim Robertson  (was: Luke Cwik)

> Use Gradle to build/release project
> ---
>
> Key: BEAM-3249
> URL: https://issues.apache.org/jira/browse/BEAM-3249
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system, testing
>Reporter: Luke Cwik
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 14h 20m
>  Remaining Estimate: 0h
>
> I have collected data by running several builds against master using Gradle 
> and Maven without using Gradle's support for incremental builds.
> Gradle (mins)
> min: 25.04
> max: 160.14
> median: 45.78
> average: 52.19
> stdev: 30.80
> Maven (mins)
> min: 56.86
> max: 216.55
> median: 87.93
> average: 109.10
> stdev: 48.01
> I excluded a few timeouts (240 mins) that happened during the Maven build 
> from its numbers but we can see conclusively that Gradle is about twice as 
> fast for the build when compared to Maven when run using Jenkins.
> Original dev@ thread: 
> https://lists.apache.org/thread.html/225dddcfc78f39bbb296a0d2bbef1caf37e17677c7e5573f0b6fe253@%3Cdev.beam.apache.org%3E
> The data is available here 
> https://docs.google.com/spreadsheets/d/1MHVjF-xoI49_NJqEQakUgnNIQ7Qbjzu8Y1q_h3dbF1M/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4330) Sporadic ZipExceptions in tests with gradle and "isRelease"

2018-05-24 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4330:

Summary: Sporadic ZipExceptions in tests with gradle and "isRelease"   
(was: Some tests fail with ZipException)

> Sporadic ZipExceptions in tests with gradle and "isRelease" 
> 
>
> Key: BEAM-4330
> URL: https://issues.apache.org/jira/browse/BEAM-4330
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Priority: Major
>
> Some tests recently started to fail with (for exmaple with ElasticsearchIO)
>   
> {code:java}
>  java.util.ServiceConfigurationError at ElasticsearchIOTest.java:106
> Caused by: java.util.zip.ZipException at ElasticsearchIOTest.java:106
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4330) Some tests fail with ZipException

2018-05-24 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4330:

Issue Type: Sub-task  (was: Test)
Parent: BEAM-3249

> Some tests fail with ZipException
> -
>
> Key: BEAM-4330
> URL: https://issues.apache.org/jira/browse/BEAM-4330
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Priority: Major
>
> Some tests recently started to fail with (for exmaple with ElasticsearchIO)
>   
> {code:java}
>  java.util.ServiceConfigurationError at ElasticsearchIOTest.java:106
> Caused by: java.util.zip.ZipException at ElasticsearchIOTest.java:106
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3249) Use Gradle to build/release project

2018-05-24 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488725#comment-16488725
 ] 

Tim Robertson commented on BEAM-3249:
-

Linking these as Slack chat seems to confirm the {{ZipException}} relates to a 
gradle issue.

> Use Gradle to build/release project
> ---
>
> Key: BEAM-3249
> URL: https://issues.apache.org/jira/browse/BEAM-3249
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system, testing
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Major
>  Time Spent: 14h 20m
>  Remaining Estimate: 0h
>
> I have collected data by running several builds against master using Gradle 
> and Maven without using Gradle's support for incremental builds.
> Gradle (mins)
> min: 25.04
> max: 160.14
> median: 45.78
> average: 52.19
> stdev: 30.80
> Maven (mins)
> min: 56.86
> max: 216.55
> median: 87.93
> average: 109.10
> stdev: 48.01
> I excluded a few timeouts (240 mins) that happened during the Maven build 
> from its numbers but we can see conclusively that Gradle is about twice as 
> fast for the build when compared to Maven when run using Jenkins.
> Original dev@ thread: 
> https://lists.apache.org/thread.html/225dddcfc78f39bbb296a0d2bbef1caf37e17677c7e5573f0b6fe253@%3Cdev.beam.apache.org%3E
> The data is available here 
> https://docs.google.com/spreadsheets/d/1MHVjF-xoI49_NJqEQakUgnNIQ7Qbjzu8Y1q_h3dbF1M/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4341) Enforce ErrorProne analysis in the google-cloud-platform IO project

2018-05-24 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson resolved BEAM-4341.
-
   Resolution: Fixed
Fix Version/s: 2.5.0

> Enforce ErrorProne analysis in the google-cloud-platform IO project
> ---
>
> Key: BEAM-4341
> URL: https://issues.apache.org/jira/browse/BEAM-4341
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
> Fix For: 2.5.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-google-cloud-platform}}. Additional context discussed on 
> the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-google-cloud-platform:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/google-cloud-platform}} 
> project.
> # In {{sdks/java/io/google-cloud-platform/build.gradle}}, add 
> {{failOnWarning: true}} to the call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4389) Enable partial updates Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4389:

Summary: Enable partial updates Elasticsearch  (was: Enable updates and 
upserts for Elasticsearch)

> Enable partial updates Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUseUpdate(...)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUseUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4389) Enable updates and upserts for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4389:

Description: 
Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUsePartialUpdate(true)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
.withConnectionConfiguration(connectionConfiguration)
.withIdFn(new ExtractValueFn("id"))
.withUpdateMode(UpdateMode.PARTIAL)
{code}



  was:
Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUsePartialUpdate(true)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
.withConnectionConfiguration(connectionConfiguration)
.withIdFn(new ExtractValueFn("id"))
.withUsePartialUpdate(true)
{code}




> Enable updates and upserts for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUpdateMode(UpdateMode.PARTIAL)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4389) Enable updates and upserts for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487245#comment-16487245
 ] 

Tim Robertson edited comment on BEAM-4389 at 5/23/18 1:47 PM:
--

I was just pondering that [~echauchot].
[edited response follows]

Default behaviour when explicitly controlling the document ID is a full 
document upsert already (create or replace doc). This will add partial updates 
only.

Elasticsearch also has the notion of scripted updates (useful for e.g. 
incrementing counters) which I don't propose we support.



was (Author: timrobertson100):
I was just pondering that [~echauchot] - what about something like (pseudo 
code):

{code}
// Mode being "partial update" or "document upsert", if not set default is 
"insert"
public Write withUpdateMode(Mode mode)
{code}

It looks like v2 and v5 both support this. I'd favour that we only support 
insert (default), partial update, and document upsert  (i.e. not support 
scripted upserts)

Thanks for the input
 

> Enable updates and upserts for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUseUpdate(...)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUseUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4389) Enable updates and upserts for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487245#comment-16487245
 ] 

Tim Robertson edited comment on BEAM-4389 at 5/23/18 1:47 PM:
--

I was just pondering that [~echauchot].
[edited response follows]

Default behaviour when explicitly controlling the document ID is a full 
document upsert already (create or replace doc). This will add partial updates 
only.

Elasticsearch also has the notion of scripted updates (useful for e.g. 
incrementing counters) which I don't propose we support.

Thanks for the input



was (Author: timrobertson100):
I was just pondering that [~echauchot].
[edited response follows]

Default behaviour when explicitly controlling the document ID is a full 
document upsert already (create or replace doc). This will add partial updates 
only.

Elasticsearch also has the notion of scripted updates (useful for e.g. 
incrementing counters) which I don't propose we support.


> Enable updates and upserts for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUseUpdate(...)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUseUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4389) Enable updates and upserts for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4389:

Description: 
Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUseUpdate(...)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
.withConnectionConfiguration(connectionConfiguration)
.withIdFn(new ExtractValueFn("id"))
.withUseUpdate(true)
{code}



  was:
Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUpdateMode(...)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
.withConnectionConfiguration(connectionConfiguration)
.withIdFn(new ExtractValueFn("id"))
.withUpdateMode(UpdateMode.PARTIAL)
{code}




> Enable updates and upserts for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUseUpdate(...)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUseUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4389) Enable updates and upserts for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4389:

Description: 
Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUpdateMode(...)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
.withConnectionConfiguration(connectionConfiguration)
.withIdFn(new ExtractValueFn("id"))
.withUpdateMode(UpdateMode.PARTIAL)
{code}



  was:
Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUsePartialUpdate(true)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
.withConnectionConfiguration(connectionConfiguration)
.withIdFn(new ExtractValueFn("id"))
.withUpdateMode(UpdateMode.PARTIAL)
{code}




> Enable updates and upserts for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUpdateMode(...)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUpdateMode(UpdateMode.PARTIAL)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4389) Enable updates and upserts for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4389:

Summary: Enable updates and upserts for Elasticsearch  (was: Enable partial 
updates for Elasticsearch)

> Enable updates and upserts for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUsePartialUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4389) Enable partial updates for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487245#comment-16487245
 ] 

Tim Robertson commented on BEAM-4389:
-

I was just pondering that [~echauchot] - what about something like (pseudo 
code):

{code}
// Mode being "partial update" or "document upsert", if not set default is 
"insert"
public Write withUpdateMode(Mode mode)
{code}

It looks like v2 and v5 both support this. I'd favour that we only support 
insert (default), partial update, and document upsert  (i.e. not support 
scripted upserts)

Thanks for the input
 

> Enable partial updates for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUsePartialUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4389) Enable partial updates for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487085#comment-16487085
 ] 

Tim Robertson edited comment on BEAM-4389 at 5/23/18 11:38 AM:
---

Thanks for the quick reply [~echauchot]

The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent 
to ES to have {{update}} instead of {{index}} operations. Server side 
Elasticsearch treats this as a "get document, apply edits, save document" 
operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 

Edited to add: I'd probably include a {{"_retry_on_conflict" : 5}} or similar 
for the updates as well to improve potential race condition handling.


was (Author: timrobertson100):
Thanks for the quick reply [~echauchot]

The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent 
to ES to have {{update}} instead of {{index}} operations. Server side 
Elasticsearch treats this as a "get document, apply edits, save document" 
operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 

Edited to add: I'd probably include a {{"_retry_on_conflict" : 5}} or similar 
for the updates as well

> Enable partial updates for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUsePartialUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4389) Enable partial updates for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487085#comment-16487085
 ] 

Tim Robertson edited comment on BEAM-4389 at 5/23/18 11:37 AM:
---

Thanks for the quick reply [~echauchot]

The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent 
to ES to have {{update}} instead of {{index}} operations. Server side 
Elasticsearch treats this as a "get document, apply edits, save document" 
operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 

Edited to add: I'd probably include a {{"_retry_on_conflict" : 5}} or similar 
for the updates as well


was (Author: timrobertson100):
Thanks for the quick reply [~echauchot]

The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent 
to ES to have {{update}} instead of {{index}} operations. Server side 
Elasticsearch treats this as a "get document, apply edits, save document" 
operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 

Edited to add: I'd probably include a {{"_retry_on_conflict" : 5}} or similar 
as well

> Enable partial updates for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUsePartialUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4389) Enable partial updates for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487085#comment-16487085
 ] 

Tim Robertson edited comment on BEAM-4389 at 5/23/18 11:37 AM:
---

Thanks for the quick reply [~echauchot]

The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent 
to ES to have {{update}} instead of {{index}} operations. Server side 
Elasticsearch treats this as a "get document, apply edits, save document" 
operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 

Edited to add: I'd probably include a {{"_retry_on_conflict" : 5}} or similar 
as well


was (Author: timrobertson100):
Thanks for the quick reply [~echauchot]

The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent 
to ES to have {{update}} instead of {{index}} operations. Server side 
Elasticsearch treats this as a "get document, apply edits, save document" 
operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 

> Enable partial updates for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUsePartialUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4389) Enable partial updates for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487085#comment-16487085
 ] 

Tim Robertson edited comment on BEAM-4389 at 5/23/18 11:09 AM:
---

Thanks for the quick reply [~echauchot]

The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent 
to ES to have {{update}} instead of {{index}} operations. Server side 
Elasticsearch treats this as a "get document, apply edits, save document" 
operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 


was (Author: timrobertson100):
Thanks for the quick reply [~echauchot]

The {withUsePartialUpdate(true)} would simply change the {bulk} list sent to ES 
to have {update} instead of {index} operations. Server side Elasticsearch 
treats this as a "get document, apply edits, save document" operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 

> Enable partial updates for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUsePartialUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4389) Enable partial updates for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487085#comment-16487085
 ] 

Tim Robertson commented on BEAM-4389:
-

Thanks for the quick reply [~echauchot]

The {withUsePartialUpdate(true)} would simply change the {bulk} list sent to ES 
to have {update} instead of {index} operations. Server side Elasticsearch 
treats this as a "get document, apply edits, save document" operation.

In our code I think it would be something as simple as exposing the 
configuration toggle and changing:
{code}
  batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress, 
document));
{code}

to

{code}
  String operation = spec.isPartialUpdate() ? "update" : "index";
  batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress, 
document));
{code}
 
New fields being introduced and schema compatibility seem no different to the 
current model (you can push nonsense JSON to a live Elasticsearch using today). 
Or am I overlooking something please? 

> Enable partial updates for Elasticsearch
> 
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUsePartialUpdate(true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4389) Enable partial updates for Elasticsearch

2018-05-23 Thread Tim Robertson (JIRA)
Tim Robertson created BEAM-4389:
---

 Summary: Enable partial updates for Elasticsearch
 Key: BEAM-4389
 URL: https://issues.apache.org/jira/browse/BEAM-4389
 Project: Beam
  Issue Type: New Feature
  Components: io-java-elasticsearch
Affects Versions: 2.4.0
Reporter: Tim Robertson
Assignee: Tim Robertson


Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUsePartialUpdate(true)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
.withConnectionConfiguration(connectionConfiguration)
.withIdFn(new ExtractValueFn("id"))
.withUsePartialUpdate(true)
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-4350) Enforce ErrorProne analysis in the mqtt IO project

2018-05-18 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson closed BEAM-4350.
---
   Resolution: Duplicate
Fix Version/s: 2.5.0

The title says MQTT and the body refers to MongoDB but those issues both exist 
elsewhere.

Suspect this was just done in error when opening many issues as clones.

CC [~swegner] in case I'm overlook something.

> Enforce ErrorProne analysis in the mqtt IO project
> --
>
> Key: BEAM-4350
> URL: https://issues.apache.org/jira/browse/BEAM-4350
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-mongodb
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
> Fix For: 2.5.0
>
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-mongodb}}. Additional context discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-mongodb:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/mongodb}} project.
> # In {{sdks/java/io/mongodb/build.gradle}}, add {{failOnWarning: true}} to 
> the call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4361) Document usage of HBase TableSnapshotInputFormat

2018-05-18 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4361:

Description: 
Add a paragraph demonstrating the usage of {{TableSnapshotInputFormat}} as a 
mechanism for doing efficient full scans over HBase to 
https://beam.apache.org/documentation/io/built-in/hadoop/

Typically in MR / Spark this yields up to 3-4x improvement over hitting region 
servers directly and keeps load (GC etc) from those services.

I have it tested and an example ready.

  was:
Add a paragraph demonstrating the usage of {{TableSnapshotInputFormat }} as a 
mechanism for doing efficient full scans over HBase to 
https://beam.apache.org/documentation/io/built-in/hadoop/

Typically in MR / Spark this yields up to 3-4x improvement over hitting region 
servers directly and keeps load (GC etc) from those services.

I have it tested and an example ready.


> Document usage of HBase TableSnapshotInputFormat 
> -
>
> Key: BEAM-4361
> URL: https://issues.apache.org/jira/browse/BEAM-4361
> Project: Beam
>  Issue Type: Task
>  Components: website
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Trivial
>
> Add a paragraph demonstrating the usage of {{TableSnapshotInputFormat}} as a 
> mechanism for doing efficient full scans over HBase to 
> https://beam.apache.org/documentation/io/built-in/hadoop/
> Typically in MR / Spark this yields up to 3-4x improvement over hitting 
> region servers directly and keeps load (GC etc) from those services.
> I have it tested and an example ready.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4361) Document usage of HBase TableSnapshotInputFormat

2018-05-18 Thread Tim Robertson (JIRA)
Tim Robertson created BEAM-4361:
---

 Summary: Document usage of HBase TableSnapshotInputFormat 
 Key: BEAM-4361
 URL: https://issues.apache.org/jira/browse/BEAM-4361
 Project: Beam
  Issue Type: Task
  Components: website
Affects Versions: 2.4.0
Reporter: Tim Robertson
Assignee: Tim Robertson


Add a paragraph demonstrating the usage of {{TableSnapshotInputFormat }} as a 
mechanism for doing efficient full scans over HBase to 
https://beam.apache.org/documentation/io/built-in/hadoop/

Typically in MR / Spark this yields up to 3-4x improvement over hitting region 
servers directly and keeps load (GC etc) from those services.

I have it tested and an example ready.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4260) Document usage for hcatalog 1.1

2018-05-18 Thread Tim Robertson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480680#comment-16480680
 ] 

Tim Robertson commented on BEAM-4260:
-

Thanks [~iemejia] - I will do that.

How would you feel about a 1 liner in the JDoc too? I just know it is where I'd 
look first

> Document usage for hcatalog 1.1
> ---
>
> Key: BEAM-4260
> URL: https://issues.apache.org/jira/browse/BEAM-4260
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-hcatalog, website
>Affects Versions: 2.4.0
>Reporter: Tim Robertson
>Assignee: Tim Robertson
>Priority: Minor
>
> The {{HCatalogIO}} does not work with environments providing Hive Server 1.x 
> which is in widespread use - as an example the latest Cloudera (5.14.2) 
> provides 1.1.x
>  
> The {{HCatalogIO}} marks it's Hive dependencies as provided, so I believe the 
> intention was to be open to multiple versions.
>  
> The issues come from the following:  
>  - use of {{HCatUtil.getHiveMetastoreClient(hiveConf)}} while previous 
> versions used the [now 
> deprecated|https://github.com/apache/hive/blob/master/hcatalog/core/src/main/java/org/apache/hive/hcatalog/common/HCatUtil.java#L586]
>  {{getHiveClient(HiveConf hiveConf)}}  
>  - Changes to the signature of {{RetryingMetaStoreClient.getProxy(...)}}
>  
> Given this doesn't work in a major Hadoop distro, and will not until the next 
> CDH release later in 2018 (i.e. widespread adoption only expected in 2019) I 
> think it would be worthwhile providing a fix/workaround.
> I _think_ building for 2.3 and relocating in your own app might be a 
> workaround although I'm still testing it.  If that is successful I'd propose 
> adding it to the project README or in a separate markdown file linked from 
> the README.
> Does that sound like a reasonable approach please?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4342) Enforce ErrorProne analysis in the hadoop IO projects

2018-05-18 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4342:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the hadoop IO projects
> -
>
> Key: BEAM-4342
> URL: https://issues.apache.org/jira/browse/BEAM-4342
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-hadoop
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-hadoop-*}}. Additional context discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-hadoop-common:assemble 
> :beam-sdks-java-io-hadoop-file-system:assemble 
> :beam-sdks-java-io-hadoop-input-format:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/hadoop*}} projects.
> # In {{sdks/java/io/hadoop-common/build.gradle}}, 
> {{sdks/java/io/hadoop-file-system/build.gradle}}, and 
> {{sdks/java/io/hadoop-input-format/build.gradle}}, add {{failOnWarning: 
> true}} to the call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4335) Enforce ErrorProne analysis in the amazon-web-services IO project

2018-05-18 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4335:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the amazon-web-services IO project
> -
>
> Key: BEAM-4335
> URL: https://issues.apache.org/jira/browse/BEAM-4335
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-aws
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-amazon-web-services}}. Additional context discussed on 
> the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-amazon-web-services:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/amazon-web-services}} 
> project.
> # In {{sdks/java/io/amazon-web-services/build.gradle}}, add {{failOnWarning: 
> true}} to the call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4339) Enforce ErrorProne analysis in the elasticsearch IO project

2018-05-18 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4339:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the elasticsearch IO project
> ---
>
> Key: BEAM-4339
> URL: https://issues.apache.org/jira/browse/BEAM-4339
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-elasticsearch}} and related project. Additional context 
> discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-elasticsearch:assemble 
> :beam-sdks-java-io-elasticsearch-tests-2:assemble 
> :beam-sdks-java-io-elasticsearch-tests-5:assemble 
> :beam-sdks-java-io-elasticsearch-tests-common:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/elasticsearch*}} 
> projects.
> # In {{sdks/java/io/elasticsearch/build.gradle}}, 
> {{sdks/java/io/elasticsearch-tests/elasticsearch-tests-2/build.gradle}}, and 
> {{sdks/java/io/elasticsearch-tests/elasticsearch-tests-5/build.gradle}}, and 
> {{sdks/java/io/elasticsearch-tests/elasticsearch-tests-common/build.gradle}}, 
> add {{failOnWarning: true}} to the call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4341) Enforce ErrorProne analysis in the google-cloud-platform IO project

2018-05-18 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4341:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the google-cloud-platform IO project
> ---
>
> Key: BEAM-4341
> URL: https://issues.apache.org/jira/browse/BEAM-4341
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-google-cloud-platform}}. Additional context discussed on 
> the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-google-cloud-platform:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/google-cloud-platform}} 
> project.
> # In {{sdks/java/io/google-cloud-platform/build.gradle}}, add 
> {{failOnWarning: true}} to the call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4340) Enforce ErrorProne analysis in the file-based-io-tests project

2018-05-18 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4340:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the file-based-io-tests project
> --
>
> Key: BEAM-4340
> URL: https://issues.apache.org/jira/browse/BEAM-4340
> Project: Beam
>  Issue Type: Improvement
>  Components: io-ideas
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-file-based-io-tests}}. Additional context discussed on 
> the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-file-based-io-tests:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/file-based-io-tests}} 
> project.
> # In {{sdks/java/io/file-based-io-tests/build.gradle}}, add {{failOnWarning: 
> true}} to the call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4337) Enforce ErrorProne analysis in the Cassandra IO project

2018-05-17 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4337:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the Cassandra IO project
> ---
>
> Key: BEAM-4337
> URL: https://issues.apache.org/jira/browse/BEAM-4337
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-cassandra
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-cassandra}}. Additional context discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-cassandra:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/cassandra}} project.
> # In {{sdks/java/io/cassandra/build.gradle}}, add {{failOnWarning: true}} to 
> the call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4338) Enforce ErrorProne analysis in the common IO project

2018-05-17 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4338:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the common IO project
> 
>
> Key: BEAM-4338
> URL: https://issues.apache.org/jira/browse/BEAM-4338
> Project: Beam
>  Issue Type: Improvement
>  Components: io-ideas
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-common}}. Additional context discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-common:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/common}} project.
> # In {{sdks/java/io/common/build.gradle}}, add {{failOnWarning: true}} to the 
> call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4355) Enforce ErrorProne analysis in the xml IO project

2018-05-17 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4355:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the xml IO project
> -
>
> Key: BEAM-4355
> URL: https://issues.apache.org/jira/browse/BEAM-4355
> Project: Beam
>  Issue Type: Improvement
>  Components: io-ideas
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-xml}}. Additional context discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-xml:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/xml}} project.
> # In {{sdks/java/io/xml/build.gradle}}, add {{failOnWarning: true}} to the 
> call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4354) Enforce ErrorProne analysis in the tika IO project

2018-05-17 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4354:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the tika IO project
> --
>
> Key: BEAM-4354
> URL: https://issues.apache.org/jira/browse/BEAM-4354
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-tika
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-tika}}. Additional context discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-tika:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/tika}} project.
> # In {{sdks/java/io/tika/build.gradle}}, add {{failOnWarning: true}} to the 
> call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4352) Enforce ErrorProne analysis in the redis IO project

2018-05-17 Thread Tim Robertson (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson reassigned BEAM-4352:
---

Assignee: Tim Robertson

> Enforce ErrorProne analysis in the redis IO project
> ---
>
> Key: BEAM-4352
> URL: https://issues.apache.org/jira/browse/BEAM-4352
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-redis
>Reporter: Scott Wegner
>Assignee: Tim Robertson
>Priority: Minor
>  Labels: errorprone, starter
>
> Java ErrorProne static analysis was [recently 
> enabled|https://github.com/apache/beam/pull/5161] in the Gradle build 
> process, but only as warnings. ErrorProne errors are generally useful and 
> easy to fix. Some work was done to [make sdks-java-core 
> ErrorProne-clean|https://github.com/apache/beam/pull/5319] and add 
> enforcement. This task is clean ErrorProne warnings and add enforcement in 
> {{beam-sdks-java-io-redis}}. Additional context discussed on the [dev 
> list|https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E].
> Fixing this issue will involve:
> # Follow instructions in the [Contribution 
> Guide|https://beam.apache.org/contribute/] to set up a {{beam}} development 
> environment.
> # Run the following command to compile and run ErrorProne analysis on the 
> project: {{./gradlew :beam-sdks-java-io-redis:assemble}}
> # Fix each ErrorProne warning from the {{sdks/java/io/redis}} project.
> # In {{sdks/java/io/redis/build.gradle}}, add {{failOnWarning: true}} to the 
> call the {{applyJavaNature()}} 
> ([example|https://github.com/apache/beam/pull/5319/files#diff-9390c20635aed5f42f83b97506a87333R20]).
> This starter issue is sponsored by [~swegner]. Feel free to [reach 
> out|https://beam.apache.org/community/contact-us/] with questions or code 
> review:
> * JIRA: [~swegner]
> * GitHub: [@swegner|https://github.com/swegner]
> * Slack: [@Scott Wegner|https://s.apache.org/beam-slack-channel]
> * Email: swegner at google dot com



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >