Re: apache-beam-jenkins-15 out of disk

2019-07-03 Thread Yifan Zou
I reimaged the beam15. The worker is re-enabled. Let us know if anything
weird happens on any agent.

Thanks.
Yifan

On Mon, Jul 1, 2019 at 10:00 AM Yifan Zou  wrote:

> https://issues.apache.org/jira/browse/BEAM-7650 tracks the docker issue.
>
> On Sun, Jun 30, 2019 at 2:35 PM Mark Liu  wrote:
>
>> Thank you for triaging and working out a solution Yifan and Ankur.
>>
>> Ankur, from what you discovered, we should fix this race condition
>> otherwise same problem will happen in the future. Is there a jira tracking
>> this issue?
>>
>> On Fri, Jun 28, 2019 at 4:56 PM Yifan Zou  wrote:
>>
>>> Sorry for the inconvenience. I disabled the worker. I'll need more time
>>> to restore it.
>>>
>>> On Fri, Jun 28, 2019 at 3:56 PM Daniel Oliveira 
>>> wrote:
>>>
 Any updates to this issue today? It seems like this (or a similar bug)
 is still happening across many Pre and Postcommits.

 On Fri, Jun 28, 2019 at 12:33 AM Yifan Zou  wrote:

> I did the prune on beam15. The disk was free but all jobs fails with
> other weird problems. Looks like docker prune overkills, but I don't have
> evidence. Will look further in AM.
>
> On Thu, Jun 27, 2019 at 11:20 PM Udi Meiri  wrote:
>
>> See how the hdfs IT already avoids tag collisions.
>>
>> On Thu, Jun 27, 2019, 20:42 Yichi Zhang  wrote:
>>
>>> for flakiness I guess a tag is needed to separate concurrent build
>>> apart.
>>>
>>> On Thu, Jun 27, 2019 at 8:39 PM Yichi Zhang 
>>> wrote:
>>>
 maybe a cron job on jenkins node that does docker prune every day?

 On Thu, Jun 27, 2019 at 6:58 PM Ankur Goenka 
 wrote:

> This highlights the race condition caused by using single docker
> registry on a machine.
> If 2 tests create "jenkins-docker-apache.bintray.io/beam/python" one
> after another then the 2nd one will replace the 1st one and cause 
> flakyness.
>
> Is their a way to dynamically create and destroy docker repository
> on a machine and clean all the relevant data?
>
> On Thu, Jun 27, 2019 at 3:15 PM Yifan Zou 
> wrote:
>
>> The problem was because of the large quantity of stale docker
>> images generated by the Python portable tests and HDFS IT.
>>
>> Dumping the docker disk usage gives me:
>>
>> TYPETOTAL   ACTIVE  SIZE
>>RECLAIMABLE
>> *Images  1039356
>> 424GB   384.2GB (90%)*
>> Containers  987 2
>> 2.042GB 2.041GB (99%)
>> Local Volumes   126 0
>> 392.8MB 392.8MB (100%)
>>
>> REPOSITORY
>> TAG IMAGE IDCREATED
>> SIZESHARED SIZE UNIQUE SIZE 
>> CONTAINERS
>> jenkins-docker-apache.bintray.io/beam/python3
>>  latest  ff1b949f444222 hours ago
>> 1.639GB
>>   922.3MB  716.9MB 0
>> jenkins-docker-apache.bintray.io/beam/python
>>latest  1dda7b9d974822 hours ago
>> 1.624GB
>> 913.7MB   710.3MB 0
>> 
>>  05458187a0e3
>> 22 hours
>> ago732.9MB 625.1MB107.8MB
>>  4
>> 
>>  896f35dd685f
>> 23 hours
>> ago1.639GB 922.3MB   716.9MB 
>> 0
>> 
>>  db4d24ca9f2b
>> 23 hours
>> ago1.624GB 913.7MB  710.3MB  
>>0
>> 
>>   547df4d71c31
>> 23
>> hours ago732.9MB 625.1MB 107.8MB
>>   4
>> 
>>   dd7d9582c3e0
>> 23
>> hours ago1.639GB 922.3MB 716.9MB
>>   0
>> 
>>   664aae255239
>> 23
>> hours ago1.624GB 913.7MB 710.3MB
>>   0
>> 
>>   b528fedf9228
>> 23
>> hours ago732.9MB 625.1MB 107.8MB
>>   4
>> 
>>   8e996f22435e
>> 25
>> hours ago1.624GB  

Python Utilities

2019-07-03 Thread Shannon Duncan
I have been writing a bunch of utilities for the python SDK such as joins,
selections, composite transforms, etc...

I am working with my company to see if I can open source the utilities.
Would it be best to post them on a separate PyPi project, or to PR them
into the beam SDK? I assume if they let me open source it they will want
some attribution or something like that.

Thanks,
Shannon


Re: Dataflow IT failures on Python being investigated internally

2019-07-03 Thread Udi Meiri
https://issues.apache.org/jira/browse/BEAM-7687

On Wed, Jul 3, 2019 at 11:57 AM Udi Meiri  wrote:

> It seems that at least some of the failures to start pipelines on DF were
> due to a CMEK misconfiguration.
>
> On Tue, Jul 2, 2019 at 6:45 PM Udi Meiri  wrote:
>
>> The failures are of the type where the pipeline fails very quickly (10
>> seconds) and there's a "Pipeline execution failed" or "Workflow failed"
>> error.
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Dataflow IT failures on Python being investigated internally

2019-07-03 Thread Udi Meiri
It seems that at least some of the failures to start pipelines on DF were
due to a CMEK misconfiguration.

On Tue, Jul 2, 2019 at 6:45 PM Udi Meiri  wrote:

> The failures are of the type where the pipeline fails very quickly (10
> seconds) and there's a "Pipeline execution failed" or "Workflow failed"
> error.
>


smime.p7s
Description: S/MIME Cryptographic Signature


[Discuss] Create stackoverflow tags for python, java and go SDKs?

2019-07-03 Thread Rui Wang
Hi Community,

When reading apache-beam related questions in stackoverflow, it happens
that some questions only mention version number(e.g. 2.8.0) but not mention
which SDK related. Sometimes I can tell which SDK it is from code snippets,
sometime I cannot as there is no code snippet. So in order to answer those
questions I need to first comment and ask which SDK.

I noticed that there is no tag for a specific SDK for apache beam. Adding
such tags will be helpful when
1. Questions with such tag tell which SDK it is talking about.
2. If Questions do not mention SDK and without such tag, I can (or anyone
else) help tag them.

Note that creating tags is a privilege in SO that requires >1500
reputation[1]. If people generally are ok with this idea, we will need to
ask for help in the community to see who could be able to create tags.


[1]: https://stackoverflow.com/help/privileges/create-tags

Rui


Re: [VOTE] Vendored dependencies release process

2019-07-03 Thread Jens Nyman
+1

On 2019/07/02 23:49:10, Lukasz Cwik  wrote:
> Please vote based on the vendored dependencies release process as>
> discussed[1] and documented[2].>
>
> Please vote as follows:>
> +1: Adopt the vendored dependency release process>
> -1: The vendored release process needs to change because ...>
>
> Since many people in the US may be out due to the holiday schedule, I'll>
> try to close the vote and tally the results on July 9th so please vote>
> before then.>
>
> 1:>
>
https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E>

> 2: https://s.apache.org/beam-release-vendored-artifacts>
>


Re: Change of Behavior - JDBC Set Command

2019-07-03 Thread Alireza Samadian
Before this PR, the set command was using a map to store values and then it
was using PipelineOptionsFactory#fromArgs to parse those values. Therefore,
by using PipelieOptionsFactory#parseObjects, we keep the same value parsing
behavior for the SET command as before. Using PipelineOptionsFactory for
parsing the objects also has two more advantages: It will prevent code
duplication for parsing objects, and PipelineOptionsFactory does some extra
checks (For instance checks if the runner is a valid type of runner). Thus,
I think parsing the values the same way as
PipelieOptionsFactory#parseObjects will be a better option.


On Tue, Jul 2, 2019 at 3:50 PM Lukasz Cwik  wrote:

> I see, in the current PR it seems like we are trying to adopt the parsing
> logic of PipelineOptions command line value parsing to all SQL usecases
> since we are exposing the parseOption method to be used in the
> PipelineOptionsReflectionSetter#setOption.
>
> I should have asked in my earlier e-mail whether we wanted string to value
> parsing to match what we do inside of the PipelineOptionsFactory. If no,
> then PipelineOptionsReflectionSetter#setOption should take an Object type
> for value instead of String.
>
> On Tue, Jul 2, 2019 at 9:39 AM Anton Kedin  wrote:
>
>> The proposed API assumes you already have a property name and a value
>> parsed somehow, and now want to update a field on a pre-existing options
>> object with that value, so there is no assumption about parsing being the
>> same or not. E.g. if you set a property called `runner` to a string value
>> `DirectRunner`, it should behave the same way whether it came from command
>> line args, SQL SET command, JDBC connection args, or anywhere else.
>>
>> That said, we parse SET command differently from command line args [1].
>> We also parse the pipeline options from the connection args [2] that has a
>> different syntax as well. I don't know whether we can easily deal with this
>> aspect at this point (and whether we should), but if a value can get
>> parsed, idea is that it should work the same way after that.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/codegen/includes/parserImpls.ftl#L307
>> [2]
>> https://github.com/apache/beam/blob/b2fd4e392ede19f03a48997252970b8bba8535f1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/JdbcConnection.java#L82
>>
>> On Fri, Jun 28, 2019 at 7:57 AM Lukasz Cwik  wrote:
>>
>>> Do we want SQL argument parsing to always be 1-1 with how command line
>>> parsing is being done?
>>> Note that this is different from the JSON <-> PipelineOptions conversion.
>>>
>>> I can see why the wrapper makes sense, just want to make sure that the
>>> JDBC SET command aligns with what we are trying to expose.
>>>
>>> On Thu, Jun 27, 2019 at 9:17 PM Anton Kedin  wrote:
>>>
 I think we thought about this approach but decided to get rid of the
 map representation wherever we can while still supporting setting of the
 options by name.

 One of the lesser important downsides of keeping the map around is that
 we will need to do `fromArgs` at least twice.

 Another downside is that we will probably have to keep and maintain two
 representations of the pipeline options at the same time and have extra
 validations and probably reconciliation logic.

 We need the map representation in the JDBC/command-line use case where
 it's the only way for a user to specify the options. A user runs a special
 SQL command which goes through normal parsing and execution logic. On top
 of that we have a case of mixed Java/SQL pipelines, where we already have
 an instance of PipelineOptions and don't need a user to set the options
 from within a query. Right now this is impossible for other reasons as
 well. But to support both JDBC and Java+SQL use cases we currently pass
 both a map and a PipelineOptions instance around. Which makes things
 confusing. We can probably reduce passing things around but I think we will
 still need to keep both representations.

 Ideally, I think, mixed Java+SQL pipelines should be backed by that
 same JDBC logic as much as possible. So potentially we should allow users
 to set the pipeline options from within a complicated query even in
 SqlTransform in a Java pipeline. However setting an option from within SQL
 persists it in the map, but in mixed case we already have the
 PipelineOption instance that we got from the SqlTransform. So now we will
 need to maintain the logic to reconcile the two representations. That will
 probably involve either something similar to the proposed reflection
 approach, or serializing both representations to a map or JSON and then
 reconciling and then reconstructing it from there. This sounds unnecessary
 and we can avoid this if we are able to just set the pipeline options by
 name in the first place. In that 

Re: Stop using Perfkit Benchmarker tool in all tests?

2019-07-03 Thread Lukasz Cwik
Makes sense to me to move forward with your suggestion.

On Wed, Jul 3, 2019 at 3:57 AM Łukasz Gajowy 
wrote:

> Are there features in Perfkit that we would like to be using that we
>> aren't?
>>
>
> Besides the Kubernetes related code I mentioned above (that, I believe,
> can be easily replaced) I don't see any added value in having Perfkit. The
> Kubernetes parts could be replaced with a set of fine-grained Gradle tasks
> invoked by other high-level tasks and Jenkins job's steps. There also seem
> to be some Gradle + Kubernetes plugins out there that might prove useful
> here (no solid research in that area).
>
>
>> Can we make the integration with Perfkit less brittle?
>>
>
> There was an idea to move all beam benchmark's code from Perfkit (
> beam_benchmark_helper.py
> 
> , beam_integration_benchmark.py
> )
> to beam repository and inject it to Perfkit every time we use it. However,
> that would require investing time and effort in doing that and it will
> still not solve the problems I listed above. It will also still require
> knowledge of how Perfkit works from Beam developers while we can avoid that
> and use the existing tools (gradle, jenkins).
>
> Thanks!
>
> pt., 28 cze 2019 o 17:31 Lukasz Cwik  napisał(a):
>
>> +1 for removing tests that are not maintained.
>>
>> Are there features in Perfkit that we would like to be using that we
>> aren't?
>> Can we make the integration with Perfkit less brittle?
>>
>> If we aren't getting much and don't plan to get much value in the short
>> term, removal makes sense to me.
>>
>> On Thu, Jun 27, 2019 at 3:16 AM Łukasz Gajowy  wrote:
>>
>>> Hi all,
>>>
>>> moving the discussion to the dev list:
>>> https://github.com/apache/beam/pull/8919. I think that Perfkit
>>> Benchmarker should be removed from all our tests.
>>>
>>> Problems that we face currently:
>>>
>>>1. Changes to Gradle tasks/build configuration in the Beam codebase
>>>have to be reflected in Perfkit code. This required PRs to Perfkit which
>>>can last and the tests break due to this sometimes (no change in Perfkit 
>>> +
>>>change already there in beam = incompatibility). This is what happened in
>>>PR 8919 (above),
>>>2. Can't run in Python3 (depends on python 2 only library like
>>>functools32),
>>>3. Black box testing which hard to collect pipeline related metrics,
>>>4. Measurement of run time is inaccurate,
>>>5. It offers relatively small elasticity in comparison with eg.
>>>Jenkins tasks in terms of setting up the testing infrastructure (runners,
>>>databases). For example, if we'd like to setup Flink runner, and reuse it
>>>in consequent tests in one go, that would be impossible. We can easily do
>>>this in Jenkins.
>>>
>>> Tests that use Perfkit:
>>>
>>>1.  IO integration tests,
>>>2.  Python performance tests,
>>>3.  beam_PerformanceTests_Dataflow (disabled),
>>>4.  beam_PerformanceTests_Spark (failing constantly - looks not
>>>maintained).
>>>
>>> From the IOIT perspective (1), only the code that setups/tears down
>>> Kubernetes resources is useful right now but these parts can be easily
>>> implemented in Jenkins/Gradle code. That would make Perfkit obsolete in
>>> IOIT because we already collect metrics using Metrics API and store them in
>>> BigQuery directly.
>>>
>>> As for point 2: I have no knowledge of how complex the task would be
>>> (help needed).
>>>
>>> Regarding 3, 4: Those tests seem to be not maintained - should we remove
>>> them?
>>>
>>> Opinions?
>>>
>>> Thank you,
>>> Łukasz
>>>
>>>
>>>
>>>
>>>


Re: Wiki access?

2019-07-03 Thread Lukasz Cwik
I have added you. Thanks for helping out with the docs.

On Wed, Jul 3, 2019 at 8:22 AM Ryan Skraba  wrote:

> Oof, sorry: ryanskraba
>
> Thanks in advance!  There's a lot of great info in there.
>
> On Wed, Jul 3, 2019 at 5:03 PM Lukasz Cwik  wrote:
>
>> Can you share your login id for cwiki.apache.org?
>>
>> On Wed, Jul 3, 2019 at 7:21 AM Ryan Skraba  wrote:
>>
>>> Hello -- I've been reading through a lot of Beam documentation recently,
>>> and noting minor typos here and there... Is it possible to get Wiki access
>>> to make fixes on the spot?
>>>
>>> Best regards, Ryan
>>>
>>


Re: Wiki access?

2019-07-03 Thread Ryan Skraba
Oof, sorry: ryanskraba

Thanks in advance!  There's a lot of great info in there.

On Wed, Jul 3, 2019 at 5:03 PM Lukasz Cwik  wrote:

> Can you share your login id for cwiki.apache.org?
>
> On Wed, Jul 3, 2019 at 7:21 AM Ryan Skraba  wrote:
>
>> Hello -- I've been reading through a lot of Beam documentation recently,
>> and noting minor typos here and there... Is it possible to get Wiki access
>> to make fixes on the spot?
>>
>> Best regards, Ryan
>>
>


Re: Wiki access?

2019-07-03 Thread Lukasz Cwik
Can you share your login id for cwiki.apache.org?

On Wed, Jul 3, 2019 at 7:21 AM Ryan Skraba  wrote:

> Hello -- I've been reading through a lot of Beam documentation recently,
> and noting minor typos here and there... Is it possible to get Wiki access
> to make fixes on the spot?
>
> Best regards, Ryan
>


Wiki access?

2019-07-03 Thread Ryan Skraba
Hello -- I've been reading through a lot of Beam documentation recently,
and noting minor typos here and there... Is it possible to get Wiki access
to make fixes on the spot?

Best regards, Ryan


Re: Stop using Perfkit Benchmarker tool in all tests?

2019-07-03 Thread Łukasz Gajowy
>
> Are there features in Perfkit that we would like to be using that we
> aren't?
>

Besides the Kubernetes related code I mentioned above (that, I believe, can
be easily replaced) I don't see any added value in having Perfkit. The
Kubernetes parts could be replaced with a set of fine-grained Gradle tasks
invoked by other high-level tasks and Jenkins job's steps. There also seem
to be some Gradle + Kubernetes plugins out there that might prove useful
here (no solid research in that area).


> Can we make the integration with Perfkit less brittle?
>

There was an idea to move all beam benchmark's code from Perfkit (
beam_benchmark_helper.py

, beam_integration_benchmark.py
)
to beam repository and inject it to Perfkit every time we use it. However,
that would require investing time and effort in doing that and it will
still not solve the problems I listed above. It will also still require
knowledge of how Perfkit works from Beam developers while we can avoid that
and use the existing tools (gradle, jenkins).

Thanks!

pt., 28 cze 2019 o 17:31 Lukasz Cwik  napisał(a):

> +1 for removing tests that are not maintained.
>
> Are there features in Perfkit that we would like to be using that we
> aren't?
> Can we make the integration with Perfkit less brittle?
>
> If we aren't getting much and don't plan to get much value in the short
> term, removal makes sense to me.
>
> On Thu, Jun 27, 2019 at 3:16 AM Łukasz Gajowy  wrote:
>
>> Hi all,
>>
>> moving the discussion to the dev list:
>> https://github.com/apache/beam/pull/8919. I think that Perfkit
>> Benchmarker should be removed from all our tests.
>>
>> Problems that we face currently:
>>
>>1. Changes to Gradle tasks/build configuration in the Beam codebase
>>have to be reflected in Perfkit code. This required PRs to Perfkit which
>>can last and the tests break due to this sometimes (no change in Perfkit +
>>change already there in beam = incompatibility). This is what happened in
>>PR 8919 (above),
>>2. Can't run in Python3 (depends on python 2 only library like
>>functools32),
>>3. Black box testing which hard to collect pipeline related metrics,
>>4. Measurement of run time is inaccurate,
>>5. It offers relatively small elasticity in comparison with eg.
>>Jenkins tasks in terms of setting up the testing infrastructure (runners,
>>databases). For example, if we'd like to setup Flink runner, and reuse it
>>in consequent tests in one go, that would be impossible. We can easily do
>>this in Jenkins.
>>
>> Tests that use Perfkit:
>>
>>1.  IO integration tests,
>>2.  Python performance tests,
>>3.  beam_PerformanceTests_Dataflow (disabled),
>>4.  beam_PerformanceTests_Spark (failing constantly - looks not
>>maintained).
>>
>> From the IOIT perspective (1), only the code that setups/tears down
>> Kubernetes resources is useful right now but these parts can be easily
>> implemented in Jenkins/Gradle code. That would make Perfkit obsolete in
>> IOIT because we already collect metrics using Metrics API and store them in
>> BigQuery directly.
>>
>> As for point 2: I have no knowledge of how complex the task would be
>> (help needed).
>>
>> Regarding 3, 4: Those tests seem to be not maintained - should we remove
>> them?
>>
>> Opinions?
>>
>> Thank you,
>> Łukasz
>>
>>
>>
>>
>>


Re: [Python] Read Hadoop Sequence File?

2019-07-03 Thread Ismaël Mejía
That's great. I can help whenever you need. We just need to choose its
destination. Both the `hadoop-format` and `hadoop-file-system` modules
are good candidates, I would even feel inclined to put it in its own
module `sdks/java/extensions/sequencefile` to make it more easy to
discover by the final users.

A thing to consider is the SeekableByteChannel adapters, we can move
that into hadoop-common if needed and refactor the modules to share
code. Worth to take a look at
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
to see if some of it could be useful.

On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein  wrote:
>
> Hi all,
>
> I wrote those classes with the intention of upstreaming them to Beam. I can 
> try to make some time this quarter to clean them up. I would need a bit of 
> guidance from a beam expert in how to make them coexist with HadoopFormatIO 
> though.
>
>
> On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis  wrote:
>>
>> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>>
>> Solomon Duskis | Google Cloud clients | sdus...@google.com | 914-462-0531
>>
>>
>> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía  wrote:
>>>
>>> (Adding dev@ and Solomon Duskis to the discussion)
>>>
>>> I was not aware of these thanks for sharing David. Definitely it would
>>> be a great addition if we could have those donated as an extension in
>>> the Beam side. We can even evolve them in the future to be more FileIO
>>> like. Any chance this can happen? Maybe Solomon and his team?
>>>
>>>
>>>
>>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek  wrote:
>>> >
>>> > Hi, you can use SequenceFileSink and Source, from a BigTable client. 
>>> > Those works nice with FileIO.
>>> >
>>> > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>> > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>> >
>>> > It would be really cool to move these into Beam, but that's up to 
>>> > Googlers to decide, whether they want to donate this.
>>> >
>>> > D.
>>> >
>>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan 
>>> >  wrote:
>>> >>
>>> >> It's not outside the realm of possibilities. For now I've created an 
>>> >> intermediary step of a hadoop job that converts from sequence to text 
>>> >> file.
>>> >>
>>> >> Looking into better options.
>>> >>
>>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath  
>>> >> wrote:
>>> >>>
>>> >>> Java SDK has a HadoopInputFormatIO using which you should be able to 
>>> >>> read Sequence files: 
>>> >>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> >>> I don't think there's a direct alternative for this for Python.
>>> >>>
>>> >>> Is it possible to write to a well-known format such as Avro instead of 
>>> >>> a Hadoop specific format which will allow you to read from both 
>>> >>> Dataproc/Hadoop and Beam Python SDK ?
>>> >>>
>>> >>> Thanks,
>>> >>> Cham
>>> >>>
>>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan 
>>> >>>  wrote:
>>> 
>>>  That's a pretty big hole for a missing source/sink when looking at 
>>>  transitioning from Dataproc to Dataflow using GCS as storage buffer 
>>>  instead of a traditional hdfs.
>>> 
>>>  From what I've been able to tell from source code and documentation, 
>>>  Java is able to but not Python?
>>> 
>>>  Thanks,
>>>  Shannon
>>> 
>>>  On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath 
>>>   wrote:
>>> >
>>> > I don't think we have a source/sink for reading Hadoop sequence 
>>> > files. Your best bet currently will probably be to use FileSystem 
>>> > abstraction to create a file from a ParDo and read directly from 
>>> > there using a library that can read sequence files.
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
>>> >  wrote:
>>> >>
>>> >> I'm wanting to read a Sequence/Map file from Hadoop stored on Google 
>>> >> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the 
>>> >> Python SDK.
>>> >>
>>> >> I cannot locate any good adapters for this, and the one Hadoop 
>>> >> Filesystem reader seems to only read from a "hdfs://" url.
>>> >>
>>> >> I'm wanting to use Dataflow and GCS exclusively to start mixing in 
>>> >> Beam pipelines with our current Hadoop Pipelines.
>>> >>
>>> >> Is this a feature that is supported or will be supported in the 
>>> >> future?
>>> >> Does anyone have any good suggestions for this that is performant?
>>> >>
>>> >> I'd also like to be able to write back out to a SequenceFile if 
>>> >> 

Re: WebSocket/Https connector for Apache Beam (Java)?

2019-07-03 Thread Ismaël Mejía
I was not aware of that websocket work thanks for sharing Alexey.
I-Feng a Websocket based IO would be a really nice contribution so
worth to bring it to the project if it is in your plans. Maybe worth
to sync with JB on that.

If by any chance you or any other person is interested in adding an IO
for raw sockets there used to be an UnboundedSocketSource as part of
the Beam Flink runner that was removed recently.
https://github.com/apache/beam/pull/8574

I think this could be easily re-wrapped and pushed into its own
module, and I think it would be useful not only for pipelines but also
for demo purposes or interactive stuff like the SQL CLI / IPython
notebooks.

On Wed, Jul 3, 2019 at 10:07 AM Alexey Romanenko
 wrote:
>
> Probably, this blog, which Romain Manni-Bucau wrote a while ago, could be 
> helpful:
> https://rmannibucau.metawerx.net/post/apache-beam-websocket-output
>
> It’s not in Beam upstream but you can use it as your own transform.
>
> On 2 Jul 2019, at 13:34, I-Feng Lin  wrote:
>
> Hello all,
>
> I have an Apache Beam pipeline in Java where I would like to read data that 
> comes from a WebSocket and write data to the server through Https.
>
> I have been looking for some connectors but so far my search was 
> unsuccessful. I know that it is possible to create custom connectors but I 
> want to check if there is anything already exists.
>
> Thanks in advance,
>
> Ifeng
>
>


Re: WebSocket/Https connector for Apache Beam (Java)?

2019-07-03 Thread Alexey Romanenko
Probably, this blog, which Romain Manni-Bucau wrote a while ago, could be 
helpful:
https://rmannibucau.metawerx.net/post/apache-beam-websocket-output 


It’s not in Beam upstream but you can use it as your own transform.

> On 2 Jul 2019, at 13:34, I-Feng Lin  wrote:
> 
> Hello all,
> 
> I have an Apache Beam pipeline in Java where I would like to read data that 
> comes from a WebSocket and write data to the server through Https.
> 
> I have been looking for some connectors but so far my search was 
> unsuccessful. I know that it is possible to create custom connectors but I 
> want to check if there is anything already exists.
> 
> Thanks in advance,
> 
> Ifeng
>