date:20200205

Re: Dropping expired sessions with Apache Beam

2020-02-05 Thread Jan Lukavský


Hi Juliana,

I'm not quite familiar with the python SDK, so I can give just a generic 
advise. The problem you describe seems to be handled well via stateful 
dofn [1], where you would hold last timestamp of event per session and 
setup a timer on each incoming event to the expiration time (if the 
timestamp of that event is greater than the greatest see so far). Once 
you receive LOGOUT, you reset this timer and expire the session 
(probably by unsetting the last received event timestamp). Note that the 
events will arrive out-of-order generally (not sorted by timestamp), so 
you must keep the maximal timestamp and update it only with events with 
higher timestamp.


> In normal python I would keep a dict with each session as key and 
last timestamp as value. For each new entry of a given key I would check 
the timedelta. If bigger than window. Expired. Otherwise, update last 
timestamp. But don't know how to handle in beam.


This is essentially what you should do, just use the stateful API.

Hope this helps,

 Jan

[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html

[2] 
https://beam.apache.org/releases/pydoc/2.18.0/apache_beam.transforms.userstate.html


On 2/5/20 12:58 AM, Juliana Pereira wrote:
I have a log web log file that contains sessions id's and 
interactions, there are three interactions `GET, LOGIN, LOGOUT`. 
Something like:


```
00:00:01;session1;GET
00:00:03;session2;LOGIN
00:01:01;session1;LOGOUT
00:03:01;session2;GET
00:08:15;session2;GET
```

and goes on.

I want to be able to identify (right now, I'm dealing with bounded 
data) with sessions were expired. By expired I mean any session that 
do not have any interaction in a 5 minutes interval.


Of course, if user "LOGOUT", expiration will not be applied. In the 
data above session 2 should be considered expired.


I have the folloing dataflow
```
( p
      | 'Read Files' >> ReadFromText(known_args.input, coder=LogCoder())
      | ParDo(LogToSession())
      | beam.Map(lambda entry: (entry.session, entry))
      | beam.GroupByKey()
)
```

The `LogCoder()` is responsible to correctly read the input files. The 
`LogToSession` convert a log line to a Python class that correctly 
handle the data structure, begin able to acess properties correctly.


For example I can fetch `entry.session` or `entry.timestamp` or 
`entry.operation`.


Once processed by `LogToSession`, `entry.timestamp` is a python 
`datetime`, `entry.session` is a `str` and `entry.operation` is also 
an `str`.


In normal python I would keep a dict with each session as key and last 
timestamp as value. For each new entry of a given key I would check 
the timedelta. If bigger than window. Expired. Otherwise, update last 
timestamp. But don't know how to handle in beam.


How to handle the next steps?

Stability of Timer.withOutputTimestamp

2020-02-05 Thread Steve Niemitz

I noticed that Timer.withOutputTimestamp has landed in 2.19, but I didn't
see any mention of it in the release notes.

Is this feature considered stable (specifically on dataflow)?

seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

2020-02-05 Thread Alan Krumholz

Hello, I'm having issues running beam.util.GroupIntoBatches() in DataFlow.

I get the following error message:

Exception: Requested execution of a stateful DoFn, but no user state
> context is available. This likely means that the current runner does not
> support the execution of stateful DoFns


Seems to be related to:
https://stackoverflow.com/questions/56403572/no-userstate-context-is-available-google-cloud-dataflow

Is there another way I can achieve the same using other beam function?

I basically want to batch rows into groups of 100 as it is a lot faster to
transform all at once than doing it 1 by 1.

I also was planning to use this function for a custom snowflake sink (so I
could insert many rows at once)

I'm sure there must be another way to do this in DataFlow but not sure how?

Thanks so much!

Re: Stability of Timer.withOutputTimestamp

2020-02-05 Thread Steve Niemitz

Also, as a follow up, I'm curious about this commit:
https://github.com/apache/beam/commit/80862f2de6f224c3a1e7885d197d1ca952ec07e3

My use case is that I want to set a timer to fire after the max timestamp
of a window, but hold the watermark to the max timestamp until it fires,
essentially delaying the window closing by some amount of event time.
Previous to that revert commit it seems like that would have been possible,
but now it would fail (since the target is after the window's maxTimestamp).

What was the reason this was reverted, and are there plans to un-revert it?

On Wed, Feb 5, 2020 at 10:01 AM Steve Niemitz  wrote:

> I noticed that Timer.withOutputTimestamp has landed in 2.19, but I didn't
> see any mention of it in the release notes.
>
> Is this feature considered stable (specifically on dataflow)?
>

Re: seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

2020-02-05 Thread Alan Krumholz

Never mind there seems to be a  beam.GroupIntoBatches()  that I should have
originally used instead of beam.util.GroupIntoBatches()

On Wed, Feb 5, 2020 at 7:19 AM Alan Krumholz 
wrote:

> Hello, I'm having issues running beam.util.GroupIntoBatches() in DataFlow.
>
> I get the following error message:
>
> Exception: Requested execution of a stateful DoFn, but no user state
>> context is available. This likely means that the current runner does not
>> support the execution of stateful DoFns
>
>
> Seems to be related to:
>
> https://stackoverflow.com/questions/56403572/no-userstate-context-is-available-google-cloud-dataflow
>
> Is there another way I can achieve the same using other beam function?
>
> I basically want to batch rows into groups of 100 as it is a lot faster to
> transform all at once than doing it 1 by 1.
>
> I also was planning to use this function for a custom snowflake sink (so I
> could insert many rows at once)
>
> I'm sure there must be another way to do this in DataFlow but not sure how?
>
> Thanks so much!
>

Re: seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

2020-02-05 Thread Alan Krumholz

Actually beam.GroupIntoBatches() gives me the same error as
beam.util.GroupIntoBatches() :(
back to square one.

Any other ideas?

Thank you!


On Wed, Feb 5, 2020 at 7:32 AM Alan Krumholz 
wrote:

> Never mind there seems to be a  beam.GroupIntoBatches()  that I
> should have originally used instead of beam.util.GroupIntoBatches()
>
> On Wed, Feb 5, 2020 at 7:19 AM Alan Krumholz 
> wrote:
>
>> Hello, I'm having issues running beam.util.GroupIntoBatches() in DataFlow.
>>
>> I get the following error message:
>>
>> Exception: Requested execution of a stateful DoFn, but no user state
>>> context is available. This likely means that the current runner does not
>>> support the execution of stateful DoFns
>>
>>
>> Seems to be related to:
>>
>> https://stackoverflow.com/questions/56403572/no-userstate-context-is-available-google-cloud-dataflow
>>
>> Is there another way I can achieve the same using other beam function?
>>
>> I basically want to batch rows into groups of 100 as it is a lot faster
>> to transform all at once than doing it 1 by 1.
>>
>> I also was planning to use this function for a custom snowflake sink (so
>> I could insert many rows at once)
>>
>> I'm sure there must be another way to do this in DataFlow but not sure
>> how?
>>
>> Thanks so much!
>>
>

Re: seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

2020-02-05 Thread Alan Krumholz

OK, seems like beam.BatchElements(max_batch_size=x) will do the trick for
me and runs fine in DataFlow!

On Wed, Feb 5, 2020 at 7:38 AM Alan Krumholz 
wrote:

> Actually beam.GroupIntoBatches() gives me the same error as
> beam.util.GroupIntoBatches() :(
> back to square one.
>
> Any other ideas?
>
> Thank you!
>
>
> On Wed, Feb 5, 2020 at 7:32 AM Alan Krumholz 
> wrote:
>
>> Never mind there seems to be a  beam.GroupIntoBatches()  that I
>> should have originally used instead of beam.util.GroupIntoBatches()
>>
>> On Wed, Feb 5, 2020 at 7:19 AM Alan Krumholz 
>> wrote:
>>
>>> Hello, I'm having issues running beam.util.GroupIntoBatches() in
>>> DataFlow.
>>>
>>> I get the following error message:
>>>
>>> Exception: Requested execution of a stateful DoFn, but no user state
 context is available. This likely means that the current runner does not
 support the execution of stateful DoFns
>>>
>>>
>>> Seems to be related to:
>>>
>>> https://stackoverflow.com/questions/56403572/no-userstate-context-is-available-google-cloud-dataflow
>>>
>>> Is there another way I can achieve the same using other beam function?
>>>
>>> I basically want to batch rows into groups of 100 as it is a lot faster
>>> to transform all at once than doing it 1 by 1.
>>>
>>> I also was planning to use this function for a custom snowflake sink (so
>>> I could insert many rows at once)
>>>
>>> I'm sure there must be another way to do this in DataFlow but not sure
>>> how?
>>>
>>> Thanks so much!
>>>
>>

Re: seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

2020-02-05 Thread Robert Bradshaw

Yes, you should use BatchElements. Stateful DoFns are not yet
supported for Python Dataflow. (The difference is that
GroupIntoBatches has the capability to batch across bundles, which can
be important for streaming.)

On Wed, Feb 5, 2020 at 7:53 AM Alan Krumholz  wrote:
>
> OK, seems like beam.BatchElements(max_batch_size=x) will do the trick for me 
> and runs fine in DataFlow!
>
> On Wed, Feb 5, 2020 at 7:38 AM Alan Krumholz  
> wrote:
>>
>> Actually beam.GroupIntoBatches() gives me the same error as  
>> beam.util.GroupIntoBatches() :(
>> back to square one.
>>
>> Any other ideas?
>>
>> Thank you!
>>
>>
>> On Wed, Feb 5, 2020 at 7:32 AM Alan Krumholz  
>> wrote:
>>>
>>> Never mind there seems to be a  beam.GroupIntoBatches()  that I should have 
>>> originally used instead of beam.util.GroupIntoBatches()
>>>
>>> On Wed, Feb 5, 2020 at 7:19 AM Alan Krumholz  
>>> wrote:

 Hello, I'm having issues running beam.util.GroupIntoBatches() in DataFlow.

 I get the following error message:

> Exception: Requested execution of a stateful DoFn, but no user state 
> context is available. This likely means that the current runner does not 
> support the execution of stateful DoFns

 Seems to be related to:
 https://stackoverflow.com/questions/56403572/no-userstate-context-is-available-google-cloud-dataflow

 Is there another way I can achieve the same using other beam function?

 I basically want to batch rows into groups of 100 as it is a lot faster to 
 transform all at once than doing it 1 by 1.

 I also was planning to use this function for a custom snowflake sink (so I 
 could insert many rows at once)

 I'm sure there must be another way to do this in DataFlow but not sure how?

 Thanks so much!

Re: Stability of Timer.withOutputTimestamp

2020-02-05 Thread Luke Cwik

+Reuven Lax 

On Wed, Feb 5, 2020 at 7:33 AM Steve Niemitz  wrote:

> Also, as a follow up, I'm curious about this commit:
>
> https://github.com/apache/beam/commit/80862f2de6f224c3a1e7885d197d1ca952ec07e3
>
> My use case is that I want to set a timer to fire after the max timestamp
> of a window, but hold the watermark to the max timestamp until it fires,
> essentially delaying the window closing by some amount of event time.
> Previous to that revert commit it seems like that would have been possible,
> but now it would fail (since the target is after the window's maxTimestamp).
>
> What was the reason this was reverted, and are there plans to un-revert it?
>
> On Wed, Feb 5, 2020 at 10:01 AM Steve Niemitz  wrote:
>
>> I noticed that Timer.withOutputTimestamp has landed in 2.19, but I didn't
>> see any mention of it in the release notes.
>>
>> Is this feature considered stable (specifically on dataflow)?
>>
>

Re: Local error when using requirements_file for external runner.

2020-02-05 Thread Robert Bradshaw

Maybe try -vv for more information? Are you able to install
cryptography directly with --no-binary -v? E.g.

python -m pip download --dest /tmp/dataflow-requirements-cache
cryptography --exists-action i --no-binary :all: -vv

Perhaps it requires header files or setup_requires is messing things up.

On Mon, Feb 3, 2020 at 11:13 AM Alan Krumholz  wrote:
>
> Hi Robert,
>
> Please find attached the output of running it with -v
>
> Thank you
>
> On Mon, Feb 3, 2020 at 10:54 AM Robert Bradshaw  wrote:
>>
>> Hmm... that error doesn't give much information. Could you perhaps try
>> running the failing command with a '-v' (for more verbosity)?
>>
>> On Mon, Feb 3, 2020 at 4:43 AM Alan Krumholz  
>> wrote:
>> >
>> > I have a simple python pipeline that uses a publicly available (PyPI) 
>> > library.
>> >
>> > I can run my pipeline fine using my local runner.
>> >
>> > I can also run it fine when using DataFlow runner if I provide a 
>> > setup_file to the pipeline.
>> >
>> > However, when I try to do this by using a requirements_file instead of a 
>> > setup_file (recommended and cleaner way when pipeline has only PyPI 
>> > dependencies) I get an error in my local machine and the job is never 
>> > submitted to DataFlow.
>> >
>> > I did some digging and the problem seems to be that when you use a 
>> > requirements_file the python SDK tries running the following command in 
>> > the local machine before submitting the external job:
>> >
>> > python -m pip download --dest /tmp/dataflow-requirements-cache -r 
>> > /tmp/requirements.txt --exists-action i --no-binary :all:
>> >
>> >
>> > This command seems to be trying to install all these other libraries 
>> > (apart from the one in my requirements_file):
>> >
>> > azure-common-1.1.24.zip
>> > azure-storage-blob-2.1.0.tar.gz
>> > boto3-1.11.9.tar.gz
>> > botocore-1.14.9.tar.gz
>> > certifi-2019.11.28.tar.gz
>> > cffi-1.13.2.tar.gz
>> > pycryptodomex-3.9.6.tar.gz
>> > pyOpenSSL-19.1.0.tar.gz
>> > pytz-2019.3.tar.gz
>> > requests-2.22.0.tar.gz
>> > urllib3-1.25.8.tar.gz
>> >
>> > It installs some of them fine but the error seems to come when it tries to 
>> > install "cryptography":
>> >
>> > Collecting azure-common<2.0.0
>> >   Using cached azure-common-1.1.24.zip (18 kB)
>> >   Saved /tmp/dataflow-requirements-cache/azure-common-1.1.24.zip
>> > Collecting azure-storage-blob<12.0.0
>> >   Using cached azure-storage-blob-2.1.0.tar.gz (83 kB)
>> >   Saved /tmp/dataflow-requirements-cache/azure-storage-blob-2.1.0.tar.gz
>> > Collecting boto3<1.12,>=1.4.4
>> >   Using cached boto3-1.11.9.tar.gz (98 kB)
>> >   Saved /tmp/dataflow-requirements-cache/boto3-1.11.9.tar.gz
>> > Collecting botocore<1.15,>=1.5.0
>> >   Using cached botocore-1.14.9.tar.gz (6.1 MB)
>> >   Saved /tmp/dataflow-requirements-cache/botocore-1.14.9.tar.gz
>> > Collecting requests<2.23.0
>> >   Using cached requests-2.22.0.tar.gz (113 kB)
>> >   Saved /tmp/dataflow-requirements-cache/requests-2.22.0.tar.gz
>> > Collecting urllib3<1.26.0,>=1.20
>> >   Using cached urllib3-1.25.8.tar.gz (261 kB)
>> >   Saved /tmp/dataflow-requirements-cache/urllib3-1.25.8.tar.gz
>> > Collecting certifi<2021.0.0
>> >   Using cached certifi-2019.11.28.tar.gz (156 kB)
>> >   Saved /tmp/dataflow-requirements-cache/certifi-2019.11.28.tar.gz
>> > Collecting pytz<2021.0
>> >   Using cached pytz-2019.3.tar.gz (312 kB)
>> >   Saved /tmp/dataflow-requirements-cache/pytz-2019.3.tar.gz
>> > Collecting pycryptodomex!=3.5.0,<4.0.0,>=3.2
>> >   Using cached pycryptodomex-3.9.6.tar.gz (15.5 MB)
>> >   Saved /tmp/dataflow-requirements-cache/pycryptodomex-3.9.6.tar.gz
>> > Collecting pyOpenSSL<21.0.0,>=16.2.0
>> >   Using cached pyOpenSSL-19.1.0.tar.gz (160 kB)
>> >   Saved /tmp/dataflow-requirements-cache/pyOpenSSL-19.1.0.tar.gz
>> > Collecting cffi<1.14,>=1.9
>> >   Using cached cffi-1.13.2.tar.gz (460 kB)
>> >   Saved /tmp/dataflow-requirements-cache/cffi-1.13.2.tar.gz
>> > Collecting cryptography<3.0.0,>=1.8.2
>> >   Using cached cryptography-2.8.tar.gz (504 kB)
>> >   Installing build dependencies ... error
>> >   ERROR: Command errored out with exit status 1:
>> >command: /opt/conda/bin/python 
>> > /opt/conda/lib/python3.6/site-packages/pip install --ignore-installed 
>> > --no-user --prefix /tmp/pip-build-env-wq57ycwk/overlay 
>> > --no-warn-script-location --no-binary :all: --only-binary :none: -i 
>> > https://pypi.org/simple -- 'setuptools>=40.6.0' wheel 'cffi>=1.8,!=1.11.3; 
>> > platform_python_implementation != '"'"'PyPy'"'"''
>> >
>> >
>> >
>> > Has anyone else seen this problem? and is there an easy way to fix it?
>> >
>> >
>> > Thank you!

Re: Stability of Timer.withOutputTimestamp

2020-02-05 Thread Kenneth Knowles

It is definitely too new to be stable in the sense of not even tiny changes
to the API / runtime compatibility.

However, in my opinion it is so fundamental (and overdue) it will certainly
exist in some form.

Feel free to use it if you are OK with the possibility of minor
compile-time adjustments and you do not require Dataflow pipeline update
compatibility.

Kenn

On Wed, Feb 5, 2020 at 10:31 AM Luke Cwik  wrote:

> +Reuven Lax 
>
> On Wed, Feb 5, 2020 at 7:33 AM Steve Niemitz  wrote:
>
>> Also, as a follow up, I'm curious about this commit:
>>
>> https://github.com/apache/beam/commit/80862f2de6f224c3a1e7885d197d1ca952ec07e3
>>
>> My use case is that I want to set a timer to fire after the max timestamp
>> of a window, but hold the watermark to the max timestamp until it fires,
>> essentially delaying the window closing by some amount of event time.
>> Previous to that revert commit it seems like that would have been possible,
>> but now it would fail (since the target is after the window's maxTimestamp).
>>
>> What was the reason this was reverted, and are there plans to un-revert
>> it?
>>
>> On Wed, Feb 5, 2020 at 10:01 AM Steve Niemitz 
>> wrote:
>>
>>> I noticed that Timer.withOutputTimestamp has landed in 2.19, but I
>>> didn't see any mention of it in the release notes.
>>>
>>> Is this feature considered stable (specifically on dataflow)?
>>>
>>

Re: Dropping expired sessions with Apache Beam

Stability of Timer.withOutputTimestamp

seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

Re: Stability of Timer.withOutputTimestamp

Re: seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

Re: seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

Re: seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

Re: seems beam.util.GroupIntoBatches is not supported in DataFlow. Any alternative?

Re: Stability of Timer.withOutputTimestamp

Re: Local error when using requirements_file for external runner.

Re: Stability of Timer.withOutputTimestamp

11 matches

Site Navigation

Mail list logo

Footer information