dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Hi,

I was running a dataflow job in GCP last night and it was running fine.
This morning this same exact job is failing with the following error:

Error message from worker: Traceback (most recent call last): File
"/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
line 286, in loads return dill.loads(s) File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds) File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load() File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self) File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
_load_type return _reverse_typemap[name] KeyError: 'ClassType' During
handling of the above exception, another exception occurred: Traceback
(most recent call last): File
"/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
line 648, in do_work work_executor.execute() File
"/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
176, in execute op.start() File "apache_beam/runners/worker/operations.py",
line 649, in apache_beam.runners.worker.operations.DoOperation.start File
"apache_beam/runners/worker/operations.py", line 651, in
apache_beam.runners.worker.operations.DoOperation.start File
"apache_beam/runners/worker/operations.py", line 652, in
apache_beam.runners.worker.operations.DoOperation.start File
"apache_beam/runners/worker/operations.py", line 261, in
apache_beam.runners.worker.operations.Operation.start File
"apache_beam/runners/worker/operations.py", line 266, in
apache_beam.runners.worker.operations.Operation.start File
"apache_beam/runners/worker/operations.py", line 597, in
apache_beam.runners.worker.operations.DoOperation.setup File
"apache_beam/runners/worker/operations.py", line 602, in
apache_beam.runners.worker.operations.DoOperation.setup File
"/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
line 290, in loads return dill.loads(s) File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds) File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load() File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self) File
"/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
_load_type return _reverse_typemap[name] KeyError: 'ClassType'


If I use a local runner it still runs fine.
Anyone else experiencing something similar today? (or know how to fix this?)

Thanks!


Re: Best approach for recalculating statistics based on amended or deleted events?

2020-02-04 Thread Stephen Young
Thank you Reza. That was very helpful!

On 2020/02/03 01:03:18, Reza Rokni  wrote: 
> Hi,
> 
> So https://issues.apache.org/jira/browse/BEAM-91 would be nice... but not
> there sadly.
> 
> Not ideal but some thoughts:
> 
> With regards to state, the time scale ( for example you mentioned a week )
> could be problematic if the events per key is large. Could you create an
> aggregation value for your events, or do all events need to be available
> when doing your recalculation? So a count would be ok I assume to keep as 1
> hour aggregates ( 1 being arbitrary of course ). Things like unique are
> only possible if you are ok with < 100% accuracy by using algorithms like
> HyperLogLog and state sketches. Beam Docs HLL
> 
> .
> 
> The other option is that you keep x min / hours / days  in state but also
> push out information to a backing database, for the longer periods. So if
> redaction is for 3 weeks ago, then in the Stateful DoFn read that keys data
> from a backing database and recomput. Assuming most updates are within the
> smaller timeframes, then they should be dealt with by the data in the State
> ( cache hit ) rather than off state ( cache miss ) .
> 
> Cheers
> 
> Reza
> 
> On Thu, 30 Jan 2020 at 22:31, Stephen Young 
> wrote:
> 
> > I am currently looking into how Beam can support a live data collection
> > platform. We want to collect certain data in real-time. This data will be
> > sent to Kafka and we want to use Beam to calculate statistics and derived
> > events from it.
> >
> > An important thing we need to be able to handle is amendment or deletion
> > events. For example, we may get an event that someone has performed an
> > action and from this we'd calculate how many of these actions they had
> > taken in total. We'd also build calculations on top of that, for example
> > top 10 rankings by these counts, or arbitrarily many layers of calculations
> > beyond that. But sometime later (this could be a few seconds or a week) we
> > receive an amendment event to that action. This indicates that the action
> > was taken by a different person or from a different location. We then need
> > Beam to recalculate all of our downstream stats i.e. the counts need to be
> > changed and rankings need to be adjusted.
> >
> > We could potentially solve this problem by using Beam's stateful operators
> > to store all actions in state and then we always calculate the stats from
> > this state. But we're concerned this approach won't scale well when we have
> > lots of events and we need to store lots of events that happen often.
> >
> > An alternative is the differential dataflow approach described here:
> > https://docs.rs/differential-dataflow/0.11.0/differential_dataflow/. They
> > explain:
> >
> > "Once you have defined a differential dataflow computation, you may then
> > add records to or remove records from its inputs; the system will
> > automatically update the computation's outputs with the appropriate
> > corresponding additions and removals, and report these changes to you."
> >
> > Is anything like this achievable with Beam? Thanks!
> >
> 


Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Pablo Estrada
Hi Alan,
could it be that you're picking up the new Apache Beam 2.19.0 release?
Could you try depending on beam 2.18.0 to see if the issue surfaces when
using the new release?

If something was working and no longer works, it sounds like a bug. This
may have to do with how we pickle (dill / cloudpickle) - see this question
https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
Best
-P.

On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz 
wrote:

> Hi,
>
> I was running a dataflow job in GCP last night and it was running fine.
> This morning this same exact job is failing with the following error:
>
> Error message from worker: Traceback (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
> line 286, in loads return dill.loads(s) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
> return load(file, ignore, **kwds) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
> return Unpickler(file, ignore=ignore, **kwds).load() File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
> _load_type return _reverse_typemap[name] KeyError: 'ClassType' During
> handling of the above exception, another exception occurred: Traceback
> (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
> line 648, in do_work work_executor.execute() File
> "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
> 176, in execute op.start() File "apache_beam/runners/worker/operations.py",
> line 649, in apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 651, in
> apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 652, in
> apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 261, in
> apache_beam.runners.worker.operations.Operation.start File
> "apache_beam/runners/worker/operations.py", line 266, in
> apache_beam.runners.worker.operations.Operation.start File
> "apache_beam/runners/worker/operations.py", line 597, in
> apache_beam.runners.worker.operations.DoOperation.setup File
> "apache_beam/runners/worker/operations.py", line 602, in
> apache_beam.runners.worker.operations.DoOperation.setup File
> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
> line 290, in loads return dill.loads(s) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
> return load(file, ignore, **kwds) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
> return Unpickler(file, ignore=ignore, **kwds).load() File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
> _load_type return _reverse_typemap[name] KeyError: 'ClassType'
>
>
> If I use a local runner it still runs fine.
> Anyone else experiencing something similar today? (or know how to fix
> this?)
>
> Thanks!
>


Re: Kafka Avro Schema Registry Support

2020-02-04 Thread Ismaël Mejía
Support for Confluent Schema Registry was merged into KafkaIO today. You can
test it with tomorrow's snapshots (version 2.20.0-SNAPSHOT) or just when
2.20.0
gets released. Notice that this was already possible, but Alexey took care
of
making this more user friendly because this is (was) a frequently requested
feature by Kafka/Avro users.



On Fri, Sep 28, 2018 at 6:58 PM Raghu Angadi  wrote:

> Looks like your producer writing a Avro specfic records.
>
> Can you read the records using bundled console consumer? I think it will
> be simpler for you to get it returning valid records and use the same
> deserializer config with your KafkaIO reader.
>
> On Fri, Sep 28, 2018 at 9:33 AM Vishwas Bm  wrote:
>
>> Hi Raghu,
>>
>> Thanks for the response.  We are now trying with GenericAvroDeserializer
>> but still seeing issues.
>> We have a producer which sends messages to kafka in format
>> .
>>
>> Below is the code snippet, we have used at Beam KafkaIo.
>>
>>  org.apache.avro.Schema schema = null;
>> try {
>> schema = new org.apache.avro.Schema.Parser().parse(new
>> File("Schema path"));
>> } catch (Exception e) {
>> e.printStackTrace();
>> }
>> KafkaIO.Read kafkaIoRead =
>> KafkaIO.read()
>>
>> .withBootstrapServers(bootstrapServerUrl).withTopic(topicName)
>> .withKeyDeserializer(StringDeserializer.class)
>>
>> .withValueDeserializerAndCoder(GenericAvroDeserializer.class,
>> AvroCoder.of(schema))
>>
>> .updateConsumerProperties(ImmutableMap.of("schema.registry.url", schemaUrl))
>> .withTimestampPolicyFactory((tp, prevWatermark) -> new
>> KafkaCustomTimestampPolicy(maxDelay,
>> timestampInfo, prevWatermark));
>>
>> Below is the error seen,
>>
>> Caused by:
>> avro.shaded.com.google.common.util.concurrent.UncheckedExecutionException:
>> org.apache.avro.AvroRuntimeException: Not a Specific class: interface
>> org.apache.avro.generic.GenericRecord
>> at
>> avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2234)
>> at
>> avro.shaded.com.google.common.cache.LocalCache.get(LocalCache.java:3965)
>> at
>> avro.shaded.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
>> at
>> avro.shaded.com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
>> at
>> org.apache.avro.specific.SpecificData.getSchema(SpecificData.java:225)
>> ... 8 more
>> Caused by: org.apache.avro.AvroRuntimeException: Not a Specific class:
>> interface org.apache.avro.generic.GenericRecord
>> at
>> org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:285)
>> at
>> org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:594)
>> at
>> org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218)
>> at
>> org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215)
>> at
>> avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
>> at
>> avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
>> at
>> avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
>> at
>> avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
>>
>>
>> Can you provide some pointers on this.
>>
>>
>> *Thanks & Regards,*
>>
>> *Vishwas *
>>
>>
>>
>> On Fri, Sep 28, 2018 at 3:12 AM Raghu Angadi  wrote:
>>
>>> It is a compilation error due to type mismatch for value type.
>>>
>>> Please match key and value types for KafkaIO reader. I.e. if you have
>>> KafkaIO.read().,  'withValueDeserializer()' needs a
>>> class object which extends 'Deserializer'. Since
>>> KafkaAvroDeserializer extends 'Deserializer', so your ValueType
>>> needs to be Object, instead of String.
>>>
>>> Btw, it might be better to use GenericAvroDeseiralizer or
>>> SpecificAvroDeserializer from the same package.
>>>
>>>
>>> On Thu, Sep 27, 2018 at 10:31 AM Vishwas Bm  wrote:
>>>

 Hi Raghu,

 The deserializer is provided by confluent
 *io.confluent.kafka.serializers* package.

 When we set valueDeserializer as  KafkaAvroDeserializer.  We are
 getting below error:
The method withValueDeserializer(Class>>> Deserializer>) in the type KafkaIO.Read is not
 applicable for the arguments
  (Class)

 From the error, it looks like beam does not support this deserializer.
 Also we wanted to use schemaRegistry from confluent, is this supported
 in Beam ?


 *Thanks & Regards,*
 *Vishwas *


 On Thu, Sep 27, 2018 at 10:28 PM Raghu Angadi 
 wrote:

> You can set key/value deserializers :
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L101
> What are the errors you see?
>
> Also note

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Hi Pablo,
This is strange... it doesn't seem to be the last beam release as last
night it was already using 2.19.0 I wonder if it was some release from the
DataFlow team (not beam related):
Job typeBatch
Job status Succeeded
SDK version
Apache Beam Python 3.5 SDK 2.19.0
Region
us-central1
Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8
Elapsed time5 min 11 sec

On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada  wrote:

> Hi Alan,
> could it be that you're picking up the new Apache Beam 2.19.0 release?
> Could you try depending on beam 2.18.0 to see if the issue surfaces when
> using the new release?
>
> If something was working and no longer works, it sounds like a bug. This
> may have to do with how we pickle (dill / cloudpickle) - see this question
> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
> Best
> -P.
>
> On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz 
> wrote:
>
>> Hi,
>>
>> I was running a dataflow job in GCP last night and it was running fine.
>> This morning this same exact job is failing with the following error:
>>
>> Error message from worker: Traceback (most recent call last): File
>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
>> line 286, in loads return dill.loads(s) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
>> return load(file, ignore, **kwds) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
>> return Unpickler(file, ignore=ignore, **kwds).load() File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
>> obj = StockUnpickler.load(self) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
>> _load_type return _reverse_typemap[name] KeyError: 'ClassType' During
>> handling of the above exception, another exception occurred: Traceback
>> (most recent call last): File
>> "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
>> line 648, in do_work work_executor.execute() File
>> "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
>> 176, in execute op.start() File "apache_beam/runners/worker/operations.py",
>> line 649, in apache_beam.runners.worker.operations.DoOperation.start File
>> "apache_beam/runners/worker/operations.py", line 651, in
>> apache_beam.runners.worker.operations.DoOperation.start File
>> "apache_beam/runners/worker/operations.py", line 652, in
>> apache_beam.runners.worker.operations.DoOperation.start File
>> "apache_beam/runners/worker/operations.py", line 261, in
>> apache_beam.runners.worker.operations.Operation.start File
>> "apache_beam/runners/worker/operations.py", line 266, in
>> apache_beam.runners.worker.operations.Operation.start File
>> "apache_beam/runners/worker/operations.py", line 597, in
>> apache_beam.runners.worker.operations.DoOperation.setup File
>> "apache_beam/runners/worker/operations.py", line 602, in
>> apache_beam.runners.worker.operations.DoOperation.setup File
>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
>> line 290, in loads return dill.loads(s) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
>> return load(file, ignore, **kwds) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
>> return Unpickler(file, ignore=ignore, **kwds).load() File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
>> obj = StockUnpickler.load(self) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
>> _load_type return _reverse_typemap[name] KeyError: 'ClassType'
>>
>>
>> If I use a local runner it still runs fine.
>> Anyone else experiencing something similar today? (or know how to fix
>> this?)
>>
>> Thanks!
>>
>


Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Pablo Estrada
Hm that's odd. No changes to the pipeline? Are you able to share some of
the code?

+Udi Meiri  do you have any idea what could be going on
here?

On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz 
wrote:

> Hi Pablo,
> This is strange... it doesn't seem to be the last beam release as last
> night it was already using 2.19.0 I wonder if it was some release from the
> DataFlow team (not beam related):
> Job typeBatch
> Job status Succeeded
> SDK version
> Apache Beam Python 3.5 SDK 2.19.0
> Region
> us-central1
> Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8
> Elapsed time5 min 11 sec
>
> On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada  wrote:
>
>> Hi Alan,
>> could it be that you're picking up the new Apache Beam 2.19.0 release?
>> Could you try depending on beam 2.18.0 to see if the issue surfaces when
>> using the new release?
>>
>> If something was working and no longer works, it sounds like a bug. This
>> may have to do with how we pickle (dill / cloudpickle) - see this question
>> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
>> Best
>> -P.
>>
>> On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz 
>> wrote:
>>
>>> Hi,
>>>
>>> I was running a dataflow job in GCP last night and it was running fine.
>>> This morning this same exact job is failing with the following error:
>>>
>>> Error message from worker: Traceback (most recent call last): File
>>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
>>> line 286, in loads return dill.loads(s) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
>>> return load(file, ignore, **kwds) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
>>> return Unpickler(file, ignore=ignore, **kwds).load() File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
>>> obj = StockUnpickler.load(self) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
>>> _load_type return _reverse_typemap[name] KeyError: 'ClassType' During
>>> handling of the above exception, another exception occurred: Traceback
>>> (most recent call last): File
>>> "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
>>> line 648, in do_work work_executor.execute() File
>>> "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
>>> 176, in execute op.start() File "apache_beam/runners/worker/operations.py",
>>> line 649, in apache_beam.runners.worker.operations.DoOperation.start File
>>> "apache_beam/runners/worker/operations.py", line 651, in
>>> apache_beam.runners.worker.operations.DoOperation.start File
>>> "apache_beam/runners/worker/operations.py", line 652, in
>>> apache_beam.runners.worker.operations.DoOperation.start File
>>> "apache_beam/runners/worker/operations.py", line 261, in
>>> apache_beam.runners.worker.operations.Operation.start File
>>> "apache_beam/runners/worker/operations.py", line 266, in
>>> apache_beam.runners.worker.operations.Operation.start File
>>> "apache_beam/runners/worker/operations.py", line 597, in
>>> apache_beam.runners.worker.operations.DoOperation.setup File
>>> "apache_beam/runners/worker/operations.py", line 602, in
>>> apache_beam.runners.worker.operations.DoOperation.setup File
>>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
>>> line 290, in loads return dill.loads(s) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
>>> return load(file, ignore, **kwds) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
>>> return Unpickler(file, ignore=ignore, **kwds).load() File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
>>> obj = StockUnpickler.load(self) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
>>> _load_type return _reverse_typemap[name] KeyError: 'ClassType'
>>>
>>>
>>> If I use a local runner it still runs fine.
>>> Anyone else experiencing something similar today? (or know how to fix
>>> this?)
>>>
>>> Thanks!
>>>
>>


Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
I tried breaking apart my pipeline. Seems the step that breaks it is:
beam.io.WriteToBigQuery

Let me see if I can create a self contained example that breaks to share
with you

Thanks!

On Tue, Feb 4, 2020 at 9:53 AM Pablo Estrada  wrote:

> Hm that's odd. No changes to the pipeline? Are you able to share some of
> the code?
>
> +Udi Meiri  do you have any idea what could be going on
> here?
>
> On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz 
> wrote:
>
>> Hi Pablo,
>> This is strange... it doesn't seem to be the last beam release as last
>> night it was already using 2.19.0 I wonder if it was some release from the
>> DataFlow team (not beam related):
>> Job typeBatch
>> Job status Succeeded
>> SDK version
>> Apache Beam Python 3.5 SDK 2.19.0
>> Region
>> us-central1
>> Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8
>> Elapsed time5 min 11 sec
>>
>> On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada  wrote:
>>
>>> Hi Alan,
>>> could it be that you're picking up the new Apache Beam 2.19.0 release?
>>> Could you try depending on beam 2.18.0 to see if the issue surfaces when
>>> using the new release?
>>>
>>> If something was working and no longer works, it sounds like a bug. This
>>> may have to do with how we pickle (dill / cloudpickle) - see this question
>>> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
>>> Best
>>> -P.
>>>
>>> On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz 
>>> wrote:
>>>
 Hi,

 I was running a dataflow job in GCP last night and it was running fine.
 This morning this same exact job is failing with the following error:

 Error message from worker: Traceback (most recent call last): File
 "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
 line 286, in loads return dill.loads(s) File
 "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
 return load(file, ignore, **kwds) File
 "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
 return Unpickler(file, ignore=ignore, **kwds).load() File
 "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
 obj = StockUnpickler.load(self) File
 "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
 _load_type return _reverse_typemap[name] KeyError: 'ClassType' During
 handling of the above exception, another exception occurred: Traceback
 (most recent call last): File
 "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
 line 648, in do_work work_executor.execute() File
 "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
 176, in execute op.start() File "apache_beam/runners/worker/operations.py",
 line 649, in apache_beam.runners.worker.operations.DoOperation.start File
 "apache_beam/runners/worker/operations.py", line 651, in
 apache_beam.runners.worker.operations.DoOperation.start File
 "apache_beam/runners/worker/operations.py", line 652, in
 apache_beam.runners.worker.operations.DoOperation.start File
 "apache_beam/runners/worker/operations.py", line 261, in
 apache_beam.runners.worker.operations.Operation.start File
 "apache_beam/runners/worker/operations.py", line 266, in
 apache_beam.runners.worker.operations.Operation.start File
 "apache_beam/runners/worker/operations.py", line 597, in
 apache_beam.runners.worker.operations.DoOperation.setup File
 "apache_beam/runners/worker/operations.py", line 602, in
 apache_beam.runners.worker.operations.DoOperation.setup File
 "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
 line 290, in loads return dill.loads(s) File
 "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
 return load(file, ignore, **kwds) File
 "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
 return Unpickler(file, ignore=ignore, **kwds).load() File
 "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
 obj = StockUnpickler.load(self) File
 "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
 _load_type return _reverse_typemap[name] KeyError: 'ClassType'


 If I use a local runner it still runs fine.
 Anyone else experiencing something similar today? (or know how to fix
 this?)

 Thanks!

>>>


Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Mikhail Gryzykhin
Hi Alan,

+Valentyn Tymofieiev  Can you verify if my assumption
is correct?

It seems that the problem might come from dill version mismatch.  Dill
version should match on worker and user code. Between Beam 2.17 and Beam
2.18 we upgraded dill version to 0.3.1.1 which has an incompatible format
with earlier versions.

Which version of dill do you use when submitting pipeline?

Try using dill version below 0.3.1 with Beam 2.17 and earlier. And Dill
0.3.1 or above with Beam 2.18 and above.

Regards,
--Mikhail.


On Tue, Feb 4, 2020 at 10:04 AM Alan Krumholz 
wrote:

> I tried breaking apart my pipeline. Seems the step that breaks it is:
> beam.io.WriteToBigQuery
>
> Let me see if I can create a self contained example that breaks to share
> with you
>
> Thanks!
>
> On Tue, Feb 4, 2020 at 9:53 AM Pablo Estrada  wrote:
>
>> Hm that's odd. No changes to the pipeline? Are you able to share some of
>> the code?
>>
>> +Udi Meiri  do you have any idea what could be going
>> on here?
>>
>> On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz 
>> wrote:
>>
>>> Hi Pablo,
>>> This is strange... it doesn't seem to be the last beam release as last
>>> night it was already using 2.19.0 I wonder if it was some release from the
>>> DataFlow team (not beam related):
>>> Job typeBatch
>>> Job status Succeeded
>>> SDK version
>>> Apache Beam Python 3.5 SDK 2.19.0
>>> Region
>>> us-central1
>>> Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8
>>> Elapsed time5 min 11 sec
>>>
>>> On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada  wrote:
>>>
 Hi Alan,
 could it be that you're picking up the new Apache Beam 2.19.0 release?
 Could you try depending on beam 2.18.0 to see if the issue surfaces when
 using the new release?

 If something was working and no longer works, it sounds like a bug.
 This may have to do with how we pickle (dill / cloudpickle) - see this
 question
 https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
 Best
 -P.

 On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz 
 wrote:

> Hi,
>
> I was running a dataflow job in GCP last night and it was running fine.
> This morning this same exact job is failing with the following error:
>
> Error message from worker: Traceback (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
> line 286, in loads return dill.loads(s) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
> return load(file, ignore, **kwds) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
> return Unpickler(file, ignore=ignore, **kwds).load() File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
> _load_type return _reverse_typemap[name] KeyError: 'ClassType' During
> handling of the above exception, another exception occurred: Traceback
> (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
> line 648, in do_work work_executor.execute() File
> "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
> 176, in execute op.start() File 
> "apache_beam/runners/worker/operations.py",
> line 649, in apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 651, in
> apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 652, in
> apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 261, in
> apache_beam.runners.worker.operations.Operation.start File
> "apache_beam/runners/worker/operations.py", line 266, in
> apache_beam.runners.worker.operations.Operation.start File
> "apache_beam/runners/worker/operations.py", line 597, in
> apache_beam.runners.worker.operations.DoOperation.setup File
> "apache_beam/runners/worker/operations.py", line 602, in
> apache_beam.runners.worker.operations.DoOperation.setup File
> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
> line 290, in loads return dill.loads(s) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
> return load(file, ignore, **kwds) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
> return Unpickler(file, ignore=ignore, **kwds).load() File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
> _load_type return _reverse_typemap[name] KeyError: 'ClassType'
>
>
> If I u

Re: Kafka Avro Schema Registry Support

2020-02-04 Thread rahul patwari
Thanks Ismael for the update.
Thanks Alexey for the enhancement.
We will test it with 2.20 release.

On Tue, 4 Feb 2020, 10:53 pm Ismaël Mejía,  wrote:

> Support for Confluent Schema Registry was merged into KafkaIO today. You
> can
> test it with tomorrow's snapshots (version 2.20.0-SNAPSHOT) or just when
> 2.20.0
> gets released. Notice that this was already possible, but Alexey took care
> of
> making this more user friendly because this is (was) a frequently requested
> feature by Kafka/Avro users.
>
>
>
> On Fri, Sep 28, 2018 at 6:58 PM Raghu Angadi  wrote:
>
>> Looks like your producer writing a Avro specfic records.
>>
>> Can you read the records using bundled console consumer? I think it will
>> be simpler for you to get it returning valid records and use the same
>> deserializer config with your KafkaIO reader.
>>
>> On Fri, Sep 28, 2018 at 9:33 AM Vishwas Bm  wrote:
>>
>>> Hi Raghu,
>>>
>>> Thanks for the response.  We are now trying with GenericAvroDeserializer
>>> but still seeing issues.
>>> We have a producer which sends messages to kafka in format
>>> .
>>>
>>> Below is the code snippet, we have used at Beam KafkaIo.
>>>
>>>  org.apache.avro.Schema schema = null;
>>> try {
>>> schema = new org.apache.avro.Schema.Parser().parse(new
>>> File("Schema path"));
>>> } catch (Exception e) {
>>> e.printStackTrace();
>>> }
>>> KafkaIO.Read kafkaIoRead =
>>> KafkaIO.read()
>>>
>>> .withBootstrapServers(bootstrapServerUrl).withTopic(topicName)
>>> .withKeyDeserializer(StringDeserializer.class)
>>>
>>> .withValueDeserializerAndCoder(GenericAvroDeserializer.class,
>>> AvroCoder.of(schema))
>>>
>>> .updateConsumerProperties(ImmutableMap.of("schema.registry.url", schemaUrl))
>>> .withTimestampPolicyFactory((tp, prevWatermark) -> new
>>> KafkaCustomTimestampPolicy(maxDelay,
>>> timestampInfo, prevWatermark));
>>>
>>> Below is the error seen,
>>>
>>> Caused by:
>>> avro.shaded.com.google.common.util.concurrent.UncheckedExecutionException:
>>> org.apache.avro.AvroRuntimeException: Not a Specific class: interface
>>> org.apache.avro.generic.GenericRecord
>>> at
>>> avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2234)
>>> at
>>> avro.shaded.com.google.common.cache.LocalCache.get(LocalCache.java:3965)
>>> at
>>> avro.shaded.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
>>> at
>>> avro.shaded.com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
>>> at
>>> org.apache.avro.specific.SpecificData.getSchema(SpecificData.java:225)
>>> ... 8 more
>>> Caused by: org.apache.avro.AvroRuntimeException: Not a Specific class:
>>> interface org.apache.avro.generic.GenericRecord
>>> at
>>> org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:285)
>>> at
>>> org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:594)
>>> at
>>> org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218)
>>> at
>>> org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215)
>>> at
>>> avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
>>> at
>>> avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
>>> at
>>> avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
>>> at
>>> avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
>>>
>>>
>>> Can you provide some pointers on this.
>>>
>>>
>>> *Thanks & Regards,*
>>>
>>> *Vishwas *
>>>
>>>
>>>
>>> On Fri, Sep 28, 2018 at 3:12 AM Raghu Angadi  wrote:
>>>
 It is a compilation error due to type mismatch for value type.

 Please match key and value types for KafkaIO reader. I.e. if you have
 KafkaIO.read().,  'withValueDeserializer()' needs a
 class object which extends 'Deserializer'. Since
 KafkaAvroDeserializer extends 'Deserializer', so your ValueType
 needs to be Object, instead of String.

 Btw, it might be better to use GenericAvroDeseiralizer or
 SpecificAvroDeserializer from the same package.


 On Thu, Sep 27, 2018 at 10:31 AM Vishwas Bm 
 wrote:

>
> Hi Raghu,
>
> The deserializer is provided by confluent
> *io.confluent.kafka.serializers* package.
>
> When we set valueDeserializer as  KafkaAvroDeserializer.  We are
> getting below error:
>The method withValueDeserializer(Class Deserializer>) in the type KafkaIO.Read is not
> applicable for the arguments
>  (Class)
>
> From the error, it looks like beam does not support this deserializer.
> Also we wanted to use schemaRegistry from confluent, is this supported
> in Beam ?
>
>
> *Thanks & Regards,*
> *Vishwas *
>
>>

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Here is a test job that sometimes fails and sometimes doesn't (but most
times do).
There seems to be something stochastic that causes this as after several
tests a couple of them did succeed


def test_error(
bq_table: str) -> str:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class GenData(beam.DoFn):
def process(self, _):
for _ in range (2):
yield {'a':1,'b':2}


def get_bigquery_schema():
from apache_beam.io.gcp.internal.clients import bigquery

table_schema = bigquery.TableSchema()
columns = [
["a","integer","nullable"],
["b","integer","nullable"]
]

for column in columns:
column_schema = bigquery.TableFieldSchema()
column_schema.name = column[0]
column_schema.type = column[1]
column_schema.mode = column[2]
table_schema.fields.append(column_schema)

return table_schema

pipeline = beam.Pipeline(options=PipelineOptions(
project='my-project',
temp_location = 'gs://my-bucket/temp',
staging_location = 'gs://my-bucket/staging',
runner='DataflowRunner'
))
#pipeline = beam.Pipeline()

(
pipeline
| 'Empty start' >> beam.Create([''])
| 'Generate Data' >> beam.ParDo(GenData())
#| 'print' >> beam.Map(print)
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
project=bq_table.split(':')[0],
dataset=bq_table.split(':')[1].split('.')[0],
table=bq_table.split(':')[1].split('.')[1],
schema=get_bigquery_schema(),

create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,

write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
)

result = pipeline.run()
result.wait_until_finish()

return True

test_error(
bq_table = 'my-project:my_dataset.my_table'
)

On Tue, Feb 4, 2020 at 10:04 AM Alan Krumholz 
wrote:

> I tried breaking apart my pipeline. Seems the step that breaks it is:
> beam.io.WriteToBigQuery
>
> Let me see if I can create a self contained example that breaks to share
> with you
>
> Thanks!
>
> On Tue, Feb 4, 2020 at 9:53 AM Pablo Estrada  wrote:
>
>> Hm that's odd. No changes to the pipeline? Are you able to share some of
>> the code?
>>
>> +Udi Meiri  do you have any idea what could be going
>> on here?
>>
>> On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz 
>> wrote:
>>
>>> Hi Pablo,
>>> This is strange... it doesn't seem to be the last beam release as last
>>> night it was already using 2.19.0 I wonder if it was some release from the
>>> DataFlow team (not beam related):
>>> Job typeBatch
>>> Job status Succeeded
>>> SDK version
>>> Apache Beam Python 3.5 SDK 2.19.0
>>> Region
>>> us-central1
>>> Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8
>>> Elapsed time5 min 11 sec
>>>
>>> On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada  wrote:
>>>
 Hi Alan,
 could it be that you're picking up the new Apache Beam 2.19.0 release?
 Could you try depending on beam 2.18.0 to see if the issue surfaces when
 using the new release?

 If something was working and no longer works, it sounds like a bug.
 This may have to do with how we pickle (dill / cloudpickle) - see this
 question
 https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
 Best
 -P.

 On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz 
 wrote:

> Hi,
>
> I was running a dataflow job in GCP last night and it was running fine.
> This morning this same exact job is failing with the following error:
>
> Error message from worker: Traceback (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
> line 286, in loads return dill.loads(s) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
> return load(file, ignore, **kwds) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
> return Unpickler(file, ignore=ignore, **kwds).load() File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
> _load_type return _reverse_typemap[name] KeyError: 'ClassType' During
> handling of the above exception, another exception occurred: Traceback
> (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
> line 648, in do_work work_executor.execute() File
> "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
> 176, in execute op.start() File 
> "apache_beam/runners/worker/operations.py",
> line 649, in apache_beam.runners.worker.opera

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
BTW it doesn't seem to be related to the BQ sink. My job is failing now too
without that part (and it wasn't earlier today):

def test_error(
bq_table: str) -> str:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class GenData(beam.DoFn):
def process(self, _):
for _ in range (2):
yield {'a':1,'b':2}

pipeline = beam.Pipeline(options=PipelineOptions(
project='my-project',
temp_location = 'gs://my-bucket/temp',
staging_location = 'gs://my-bucket/staging',
runner='DataflowRunner'
))
#pipeline = beam.Pipeline()

(
pipeline
| 'Empty start' >> beam.Create([''])
| 'Generate Data' >> beam.ParDo(GenData())
| 'print' >> beam.Map(print)
)

result = pipeline.run()
result.wait_until_finish()

return True

test_error(
bq_table = 'my-project:my_dataset.my_table'
)

On Tue, Feb 4, 2020 at 11:21 AM Alan Krumholz 
wrote:

> Here is a test job that sometimes fails and sometimes doesn't (but most
> times do).
> There seems to be something stochastic that causes this as after several
> tests a couple of them did succeed
>
>
> def test_error(
> bq_table: str) -> str:
>
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
>
> class GenData(beam.DoFn):
> def process(self, _):
> for _ in range (2):
> yield {'a':1,'b':2}
>
>
> def get_bigquery_schema():
> from apache_beam.io.gcp.internal.clients import bigquery
>
> table_schema = bigquery.TableSchema()
> columns = [
> ["a","integer","nullable"],
> ["b","integer","nullable"]
> ]
>
> for column in columns:
> column_schema = bigquery.TableFieldSchema()
> column_schema.name = column[0]
> column_schema.type = column[1]
> column_schema.mode = column[2]
> table_schema.fields.append(column_schema)
>
> return table_schema
>
> pipeline = beam.Pipeline(options=PipelineOptions(
> project='my-project',
> temp_location = 'gs://my-bucket/temp',
> staging_location = 'gs://my-bucket/staging',
> runner='DataflowRunner'
> ))
> #pipeline = beam.Pipeline()
>
> (
> pipeline
> | 'Empty start' >> beam.Create([''])
> | 'Generate Data' >> beam.ParDo(GenData())
> #| 'print' >> beam.Map(print)
> | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
> project=bq_table.split(':')[0],
> dataset=bq_table.split(':')[1].split('.')[0],
> table=bq_table.split(':')[1].split('.')[1],
> schema=get_bigquery_schema(),
>
> create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
>
> write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
> )
>
> result = pipeline.run()
> result.wait_until_finish()
>
> return True
>
> test_error(
> bq_table = 'my-project:my_dataset.my_table'
> )
>
> On Tue, Feb 4, 2020 at 10:04 AM Alan Krumholz 
> wrote:
>
>> I tried breaking apart my pipeline. Seems the step that breaks it is:
>> beam.io.WriteToBigQuery
>>
>> Let me see if I can create a self contained example that breaks to share
>> with you
>>
>> Thanks!
>>
>> On Tue, Feb 4, 2020 at 9:53 AM Pablo Estrada  wrote:
>>
>>> Hm that's odd. No changes to the pipeline? Are you able to share some of
>>> the code?
>>>
>>> +Udi Meiri  do you have any idea what could be going
>>> on here?
>>>
>>> On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz 
>>> wrote:
>>>
 Hi Pablo,
 This is strange... it doesn't seem to be the last beam release as last
 night it was already using 2.19.0 I wonder if it was some release from the
 DataFlow team (not beam related):
 Job typeBatch
 Job status Succeeded
 SDK version
 Apache Beam Python 3.5 SDK 2.19.0
 Region
 us-central1
 Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8
 Elapsed time5 min 11 sec

 On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada 
 wrote:

> Hi Alan,
> could it be that you're picking up the new Apache Beam 2.19.0 release?
> Could you try depending on beam 2.18.0 to see if the issue surfaces when
> using the new release?
>
> If something was working and no longer works, it sounds like a bug.
> This may have to do with how we pickle (dill / cloudpickle) - see this
> question
> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
> Best
> -P.
>
> On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz <
> alan.krumh...@betterup.co> wrote:
>
>> Hi,
>>
>> I was running a dataflow job in GCP last night and it was running
>> fine.
>> This morning this same exact job is failing with the fol

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Valentyn Tymofieiev
It don't think there is a mismatch between dill versions here, but
https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
mentions
a similar error and may be related. What is the output of pip freeze on
your machine (or better: pip install pipdeptree; pipdeptree)?


On Tue, Feb 4, 2020 at 11:22 AM Alan Krumholz 
wrote:

> Here is a test job that sometimes fails and sometimes doesn't (but most
> times do).
> There seems to be something stochastic that causes this as after several
> tests a couple of them did succeed
>
>
> def test_error(
> bq_table: str) -> str:
>
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
>
> class GenData(beam.DoFn):
> def process(self, _):
> for _ in range (2):
> yield {'a':1,'b':2}
>
>
> def get_bigquery_schema():
> from apache_beam.io.gcp.internal.clients import bigquery
>
> table_schema = bigquery.TableSchema()
> columns = [
> ["a","integer","nullable"],
> ["b","integer","nullable"]
> ]
>
> for column in columns:
> column_schema = bigquery.TableFieldSchema()
> column_schema.name = column[0]
> column_schema.type = column[1]
> column_schema.mode = column[2]
> table_schema.fields.append(column_schema)
>
> return table_schema
>
> pipeline = beam.Pipeline(options=PipelineOptions(
> project='my-project',
> temp_location = 'gs://my-bucket/temp',
> staging_location = 'gs://my-bucket/staging',
> runner='DataflowRunner'
> ))
> #pipeline = beam.Pipeline()
>
> (
> pipeline
> | 'Empty start' >> beam.Create([''])
> | 'Generate Data' >> beam.ParDo(GenData())
> #| 'print' >> beam.Map(print)
> | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
> project=bq_table.split(':')[0],
> dataset=bq_table.split(':')[1].split('.')[0],
> table=bq_table.split(':')[1].split('.')[1],
> schema=get_bigquery_schema(),
>
> create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
>
> write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
> )
>
> result = pipeline.run()
> result.wait_until_finish()
>
> return True
>
> test_error(
> bq_table = 'my-project:my_dataset.my_table'
> )
>
> On Tue, Feb 4, 2020 at 10:04 AM Alan Krumholz 
> wrote:
>
>> I tried breaking apart my pipeline. Seems the step that breaks it is:
>> beam.io.WriteToBigQuery
>>
>> Let me see if I can create a self contained example that breaks to share
>> with you
>>
>> Thanks!
>>
>> On Tue, Feb 4, 2020 at 9:53 AM Pablo Estrada  wrote:
>>
>>> Hm that's odd. No changes to the pipeline? Are you able to share some of
>>> the code?
>>>
>>> +Udi Meiri  do you have any idea what could be going
>>> on here?
>>>
>>> On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz 
>>> wrote:
>>>
 Hi Pablo,
 This is strange... it doesn't seem to be the last beam release as last
 night it was already using 2.19.0 I wonder if it was some release from the
 DataFlow team (not beam related):
 Job typeBatch
 Job status Succeeded
 SDK version
 Apache Beam Python 3.5 SDK 2.19.0
 Region
 us-central1
 Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8
 Elapsed time5 min 11 sec

 On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada 
 wrote:

> Hi Alan,
> could it be that you're picking up the new Apache Beam 2.19.0 release?
> Could you try depending on beam 2.18.0 to see if the issue surfaces when
> using the new release?
>
> If something was working and no longer works, it sounds like a bug.
> This may have to do with how we pickle (dill / cloudpickle) - see this
> question
> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
> Best
> -P.
>
> On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz <
> alan.krumh...@betterup.co> wrote:
>
>> Hi,
>>
>> I was running a dataflow job in GCP last night and it was running
>> fine.
>> This morning this same exact job is failing with the following error:
>>
>> Error message from worker: Traceback (most recent call last): File
>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
>> line 286, in loads return dill.loads(s) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in 
>> loads
>> return load(file, ignore, **kwds) File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
>> return Unpickler(file, ignore=ignore, **kwds).load() File
>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
>> obj = StockUnpickler.load(self) 

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Hi Valentyn,

Here is my pip freeze on my machine (note that the error is in dataflow,
the job runs fine in my machine)

ansiwrap==0.8.4
apache-beam==2.19.0
arrow==0.15.5
asn1crypto==1.3.0
astroid==2.3.3
astropy==3.2.3
attrs==19.3.0
avro-python3==1.9.1
azure-common==1.1.24
azure-storage-blob==2.1.0
azure-storage-common==2.1.0
backcall==0.1.0
bcolz==1.2.1
binaryornot==0.4.4
bleach==3.1.0
boto3==1.11.9
botocore==1.14.9
cachetools==3.1.1
certifi==2019.11.28
cffi==1.13.2
chardet==3.0.4
Click==7.0
cloudpickle==1.2.2
colorama==0.4.3
configparser==4.0.2
confuse==1.0.0
cookiecutter==1.7.0
crcmod==1.7
cryptography==2.8
cycler==0.10.0
daal==2019.0
datalab==1.1.5
decorator==4.4.1
defusedxml==0.6.0
dill==0.3.1.1
distro==1.0.1
docker==4.1.0
docopt==0.6.2
docutils==0.15.2
entrypoints==0.3
enum34==1.1.6
fairing==0.5.3
fastavro==0.21.24
fasteners==0.15
fsspec==0.6.2
future==0.18.2
gcsfs==0.6.0
gitdb2==2.0.6
GitPython==3.0.5
google-api-core==1.16.0
google-api-python-client==1.7.11
google-apitools==0.5.28
google-auth==1.11.0
google-auth-httplib2==0.0.3
google-auth-oauthlib==0.4.1
google-cloud-bigquery==1.17.1
google-cloud-bigtable==1.0.0
google-cloud-core==1.2.0
google-cloud-dataproc==0.6.1
google-cloud-datastore==1.7.4
google-cloud-language==1.3.0
google-cloud-logging==1.14.0
google-cloud-monitoring==0.31.1
google-cloud-pubsub==1.0.2
google-cloud-secret-manager==0.1.1
google-cloud-spanner==1.13.0
google-cloud-storage==1.25.0
google-cloud-translate==2.0.0
google-compute-engine==20191210.0
google-resumable-media==0.4.1
googleapis-common-protos==1.51.0
grpc-google-iam-v1==0.12.3
grpcio==1.26.0
h5py==2.10.0
hdfs==2.5.8
html5lib==1.0.1
htmlmin==0.1.12
httplib2==0.12.0
icc-rt==2020.0.133
idna==2.8
ijson==2.6.1
imageio==2.6.1
importlib-metadata==1.4.0
intel-numpy==1.15.1
intel-openmp==2020.0.133
intel-scikit-learn==0.19.2
intel-scipy==1.1.0
ipykernel==5.1.4
ipython==7.9.0
ipython-genutils==0.2.0
ipython-sql==0.3.9
ipywidgets==7.5.1
isort==4.3.21
jedi==0.16.0
Jinja2==2.11.0
jinja2-time==0.2.0
jmespath==0.9.4
joblib==0.14.1
json5==0.8.5
jsonschema==3.2.0
jupyter==1.0.0
jupyter-aihub-deploy-extension==0.1
jupyter-client==5.3.4
jupyter-console==6.1.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.5.1
jupyter-core==4.6.1
jupyter-highlight-selected-word==0.2.0
jupyter-http-over-ws==0.0.7
jupyter-latex-envs==1.4.6
jupyter-nbextensions-configurator==0.4.1
jupyterlab==1.2.6
jupyterlab-git==0.9.0
jupyterlab-server==1.0.6
keyring==10.1
keyrings.alt==1.3
kiwisolver==1.1.0
kubernetes==10.0.1
lazy-object-proxy==1.4.3
llvmlite==0.31.0
lxml==4.4.2
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.0.3
mccabe==0.6.1
missingno==0.4.2
mistune==0.8.4
mkl==2019.0
mkl-fft==1.0.6
mkl-random==1.0.1.1
mock==2.0.0
monotonic==1.5
more-itertools==8.1.0
nbconvert==5.6.1
nbdime==1.1.0
nbformat==5.0.4
networkx==2.4
nltk==3.4.5
notebook==6.0.3
numba==0.47.0
numpy==1.15.1
oauth2client==3.0.0
oauthlib==3.1.0
opencv-python==4.1.2.30
oscrypto==1.2.0
packaging==20.1
pandas==0.25.3
pandas-profiling==1.4.0
pandocfilters==1.4.2
papermill==1.2.1
parso==0.6.0
pathlib2==2.3.5
pbr==5.4.4
pexpect==4.8.0
phik==0.9.8
pickleshare==0.7.5
Pillow-SIMD==6.2.2.post1
pipdeptree==0.13.2
plotly==4.5.0
pluggy==0.13.1
poyo==0.5.0
prettytable==0.7.2
prometheus-client==0.7.1
prompt-toolkit==2.0.10
protobuf==3.11.2
psutil==5.6.7
ptyprocess==0.6.0
py==1.8.1
pyarrow==0.15.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.19
pycrypto==2.6.1
pycryptodomex==3.9.6
pycurl==7.43.0
pydaal==2019.0.0.20180713
pydot==1.4.1
Pygments==2.5.2
pygobject==3.22.0
PyJWT==1.7.1
pylint==2.4.4
pymongo==3.10.1
pyOpenSSL==19.1.0
pyparsing==2.4.6
pyrsistent==0.15.7
pytest==5.3.4
pytest-pylint==0.14.1
python-apt==1.4.1
python-dateutil==2.8.1
pytz==2019.3
PyWavelets==1.1.1
pyxdg==0.25
PyYAML==5.3
pyzmq==18.1.1
qtconsole==4.6.0
requests==2.22.0
requests-oauthlib==1.3.0
retrying==1.3.3
rsa==4.0
s3transfer==0.3.2
scikit-image==0.15.0
scikit-learn==0.19.2
scipy==1.1.0
seaborn==0.9.1
SecretStorage==2.3.1
Send2Trash==1.5.0
simplegeneric==0.8.1
six==1.14.0
smmap2==2.0.5
snowflake-connector-python==2.2.0
SQLAlchemy==1.3.13
sqlparse==0.3.0
tbb==2019.0
tbb4py==2019.0
tenacity==6.0.0
terminado==0.8.3
testpath==0.4.4
textwrap3==0.9.2
tornado==5.1.1
tqdm==4.42.0
traitlets==4.3.3
typed-ast==1.4.1
typing==3.7.4.1
typing-extensions==3.7.4.1
unattended-upgrades==0.1
uritemplate==3.0.1
urllib3==1.24.2
virtualenv==16.7.9
wcwidth==0.1.8
webencodings==0.5.1
websocket-client==0.57.0
Werkzeug==0.16.1
whichcraft==0.6.1
widgetsnbextension==3.5.1
wrapt==1.11.2
zipp==1.1.0


On Tue, Feb 4, 2020 at 11:33 AM Valentyn Tymofieiev 
wrote:

> It don't think there is a mismatch between dill versions here, but
> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
>  mentions
> a similar error and may be related. What is the output of pip freeze on
> your machine (or better: pip install pipdeptree; pipdeptree)?
>
>
> On Tue, Feb 4, 2020 at 11:22 AM Alan Krumholz 
> wrote:
>

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Valentyn Tymofieiev
The fact that you have cloudpickle==1.2.2 further confirms that you may be
hitting the same error as
https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
 .

Could you try to start over with a clean virtual environment?

On Tue, Feb 4, 2020 at 11:46 AM Alan Krumholz 
wrote:

> Hi Valentyn,
>
> Here is my pip freeze on my machine (note that the error is in dataflow,
> the job runs fine in my machine)
>
> ansiwrap==0.8.4
> apache-beam==2.19.0
> arrow==0.15.5
> asn1crypto==1.3.0
> astroid==2.3.3
> astropy==3.2.3
> attrs==19.3.0
> avro-python3==1.9.1
> azure-common==1.1.24
> azure-storage-blob==2.1.0
> azure-storage-common==2.1.0
> backcall==0.1.0
> bcolz==1.2.1
> binaryornot==0.4.4
> bleach==3.1.0
> boto3==1.11.9
> botocore==1.14.9
> cachetools==3.1.1
> certifi==2019.11.28
> cffi==1.13.2
> chardet==3.0.4
> Click==7.0
> cloudpickle==1.2.2
> colorama==0.4.3
> configparser==4.0.2
> confuse==1.0.0
> cookiecutter==1.7.0
> crcmod==1.7
> cryptography==2.8
> cycler==0.10.0
> daal==2019.0
> datalab==1.1.5
> decorator==4.4.1
> defusedxml==0.6.0
> dill==0.3.1.1
> distro==1.0.1
> docker==4.1.0
> docopt==0.6.2
> docutils==0.15.2
> entrypoints==0.3
> enum34==1.1.6
> fairing==0.5.3
> fastavro==0.21.24
> fasteners==0.15
> fsspec==0.6.2
> future==0.18.2
> gcsfs==0.6.0
> gitdb2==2.0.6
> GitPython==3.0.5
> google-api-core==1.16.0
> google-api-python-client==1.7.11
> google-apitools==0.5.28
> google-auth==1.11.0
> google-auth-httplib2==0.0.3
> google-auth-oauthlib==0.4.1
> google-cloud-bigquery==1.17.1
> google-cloud-bigtable==1.0.0
> google-cloud-core==1.2.0
> google-cloud-dataproc==0.6.1
> google-cloud-datastore==1.7.4
> google-cloud-language==1.3.0
> google-cloud-logging==1.14.0
> google-cloud-monitoring==0.31.1
> google-cloud-pubsub==1.0.2
> google-cloud-secret-manager==0.1.1
> google-cloud-spanner==1.13.0
> google-cloud-storage==1.25.0
> google-cloud-translate==2.0.0
> google-compute-engine==20191210.0
> google-resumable-media==0.4.1
> googleapis-common-protos==1.51.0
> grpc-google-iam-v1==0.12.3
> grpcio==1.26.0
> h5py==2.10.0
> hdfs==2.5.8
> html5lib==1.0.1
> htmlmin==0.1.12
> httplib2==0.12.0
> icc-rt==2020.0.133
> idna==2.8
> ijson==2.6.1
> imageio==2.6.1
> importlib-metadata==1.4.0
> intel-numpy==1.15.1
> intel-openmp==2020.0.133
> intel-scikit-learn==0.19.2
> intel-scipy==1.1.0
> ipykernel==5.1.4
> ipython==7.9.0
> ipython-genutils==0.2.0
> ipython-sql==0.3.9
> ipywidgets==7.5.1
> isort==4.3.21
> jedi==0.16.0
> Jinja2==2.11.0
> jinja2-time==0.2.0
> jmespath==0.9.4
> joblib==0.14.1
> json5==0.8.5
> jsonschema==3.2.0
> jupyter==1.0.0
> jupyter-aihub-deploy-extension==0.1
> jupyter-client==5.3.4
> jupyter-console==6.1.0
> jupyter-contrib-core==0.3.3
> jupyter-contrib-nbextensions==0.5.1
> jupyter-core==4.6.1
> jupyter-highlight-selected-word==0.2.0
> jupyter-http-over-ws==0.0.7
> jupyter-latex-envs==1.4.6
> jupyter-nbextensions-configurator==0.4.1
> jupyterlab==1.2.6
> jupyterlab-git==0.9.0
> jupyterlab-server==1.0.6
> keyring==10.1
> keyrings.alt==1.3
> kiwisolver==1.1.0
> kubernetes==10.0.1
> lazy-object-proxy==1.4.3
> llvmlite==0.31.0
> lxml==4.4.2
> Markdown==3.1.1
> MarkupSafe==1.1.1
> matplotlib==3.0.3
> mccabe==0.6.1
> missingno==0.4.2
> mistune==0.8.4
> mkl==2019.0
> mkl-fft==1.0.6
> mkl-random==1.0.1.1
> mock==2.0.0
> monotonic==1.5
> more-itertools==8.1.0
> nbconvert==5.6.1
> nbdime==1.1.0
> nbformat==5.0.4
> networkx==2.4
> nltk==3.4.5
> notebook==6.0.3
> numba==0.47.0
> numpy==1.15.1
> oauth2client==3.0.0
> oauthlib==3.1.0
> opencv-python==4.1.2.30
> oscrypto==1.2.0
> packaging==20.1
> pandas==0.25.3
> pandas-profiling==1.4.0
> pandocfilters==1.4.2
> papermill==1.2.1
> parso==0.6.0
> pathlib2==2.3.5
> pbr==5.4.4
> pexpect==4.8.0
> phik==0.9.8
> pickleshare==0.7.5
> Pillow-SIMD==6.2.2.post1
> pipdeptree==0.13.2
> plotly==4.5.0
> pluggy==0.13.1
> poyo==0.5.0
> prettytable==0.7.2
> prometheus-client==0.7.1
> prompt-toolkit==2.0.10
> protobuf==3.11.2
> psutil==5.6.7
> ptyprocess==0.6.0
> py==1.8.1
> pyarrow==0.15.1
> pyasn1==0.4.8
> pyasn1-modules==0.2.8
> pycparser==2.19
> pycrypto==2.6.1
> pycryptodomex==3.9.6
> pycurl==7.43.0
> pydaal==2019.0.0.20180713
> pydot==1.4.1
> Pygments==2.5.2
> pygobject==3.22.0
> PyJWT==1.7.1
> pylint==2.4.4
> pymongo==3.10.1
> pyOpenSSL==19.1.0
> pyparsing==2.4.6
> pyrsistent==0.15.7
> pytest==5.3.4
> pytest-pylint==0.14.1
> python-apt==1.4.1
> python-dateutil==2.8.1
> pytz==2019.3
> PyWavelets==1.1.1
> pyxdg==0.25
> PyYAML==5.3
> pyzmq==18.1.1
> qtconsole==4.6.0
> requests==2.22.0
> requests-oauthlib==1.3.0
> retrying==1.3.3
> rsa==4.0
> s3transfer==0.3.2
> scikit-image==0.15.0
> scikit-learn==0.19.2
> scipy==1.1.0
> seaborn==0.9.1
> SecretStorage==2.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.14.0
> smmap2==2.0.5
> snowflake-connector-python==2.2.0
> SQLAlchemy==1.3.13
> sqlparse==0.3.0
> tbb==2019.0
> tbb4py==2019.0
> tenacity==6.0.0
> terminado==0.8.3
> testpath==0.4.4
> textwrap3==0.9.2
> tornado==

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
I'm using a managed notebook instance from GCP
It seems those already come with cloudpickle==1.2.2 as soon as you
provision it. apache-beam[gcp] will then install dill==0.3.1.1 I'm going to
try to uninstall cloudpickle before installing apache-beam and see if this
fixes the problem

Thank you

On Tue, Feb 4, 2020 at 11:54 AM Valentyn Tymofieiev 
wrote:

> The fact that you have cloudpickle==1.2.2 further confirms that you may
> be hitting the same error as
> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
>  .
>
> Could you try to start over with a clean virtual environment?
>
> On Tue, Feb 4, 2020 at 11:46 AM Alan Krumholz 
> wrote:
>
>> Hi Valentyn,
>>
>> Here is my pip freeze on my machine (note that the error is in dataflow,
>> the job runs fine in my machine)
>>
>> ansiwrap==0.8.4
>> apache-beam==2.19.0
>> arrow==0.15.5
>> asn1crypto==1.3.0
>> astroid==2.3.3
>> astropy==3.2.3
>> attrs==19.3.0
>> avro-python3==1.9.1
>> azure-common==1.1.24
>> azure-storage-blob==2.1.0
>> azure-storage-common==2.1.0
>> backcall==0.1.0
>> bcolz==1.2.1
>> binaryornot==0.4.4
>> bleach==3.1.0
>> boto3==1.11.9
>> botocore==1.14.9
>> cachetools==3.1.1
>> certifi==2019.11.28
>> cffi==1.13.2
>> chardet==3.0.4
>> Click==7.0
>> cloudpickle==1.2.2
>> colorama==0.4.3
>> configparser==4.0.2
>> confuse==1.0.0
>> cookiecutter==1.7.0
>> crcmod==1.7
>> cryptography==2.8
>> cycler==0.10.0
>> daal==2019.0
>> datalab==1.1.5
>> decorator==4.4.1
>> defusedxml==0.6.0
>> dill==0.3.1.1
>> distro==1.0.1
>> docker==4.1.0
>> docopt==0.6.2
>> docutils==0.15.2
>> entrypoints==0.3
>> enum34==1.1.6
>> fairing==0.5.3
>> fastavro==0.21.24
>> fasteners==0.15
>> fsspec==0.6.2
>> future==0.18.2
>> gcsfs==0.6.0
>> gitdb2==2.0.6
>> GitPython==3.0.5
>> google-api-core==1.16.0
>> google-api-python-client==1.7.11
>> google-apitools==0.5.28
>> google-auth==1.11.0
>> google-auth-httplib2==0.0.3
>> google-auth-oauthlib==0.4.1
>> google-cloud-bigquery==1.17.1
>> google-cloud-bigtable==1.0.0
>> google-cloud-core==1.2.0
>> google-cloud-dataproc==0.6.1
>> google-cloud-datastore==1.7.4
>> google-cloud-language==1.3.0
>> google-cloud-logging==1.14.0
>> google-cloud-monitoring==0.31.1
>> google-cloud-pubsub==1.0.2
>> google-cloud-secret-manager==0.1.1
>> google-cloud-spanner==1.13.0
>> google-cloud-storage==1.25.0
>> google-cloud-translate==2.0.0
>> google-compute-engine==20191210.0
>> google-resumable-media==0.4.1
>> googleapis-common-protos==1.51.0
>> grpc-google-iam-v1==0.12.3
>> grpcio==1.26.0
>> h5py==2.10.0
>> hdfs==2.5.8
>> html5lib==1.0.1
>> htmlmin==0.1.12
>> httplib2==0.12.0
>> icc-rt==2020.0.133
>> idna==2.8
>> ijson==2.6.1
>> imageio==2.6.1
>> importlib-metadata==1.4.0
>> intel-numpy==1.15.1
>> intel-openmp==2020.0.133
>> intel-scikit-learn==0.19.2
>> intel-scipy==1.1.0
>> ipykernel==5.1.4
>> ipython==7.9.0
>> ipython-genutils==0.2.0
>> ipython-sql==0.3.9
>> ipywidgets==7.5.1
>> isort==4.3.21
>> jedi==0.16.0
>> Jinja2==2.11.0
>> jinja2-time==0.2.0
>> jmespath==0.9.4
>> joblib==0.14.1
>> json5==0.8.5
>> jsonschema==3.2.0
>> jupyter==1.0.0
>> jupyter-aihub-deploy-extension==0.1
>> jupyter-client==5.3.4
>> jupyter-console==6.1.0
>> jupyter-contrib-core==0.3.3
>> jupyter-contrib-nbextensions==0.5.1
>> jupyter-core==4.6.1
>> jupyter-highlight-selected-word==0.2.0
>> jupyter-http-over-ws==0.0.7
>> jupyter-latex-envs==1.4.6
>> jupyter-nbextensions-configurator==0.4.1
>> jupyterlab==1.2.6
>> jupyterlab-git==0.9.0
>> jupyterlab-server==1.0.6
>> keyring==10.1
>> keyrings.alt==1.3
>> kiwisolver==1.1.0
>> kubernetes==10.0.1
>> lazy-object-proxy==1.4.3
>> llvmlite==0.31.0
>> lxml==4.4.2
>> Markdown==3.1.1
>> MarkupSafe==1.1.1
>> matplotlib==3.0.3
>> mccabe==0.6.1
>> missingno==0.4.2
>> mistune==0.8.4
>> mkl==2019.0
>> mkl-fft==1.0.6
>> mkl-random==1.0.1.1
>> mock==2.0.0
>> monotonic==1.5
>> more-itertools==8.1.0
>> nbconvert==5.6.1
>> nbdime==1.1.0
>> nbformat==5.0.4
>> networkx==2.4
>> nltk==3.4.5
>> notebook==6.0.3
>> numba==0.47.0
>> numpy==1.15.1
>> oauth2client==3.0.0
>> oauthlib==3.1.0
>> opencv-python==4.1.2.30
>> oscrypto==1.2.0
>> packaging==20.1
>> pandas==0.25.3
>> pandas-profiling==1.4.0
>> pandocfilters==1.4.2
>> papermill==1.2.1
>> parso==0.6.0
>> pathlib2==2.3.5
>> pbr==5.4.4
>> pexpect==4.8.0
>> phik==0.9.8
>> pickleshare==0.7.5
>> Pillow-SIMD==6.2.2.post1
>> pipdeptree==0.13.2
>> plotly==4.5.0
>> pluggy==0.13.1
>> poyo==0.5.0
>> prettytable==0.7.2
>> prometheus-client==0.7.1
>> prompt-toolkit==2.0.10
>> protobuf==3.11.2
>> psutil==5.6.7
>> ptyprocess==0.6.0
>> py==1.8.1
>> pyarrow==0.15.1
>> pyasn1==0.4.8
>> pyasn1-modules==0.2.8
>> pycparser==2.19
>> pycrypto==2.6.1
>> pycryptodomex==3.9.6
>> pycurl==7.43.0
>> pydaal==2019.0.0.20180713
>> pydot==1.4.1
>> Pygments==2.5.2
>> pygobject==3.22.0
>> PyJWT==1.7.1
>> pylint==2.4.4
>> pymongo==3.10.1
>> pyOpenSSL==19.1.0
>> pyparsing==2.4.6
>> pyrsistent==0.15.7
>> pytest==5.3.4
>> pytest-pylint==0.14.1
>> python-apt==1.4.1
>> python-dateuti

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Seems like the image we use in KFP to orchestrate the job has
cloudpickle==0.8.1
and that one doesn't seem to cause issues.
I think I'm unblock for now but I'm sure I won't be the last one to try to
do this using GCP managed notebooks :(

Thanks for all the help!


On Tue, Feb 4, 2020 at 12:24 PM Alan Krumholz 
wrote:

> I'm using a managed notebook instance from GCP
> It seems those already come with cloudpickle==1.2.2 as soon as you
> provision it. apache-beam[gcp] will then install dill==0.3.1.1 I'm going
> to try to uninstall cloudpickle before installing apache-beam and see if
> this fixes the problem
>
> Thank you
>
> On Tue, Feb 4, 2020 at 11:54 AM Valentyn Tymofieiev 
> wrote:
>
>> The fact that you have cloudpickle==1.2.2 further confirms that you may
>> be hitting the same error as
>> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
>>  .
>>
>> Could you try to start over with a clean virtual environment?
>>
>> On Tue, Feb 4, 2020 at 11:46 AM Alan Krumholz 
>> wrote:
>>
>>> Hi Valentyn,
>>>
>>> Here is my pip freeze on my machine (note that the error is in dataflow,
>>> the job runs fine in my machine)
>>>
>>> ansiwrap==0.8.4
>>> apache-beam==2.19.0
>>> arrow==0.15.5
>>> asn1crypto==1.3.0
>>> astroid==2.3.3
>>> astropy==3.2.3
>>> attrs==19.3.0
>>> avro-python3==1.9.1
>>> azure-common==1.1.24
>>> azure-storage-blob==2.1.0
>>> azure-storage-common==2.1.0
>>> backcall==0.1.0
>>> bcolz==1.2.1
>>> binaryornot==0.4.4
>>> bleach==3.1.0
>>> boto3==1.11.9
>>> botocore==1.14.9
>>> cachetools==3.1.1
>>> certifi==2019.11.28
>>> cffi==1.13.2
>>> chardet==3.0.4
>>> Click==7.0
>>> cloudpickle==1.2.2
>>> colorama==0.4.3
>>> configparser==4.0.2
>>> confuse==1.0.0
>>> cookiecutter==1.7.0
>>> crcmod==1.7
>>> cryptography==2.8
>>> cycler==0.10.0
>>> daal==2019.0
>>> datalab==1.1.5
>>> decorator==4.4.1
>>> defusedxml==0.6.0
>>> dill==0.3.1.1
>>> distro==1.0.1
>>> docker==4.1.0
>>> docopt==0.6.2
>>> docutils==0.15.2
>>> entrypoints==0.3
>>> enum34==1.1.6
>>> fairing==0.5.3
>>> fastavro==0.21.24
>>> fasteners==0.15
>>> fsspec==0.6.2
>>> future==0.18.2
>>> gcsfs==0.6.0
>>> gitdb2==2.0.6
>>> GitPython==3.0.5
>>> google-api-core==1.16.0
>>> google-api-python-client==1.7.11
>>> google-apitools==0.5.28
>>> google-auth==1.11.0
>>> google-auth-httplib2==0.0.3
>>> google-auth-oauthlib==0.4.1
>>> google-cloud-bigquery==1.17.1
>>> google-cloud-bigtable==1.0.0
>>> google-cloud-core==1.2.0
>>> google-cloud-dataproc==0.6.1
>>> google-cloud-datastore==1.7.4
>>> google-cloud-language==1.3.0
>>> google-cloud-logging==1.14.0
>>> google-cloud-monitoring==0.31.1
>>> google-cloud-pubsub==1.0.2
>>> google-cloud-secret-manager==0.1.1
>>> google-cloud-spanner==1.13.0
>>> google-cloud-storage==1.25.0
>>> google-cloud-translate==2.0.0
>>> google-compute-engine==20191210.0
>>> google-resumable-media==0.4.1
>>> googleapis-common-protos==1.51.0
>>> grpc-google-iam-v1==0.12.3
>>> grpcio==1.26.0
>>> h5py==2.10.0
>>> hdfs==2.5.8
>>> html5lib==1.0.1
>>> htmlmin==0.1.12
>>> httplib2==0.12.0
>>> icc-rt==2020.0.133
>>> idna==2.8
>>> ijson==2.6.1
>>> imageio==2.6.1
>>> importlib-metadata==1.4.0
>>> intel-numpy==1.15.1
>>> intel-openmp==2020.0.133
>>> intel-scikit-learn==0.19.2
>>> intel-scipy==1.1.0
>>> ipykernel==5.1.4
>>> ipython==7.9.0
>>> ipython-genutils==0.2.0
>>> ipython-sql==0.3.9
>>> ipywidgets==7.5.1
>>> isort==4.3.21
>>> jedi==0.16.0
>>> Jinja2==2.11.0
>>> jinja2-time==0.2.0
>>> jmespath==0.9.4
>>> joblib==0.14.1
>>> json5==0.8.5
>>> jsonschema==3.2.0
>>> jupyter==1.0.0
>>> jupyter-aihub-deploy-extension==0.1
>>> jupyter-client==5.3.4
>>> jupyter-console==6.1.0
>>> jupyter-contrib-core==0.3.3
>>> jupyter-contrib-nbextensions==0.5.1
>>> jupyter-core==4.6.1
>>> jupyter-highlight-selected-word==0.2.0
>>> jupyter-http-over-ws==0.0.7
>>> jupyter-latex-envs==1.4.6
>>> jupyter-nbextensions-configurator==0.4.1
>>> jupyterlab==1.2.6
>>> jupyterlab-git==0.9.0
>>> jupyterlab-server==1.0.6
>>> keyring==10.1
>>> keyrings.alt==1.3
>>> kiwisolver==1.1.0
>>> kubernetes==10.0.1
>>> lazy-object-proxy==1.4.3
>>> llvmlite==0.31.0
>>> lxml==4.4.2
>>> Markdown==3.1.1
>>> MarkupSafe==1.1.1
>>> matplotlib==3.0.3
>>> mccabe==0.6.1
>>> missingno==0.4.2
>>> mistune==0.8.4
>>> mkl==2019.0
>>> mkl-fft==1.0.6
>>> mkl-random==1.0.1.1
>>> mock==2.0.0
>>> monotonic==1.5
>>> more-itertools==8.1.0
>>> nbconvert==5.6.1
>>> nbdime==1.1.0
>>> nbformat==5.0.4
>>> networkx==2.4
>>> nltk==3.4.5
>>> notebook==6.0.3
>>> numba==0.47.0
>>> numpy==1.15.1
>>> oauth2client==3.0.0
>>> oauthlib==3.1.0
>>> opencv-python==4.1.2.30
>>> oscrypto==1.2.0
>>> packaging==20.1
>>> pandas==0.25.3
>>> pandas-profiling==1.4.0
>>> pandocfilters==1.4.2
>>> papermill==1.2.1
>>> parso==0.6.0
>>> pathlib2==2.3.5
>>> pbr==5.4.4
>>> pexpect==4.8.0
>>> phik==0.9.8
>>> pickleshare==0.7.5
>>> Pillow-SIMD==6.2.2.post1
>>> pipdeptree==0.13.2
>>> plotly==4.5.0
>>> pluggy==0.13.1
>>> poyo==0.5.0
>>> prettytable==0.7.2
>>> prometheus-client==0.7.1

Seattle Beam Meetup - March 2

2020-02-04 Thread Aizhamal Nurmamat kyzy
Hello everyone,

We are hosting a Beam Meetup in Seattle on March 2! If you are in the
Seattle area please come and join us at Google office in South Lake Union.

Meetup agenda:
18:00 - Registration, speed networking, food and drinks.
18:30 - Encoding free-text drug names in electronic health records using
Apache Beam by Damon Douglas (Dito)
19:00 - Beam+Rust : Portable Correct Performance by Sean Jensen-Grey
(Google)
19:30 - Evolution of Dataswarm, a Facebook Scale Data Pipeline Platform by
Shawn Wang and Kaushik Raj (Facebook)
20:00 - Networking

The detailed agenda and logistics are published here:

https://www.meetup.com/Seattle-Apache-Beam-Meetup/events/268302476/

I hope to see a lot of you around!

Thanks,
Aizhamal


Dropping expired sessions with Apache Beam

2020-02-04 Thread Juliana Pereira
I have a log web log file that contains sessions id's and interactions,
there are three interactions `GET, LOGIN, LOGOUT`. Something like:

```
00:00:01;session1;GET
00:00:03;session2;LOGIN
00:01:01;session1;LOGOUT
00:03:01;session2;GET
00:08:15;session2;GET
```

and goes on.

I want to be able to identify (right now, I'm dealing with bounded data)
with sessions were expired. By expired I mean any session that do not have
any interaction in a 5 minutes interval.

Of course, if user "LOGOUT", expiration will not be applied. In the data
above session 2 should be considered expired.

I have the folloing dataflow
```
( p
  | 'Read Files' >> ReadFromText(known_args.input, coder=LogCoder())
  | ParDo(LogToSession())
  | beam.Map(lambda entry: (entry.session, entry))
  | beam.GroupByKey()
)
```

The `LogCoder()` is responsible to correctly read the input files. The
`LogToSession` convert a log line to a Python class that correctly handle
the data structure, begin able to acess properties correctly.

For example I can fetch `entry.session` or `entry.timestamp` or
`entry.operation`.

Once processed by `LogToSession`, `entry.timestamp` is a python `datetime`,
`entry.session` is a `str` and `entry.operation` is also an `str`.

In normal python I would keep a dict with each session as key and last
timestamp as value. For each new entry of a given key I would check the
timedelta. If bigger than window. Expired. Otherwise, update last
timestamp. But don't know how to handle in beam.

How to handle the next steps?


[ANNOUNCE] Beam 2.19.0 Released

2020-02-04 Thread Boyuan Zhang
The Apache Beam team is pleased to announce the release of version 2.19.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on
the Beam blog: https://beam.apache.org/blog/2020/02/04/beam-2.19.0.html

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.19.0.
-- Boyuan Zhang, on behalf of The Apache Beam team


Re: [ANNOUNCE] Beam 2.19.0 Released

2020-02-04 Thread Connell O'Callaghan
Well done and thank you Boyuan (and all involved)!!!

On Tue, Feb 4, 2020 at 4:25 PM Boyuan Zhang  wrote:

> The Apache Beam team is pleased to announce the release of version 2.19.0.
>
> Apache Beam is an open source unified programming model to define and
> execute data processing pipelines, including ETL, batch and stream
> (continuous) processing. See https://beam.apache.org
>
> You can download the release here:
>
> https://beam.apache.org/get-started/downloads/
>
> This release includes bug fixes, features, and improvements detailed on
> the Beam blog: https://beam.apache.org/blog/2020/02/04/beam-2.19.0.html
>
> Thanks to everyone who contributed to this release, and we hope you enjoy
> using Beam 2.19.0.
> -- Boyuan Zhang, on behalf of The Apache Beam team
>


Re: [ANNOUNCE] Beam 2.19.0 Released

2020-02-04 Thread Hannah Jiang
Thanks Boyuan!


On Tue, Feb 4, 2020 at 4:46 PM Connell O'Callaghan 
wrote:

> Well done and thank you Boyuan (and all involved)!!!
>
> On Tue, Feb 4, 2020 at 4:25 PM Boyuan Zhang  wrote:
>
>> The Apache Beam team is pleased to announce the release of version 2.19.0
>> .
>>
>> Apache Beam is an open source unified programming model to define and
>> execute data processing pipelines, including ETL, batch and stream
>> (continuous) processing. See https://beam.apache.org
>>
>> You can download the release here:
>>
>> https://beam.apache.org/get-started/downloads/
>>
>> This release includes bug fixes, features, and improvements detailed on
>> the Beam blog: https://beam.apache.org/blog/2020/02/04/beam-2.19.0.html
>>
>> Thanks to everyone who contributed to this release, and we hope you enjoy
>> using Beam 2.19.0.
>> -- Boyuan Zhang, on behalf of The Apache Beam team
>>
>