Re: Side Inputs size

2019-04-08 Thread Lukasz Cwik
Side input performance and scaling is runner dependent. Runners should
attempt to provide support for efficient random access lookup in the maps.
Side inputs should also be cached across elements if the map hasn't changed
which runners should also be capable of doing.

So yes, side input size can impact performance depending on which runner
you choose to use. Some runners don't deal with side inputs at all while
others may scale to support terabytes in size.

Saving it as a static class variable may be a useful workaround if the
runner is not performing as well as you would like.

Map side inputs are usually used to produce joins. Have you tried using
CoGroupByKey to do the join instead?

On Mon, Apr 8, 2019 at 10:30 AM augusto@gmail.com 
wrote:

> Hi,
>
> In one of my transforms I am using Map which is the result of a previous
> transform as a sideInput. This Map  is potentially very large
> with count of all words that appeared in all documents.
>
> The step that uses the sideInput is quite slow because it seems like it is
> initialising a huge Hashmap for every element it processes (I followed this
> example
> https://beam.apache.org/documentation/programming-guide/#side-inputs)
>
> Is this the wrong way of using sideInputs? And by this I mean, can a
> sideInput be too big to be a sideInput? I also thought about saving the
> sideInput as a static class variable, then in principle I only have to read
> it once per "transform" initialised in the cluster.
>
> Am I going totally wrong about this, should I try other approaches?
>
> Best regards,
> Augusto
>
>
>


Re: Implementation an S3 file system for python SDK - Updated

2019-04-08 Thread Lukasz Cwik
A filesystem is a lower level abstraction that a PTransform can use thus
there is no need to consider SDF when creating the S3 filesytem.
If we were redesigning the interface to all filesystems, then SDF should be
considered.

On Mon, Apr 8, 2019 at 10:54 AM Lara Schmidt  wrote:

> I'd push towards waiting until SDF is working end to end to begin
> converting things. Unless it's something like Text.ReadAll batch API that
> gets benefits without a SDF implementation. I don't have a lot of context
> on what file APIs python already supports.
>
> On Mon, Apr 8, 2019 at 10:06 AM Pablo Estrada  wrote:
>
>> Currently, Pasan is working on a design for adding a couple
>> implementations to the Filesystem interface in Python, and it's not
>> necessary to consider SDF here. IMHO.
>>
>> On the other hand, Python's fileio[1] could probably use SDF-based
>> improvements to split when many files are being matched.
>> Best
>> -P.
>>
>> On Mon, Apr 8, 2019 at 10:00 AM Alex Amato  wrote:
>>
>>> +Lukasz Cwik , +Boyuan Zhang , +Lara
>>> Schmidt 
>>>
>>> Should splittable DoFn be considered in this design? In order to split
>>> and scale the source step properly?
>>>
>>> On Mon, Apr 8, 2019 at 9:11 AM Ahmet Altay  wrote:
>>>
 +dev  +Pablo Estrada  +Chamikara
 Jayalath  +Udi Meiri 

 Thank you Pasan. I quickly looked at the proposal and it looks good.
 Added a few folks who could offer additional feedback.

 On Mon, Apr 8, 2019 at 12:13 AM Pasan Kamburugamuwa <
 pasankamburugamu...@gmail.com> wrote:

> Hi,
>
> I have updated the project proposal according to the given feedback.
> So can you guys check my proposal again and give me your feedback about
> corrections I have done.
>
> Here is the link to the updated project proposal
>
> https://docs.google.com/document/d/1i_PoIrbmhNgwKCS1TYWC28A9RsyZQFsQCJic3aCXO-8/edit?usp=sharing
>
> Thank you
> Pasan Kamburugamuwa
>



Side Inputs size

2019-04-08 Thread augusto . mcc
Hi,

In one of my transforms I am using Map which is the result of a previous 
transform as a sideInput. This Map  is potentially very large with 
count of all words that appeared in all documents. 

The step that uses the sideInput is quite slow because it seems like it is 
initialising a huge Hashmap for every element it processes (I followed this 
example https://beam.apache.org/documentation/programming-guide/#side-inputs)

Is this the wrong way of using sideInputs? And by this I mean, can a sideInput 
be too big to be a sideInput? I also thought about saving the sideInput as a 
static class variable, then in principle I only have to read it once per 
"transform" initialised in the cluster.

Am I going totally wrong about this, should I try other approaches?

Best regards,
Augusto




Re: Implementation an S3 file system for python SDK - Updated

2019-04-08 Thread Pablo Estrada
Currently, Pasan is working on a design for adding a couple implementations
to the Filesystem interface in Python, and it's not necessary to consider
SDF here. IMHO.

On the other hand, Python's fileio[1] could probably use SDF-based
improvements to split when many files are being matched.
Best
-P.

On Mon, Apr 8, 2019 at 10:00 AM Alex Amato  wrote:

> +Lukasz Cwik , +Boyuan Zhang , +Lara
> Schmidt 
>
> Should splittable DoFn be considered in this design? In order to split and
> scale the source step properly?
>
> On Mon, Apr 8, 2019 at 9:11 AM Ahmet Altay  wrote:
>
>> +dev  +Pablo Estrada  +Chamikara
>> Jayalath  +Udi Meiri 
>>
>> Thank you Pasan. I quickly looked at the proposal and it looks good.
>> Added a few folks who could offer additional feedback.
>>
>> On Mon, Apr 8, 2019 at 12:13 AM Pasan Kamburugamuwa <
>> pasankamburugamu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have updated the project proposal according to the given feedback. So
>>> can you guys check my proposal again and give me your feedback about
>>> corrections I have done.
>>>
>>> Here is the link to the updated project proposal
>>>
>>> https://docs.google.com/document/d/1i_PoIrbmhNgwKCS1TYWC28A9RsyZQFsQCJic3aCXO-8/edit?usp=sharing
>>>
>>> Thank you
>>> Pasan Kamburugamuwa
>>>
>>


Re: Is there an integration test available for filesystem checking

2019-04-08 Thread Pablo Estrada
I recommend you send these questions to the dev@ list Pasan.

Have you looked at the *_test.py files corresponding to each one of the
file systems? Are they all mocking their access to GCS?
Best
-P.

On Sun, Apr 7, 2019 at 11:12 PM Pasan Kamburugamuwa <
pasankamburugamu...@gmail.com> wrote:

> Hello,
>
> I am currently updating the project proposal which I have already sent to
> the community to get the feedback. So I am having a problem with it, I want
> to know is there any integration testing available for test the filesystem.
>
> Thanks
> pasan kamburugamuwa
>


Re: Implementation an S3 file system for python SDK - Updated

2019-04-08 Thread Ahmet Altay
+dev  +Pablo Estrada  +Chamikara
Jayalath  +Udi Meiri 

Thank you Pasan. I quickly looked at the proposal and it looks good. Added
a few folks who could offer additional feedback.

On Mon, Apr 8, 2019 at 12:13 AM Pasan Kamburugamuwa <
pasankamburugamu...@gmail.com> wrote:

> Hi,
>
> I have updated the project proposal according to the given feedback. So
> can you guys check my proposal again and give me your feedback about
> corrections I have done.
>
> Here is the link to the updated project proposal
>
> https://docs.google.com/document/d/1i_PoIrbmhNgwKCS1TYWC28A9RsyZQFsQCJic3aCXO-8/edit?usp=sharing
>
> Thank you
> Pasan Kamburugamuwa
>


Re: Couchbase

2019-04-08 Thread Ismaël Mejía
Hello,

Guobao is working on this, but he is OOO at least until end of next week so
if you can wait it will be available 'soon'.
If you need this urgently and you decide to write your own implementation
of write, it would be a valuable contribution that I will be happy to
review.

Regards,
Ismaël



On Mon, Apr 8, 2019 at 2:16 PM Joshua Fox  wrote:

> Note that  the Read part  has recently been developed. I need a
> very simply write functionality -- simply inserting JsonObjects to
> Couchbase.
>
> On Mon, Apr 8, 2019 at 3:13 PM Joshua Fox  wrote:
>
>> I am looking for the equivalent of
>> org.apache.beam.sdk.io.gcp.datastore.DatastoreV1.Write for Couchbase
>>
>> What is the status? From this
>>  and this
>> , it does not seem to
>> be in-progress.
>>
>>
>
> --
>
>
> *JOSHUA FOX*
> Director, Software Architecture | Freightos
>
>
>
> *T (Israel): *+972-545691165 | *T (US)*:  +1-3123400953
> Smooth shipping.
>
>
>
>


Re: Couchbase

2019-04-08 Thread Joshua Fox
Note that  the Read part  has recently been developed. I need a
very simply write functionality -- simply inserting JsonObjects to
Couchbase.

On Mon, Apr 8, 2019 at 3:13 PM Joshua Fox  wrote:

> I am looking for the equivalent of
> org.apache.beam.sdk.io.gcp.datastore.DatastoreV1.Write for Couchbase
>
> What is the status? From this
>  and this
> , it does not seem to be
> in-progress.
>
>

-- 


*JOSHUA FOX*
Director, Software Architecture | Freightos



*T (Israel): *+972-545691165 | *T (US)*:  +1-3123400953
Smooth shipping.


Couchbase

2019-04-08 Thread Joshua Fox
I am looking for the equivalent of
org.apache.beam.sdk.io.gcp.datastore.DatastoreV1.Write for Couchbase

What is the status? From this
 and this
, it does not seem to be
in-progress.


Re: Is AvroCoder the right coder for me?

2019-04-08 Thread Augusto Ribeiro
Hi Ryan,

Thanks for the input. When I last tried running my pipeline, this problem 
doesn't seem to be a huge bottleneck. I probably had other things that were 
making it worse. I still think it is weird that when you take a thread dump 
"snapshot" most of the methods are waiting on that lock so if after I fix all 
my other problems I might come back to this.

Best regards,
Augusto

On 2019/04/04 12:39:52, Ryan Skraba  wrote: 
> Hello Augusto!> 
> 
> I just took a look.  The behaviour that you're seeing looks like it's set> 
> in Avro ReflectData -- to avoid doing expensive reflection calls for each> 
> serialization/deserialization, it uses a cache per-class AND access is> 
> synchronized [1].  Only one thread in your executor JVM is accessing the> 
> cached ClassAccessorData at a time, and so it's "normal" that the others> 
> are waiting...  Of course, this doesn't mean that only one thread in the> 
> executor is running at a time, just that they always need to wait their> 
> turn before passing through that one method.> 
> 
> You could have more executors with fewer cores per executor.  That might> 
> shed some light, but it's not really a workaround or solution.> 
> 
> We've had really good results with AvroCoder.of(Schema), which uses> 
> GenericData underneath.   We already knew the schema we wanted, so it was> 
> ok to lose the "magic" of ReflectData and its automatic schema inference,> 
> etc.   I'm a bit surprised that this hasn't come up as a bottleneck before> 
> in Avro, but I didn't find an existing JIRA.> 
> 
> If Avro serialization isn't important to you, you might want to check out> 
> the custom Coder route.  I'd love to hear if you see a big gain in perf!> 
> 
> I hope this helps, Ryan> 
> 
> [1]> 
> https://github.com/apache/avro/blame/branch-1.8/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java#L262>
>  
> .> 
> 
> On Tue, Apr 2, 2019 at 5:52 PM Maximilian Michels  wrote:> 
> 
> > Hey Augusto,> 
> >> 
> > I haven't used @DefaultCoder, but it could be the problem here.> 
> >> 
> > What if you specify the coder directly for your PCollection? For example:> 
> >> 
> >pCol.setCoder(AvroCoder.of(YourClazz.class));> 
> >> 
> >> 
> > Thanks,> 
> > Max> 
> >> 
> > On 01.04.19 17:52, Augusto Ribeiro wrote:> 
> > > Hi Max,> 
> > >> 
> > > I tried to run the job again in a cluster, this is a thread dump from> 
> > > one of the Spark executors (16 cores)> 
> > >> 
> > > https://imgur.com/u2Gz0xY> 
> > >> 
> > > As you can see, almost all threads are blocked on that single Avro> 
> > > reflection method.> 
> > >> 
> > > Best regards,> 
> > > Augusto> 
> > >> 
> > >> 
> > > On 2019/03/27 07:43:17, Augusto Ribeiro  
> > > > wrote:> 
> > >  > Hi Max,>> 
> > >  >> 
> > >  > Thanks for the answer I will give it another try after I sorted out> 
> > > some other things. I will try to save more data next time (screenshots,> 
> > > thread dumps) so that if it happens again I will be more specific in my> 
> > > questions.>> 
> > >  >> 
> > >  > Best regards,>> 
> > >  > Augusto>> 
> > >  >> 
> > >  > On 2019/03/26 12:31:54, Maximilian Michels  
> > > > wrote: >> 
> > >  > > Hi Augusto,> >> 
> > >  > > >> 
> > >  > > Generally speaking Avro should provide very good performance. The> 
> > > calls > >> 
> > >  > > you are seeing should not be significant because Avro caches the> 
> > > schema > >> 
> > >  > > information for a type. It only creates a schema via Reflection the> 
> > >  > >> 
> > >  > > first time it sees a new type.> >> 
> > >  > > >> 
> > >  > > You can optimize further by using your domain knowledge and create> 
> > > a > >> 
> > >  > > custom coder. However, if you do not do anything fancy, I think the> 
> > > odds > >> 
> > >  > > are low that you will see a performance increase.> >> 
> > >  > > >> 
> > >  > > Cheers,> >> 
> > >  > > Max> >> 
> > >  > > >> 
> > >  > > On 26.03.19 09:35, Augusto Ribeiro wrote:> >> 
> > >  > > > Hi again,> >> 
> > >  > > > > >> 
> > >  > > > Sorry for bumping this thread but nobody really came with> 
> > > insight.> >> 
> > >  > > > > >> 
> > >  > > > Should I be defining my own coders for my objects or is it common> 
> > > practice to use the AvroCoder or maybe some other coder?> >> 
> > >  > > > > >> 
> > >  > > > Best regards,> >> 
> > >  > > > Augusto> >> 
> > >  > > > > >> 
> > >  > > > On 2019/03/21 07:35:07, au...@gmail.com > 
> > > http://gmail.com>> wrote:> >> 
> > >  > > >> Hi>> >> 
> > >  > > >>> >> 
> > >  > > >> I am trying out Beam to do some data aggregations. Many of the> 
> > > inputs/outputs of my transforms are complex objects (not super complex,> 
> > > but containing Maps/Lists/Sets sometimes) so when I was prompted to> 
> > > defined a coder to these objects I added the annotation> 
> > > @DefaultCoder(AvroCoder.class) and things worked in my development> 
> > > environment.>> >> 
> > >  > > >>> >> 
> > >  > > >> Now that I am trying to run in on "real" data I 

Implementation an S3 file system for python SDK - Updated

2019-04-08 Thread Pasan Kamburugamuwa
Hi,

I have updated the project proposal according to the given feedback. So can
you guys check my proposal again and give me your feedback about
corrections I have done.

Here is the link to the updated project proposal
https://docs.google.com/document/d/1i_PoIrbmhNgwKCS1TYWC28A9RsyZQFsQCJic3aCXO-8/edit?usp=sharing

Thank you
Pasan Kamburugamuwa


Is there an integration test available for filesystem checking

2019-04-08 Thread Pasan Kamburugamuwa
Hello,

I am currently updating the project proposal which I have already sent to
the community to get the feedback. So I am having a problem with it, I want
to know is there any integration testing available for test the filesystem.

Thanks
pasan kamburugamuwa