Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Shannon Duncan
shuffle will not be performed. > > Reuven > > On Fri, Sep 27, 2019 at 12:12 PM Shannon Duncan < > joseph.dun...@liveramp.com> wrote: > >> Interesting. Right now we are only doing batch processing so I hadn't >> thought about the windowing aspect. >> &g

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Shannon Duncan
exactly this > case. > > Reuven > > On Fri, Sep 27, 2019 at 8:47 AM Shannon Duncan > wrote: > >> Yes, Specifically TextIO withNumShards(). >> >> On Fri, Sep 27, 2019 at 10:45 AM Reuven Lax wrote: >> >>> I'm not sure what you mean by "w

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Shannon Duncan
Yes, Specifically TextIO withNumShards(). On Fri, Sep 27, 2019 at 10:45 AM Reuven Lax wrote: > I'm not sure what you mean by "write out ot a specific shard number." Are > you talking about FIleIO sinks? > > Reuven > > On Fri, Sep 27, 2019 at 7:41 AM Shannon D

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Shannon Duncan
I see two main options here. Create an in memory Iterable as you do your first iteration. (poor implementation imo) Separate your iterations as separate DoFn and call them separately with the PCollection output from Shuffle. There are many different paths but finding the most parallel way is prob

Shuffling on shardnum, is it necessary?

2019-09-27 Thread Shannon Duncan
ffle service is enabled. Thoughts? Thanks, Shannon Duncan

Re: Prevent Shuffling on Writing Files

2019-09-19 Thread Shannon Duncan
that the Total shuffle data process counter counts the number of > bytes written to shuffle + the number of bytes read. So if you shuffle 1GB > of data, you should expect to see 2GB on the counter. > > On Wed, Sep 18, 2019 at 2:39 PM Shannon Duncan > wrote: > >> Ok just ran

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
Sorry missed a part of the map output for flatten: [image: image.png] However the shuffle does show only 29.32 GB going into it but the output of Total Shuffled data is 58.66 GB [image: image.png] On Wed, Sep 18, 2019 at 4:39 PM Shannon Duncan wrote: > Ok just ran the job on a small in

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
] On Wed, Sep 18, 2019 at 4:24 PM Reuven Lax wrote: > > > On Wed, Sep 18, 2019 at 2:12 PM Shannon Duncan > wrote: > >> I will attempt to do without sharding (though I believe we did do a run >> without shards and it incurred the extra shuffle costs). >> >

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
e: image.png] On Wed, Sep 18, 2019 at 4:08 PM Reuven Lax wrote: > In that case you should be able to leave sharding unspecified, and you > won't incur the extra shuffle. Specifying explicit sharding is generally > necessary only for streaming. > > On Wed, Sep 18, 20

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
batch on dataflowRunner. On Wed, Sep 18, 2019 at 4:05 PM Reuven Lax wrote: > Are you using streaming or batch? Also which runner are you using? > > On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan > wrote: > >> So I followed up on why TextIO shuffles and dug into the code

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
o I'm expanding it to dev as well. +dev Finding a solution that prevents quadrupling shuffle costs when simply writing out a file is a necessity for large scale jobs that work with 100+ TB of data. If anyone has any ideas I'd love to hear them. Thanks, Shannon Duncan On Wed, Sep 18, 2019 at

Re: [Python] Read Hadoop Sequence File?

2019-07-16 Thread Shannon Duncan
t; On Fri, Jul 12, 2019 at 2:55 PM Shannon Duncan > wrote: > >> Clarification on previous message. Only happens on local file system >> where it is unable to match a pattern string. Via a `gs://` link it >> is able to do multiple file matching. >> >>

Re: [Python] Read Hadoop Sequence File?

2019-07-12 Thread Shannon Duncan
Clarification on previous message. Only happens on local file system where it is unable to match a pattern string. Via a `gs://` link it is able to do multiple file matching. On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan wrote: > Awesome. I got it working for a single file, but for a struct

Re: [Python] Read Hadoop Sequence File?

2019-07-12 Thread Shannon Duncan
cloud/bigtable/beam/sequencefiles/ImportJob.java#L159-L173> > for an example for hbase Results > > On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan < > joseph.dun...@liveramp.com> wrote: > >> If I wanted to go ahead and include this within a new Java Pipeline, what >> woul

Re: [Java] Using a complex datastructure as Key for KV

2019-07-12 Thread Shannon Duncan
support within Apache Beam that might be able to provide > guidance (+Reuven Lax +Brian Hulette > ). > > On Fri, Jul 12, 2019 at 11:05 AM Shannon Duncan < > joseph.dun...@liveramp.com> wrote: > >> I have a working TreeMapCoder now. Got it all setup and done, and the &g

Re: Python Utilities

2019-07-10 Thread Shannon Duncan
/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java So how does it handle that? On Mon, Jul 8, 2019 at 12:39 PM Shannon Duncan wrote: > Yeah these are for local testing right now. I was hoping to gain insight > on better

Re: [Python] Read Hadoop Sequence File?

2019-07-10 Thread Shannon Duncan
ttps://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java > >>> > > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bi

Re: Python Utilities

2019-07-08 Thread Shannon Duncan
sforms or/and putting it in an > extension Python module, instead of the main ones? > > Best, > Robin > > On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan > wrote: > >> As a follow up. Here is the repo that contains the utilities for now. >> https://github.com/s

Re: Python Utilities

2019-07-08 Thread Shannon Duncan
As a follow up. Here is the repo that contains the utilities for now. https://github.com/shadowcodex/apache-beam-utilities. Will put together a proper PR as code gets closer to production quality. - Shannon On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan wrote: > Thanks Frederik, > &g

Re: Python Utilities

2019-07-08 Thread Shannon Duncan
named addressee you should not > disseminate, distribute or copy this e-mail. Please notify the sender > immediately by e-mail if you have received this e-mail by mistake and > delete this e-mail from your system. If you are not the intended recipient > you are notified that discl

Re: Python Utilities

2019-07-08 Thread Shannon Duncan
n On Sun, Jul 7, 2019 at 10:47 PM Rui Wang wrote: > Maybe also adding Aggregation/GroupBy as utilities? > > > -Rui > > On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan > wrote: > >> Thanks Valentyn, >> >> I'll outline the utilities and accept any s

Re: Python Utilities

2019-07-07 Thread Shannon Duncan
gt; > - If your change is large or it is your first change, it is a good idea to > discuss it on the dev@ mailing list > - For large changes create a design doc (template, examples) and email it > to the dev@ mailing list. > > Thanks, > Valentyn > > On Wed, Jul 3, 2019 at 3

Python Utilities

2019-07-03 Thread Shannon Duncan
I have been writing a bunch of utilities for the python SDK such as joins, selections, composite transforms, etc... I am working with my company to see if I can open source the utilities. Would it be best to post them on a separate PyPi project, or to PR them into the beam SDK? I assume if they le

Re: [Python] Read Hadoop Sequence File?

2019-07-02 Thread Shannon Duncan
glers to decide, whether they want to donate this. > > > > D. > > > > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan < > joseph.dun...@liveramp.com> wrote: > >> > >> It's not outside the realm of possibilities. For now I've created an