Re: Indexing a PCollection

2019-02-21 Thread Robert Bradshaw
On Thu, Feb 21, 2019 at 11:34 PM Daniel Erenrich wrote: > > Does this code snippet represent what you are talking about, Mike? > > val partitionedWords = distinctWords.keyBy(w => r.nextInt(N)) > val partitionCounts = partitionedWords.countByKey.asMapSideInput > val out = > partitioned

Re: Indexing a PCollection

2019-02-21 Thread Daniel Erenrich
Does this code snippet represent what you are talking about, Mike? val partitionedWords = distinctWords.keyBy(w => r.nextInt(N)) val partitionCounts = partitionedWords.countByKey.asMapSideInput val out = partitionedWords.groupByKey.withSideInputs(partitionCounts).flatMap((t, s) => {

Re: Running WordCount With DataflowRunner

2019-02-21 Thread Henrique Molina
Hi Shrikant, Pay attention in your parameter --output=gs://test-bucket*/c *\ Your configuration is indicate directory /c , not */b* So, check in your Storage GCP if exist this directory : /c ( -output=gs://test-bucket/c ) and check : gsutil ls gs://test-bucket*/c* Cheers Carlos Molina On Thu,

Running WordCount With DataflowRunner

2019-02-21 Thread shrikant bang
Hi Team, I have tried running WordCount with DataflowRunner on GC. However I am getting exception as Caused by: java.lang.IllegalArgumentException: *Output path does not exist or is not writeable: gs://test-bucket/b.* However, gs://test-bucket/b exists and can be accessible by g

Re: Indexing a PCollection

2019-02-21 Thread Mike Pedersen
One could group it by a random integer up to N like Bradshaw suggests, and then globally find a offset for each key such that it becomes dense. The only non-distributed work is finding the offsets, which should over at most N rows, which in turn should be somewhere in the magnitude of your paralle

Re: Some questions about ensuring correctness with windowing and triggering

2019-02-21 Thread Robert Bradshaw
On Wed, Feb 20, 2019 at 6:54 PM Raghu Angadi wrote: > > On Tue, Feb 12, 2019 at 10:28 AM Robert Bradshaw wrote: >> >> Correct, even within the same key there's no promise of event time ordering >> mapping of panes to real time ordering because the downstream operations may >> happen on a differ

Re: Indexing a PCollection

2019-02-21 Thread Robert Bradshaw
Your intuition is correct that this will not scale well. Here distinctWords will need to be read in its entirety into memory, and in addition doing a lookup with indexOf for each word will be O(n^2) work. If it is sufficient to get indices that are mostly, but not perfectly, dense, I would recomme