On Thu, Feb 21, 2019 at 11:34 PM Daniel Erenrich wrote:
>
> Does this code snippet represent what you are talking about, Mike?
>
> val partitionedWords = distinctWords.keyBy(w => r.nextInt(N))
> val partitionCounts = partitionedWords.countByKey.asMapSideInput
> val out =
> partitioned
Does this code snippet represent what you are talking about, Mike?
val partitionedWords = distinctWords.keyBy(w => r.nextInt(N))
val partitionCounts = partitionedWords.countByKey.asMapSideInput
val out =
partitionedWords.groupByKey.withSideInputs(partitionCounts).flatMap((t, s)
=> {
Hi Shrikant,
Pay attention in your parameter --output=gs://test-bucket*/c *\
Your configuration is indicate directory /c , not */b*
So, check in your Storage GCP if exist this directory : /c (
-output=gs://test-bucket/c )
and check :
gsutil ls gs://test-bucket*/c*
Cheers
Carlos Molina
On Thu,
Hi Team,
I have tried running WordCount with DataflowRunner on GC. However I am
getting exception as
Caused by: java.lang.IllegalArgumentException: *Output
path does not exist or is not writeable: gs://test-bucket/b.*
However, gs://test-bucket/b exists and can be accessible by g
One could group it by a random integer up to N like Bradshaw suggests, and
then globally find a offset for each key such that it becomes dense.
The only non-distributed work is finding the offsets, which should over at
most N rows, which in turn should be somewhere in the magnitude of your
paralle
On Wed, Feb 20, 2019 at 6:54 PM Raghu Angadi wrote:
>
> On Tue, Feb 12, 2019 at 10:28 AM Robert Bradshaw wrote:
>>
>> Correct, even within the same key there's no promise of event time ordering
>> mapping of panes to real time ordering because the downstream operations may
>> happen on a differ
Your intuition is correct that this will not scale well. Here
distinctWords will need to be read in its entirety into memory, and in
addition doing a lookup with indexOf for each word will be O(n^2)
work.
If it is sufficient to get indices that are mostly, but not perfectly,
dense, I would recomme