As a generic answer in a distributed environment like spark, making sure
that data is distributed evenly among all nodes (assuming every node is the
same or similar) can help performance
repartition thus controls the data distribution among all nodes. However,
it is not that straight forward. Your
Please find below the exact exception
Exception in thread "streaming-job-executor-3" java.lang.OutOfMemoryError:
Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at
java.la
Can you please help.
Thanks
Amit
On Sun, Nov 8, 2020 at 1:35 PM Amit Sharma wrote:
> Hi , I am using 16 nodes spark cluster with below config
> 1. Executor memory 8 GB
> 2. 5 cores per executor
> 3. Driver memory 12 GB.
>
>
> We have streaming job. We do not see problem but sometimes we get
>
Do you mean I need to index them, onehotencode and interact them?
I tried both ways:
Index -> interact -> onehotencode: it gave me 25 combinations.
Index -> onehotencode -> interact: it gave me 16 combinations.
Neither of them gave me expected 24 combinations. Did I miss something?
Thanks,
Fr
Append mode will wait till watermark expires
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
This is because it append mode doesn't output rows until a row is finalized.
Thereotically, data for any row can appear anytime as long as it's in the
waterm
Hi,
Just need some advise.
- When we have multiple spark nodes running code, under what conditions a
repartition make sense?
- Can we repartition and cache the result --> df = spark.sql("select from
...").repartition(4).cache
- If we choose a repartition (4), will that repartition ap
Working on a POC to determine if Spark is a good fit for our use case.
I have a stream of readings from IoT devices that are being published to a
Kafka topic. I need to consume those readings along with recently consumed
readings for the same device to determine the quality of the current
reading.
I think you have this flipped around - you want to one-hot encode, then
compute interactions. As it is you are treating the product of {0,1,2,3,4}
x {0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25
possible values and probably is not what you intend.
On Mon, Nov 9, 2020 at 7
Hi,
How are you doing?
Please first introduce myself to you. I am Yi Du, working in a mortgage
insurance company called ‘Arch Capital Group’ based in Washington DC office in
US. I find your profile under the repo Spark of Github and would like to ask
you one particular coding issue under Spark