Re: repartition in Spark

2020-11-09 Thread Mich Talebzadeh
As a generic answer in a distributed environment like spark, making sure that data is distributed evenly among all nodes (assuming every node is the same or similar) can help performance repartition thus controls the data distribution among all nodes. However, it is not that straight forward.

Re: Out of memory issue

2020-11-09 Thread Amit Sharma
Please find below the exact exception Exception in thread "streaming-job-executor-3" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at

Re: Out of memory issue

2020-11-09 Thread Amit Sharma
Can you please help. Thanks Amit On Sun, Nov 8, 2020 at 1:35 PM Amit Sharma wrote: > Hi , I am using 16 nodes spark cluster with below config > 1. Executor memory 8 GB > 2. 5 cores per executor > 3. Driver memory 12 GB. > > > We have streaming job. We do not see problem but sometimes we get

RE: Ask about Pyspark ML interaction

2020-11-09 Thread Du, Yi
Do you mean I need to index them, onehotencode and interact them? I tried both ways: Index -> interact -> onehotencode: it gave me 25 combinations. Index -> onehotencode -> interact: it gave me 16 combinations. Neither of them gave me expected 24 combinations. Did I miss something? Thanks,

Re: [Structured Streaming] Join stream of readings with collection of previous related readings

2020-11-09 Thread Lalwani, Jayesh
Append mode will wait till watermark expires https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes This is because it append mode doesn't output rows until a row is finalized. Thereotically, data for any row can appear anytime as long as it's in the

repartition in Spark

2020-11-09 Thread ashok34...@yahoo.com.INVALID
Hi, Just need some advise. - When we have multiple spark nodes running code, under what conditions a repartition make sense? - Can we repartition and cache the result --> df = spark.sql("select from ...").repartition(4).cache - If we choose a repartition (4), will that repartition

[Structured Streaming] Join stream of readings with collection of previous related readings

2020-11-09 Thread nathan.brinks
Working on a POC to determine if Spark is a good fit for our use case. I have a stream of readings from IoT devices that are being published to a Kafka topic. I need to consume those readings along with recently consumed readings for the same device to determine the quality of the current

Re: Ask about Pyspark ML interaction

2020-11-09 Thread Sean Owen
I think you have this flipped around - you want to one-hot encode, then compute interactions. As it is you are treating the product of {0,1,2,3,4} x {0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25 possible values and probably is not what you intend. On Mon, Nov 9, 2020 at

Ask about Pyspark ML interaction

2020-11-09 Thread Du, Yi
Hi, How are you doing? Please first introduce myself to you. I am Yi Du, working in a mortgage insurance company called ‘Arch Capital Group’ based in Washington DC office in US. I find your profile under the repo Spark of Github and would like to ask you one particular coding issue under