[ https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242966#comment-17242966 ]
Vladislav Sterkhov commented on SPARK-33620: -------------------------------------------- [~hyukjin.kwon] Hello. I created associated question. https://stackoverflow.com/questions/65087925/spark-task-not-starting-after-filtering > Task not started after filtering > -------------------------------- > > Key: SPARK-33620 > URL: https://issues.apache.org/jira/browse/SPARK-33620 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.4.7 > Reporter: Vladislav Sterkhov > Priority: Major > Attachments: VlwWJ.png, mgg1s.png > > > Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb > memory used task starting and complete, but we need use unlimited stack. > Please help > > !VlwWJ.png|width=644,height=150! > > !mgg1s.png|width=651,height=182! > > This my code: > {{var filteredRDD = sparkContext.emptyRDD[String] > for (path<- pathBuffer) > { val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD)) > filteredRDD = filteredRDD.++(someRDD.filter(row =>\{...} > ) > } > hiveService.insertRDD(filteredRDD.repartition(10), outTable)}} > > been other way. When i got StackOverflowError after many iteration spark > > {{java.lang.StackOverflowError > at > java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2303) > at > java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2596) > at > java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2606) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1319) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1707) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)}} > \{{}} > \{{}} > How i must build my code with repartitional and persist\coalesce for to > nodes not crashes? > I tried to rebuild the program in different ways, transferring repartitioning > and saving in memory / disk inside the loop, installed a large number of > partitions - 200. > The program either hangs on the “repartition” stage or crashes into error > code 143 (outOfMemory), throwing a stackOverflowError in a strange way. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org