my first data science project with spark

2021-12-26 Thread bitfox
Hello list, Thanks to Spark project and the community I have made my first data statistics project with Spark. The url: https://github.com/bitfoxtop/EmailRankings Surely this is not that big-data... I can even write a python script to finish the job more quickly. But since the job was done

some questions when using structure streaming

2021-12-26 Thread fangmin
Hi developers, I using structured streaming + kafka. Last week on prd environment when one node(kafka cluster) crashed ,my application consumes will become slowly, but when I used kafka console it can consume message and the speed is ok. On fat environment,I kill the kafka process

some errors occur when using structured streaming

2021-12-26 Thread fangmin
Hi spark developers, I ask one question on issure board:SPARK-37720.(Error reading delta file,hdfs://BMT163/state/0/0/2879.delta does not exist) Answers: mismatch spark core and python. I am comfused: if it causes by mismatch version,it maybe happend always,but now it occured

Pyspark debugging best practices

2021-12-26 Thread Andrew Davidson
Hi I am having trouble debugging my driver. It runs correctly on smaller data set but fails on large ones. It is very hard to figure out what the bug is. I suspect it may have something do with the way spark is installed and configured. I am using google cloud platform dataproc pyspark The

Pyspark garbage collection and cache management best practices

2021-12-26 Thread Andrew Davidson
Hi Below is typical pseudo code I find myself writing over and over again. There is only a single action at the very end of the program. The early narrow transformations potentially hold on to a lot of needless data. I have a for loop over join. (ie wide transformation). Followed by a bunch