Garbage collections issue on MapPartitions

2016-01-29 Thread rcollich
Hi all, I currently have a mapPartitions job which is flatMapping each value in the iterator, and I'm running into an issue where there will be major GC costs on certain executions. Some executors will take 20 minutes, 15 of which are pure garbage collection, and I believe that a lot of it has to

Setting up data for columnsimilarity

2016-01-28 Thread rcollich
Hi all, I need to be able to find the cosine similarity of a series of vectors (for the sake of arguments let's say that every vector is a tweet). However, I'm having an issue with how I can actually prepare my data to use the Columnsimilarity function. I'm receiving these vectors in row format