Hello I have ~5 million text documents, each around 10-15KB in size, and split into ~15 columns. I intend to do machine learning, and thus I need to extract all of the data at the same time, and potentially update everything on every run.
So far I've just used json serializing, or simply cached the RDD to dick. However, I feel like there must be a better way. I have tried HBase, but I had a hard time setting it up and getting it to work properly. It also felt like a lot of work for my simple requirements. I want something /simple/. Any suggestions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org