(Am I doing this mailinglist thing right? Never used this ...)
I do not have a cluster.
Initially I tried to setup hadoop+hbase+spark, but after spending a week
trying to get work, I gave up. I had a million problems with mismatching
versions, and things working locally on the server, but not
programatically through my client computer, and vice versa. There was
/always something /that did not work, one way another.
And since I had to actually get things /done /rather than becoming an
expert in clustering, I gave up and just used simple serializing.
Now I'm going to make a second attempt, but this time around I'll ask
for help:p
--
mvh
Patrick Skjennum
On 04.02.2016 22.14, Ted Yu wrote:
bq. had a hard time setting it up
Mind sharing your experience in more detail :-)
If you already have a hadoop cluster, it should be relatively straight
forward to setup.
Tuning needs extra effort.
On Thu, Feb 4, 2016 at 12:58 PM, habitats <m...@habitats.no
<mailto:m...@habitats.no>> wrote:
Hello
I have ~5 million text documents, each around 10-15KB in size, and
split
into ~15 columns. I intend to do machine learning, and thus I need to
extract all of the data at the same time, and potentially update
everything
on every run.
So far I've just used json serializing, or simply cached the RDD
to dick.
However, I feel like there must be a better way.
I have tried HBase, but I had a hard time setting it up and
getting it to
work properly. It also felt like a lot of work for my simple
requirements. I
want something /simple/.
Any suggestions?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>