If I'm not mistaken, your data seems to be about 50MB of text documents? In 
which case simple flat text files in JSON or CSV seems ideal, as you are 
already doing. If you are using Spark then DataFrames can read/write either of 
these formats.

For that size of data you may not require Spark. Single-instance scikit-learn 
or VW or whatever should be ok (depending on what model you want to build).

If you need any search & filtering capability I'd recommend elasticsearch, 
which has a very good Spark connecter within the elasticsearch-hadoop project. 
It's also easy to set up and get started (but more tricky to actually run in 
production).

PostgresSQL may also be a good option with its JSON support.

Hope that helps

Sent from my iPhone

> On 4 Feb 2016, at 23:23, Patrick Skjennum <m...@habitats.no> wrote:
> 
> (Am I doing this mailinglist thing right? Never used this ...)
> 
> I do not have a cluster.
> 
> Initially I tried to setup hadoop+hbase+spark, but after spending a week 
> trying to get work, I gave up. I had a million problems with mismatching 
> versions, and things working locally on the server, but not programatically 
> through my client computer, and vice versa. There was always something  that 
> did not work, one way another.
> 
> And since I had to actually get things done rather than becoming an expert in 
> clustering, I gave up and just used simple serializing.
> 
> Now I'm going to make a second attempt, but this time around I'll ask for 
> help:p
> -- 
> mvh
> Patrick Skjennum
> 
> 
>> On 04.02.2016 22.14, Ted Yu wrote:
>> bq. had a hard time setting it up
>> 
>> Mind sharing your experience in more detail :-)
>> If you already have a hadoop cluster, it should be relatively straight 
>> forward to setup.
>> 
>> Tuning needs extra effort.
>> 
>>> On Thu, Feb 4, 2016 at 12:58 PM, habitats <m...@habitats.no> wrote:
>>> Hello
>>> 
>>> I have ~5 million text documents, each around 10-15KB in size, and split
>>> into ~15 columns. I intend to do machine learning, and thus I need to
>>> extract all of the data at the same time, and potentially update everything
>>> on every run.
>>> 
>>> So far I've just used json serializing, or simply cached the RDD to dick.
>>> However, I feel like there must be a better way.
>>> 
>>> I have tried HBase, but I had a hard time setting it up and getting it to
>>> work properly. It also felt like a lot of work for my simple requirements. I
>>> want something /simple/.
>>> 
>>> Any suggestions?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
> 

Reply via email to