Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread habitats
Hello

I have ~5 million text documents, each around 10-15KB in size, and split
into ~15 columns. I intend to do machine learning, and thus I need to
extract all of the data at the same time, and potentially update everything
on every run.

So far I've just used json serializing, or simply cached the RDD to dick.
However, I feel like there must be a better way.

I have tried HBase, but I had a hard time setting it up and getting it to
work properly. It also felt like a lot of work for my simple requirements. I
want something /simple/.

Any suggestions?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread Patrick Skjennum

(Am I doing this mailinglist thing right? Never used this ...)

I do not have a cluster.

Initially I tried to setup hadoop+hbase+spark, but after spending a week 
trying to get work, I gave up. I had a million problems with mismatching 
versions, and things working locally on the server, but not 
programatically through my client computer, and vice versa. There was 
/always something /that did not work, one way another.


And since I had to actually get things /done /rather than becoming an 
expert in clustering, I gave up and just used simple serializing.


Now I'm going to make a second attempt, but this time around I'll ask 
for help:p


--
mvh
Patrick Skjennum


On 04.02.2016 22.14, Ted Yu wrote:

bq. had a hard time setting it up

Mind sharing your experience in more detail :-)
If you already have a hadoop cluster, it should be relatively straight 
forward to setup.


Tuning needs extra effort.

On Thu, Feb 4, 2016 at 12:58 PM, habitats <m...@habitats.no 
<mailto:m...@habitats.no>> wrote:


Hello

I have ~5 million text documents, each around 10-15KB in size, and
split
into ~15 columns. I intend to do machine learning, and thus I need to
extract all of the data at the same time, and potentially update
everything
on every run.

So far I've just used json serializing, or simply cached the RDD
to dick.
However, I feel like there must be a better way.

I have tried HBase, but I had a hard time setting it up and
getting it to
work properly. It also felt like a lot of work for my simple
requirements. I
want something /simple/.

Any suggestions?



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>






Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread Ted Yu
bq. had a hard time setting it up

Mind sharing your experience in more detail :-)
If you already have a hadoop cluster, it should be relatively straight
forward to setup.

Tuning needs extra effort.

On Thu, Feb 4, 2016 at 12:58 PM, habitats <m...@habitats.no> wrote:

> Hello
>
> I have ~5 million text documents, each around 10-15KB in size, and split
> into ~15 columns. I intend to do machine learning, and thus I need to
> extract all of the data at the same time, and potentially update everything
> on every run.
>
> So far I've just used json serializing, or simply cached the RDD to dick.
> However, I feel like there must be a better way.
>
> I have tried HBase, but I had a hard time setting it up and getting it to
> work properly. It also felt like a lot of work for my simple requirements.
> I
> want something /simple/.
>
> Any suggestions?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread Nick Pentreath
If I'm not mistaken, your data seems to be about 50MB of text documents? In 
which case simple flat text files in JSON or CSV seems ideal, as you are 
already doing. If you are using Spark then DataFrames can read/write either of 
these formats.

For that size of data you may not require Spark. Single-instance scikit-learn 
or VW or whatever should be ok (depending on what model you want to build).

If you need any search & filtering capability I'd recommend elasticsearch, 
which has a very good Spark connecter within the elasticsearch-hadoop project. 
It's also easy to set up and get started (but more tricky to actually run in 
production).

PostgresSQL may also be a good option with its JSON support.

Hope that helps

Sent from my iPhone

> On 4 Feb 2016, at 23:23, Patrick Skjennum <m...@habitats.no> wrote:
> 
> (Am I doing this mailinglist thing right? Never used this ...)
> 
> I do not have a cluster.
> 
> Initially I tried to setup hadoop+hbase+spark, but after spending a week 
> trying to get work, I gave up. I had a million problems with mismatching 
> versions, and things working locally on the server, but not programatically 
> through my client computer, and vice versa. There was always something  that 
> did not work, one way another.
> 
> And since I had to actually get things done rather than becoming an expert in 
> clustering, I gave up and just used simple serializing.
> 
> Now I'm going to make a second attempt, but this time around I'll ask for 
> help:p
> -- 
> mvh
> Patrick Skjennum
> 
> 
>> On 04.02.2016 22.14, Ted Yu wrote:
>> bq. had a hard time setting it up
>> 
>> Mind sharing your experience in more detail :-)
>> If you already have a hadoop cluster, it should be relatively straight 
>> forward to setup.
>> 
>> Tuning needs extra effort.
>> 
>>> On Thu, Feb 4, 2016 at 12:58 PM, habitats <m...@habitats.no> wrote:
>>> Hello
>>> 
>>> I have ~5 million text documents, each around 10-15KB in size, and split
>>> into ~15 columns. I intend to do machine learning, and thus I need to
>>> extract all of the data at the same time, and potentially update everything
>>> on every run.
>>> 
>>> So far I've just used json serializing, or simply cached the RDD to dick.
>>> However, I feel like there must be a better way.
>>> 
>>> I have tried HBase, but I had a hard time setting it up and getting it to
>>> work properly. It also felt like a lot of work for my simple requirements. I
>>> want something /simple/.
>>> 
>>> Any suggestions?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>