Thanks, sounds interesting! How do you load files to Spark? Did you consider 
having multiple files instead of file lines?

From: Hector Yee [mailto:hector....@gmail.com]
Sent: Wednesday, April 01, 2015 11:36 AM
To: Ulanov, Alexander
Cc: Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

I use Thrift and then base64 encode the binary and save it as text file lines 
that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets of the 
data and not have dependencies on HDFS / hadoop for server stuff for example.


On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage 
protobuf files in hdfs https://github.com/twitter/elephant-bird


From: Evan R. Sparks 
[mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and found 
it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed 
filenames as input, you couldn't pass it anything like an InputStream). I don't 
know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects 
using it already. Keep in mind that it's column-oriented which might impact 
performance - but basically you're going to want your features in a byte array 
and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch 
<java...@gmail.com<mailto:java...@gmail.com><mailto:java...@gmail.com<mailto:java...@gmail.com>>>
 wrote:
There are some convenience methods you might consider including:

           MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>:

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>



--
Yee Yang Li Hector<http://google.com/+HectorYee>
google.com/+HectorYee<http://google.com/+HectorYee>

Reply via email to