[How-To] Custom file format as source

OBones Mon, 12 Jun 2017 03:01:51 -0700

Hello,

I have an application here that generates data files in a custom binaryformat that provides the following information:

Column list, each column has a data type (64 bit integer, 32 bit stringindex, 64 bit IEEE float, 1 byte boolean)Catalogs that give modalities for some columns (ie, column 1 containsonly the following values: A, B, C, D)

Array for actual data, each row has a fixed size according to the columns.

Here is an example:

Col1, 64bit integer
Col2, 32bit string index
Col3, 64bit integer
Col4, 64bit float

Catalog for Col1 = 10, 20, 30, 40, 50
Catalog for Col2 = Big, Small, Large, Tall
Catalog for Col3 = 101, 102, 103, 500, 5000
Catalog for Col4 = (no catalog)

Data array =
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
...

I would like to use this kind of file as a source for various ML relatedcomputations (CART, RandomForrest, Gradient boosting...) and Spark isvery interesting in this area.However, I'm a bit lost as to what I should write to have Spark use thatfile format as a source for its computation. Considering that thosefiles are quite big (100 million lines, hundreds of gigs on disk), I'drather not create something that writes a new file in a built-in format,but I'd rather write some code that makes Spark accept the file as it is.

I looked around and saw the textfile method but it is not applicable tomy case. I also saw the spark.read.format("libsvm") syntax which tellsme that there is a list of supported formats known to spark, which Ibelieve are called Dataframes, but I could not find any tutorial on thissubject.

Would you have any suggestion or links to documentation that would getme started?


Regards,
Olivier

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[How-To] Custom file format as source

Reply via email to