Try
https://mapr.com/blog/spark-data-source-api-extending-our-spark-sql-query-engine/
Thanks,
Assaf.
-----Original Message-----
From: OBones [mailto:[email protected]]
Sent: Monday, June 12, 2017 1:01 PM
To: [email protected]
Subject: [How-To] Custom file format as source
Hello,
I have an application here that generates data files in a custom binary format
that provides the following information:
Column list, each column has a data type (64 bit integer, 32 bit string index,
64 bit IEEE float, 1 byte boolean) Catalogs that give modalities for some
columns (ie, column 1 contains only the following values: A, B, C, D) Array for
actual data, each row has a fixed size according to the columns.
Here is an example:
Col1, 64bit integer
Col2, 32bit string index
Col3, 64bit integer
Col4, 64bit float
Catalog for Col1 = 10, 20, 30, 40, 50
Catalog for Col2 = Big, Small, Large, Tall Catalog for Col3 = 101, 102, 103,
500, 5000 Catalog for Col4 = (no catalog)
Data array =
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
...
I would like to use this kind of file as a source for various ML related
computations (CART, RandomForrest, Gradient boosting...) and Spark is very
interesting in this area.
However, I'm a bit lost as to what I should write to have Spark use that file
format as a source for its computation. Considering that those files are quite
big (100 million lines, hundreds of gigs on disk), I'd rather not create
something that writes a new file in a built-in format, but I'd rather write
some code that makes Spark accept the file as it is.
I looked around and saw the textfile method but it is not applicable to my
case. I also saw the spark.read.format("libsvm") syntax which tells me that
there is a list of supported formats known to spark, which I believe are called
Dataframes, but I could not find any tutorial on this subject.
Would you have any suggestion or links to documentation that would get me
started?
Regards,
Olivier
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]