Iskander14yo opened a new issue, #917:
URL: https://github.com/apache/incubator-graphar/issues/917
### Describe the enhancement requested
Today the conversion/import path does not seem to scale well for larger
datasets.
From reading the current code and trying it in practice, I see that:
- the C++ high-level builders are convenience APIs and keep data in memory
until `Dump()`
- the Spark writer scales better in principle, but still does heavy batch
work such as index generation, joins, sorting, repartitioning, and offset
construction
Because GraphAr is positioned for use with "large-scale graph data", it
would be useful for a community to have a clearer path for scalable conversion.
Assuming I'm not missing something, my suggestion is:
- keep the C++ high-level writer/builder path simple/reference-oriented
and convenient for small/medium imports
- optimize the Spark API/writer as the primary path for large-scale
conversion
This way we treat Spark as the practical scalable backend for data lake,
object stores, HDFS, and distributed preprocessing.
Why Spark seems like the better place to optimize first:
- Spark is considered a data lake first-class citizen, used by many orgs in
production and thus in practice is more accessible for end-users (compared to a
dedicated VM only for C++ import)
- storage backends such as S3/HDFS are abstracted through Spark/Hadoop
- large joins / remapping / repartitioning are natural Spark workloads
- avoiding two separate “fully optimized” implementations (Spark and Cpp)
may be easier to maintain long-term
To sum up, would the project agree with this direction?
If it sounds reasonable, I’d be happy to help investigate and propose
concrete improvements in the Spark conversion path.
### Component(s)
C++, Spark
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]