This will depend on multiple factors. Assuming we are talking significant volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion performance is the sole consideration (which is true in many production use cases). Sqoop provides some potential optimisations specially around using native database batch extraction tools that spark cannot take advantage of. The performance inefficiency of using MR (actually map-only) is insignificant over a large corpus of data. Further, in a shared cluster, if the data volume is skewed for the given partition key, spark, without dynamic container allocation, can be significantly inefficient from cluster resources usage perspective. With dynamic allocation enabled, it is less so but sqoop still has a slight edge due to the time Spark holds on to the resources before giving them up.
If ingestion is part of a more complex DAG that relies on Spark cache (rdd / dataframe or dataset), then using Spark jdbc can have a significant advantage in being able to cache the data without persisting into hdfs first. But whether this will convert into an overall significantly better performance of the DAG or cluster will depend on the DAG stages and their performance. In general, if the ingestion stage is the significant bottleneck in the DAG, then the advantage will be significant. Hope this provides a general direction to consider in your case. On 25 Aug 2016 3:09 a.m., "Venkata Penikalapati" < mail.venkatakart...@gmail.com> wrote: > Team, > Please help me in choosing sqoop or spark jdbc to fetch data from rdbms. > Sqoop has lot of optimizations to fetch data does spark jdbc also has those > ? > > I'm performing few analytics using spark data for which data is residing > in rdbms. > > Please guide me with this. > > > Thanks > Venkata Karthik P > >