We are trying to switch from Postgres to the Spark's built-in Hive with Thrift server as the data sink to persist the ML result data, with the hope that Hive would improve the ML pipeline performance. However, it turned out that it took significantly longer for Hive to persist dataframes (via the SQL's saveAsTable API) for Postgres using JDBC.  Does anyone have experienced similar problems with Hive?  Any recommendations in performance improvement would be highly appreciated.

We are using Spark in standalone mode.   I would assume that running Spark on a real Hive database or on simply on Hadoop would be more desired.  Has anyone done any performance comparison between running Spark with built-in Hive (with just the metastore) vs Spark on a full-fledged Hive DB vs Spark with built-in Hive on Hadoop? Thanks!

-- ND



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to