We are trying to switch from Postgres to the Spark's built-in Hive with
Thrift server as the data sink to persist the ML result data, with the
hope that Hive would improve the ML pipeline performance. However, it
turned out that it took significantly longer for Hive to persist
dataframes (via the SQL's saveAsTable API) for Postgres using JDBC.
Does anyone have experienced similar problems with Hive? Any
recommendations in performance improvement would be highly appreciated.
We are using Spark in standalone mode. I would assume that running
Spark on a real Hive database or on simply on Hadoop would be more
desired. Has anyone done any performance comparison between running
Spark with built-in Hive (with just the metastore) vs Spark on a
full-fledged Hive DB vs Spark with built-in Hive on Hadoop? Thanks!
-- ND
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org