Hi Teng,
what version of spark are using as the execution engine. are you using a vendor's product here? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 27 May 2016 at 13:05, Teng Qiu <teng...@gmail.com> wrote: > I agree with Koert and Reynold, spark works well with large dataset now. > > back to the original discussion, compare SparkSQL vs Hive in Spark vs > Spark API. > > SparkSQL vs Spark API you can simply imagine you are in RDBMS world, > SparkSQL is pure SQL, and Spark API is language for writing stored > procedure > > Hive on Spark is similar to SparkSQL, it is a pure SQL interface that > use spark as spark as execution engine, SparkSQL uses Hive's syntax, > so as a language, i would say they are almost the same. > > but Hive on Spark has a much better support for hive features, > especially hiveserver2 and security features, hive features in > SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but > in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't > work with hivevar and hiveconf argument anymore, and the username for > login via jdbc doesn't work either... > see https://issues.apache.org/jira/browse/SPARK-13983 > > i believe hive support in spark project is really very low priority > stuff... > > sadly Hive on spark integration is not that easy, there are a lot of > dependency conflicts... such as > https://issues.apache.org/jira/browse/HIVE-13301 > > our requirement is using spark with hiveserver2 in a secure way (with > authentication and authorization), currently SparkSQL alone can not > provide this, we are using ranger/sentry + Hive on Spark. > > hope this can help you to get a better idea which direction you should go. > > Cheers, > > Teng > > > 2016-05-27 2:36 GMT+02:00 Koert Kuipers <ko...@tresata.com>: > > We do disk-to-disk iterative algorithms in spark all the time, on > datasets > > that do not fit in memory, and it works well for us. I usually have to do > > some tuning of number of partitions for a new dataset but that's about > it in > > terms of inconveniences. > > > > On May 26, 2016 2:07 AM, "Jörn Franke" <jornfra...@gmail.com> wrote: > > > > > > Spark can handle this true, but it is optimized for the idea that it > works > > it works on the same full dataset in-memory due to the underlying nature > of > > machine learning algorithms (iterative). Of course, you can spill over, > but > > that you should avoid. > > > > That being said you should have read my final sentence about this. Both > > systems develop and change. > > > > > > On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote: > > > > > > On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com> > wrote: > >> > >> Spark is more for machine learning working iteravely over the whole same > >> dataset in memory. Additionally it has streaming and graph processing > >> capabilities that can be used together. > > > > > > Hi Jörn, > > > > The first part is actually no true. Spark can handle data far greater > than > > the aggregate memory available on a cluster. The more recent versions > (1.3+) > > of Spark have external operations for almost all built-in operators, and > > while things may not be perfect, those external operators are becoming > more > > and more robust with each version of Spark. > > > > > > > > > > >