Hi Ted, do you mean Hive 2 with spark 2 snapshot build as the execution engine just binaries for snapshot (all ok)?
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 27 May 2016 at 16:39, Ted Yu <yuzhih...@gmail.com> wrote: > Teng: > Why not try out the 2.0 SANPSHOT build ? > > Thanks > > > On May 27, 2016, at 7:44 AM, Teng Qiu <teng...@gmail.com> wrote: > > > > ah, yes, the version is another mess!... no vendor's product > > > > i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work. > > > > hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this > > from hive side https://issues.apache.org/jira/browse/HIVE-13301 > > > > the jackson-databind lib from calcite-avatica.jar is too old. > > > > will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0 > released. > > > > > > 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: > >> Hi Teng, > >> > >> > >> what version of spark are using as the execution engine. are you using a > >> vendor's product here? > >> > >> thanks > >> > >> Dr Mich Talebzadeh > >> > >> > >> > >> LinkedIn > >> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >> > >> > >> > >> http://talebzadehmich.wordpress.com > >> > >> > >> > >> > >>> On 27 May 2016 at 13:05, Teng Qiu <teng...@gmail.com> wrote: > >>> > >>> I agree with Koert and Reynold, spark works well with large dataset > now. > >>> > >>> back to the original discussion, compare SparkSQL vs Hive in Spark vs > >>> Spark API. > >>> > >>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world, > >>> SparkSQL is pure SQL, and Spark API is language for writing stored > >>> procedure > >>> > >>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that > >>> use spark as spark as execution engine, SparkSQL uses Hive's syntax, > >>> so as a language, i would say they are almost the same. > >>> > >>> but Hive on Spark has a much better support for hive features, > >>> especially hiveserver2 and security features, hive features in > >>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but > >>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't > >>> work with hivevar and hiveconf argument anymore, and the username for > >>> login via jdbc doesn't work either... > >>> see https://issues.apache.org/jira/browse/SPARK-13983 > >>> > >>> i believe hive support in spark project is really very low priority > >>> stuff... > >>> > >>> sadly Hive on spark integration is not that easy, there are a lot of > >>> dependency conflicts... such as > >>> https://issues.apache.org/jira/browse/HIVE-13301 > >>> > >>> our requirement is using spark with hiveserver2 in a secure way (with > >>> authentication and authorization), currently SparkSQL alone can not > >>> provide this, we are using ranger/sentry + Hive on Spark. > >>> > >>> hope this can help you to get a better idea which direction you should > go. > >>> > >>> Cheers, > >>> > >>> Teng > >>> > >>> > >>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers <ko...@tresata.com>: > >>>> We do disk-to-disk iterative algorithms in spark all the time, on > >>>> datasets > >>>> that do not fit in memory, and it works well for us. I usually have to > >>>> do > >>>> some tuning of number of partitions for a new dataset but that's about > >>>> it in > >>>> terms of inconveniences. > >>>> > >>>> On May 26, 2016 2:07 AM, "Jörn Franke" <jornfra...@gmail.com> wrote: > >>>> > >>>> > >>>> Spark can handle this true, but it is optimized for the idea that it > >>>> works > >>>> it works on the same full dataset in-memory due to the underlying > nature > >>>> of > >>>> machine learning algorithms (iterative). Of course, you can spill > over, > >>>> but > >>>> that you should avoid. > >>>> > >>>> That being said you should have read my final sentence about this. > Both > >>>> systems develop and change. > >>>> > >>>> > >>>> On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote: > >>>> > >>>> > >>>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Spark is more for machine learning working iteravely over the whole > >>>>> same > >>>>> dataset in memory. Additionally it has streaming and graph processing > >>>>> capabilities that can be used together. > >>>> > >>>> > >>>> Hi Jörn, > >>>> > >>>> The first part is actually no true. Spark can handle data far greater > >>>> than > >>>> the aggregate memory available on a cluster. The more recent versions > >>>> (1.3+) > >>>> of Spark have external operations for almost all built-in operators, > and > >>>> while things may not be perfect, those external operators are becoming > >>>> more > >>>> and more robust with each version of Spark. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >