tried spark 2.0.0 preview, but no assembly jar there... then just gave up... :p
2016-05-27 17:39 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > Teng: > Why not try out the 2.0 SANPSHOT build ? > > Thanks > >> On May 27, 2016, at 7:44 AM, Teng Qiu <teng...@gmail.com> wrote: >> >> ah, yes, the version is another mess!... no vendor's product >> >> i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work. >> >> hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this >> from hive side https://issues.apache.org/jira/browse/HIVE-13301 >> >> the jackson-databind lib from calcite-avatica.jar is too old. >> >> will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0 released. >> >> >> 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: >>> Hi Teng, >>> >>> >>> what version of spark are using as the execution engine. are you using a >>> vendor's product here? >>> >>> thanks >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> >>>> On 27 May 2016 at 13:05, Teng Qiu <teng...@gmail.com> wrote: >>>> >>>> I agree with Koert and Reynold, spark works well with large dataset now. >>>> >>>> back to the original discussion, compare SparkSQL vs Hive in Spark vs >>>> Spark API. >>>> >>>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world, >>>> SparkSQL is pure SQL, and Spark API is language for writing stored >>>> procedure >>>> >>>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that >>>> use spark as spark as execution engine, SparkSQL uses Hive's syntax, >>>> so as a language, i would say they are almost the same. >>>> >>>> but Hive on Spark has a much better support for hive features, >>>> especially hiveserver2 and security features, hive features in >>>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but >>>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't >>>> work with hivevar and hiveconf argument anymore, and the username for >>>> login via jdbc doesn't work either... >>>> see https://issues.apache.org/jira/browse/SPARK-13983 >>>> >>>> i believe hive support in spark project is really very low priority >>>> stuff... >>>> >>>> sadly Hive on spark integration is not that easy, there are a lot of >>>> dependency conflicts... such as >>>> https://issues.apache.org/jira/browse/HIVE-13301 >>>> >>>> our requirement is using spark with hiveserver2 in a secure way (with >>>> authentication and authorization), currently SparkSQL alone can not >>>> provide this, we are using ranger/sentry + Hive on Spark. >>>> >>>> hope this can help you to get a better idea which direction you should go. >>>> >>>> Cheers, >>>> >>>> Teng >>>> >>>> >>>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers <ko...@tresata.com>: >>>>> We do disk-to-disk iterative algorithms in spark all the time, on >>>>> datasets >>>>> that do not fit in memory, and it works well for us. I usually have to >>>>> do >>>>> some tuning of number of partitions for a new dataset but that's about >>>>> it in >>>>> terms of inconveniences. >>>>> >>>>> On May 26, 2016 2:07 AM, "Jörn Franke" <jornfra...@gmail.com> wrote: >>>>> >>>>> >>>>> Spark can handle this true, but it is optimized for the idea that it >>>>> works >>>>> it works on the same full dataset in-memory due to the underlying nature >>>>> of >>>>> machine learning algorithms (iterative). Of course, you can spill over, >>>>> but >>>>> that you should avoid. >>>>> >>>>> That being said you should have read my final sentence about this. Both >>>>> systems develop and change. >>>>> >>>>> >>>>> On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote: >>>>> >>>>> >>>>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Spark is more for machine learning working iteravely over the whole >>>>>> same >>>>>> dataset in memory. Additionally it has streaming and graph processing >>>>>> capabilities that can be used together. >>>>> >>>>> >>>>> Hi Jörn, >>>>> >>>>> The first part is actually no true. Spark can handle data far greater >>>>> than >>>>> the aggregate memory available on a cluster. The more recent versions >>>>> (1.3+) >>>>> of Spark have external operations for almost all built-in operators, and >>>>> while things may not be perfect, those external operators are becoming >>>>> more >>>>> and more robust with each version of Spark. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org