Hi Teng,

what version of spark are using as the execution engine. are you using a
vendor's product here?

thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 27 May 2016 at 13:05, Teng Qiu <teng...@gmail.com> wrote:

> I agree with Koert and Reynold, spark works well with large dataset now.
>
> back to the original discussion, compare SparkSQL vs Hive in Spark vs
> Spark API.
>
> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
> SparkSQL is pure SQL, and Spark API is language for writing stored
> procedure
>
> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
> so as a language, i would say they are almost the same.
>
> but Hive on Spark has a much better support for hive features,
> especially hiveserver2 and security features, hive features in
> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
> work with hivevar and hiveconf argument anymore, and the username for
> login via jdbc doesn't work either...
> see https://issues.apache.org/jira/browse/SPARK-13983
>
> i believe hive support in spark project is really very low priority
> stuff...
>
> sadly Hive on spark integration is not that easy, there are a lot of
> dependency conflicts... such as
> https://issues.apache.org/jira/browse/HIVE-13301
>
> our requirement is using spark with hiveserver2 in a secure way (with
> authentication and authorization), currently SparkSQL alone can not
> provide this, we are using ranger/sentry + Hive on Spark.
>
> hope this can help you to get a better idea which direction you should go.
>
> Cheers,
>
> Teng
>
>
> 2016-05-27 2:36 GMT+02:00 Koert Kuipers <ko...@tresata.com>:
> > We do disk-to-disk iterative algorithms in spark all the time, on
> datasets
> > that do not fit in memory, and it works well for us. I usually have to do
> > some tuning of number of partitions for a new dataset but that's about
> it in
> > terms of inconveniences.
> >
> > On May 26, 2016 2:07 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
> >
> >
> > Spark can handle this true, but it is optimized for the idea that it
> works
> > it works on the same full dataset in-memory due to the underlying nature
> of
> > machine learning algorithms (iterative). Of course, you can spill over,
> but
> > that you should avoid.
> >
> > That being said you should have read my final sentence about this. Both
> > systems develop and change.
> >
> >
> > On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote:
> >
> >
> > On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com>
> wrote:
> >>
> >> Spark is more for machine learning working iteravely over the whole same
> >> dataset in memory. Additionally it has streaming and graph processing
> >> capabilities that can be used together.
> >
> >
> > Hi Jörn,
> >
> > The first part is actually no true. Spark can handle data far greater
> than
> > the aggregate memory available on a cluster. The more recent versions
> (1.3+)
> > of Spark have external operations for almost all built-in operators, and
> > while things may not be perfect, those external operators are becoming
> more
> > and more robust with each version of Spark.
> >
> >
> >
> >
> >
>

Reply via email to