Re: spark SQL thriftserver over ignite and cassandra

vincent gromakowski Thu, 06 Oct 2016 00:00:08 -0700

Thanks

Starting the thriftserver with igniterdd tables doesn't seem very hard.
Implementing a security layer over ignite cache may be harder as I need to:
- get username from thriftserver
- intercept each request and check permissions
Maybe spark will also be able to handle permissions...


I will keep you informed

Le 6 oct. 2016 00:12, "Denis Magda" <dma...@gridgain.com> a écrit :

> Vincent,
>
> Please see below
>
> On Oct 5, 2016, at 4:31 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> Hi
> thanks for your explanations. Please find inline more questions
>
> Vincent
>
> 2016-10-05 3:33 GMT+02:00 Denis Magda <dma...@gridgain.com>:
>
>> Hi Vincent,
>>
>> See my answers inline
>>
>> On Oct 4, 2016, at 12:54 AM, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>> Hi,
>> I know that Ignite has SQL support but:
>> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier to
>> integrate on corporate networks with rules, firewalls, proxies
>>
>>
>> *Igor Sapego*, what URIs are supported presently?
>>
>> - The SQL engine doesn't seem to scale like Spark SQL would. For
>> instance, Spark won't generate OOM is dataset (source or result) doesn't
>> fit in memory. From Ignite side, it's not clear…
>>
>>
>> OOM is not related to scalability topic at all. This is about
>> application’s logic.
>>
>> Ignite SQL engine perfectly scales out along with your cluster. Moreover,
>> Ignite supports indexes which allows you to get O(logN) running time
>> complexity for your SQL queries while in case of Spark you will face with
>> full-scans (O(N)) all the time.
>>
>> However, to benefit from Ignite SQL queries you have to put all the data
>> in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational
>> database, MongoDB, etc) while a SQL query is executed and won’t preload
>> anything from an underlying CacheStore. Automatic preloading works for
>> key-value queries like cache.get(key).
>>
>
>
> This is an issue because I will potentially have to query TB of data. If I
> use Spark thriftserver backed by IgniteRDD, does it solve this point and
> can I get automatic preloading from C* ?
>
>
> IgniteRDD will load missing tuples (key-value) pair from Cassandra because
> essentially IgniteRDD is an IgniteCache and Cassandra is a CacheStore. The
> only thing that is left to check is whether Spark triftserver can work with
> IgniteRDDs. Hope you will be able figure out this and share your feedback
> with us.
>
>
>
>> - Spark thrift can manage multi tenancy: different users can connect to
>> the same SQL engine and share cache. In Ignite it's one cache per user, so
>> a big waste of RAM.
>>
>>
>> Everyone can connect to an Ignite cluster and work with the same set of
>> distributed caches. I’m not sure why you need to create caches with the
>> same content for every user.
>>
>
> It's a security issue, Ignite cache doesn't provide multiple user account
> per cache. I am thinking of using Spark to authenticate multiple users and
> then Spark use a shared account on Ignite cache
>
>
> Basically, Ignite provides basic security interfaces and some
> implementations which you can rely on by building your secure solution.
> This article can be useful for your case
> http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/
>
> —
> Denis
>
>
>> If you need a real multi-tenancy support where cacheA is allowed to be
>> accessed by a group of users A only and cacheB by users from group B then
>> you can take a look at GridGain which is built on top of Ignite
>> https://gridgain.readme.io/docs/multi-tenancy
>>
>>
>>
> OK but I am evaluating open source only solutions (kylin, druid,
> alluxio...), it's a constraint from my hierarchy
>
>>
>> What I want to achieve is :
>> - use Cassandra for data store as it provides idempotence (HDFS/hive
>> doesn't), resulting in exactly once semantic without any duplicates.
>> - use Spark SQL thriftserver in multi tenancy for large scale adhoc
>> analytics queries (> TB) from an ODBC driver through HTTP(S)
>> - accelerate Cassandra reads when the data modeling of the Cassandra
>> table doesn't fit the queries. Queries would be OLAP style: target multiple
>> C* partitions, groupby or filters on lots of dimensions that aren't
>> necessarely in the C* table key.
>>
>>
>> As it was mentioned Ignite uses Cassandra as a CacheStore. You should
>> keep this in mind. Before trying to assemble all the chain I would
>> recommend you trying to connect Spark SQL thrift server directly to Ignite
>> and work with its shared RDDs [1]. A shared RDD (basically Ignite cache)
>> can be backed by Cassandra. Probably this chain will work for you but I
>> can’t give more precise guidance on this.
>>
>>
> I will try to make it works and give you feedback
>
>
>
>> [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark
>>
>> —
>> Denis
>>
>> Thanks for your advises
>>
>>
>> 2016-10-04 6:51 GMT+02:00 Jörn Franke <jornfra...@gmail.com>:
>>
>>> I am not sure that this will be performant. What do you want to achieve
>>> here? Fast lookups? Then the Cassandra Ignite store might be the right
>>> solution. If you want to do more analytic style of queries then you can put
>>> the data on HDFS/Hive and use the Ignite HDFS cache to cache certain
>>> partitions/tables in Hive in-memory. If you want to go to iterative machine
>>> learning algorithms you can go for Spark on top of this. You can use then
>>> also Ignite cache for Spark RDDs.
>>>
>>> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <akuznet...@gridgain.com>
>>> wrote:
>>>
>>> Hi, Vincent!
>>>
>>> Ignite also has SQL support (also scalable), I think it will be much
>>> faster to query directly from Ignite than query from Spark.
>>> Also please mind, that before executing queries you should load all
>>> needed data to cache.
>>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>>>
>>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra
>>>
>>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski <vincent.
>>> gromakow...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I am evaluating the possibility to use Spark SQL (and its scalability)
>>>> over an Ignite cache with Cassandra persistent store to increase read
>>>> workloads like OLAP style analytics.
>>>> Is there any way to configure Spark thriftserver to load an external
>>>> table in Ignite like we can do in Cassandra ?
>>>> Here is an example of config for spark backed by cassandra
>>>>
>>>> CREATE EXTERNAL TABLE MyHiveTable
>>>>         ( id int, data string )
>>>>         STORED BY 'org.apache.hadoop.hive.cassan
>>>> dra.cql.CqlStorageHandler'
>>>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "cassandra.ks.name"
>>>> = "test" ,
>>>>           "cassandra.cf.name" = "mytable" ,
>>>>           "cassandra.ks.repfactor" = "1" ,
>>>>           "cassandra.ks.strategy" =
>>>>             "org.apache.cassandra.locator.SimpleStrategy" );
>>>>
>>>>
>>>
>>>
>>> --
>>> Alexey Kuznetsov
>>>
>>>
>

Re: spark SQL thriftserver over ignite and cassandra

Reply via email to