Re: spark SQL thriftserver over ignite and cassandra

Igor Sapego Mon, 17 Oct 2016 10:10:20 -0700

Hi Vincent,

Can you please explain what do you mean by HTTP(S) support for the ODBC?


I'm not quite sure I get it.

Best Regards,
Igor

On Thu, Oct 6, 2016 at 9:59 AM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Thanks
>
> Starting the thriftserver with igniterdd tables doesn't seem very hard.
> Implementing a security layer over ignite cache may be harder as I need to:
> - get username from thriftserver
> - intercept each request and check permissions
> Maybe spark will also be able to handle permissions...
>
> I will keep you informed
>
> Le 6 oct. 2016 00:12, "Denis Magda" <dma...@gridgain.com> a écrit :
>
>> Vincent,
>>
>> Please see below
>>
>> On Oct 5, 2016, at 4:31 AM, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>> Hi
>> thanks for your explanations. Please find inline more questions
>>
>> Vincent
>>
>> 2016-10-05 3:33 GMT+02:00 Denis Magda <dma...@gridgain.com>:
>>
>>> Hi Vincent,
>>>
>>> See my answers inline
>>>
>>> On Oct 4, 2016, at 12:54 AM, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>> Hi,
>>> I know that Ignite has SQL support but:
>>> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier
>>> to integrate on corporate networks with rules, firewalls, proxies
>>>
>>>
>>> *Igor Sapego*, what URIs are supported presently?
>>>
>>> - The SQL engine doesn't seem to scale like Spark SQL would. For
>>> instance, Spark won't generate OOM is dataset (source or result) doesn't
>>> fit in memory. From Ignite side, it's not clear…
>>>
>>>
>>> OOM is not related to scalability topic at all. This is about
>>> application’s logic.
>>>
>>> Ignite SQL engine perfectly scales out along with your cluster.
>>> Moreover, Ignite supports indexes which allows you to get O(logN) running
>>> time complexity for your SQL queries while in case of Spark you will face
>>> with full-scans (O(N)) all the time.
>>>
>>> However, to benefit from Ignite SQL queries you have to put all the data
>>> in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational
>>> database, MongoDB, etc) while a SQL query is executed and won’t preload
>>> anything from an underlying CacheStore. Automatic preloading works for
>>> key-value queries like cache.get(key).
>>>
>>
>>
>> This is an issue because I will potentially have to query TB of data. If
>> I use Spark thriftserver backed by IgniteRDD, does it solve this point and
>> can I get automatic preloading from C* ?
>>
>>
>> IgniteRDD will load missing tuples (key-value) pair from Cassandra
>> because essentially IgniteRDD is an IgniteCache and Cassandra is a
>> CacheStore. The only thing that is left to check is whether Spark
>> triftserver can work with IgniteRDDs. Hope you will be able figure out this
>> and share your feedback with us.
>>
>>
>>
>>> - Spark thrift can manage multi tenancy: different users can connect to
>>> the same SQL engine and share cache. In Ignite it's one cache per user, so
>>> a big waste of RAM.
>>>
>>>
>>> Everyone can connect to an Ignite cluster and work with the same set of
>>> distributed caches. I’m not sure why you need to create caches with the
>>> same content for every user.
>>>
>>
>> It's a security issue, Ignite cache doesn't provide multiple user account
>> per cache. I am thinking of using Spark to authenticate multiple users and
>> then Spark use a shared account on Ignite cache
>>
>>
>> Basically, Ignite provides basic security interfaces and some
>> implementations which you can rely on by building your secure solution.
>> This article can be useful for your case
>> http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/
>>
>> —
>> Denis
>>
>>
>>> If you need a real multi-tenancy support where cacheA is allowed to be
>>> accessed by a group of users A only and cacheB by users from group B then
>>> you can take a look at GridGain which is built on top of Ignite
>>> https://gridgain.readme.io/docs/multi-tenancy
>>>
>>>
>>>
>> OK but I am evaluating open source only solutions (kylin, druid,
>> alluxio...), it's a constraint from my hierarchy
>>
>>>
>>> What I want to achieve is :
>>> - use Cassandra for data store as it provides idempotence (HDFS/hive
>>> doesn't), resulting in exactly once semantic without any duplicates.
>>> - use Spark SQL thriftserver in multi tenancy for large scale adhoc
>>> analytics queries (> TB) from an ODBC driver through HTTP(S)
>>> - accelerate Cassandra reads when the data modeling of the Cassandra
>>> table doesn't fit the queries. Queries would be OLAP style: target multiple
>>> C* partitions, groupby or filters on lots of dimensions that aren't
>>> necessarely in the C* table key.
>>>
>>>
>>> As it was mentioned Ignite uses Cassandra as a CacheStore. You should
>>> keep this in mind. Before trying to assemble all the chain I would
>>> recommend you trying to connect Spark SQL thrift server directly to Ignite
>>> and work with its shared RDDs [1]. A shared RDD (basically Ignite cache)
>>> can be backed by Cassandra. Probably this chain will work for you but I
>>> can’t give more precise guidance on this.
>>>
>>>
>> I will try to make it works and give you feedback
>>
>>
>>
>>> [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark
>>>
>>> —
>>> Denis
>>>
>>> Thanks for your advises
>>>
>>>
>>> 2016-10-04 6:51 GMT+02:00 Jörn Franke <jornfra...@gmail.com>:
>>>
>>>> I am not sure that this will be performant. What do you want to achieve
>>>> here? Fast lookups? Then the Cassandra Ignite store might be the right
>>>> solution. If you want to do more analytic style of queries then you can put
>>>> the data on HDFS/Hive and use the Ignite HDFS cache to cache certain
>>>> partitions/tables in Hive in-memory. If you want to go to iterative machine
>>>> learning algorithms you can go for Spark on top of this. You can use then
>>>> also Ignite cache for Spark RDDs.
>>>>
>>>> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <akuznet...@gridgain.com>
>>>> wrote:
>>>>
>>>> Hi, Vincent!
>>>>
>>>> Ignite also has SQL support (also scalable), I think it will be much
>>>> faster to query directly from Ignite than query from Spark.
>>>> Also please mind, that before executing queries you should load all
>>>> needed data to cache.
>>>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>>>>
>>>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra
>>>>
>>>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski <vincent.gromakows
>>>> k...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> I am evaluating the possibility to use Spark SQL (and its scalability)
>>>>> over an Ignite cache with Cassandra persistent store to increase read
>>>>> workloads like OLAP style analytics.
>>>>> Is there any way to configure Spark thriftserver to load an external
>>>>> table in Ignite like we can do in Cassandra ?
>>>>> Here is an example of config for spark backed by cassandra
>>>>>
>>>>> CREATE EXTERNAL TABLE MyHiveTable
>>>>>         ( id int, data string )
>>>>>         STORED BY 'org.apache.hadoop.hive.cassan
>>>>> dra.cql.CqlStorageHandler'
>>>>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "
>>>>> cassandra.ks.name" = "test" ,
>>>>>           "cassandra.cf.name" = "mytable" ,
>>>>>           "cassandra.ks.repfactor" = "1" ,
>>>>>           "cassandra.ks.strategy" =
>>>>>             "org.apache.cassandra.locator.SimpleStrategy" );
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Alexey Kuznetsov
>>>>
>>>>
>>

Re: spark SQL thriftserver over ignite and cassandra

Reply via email to