Re: Integration of Spark and Ignite. Prototype.

Valentin Kulichenko Mon, 27 Nov 2017 15:48:34 -0800

Nikolay,

Let's estimate the strategy implementation work, and then decide weather to
merge the code in current state or not. If anything is unclear, please
start a separate discussion.


-Val

On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <nizhikov....@gmail.com>
wrote:

> Hello, Val, Denis.
>
> > Personally, I think that we should release the integration only after
> the strategy is fully supported.
>
> I see two major reason to propose merge of DataFrame API implementation
> without custom strategy:
>
> 1. My PR is relatively huge, already. From my experience of interaction
> with Ignite community - the bigger PR becomes, the more time of commiters
> required to review PR.
> So, I propose to move smaller, but complete steps here.
>
> 2. It is not clear for me what exactly includes "custom strategy and
> optimization".
> Seems, that additional discussion required.
> I think, I can put my thoughts on the paper and start discussion right
> after basic implementation is done.
>
> > Custom strategy implementation is actually very important for this
> integration.
>
> Understand and fully agreed.
> I'm ready to continue work in that area.
>
> 23.11.2017 02:15, Denis Magda пишет:
>
> Val, Nikolay,
>>
>> Personally, I think that we should release the integration only after the
>> strategy is fully supported. Without the strategy we don’t really leverage
>> from Ignite’s SQL engine and introduce redundant data movement between
>> Ignite and Spark nodes.
>>
>> How big is the effort to support the strategy in terms of the amount of
>> work left? 40%, 60%, 80%?
>>
>> —
>> Denis
>>
>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
>>> valentin.kuliche...@gmail.com> wrote:
>>>
>>> Nikolay,
>>>
>>> Custom strategy implementation is actually very important for this
>>> integration. Basically, it will allow to create a SQL query for Ignite
>>> and
>>> execute it directly on the cluster. Your current implementation only
>>> adds a
>>> new DataSource which means that Spark will fetch data in its own memory
>>> first, and then do most of the work (like joins for example). Does it
>>> make
>>> sense to you? Can you please take a look at this and provide your
>>> thoughts
>>> on how much development is implied there?
>>>
>>> Current code looks good to me though and I'm OK if the strategy is
>>> implemented as a next step in a scope of separate ticket. I will do final
>>> review early next week and will merge it if everything is OK.
>>>
>>> -Val
>>>
>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <nizhikov....@gmail.com>
>>> wrote:
>>>
>>> Hello.
>>>>
>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>>>
>>>> implementations and what is the difference?
>>>>
>>>> IgniteCatalog removed.
>>>>
>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>
>>>> set manually on SQLContext each time it's created....Is there any way to
>>>> automate this and improve usability?
>>>>
>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
>>>>
>>>> Actually, I think it makes sense to create a builder similar to
>>>>>
>>>> SparkSession.builder()...
>>>>
>>>> IgniteBuilder added.
>>>> Syntax looks like:
>>>>
>>>> ```
>>>> val igniteSession = IgniteSparkSession.builder()
>>>>     .appName("Spark Ignite catalog example")
>>>>     .master("local")
>>>>     .config("spark.executor.instances", "2")
>>>>     .igniteConfig(CONFIG)
>>>>     .getOrCreate()
>>>>
>>>> igniteSession.catalog.listTables().show()
>>>> ```
>>>>
>>>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
>>>>
>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <nizhikov....@gmail.com>:
>>>>
>>>> Hello, Valentin.
>>>>>
>>>>> My answers is below.
>>>>> Dmitry, do we need to move discussion to Jira?
>>>>>
>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>>>>>
>>>>>
>>>>> As I mentioned earlier, to implement and override Spark Catalog one
>>>>> have
>>>>> to use internal(private) Spark API.
>>>>> So I have to use package `org.spark.sql.***` to have access to private
>>>>> class and variables.
>>>>>
>>>>> For example, SharedState class that stores link to ExternalCatalog
>>>>> declared as `private[sql] class SharedState` - i.e. package private.
>>>>>
>>>>> Can these classes reside under org.apache.ignite.spark instead?
>>>>>>
>>>>>
>>>>> No, as long as we want to have our own implementation of
>>>>> ExternalCatalog.
>>>>>
>>>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>>>>>
>>>>> some king of config options. Can you describe the purpose of each of
>>>>> them?
>>>>>
>>>>> I extend comments for this options.
>>>>> Please, see my commit [1] or PR HEAD:
>>>>>
>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>>>>
>>>>> implementations and what is the difference?
>>>>>
>>>>> Good catch, thank you!
>>>>> After additional research I founded that only IgniteExternalCatalog
>>>>> required.
>>>>> I will update PR with IgniteCatalog remove in a few days.
>>>>>
>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>>>>>
>>>>> our plans on implementing them? Also, what exactly is planned in
>>>>> IgniteOptimization and what is its purpose?
>>>>>
>>>>> Actually, this is very good question :)
>>>>> And I need advice from experienced community members here:
>>>>>
>>>>> `IgniteOptimization` purpose is to modify query plan created by Spark.
>>>>> Currently, we have one optimization described in IGNITE-3084 [2] by
>>>>> you,
>>>>> Valentin :) :
>>>>>
>>>>> “If there are non-Ignite relations in the plan, we should fall back to
>>>>> native Spark strategies“
>>>>>
>>>>> I think we can go little further and reduce join of two Ignite backed
>>>>> Data Frames into single Ignite SQL query. Currently, this feature is
>>>>> unimplemented.
>>>>>
>>>>> *Do we need it now? Or we can postpone it and concentrates on basic
>>>>> Data
>>>>> Frame and Catalog implementation?*
>>>>>
>>>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
>>>>> LogicalPlan into physical operators.
>>>>> I don’t have ideas how to use this opportunity. So I think we don’t
>>>>> need
>>>>> IgniteStrategy.
>>>>>
>>>>> Can you or anyone else suggest some optimization strategy to speed up
>>>>> SQL
>>>>> query execution?
>>>>>
>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>>
>>>>> set manually on SQLContext each time it's created....Is there any way
>>>>> to
>>>>> automate this and improve usability?
>>>>>
>>>>> These classes added to `extraOptimizations` when one using
>>>>> IgniteSparkSession.
>>>>> As far as I know, there is no way to automatically add these classes to
>>>>> regular SparkSession.
>>>>>
>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
>>>>>>
>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>> Confusing.
>>>>>
>>>>> DataFrame API is *public* Spark API. So anyone can provide
>>>>> implementation
>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t need
>>>>> any
>>>>> Ignite specific session.
>>>>>
>>>>> Catalog API is *internal* Spark API. There is no way to plug custom
>>>>> catalog implementation into Spark [3]. So we have to use
>>>>> `IgniteSparkSession` that extends regular SparkSession and overrides
>>>>> links
>>>>> to `ExternalCatalog`.
>>>>>
>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>>
>>>>> really needed? It looks like we can directly provide the configuration
>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
>>>>> create it
>>>>> by itself under the hood.
>>>>>
>>>>> Actually, IgniteContext is base class for Ignite <-> Spark integration
>>>>> for now. So I tried to reuse it here. I like the idea to remove
>>>>> explicit
>>>>> usage of IgniteContext.
>>>>> Will implement it in a few days.
>>>>>
>>>>> Actually, I think it makes sense to create a builder similar to
>>>>>>
>>>>> SparkSession.builder()...
>>>>>
>>>>> Great idea! I will implement such builder in a few days.
>>>>>
>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>>>>
>>>>> when we don't have SQL configured on Ignite side?
>>>>>
>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a key-value
>>>>> cache.
>>>>>
>>>>> I thought we decided not to support this, no? Or this is something
>>>>>> else?
>>>>>>
>>>>>
>>>>> My understanding is following:
>>>>>
>>>>> 1. We can’t support automatic resolving key-value caches in
>>>>> *ExternalCatalog*. Because there is no way to reliably detect key and
>>>>> value
>>>>> classes.
>>>>>
>>>>> 2. We can support key-value caches in regular Data Frame
>>>>> implementation.
>>>>> Because we can require user to provide key and value classes
>>>>> explicitly.
>>>>>
>>>>> 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeS
>>>>>>
>>>>> parkSqlFromCacheExample2?
>>>>>
>>>>> Key-value cache:
>>>>>
>>>>> key - java.lang.Long,
>>>>> value - case class Person(name: String, birthDate: java.util.Date)
>>>>>
>>>>> Schema of data frame for cache is:
>>>>>
>>>>> key - long
>>>>> value.name - string
>>>>> value.birthDate - date
>>>>>
>>>>> So we can select data from data from cache:
>>>>>
>>>>> SELECT
>>>>>   key, `value.name`,  `value.birthDate`
>>>>> FROM
>>>>>   testCache
>>>>> WHERE key >= 2 AND `value.name` like '%0'
>>>>>
>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>>>>> bf417bc59b0519156fd4d09114c8da7
>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>>>>> abpanels:comment-tabpanel#comment-15794210
>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>>>>> abpanels:comment-tabpanel#comment-15543733
>>>>>
>>>>>
>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>>>>>
>>>>> Val, thanks for the review. Can I ask you to add the same comments to
>>>>> the
>>>>>
>>>>>> ticket?
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>>>>>> valentin.kuliche...@gmail.com> wrote:
>>>>>>
>>>>>> Nikolay, Anton,
>>>>>>
>>>>>>>
>>>>>>> I did a high level review of the code. First of all, impressive
>>>>>>> results!
>>>>>>> However, I have some questions/comments.
>>>>>>>
>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>>>>>>> codebase?
>>>>>>> Can
>>>>>>> these classes reside under org.apache.ignite.spark instead?
>>>>>>> 2. IgniteRelationProvider contains multiple constants which I guess
>>>>>>> are
>>>>>>> some king of config options. Can you describe the purpose of each of
>>>>>>> them?
>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>>>>>>> Catalog
>>>>>>> implementations and what is the difference?
>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
>>>>>>> are
>>>>>>> our
>>>>>>> plans on implementing them? Also, what exactly is planned in
>>>>>>> IgniteOptimization and what is its purpose?
>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>>> set
>>>>>>> manually on SQLContext each time it's created. This seems to be very
>>>>>>> error
>>>>>>> prone. Is there any way to automate this and improve usability?
>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>>>> confusing.
>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>>> really
>>>>>>> needed? It looks like we can directly provide the configuration
>>>>>>> file; if
>>>>>>> IgniteSparkSession really requires IgniteContext, it can create it by
>>>>>>> itself under the hood. Actually, I think it makes sense to create a
>>>>>>> builder
>>>>>>> similar to SparkSession.builder(), it would be good if our APIs here
>>>>>>> are
>>>>>>> consistent with Spark APIs.
>>>>>>> 8. Can you clarify the query syntax
>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>>>>> when
>>>>>>> we don't have SQL configured on Ignite side? I thought we decided
>>>>>>> not to
>>>>>>> support this, no? Or this is something else?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -Val
>>>>>>>
>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>>>>>> avinogra...@gridgain.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Sounds awesome.
>>>>>>>
>>>>>>>>
>>>>>>>> I'll try to review API & tests this week.
>>>>>>>>
>>>>>>>> Val,
>>>>>>>> Your review still required :)
>>>>>>>>
>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>>>>>>>> nizhikov....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Yes
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>>>>>> avinogra...@gridgain.com> написал:
>>>>>>>>>
>>>>>>>>> Nikolay,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So, it will be able to start regular spark and ignite clusters
>>>>>>>>>> and,
>>>>>>>>>>
>>>>>>>>>> using
>>>>>>>>>
>>>>>>>>
>>>>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>>>>>>
>>>>>>>>>> correct?
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>>>>>
>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hello, Anton.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>>>>>> These libraries are added to the classpath for each remote node
>>>>>>>>>>>
>>>>>>>>>>> running
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> submitted job.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Please, see documentation:
>>>>>>>>>>>
>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>>>>>
>>>>>>>>>>> avinogra...@gridgain.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> :
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Nikolay,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> have
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> any
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> What do you mean? I see code like:
>>>>>>>>>>>>
>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>>>>>
>>>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have created example application to run Ignite Data Frame on
>>>>>>>>>>>>>
>>>>>>>>>>>>> standalone
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Spark cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>>>>
>>>>>>>>>>>>> have
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> any
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>>>>>
>>>>>>>>>>>>> statistics.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>>>>>
>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>> valentin.kuliche...@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>> :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Nikolay,
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I will
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> my
>>>>>>>>
>>>>>>>>>
>>>>>>>>> best
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> review the code this week.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> work
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>>>
>>>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      * Can be run on JDK 7.
>>>>>>>>>>>>>>>>      * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      * Can be run only on JDK 8+
>>>>>>>>>>>>>>>>      * Released Jul 11, 2017.
>>>>>>>>>>>>>>>>      * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> example).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>>>> 2.2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>>>> valentin.kuliche...@gmail.com>:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> dma...@apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ignite.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> measurements.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>> As a
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Spark
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> integration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>> or
>>>>>>>>>
>>>>>>>>> Spark +
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> module
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>> in
>>>>>>>>>
>>>>>>>>> ignite-spark integration.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> Ignite
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [1].
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>>>>>     * Ignite DataSource implementation(
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>     * DataFrame support for Ignite SQL table.
>>>>>>>>>>>
>>>>>>>>>>>>     * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> resolving
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>> ignites
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> prototype.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> suppose
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>> to
>>>>>>>>>
>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> used [3].
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> scalability
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> Ignite
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> codebase?
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *internal
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> Spark
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> API*.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>> public
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> or
>>>>>>>>>>>>>
>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> Ignite
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> code
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> in
>>>>>>>>
>>>>>>>> some
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> external
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>> can
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> support
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>
>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>
>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>
>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>> nizhikov....@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Nikolay Izhikov
>>>> nizhikov....@gmail.com
>>>>
>>>>
>>

Re: Integration of Spark and Ignite. Prototype.

Reply via email to