Re: R and SparkR Support

Amos B. Elberg Tue, 23 Feb 2016 11:41:16 -0800

I continue to not see a point to engaging in this as a debate. 

The user acceptance speaks for itself. (As just one thing, the only person who 
hasn't gotten the display system working in 208, is Eric.) So does the rate of 
change - there have been a series of pushes to 702 in the past few month or 
two, either fixing problems related to (1), or adding functionality that Eric 
originally said wasn't required or was a bad idea, but put in after I pointed 
it out or users complained.  The reason that process slowed is that I've 
stopped highlighting the gaps.


If anyone has a question about any of this, I'll address it.

> On Feb 23, 2016, at 2:06 PM, Eric Charles <[email protected]> wrote:
> 
> 
>> On 23/02/16 19:52, Amos B. Elberg wrote:
>> Eric, they're not equivalent. 208 continues to have functionality 702 
>> doesn't, including the display system.
>> 
>> I'm not going to tell you what you're doing wrong in your implementation and 
>> "test" of 208, because the users don't seem to have the same confusion, and 
>> I've essentially been guiding your development process by pointing out the 
>> issues.
>> 
>> All three of the issues you raise were addressed already in other threads:
>> 
>> 1. The proposed approach to rscala actually introduces maintenance issues 
>> that have already broken 702. 702 was then revised to work around that, by 
>> distributing part of rscala in binary form. But the workaround doesn't deal 
>> with the issue of R users updating their own installations, and it 
>> eliminates the purported benefit of the approach.
> 
> Using binary form with a specific version at build time is the classical way 
> to deploy on machines. Upgrading machines with a new rscala library implies 
> rebuilding and redeploying.
> 
> This flexibility is only possible with binaries and not with forked fixed 
> source code. With 702, you can choose to build with scala 2.xx and rscala 
> 1.0.8 or the version you want to align with the library available on your 
> machines.
> 
>> 2. This is purely cosmetic. 208 is outside the spark module because it made 
>> development, testing and merging cleaner.
> 
> Sure, this is cosmetic, but I have tried to stick to the existing pyspark 
> implementation to avoid additional maven modules. Btw, having two magic 
> keywords as 208 offers is also something I have avoided to align with current 
> practices and make it simple for the end user.
> 
>> 
>> 3. 208 has supported the HTML, TABLE and IMG display system all along, in an 
>> R-consistent manner. 702 originally did not support any of it. After I 
>> pointed out the gap and users complained, 702 was revised to implement it 
>> partially. 702 still does not. That's why the user questions about this all 
>> get asked on 702 - the people using 208 don't need to ask about it, because 
>> it works as expected.
> 
> I quickly pulled and tested today your branch but running print("%html 
> <h1>hello</h1>") didn't work. Will try again tomorrow.
> 
>>> On Feb 23, 2016, at 1:20 PM, Eric Charles <[email protected]> wrote:
>>> 
>>> It would make no sense merging both.
>>> 
>>> From an end-user perspective, I guess both are equivalent, although with 
>>> the last commit I made, the Zeppelin Display system is supported in 702 (I 
>>> had no luck when testing this functionality with 208). As I said, feel free 
>>> to test both and send feature requests.
>>> 
>>> From a developer perspective, I will reiterate the points I sent on [1] 
>>> which are addressed in 702 (these points make sense to me but didn't 
>>> receive echo so far - would like to get feedback on these):
>>> 
>>> 1.- Use rscala jar instead of forking -> allows to support the platform 
>>> version (scala version...) and benefit from the rscala project new versions 
>>> with patches without having to maintain in the zeppelin source tree fork.
>>> 
>>> 2.- Just like Python, develop R in the Spark module
>>> 
>>> 3.- Support the same behavior asthe rest (no TABLE when output is a 
>>> dataframe, support the HTML, TABLE and IMG display system, support the 
>>> Dynamic Form system).
>>> 
>>> I still have the Dynamic Form system operational.
>>> 
>>> [1] 
>>> http://mail-archives.apache.org/mod_mbox/incubator-zeppelin-dev/201512.mbox/%3C5683E471.9010001%40apache.org%3E
>>> 
>>>> On 23/02/16 19:09, Jeff Steinmetz wrote:
>>>> Thank you Amos Elberg & Eric Charles:
>>>> Is the goal of the community to merge both 208 and 702 at some point as 
>>>> two “different” R interpreters?
>>>> 
>>>> One that is
>>>>   %r
>>>> And another that is
>>>>   %spark.r
>>>> 
>>>> Still trying to wrap my head around the difference.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 2/23/16, 9:34 AM, "Amos B. Elberg" <[email protected]> wrote:
>>>>> 
>>>>> Jeff - 702 isn't a fork, it's an alternative based on 208 that has a 
>>>>> subset of 208's features.  208 is the superset. 208 is also what the 
>>>>> community is now attempting to integrate.
>>>>> 
>>>>> R does support serialization of functions.
>>>>> 
>>>>> 208 does support passing a spark table back and forth between R and 
>>>>> scala. Passing a data.frame through the Zeppelin context will fail in 
>>>>> spark up to 1.5. It may now be working for some data frames in 1.6.
>>>>> 
>>>>> There are examples that do all these things in the documentation for 208 
>>>>> on my repo at github.com/elbamos/Zeppelin-With-R
>>>>> 
>>>>>> On Feb 23, 2016, at 12:03 PM, Jeff Steinmetz 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Hello zeppelin dev group,
>>>>>> 
>>>>>> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to 
>>>>>> figure out if the functionality between these are overlapping, or one 
>>>>>> supports something different than the other.  Is 702 a super set of 208 
>>>>>> (702 is a fork of 208)?
>>>>>> 
>>>>>> Can you pass the reference of a distributed (parallelized) dataframe 
>>>>>> built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", 
>>>>>> myDF)?
>>>>>> 
>>>>>> Similarly, since R doesn’t support serialization of functions (unless 
>>>>>> you use something from the SparkR library) is there an example of 
>>>>>> collecting the parallel DF to a local DF (which I realize it means the 
>>>>>> dataset needs to fit in local memory on the zeppelin server).
>>>>>> 
>>>>>> I can to dig into this a bit and help out where appropriate, however its 
>>>>>> unclear which PR to focus my efforts on.
>>>>>> 
>>>>>> Best,
>>>>>> Jeff Steinmetz
>>>>>> Principal Architect
>>>>>> Akili Interactive Labs
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 2/23/16, 8:01 AM, "elbamos" <[email protected]> wrote:
>>>>>>> 
>>>>>>> Github user elbamos commented on the pull request:
>>>>>>> 
>>>>>>>   
>>>>>>> https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>>>>>>> 
>>>>>>>   @btiernay support for that has been in 208 all along...
>>>>>>> 
>>>>>>>> On Feb 23, 2016, at 9:27 AM, Bob Tiernay <[email protected]> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> @echarles This is great! Thanks for all your hard work. Very much 
>>>>>>>> appreciated!
>>>>>>>> 
>>>>>>>> â•‰
>>>>>>>> Reply to this email directly or view it on GitHub.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---
>>>>>>> If your project is set up for it, you can reply to this email and have 
>>>>>>> your
>>>>>>> reply appear on GitHub as well. If your project does not have this 
>>>>>>> feature
>>>>>>> enabled and wishes so, or if the feature is enabled but not working, 
>>>>>>> please
>>>>>>> contact infrastructure at [email protected] or file a JIRA 
>>>>>>> ticket
>>>>>>> with INFRA.
>>>>>>> ---
>>>>

Re: R and SparkR Support

Reply via email to