from:"Felix Cheung"

Re: [DISCUSS] ZEPPELIN-2619. Save note in [Title].zpln instead of [NOTEID]/note.json

2018-08-13 Thread Felix Cheung

Perhaps one concern is users having characters in note name that are invalid 
for file name/file path?

From: Mohit Jaggi 
Sent: Sunday, August 12, 2018 6:02 PM
To: users@zeppelin.apache.org
Cc: dev
Subject: Re: [DISCUSS] ZEPPELIN-2619. Save note in [Title].zpln instead of 
[NOTEID]/note.json

sounds like a good idea!

On Sun, Aug 12, 2018 at 5:34 PM Jeff Zhang 
mailto:zjf...@gmail.com>> wrote:
Motivation

   The motivation of ZEPPELIN-2619 is to change the notes storage structure. 
Previously we store it using {noteId}/note.json, we’d like to change it into 
{note_name}_{note_id}.zpln. There are several reasons for this change.

  1.  {noteId}/note.json is not scalable. We put all notes in one root folder 
in flat structure. And when zeppelin server starts, we need to read all 
note.json to get the note file name and build the note folder structure 
(Because we need to get the note name which is stored in note.json to build the 
notebook menu). This would be a nightmare when you have large amounts of notes.

  2.  {noteId}/note.json is not maintainable. It is difficult for a 
developer/administrator to find note file based on note name.

  3.  {noteId}/note.json has no folder structure. Currently zeppelin have to 
build the folder structure internally in memory according note name which is a 
big overhead.

New Approach

   As I mentioned above, I propose to change the note storage structure to 
{note_name}_{note_id}.zpln.  note_name could contains folders, e.g. 
folder_1/mynote_abcd.zpln

This kind of note storage structure could bring several benefits.

  1.  We don’t need to load all notes when zeppelin starts. We just need to 
list each folder to get the note name and note_id.

  2.  It is much maintainable so that it is easy to find the note file based on 
note name.

  3.  It has the folder structure already. That can be mapped to the note 
folder structure.

Side Effect

This approach only works for file system storage, so that means we have to drop 
support for MongoNotebookRepo. I think it is ok because I didn’t see any users 
talk about this in community, so I assume no one is using it.

This is overall design, welcome any comments and feedback. Thanks.

Here's the google docs, you can also comment it here.

https://docs.google.com/document/d/126egAQmhQOL4ynxJ3AQJQRBBLdW8TATYcGkDL1DNZoE/edit?usp=sharing

Re: [DISCUSS] ZEPPELIN-2619. Save note in [Title].zpln instead of [NOTEID]/note.json

2018-08-13 Thread Felix Cheung

Perhaps one concern is users having characters in note name that are invalid 
for file name/file path?

From: Mohit Jaggi 
Sent: Sunday, August 12, 2018 6:02 PM
To: us...@zeppelin.apache.org
Cc: dev
Subject: Re: [DISCUSS] ZEPPELIN-2619. Save note in [Title].zpln instead of 
[NOTEID]/note.json

sounds like a good idea!

On Sun, Aug 12, 2018 at 5:34 PM Jeff Zhang 
mailto:zjf...@gmail.com>> wrote:
Motivation

   The motivation of ZEPPELIN-2619 is to change the notes storage structure. 
Previously we store it using {noteId}/note.json, we’d like to change it into 
{note_name}_{note_id}.zpln. There are several reasons for this change.

  1.  {noteId}/note.json is not scalable. We put all notes in one root folder 
in flat structure. And when zeppelin server starts, we need to read all 
note.json to get the note file name and build the note folder structure 
(Because we need to get the note name which is stored in note.json to build the 
notebook menu). This would be a nightmare when you have large amounts of notes.

  2.  {noteId}/note.json is not maintainable. It is difficult for a 
developer/administrator to find note file based on note name.

  3.  {noteId}/note.json has no folder structure. Currently zeppelin have to 
build the folder structure internally in memory according note name which is a 
big overhead.

New Approach

   As I mentioned above, I propose to change the note storage structure to 
{note_name}_{note_id}.zpln.  note_name could contains folders, e.g. 
folder_1/mynote_abcd.zpln

This kind of note storage structure could bring several benefits.

  1.  We don’t need to load all notes when zeppelin starts. We just need to 
list each folder to get the note name and note_id.

  2.  It is much maintainable so that it is easy to find the note file based on 
note name.

  3.  It has the folder structure already. That can be mapped to the note 
folder structure.

Side Effect

This approach only works for file system storage, so that means we have to drop 
support for MongoNotebookRepo. I think it is ok because I didn’t see any users 
talk about this in community, so I assume no one is using it.

This is overall design, welcome any comments and feedback. Thanks.

Here's the google docs, you can also comment it here.

https://docs.google.com/document/d/126egAQmhQOL4ynxJ3AQJQRBBLdW8TATYcGkDL1DNZoE/edit?usp=sharing

Re: [R] discuss: removing lint-r checks for old branches

2018-08-11 Thread Felix Cheung

SGTM for old branches.

I recall we need to upgrade to newer lintr since it is missing some tests.

Also these seems like real test failures? Are these only happening in 2.1 and 
2.2?

From: shane knapp 
Sent: Friday, August 10, 2018 4:04 PM
To: Sean Owen
Cc: Shivaram Venkataraman; Reynold Xin; dev
Subject: Re: [R] discuss: removing lint-r checks for old branches

/agreemsg

On Fri, Aug 10, 2018 at 4:02 PM, Sean Owen 
mailto:sro...@gmail.com>> wrote:
Seems OK to proceed with shutting off lintr, as it was masking those.

On Fri, Aug 10, 2018 at 6:01 PM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
ugh...  R unit tests failed on both of these builds.
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94583/artifact/R/target/
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94584/artifact/R/target/

On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman 
mailto:shiva...@eecs.berkeley.edu>> wrote:
Sounds good to me as well. Thanks Shane.

Shivaram
On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
>
> SGTM
>
> On Fri, Aug 10, 2018 at 1:39 PM shane knapp 
> mailto:skn...@berkeley.edu>> wrote:
>>
>> https://issues.apache.org/jira/browse/SPARK-25089
>>
>> basically since these branches are old, and there will be a greater than 
>> zero amount of work to get lint-r to pass (on the new ubuntu workers), sean 
>> and i are proposing to remove the lint-r checks for the builds.
>>
>> this is super not important for the 2.4 cut/code freeze, but i wanted to get 
>> this done before it gets pushed down my queue and before we revisit the 
>> ubuntu port.
>>
>> thanks in advance,
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu

--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-04 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569109#comment-16569109
 ] 

Felix Cheung commented on SPARK-24924:
--

 

I tend to agree that we shouldn't "magically" remap different implementations 
or changes behavior across versions, esp. since we have never really tested 
them for compatibility and documented in any way as such.

Do we have agreement on what the behavior should be then? Could someone 
summarize?

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite

2018-07-26 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558009#comment-16558009
 ] 

Felix Cheung commented on SPARK-23683:
--

should this be ported back to branch 2.3 ? we ran into this problem and the 
original change seems to be in 2.3.0 release SPARK-20236

> FileCommitProtocol.instantiate to require 3-arg constructor for dynamic 
> partition overwrite
> ---
>
> Key: SPARK-23683
> URL: https://issues.apache.org/jira/browse/SPARK-23683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 2.4.0
>
>
> with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three 
> argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. 
> If there is no such constructor, it falls back to the classic two-arg one.
> When {{InsertIntoHadoopFsRelationCommand}} passes down that 
> {{dynamicPartitionOverwrite}} flag to  {{FileCommitProtocol.instantiate()}}, 
> it _assumes_ that the instantiated protocol supports the specific 
> requirements of dynamic partition overwrite. It does not notice when this 
> does not hold, and so the output generated may be incorrect.
> Proposed: when dynamicPartitionOverwrite == true, require the protocol 
> implementation to have a 3-arg constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Felix Cheung

+1 this has been problematic.

Also, this list needs to be updated every time we make a new release?

Plus can we cache them on Jenkins, maybe we can avoid downloading the same 
thing from Apache archive every test run.



From: Marco Gaido 
Sent: Monday, July 16, 2018 11:12 PM
To: Hyukjin Kwon
Cc: Sean Owen; dev
Subject: Re: Cleaning Spark releases from mirrors, and the flakiness of 
HiveExternalCatalogVersionsSuite

+1 too

On Tue, 17 Jul 2018, 05:38 Hyukjin Kwon, 
mailto:gurwls...@gmail.com>> wrote:
+1

2018년 7월 17일 (화) 오전 7:34, Sean Owen 
mailto:sro...@apache.org>>님이 작성:
Fix is committed to branches back through 2.2.x, where this test was added.

There is still some issue; I'm seeing that 
archive.apache.org is rate-limiting downloads and 
frequently returning 503 errors.

We can help, I guess, by avoiding testing against non-current releases. Right 
now we should be testing against 2.3.1, 2.2.2, 2.1.3, right? 2.0.x is now 
effectively EOL right?

I can make that quick change too if everyone's amenable, in order to prevent 
more failures in this test from master.

On Sun, Jul 15, 2018 at 3:51 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
Yesterday I cleaned out old Spark releases from the mirror system -- we're 
supposed to only keep the latest release from active branches out on mirrors. 
(All releases are available from the Apache archive site.)

Having done so I realized quickly that the HiveExternalCatalogVersionsSuite 
relies on the versions it downloads being available from mirrors. It has been 
flaky, as sometimes mirrors are unreliable. I think now it will not work for 
any versions except 2.3.1, 2.2.2, 2.1.3.

Because we do need to clean those releases out of the mirrors soon anyway, and 
because they're flaky sometimes, I propose adding logic to the test to fall 
back on downloading from the Apache archive site.

... and I'll do that right away to unblock HiveExternalCatalogVersionsSuite 
runs. I think it needs to be backported to other branches as they will still be 
testing against potentially non-current Spark releases.

Sean

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-18 Thread Felix Cheung

+1

From: Bruce Robbins 
Sent: Wednesday, July 18, 2018 3:02 PM
To: Ryan Blue
Cc: Spark Dev List
Subject: Re: [VOTE] SPIP: Standardize SQL logical plans

+1 (non-binding)

On Tue, Jul 17, 2018 at 10:59 AM, Ryan Blue 
mailto:b...@apache.org>> wrote:
Hi everyone,

>From discussion on the proposal doc and the discussion thread, I think we have 
>consensus around the plan to standardize logical write operations for 
>DataSourceV2. I would like to call a vote on the proposal.

The proposal doc is here: SPIP: Standardize SQL logical 
plans.

This vote is for the plan in that doc. The related SPIP with APIs to 
create/alter/drop tables will be a separate vote.

Please vote in the next 72 hours:

[+1]: Spark should adopt the SPIP
[-1]: Spark should not adopt the SPIP because . . .

Thanks for voting, everyone!

--
Ryan Blue

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2018-07-12 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541191#comment-16541191
 ] 

Felix Cheung commented on SPARK-20202:
--

How like will there be a hive release? HIVE-16391 is still open?

Stay with hive 1.2 will slowly become a big problem for us within a few 
months...

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24781) Using a reference from Dataset in Filter/Sort might not work.

2018-07-11 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539627#comment-16539627
 ] 

Felix Cheung commented on SPARK-24781:
--

[~jerryshao]

> Using a reference from Dataset in Filter/Sort might not work.
> -
>
> Key: SPARK-24781
> URL: https://issues.apache.org/jira/browse/SPARK-24781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takuya Ueshin
>Priority: Blocker
>
> When we use a reference from {{Dataset}} in {{filter}} or {{sort}}, which was 
> not used in the prior {{select}}, an {{AnalysisException}} occurs, e.g.,
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(df("name")).filter(df("id") === 0).show()
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing 
> from name#5 in operator !Filter (id#6 = 0).;;
> !Filter (id#6 = 0)
>+- AnalysisBarrier
>   +- Project [name#5]
>  +- Project [_1#2 AS name#5, _2#3 AS id#6]
> +- LocalRelation [_1#2, _2#3]
> {noformat}
> If we use {{col}} instead, it works:
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(col("name")).filter(col("id") === 0).show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-07-11 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539622#comment-16539622
 ] 

Felix Cheung commented on SPARK-14220:
--

this shouldn't block 2.3.2 right?

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23461) vignettes should include model predictions for some ML models

2018-07-11 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-23461:


Assignee: Huaxin Gao

> vignettes should include model predictions for some ML models
> -
>
> Key: SPARK-23461
> URL: https://issues.apache.org/jira/browse/SPARK-23461
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.4.0
>
>
> eg. 
> Linear Support Vector Machine (SVM) Classifier
> h4. Logistic Regression
> Tree - GBT, RF, DecisionTree
> (and ALS was disabled)
> By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23461) vignettes should include model predictions for some ML models

2018-07-11 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23461.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> vignettes should include model predictions for some ML models
> -
>
> Key: SPARK-23461
> URL: https://issues.apache.org/jira/browse/SPARK-23461
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.4.0
>
>
> eg. 
> Linear Support Vector Machine (SVM) Classifier
> h4. Logistic Regression
> Tree - GBT, RF, DecisionTree
> (and ALS was disabled)
> By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes

2018-07-11 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23529.
--
   Resolution: Fixed
 Assignee: Andrew Korzhuev  (was: Anirudh Ramanathan)
Fix Version/s: 2.4.0

> Specify hostpath volume and mount the volume in Spark driver and executor 
> pods in Kubernetes
> 
>
> Key: SPARK-23529
> URL: https://issues.apache.org/jira/browse/SPARK-23529
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Suman Somasundar
>Assignee: Andrew Korzhuev
>Priority: Minor
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

2018-07-09 Thread Felix Cheung

I recall this might be a problem running Spark on java 9



From: Shivaram Venkataraman 
Sent: Monday, July 9, 2018 2:17 PM
To: dev; Felix Cheung; Tom Graves
Subject: Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

The upcoming 2.2.2 release was submitted to CRAN. I think there are
some knows issues on Windows, but does anybody know what the following
error with Netty is ?

> WARNING: Illegal reflective access by 
> io.netty.util.internal.PlatformDependent0$1 
> (file:/home/hornik/.cache/spark/spark-2.2.2-bin-hadoop2.7/jars/netty-all-4.0.43.Final.jar)
>  to field java.nio.Buffer.address

Thanks
Shivaram


-- Forwarded message -
From: 
Date: Mon, Jul 9, 2018 at 12:12 PM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2
To: 
Cc: 


Dear maintainer,

package SparkR_2.2.2.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/Windows/00check.log>
Status: 1 ERROR, 1 WARNING
Debian: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/Debian/00check.log>
Status: 1 ERROR, 2 WARNINGs

Last released version's CRAN status: ERROR: 1, OK: 1
See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>

CRAN Web: <https://cran.r-project.org/package=SparkR>

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:
<https://stat.ethz.ch/mailman/listinfo/r-package-devel>
If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.2.2_20180709_175630/>
The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: WARNING
Maintainer: 'Shivaram Venkataraman '

New submission

Package was archived on CRAN

Insufficient package version (submitted: 2.2.2, existing: 2.3.0)

Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
corrected despite reminders.

Found the following (possibly) invalid URLs:
URL: http://spark.apache.org/docs/latest/api/R/mean.html
From: inst/doc/sparkr-vignettes.html
Status: 404
Message: Not Found

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'x64', Result: ERROR
Running 'run-all.R' [175s]
Running the tests in 'tests/run-all.R' failed.
Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements. See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License. You may obtain a copy of the License at
> #
> # http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following object is masked from 'package:testthat':

describe

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+ Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directory if they exist.
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://mirror.dkd.de/apache/spark
Downloading spark-2.2.2 for Had

[jira] [Commented] (SPARK-24674) Spark on Kubernetes BLAS performance

2018-07-07 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535983#comment-16535983
 ] 

Felix Cheung commented on SPARK-24674:
--

kinda - there has been discussions on whether alpine is appropriate as the base 
system for the images. would it address your problem if you can get BLAS and 
LAPACK?

> Spark on Kubernetes BLAS performance
> 
>
> Key: SPARK-24674
> URL: https://issues.apache.org/jira/browse/SPARK-24674
> Project: Spark
>  Issue Type: Question
>  Components: Build, Kubernetes, MLlib
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 SNAPSHOT (as of June 25th)
> Kubernetes version 1.7.5
> Kubernetes cluster, consisting of 4 Nodes with 16 GB RAM, 8 core Intel 
> processors.
>Reporter: Dennis Aumiller
>Priority: Minor
>  Labels: performance
>
>  
> Usually native BLAS libraries speed up the execution time of CPU-heavy 
> operations as for example in MLlib quite significantly.
>  Of course, the initial error
> {code:java}
> WARN  BLAS:61 - Failed to load implementation from: 
> com.github.fommil.netlib.NativeSystemBLAS
> {code}
> can be resolved not so easily, since, as reported 
> [here|[https://github.com/apache/spark/pull/19717/files/7d2b30373b2e4d8d5311e10c3f9a62a2d900d568],]
>  this seems to be the issue because of the underlying image used by the Spark 
> Dockerfile.
>  Re-building spark with
> {code:java}
> -Pnetlib-lgpl
> {code}
> also does not solve the problem, but I managed to build BLAS and LAPACK into 
> Alpine, with a lot of tricks involved.
> Interestingly, I noticed that the performance of PCA in my case dropped quite 
> significantly (with BLAS support, compared to the netlib-java fallback). I am 
> aware of [#SPARK-21305] as well, but that did not help my case, either.
>  Furthermore, calling SVD on a matrix of only size 5000x5000 (density 1%) 
> already throws an error when trying to use native ARPACK, but runs perfectly 
> fine with the fallback version.
> The question would be whether there has been some investigation in that 
> direction already.
>  Or, if not, whether it would be interesting for the Spark community to 
> provide a
>  * more detailed report with respect to timings/configurations/test setup
>  * a provided Dockerfile to build Spark with BLAS/LAPACK/ARPACK using the 
> shipped Dockerfile as a basis
>   
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-07-06 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535345#comment-16535345
 ] 

Felix Cheung commented on SPARK-24152:
--

I'm pretty sure it's a problem on the server side though



> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24535) Fix java version parsing in SparkR on Windows

2018-07-06 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-24535.
--
  Resolution: Fixed
   Fix Version/s: 2.4.0
  2.3.2
Target Version/s: 2.3.2, 2.4.0  (was: 2.3.2)

> Fix java version parsing in SparkR on Windows
> -
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>    Assignee: Felix Cheung
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR on Windows

2018-07-06 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534502#comment-16534502
 ] 

Felix Cheung commented on SPARK-24535:
--

[~shivaram] I've merged it, but it could be great if you could review the fix

> Fix java version parsing in SparkR on Windows
> -
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>    Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24535) Fix java version parsing in SparkR on Windows

2018-07-03 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530963#comment-16530963
 ] 

Felix Cheung edited comment on SPARK-24535 at 7/3/18 7:50 AM:
--

ok, I submitted a modified package to win-builder, and was able to confirm the 
problem with launchScript. updating with the fix, result here 
[https://win-builder.r-project.org/zD6OfPID9JtR/00check.log]

basically, I don't think launchScript(.. wait = T) is working on Windows.


was (Author: felixcheung):
ok, I submitted a modified package to win-builder, and was able to confirm the 
problem with launchScript. updating with the fix, result here 
https://win-builder.r-project.org/zD6OfPID9JtR/00check.log

> Fix java version parsing in SparkR on Windows
> -
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>    Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR on Windows

2018-07-03 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24535:
-
Summary: Fix java version parsing in SparkR on Windows  (was: Fix java 
version parsing in SparkR)

> Fix java version parsing in SparkR on Windows
> -
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>    Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-07-03 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530963#comment-16530963
 ] 

Felix Cheung commented on SPARK-24535:
--

ok, I submitted a modified package to win-builder, and was able to confirm the 
problem with launchScript. updating with the fix, result here 
https://win-builder.r-project.org/zD6OfPID9JtR/00check.log

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>    Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [ANNOUNCE] Apache Zeppelin 0.8.0 released

2018-06-29 Thread Felix Cheung

Hmm... this is built automatically from the git tag when the release is 
official. We will need to investigate..


From: Krishna Pandey 
Sent: Thursday, June 28, 2018 3:26:46 PM
To: dev@zeppelin.apache.org
Subject: Re: [ANNOUNCE] Apache Zeppelin 0.8.0 released

Awesome. Lots of new feature made it's way in this release. Kudos to all
contributors.

Thanks,
Krishna Pandey

On Thu, Jun 28, 2018 at 2:10 PM Chaoran Yu  wrote:

> Thanks Jeff for preparing the release.
>
> But the Docker image for 0.8.0 is failing:
> https://hub.docker.com/r/apache/zeppelin/builds/
>
> Can someone upload the new image to apache's docker hub account?
>
> Thank you,
> Chaoran Yu
>
>
> On 2018/06/28 03:06:46, Jeff Zhang  wrote:
> > The Apache Zeppelin community is pleased to announce the availability of>
> > the 0.8.0 release.>
> >
> > Zeppelin is a collaborative data analytics and visualization tool for>
> > distributed, general-purpose data processing system such as Apache
> Spark,>
> > Apache Flink, etc.>
> >
> > This is another major release after the last minor release 0.7.3.>
> > The community put significant effort into improving Apache Zeppelin
> since>
> > the last release. 122 contributors fixed totally 602 issues. Lots of>
> > new features are introduced, such as inline configuration, ipython>
> > interpreter, yarn-cluster mode support , interpreter lifecycle manager>
> > and etc.>
> >
> > We encourage you to download the latest release>
> > fromhttp://zeppelin.apache.org/download.html>
> >
> > Release note is available>
> > athttp://zeppelin.apache.org/releases/zeppelin-release-0.8.0.html>
> >
> > We welcome your help and feedback. For more information on the project
> and>
> > how to get involved, visit our website at http://zeppelin.apache.org/>
> >
> > Thank you all users and contributors who have helped to improve Apache>
> > Zeppelin.>
> >
> > Regards,>
> > The Apache Zeppelin community>
> >
>

[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR

2018-06-28 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24535:
-
Target Version/s: 2.3.2

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>    Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR

2018-06-28 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24535:
-
Priority: Blocker  (was: Major)

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24535) Fix java version parsing in SparkR

2018-06-28 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-24535:


Assignee: Felix Cheung

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>    Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Felix Cheung

If I recall we stop releasing Hadoop 2.3 or 2.4 in newer releases (2.2+?) - 
that might be why they are not the release script.



From: Marcelo Vanzin 
Sent: Thursday, June 28, 2018 11:12:45 AM
To: Sean Owen
Cc: Marcelo Vanzin; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

Alright, uploaded the missing packages.

I'll send a PR to update the release scripts just in case...

On Thu, Jun 28, 2018 at 10:08 AM, Sean Owen  wrote:
> If it's easy enough to produce them, I agree you can just add them to the RC
> dir.
>
> On Thu, Jun 28, 2018 at 11:56 AM Marcelo Vanzin
>  wrote:
>>
>> I just noticed this RC is missing builds for hadoop 2.3 and 2.4, which
>> existed in the previous version:
>> https://dist.apache.org/repos/dist/release/spark/spark-2.1.2/
>>
>> How important do we think are those? I think I can just build them and
>> publish them to the RC directory without having to create a new RC.
>>
>> On Tue, Jun 26, 2018 at 1:25 PM, Marcelo Vanzin 
>> wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 2.1.3.
>> >
>> > The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if
>> > a
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.1.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> > https://github.com/apache/spark/tree/v2.1.3-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1275/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>> >
>> > The list of bug fixes going into 2.1.3 can be found at the following
>> > URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12341660
>> >
>> > Notes:
>> >
>> > - RC1 was not sent for a vote. I had trouble building it, and by the
>> > time I got
>> >   things fixed, there was a blocker bug filed. It was already tagged in
>> > git
>> >   at that time.
>> >
>> > - If testing the source package, I recommend using Java 8, even though
>> > 2.1
>> >   supports Java 7 (and the RC was built with JDK 7). This is because
>> > Maven
>> >   Central has updated some configuration that makes the default Java 7
>> > SSL
>> >   config not work.
>> >
>> > - There are Maven artifacts published for Scala 2.10, but binary
>> > releases are only
>> >   available for Scala 2.11. This matches the previous release (2.1.2),
>> > but if there's
>> >   a need / desire to have pre-built distributions for Scala 2.10, I can
>> > probably
>> >   amend the RC without having to create a new one.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.1.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.1.3 can be found at:
>> > https://s.apache.org/spark-2.1.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> >
>> > --
>> > Marcelo
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>



--
Marcelo

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Felix Cheung

Exactly...


From: Marcelo Vanzin 
Sent: Thursday, June 28, 2018 9:16:08 AM
To: Tom Graves
Cc: Felix Cheung; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

Yeah, we should be more careful with that in general. Like we state
that "Spark runs on Java 8+"...

On Thu, Jun 28, 2018 at 9:13 AM, Tom Graves  wrote:
> Right we say we support R3.1+ but we never actually did, so agree its a bug
> but its not a regression since we never really supported them or tested with
> them and its not a logic or security bug that ends in corruptions or bad
> behavior so in my opinion its not a blocker.   Again I'm fine with adding it
> though if others agree.   Maybe we should really change our documentation to
> state more clearly what versions we know it works with and have tested with
> since someone could read R3.1+ as it works with R4 (once released) which
> very well might not be the case.
>
>
> I'm +1 on the release.
>
> Tom
>
> On Thursday, June 28, 2018, 10:28:21 AM CDT, Felix Cheung
>  wrote:
>
>
> Not pushing back, but our support message has always been R 3.1+ so it a bit
> off to say we don’t support newer releases.
>
> https://spark.apache.org/docs/2.1.2/
>
> But looking back, this was found during 2.1.2 RC2 and didn’t fix (in time)
> for 2.1.2?
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
>
> Since it isn’t a regression I’d say +1 from me.
>
>
> ____
> From: Tom Graves 
> Sent: Thursday, June 28, 2018 6:56:16 AM
> To: Marcelo Vanzin; Felix Cheung
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> If this is just supporting newer versions of R that 2.1 never supported then
> I would say its not a blocker. But if you feel its useful enough then I
> would say its up to Marcelo if he wants to pull in and spin another rc.
>
> Tom
>
> On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung
>  wrote:
>
>
> Yes, this is broken with newer version of R.
>
> We check explicitly for warning for the R check which should fail the test
> run.
>
> 
> From: Marcelo Vanzin 
> Sent: Wednesday, June 27, 2018 6:55 PM
> To: Felix Cheung
> Cc: Marcelo Vanzin; Tom Graves; dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> Not sure I understand that bug. Is it a compatibility issue with new
> versions of R?
>
> It's at least marked as fixed in 2.2(.1).
>
> We do run jenkins on these branches, but that seems like just a
> warning, which would not fail those builds...
>
> On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung 
> wrote:
>> (I don’t want to block the release(s) per se...)
>>
>> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>>
>> This is fixed in 2.3 back in Nov 2017
>>
>> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>>
>> Perhaps we don't get Jenkins run on these branches? It should have been
>> detected.
>>
>> * checking for code/documentation mismatches ... WARNING
>> Codoc mismatches from documentation object 'attach':
>> attach
>> Code: function(what, pos = 2L, name = deparse(substitute(what),
>> backtick = FALSE), warn.conflicts = TRUE)
>> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>> warn.conflicts = TRUE)
>> Mismatches in argument default values:
>> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
>> deparse(substitute(what))
>>
>> Codoc mismatches from documentation object 'glm':
>> glm
>> Code: function(formula, family = gaussian, data, weights, subset,
>> na.action, start = NULL, etastart, mustart, offset,
>> control = list(...), model = TRUE, method = "glm.fit",
>> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
>> NULL, ...)
>> Docs: function(formula, family = gaussian, data, weights, subset,
>> na.action, start = NULL, etastart, mustart, offset,
>> control = list(...), model = TRUE, method = "glm.fit",
>> x = FALSE, y = TRUE, contrasts = NULL, ...)
>> Argument names in code not in docs:
>> singular.ok
>> Mismatches in argument names:
>> Position: 16 Code: singular.ok Docs: contrasts
>> Position: 17 Code: contrasts Docs: ...
>>
>> 
>> From: Sean Owen 
>> Sent: Wednesday, June 27, 2018 5:02:37 AM
>> To: Marcelo Vanzin
>> Cc: dev
>> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>>
>> +1 from me too for the usual reasons.
>>
>> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin
>> 
>> wro

Re: Time for 2.3.2?

2018-06-28 Thread Felix Cheung

Yap will do


From: Marcelo Vanzin 
Sent: Thursday, June 28, 2018 9:04:41 AM
To: Felix Cheung
Cc: Spark dev list
Subject: Re: Time for 2.3.2?

Could you mark that bug as blocker and set the target version, in that case?

On Thu, Jun 28, 2018 at 8:46 AM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
+1

I’d like to fix SPARK-24535 first though


From: Stavros Kontopoulos 
mailto:stavros.kontopou...@lightbend.com>>
Sent: Thursday, June 28, 2018 3:50:34 AM
To: Marco Gaido
Cc: Takeshi Yamamuro; Xingbo Jiang; Wenchen Fan; Spark dev list; Saisai Shao; 
van...@cloudera.com.invalid
Subject: Re: Time for 2.3.2?

+1 makes sense.

On Thu, Jun 28, 2018 at 12:07 PM, Marco Gaido 
mailto:marcogaid...@gmail.com>> wrote:
+1 too, I'd consider also to include SPARK-24208 if we can solve it timely...

2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro 
mailto:linguin@gmail.com>>:
+1, I heard some Spark users have skipped v2.3.1 because of these bugs.

On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
mailto:jiangxb1...@gmail.com>> wrote:
+1

Wenchen Fan mailto:cloud0...@gmail.com>>于2018年6月28日 
周四下午2:06写道：
Hi Saisai, that's great! please go ahead!

On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:
+1, like mentioned by Marcelo, these issues seems quite severe.

I can work on the release if short of hands :).

Thanks
Jerry


Marcelo Vanzin  于2018年6月28日周四 上午11:40写道：
+1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
for those out.

(Those are what delayed 2.2.2 and 2.1.3 for those watching...)

On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
> Hi all,
>
> Spark 2.3.1 was released just a while ago, but unfortunately we discovered
> and fixed some critical issues afterward.
>
> SPARK-24495: SortMergeJoin may produce wrong result.
> This is a serious correctness bug, and is easy to hit: have duplicated join
> key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`, and the
> join is a sort merge join. This bug is only present in Spark 2.3.
>
> SPARK-24588: stream-stream join may produce wrong result
> This is a correctness bug in a new feature of Spark 2.3: the stream-stream
> join. Users can hit this bug if one of the join side is partitioned by a
> subset of the join keys.
>
> SPARK-24552: Task attempt numbers are reused when stages are retried
> This is a long-standing bug in the output committer that may introduce data
> corruption.
>
> SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
> access arbitrary files
> This is a potential security issue if users build access control module upon
> Spark.
>
> I think we need a Spark 2.3.2 to address these issues(especially the
> correctness bugs) ASAP. Any thoughts?
>
> Thanks,
> Wenchen



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>



--
---
Takeshi Yamamuro




--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
p:  +30 6977967274

e: stavros.kontopou...@lightbend.com<mailto:dave.mar...@lightbend.com>

[https://docs.google.com/a/lightbend.com/uc?id=0B5AMuG_Ml2ddbFJqVWJxeHV0bzg=download]



--
Marcelo

Re: [ANNOUNCE] Apache Zeppelin 0.8.0 released

2018-06-28 Thread Felix Cheung

Congrats and thanks for putting together the release


From: Miquel Angel Andreu Febrer 
Sent: Wednesday, June 27, 2018 11:02:20 PM
To: dev@zeppelin.apache.org
Cc: us...@zeppelin.apache.org
Subject: Re: [ANNOUNCE] Apache Zeppelin 0.8.0 released

Great news,

It has been a hard work to announce this release

Thank you very much Jeff for your work and your patience





El jue., 28 jun. 2018 6:05, Sanjay Dasgupta 
escribió:

> This is really a great milestone.
>
> Thanks to those behind the grand effort.
>
> On Thu, Jun 28, 2018 at 8:51 AM, Prabhjyot Singh  >
> wrote:
>
> > Awesome! congratulations team.
> >
> >
> >
> > On Thu 28 Jun, 2018, 8:39 AM Taejun Kim,  wrote:
> >
> >> Awesome! Thanks for your great work :)
> >>
> >> 2018년 6월 28일 (목) 오후 12:07, Jeff Zhang 님이 작성:
> >>
> >>> The Apache Zeppelin community is pleased to announce the availability
> of
> >>> the 0.8.0 release.
> >>>
> >>> Zeppelin is a collaborative data analytics and visualization tool for
> >>> distributed, general-purpose data processing system such as Apache
> Spark,
> >>> Apache Flink, etc.
> >>>
> >>> This is another major release after the last minor release 0.7.3.
> >>> The community put significant effort into improving Apache Zeppelin
> since
> >>> the last release. 122 contributors fixed totally 602 issues. Lots of
> >>> new features are introduced, such as inline configuration, ipython
> >>> interpreter, yarn-cluster mode support , interpreter lifecycle manager
> >>> and etc.
> >>>
> >>> We encourage you to download the latest release
> >>> fromhttp://zeppelin.apache.org/download.html
> >>>
> >>> Release note is available
> >>> athttp://zeppelin.apache.org/releases/zeppelin-release-0.8.0.html
> >>>
> >>> We welcome your help and feedback. For more information on the project
> >>> and
> >>> how to get involved, visit our website at http://zeppelin.apache.org/
> >>>
> >>> Thank you all users and contributors who have helped to improve Apache
> >>> Zeppelin.
> >>>
> >>> Regards,
> >>> The Apache Zeppelin community
> >>>
> >> --
> >> Taejun Kim
> >>
> >> Data Mining Lab.
> >> School of Electrical and Computer Engineering
> >> University of Seoul
> >>
> >
>

Re: [ANNOUNCE] Apache Zeppelin 0.8.0 released

2018-06-28 Thread Felix Cheung

Congrats and thanks for putting together the release


From: Miquel Angel Andreu Febrer 
Sent: Wednesday, June 27, 2018 11:02:20 PM
To: d...@zeppelin.apache.org
Cc: users@zeppelin.apache.org
Subject: Re: [ANNOUNCE] Apache Zeppelin 0.8.0 released

Great news,

It has been a hard work to announce this release

Thank you very much Jeff for your work and your patience





El jue., 28 jun. 2018 6:05, Sanjay Dasgupta 
escribió:

> This is really a great milestone.
>
> Thanks to those behind the grand effort.
>
> On Thu, Jun 28, 2018 at 8:51 AM, Prabhjyot Singh  >
> wrote:
>
> > Awesome! congratulations team.
> >
> >
> >
> > On Thu 28 Jun, 2018, 8:39 AM Taejun Kim,  wrote:
> >
> >> Awesome! Thanks for your great work :)
> >>
> >> 2018년 6월 28일 (목) 오후 12:07, Jeff Zhang 님이 작성:
> >>
> >>> The Apache Zeppelin community is pleased to announce the availability
> of
> >>> the 0.8.0 release.
> >>>
> >>> Zeppelin is a collaborative data analytics and visualization tool for
> >>> distributed, general-purpose data processing system such as Apache
> Spark,
> >>> Apache Flink, etc.
> >>>
> >>> This is another major release after the last minor release 0.7.3.
> >>> The community put significant effort into improving Apache Zeppelin
> since
> >>> the last release. 122 contributors fixed totally 602 issues. Lots of
> >>> new features are introduced, such as inline configuration, ipython
> >>> interpreter, yarn-cluster mode support , interpreter lifecycle manager
> >>> and etc.
> >>>
> >>> We encourage you to download the latest release
> >>> fromhttp://zeppelin.apache.org/download.html
> >>>
> >>> Release note is available
> >>> athttp://zeppelin.apache.org/releases/zeppelin-release-0.8.0.html
> >>>
> >>> We welcome your help and feedback. For more information on the project
> >>> and
> >>> how to get involved, visit our website at http://zeppelin.apache.org/
> >>>
> >>> Thank you all users and contributors who have helped to improve Apache
> >>> Zeppelin.
> >>>
> >>> Regards,
> >>> The Apache Zeppelin community
> >>>
> >> --
> >> Taejun Kim
> >>
> >> Data Mining Lab.
> >> School of Electrical and Computer Engineering
> >> University of Seoul
> >>
> >
>

Re: Time for 2.3.2?

2018-06-28 Thread Felix Cheung

+1

I’d like to fix SPARK-24535 first though


From: Stavros Kontopoulos 
Sent: Thursday, June 28, 2018 3:50:34 AM
To: Marco Gaido
Cc: Takeshi Yamamuro; Xingbo Jiang; Wenchen Fan; Spark dev list; Saisai Shao; 
van...@cloudera.com.invalid
Subject: Re: Time for 2.3.2?

+1 makes sense.

On Thu, Jun 28, 2018 at 12:07 PM, Marco Gaido 
mailto:marcogaid...@gmail.com>> wrote:
+1 too, I'd consider also to include SPARK-24208 if we can solve it timely...

2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro 
mailto:linguin@gmail.com>>:
+1, I heard some Spark users have skipped v2.3.1 because of these bugs.

On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
mailto:jiangxb1...@gmail.com>> wrote:
+1

Wenchen Fan mailto:cloud0...@gmail.com>>于2018年6月28日 
周四下午2:06写道：
Hi Saisai, that's great! please go ahead!

On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:
+1, like mentioned by Marcelo, these issues seems quite severe.

I can work on the release if short of hands :).

Thanks
Jerry


Marcelo Vanzin  于2018年6月28日周四 上午11:40写道：
+1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
for those out.

(Those are what delayed 2.2.2 and 2.1.3 for those watching...)

On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
> Hi all,
>
> Spark 2.3.1 was released just a while ago, but unfortunately we discovered
> and fixed some critical issues afterward.
>
> SPARK-24495: SortMergeJoin may produce wrong result.
> This is a serious correctness bug, and is easy to hit: have duplicated join
> key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`, and the
> join is a sort merge join. This bug is only present in Spark 2.3.
>
> SPARK-24588: stream-stream join may produce wrong result
> This is a correctness bug in a new feature of Spark 2.3: the stream-stream
> join. Users can hit this bug if one of the join side is partitioned by a
> subset of the join keys.
>
> SPARK-24552: Task attempt numbers are reused when stages are retried
> This is a long-standing bug in the output committer that may introduce data
> corruption.
>
> SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
> access arbitrary files
> This is a potential security issue if users build access control module upon
> Spark.
>
> I think we need a Spark 2.3.2 to address these issues(especially the
> correctness bugs) ASAP. Any thoughts?
>
> Thanks,
> Wenchen



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



--
---
Takeshi Yamamuro




--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
p:  +30 6977967274

e: stavros.kontopou...@lightbend.com

[https://docs.google.com/a/lightbend.com/uc?id=0B5AMuG_Ml2ddbFJqVWJxeHV0bzg=download]

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Felix Cheung

Not pushing back, but our support message has always been R 3.1+ so it a bit 
off to say we don’t support newer releases.

https://spark.apache.org/docs/2.1.2/

But looking back, this was found during 2.1.2 RC2 and didn’t fix (in time) for 
2.1.2?

http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555

Since it isn’t a regression I’d say +1 from me.



From: Tom Graves 
Sent: Thursday, June 28, 2018 6:56:16 AM
To: Marcelo Vanzin; Felix Cheung
Cc: dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

If this is just supporting newer versions of R that 2.1 never supported then I 
would say its not a blocker. But if you feel its useful enough then I would say 
its up to Marcelo if he wants to pull in and spin another rc.

Tom

On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung 
 wrote:


Yes, this is broken with newer version of R.

We check explicitly for warning for the R check which should fail the test run.


From: Marcelo Vanzin 
Sent: Wednesday, June 27, 2018 6:55 PM
To: Felix Cheung
Cc: Marcelo Vanzin; Tom Graves; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

Not sure I understand that bug. Is it a compatibility issue with new
versions of R?

It's at least marked as fixed in 2.2(.1).

We do run jenkins on these branches, but that seems like just a
warning, which would not fail those builds...

On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung  wrote:
> (I don’t want to block the release(s) per se...)
>
> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>
> This is fixed in 2.3 back in Nov 2017
> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> Perhaps we don't get Jenkins run on these branches? It should have been
> detected.
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
> Code: function(what, pos = 2L, name = deparse(substitute(what),
> backtick = FALSE), warn.conflicts = TRUE)
> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
> warn.conflicts = TRUE)
> Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
> deparse(substitute(what))
>
> Codoc mismatches from documentation object 'glm':
> glm
> Code: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
> NULL, ...)
> Docs: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, contrasts = NULL, ...)
> Argument names in code not in docs:
> singular.ok
> Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
>
> 
> From: Sean Owen 
> Sent: Wednesday, June 27, 2018 5:02:37 AM
> To: Marcelo Vanzin
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> +1 from me too for the usual reasons.
>
> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin 
> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.3.
>>
>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1275/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>>
>> The list of bug fixes going into 2.1.3 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12341660
>>
>> Notes:
>>
>> - RC1 was not sent for a vote. I had trouble building it, and by the time
>> I got
>> things fixed,

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-27 Thread Felix Cheung

Yes, this is broken with newer version of R.

We check explicitly for warning for the R check which should fail the test run.


From: Marcelo Vanzin 
Sent: Wednesday, June 27, 2018 6:55 PM
To: Felix Cheung
Cc: Marcelo Vanzin; Tom Graves; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

Not sure I understand that bug. Is it a compatibility issue with new
versions of R?

It's at least marked as fixed in 2.2(.1).

We do run jenkins on these branches, but that seems like just a
warning, which would not fail those builds...

On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung  wrote:
> (I don’t want to block the release(s) per se...)
>
> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>
> This is fixed in 2.3 back in Nov 2017
> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> Perhaps we don't get Jenkins run on these branches? It should have been
> detected.
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
> Code: function(what, pos = 2L, name = deparse(substitute(what),
> backtick = FALSE), warn.conflicts = TRUE)
> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
> warn.conflicts = TRUE)
> Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
> deparse(substitute(what))
>
> Codoc mismatches from documentation object 'glm':
> glm
> Code: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
> NULL, ...)
> Docs: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, contrasts = NULL, ...)
> Argument names in code not in docs:
> singular.ok
> Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
>
> 
> From: Sean Owen 
> Sent: Wednesday, June 27, 2018 5:02:37 AM
> To: Marcelo Vanzin
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> +1 from me too for the usual reasons.
>
> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin 
> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.3.
>>
>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1275/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>>
>> The list of bug fixes going into 2.1.3 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12341660
>>
>> Notes:
>>
>> - RC1 was not sent for a vote. I had trouble building it, and by the time
>> I got
>> things fixed, there was a blocker bug filed. It was already tagged in
>> git
>> at that time.
>>
>> - If testing the source package, I recommend using Java 8, even though 2.1
>> supports Java 7 (and the RC was built with JDK 7). This is because Maven
>> Central has updated some configuration that makes the default Java 7 SSL
>> config not work.
>>
>> - There are Maven artifacts published for Scala 2.10, but binary
>> releases are only
>> available for Scala 2.11. This matches the previous release (2.1.2),
>> but if there's
>> a need / desire to have pre-built distributions for Scala 2.10, I can
>> probably
>> amend the RC without having to create a new one.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> ===

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-27 Thread Felix Cheung

(I don’t want to block the release(s) per se...)

We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)

This is fixed in 2.3 back in Nov 2017 
https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6

Perhaps we don't get Jenkins run on these branches? It should have been 
detected.

* checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'attach':
attach
Code: function(what, pos = 2L, name = deparse(substitute(what),
backtick = FALSE), warn.conflicts = TRUE)
Docs: function(what, pos = 2L, name = deparse(substitute(what)),
warn.conflicts = TRUE)
Mismatches in argument default values:
Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: 
deparse(substitute(what))

Codoc mismatches from documentation object 'glm':
glm
Code: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
NULL, ...)
Docs: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, contrasts = NULL, ...)
Argument names in code not in docs:
singular.ok
Mismatches in argument names:
Position: 16 Code: singular.ok Docs: contrasts
Position: 17 Code: contrasts Docs: ...


From: Sean Owen 
Sent: Wednesday, June 27, 2018 5:02:37 AM
To: Marcelo Vanzin
Cc: dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

+1 from me too for the usual reasons.

On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin  
wrote:
Please vote on releasing the following candidate as Apache Spark version 2.1.3.

The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.1.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
https://github.com/apache/spark/tree/v2.1.3-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1275/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/

The list of bug fixes going into 2.1.3 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12341660

Notes:

- RC1 was not sent for a vote. I had trouble building it, and by the time I got
  things fixed, there was a blocker bug filed. It was already tagged in git
  at that time.

- If testing the source package, I recommend using Java 8, even though 2.1
  supports Java 7 (and the RC was built with JDK 7). This is because Maven
  Central has updated some configuration that makes the default Java 7 SSL
  config not work.

- There are Maven artifacts published for Scala 2.10, but binary
releases are only
  available for Scala 2.11. This matches the previous release (2.1.2),
but if there's
  a need / desire to have pre-built distributions for Scala 2.10, I can probably
  amend the RC without having to create a new one.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.1.3?
===

The current list of open tickets targeted at 2.1.3 can be found at:
https://s.apache.org/spark-2.1.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread Felix Cheung

I'm not sure we have completed support for Java 10

From: Rahul Agrawal 
Sent: Thursday, June 21, 2018 7:22:42 AM
To: user@spark.apache.org
Subject: Spark 2.3.1 not working on Java 10

Dear Team,

I have installed Java 10, Scala 2.12.6 and spark 2.3.1 in my desktop having 
Ubuntu 16.04. I am getting error opening spark-shell.

Failed to initialize compiler: object java.lang.Object in compiler mirror not 
found.

Please let me know if there is any way to run spark in Java 10.

Thanks,
Rahul

[jira] [Created] (SPARK-24572) "eager execution" for R shell, IDE

2018-06-16 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-24572:


 Summary: "eager execution" for R shell, IDE
 Key: SPARK-24572
 URL: https://issues.apache.org/jira/browse/SPARK-24572
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Felix Cheung


like python in SPARK-24215

we could also have eager execution when SparkDataFrame is returned to the R 
shell



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-06-15 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513416#comment-16513416
 ] 

Felix Cheung commented on SPARK-23435:
--

sorry, I did try but couldn't get it to work, but something like this?

 
{code:java}
# for testthat after 1.0.2 call test_dir as run_tests is removed.
if (packageVersion("testthat") >= "2.0.0") {
  test_pkg_env <- list2env(as.list(getNamespace("SparkR"), all.names = TRUE),
  parent = parent.env(getNamespace("SparkR")))
  withr::local_options(list(topLevelEnvironment = test_pkg_env))
  test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests"),
env = test_pkg_env,
stop_on_failure = TRUE,
stop_on_warning = FALSE)
} else {
  testthat:::run_tests("SparkR",
file.path(sparkRDir, "pkg", "tests", "fulltests"),
NULL,
"summary")
}

{code}

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
>     Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-06-14 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513372#comment-16513372
 ] 

Felix Cheung commented on SPARK-24535:
--

is this only failing on windows?

I wonder if this is related to how stdout is redirected in launchScript

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513363#comment-16513363
 ] 

Felix Cheung commented on SPARK-24359:
--

[~shivaram] sure - do you mean 2.3.1.1 though? 2.4.0 release is not out yet

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_r

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

2018-06-12 Thread Felix Cheung

For #1 is system requirements not honored?

For #2 it looks like Oracle JDK?

From: Shivaram Venkataraman 
Sent: Tuesday, June 12, 2018 3:17:52 PM
To: dev
Cc: Felix Cheung
Subject: Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

Corresponding to the Spark 2.3.1 release, I submitted the SparkR build
to CRAN yesterday. Unfortunately it looks like there are a couple of
issues (full message from CRAN is forwarded below)

1. There are some builds started with Java 10
(http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Debian/00check.log)
which are right now counted as test failures. I wonder if we should
somehow mark them as skipped ? I can ping the CRAN team about this.

2. There is another issue with Java version parsing which
unfortunately affects even Java 8 builds. I've created
https://issues.apache.org/jira/browse/SPARK-24535 to track this.

Thanks
Shivaram

-- Forwarded message -
From: 
Date: Mon, Jun 11, 2018 at 11:24 AM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
To: 
Cc: 

Dear maintainer,

package SparkR_2.3.1.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/Windows/00check.log>
Status: 2 ERRORs, 1 NOTE
Debian: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/Debian/00check.log>
Status: 1 ERROR, 1 WARNING, 1 NOTE

Last released version's CRAN status: ERROR: 1, OK: 1
See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>

CRAN Web: <https://cran.r-project.org/package=SparkR>

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:
<https://stat.ethz.ch/mailman/listinfo/r-package-devel>
If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/>
The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: NOTE
  Maintainer: 'Shivaram Venkataraman '

  New submission

  Package was archived on CRAN

  Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

  CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
  corrected despite reminders.

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'i386', Result: ERROR
Running 'run-all.R' [30s]
  Running the tests in 'tests/run-all.R' failed.
  Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following objects are masked from 'package:testthat':

describe, not

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+   Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directo

[jira] [Commented] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options

2018-06-10 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507476#comment-16507476
 ] 

Felix Cheung commented on SPARK-13184:
--

what's the next step?

> Support minPartitions parameter for JSON and CSV datasources as options
> ---
>
> Key: SPARK-13184
> URL: https://issues.apache.org/jira/browse/SPARK-13184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After looking through the pull requests below at Spark CSV datasources,
> https://github.com/databricks/spark-csv/pull/256
> https://github.com/databricks/spark-csv/issues/141
> https://github.com/databricks/spark-csv/pull/186
> It looks Spark might need to be able to set {{minPartitions}}.
> {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs 
> to shuffle the data for most cases.
> Although I am still not sure if it needs this, I will open this ticket just 
> for discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-08 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521
 ] 

Felix Cheung edited comment on SPARK-24359 at 6/8/18 8:16 PM:
--

Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.4 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work.

 


was (Author: felixcheung):
Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.3 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between conscious

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-08 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521
 ] 

Felix Cheung edited comment on SPARK-24359 at 6/8/18 8:16 PM:
--

Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.4 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work. And likely this 2.4.0.1 would be visible in 
release share, maven etc. too

 


was (Author: felixcheung):
Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.4 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-08 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506521#comment-16506521
 ] 

Felix Cheung commented on SPARK-24359:
--

Thanks Joseph - correct one possibility is to branch 2.4.0 into branch-2.4.0.1 
such that it contains only branch-2.3 official release commits + any 
alterations for SparkML and then tag it as v2.4.0.1.

I think many of us would agree to separate repo only for convenience - so if 
one would sign up to handle the branching and commit porting etc and we get 
community to vote on such a "SparkML only" release, then it is ok. Though 
thinking about it we would still have officially a Spark 2.4.0.1 release (with 
no change from 2.4.0 hopefully) in addition to SparkML 2.4.0.1 due to the way 
the release/tag process work.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
>

Re: Scala 2.12 support

2018-06-07 Thread Felix Cheung

+1

Spoke to Dean as well and mentioned the problem with 2.11.12 
https://github.com/scala/bug/issues/10913

_
From: Sean Owen 
Sent: Wednesday, June 6, 2018 12:23 PM
Subject: Re: Scala 2.12 support
To: Holden Karau 
Cc: Dean Wampler , Reynold Xin , 
dev 

If it means no change to 2.11 support, seems OK to me for Spark 2.4.0. The 2.12 
support is separate and has never been mutually compatible with 2.11 builds 
anyway. (I also hope, suspect that the changes are minimal; tests are already 
almost entirely passing with no change to the closure cleaner when built for 
2.12)

On Wed, Jun 6, 2018 at 1:33 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Just chatted with Dean @ the summit and it sounds like from Adriaan there is a 
fix in 2.13 for the API change issue that could be back ported to 2.12 so how 
about we try and get this ball rolling?

It sounds like it would also need a closure cleaner change, which could be 
backwards compatible but since it’s such a core component and we might want to 
be cautious with it, we could when building for 2.11 use the old cleaner code 
and for 2.12 use the new code so we don’t break anyone.

How do folks feel about this?

Re: Time for 2.2.2 release

2018-06-07 Thread Felix Cheung

+1 and thanks!

From: Tom Graves 
Sent: Wednesday, June 6, 2018 7:54:43 AM
To: Dev
Subject: Time for 2.2.2 release

Hello all,

I think its time for another 2.2 release.
I took a look at Jira and I don't see anything explicitly targeted for 2.2.2 
that is not yet complete.

So I'd like to propose to release 2.2.2 soon. If there are important
fixes that should go into the release, please let those be known (by
replying here or updating the bug in Jira), otherwise I'm volunteering
to prepare the first RC soon-ish (by early next week since Spark Summit is this 
week).

Thanks!
Tom Graves

[jira] [Resolved] (SPARK-24403) reuse r worker

2018-06-04 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-24403.
--
Resolution: Duplicate

> reuse r worker
> --
>
> Key: SPARK-24403
> URL: https://issues.apache.org/jira/browse/SPARK-24403
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Deepansh
>Priority: Major
>  Labels: sparkR
>
> Currently, SparkR doesn't support reuse of its workers, so broadcast and 
> closure are transferred to workers each time. Can we add the idea of python 
> worker reuse to SparkR also, to enhance its performance?
> performance issues reference 
> [https://issues.apache.org/jira/browse/SPARK-23650]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-24403) reuse r worker

2018-06-04 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung closed SPARK-24403.


> reuse r worker
> --
>
> Key: SPARK-24403
> URL: https://issues.apache.org/jira/browse/SPARK-24403
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Deepansh
>Priority: Major
>  Labels: sparkR
>
> Currently, SparkR doesn't support reuse of its workers, so broadcast and 
> closure are transferred to workers each time. Can we add the idea of python 
> worker reuse to SparkR also, to enhance its performance?
> performance issues reference 
> [https://issues.apache.org/jira/browse/SPARK-23650]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500634#comment-16500634
 ] 

Felix Cheung commented on SPARK-24359:
--

+1 on the `spark-website` model for faster iterations. This was my suggestion 
originally not just for releases but to publish on CRAN. But if you can get the 
package source into a state to "work" without Spark (JVM) and SparkR, then it 
will make the publication process easier.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline st

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-06-03 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499528#comment-16499528
 ] 

Felix Cheung commented on SPARK-23206:
--

[~irashid] sorry I thought I had replied -

on our end, there are some debate on whether metrics collected at the 
NodeManager (YARN) level is sufficient. IMO we definitely need some breakdown 
of disk IO / app_id (and that will be hard to separate out at NM level), so 
that we can identify the heavy-shuffle app.

I don't think we should increase the payload significantly - so shouldn't 
affect the design much.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-06-02 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498901#comment-16498901
 ] 

Felix Cheung commented on SPARK-24434:
--

I see. Why not then pass along the pod template as a (yaml/json) file?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-06-01 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498225#comment-16498225
 ] 

Felix Cheung commented on SPARK-24434:
--

java properties is basically the "standard" in spark, I'm not sure I've seen 
anything else.

What goes into the "pod template"? is it just an ID?

As a reference, Spark conf is kinda the way to go for Spark jobs - I don't know 
if I see a problem with the current mechanism of setting pod in Spark conf. 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2018-06-01 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498213#comment-16498213
 ] 

Felix Cheung commented on SPARK-20202:
--

Prefer newer Hive also

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24326) Add local:// scheme support for the app jar in mesos cluster mode

2018-05-31 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-24326.
--
   Resolution: Fixed
 Assignee: Stavros Kontopoulos
Fix Version/s: 2.4.0

>  Add local:// scheme support for the app jar in mesos cluster mode
> --
>
> Key: SPARK-24326
> URL: https://issues.apache.org/jira/browse/SPARK-24326
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 2.4.0
>
>
> It is often useful to reference an application jar within the image used to 
> deploy a Spark job on mesos in cluster mode. This is not possible right now 
> because the mesos dispatcher will try to resolve the local://... uri on the 
> host (via the fetcher) and not in the container. Target is to have a scheme 
> like local:/// being resolved in the container's fs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] SPIP ML Pipelines in R

2018-05-31 Thread Felix Cheung

+1
With my concerns in the SPIP discussion.

From: Hossein 
Sent: Wednesday, May 30, 2018 2:03:03 PM
To: dev@spark.apache.org
Subject: [VOTE] SPIP ML Pipelines in R

Hi,

I started discussion 
thread
 for a new R package to expose MLlib pipelines in 
R.

To summarize we will work on utilities to generate R wrappers for MLlib 
pipeline API for a new R package. This will lower the burden for exposing new 
API in future.

Following the SPIP 
process, I am proposing 
the SPIP for a vote.

+1: Let's go ahead and implement the SPIP.
+0: Don't really care.
-1: I do not think this is a good idea for the following reasons.

Thanks,
--Hossein

Re: Revisiting Online serving of Spark models?

2018-05-30 Thread Felix Cheung

Hi!

Thank you! Let’s meet then

June 6 4pm

Moscone West Convention Center
800 Howard Street, San Francisco, CA 94103

Ground floor (outside of conference area - should be available for all) - we 
will meet and decide where to go

(Would not send invite because that would be too much noise for dev@)

To paraphrase Joseph, we will use this to kick off the discusssion and post 
notes after and follow up online. As for Seattle, I would be very interested to 
meet in person lateen and discuss ;)


_
From: Saikat Kanjilal 
Sent: Tuesday, May 29, 2018 11:46 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Maximiliano Felice 
Cc: Felix Cheung , Holden Karau 
, Joseph Bradley , Leif Walsh 
, dev 


Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is 
talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung 
mailto:felixcheun...@hotmail.com>>:
You had me at blue bottle!

_
From: Holden Karau mailto:hol...@pigscanfly.ca>>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung mailto:felixcheun...@hotmail.com>>
Cc: Saikat Kanjilal mailto:sxk1...@hotmail.com>>, 
Maximiliano Felice 
mailto:maximilianofel...@gmail.com>>, Joseph 
Bradley mailto:jos...@databricks.com>>, Leif Walsh 
mailto:leif.wa...@gmail.com>>, dev 
mailto:dev@spark.apache.org>>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue 
bottle and grab coffee (if the weather holds have our design meeting outside 
:p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Bump.


From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)


From: Saikat Kanjilal mailto:sxk1...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?


_
From: Felix Cheung mai

[jira] [Resolved] (SPARK-24331) Add arrays_overlap / array_repeat / map_entries

2018-05-30 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-24331.
--
  Resolution: Fixed
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0

> Add arrays_overlap / array_repeat / map_entries  
> -
>
> Key: SPARK-24331
> URL: https://issues.apache.org/jira/browse/SPARK-24331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Add SparkR equivalent to:
>  * arrays_overlap - SPARK-23922
>  * array_repeat - SPARK-23925
>  * map_entries - SPARK-23935



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24331) Add arrays_overlap / array_repeat / map_entries

2018-05-30 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-24331:


Assignee: Marek Novotny

> Add arrays_overlap / array_repeat / map_entries  
> -
>
> Key: SPARK-24331
> URL: https://issues.apache.org/jira/browse/SPARK-24331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Add SparkR equivalent to:
>  * arrays_overlap - SPARK-23922
>  * array_repeat - SPARK-23925
>  * map_entries - SPARK-23935



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24403) reuse r worker

2018-05-29 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494707#comment-16494707
 ] 

Felix Cheung edited comment on SPARK-24403 at 5/30/18 5:38 AM:
---

Reuse worker (daemon process) is actually supported and the default for SparkR.

The specific use case you have linked to, R UDF, is a fairly different code 
path, and that might be a different issue all together.

Please refer back to the original issue  - don't open a new JIRA. Thanks. 


was (Author: felixcheung):
Reuse worker (daemon process) is actually supported and the default for SparkR.

The specific use case you have linked to R UDF might be a different issue all 
together.

Please refer back to the original issue  - don't open a new JIRA. Thanks. 

> reuse r worker
> --
>
> Key: SPARK-24403
> URL: https://issues.apache.org/jira/browse/SPARK-24403
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Deepansh
>Priority: Major
>  Labels: sparkR
>
> Currently, SparkR doesn't support reuse of its workers, so broadcast and 
> closure are transferred to workers each time. Can we add the idea of python 
> worker reuse to SparkR also, to enhance its performance?
> performance issues reference 
> [https://issues.apache.org/jira/browse/SPARK-23650]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-05-29 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494708#comment-16494708
 ] 

Felix Cheung commented on SPARK-23650:
--

sorry, I really don't have time/resource to investigate this for now.

hopefully later...

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: packageReload.txt, read_model_in_udf.txt, 
> sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24403) reuse r worker

2018-05-29 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494707#comment-16494707
 ] 

Felix Cheung commented on SPARK-24403:
--

Reuse worker (daemon process) is actually supported and the default for SparkR.

The specific use case you have linked to R UDF might be a different issue all 
together.

Please refer back to the original issue  - don't open a new JIRA. Thanks. 

> reuse r worker
> --
>
> Key: SPARK-24403
> URL: https://issues.apache.org/jira/browse/SPARK-24403
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Deepansh
>Priority: Major
>  Labels: sparkR
>
> Currently, SparkR doesn't support reuse of its workers, so broadcast and 
> closure are transferred to workers each time. Can we add the idea of python 
> worker reuse to SparkR also, to enhance its performance?
> performance issues reference 
> [https://issues.apache.org/jira/browse/SPARK-23650]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Revisiting Online serving of Spark models?

2018-05-29 Thread Felix Cheung

You had me at blue bottle!

_
From: Holden Karau 
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung 
Cc: Saikat Kanjilal , Maximiliano Felice 
, Joseph Bradley , Leif 
Walsh , dev 

I'm down for that, we could all go for a walk maybe to the mint plazaa blue 
bottle and grab coffee (if the weather holds have our design meeting outside 
:p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Bump.

From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)

From: Saikat Kanjilal mailto:sxk1...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?

_____
From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley mailto:jos...@databricks.com>>
Cc: dev mailto:dev@spark.apache.org>>

Huge +1 on this!

From:holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark

Re: Revisiting Online serving of Spark models?

2018-05-29 Thread Felix Cheung

Bump.

From: Felix Cheung 
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev
Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)

From: Saikat Kanjilal 
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?

_____
From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley mailto:jos...@databricks.com>>
Cc: dev mailto:dev@spark.apache.org>>

Huge +1 on this!

From: holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-lo

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-28 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492341#comment-16492341
 ] 

Felix Cheung edited comment on SPARK-24359 at 5/28/18 7:10 AM:
---

re repo/release/version - again to be clear, what Shivaram and I were referring 
to, was the need to update the R code to address CRAN issue without necessarily 
any change to Spark JVM. For instance, say we add this SparkML package R code 
in the 2.4.0 release. We build RC, test and vote on the release. 2.4.0 
released. We submit package to CRAN (We can't do this in a different order 
since CRAN doesn't support "updating" an existing version) - say this 
submission fails. Then what would be the next step? Kick off 2.4.1 release 
immediately? Wait for the eventual 2.4.1 release? Since all the source code in 
the same repo, this means we will need to release Spark JVM 2.4.1 as well. This 
is the reason why SparkR package has taken this long.

re Column type - IMO it would be great to have code gen consider the R style 
and preferences (like using the df$col syntax). Maybe not for "v1"

ok then. Look forward to this!

 

 


was (Author: felixcheung):
re repo/release/version - again to be clear, what Shivaram and I were referring 
to, was the need to update the R code to address CRAN issue without necessarily 
any change to Spark JVM. For instance, we add this SparkML package R code in 
the 2.4.0 release. We build RC, test and vote on the release. 2.4.0 released. 
We submit package to CRAN - say this submission fails. Then what would be the 
next step? Kick off 2.4.1 release immediately? Wait for the eventual 2.4.1 
release? Since all the source code in the same repo, this means we will need to 
release Spark JVM 2.4.1 as well. This is the reason why SparkR package has 
taken this long.

re Column type - IMO it would be great to have code gen consider the R style 
and preferences (like using the df$col syntax). Maybe not for "v1"

ok then. Look forward to this!

 

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-28 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492341#comment-16492341
 ] 

Felix Cheung commented on SPARK-24359:
--

re repo/release/version - again to be clear, what Shivaram and I were referring 
to, was the need to update the R code to address CRAN issue without necessarily 
any change to Spark JVM. For instance, we add this SparkML package R code in 
the 2.4.0 release. We build RC, test and vote on the release. 2.4.0 released. 
We submit package to CRAN - say this submission fails. Then what would be the 
next step? Kick off 2.4.1 release immediately? Wait for the eventual 2.4.1 
release? Since all the source code in the same repo, this means we will need to 
release Spark JVM 2.4.1 as well. This is the reason why SparkR package has 
taken this long.

re Column type - IMO it would be great to have code gen consider the R style 
and preferences (like using the df$col syntax). Maybe not for "v1"

ok then. Look forward to this!

 

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will

[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-05-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491908#comment-16491908
 ] 

Felix Cheung commented on SPARK-23435:
--

Work was stopped for a while. Hopefully I can come by in a few weeks.

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491867#comment-16491867
 ] 

Felix Cheung edited comment on SPARK-24359 at 5/27/18 3:24 AM:
---

# are the set_input_col / set_output_col methods going to accept Column type? 
ie. df$col
 # could you add more details on the steps for method documentation, how to 
pick from javadoc, how to manually add, esp. when methods are created via code 
gen (also would be a requirement for submitting to CRAN)
 # could you add more details on how/whether to check for release compatibility 
between SparkML-SparkR-Spark/JVM (re: section on CRAN Release Management)
 # could you add more info on `train_validation_split()`? also in the example 
`model % fit(training)` - is `training` supposed to be a SparkDataFrame (from 
SparkR, to be clear)?

 

btw, I'd also suggestion avoiding `-` in name (eg. set_train-ratio() in the pdf)

 

thanks, minor comments, I've reviewed this.


was (Author: felixcheung):
# are the set_input_col / set_output_col methods going to accept Column type? 
ie. df$col
 # could you add more details on the steps for method documentation, how to 
pick from javadoc, how to manually add, esp. when methods are created via code 
gen (also would be a requirement for submitting to CRAN)
 # could you add more details on how/whether to check for release compatibility 
(re: section on CRAN Release Management)
 # could you add more info on `train_validation_split()`? also in the example 
`model % fit(training)` - is `training` supposed to be a SparkDataFrame (from 
SparkR, to be clear)?

 

btw, I'd also suggestion avoiding `-` in name (eg. set_train-ratio() in the pdf)

 

thanks, minor comments, I've reviewed this.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib A

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491867#comment-16491867
 ] 

Felix Cheung commented on SPARK-24359:
--

# are the set_input_col / set_output_col methods going to accept Column type? 
ie. df$col
 # could you add more details on the steps for method documentation, how to 
pick from javadoc, how to manually add, esp. when methods are created via code 
gen (also would be a requirement for submitting to CRAN)
 # could you add more details on how/whether to check for release compatibility 
(re: section on CRAN Release Management)
 # could you add more info on `train_validation_split()`? also in the example 
`model % fit(training)` - is `training` supposed to be a SparkDataFrame (from 
SparkR, to be clear)?

 

btw, I'd also suggestion avoiding `-` in name (eg. set_train-ratio() in the pdf)

 

thanks, minor comments, I've reviewed this.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
>

Re: Revisiting Online serving of Spark models?

2018-05-26 Thread Felix Cheung

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)

From: Saikat Kanjilal <sxk1...@hotmail.com>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
<maximilianofel...@gmail.com<mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
<leif.wa...@gmail.com<mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?

_
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>
Cc: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>

Huge +1 on this!

From: holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
<holden.ka...@gmail.com<mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql

[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491788#comment-16491788
 ] 

Felix Cheung commented on SPARK-24356:
--

Interesting. we will definitely look into this.

Is the plan to turn this into a PR to fix in Spark?

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
> Attachments: SPARK-24356.01.patch
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: SparkR was removed from CRAN on 2018-05-01

2018-05-25 Thread Felix Cheung

This is the fix
https://github.com/apache/spark/commit/f27a035daf705766d3445e5c6a99867c11c552b0#diff-e1e1d3d40573127e9ee0480caf1283d6

I don't have the email though.


From: Hossein 
Sent: Friday, May 25, 2018 10:58:42 AM
To: dev@spark.apache.org
Subject: SparkR was removed from CRAN on 2018-05-01

Would you please forward the email from CRAN? Is there a JIRA?

Thanks,
--Hossein

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-24 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488512#comment-16488512
 ] 

Felix Cheung edited comment on SPARK-24359 at 5/24/18 6:34 AM:
---

re this 
 * This also leads to the question of how the SparkML package APIs are going to 
depend on Spark versions. Are we only going to have code that depends on older 
Spark releases or are we going to have cases where we introduce the Java/Scala 
side code at the same time as the R API ?

The mix and match of different Spark / SparkR / SparkML release versions might 
need to be checked and enforced more strongly.

For instance, what if someone has Spark 2.3.0, SparkR 2.3.1 and SparkML 2.4 - 
does it work at all? what kind of quality (as in, testing) we would assert for 
something like this?

re release

I'm not sure I'd advocate a separate git/repo though simply because of the 
uncertainty around licensing and governance. If it's intend to be a fully 
separate project (like sparklyr, or SparkR in the early days) then by all 
means. And to agree with Shivaram's point, one other possibility is a new git 
repo under ASF. And it might be more agile / faster to iterate in a separate 
codebase.

Separate repo also add to the complicity of release mix & match - what if we 
need to patch Spark & SparkR for a critical security issue? Should we always 
re-release SparkML even when there is "no change"? or we are going to allow of 
"patch release compatibility" (or semantic versioning)? what if we have to make 
a protocol change in a patch release (thus breaking semantic versioning)? 
Either way this is complicated.

re R package

I'd recommend (learning from the year (?!) we spent with SparkR) completely 
decoupling all R package content (package tests, vignettes) from all JVM 
dependency, ie. the package could be made to run standalone without Java 
JRE/JDK and without Spark release jar. This would make getting this submitted 
to CRAN much much easier...

re naming

Sounds like you are planning to have these as S4 classes/generics/methods then.

Again I'd strongly recommend against spark.name AND set_param style. Not only 
it is inconsistent, spark.name style is generally flagged (lintr rule on this) 
and conflict with S3 OO style / method dispatch 
([http://adv-r.had.co.nz/OO-essentials.html#s3)] eg. mean.a

(yes I realize there are also many examples of . in method names)

 

That's it for now. More to add after review.

 


was (Author: felixcheung):
re this 
 * This also leads to the question of how the SparkML package APIs are going to 
depend on Spark versions. Are we only going to have code that depends on older 
Spark releases or are we going to have cases where we introduce the Java/Scala 
side code at the same time as the R API ?

The mix and match of different Spark / SparkR / SparkML release versions might 
need to be checked and enforced more strongly.

For instance, what if someone has Spark 2.3.0, SparkR 2.3.1 and SparkML 2.4 - 
does it work at all? what kind of quality (as in, testing) we would assert for 
something like this?

re release

I'm not sure I'd advocate a separate git/repo though simply because of the 
uncertainty around licensing and governance. If it's intend to be a fully 
separate project (like sparklyr, or SparkR in the early days) then by all 
means. One other possibility is a new git repo under ASF. And it might be more 
agile / faster to iterate in a separate codebase.

Separate repo also add to the complicity of release mix & match - what if we 
need to patch Spark & SparkR for a critical security issue? Should we always 
re-release SparkML even when there is "no change"? or we are going to allow of 
"patch release compatibility" (or semantic versioning)? Either way this is 
complicated.

re R package

I'd recommend (learning from the year (?!) we spent with SparkR) completely 
decoupling all R package content (package tests, vignettes) from all JVM 
dependency, ie. the package could be made to run standalone without Java 
JRE/JDK and without Spark release jar. This would make getting this submitted 
to CRAN much much easier...

re naming

Sounds like you are planning to have these as S4 classes/generics/methods then.

Again I'd strongly recommend against spark.name AND set_param style. Not only 
it is inconsistent, spark.name style is generally flagged (lintr rule on this) 
and conflict with S3 OO style / method dispatch 
([http://adv-r.had.co.nz/OO-essentials.html#s3)] eg. mean.a

(yes I realize there are also many examples of . in method names)

 

That's it for now. More to add after review.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-24 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488512#comment-16488512
 ] 

Felix Cheung commented on SPARK-24359:
--

re this 
 * This also leads to the question of how the SparkML package APIs are going to 
depend on Spark versions. Are we only going to have code that depends on older 
Spark releases or are we going to have cases where we introduce the Java/Scala 
side code at the same time as the R API ?

The mix and match of different Spark / SparkR / SparkML release versions might 
need to be checked and enforced more strongly.

For instance, what if someone has Spark 2.3.0, SparkR 2.3.1 and SparkML 2.4 - 
does it work at all? what kind of quality (as in, testing) we would assert for 
something like this?

re release

I'm not sure I'd advocate a separate git/repo though simply because of the 
uncertainty around licensing and governance. If it's intend to be a fully 
separate project (like sparklyr, or SparkR in the early days) then by all 
means. One other possibility is a new git repo under ASF. And it might be more 
agile / faster to iterate in a separate codebase.

Separate repo also add to the complicity of release mix & match - what if we 
need to patch Spark & SparkR for a critical security issue? Should we always 
re-release SparkML even when there is "no change"? or we are going to allow of 
"patch release compatibility" (or semantic versioning)? Either way this is 
complicated.

re R package

I'd recommend (learning from the year (?!) we spent with SparkR) completely 
decoupling all R package content (package tests, vignettes) from all JVM 
dependency, ie. the package could be made to run standalone without Java 
JRE/JDK and without Spark release jar. This would make getting this submitted 
to CRAN much much easier...

re naming

Sounds like you are planning to have these as S4 classes/generics/methods then.

Again I'd strongly recommend against spark.name AND set_param style. Not only 
it is inconsistent, spark.name style is generally flagged (lintr rule on this) 
and conflict with S3 OO style / method dispatch 
([http://adv-r.had.co.nz/OO-essentials.html#s3)] eg. mean.a

(yes I realize there are also many examples of . in method names)

 

That's it for now. More to add after review.

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers

[jira] [Commented] (SPARK-22055) Port release scripts

2018-05-22 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486728#comment-16486728
 ] 

Felix Cheung commented on SPARK-22055:
--

interesting - I'd definitely be happy to help.

do you have it scripted to inject the signing key into the docker image?

 

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486719#comment-16486719
 ] 

Felix Cheung edited comment on SPARK-24359 at 5/23/18 5:21 AM:
---

# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?
 # one of the first comment - please be consistent with naming convention -  
(there is no . notation in R) both `train.validation.split()` and 
`set_estimator(lr)` are method? please don't mix  `.` and `_` in names, and 
hopefully also avoid mixing in Scala's camelCasing.

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 


was (Author: felixcheung):
# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?
 # one of the first comment - please be consistent with style - 

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486719#comment-16486719
 ] 

Felix Cheung edited comment on SPARK-24359 at 5/23/18 5:18 AM:
---

# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?
 # one of the first comment - please be consistent with style - 

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 


was (Author: felixcheung):
# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existi

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486719#comment-16486719
 ] 

Felix Cheung commented on SPARK-24359:
--

# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax,

Re: Revisiting Online serving of Spark models?

2018-05-21 Thread Felix Cheung

+1 on meeting up!

From: Holden Karau <hol...@pigscanfly.ca>
Sent: Monday, May 21, 2018 2:52:20 PM
To: Joseph Bradley
Cc: Felix Cheung; dev
Subject: Re: Revisiting Online serving of Spark models?

(Oh also the write API has already been extended to take formats).

On Mon, May 21, 2018 at 2:51 PM Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
I like that idea. I’ll be around Spark Summit.

On Mon, May 21, 2018 at 1:52 PM Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?

_____
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>
Cc: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>

Huge +1 on this!

From: holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
<holden.ka...@gmail.com<mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to 
avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as 
important as per-Row transformations, but they would be helpful for batching 
for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
<hol...@pigscanfly.ca<mailto:h

Re: Running lint-java during PR builds?

2018-05-21 Thread Felix Cheung

One concern is with the volume of test runs on Travis.

In ASF projects Travis could get significantly
backed up since - if I recall - all of ASF shares one queue.

At the number of PRs Spark has this could be a big issue.



From: Marcelo Vanzin 
Sent: Monday, May 21, 2018 9:08:28 AM
To: Hyukjin Kwon
Cc: Dongjoon Hyun; dev
Subject: Re: Running lint-java during PR builds?

I'm fine with it. I tried to use the existing checkstyle sbt plugin
(trying to fix SPARK-22269), but it depends on an ancient version of
checkstyle, and I don't know sbt enough to figure out how to hack
classpaths and class loaders when applying rules, so gave up.

On Mon, May 21, 2018 at 1:47 AM, Hyukjin Kwon  wrote:
> I am going to open an INFRA JIRA if there's no explicit objection in few
> days.
>
> 2018-05-21 13:09 GMT+08:00 Hyukjin Kwon :
>>
>> I would like to revive this proposal. Travis CI. Shall we give this try? I
>> think it's worth trying it.
>>
>> 2016-11-17 3:50 GMT+08:00 Dongjoon Hyun :
>>>
>>> Hi, Marcelo and Ryan.
>>>
>>> That was the main purpose of my proposal about Travis.CI.
>>> IMO, that is the only way to achieve that without any harmful side-effect
>>> on Jenkins infra.
>>>
>>> Spark is already ready for that. Like AppVoyer, if one of you files an
>>> INFRA jira issue to enable that, they will turn on that. Then, we can try it
>>> and see the result. Also, you can turn off easily again if you don't want.
>>>
>>> Without this, we will consume more community efforts. For example, we
>>> merged lint-java error fix PR seven hours ago, but the master branch still
>>> has one lint-java error.
>>>
>>> https://travis-ci.org/dongjoon-hyun/spark/jobs/176351319
>>>
>>> Actually, I've been monitoring the history here. (It's synced every 30
>>> minutes.)
>>>
>>> https://travis-ci.org/dongjoon-hyun/spark/builds
>>>
>>> Could we give a change to this?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On 2016-11-15 13:40 (-0800), "Shixiong(Ryan) Zhu"
>>>  wrote:
>>> > I remember it's because you need to run `mvn install` before running
>>> > lint-java if the maven cache is empty, and `mvn install` is pretty
>>> > heavy.
>>> >
>>> > On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin 
>>> > wrote:
>>> >
>>> > > Hey all,
>>> > >
>>> > > Is there a reason why lint-java is not run during PR builds? I see it
>>> > > seems to be maven-only, is it really expensive to run after an sbt
>>> > > build?
>>> > >
>>> > > I see a lot of PRs coming in to fix Java style issues, and those all
>>> > > seem a little unnecessary. Either we're enforcing style checks or
>>> > > we're not, and right now it seems we aren't.
>>> > >
>>> > > --
>>> > > Marcelo
>>> > >
>>> > > -
>>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >
>>> > >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>



--
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Integrating ML/DL frameworks with Spark

2018-05-20 Thread Felix Cheung

Very cool. We would be very interested in this.

What is the plan forward to make progress in each of the three areas?

From: Bryan Cutler 
Sent: Monday, May 14, 2018 11:37:20 PM
To: Xiangrui Meng
Cc: Reynold Xin; dev
Subject: Re: Integrating ML/DL frameworks with Spark

Thanks for starting this discussion, I'd also like to see some improvements in 
this area and glad to hear that the Pandas UDFs / Arrow functionality might be 
useful.  I'm wondering if from your initial investigations you found anything 
lacking from the Arrow format or possible improvements that would simplify the 
data representation?  Also, while data could be handed off in a UDF, would it 
make sense to also discuss a more formal way to externalize the data in a way 
that would also work for the Scala API?

Thanks,
Bryan

On Wed, May 9, 2018 at 4:31 PM, Xiangrui Meng 
> wrote:
Shivaram: Yes, we can call it "gang scheduling" or "barrier synchronization". 
Spark doesn't support it now. The proposal is to have a proper support in 
Spark's job scheduler, so we can integrate well with MPI-like frameworks.

On Tue, May 8, 2018 at 11:17 AM Nan Zhu 
> wrote:
.how I skipped the last part

On Tue, May 8, 2018 at 11:16 AM, Reynold Xin 
> wrote:
Yes, Nan, totally agree. To be on the same page, that's exactly what I wrote 
wasn't it?

On Tue, May 8, 2018 at 11:14 AM Nan Zhu 
> wrote:
besides that, one of the things which is needed by multiple frameworks is to 
schedule tasks in a single wave

i.e.

if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark is 
desired to provide a capability to ensure that either we run 50 tasks at once, 
or we should quit the complete application/job after some timeout period

Best,

Nan

On Tue, May 8, 2018 at 11:10 AM, Reynold Xin 
> wrote:
I think that's what Xiangrui was referring to. Instead of retrying a single 
task, retry the entire stage, and the entire stage of tasks need to be 
scheduled all at once.

On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman 
> wrote:

  *   Fault tolerance and execution model: Spark assumes fine-grained task 
recovery, i.e. if something fails, only that task is rerun. This doesn’t match 
the execution model of distributed ML/DL frameworks that are typically 
MPI-based, and rerunning a single task would lead to the entire system hanging. 
A whole stage needs to be re-run.

This is not only useful for integrating with 3rd-party frameworks, but also 
useful for scaling MLlib algorithms. One of my earliest attempts in Spark MLlib 
was to implement All-Reduce primitive 
(SPARK-1485). But we ended up 
with some compromised solutions. With the new execution model, we can set up a 
hybrid cluster and do all-reduce properly.

Is there a particular new execution model you are referring to or do we plan to 
investigate a new execution model ?  For the MPI-like model, we also need gang 
scheduling (i.e. schedule all tasks at once or none of them) and I dont think 
we have support for that in the scheduler right now.

--

Xiangrui Meng

Software Engineer

Databricks Inc. [http://databricks.com] 

--

Xiangrui Meng

Software Engineer

Databricks Inc. [http://databricks.com]

Re: Revisiting Online serving of Spark models?

2018-05-20 Thread Felix Cheung

Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?


_
From: Felix Cheung <felixcheun...@hotmail.com>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley <jos...@databricks.com>
Cc: dev <dev@spark.apache.org>


Huge +1 on this!


From: holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of Holden Karau 
<hol...@pigscanfly.ca>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to 
avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as 
important as per-Row transformations, but they would be helpful for batching 
for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as 
any to revisit the online serving situation in Spark ML. DB & other's have done 
some excellent working moving a lot of the necessary tools into a local linear 
algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, 
but currently our individual transform/predict methods are private so they 
either need to copy or re-implement (or put them selves in org.apache.spark) to 
access them. How would folks feel about adding a new trait for ML pipeline 
stages to expose to do transformation of single element inputs (or local 
collections) that could be optionally implemented by stages which support this? 
That way we can have less copy and paste code possibly getting out of sync with 
our model training.

I think continuing to have on-line serving grow in different projects is 
probably the right path, forward (folks have different needs), but I'd love to 
see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their 
own commercial offerings, but hopefully if we make it easier for everyone the 
commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>



--
Twitter: https://twitter.com/holdenkarau

[jira] [Created] (ZEPPELIN-3469) Confusing behavior when running on Java 9

2018-05-17 Thread Felix Cheung (JIRA)

Felix Cheung created ZEPPELIN-3469:
--

 Summary: Confusing behavior when running on Java 9
 Key: ZEPPELIN-3469
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3469
 Project: Zeppelin
  Issue Type: Bug
  Components: Interpreters
Reporter: Felix Cheung


confusing error when running on Java 9 
 # start zeppelin on java 9
 # open the sample notebook
 # shift-enter to run (should run the spark interpreter)

 
{code:java}
java.lang.NullPointerException at 
org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:44) at 
org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:39) at 
org.apache.zeppelin.spark.OldSparkInterpreter.createSparkContext_2(OldSparkInterpreter.java:375)
 at 
org.apache.zeppelin.spark.OldSparkInterpreter.createSparkContext(OldSparkInterpreter.java:364)
 at 
org.apache.zeppelin.spark.OldSparkInterpreter.getSparkContext(OldSparkInterpreter.java:172)
 at 
org.apache.zeppelin.spark.OldSparkInterpreter.open(OldSparkInterpreter.java:740)
 at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:61) 
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
 at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:617)
 at org.apache.zeppelin.scheduler.Job.run(Job.java:188) at 
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140) at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
 at
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:299)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
 at java.base/java.lang.Thread.run(Thread.java:844)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ZEPPELIN-3468) Undo paragraph change after find/replace erratic

2018-05-17 Thread Felix Cheung (JIRA)

Felix Cheung created ZEPPELIN-3468:
--

 Summary: Undo paragraph change after find/replace erratic
 Key: ZEPPELIN-3468
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3468
 Project: Zeppelin
  Issue Type: Bug
  Components: front-end
Affects Versions: 0.7.3, 0.8.0
Reporter: Felix Cheung


Find/Replace - then Undo behavior seemed erratic

 
 # find some text
 # replace text with something else
 # go to paragraph to try to undo the replace (Ctrl-Z or Cmd-Z)
- replace seemed to be going into block of text (but not always the whole 
paragraph)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [VOTE] Release Apache Zeppelin 0.8.0 (RC2)

2018-05-16 Thread Felix Cheung

Tested and some observations / issues:

- We should remove *.md5 from dist.apache.org before publishing (on the other 
hand, .md5 is needed for maven / repository.apache.org)

- also this doc page has hardcoded versions (to 0.7.0), might also need a few 
more updates
https://git-wip-us.apache.org/repos/asf?p=zeppelin.git;a=blob;f=docs/usage/interpreter/installation.md;h=8bc3b3d8d186f6a2311ba52597dbe694e48835eb;hb=a88e4679a2f28a914fa181ad2df55e3744a8ff6b

- confusing error when running on Java 9 (will open JIRA)
java.lang.NullPointerException
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:44)
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:39)
at 
org.apache.zeppelin.spark.OldSparkInterpreter.createSparkContext_2(OldSparkInterpreter.java:375)
at 
org.apache.zeppelin.spark.OldSparkInterpreter.createSparkContext(OldSparkInterpreter.java:364)
at 
org.apache.zeppelin.spark.OldSparkInterpreter.getSparkContext(OldSparkInterpreter.java:172)
at 
org.apache.zeppelin.spark.OldSparkInterpreter.open(OldSparkInterpreter.java:740)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:61)
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:617)
at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:299)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.base/java.lang.Thread.run(Thread.java:844)

- Find/Replace - then Undo behavior seemed erratic (will open JIRA)
1. find some text
2. replace text with something else
3. go to paragraph to try to undo the replace - replace seemed to be going into 
block of text (but not always the whole paragraph)


From: Belousov Maksim Eduardovich 
Sent: Wednesday, May 16, 2018 5:43:10 AM
To: dev@zeppelin.apache.org
Subject: RE: [VOTE] Release Apache Zeppelin 0.8.0 (RC2)

+1
Let's release!


I built "branch-0.8" from github with commands [1]
I have got error [2] and I had to switch java version from 1.7.0_67 to 
1.8.0_171.
Did anybody face this problem?


1.
 ./dev/change_scala_version.sh 2.11
mvn clean package -DskipRat -DskipTests -Dcheckstyle.skip

2. Exception in thread "main" java.lang.UnsupportedClassVersionError: 
javax/ws/rs/core/Application : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)

Regards,

Maksim Belousov

-Original Message-
From: Sanjay Dasgupta [mailto:sanjay.dasgu...@gmail.com]
Sent: Wednesday, May 16, 2018 5:28 AM
To: dev@zeppelin.apache.org
Subject: Re: [VOTE] Release Apache Zeppelin 0.8.0 (RC2)

Hi jongyoul,

Thanks for the clarification, I hadn't thought of that.

Will be good to see 0.8.0 released.

Sanjay

On Wed, May 16, 2018 at 7:24 AM, Jongyoul Lee  wrote:

> Hi, Sanjay,
>
> This is because 0.8.0 is not deployed. these artifacts will be
> deployed after passing vote.
>
> Hope this help,
>
> BTW,
>
> I'll give +1 for this version.
>
> JL
>
> On Wed, May 16, 2018 at 8:13 AM, moon soo Lee  wrote:
>
> > Thanks Jeff preparing the release.
> >
> > +1
> >
> >  - License exists in source/binary package
> >  - Signature
> >  - Build from source
> >  - Run tutorial notebooks
> >
> > I think ./bin/install-interpreter.sh does not work because 0.8.0
> artifacts
> > are not published yet. If ./bin/install-interpreter.sh command have
> > a capability to add staging repo (
> > https://repository.apache.org/content/repositories/
> orgapachezeppelin-1124/
> > )
> > that'll be helpful verifying RC.
> >
> > On Tue, May 15, 2018 at 10:42 AM Sanjay Dasgupta <
> > sanjay.dasgu...@gmail.com>
> > wrote:
> >
> > > -1
> > >
> > > I downloaded zeppelin-0.8.0-bin-netinst.tgz, and then tried to
> > > install
> > the
> > > shell and jdbc interpreters. Both attempts failed (Ubuntu 16.04,
> > > Oracle
> > JDK
> > > 1.8.0_171).
> > >
> > > The error while trying to install the shell interpreter is shown below:
> > >
> > > $ ./bin/install-interpreter.sh --name shell
> > >
> > > Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
> > > MaxPermSize=512m; support was removed in 8.0
> > > SLF4J: Class path contains multiple SLF4J bindings.
> > > SLF4J: Found binding in
> > >
> > > [jar:file:/home/yajnas/Desktop/Data-Analytics-AI/
> > zeppelin/zeppelin-0.8.0-rc2-bin-netinst/lib/interpreter/
> > slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

[jira] [Commented] (SPARK-23455) Default Params in ML should be saved separately

2018-05-15 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476747#comment-16476747
 ] 

Felix Cheung commented on SPARK-23455:
--

does this affect R?

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-14 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475202#comment-16475202
 ] 

Felix Cheung commented on SPARK-23780:
--

[~vanzin] - FYI - if we have another RC this can be pulled into 2.3.1.

(I guess I just missed your cut of RC1)

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>    Assignee: Felix Cheung
>Priority: Major
>  Labels: regression
> Fix For: 2.4.0, 2.3.2
>
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-14 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23780.
--
   Resolution: Fixed
Fix Version/s: (was: 2.3.1)

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>    Assignee: Felix Cheung
>Priority: Major
>  Labels: regression
> Fix For: 2.4.0
>
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-14 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23780:
-
Fix Version/s: 2.3.2

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>    Assignee: Felix Cheung
>Priority: Major
>  Labels: regression
> Fix For: 2.4.0, 2.3.2
>
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-14 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-23780:


Assignee: Felix Cheung

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>    Assignee: Felix Cheung
>Priority: Major
>  Labels: regression
> Fix For: 2.3.1, 2.4.0
>
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-14 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23780:
-
Fix Version/s: 2.4.0
   2.3.1

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>    Assignee: Felix Cheung
>Priority: Major
>  Labels: regression
> Fix For: 2.3.1, 2.4.0
>
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24265) lintr checks not failing PR build

2018-05-13 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473651#comment-16473651
 ] 

Felix Cheung commented on SPARK-24265:
--

example: 
https://github.com/apache/spark/pull/21315/files#diff-5277c0f5b53da38579f8c0d5c63fba3eR66

> lintr checks not failing PR build
> -
>
> Key: SPARK-24265
> URL: https://issues.apache.org/jira/browse/SPARK-24265
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Felix Cheung
>Priority: Major
>
> a few lintr violations went through recently, need to check why they are not 
> flagged by Jenkins build



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23780:
-
Affects Version/s: 2.3.0

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>Priority: Major
>  Labels: regression
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23780:
-
Labels: regression  (was: regresion)

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>Priority: Major
>  Labels: regression
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-13 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23780:
-
Labels: regresion  (was: )

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>Priority: Major
>  Labels: regression
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24265) lintr checks not failing PR build

2018-05-13 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-24265:


 Summary: lintr checks not failing PR build
 Key: SPARK-24265
 URL: https://issues.apache.org/jira/browse/SPARK-24265
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0, 2.3.1
Reporter: Felix Cheung


a few lintr violations went through recently, need to check why they are not 
flagged by Jenkins build



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24263:
-
Description: 
testing with openjdk, noticed that it breaks because the version string is 
different

 

instead of

"java version"

it has

"openjdk version \"1.8.0_91\""

  was:
testing with openjdk, noticed that it breaks because the open string is 
different

 

instead of "java version" it has

"openjdk version \"1.8.0_91\""


> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>      Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Blocker
>
> testing with openjdk, noticed that it breaks because the version string is 
> different
>  
> instead of
> "java version"
> it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24263:
-
Description: 
testing with openjdk, noticed that it breaks because the version string is 
different

 

instead of

"java version \"1.8.0\"""

it has

"openjdk version \"1.8.0_91\""

  was:
testing with openjdk, noticed that it breaks because the version string is 
different

 

instead of

"java version"

it has

"openjdk version \"1.8.0_91\""


> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Blocker
>
> testing with openjdk, noticed that it breaks because the version string is 
> different
>  
> instead of
> "java version \"1.8.0\"""
> it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24263:
-
Priority: Blocker  (was: Major)

> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Blocker
>
> testing with openjdk, noticed that it breaks because the open string is 
> different
>  
> instead of "java version" it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24263:
-
Affects Version/s: (was: 2.3.0)
   2.3.1

> SparkR java check breaks on openjdk
> ---
>
> Key: SPARK-24263
> URL: https://issues.apache.org/jira/browse/SPARK-24263
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Felix Cheung
>Priority: Blocker
>
> testing with openjdk, noticed that it breaks because the open string is 
> different
>  
> instead of "java version" it has
> "openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24263) SparkR java check breaks on openjdk

2018-05-13 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-24263:


 Summary: SparkR java check breaks on openjdk
 Key: SPARK-24263
 URL: https://issues.apache.org/jira/browse/SPARK-24263
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung


testing with openjdk, noticed that it breaks because the open string is 
different

 

instead of "java version" it has

"openjdk version \"1.8.0_91\""



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 2450 matches

Mail list logo