[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-07-10 Thread Shivaram Venkataraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155697#comment-17155697
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

Yes – this is the reason that SparkR has been temporarily removed from CRAN. We 
need a new release to upload a new version and we have some efforts to release 
Spark 2.4.7 and Spark 3.0.1 that are ongoing AFAIK.

cc'ing the release managers [~holden] [~ruifengz] [~prashant]

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-06-22 Thread Shivaram Venkataraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142618#comment-17142618
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

Thats great! [~hyukjin.kwon] -- so we can get around the installation issue if 
we can build on R 4.0.0. However I guess we will still have the the 
serialization issue. BTW does the serialization issue go away if we build in R 
4.0.0 and run with R 3.6.3? 


> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-06-22 Thread Shivaram Venkataraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142558#comment-17142558
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

I can confirm that with build from source of Spark 3.0.0 and R 4.0.2, I see the 
following error while building vignettes.

{{R worker produced errors: Error in lapply(part, FUN) : attempt to bind a 
variable to R_UnboundValue}}

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-06-22 Thread Shivaram Venkataraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142532#comment-17142532
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

[~hyukjin.kwon] I have R 4.0.2 and will try to do a fresh build from source of 
Spark 3.0.0 

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-06-22 Thread Shivaram Venkataraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142466#comment-17142466
 ] 

Shivaram Venkataraman commented on SPARK-31918:
---

Thanks [~hyukjin.kwon]. It looks like there is another problem. From what I saw 
today, R 4.0.0 cannot load packages that were built with R 3.6.0.  Thus when 
SparkR workers try to start up with the pre-built SparkR package we see a 
failure.  I'm not really sure what is a good way to handle this. Options include
- Building the SparkR package using 4.0.0 (need to check if that works with R 
3.6)
- Copy the package from the driver (where it is usually built) and make the 
SparkR workers use the package installed on the driver

Any other ideas?

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32008) 3.0.0 release build fails

2020-06-16 Thread Shivaram Venkataraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137937#comment-17137937
 ] 

Shivaram Venkataraman commented on SPARK-32008:
---

It looks like the R vignette build failed and looking at the error message this 
seems related to https://github.com/rstudio/rmarkdown/issues/1831 -- I think it 
should work fine if you try to use R version >= 3.6

> 3.0.0 release build fails
> -
>
> Key: SPARK-32008
> URL: https://issues.apache.org/jira/browse/SPARK-32008
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.0.0
>Reporter: Philipp Dallig
>Priority: Major
>
> Hi,
> I try to build the spark release 3.0.0 by myself.
> I got the following error.
> {code}  
> 20/06/16 15:20:49 WARN PrefixSpan: Input data is not cached.
> 20/06/16 15:20:50 WARN Instrumentation: [b307b568] regParam is zero, which 
> might cause numerical instability and overfitting.
> Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
> 'vignetteInfo' is not an exported object from 'namespace:tools'
> Execution halted
> {code}
> I can reproduce this error with a small Dockerfile.
> {code}
> FROM ubuntu:18.04 as builder
> ENV MVN_VERSION=3.6.3 \
> M2_HOME=/opt/apache-maven \
> MAVEN_HOME=/opt/apache-maven \
> MVN_HOME=/opt/apache-maven \
> 
> MVN_SHA512=c35a1803a6e70a126e80b2b3ae33eed961f83ed74d18fcd16909b2d44d7dada3203f1ffe726c17ef8dcca2dcaa9fca676987befeadc9b9f759967a8cb77181c0
>  \
> MAVEN_OPTS="-Xmx3g -XX:ReservedCodeCacheSize=1g" \
> R_HOME=/usr/lib/R \
> GIT_REPO=https://github.com/apache/spark.git \
> GIT_BRANCH=v3.0.0 \
> SPARK_DISTRO_NAME=hadoop3.2 \
> SPARK_LOCAL_HOSTNAME=localhost
> # Preparation
> RUN /usr/bin/apt-get update && \
> # APT
> INSTALL_PKGS="openjdk-8-jdk-headless git wget python3 python3-pip 
> python3-setuptools r-base r-base-dev pandoc pandoc-citeproc 
> libcurl4-openssl-dev libssl-dev libxml2-dev texlive qpdf language-pack-en" && 
> \
> DEBIAN_FRONTEND=noninteractive /usr/bin/apt-get -y install 
> --no-install-recommends $INSTALL_PKGS && \
> rm -rf /var/lib/apt/lists/* && \
> Rscript -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 
> 'testthat', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \
> # Maven
> /usr/bin/wget -nv -O apache-maven.tar.gz 
> "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=maven/maven-3/${MVN_VERSION}/binaries/apache-maven-${MVN_VERSION}-bin.tar.gz";
>  && \
> echo "${MVN_SHA512} apache-maven.tar.gz" > apache-maven.sha512 && \
> sha512sum --strict -c apache-maven.sha512 && \
> tar -xvzf apache-maven.tar.gz -C /opt && \
> rm -v apache-maven.sha512 apache-maven.tar.gz && \
> /bin/ln -vs /opt/apache-maven-${MVN_VERSION} /opt/apache-maven && \
> /bin/ln -vs /opt/apache-maven/bin/mvn /usr/bin/mvn
> # Spark Distribution Build
> RUN mkdir -p /workspace && \
> cd /workspace && \
> git clone --branch ${GIT_BRANCH} ${GIT_REPO} && \
> cd /workspace/spark && \
> ./dev/make-distribution.sh --name ${SPARK_DISTRO_NAME} --pip --r --tgz 
> -Psparkr -Phadoop-3.2 -Phive-2.3 -Phive-thriftserver -Pyarn -Pkubernetes
> {code}
> I am very grateful to all helpers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0

2020-06-06 Thread Shivaram Venkataraman (Jira)
Shivaram Venkataraman created SPARK-31918:
-

 Summary: SparkR CRAN check gives a warning with R 4.0.0
 Key: SPARK-31918
 URL: https://issues.apache.org/jira/browse/SPARK-31918
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.4.6
Reporter: Shivaram Venkataraman


When the SparkR package is run through a CRAN check (i.e. with something like R 
CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
vignette as a part of the checks.

However this seems to be failing with R 4.0.0 on OSX -- both on my local 
machine and on CRAN 
https://cran.r-project.org/web/checks/check_results_SparkR.html

cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-06-06 Thread Shivaram Venkataraman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-31918:
--
Summary: SparkR CRAN check gives a warning with R 4.0.0 on OSX  (was: 
SparkR CRAN check gives a warning with R 4.0.0)

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description

2018-11-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689038#comment-16689038
 ] 

Shivaram Venkataraman commented on SPARK-24255:
---

This is a great list -- I dont think we are able to handle all of these 
scenarios ? [~kiszk] do you know of any existing library that parses all the 
version strings ?

> Require Java 8 in SparkR description
> 
>
> Key: SPARK-24255
> URL: https://issues.apache.org/jira/browse/SPARK-24255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> CRAN checks require that the Java version be set both in package description 
> and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2018-10-26 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665254#comment-16665254
 ] 

Shivaram Venkataraman commented on SPARK-12172:
---

+1 - I think if spark.lapply uses only one or two functions we could even 
inline them

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR on Windows

2018-07-06 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535105#comment-16535105
 ] 

Shivaram Venkataraman commented on SPARK-24535:
---

Thanks [~felixcheung] -- I just got back to work today. Will take a look at the 
fix now.

> Fix java version parsing in SparkR on Windows
> -
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-06-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514020#comment-16514020
 ] 

Shivaram Venkataraman commented on SPARK-24535:
---

I was going to do it manually. If we do it in the PR builder it will be great !

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-06-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514012#comment-16514012
 ] 

Shivaram Venkataraman commented on SPARK-24535:
---

I'm not sure if its only failing on Windows -- the Debian test on CRAN did not 
have the same Java version. I can try to do a test on Windows later today to 
see what I find. 

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513384#comment-16513384
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

Yes - thats what I meant [~felixcheung]

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513001#comment-16513001
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

Sounds good. Thanks [~falaki].

 [~felixcheung], on a related note maybe we can formalize these 2.4.0.1 
releases for SparkR as well ? i.e. where we only have changes in R code and it 
is compatible with 2.4.0 of SparkR (we might need to revisit some of the code 
that figures out Spark version based on SparkR version). I can open a new JIRA 
for that ?

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_reg

[jira] [Created] (SPARK-24535) Fix java version parsing in SparkR

2018-06-12 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-24535:
-

 Summary: Fix java version parsing in SparkR
 Key: SPARK-24535
 URL: https://issues.apache.org/jira/browse/SPARK-24535
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.1, 2.4.0
Reporter: Shivaram Venkataraman


We see errors on CRAN of the form 
{code:java}
  java version "1.8.0_144"
  Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
  Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
  Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
  -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
--
  subscript out of bounds
  1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
sparkConfig = sparkRTestConfig) at 
D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
  2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
sparkExecutorEnvMap, 
 sparkJars, sparkPackages)
  3: checkJavaVersion()
  4: strsplit(javaVersionFilter[[1]], "[\"]")
{code}

The complete log file is at 
http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501037#comment-16501037
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

If you have a separate repo that makes it much more cleaner to tag SparkML 
releases and test that it works with the existing Spark releases. Say by 
tagging them as 2.4.0.1 and 2.4.0.2 etc. for every small change that needs to 
be made on the R side. If they are in the same repo then the tag will apply to 
all other Spark changes at that point making it harder to separate out just the 
R changes that went into this tag.

Also this separate repo does not need to be permanent. If we find that the 
package is stable on CRAN then we can move it back into the main repo. I just 
think for the first few releases on CRAN it'll be much more easier if its not 
tied to Spark releases. 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-03 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499612#comment-16499612
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

I think where the code sits matters if we want to make more frequent SparkML 
releases when compared to Spark releases. If we have a separate repo then its 
much more easier / cleaner to create releases more frequently.

[~josephkb] it'll not be a separate project  – just a new repo in apache/ – 
similar to say `spark-website` is right now. It will be maintained by the same 
set of committers and have the same JIRA etc.

I'd just like us to understand the pros/cons of this approach vs. the current 
approach of tying releases to Spark releases and list them out to make sure we 
are taking the right call ?

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487754#comment-16487754
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

I'd just like to echo the point on release and testing strategy raised by 
[~felixcheung] 
 * For a new CRAN package tying it to the Spark release cycle can be especially 
challenging as it takes a bunch of iterations to get things right.
 * This also leads to the question of how the SparkML package APIs are going to 
depend on Spark versions. Are we only going to have code that depends on older 
Spark releases or are we going to have cases where we introduce the Java/Scala 
side code at the same time as the R API ?
 * One more idea could be to have a new repo in Apache that has its own release 
cycle (like the spark-website repo)

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> constructors are dot separated (e.g., spark.logistic.regression()) and all 
> setters and getters are snake case (e.g., set_max_iter()). If a constructor 
> gets arguments, they will be named arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need t

[jira] [Commented] (SPARK-24272) Require Java 8 for SparkR

2018-05-14 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16474546#comment-16474546
 ] 

Shivaram Venkataraman commented on SPARK-24272:
---

[~smilegator] - I opened https://issues.apache.org/jira/browse/SPARK-24255 for 
this. But I forgot to tag the PR with it. 

> Require Java 8 for SparkR
> -
>
> Key: SPARK-24272
> URL: https://issues.apache.org/jira/browse/SPARK-24272
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> This change updates the SystemRequirements and also includes a runtime check 
> if the JVM is being launched by R. The runtime check is done by querying java 
> -version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description

2018-05-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472817#comment-16472817
 ] 

Shivaram Venkataraman commented on SPARK-24255:
---

Resolved by https://github.com/apache/spark/pull/21278

> Require Java 8 in SparkR description
> 
>
> Key: SPARK-24255
> URL: https://issues.apache.org/jira/browse/SPARK-24255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> CRAN checks require that the Java version be set both in package description 
> and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24255) Require Java 8 in SparkR description

2018-05-11 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-24255.
---
   Resolution: Fixed
 Assignee: Shivaram Venkataraman
Fix Version/s: 2.4.0
   2.3.1

> Require Java 8 in SparkR description
> 
>
> Key: SPARK-24255
> URL: https://issues.apache.org/jira/browse/SPARK-24255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> CRAN checks require that the Java version be set both in package description 
> and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24255) Require Java 8 in SparkR description

2018-05-11 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-24255:
-

 Summary: Require Java 8 in SparkR description
 Key: SPARK-24255
 URL: https://issues.apache.org/jira/browse/SPARK-24255
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Shivaram Venkataraman


CRAN checks require that the Java version be set both in package description 
and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) Flaky Test: SparkR

2018-05-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461931#comment-16461931
 ] 

Shivaram Venkataraman commented on SPARK-24152:
---

If this is blocking all PRs I think its fine to temporarily remove the CRAN 
check from Jenkins – We'll just need to be extra careful while merging SparkR 
PRs for a short period of time.

> Flaky Test: SparkR
> --
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) Flaky Test: SparkR

2018-05-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461920#comment-16461920
 ] 

Shivaram Venkataraman commented on SPARK-24152:
---

Unfortunately I dont have time to look at this till Friday. Do we know if the 
problem is in SparkR or from some other package ?

> Flaky Test: SparkR
> --
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2018-04-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448835#comment-16448835
 ] 

Shivaram Venkataraman commented on SPARK-24023:
---

Thanks [~hyukjin.kwon] . +1 to what Felix said. I think its fine to have one 
sub-task be a collection of functions if they are all small and related.

> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> It's could be messy if there are duplicates for R sides in SPARK-23899. 
> Followup for each JIRA might be possible but then again messy to manage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-01-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308765#comment-16308765
 ] 

Shivaram Venkataraman commented on SPARK-16693:
---

Did we have the discussion on dev@ ? I think its a good idea to remove this but 
I just want to make sure we gave enough of a heads up on dev@ and user@ 

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22889) CRAN checks can fail if older Spark install exists

2017-12-22 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-22889:
-

 Summary: CRAN checks can fail if older Spark install exists
 Key: SPARK-22889
 URL: https://issues.apache.org/jira/browse/SPARK-22889
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.1, 2.3.0
Reporter: Shivaram Venkataraman


Since all CRAN checks go through the same machine, if there is an older partial 
download or partial install of Spark left behind the tests fail. One solution 
is to overwrite the install files when running tests. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22315) Check for version match between R package and JVM

2017-11-06 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-22315.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 19624
[https://github.com/apache/spark/pull/19624]

> Check for version match between R package and JVM
> -
>
> Key: SPARK-22315
> URL: https://issues.apache.org/jira/browse/SPARK-22315
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Shivaram Venkataraman
> Fix For: 2.2.1, 2.3.0
>
>
> With the release of SparkR on CRAN we could have scenarios where users have a 
> newer version of package when compared to the Spark cluster they are 
> connecting to.
> We should print appropriate warnings on either (a) connecting to a different 
> version R Backend (b) connecting to a Spark master running a different 
> version of Spark (this should ideally happen inside Scala ?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-11-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234372#comment-16234372
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

Right I was considering that - but everytime we run `check-cran.sh` in Jenkins 
or locally we are running with `NOT_CRAN` false and `SPARK_HOME` set ?

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-31 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233683#comment-16233683
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

So I started working on this but this is a little non-trivial because if the 
user had supplied a `SPARK_HOME` then we would end up deleting JARs from that 
directory ? 

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-30 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225924#comment-16225924
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

To consider this narrow case of CRAN checks -- we explicitly call install.spark 
from run-all.R and that returns the directory where it put Spark. So we could 
just delete that returned value when we exit ?

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224301#comment-16224301
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

well uninstall is just removing `sparkCachePath()/` -- Should be 
relatively easy to put together ?

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16223844#comment-16223844
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

The ~/.cache is created by `install.spark` -- so its in our control whether we 
want to "uninstall" spark at the end of the tests ?

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220884#comment-16220884
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

Thanks for investigating. Is 
`/tmp/ubuntu/8201eb2c-8065-458c-b564-1e61b3dc5b7d/` a symlink to 
`/tmp/hive/ubuntu/8201eb2c-8065-458c-b564-1e61b3dc5b7d/`  or are they just 
different directories ? 

We can disable the hsperfdata with the suggested flag and also change the 
java.io.tmpdir which should at least fix the blockmanager I think. I will open 
a PR for this.

Regarding Hive directories created even though its off, I have no idea why that 
is happening. [~falaki] [~hyukjin.kwon] do you have any idea on why this 
happens ?
 

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-24 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-22344:
-

 Summary: Prevent R CMD check from using /tmp
 Key: SPARK-22344
 URL: https://issues.apache.org/jira/browse/SPARK-22344
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.2
Reporter: Shivaram Venkataraman


When R CMD check is run on the SparkR package it leaves behind files in /tmp 
which is a violation of CRAN policy. We should instead write to Rtmpdir. Notes 
from CRAN are below

{code}
Checking this leaves behind dirs

   hive/$USER
   $USER

and files named like

   b4f6459b-0624-4100-8358-7aa7afbda757_resources

in /tmp, in violation of the CRAN Policy.
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-24 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217310#comment-16217310
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

I created https://issues.apache.org/jira/browse/SPARK-22344

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22281) Handle R method breaking signature changes

2017-10-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214663#comment-16214663
 ] 

Shivaram Venkataraman commented on SPARK-22281:
---

Thanks for looking into this [~felixcheung] Is there anyway we can remove the 
`usage` entry as well from the Rdoc ? This might also be something to raise 
with the roxygen project for a more long term solution 

> Handle R method breaking signature changes
> --
>
> Key: SPARK-22281
> URL: https://issues.apache.org/jira/browse/SPARK-22281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> cAs discussed here
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
> this WARNING on R-devel
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
>   Code: function(what, pos = 2L, name = deparse(substitute(what),
>  backtick = FALSE), warn.conflicts = TRUE)
>   Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>  warn.conflicts = TRUE)
>   Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: 
> deparse(substitute(what))
> Checked the latest release R 3.4.1 and the signature change wasn't there. 
> This likely indicated an upcoming change in the next R release that could 
> incur this new warning when we attempt to publish the package.
> Not sure what we can do now since we work with multiple versions of R and 
> they will have different signatures then.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22315) Check for version match between R package and JVM

2017-10-19 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-22315:
-

 Summary: Check for version match between R package and JVM
 Key: SPARK-22315
 URL: https://issues.apache.org/jira/browse/SPARK-22315
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.1
Reporter: Shivaram Venkataraman


With the release of SparkR on CRAN we could have scenarios where users have a 
newer version of package when compared to the Spark cluster they are connecting 
to.

We should print appropriate warnings on either (a) connecting to a different 
version R Backend (b) connecting to a Spark master running a different version 
of Spark (this should ideally happen inside Scala ?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-18 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209927#comment-16209927
 ] 

Shivaram Venkataraman commented on SPARK-17902:
---

I think [~falaki] might have a test case that we could test against ?

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202260#comment-16202260
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

This is now live ! https://cran.r-project.org/web/packages/SparkR/

Marking this issue as resolved.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15799) Release SparkR on CRAN

2017-10-12 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15799.
---
   Resolution: Fixed
 Assignee: Shivaram Venkataraman
Fix Version/s: 2.1.2

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22202) Release tgz content differences for python and R

2017-10-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193111#comment-16193111
 ] 

Shivaram Venkataraman commented on SPARK-22202:
---

I think the differences happen because we build the CRAN package from one of 
the Hadoop versions ? 

> Release tgz content differences for python and R
> 
>
> Key: SPARK-22202
> URL: https://issues.apache.org/jira/browse/SPARK-22202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> As a follow up to SPARK-22167, currently we are running different 
> profiles/steps in make-release.sh for hadoop2.7 vs hadoop2.6 (and others), we 
> should consider if these differences are significant and whether they should 
> be addressed.
> A couple of things:
> - R.../doc directory is not in any release jar except hadoop 2.6
> - python/dist, python.egg-info are not in any release jar except hadoop 2.7
> - R DESCRIPTION has a few additions
> I've checked to confirm these are the same in 2.1.1 release so this isn't a 
> regression.
> {code}
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/doc:
> sparkr-vignettes.Rmd
> sparkr-vignettes.R
> sparkr-vignettes.html
> index.html
> Only in spark-2.1.2-bin-hadoop2.7/python: dist
> Only in spark-2.1.2-bin-hadoop2.7/python/pyspark: python
> Only in spark-2.1.2-bin-hadoop2.7/python: pyspark.egg-info
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/DESCRIPTION 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/DESCRIPTION
> 25a26,27
> > NeedsCompilation: no
> > Packaged: 2017-10-03 00:42:30 UTC; holden
> 31c33
> < Built: R 3.4.1; ; 2017-10-02 23:18:21 UTC; unix
> ---
> > Built: R 3.4.1; ; 2017-10-03 00:45:27 UTC; unix
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR: doc
> diff -r spark-2.1.2-bin-hadoop2.7/R/lib/SparkR/html/00Index.html 
> spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/html/00Index.html
> 16a17
> > User guides, package vignettes and other 
> > documentation.
> Only in spark-2.1.2-bin-hadoop2.6/R/lib/SparkR/Meta: vignette.rds
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2017-10-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187551#comment-16187551
 ] 

Shivaram Venkataraman commented on SPARK-22063:
---

[~shaneknapp] [~felixcheung] Lets move the discussion to the JIRA ? 

I think there are a couple of ways to address this issue -- the first as 
[~hyukjin.kwon] pointed out we can make the lint-r script do the installation. 
I am not too much in favor of that as it will result in the script affecting 
packages at runtime.

Instead I was thinking if we could create R environments for each Spark version 
-- https://stackoverflow.com/questions/24283171/virtual-environment-in-r has a 
bunch of ideas on how to do this. Any thoughts on the approaches listed there ?

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^~~

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-08-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146624#comment-16146624
 ] 

Shivaram Venkataraman commented on SPARK-21349:
---

Thanks for checking. In that case I dont think we can do much about this 
specific case. For RDDs created from the driver it is inevitable that we need 
to ship the data to the executors. 

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-08-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146585#comment-16146585
 ] 

Shivaram Venkataraman commented on SPARK-21349:
---

I think this might be that we create a ParallelCollectionRDD for the statement 
`(1 to (24*365*3))`  -- The values are stored in the partition for this RDD [1]
[~dongjoon] If you use fewer values (i.e. say 1 to 100) or more partitions (I'm 
not sure how many partitions are created in this example) does the warning go 
away ?

[1] 
https://github.com/apache/spark/blob/e47f48c737052564e92903de16ff16707fae32c3/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L32

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-08-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138602#comment-16138602
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

The email I got from CRAN is pasted below. There are three points there -- one 
about `attach`, one about the Description field and one about the vignettes.

{code}
Thanks, we see:


* checking R code for possible problems ... NOTE
Found the following calls to attach():
File 'SparkR/R/DataFrame.R':
  attach(newEnv, pos = pos, name = name, warn.conflicts = warn.conflicts)
See section 'Good practice' in '?attach'.

The Description field should not start with "The SparkR package". Simply start 
"Provides ".



and thenm

* checking re-building of vignette outputs ...
and nothing happens. Apparently you expect some installed Hadoop or SPark 
software for running the vignettes?

But there is no SystemRequeirements field?


Please fix and resubmit.
{code}

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2017-07-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101900#comment-16101900
 ] 

Shivaram Venkataraman commented on SPARK-6809:
--

Yeah I dont this JIRA is applicable any more. We can close this

> Make numPartitions optional in pairRDD APIs
> ---
>
> Key: SPARK-6809
> URL: https://issues.apache.org/jira/browse/SPARK-6809
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2017-07-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6809.
--
Resolution: Not A Problem

> Make numPartitions optional in pairRDD APIs
> ---
>
> Key: SPARK-6809
> URL: https://issues.apache.org/jira/browse/SPARK-6809
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082775#comment-16082775
 ] 

Shivaram Venkataraman commented on SPARK-21367:
---

Is there anyway to get more info on the pandoc error ? I think that is the root 
cause of the problem - Error code 1 does not seem to show up anything useful in 
searches

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
> Attachments: R.paks
>
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082708#comment-16082708
 ] 

Shivaram Venkataraman commented on SPARK-21367:
---

>From [~felixcheung]'s earlier parsing of the error messages, my guess is that 
>the knitr package probably has some issues working with roxygen2 5.0.0 ? In 
>other words I wonder if the problem is caused due to incompatible versions 
>across packages. 

[~shaneknapp] Could you list down all the versions of packages installed right 
now on Jenkins ? We can then try to use that and see if we can piece together 
why this problem comes up

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081223#comment-16081223
 ] 

Shivaram Venkataraman commented on SPARK-21367:
---

[~dongjoon] Do you have a particular output file that we can use to debug ?

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080875#comment-16080875
 ] 

Shivaram Venkataraman commented on SPARK-21349:
---

Well 100K is already too large IMHO and I'm not sure adding another config 
property is really helping things just to silence some log messages.  Looking 
at the code it seems that the larger task sizes mostly stem from the 
TaskMetrics objects getting bigger -- especially with a number of new SQL 
metrics being added. I think the right fix here is to improve the serialization 
of TaskMetrics (especially if the structure is empty, why bother sending 
anything at all to the worker ?)

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080826#comment-16080826
 ] 

Shivaram Venkataraman commented on SPARK-21367:
---

I just checked the transitive dependencies and I think it should be fine to 
manually install roxygen 5.0.0

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051094#comment-16051094
 ] 

Shivaram Venkataraman commented on SPARK-21093:
---

Thanks [~hyukjin.kwon] -- thats a very useful debugging notes. In addition to 
filing the bug in R, I am wondering if there is some thing we can do in our 
SparkR code to mitigate this. Could we say add a sleep or pause before the 
gapply tests ? Or in other words do the pipes / sockets disappear after some 
time ?

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x19

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-14 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049361#comment-16049361
 ] 

Shivaram Venkataraman commented on SPARK-21093:
---

So it looks like the R worker process crashes on CentOS and that leads to the 
task failures. I think the only way to debug this might be to get a core dump 
from the R process, attach gdb to it and see the stack trace at the time of the 
crash ?

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1a

[jira] [Assigned] (SPARK-20877) Investigate if tests will time out on CRAN

2017-05-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-20877:
-

Assignee: Felix Cheung

> Investigate if tests will time out on CRAN
> --
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20877) Investigate if tests will time out on CRAN

2017-05-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-20877.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18104
[https://github.com/apache/spark/pull/18104]

> Investigate if tests will time out on CRAN
> --
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20877) Investigate if tests will time out on CRAN

2017-05-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027870#comment-16027870
 ] 

Shivaram Venkataraman commented on SPARK-20877:
---

I managed to get the tests to pass on a Windows VM (Note that the VM has only 1 
core and 4G of memory ). The timing breakdown is at 
https://gist.github.com/shivaram/dc235c50b6369cbc60d859c25b13670d and the 
overall run time was close to 1hr.  I think AppVeyor might have a beefier 
machine ?

Anyways the most expensive tests to run remain to be the same across linux and 
windows -- I think we can disable them when running on CRAN / Windows ? Are 
there other options we have to make these tests run faster ?

> Investigate if tests will time out on CRAN
> --
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20877) Investigate if tests will time out on CRAN

2017-05-25 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025878#comment-16025878
 ] 

Shivaram Venkataraman commented on SPARK-20877:
---

I've been investigating the breakdown of time taken by each test case by using 
the List reporter in testthat 
(https://github.com/hadley/testthat/blob/master/R/reporter-list.R#L7). The 
relevant code change in run-all.R was
{code}
res <- test_package("SparkR", reporter="list")
sink(stderr(), type = "output")
write.table(res, sep=",")
sink(NULL, type = "output")
{code}

The results from running `./R/check-cran.sh` on my mac are at 
https://gist.github.com/shivaram/2923bc8535b3d71e710aa760935a2c0e -- The table 
is sorted by time taken to run tests with longest tests first. I am trying to 
get a similar table from a Windows VM to see how similar or different it looks. 
Couple of takeaways 

- The gapply and few of the MLlib tests dominate the overall time taken
- I think the Windows runs might be slower because we don't use daemons in 
Windows. This coupled with the fact that we have 200 reducers by default for 
some of the group by tests leads to the slower runs on Windows is my current 
guess. Getting better timings would help verify this.

> Investigate if tests will time out on CRAN
> --
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20727) Skip SparkR tests when missing Hadoop winutils on CRAN windows machines

2017-05-12 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-20727:
-

 Summary: Skip SparkR tests when missing Hadoop winutils on CRAN 
windows machines
 Key: SPARK-20727
 URL: https://issues.apache.org/jira/browse/SPARK-20727
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.1, 2.2.0
Reporter: Shivaram Venkataraman


We should skips tests that use the Hadoop libraries while running
on CRAN check with Windows as the operating system. This is to handle
cases where the Hadoop winutils binaries are not available on the target
system. The skipped tests will consist of
1. Tests that save, load a model in MLlib
2. Tests that save, load CSV, JSON and Parquet files in SQL
3. Hive tests

Note that these tests will still be run on AppVeyor for every PR, so our 
overall test coverage should not go down



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20613) Double quotes in Windows batch script

2017-05-05 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-20613:
-

Assignee: Jarrett Meyer

> Double quotes in Windows batch script
> -
>
> Key: SPARK-20613
> URL: https://issues.apache.org/jira/browse/SPARK-20613
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.1
>Reporter: Jarrett Meyer
>Assignee: Jarrett Meyer
> Fix For: 2.1.2, 2.2.0, 2.3.0
>
>
> This is a new issue in version 2.1.1. This problem was not present in 2.1.0.
> In {{bin/spark-class2.cmd}}, both the line that sets the {{RUNNER}} and the 
> like that invokes the {{RUNNER}} have quotes. This opens and closes the quote 
> immediately, producing something like
> {code}
> RUNNER=""C:\Program Files (x86)\Java\jre1.8.0_131\bin\java""
>ab  c
> {code}
> The quote above {{a}} opens the quote. The quote above {{b}} closes the 
> quote. This creates a space at position {{c}}, which is invalid syntax.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19237) SparkR package on Windows waiting for a long time when no java is found launching spark-submit

2017-03-21 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-19237:
-

Assignee: Felix Cheung

> SparkR package on Windows waiting for a long time when no java is found 
> launching spark-submit
> --
>
> Key: SPARK-19237
> URL: https://issues.apache.org/jira/browse/SPARK-19237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> When installing SparkR as a R package (install.packages) on Windows, it will 
> check for Spark distribution and automatically download and cache it. But if 
> there is no java runtime on the machine spark-submit will just hang.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19237) SparkR package on Windows waiting for a long time when no java is found launching spark-submit

2017-03-21 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19237.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16596
[https://github.com/apache/spark/pull/16596]

> SparkR package on Windows waiting for a long time when no java is found 
> launching spark-submit
> --
>
> Key: SPARK-19237
> URL: https://issues.apache.org/jira/browse/SPARK-19237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> When installing SparkR as a R package (install.packages) on Windows, it will 
> check for Spark distribution and automatically download and cache it. But if 
> there is no java runtime on the machine spark-submit will just hang.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893736#comment-15893736
 ] 

Shivaram Venkataraman commented on SPARK-19796:
---

I think (a) is worth exploring in a new JIRA -- We should try to avoid sending 
data that we dont need on the executors during task execution.

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-02-14 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19387.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16720
[https://github.com/apache/spark/pull/16720]

> CRAN tests do not run with SparkR source package
> 
>
> Key: SPARK-19387
> URL: https://issues.apache.org/jira/browse/SPARK-19387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> It looks like sparkR.session() is not installing Spark - as a result, running 
> R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to 
> CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19568) Must include class/method documentation for CRAN check

2017-02-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865022#comment-15865022
 ] 

Shivaram Venkataraman commented on SPARK-19568:
---

Is it possible to add this as a part of nightly builds ? That might be a better 
way to test things than kicking off individual release builds

> Must include class/method documentation for CRAN check
> --
>
> Key: SPARK-19568
> URL: https://issues.apache.org/jira/browse/SPARK-19568
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> While tests are running, R CMD check --as-cran is still complaining
> {code}
> * checking for missing documentation entries ... WARNING
> Undocumented code objects:
>   ‘add_months’ ‘agg’ ‘approxCountDistinct’ ‘approxQuantile’ ‘arrange’
>   ‘array_contains’ ‘as.DataFrame’ ‘as.data.frame’ ‘asc’ ‘ascii’ ‘avg’
>   ‘base64’ ‘between’ ‘bin’ ‘bitwiseNOT’ ‘bround’ ‘cache’ ‘cacheTable’
>   ‘cancelJobGroup’ ‘cast’ ‘cbrt’ ‘ceil’ ‘clearCache’ ‘clearJobGroup’
>   ‘collect’ ‘colnames’ ‘colnames<-’ ‘coltypes’ ‘coltypes<-’ ‘column’
>   ‘columns’ ‘concat’ ‘concat_ws’ ‘contains’ ‘conv’ ‘corr’ ‘count’
>   ‘countDistinct’ ‘cov’ ‘covar_pop’ ‘covar_samp’ ‘crc32’
>   ‘createDataFrame’ ‘createExternalTable’ ‘createOrReplaceTempView’
>   ‘crossJoin’ ‘crosstab’ ‘cume_dist’ ‘dapply’ ‘dapplyCollect’
>   ‘date_add’ ‘date_format’ ‘date_sub’ ‘datediff’ ‘dayofmonth’
>   ‘dayofyear’ ‘decode’ ‘dense_rank’ ‘desc’ ‘describe’ ‘distinct’ ‘drop’
> ...
> {code}
> This is because of lack of .Rd files in a clean environment when running 
> against the content of the R source package.
> I think we need to generate the .Rd files under man\ when building the 
> release and then package with them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19571) appveyor windows tests are failing

2017-02-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864033#comment-15864033
 ] 

Shivaram Venkataraman commented on SPARK-19571:
---

cc [~hyukjin.kwon]

> appveyor windows tests are failing
> --
>
> Key: SPARK-19571
> URL: https://issues.apache.org/jira/browse/SPARK-19571
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>
> Between 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master
> https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db
> And
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master
> https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a
> Something is changed (not likely caused by R) such that tests running on 
> Windows are consistently failing with
> {code}
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 
> C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown
>  Source)
>   at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown 
> Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown
>  Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
>   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
> Source)
>   at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
> Source)
> {code}
> Since we run appveyor only when there is R changes, it is a bit harder to 
> track down which change specifically caused this.
> We also can't run appveyor on branch-2.1, so it could also be broken there.
> This could be a blocker, since it could fail tests for the R release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19487) Low latency execution for Spark

2017-02-06 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-19487:
-

 Summary: Low latency execution for Spark
 Key: SPARK-19487
 URL: https://issues.apache.org/jira/browse/SPARK-19487
 Project: Spark
  Issue Type: Umbrella
  Components: ML, Scheduler, Structured Streaming
Affects Versions: 2.1.0
Reporter: Shivaram Venkataraman


This JIRA tracks the design discussion for supporting low latency execution in 
Apache Spark. The motivation for this comes from need to support lower latency 
stream processing and lower latency iterations for sparse ML workloads.

Overview of proposed design (in the format of Spark Improvement Proposal) is at 
https://docs.google.com/document/d/1m_q83DjQcWQonEz4IsRUHu4QSjcDyqRpl29qE4LJc4s/edit?usp=sharing

Source code prototype is at: https://github.com/amplab/drizzle-spark

Lets use this JIRA to discuss high level design and we can create subtasks as 
we break this down into smaller PRs.

This is joint work with [~kayousterhout]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19486) Investigate using multiple threads for task serialization

2017-02-06 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-19486:
-

 Summary: Investigate using multiple threads for task serialization
 Key: SPARK-19486
 URL: https://issues.apache.org/jira/browse/SPARK-19486
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Shivaram Venkataraman


This is related to SPARK-18890, where all the serialization logic is moved into 
the Scheduler backend thread. As a follow on to this we can investigate using a 
thread pool to serialize a number of tasks together instead of using a single 
thread to serialize all of them.

Note that this may not yield sufficient benefits unless the driver has enough 
cores and we don't run into contention across threads. We can first investigate 
potential benefits and if there are sufficient benefits we can create a PR for 
this.

cc [~kayousterhout]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19485) Launch tasks async i.e. dont wait for the network

2017-02-06 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-19485:
-

 Summary: Launch tasks async i.e. dont wait for the network
 Key: SPARK-19485
 URL: https://issues.apache.org/jira/browse/SPARK-19485
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Shivaram Venkataraman


Currently the scheduling thread in CoarseGrainedSchedulerBackend is used to 
both walk through the list of offers and to serialize, create RPCs and send 
messages over the network.

For stages with large number of tasks we can avoid blocking on RPCs / 
serialization by moving that to a separate thread in CGSB. As a part of this 
JIRA we can first investigate the potential benefits of doing this for 
different kinds of jobs (one large stage, many independent small stages etc.) 
and then propose a code change.

cc [~kayousterhout]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18131) Support returning Vector/Dense Vector from backend

2017-01-31 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847473#comment-15847473
 ] 

Shivaram Venkataraman commented on SPARK-18131:
---

Hmm - this is tricky. We ran into a similar issue in SQL and we added a reader, 
writer object in SQL that was registered to the method in core. See 
https://github.com/apache/spark/blob/ce112cec4f9bff222aa256893f94c316662a2a7e/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L39
 for how we did that. We could do a similar thing in MLlib as well ? 

cc [~mengxr]

> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19324) JVM stdout output is dropped in SparkR

2017-01-27 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19324.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Resolved by https://github.com/apache/spark/pull/16670

> JVM stdout output is dropped in SparkR
> --
>
> Key: SPARK-19324
> URL: https://issues.apache.org/jira/browse/SPARK-19324
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> Whenever there are stdout outputs from Spark in JVM (typically when calling 
> println()) they are dropped by SparkR.
> For example, explain() for Column
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2017-01-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827237#comment-15827237
 ] 

Shivaram Venkataraman commented on SPARK-16578:
---

I think its fine to remove the target version for this - As [~mengxr] said its 
not clear what the requirements are for this deploy mode or what kind of 
applications will use this etc. If somebody puts that together we can retarget ?

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-01-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827066#comment-15827066
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

The work on this is mostly complete - But you can put me down as a Shepherd - 
if required I can create more sub-tasks etc.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19232) SparkR distribution cache location is wrong on Windows

2017-01-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19232.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16590
[https://github.com/apache/spark/pull/16590]

> SparkR distribution cache location is wrong on Windows
> --
>
> Key: SPARK-19232
> URL: https://issues.apache.org/jira/browse/SPARK-19232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Trivial
> Fix For: 2.1.1, 2.2.0
>
>
> On Linux:
> {code}
> ~/.cache/spark# ls -lart
> total 12
> drwxr-xr-x 12  500  500 4096 Dec 16 02:18 spark-2.1.0-bin-hadoop2.7
> drwxr-xr-x  3 root root 4096 Dec 18 00:03 ..
> drwxr-xr-x  3 root root 4096 Dec 18 00:06 .
> {code}
> On Windows:
> {code}
> C:\Users\felix\AppData\Local\spark\spark\Cache
> 01/13/2017  11:25 AM  spark-2.1.0-bin-hadoop2.7
> 01/13/2017  11:25 AM33,471,940 spark-2.1.0-bin-hadoop2.7.tgz
> {code}
> If we follow https://pypi.python.org/pypi/appdirs, appauthor should be 
> "Apache"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19221) Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native libraries properly

2017-01-14 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19221.
---
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/16584

> Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native 
> libraries properly
> 
>
> Key: SPARK-19221
> URL: https://issues.apache.org/jira/browse/SPARK-19221
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, SparkR
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> It seems Hadoop libraries need {{hadoop.dll}} for native libraries in the 
> path. It is not a problem in tests for now because we are only testing SparkR 
> on Windows via AppVeyor but it can be a problem if we run Scala tests via 
> AppVeyor as below:
> {code}
>  - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 
> seconds, 937 milliseconds)
>org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. 
> org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:599)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:159)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>at org.scalatest.Transformer.apply(Transformer.scala:22)
>at org.scalatest.Transformer.apply(Transformer.scala:20)
>at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>at scala.collection.immutable.List.foreach(List.scala:381)
>at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>at org.scalatest.Suite$class.run(Suite.scala:1424)
>at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>at org.scalatest.BeforeA

[jira] [Resolved] (SPARK-18335) Add a numSlices parameter to SparkR's createDataFrame

2017-01-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18335.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16512
[https://github.com/apache/spark/pull/16512]

> Add a numSlices parameter to SparkR's createDataFrame
> -
>
> Key: SPARK-18335
> URL: https://issues.apache.org/jira/browse/SPARK-18335
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> SparkR's createDataFrame doesn't have a `numSlices` parameter. The user 
> cannot set a partition number when converting a large R dataframe to SparkR 
> dataframe. A workaround is using `repartition`, but it requires a shuffle 
> stage. It's better to support the `numSlices` parameter in the 
> `createDataFrame` method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821986#comment-15821986
 ] 

Shivaram Venkataraman commented on SPARK-19177:
---

Thanks for the example - this is very useful. Just to confirm the problem here 
is in dealing with schema objects and specifically being able to append and / 
or optionally mutate the schema ? I'll update the JIRA title appropriately and 
we'll work on a fix for this.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2017-01-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819446#comment-15819446
 ] 

Shivaram Venkataraman commented on SPARK-18823:
---

Yeah I think it makes sense to not handle the case where we take a local 
vector. However adding support for `[` and `[[` to support literals and 
existing columns would be good. This is the only item remaining from what is 
summarized as #1 above I think ?

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19130) SparkR should support setting and adding new column with singular value implicitly

2017-01-11 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19130.
---
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.2.0
   2.1.1

Resolved by https://github.com/apache/spark/pull/16510

> SparkR should support setting and adding new column with singular value 
> implicitly
> --
>
> Key: SPARK-19130
> URL: https://issues.apache.org/jira/browse/SPARK-19130
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> for parity with framework like dplyr



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819260#comment-15819260
 ] 

Shivaram Venkataraman commented on SPARK-19177:
---

Thanks [~masip85] - Can you include a small code snippet that shows the problem 
?

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)

2017-01-06 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15806774#comment-15806774
 ] 

Shivaram Venkataraman commented on SPARK-19108:
---

+1 - This is a good idea. One thing I'd like to add is that it might be better 
to create one broadcast rather than two broadcasts for sake of efficiency. For 
each broadcast variable we contact the driver to get location information and 
then initiate some fetches -- Thus to keep the number of messages lower having 
one broadcast variable will be better.

> Broadcast all shared parts of tasks (to reduce task serialization time)
> ---
>
> Key: SPARK-19108
> URL: https://issues.apache.org/jira/browse/SPARK-19108
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Kay Ousterhout
>
> Expand the amount of information that's broadcasted for tasks, to avoid 
> serializing data per-task that should only be sent to each executor once for 
> the entire stage.
> Conceptually, this means we'd have new classes  specially for sending the 
> minimal necessary data to the executor, like:
> {code}
> /**
>   * metadata about the taskset needed by the executor for all tasks in this 
> taskset.  Subset of the
>   * full data kept on the driver to make it faster to serialize and send to 
> executors.
>   */
> class ExecutorTaskSetMeta(
>   val stageId: Int,
>   val stageAttemptId: Int,
>   val properties: Properties,
>   val addedFiles: Map[String, String],
>   val addedJars: Map[String, String]
>   // maybe task metrics here?
> )
> class ExecutorTaskData(
>   val partitionId: Int,
>   val attemptNumber: Int,
>   val taskId: Long,
>   val taskBinary: Broadcast[Array[Byte]],
>   val taskSetMeta: Broadcast[ExecutorTaskSetMeta]
> )
> {code}
> Then all the info you'd need to send to the executors would be a serialized 
> version of ExecutorTaskData.  Furthermore, given the simplicity of that 
> class, you could serialize manually, and then for each task you could just 
> modify the first two ints & one long directly in the byte buffer.  (You could 
> do the same trick for serialization even if ExecutorTaskSetMeta was not a 
> broadcast, but that will keep the msgs small as well.)
> There a bunch of details I'm skipping here: you'd also need to do some 
> special handling for the TaskMetrics; the way tasks get started in the 
> executor would change; you'd also need to refactor {{Task}} to let it get 
> reconstructed from this information (or add more to ExecutorTaskSetMeta); and 
> probably other details I'm overlooking now.
> (this is copied from SPARK-18890 and [~imranr]'s comment there; cc 
> [~shivaram])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15771602#comment-15771602
 ] 

Shivaram Venkataraman commented on SPARK-18924:
---

Yeah I think we could convert this JIRA into a few sub-tasks - the first one 
could be profiling some of the existing code to get a breakdown of how much 
time is spent where. The next one could be the JVM side changes like boxing / 
unboxing improvements etc. 

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18836) Serialize Task Metrics once per stage

2016-12-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-18836:
--
Fix Version/s: (was: 1.3.0)
   2.2.0

> Serialize Task Metrics once per stage
> -
>
> Key: SPARK-18836
> URL: https://issues.apache.org/jira/browse/SPARK-18836
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 2.2.0
>
>
> Right now we serialize the empty task metrics once per task -- Since this is 
> shared across all tasks we could use the same serialized task metrics across 
> all tasks of a stage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762548#comment-15762548
 ] 

Shivaram Venkataraman commented on SPARK-18924:
---

This is a good thing to investigate - Just to provide some historical context, 
the functions in serialize.R and deserialize.R were primarily designed to 
enable functions between the JVM and R and the serialization performance is 
less critical there as its mostly just function names, arguments etc. 

For data path we were originally using R's own serializer, deserializer but 
that doesn't work if we want to parse the data in JVM. So the whole dfToCols 
was a retro-fit to make things work.

In terms of design options:
- I think removing the boxing / unboxing overheads and making `readIntArray` or 
`readStringArray` in R more efficient would be a good starting point
- In terms of using other packages - there are licensing questions and also 
usability questions. So far users mostly don't require any extra R package to 
use SparkR and hence we are compatible across a bunch of R versions etc. So I 
think we should first look at the points about how we can make our existing 
architecture better
- If the bottleneck is due to R function call overheads after the above changes 
we can explore writing a C module (similar to our old hashCode implementation). 
While this has lesser complications in terms of licensing, versions matches 
etc. -  there is still some complexity on how we build and distribute this in a 
binary package.

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and 

[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15756403#comment-15756403
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

[~felixcheung] What do you think are the down sides of disabling Hive by 
default ? https://github.com/apache/spark/pull/16290 should fix the warehouse 
dir issue and if we disable hive by default then derby.log and metastore_db 
should not be created. 

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows

2016-12-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18895.
---
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/16305

> Fix resource-closing-related and path-related test failures in identified 
> ones on Windows
> -
>
> Key: SPARK-18895
> URL: https://issues.apache.org/jira/browse/SPARK-18895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> There are several tests failing due to resource-closing-related and 
> path-related  problems on Windows as below.
> - {{RPackageUtilsSuite}}:
> {code}
> - build an R package from a jar end to end *** FAILED *** (1 second, 625 
> milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - faulty R package shows documentation *** FAILED *** (359 milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - SparkR zipping works properly *** FAILED *** (47 milliseconds)
>   java.util.regex.PatternSyntaxException: Unknown character property name {r} 
> near index 4
> C:\projects\spark\target\tmp\1481729429282-0
> ^
>   at java.util.regex.Pattern.error(Pattern.java:1955)
>   at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781)
> {code}
> - {{InputOutputMetricsSuite}}:
> {code}
> - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics with cache and coalesce *** FAILED *** (109 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics for new Hadoop API with coalesce *** FAILED *** (0 
> milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
> - input metrics when reading text file *** FAILED *** (110 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - simple *** FAILED *** (125 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - more stages *** FAILED *** (110 
> milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop

[jira] [Commented] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755386#comment-15755386
 ] 

Shivaram Venkataraman commented on SPARK-18902:
---

A couple of points 
- We already include a DESCRIPTION file which lists the license as Apache
- To do this I think we just need to make a copy of LICENSE and put it in 
R/pkg/ -- The `R CMD build` should automatically pick that up then I'd guess

cc [~felixcheung]

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755385#comment-15755385
 ] 

Shivaram Venkataraman commented on SPARK-18902:
---

A couple of points 
- We already include a DESCRIPTION file which lists the license as Apache
- To do this I think we just need to make a copy of LICENSE and put it in 
R/pkg/ -- The `R CMD build` should automatically pick that up then I'd guess

cc [~felixcheung]

> Include Apache License in R source Package
> --
>
> Key: SPARK-18902
> URL: https://issues.apache.org/jira/browse/SPARK-18902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Shivaram Venkataraman
>
> Per [~srowen]'s email on the dev mailing list
> {quote}
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts. 
> It would be good practice to include this in a convenience binary. I'm not 
> sure if it's strictly mandatory, but something to adjust in any event. I 
> think that's all there is to do for SparkR
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18902) Include Apache License in R source Package

2016-12-16 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-18902:
-

 Summary: Include Apache License in R source Package
 Key: SPARK-18902
 URL: https://issues.apache.org/jira/browse/SPARK-18902
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Shivaram Venkataraman


Per [~srowen]'s email on the dev mailing list

{quote}
I don't see an Apache license / notice for the Pyspark or SparkR artifacts. It 
would be good practice to include this in a convenience binary. I'm not sure if 
it's strictly mandatory, but something to adjust in any event. I think that's 
all there is to do for SparkR
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18897) Fix SparkR SQL Test to drop test table

2016-12-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-18897:
--
Fix Version/s: (was: 2.1.0)
   2.2.0
   2.1.1

> Fix SparkR SQL Test to drop test table
> --
>
> Key: SPARK-18897
> URL: https://issues.apache.org/jira/browse/SPARK-18897
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, SparkR tests, `R/run-tests.sh` succeeds only once because 
> `test_sparkSQL.R` does not clean up the test table `people`.
> As a result, the test data is accumulated at every run and the test cases 
> fail.
> The following is the failure result for the second run.
> {code}
> Failed 
> -
> 1. Failure: create DataFrame from RDD (@test_sparkSQL.R#204) 
> ---
> collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to 
> c(16).
> Lengths differ: 2 vs 1
> 2. Failure: create DataFrame from RDD (@test_sparkSQL.R#206) 
> ---
> collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal 
> to c(176.5).
> Lengths differ: 2 vs 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18897) Fix SparkR SQL Test to drop test table

2016-12-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-18897:
--
Assignee: Dongjoon Hyun

> Fix SparkR SQL Test to drop test table
> --
>
> Key: SPARK-18897
> URL: https://issues.apache.org/jira/browse/SPARK-18897
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, SparkR tests, `R/run-tests.sh` succeeds only once because 
> `test_sparkSQL.R` does not clean up the test table `people`.
> As a result, the test data is accumulated at every run and the test cases 
> fail.
> The following is the failure result for the second run.
> {code}
> Failed 
> -
> 1. Failure: create DataFrame from RDD (@test_sparkSQL.R#204) 
> ---
> collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to 
> c(16).
> Lengths differ: 2 vs 1
> 2. Failure: create DataFrame from RDD (@test_sparkSQL.R#206) 
> ---
> collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal 
> to c(176.5).
> Lengths differ: 2 vs 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18897) Fix SparkR SQL Test to drop test table

2016-12-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18897.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.3

Issue resolved by pull request 16310
[https://github.com/apache/spark/pull/16310]

> Fix SparkR SQL Test to drop test table
> --
>
> Key: SPARK-18897
> URL: https://issues.apache.org/jira/browse/SPARK-18897
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
> Fix For: 2.0.3, 2.1.0
>
>
> Currently, SparkR tests, `R/run-tests.sh` succeeds only once because 
> `test_sparkSQL.R` does not clean up the test table `people`.
> As a result, the test data is accumulated at every run and the test cases 
> fail.
> The following is the failure result for the second run.
> {code}
> Failed 
> -
> 1. Failure: create DataFrame from RDD (@test_sparkSQL.R#204) 
> ---
> collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to 
> c(16).
> Lengths differ: 2 vs 1
> 2. Failure: create DataFrame from RDD (@test_sparkSQL.R#206) 
> ---
> collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal 
> to c(176.5).
> Lengths differ: 2 vs 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753588#comment-15753588
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

Just to check - Is your Spark installation built with Hive support (i.e. with 
-Phive) ?

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752973#comment-15752973
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

In that case an easier fix might be to just disable Hive support by default ? 
cc [~felixcheung]

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752902#comment-15752902
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

[~bdwyer] Does this still happen if you disable Hive ? One way to test that is 
to stop the sparkSession and create one with `enableHiveSupport=F`

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752898#comment-15752898
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

Yeah I dont know how to avoid creating those two -- Doesn't look like its 
configurable 

cc [~cloud_fan] [~rxin]

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18849.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16286
[https://github.com/apache/spark/pull/16286]

> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * anything else that seems out of place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18875) Fix R API doc generation by adding `DESCRIPTION` file

2016-12-14 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-18875:
--
Assignee: Dongjoon Hyun

> Fix R API doc generation by adding `DESCRIPTION` file
> -
>
> Key: SPARK-18875
> URL: https://issues.apache.org/jira/browse/SPARK-18875
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.3, 2.1.0
>
>
> Since 1.4.0, R API document index page has a broken link on `DESCRIPTION 
> file`. This issue aims to fix that.
> * Official Latest Website: 
> http://spark.apache.org/docs/latest/api/R/index.html
> * Apache Spark 2.1.0-rc2: 
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >