Re: RDD Location

2016-12-30 Thread Sun Rui
ut it seems to be suspended when executing this function. But if I move the > code to other places, like the main() function, it runs well. > > What is the reason for it? > > Thanks, > Fei > > On Fri, Dec 30, 2016 at 2:38 AM, Sun Rui <sunrise_...@163.com > <mailto:s

Re: RDD Location

2016-12-29 Thread Sun Rui
Maybe you can create your own subclass of RDD and override the getPreferredLocations() to implement the logic of dynamic changing of the locations. > On Dec 30, 2016, at 12:06, Fei Hu wrote: > > Dear all, > > Is there any way to change the host location for a certain

Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Sun Rui
Hi, Could you give more information about your Spark environment? cluster manager, spark version, using dynamic allocation or not, etc.. Generally, executors will delete temporary directories for shuffle files on exit because JVM shutdown hooks are registered. Unless they are brutally killed.

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui
18:51, Maciej Szymkiewicz <mszymkiew...@gmail.com> wrote: > > Thank you for your prompt response and great examples Sun Rui but I am > still confused about one thing. Do you see any particular reason to not > to merge subsequent limits? Following case > >(limit n (m

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-11 Thread Sun Rui
-1 https://issues.apache.org/jira/browse/SPARK-16379 > On Jul 6, 2016, at 19:28, Maciej Bryński wrote: > > -1 > https://issues.apache.org/jira/browse/SPARK-16379 >

Re: spark1.6.2 ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread Sun Rui
maybe related to "parquet-provided”? remove "parquet-provided” profile when making the distribution or adding the parquet jar into class path when running Spark > On Jul 8, 2016, at 09:25, kevin wrote: > > parquet-provided

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Sun Rui
You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors

Re: Windows Rstudio to Linux spakR

2016-06-01 Thread Sun Rui
Selvam, First, deploy the Spark distribution on your Windows machine, which is of the same version of Spark in your Linux cluster Second, follow the instructions at https://github.com/apache/spark/tree/master/R#using-sparkr-from-rstudio. Specify the Spark master URL for your Linux Spark

Re:

2016-05-22 Thread Sun Rui
No permission is required. Just send your PR:) > On May 22, 2016, at 20:04, 成强 > wrote: > > spark-15429

Re: spark on kubernetes

2016-05-22 Thread Sun Rui
If it is possible to rewrite URL in outbound responses in Knox or other reverse proxy, would that solve your issue? > On May 22, 2016, at 14:55, Gurvinder Singh wrote: > > On 05/22/2016 08:32 AM, Reynold Xin wrote: >> Kubernetes itself already has facilities for http

Re: spark on kubernetes

2016-05-22 Thread Sun Rui
I think “reverse proxy” is beneficial to monitoring a cluster in a secure way. This feature is not only desired for Spark on standalone, but also Spark on YARN, and also projects other than spark. Maybe Apache Knox can help you. Not sure how Knox can integrate with Spark. > On May 22, 2016, at

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
Kai, You can simply ignore this test failure before it is fixed > On May 20, 2016, at 12:54, Sun Rui <sunrise_...@163.com> wrote: > > Yes. I also met this issue. It is likely related to recent R versions. > Could you help to submit a JIRA issue? I will take a look at it >>

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
gt; > > I guess this issue related to permission. It seems I used `sudo > ./R/run-tests.sh` and it worked sometimes. Without permission, maybe we > couldn't access /tmp directory. However, the SparkR unit testing is brittle. > > Could someone give any hints of how to solve this

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
.909 Thread-1 INFO SparkContext: Successfully > stopped SparkContext > 1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown > hook called > 1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHookManager: Deleting > directory > /private/var/folders/xy/qc

Re: SparkR dataframe error

2016-05-18 Thread Sun Rui
> attaching here again. > > On Wed, May 18, 2016 at 5:27 PM, Sun Rui <sunrise_...@163.com > <mailto:sunrise_...@163.com>> wrote: > It’s wrong behaviour that head(df) outputs no row > Could you send a screenshot displaying whole error message? >> On May 19, 2016,

Re: SparkR dataframe error

2016-05-18 Thread Sun Rui
It’s wrong behaviour that head(df) outputs no row Could you send a screenshot displaying whole error message? > On May 19, 2016, at 08:12, Gayathri Murali > wrote: > > I am trying to run a basic example on Interactive R shell and run into the > following error.

RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Sun, Rui
nternally Dataset[Row(value: Row)]. From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, February 26, 2016 3:55 PM To: Sun, Rui <rui@intel.com> Cc: Koert Kuipers <ko...@tresata.com>; dev@spark.apache.org Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0 The join and joinWith

RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Sun, Rui
Vote for option 2. Source compatibility and binary compatibility are very important from user’s perspective. It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As you said, sometimes it is more natural to think about DataFrame. I am wondering if conceptually there is

RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-07 Thread Sun, Rui
This should be solved by your pending PR https://github.com/apache/spark/pull/10480, right? From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Sunday, February 7, 2016 8:50 PM To: Sun, Rui <rui@intel.com>; Andrew Holway <andrew.hol...@otternetworks.de>; dev@spark.apache

RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-06 Thread Sun, Rui
DataFrameWrite.jdbc() does not work? From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Sunday, February 7, 2016 9:54 AM To: Andrew Holway ; dev@spark.apache.org Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2) Unfortunately I couldn't find

RE: Specifying Scala types when calling methods from SparkR

2015-12-10 Thread Sun, Rui
...@alteryx.com] Sent: Friday, December 11, 2015 2:47 AM To: Sun, Rui; shiva...@eecs.berkeley.edu Cc: dev@spark.apache.org Subject: RE: Specifying Scala types when calling methods from SparkR Hi Sun Rui, I’ve had some luck simply using “objectFile” when saving from SparkR directly. The problem is that if you

RE: Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Sun, Rui
Hi, Just use ""objectFile" instead of "objectFile[PipelineModel]" for callJMethod. You can take the objectFile() in context.R as example. Since the SparkContext created in SparkR is actually a JavaSparkContext, there is no need to pass the implicit ClassTag. -Original Message- From:

RE: SparkR package path

2015-09-24 Thread Sun, Rui
not small change to spark-submit. Also additional network traffic overhead would be incurred. I can’t see any compelling demand for this. From: Hossein [mailto:fal...@gmail.com] Sent: Friday, September 25, 2015 5:09 AM To: shiva...@eecs.berkeley.edu Cc: Sun, Rui; dev@spark.apache.org; Dan Putler

RE: SparkR package path

2015-09-24 Thread Sun, Rui
AM To: Sun, Rui Cc: shiva...@eecs.berkeley.edu; dev@spark.apache.org Subject: Re: SparkR package path Requiring users to download entire Spark distribution to connect to a remote cluster (which is already running Spark) seems an over kill. Even for most spark users who download Spark source

RE: SparkR package path

2015-09-23 Thread Sun, Rui
, there is a documentation at https://github.com/apache/spark/tree/master/R From: Hossein [mailto:fal...@gmail.com] Sent: Thursday, September 24, 2015 1:42 AM To: shiva...@eecs.berkeley.edu Cc: Sun, Rui; dev@spark.apache.org Subject: Re: SparkR package path Yes, I think exposing SparkR in CRAN can significantly

RE: SparkR package path

2015-09-21 Thread Sun, Rui
Hossein, Any strong reason to download and install SparkR source package separately from the Spark distribution? An R user can simply download the spark distribution, which contains SparkR source and binary package, and directly use sparkR. No need to install SparkR package at all. From:

[SparkR] is toDF() necessary

2015-05-08 Thread Sun, Rui
toDF() is defined to convert an RDD to a DataFrame. But it is just a very thin wrapper of createDataFrame() by help the caller avoid input of SQLContext. Since Scala/pySpark does not have toDF(), and we'd better keep API as narrow and simple as possible. Is toDF() really necessary? Could we