Re: RDD Location

2016-12-30 Thread Sun Rui
You can’t call runJob inside getPreferredLocations().
You can take a look at the source  code of HadoopRDD to help you implement 
getPreferredLocations() appropriately.
> On Dec 31, 2016, at 09:48, Fei Hu <hufe...@gmail.com> wrote:
> 
> That is a good idea.
> 
> I tried add the following code to get getPreferredLocations() function:
> 
> val results: Array[Array[DataChunkPartition]] = context.runJob(
>   partitionsRDD, (context: TaskContext, partIter: 
> Iterator[DataChunkPartition]) => partIter.toArray, dd, allowLocal = true)
> 
> But it seems to be suspended when executing this function. But if I move the 
> code to other places, like the main() function, it runs well.
> 
> What is the reason for it?
> 
> Thanks,
> Fei
> 
> On Fri, Dec 30, 2016 at 2:38 AM, Sun Rui <sunrise_...@163.com 
> <mailto:sunrise_...@163.com>> wrote:
> Maybe you can create your own subclass of RDD and override the 
> getPreferredLocations() to implement the logic of dynamic changing of the 
> locations.
> > On Dec 30, 2016, at 12:06, Fei Hu <hufe...@gmail.com 
> > <mailto:hufe...@gmail.com>> wrote:
> >
> > Dear all,
> >
> > Is there any way to change the host location for a certain partition of RDD?
> >
> > "protected def getPreferredLocations(split: Partition)" can be used to 
> > initialize the location, but how to change it after the initialization?
> >
> >
> > Thanks,
> > Fei
> >
> >
> 
> 
> 



Re: RDD Location

2016-12-29 Thread Sun Rui
Maybe you can create your own subclass of RDD and override the 
getPreferredLocations() to implement the logic of dynamic changing of the 
locations.
> On Dec 30, 2016, at 12:06, Fei Hu  wrote:
> 
> Dear all,
> 
> Is there any way to change the host location for a certain partition of RDD?
> 
> "protected def getPreferredLocations(split: Partition)" can be used to 
> initialize the location, but how to change it after the initialization?
> 
> 
> Thanks,
> Fei
> 
> 



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: shuffle files not deleted after executor restarted

2016-09-02 Thread Sun Rui
Hi,
Could you give more information about your Spark environment? cluster manager, 
spark version, using dynamic allocation or not, etc..

Generally, executors will delete temporary directories for shuffle files on 
exit because JVM shutdown hooks are registered. Unless they are brutally killed.

You can safely delete the directories when you are sure that the spark 
applications related to them have finished. A crontab task may be used for 
automatic clean up.

> On Sep 2, 2016, at 12:18, 汪洋  wrote:
> 
> Hi all,
> 
> I discovered that sometimes executor exits unexpectedly and when it is 
> restarted, it will create another blockmgr directory without deleting the old 
> ones. Thus, for a long running application, some shuffle files will never be 
> cleaned up. Sometimes those files could take up the whole disk. 
> 
> Is there a way to clean up those unused file automatically? Or is it safe to 
> delete the old directory manually only leaving the newest one?
> 
> Here is the executor’s local directory.
> 
> 
> Any advice on this?
> 
> Thanks.
> 
> Yang



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui
Spark does optimise subsequent limits, for example:
scala> df1.limit(3).limit(1).explain
== Physical Plan ==
CollectLimit 1
+- *SerializeFromObject [assertnotnull(input[0, $line14.$read$$iw$$iw$my, 
true], top level non-flat input object).x AS x#2]
   +- Scan ExternalRDDScan[obj#1]

However, limit can not be simply pushes down across mapping functions, because 
the number of rows may change across functions. for example, flatMap()

It seems that limit can be pushed across map() which won’t change the number of 
rows. Maybe this is a room for Spark optimisation.

> On Aug 2, 2016, at 18:51, Maciej Szymkiewicz <mszymkiew...@gmail.com> wrote:
> 
> Thank you for your prompt response and great examples Sun Rui but I am
> still confused about one thing. Do you see any particular reason to not
> to merge subsequent limits? Following case
> 
>(limit n (map f (limit m ds)))
> 
> could be optimized to:
> 
>(map f (limit n (limit m ds)))
> 
> and further to
> 
>(map f (limit (min n m) ds))
> 
> couldn't it?
> 
> 
> On 08/02/2016 11:57 AM, Sun Rui wrote:
>> Based on your code, here is simpler test case on Spark 2.0
>> 
>>case class my (x: Int)
>>val rdd = sc.parallelize(0.until(1), 1000).map { x => my(x) }
>>val df1 = spark.createDataFrame(rdd)
>>val df2 = df1.limit(1)
>>df1.map { r => r.getAs[Int](0) }.first
>>df2.map { r => r.getAs[Int](0) }.first // Much slower than the
>>previous line
>> 
>> Actually, Dataset.first is equivalent to Dataset.limit(1).collect, so
>> check the physical plan of the two cases:
>> 
>>scala> df1.map { r => r.getAs[Int](0) }.limit(1).explain
>>== Physical Plan ==
>>CollectLimit 1
>>+- *SerializeFromObject [input[0, int, true] AS value#124]
>>   +- *MapElements , obj#123: int
>>  +- *DeserializeToObject createexternalrow(x#74,
>>StructField(x,IntegerType,false)), obj#122: org.apache.spark.sql.Row
>> +- Scan ExistingRDD[x#74]
>> 
>>scala> df2.map { r => r.getAs[Int](0) }.limit(1).explain
>>== Physical Plan ==
>>CollectLimit 1
>>+- *SerializeFromObject [input[0, int, true] AS value#131]
>>   +- *MapElements , obj#130: int
>>  +- *DeserializeToObject createexternalrow(x#74,
>>StructField(x,IntegerType,false)), obj#129: org.apache.spark.sql.Row
>> +- *GlobalLimit 1
>>+- Exchange SinglePartition
>>   +- *LocalLimit 1
>>  +- Scan ExistingRDD[x#74]
>> 
>> 
>> For the first case, it is related to an optimisation in
>> the CollectLimitExec physical operator. That is, it will first fetch
>> the first partition to get limit number of row, 1 in this case, if not
>> satisfied, then fetch more partitions, until the desired limit is
>> reached. So generally, if the first partition is not empty, only the
>> first partition will be calculated and fetched. Other partitions will
>> even not be computed.
>> 
>> However, in the second case, the optimisation in the CollectLimitExec
>> does not help, because the previous limit operation involves a shuffle
>> operation. All partitions will be computed, and running LocalLimit(1)
>> on each partition to get 1 row, and then all partitions are shuffled
>> into a single partition. CollectLimitExec will fetch 1 row from the
>> resulted single partition.
>> 
>> 
>>> On Aug 2, 2016, at 09:08, Maciej Szymkiewicz <mszymkiew...@gmail.com
>>> <mailto:mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>>> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> This doesn't look like something expected, does it?
>>> 
>>> http://stackoverflow.com/q/38710018/1560062
>>> 
>>> Quick glance at the UI suggest that there is a shuffle involved and
>>> input for first is ShuffledRowRDD.
>>> -- 
>>> Best regards,
>>> Maciej Szymkiewicz
>> 
> 
> -- 
> Maciej Szymkiewicz



Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-11 Thread Sun Rui
-1
https://issues.apache.org/jira/browse/SPARK-16379 


> On Jul 6, 2016, at 19:28, Maciej Bryński  wrote:
> 
> -1
> https://issues.apache.org/jira/browse/SPARK-16379 
> 


Re: spark1.6.2 ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread Sun Rui
maybe related to "parquet-provided”? 
remove "parquet-provided” profile when making the distribution or adding the 
parquet jar into class path when running Spark
> On Jul 8, 2016, at 09:25, kevin  wrote:
> 
> parquet-provided



Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Sun Rui
You can read 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals 

For pySpark data flow on worker nodes, you can read the source code of 
PythonRDD.scala. Python worker processes communicate with Spark executors via 
sockets instead of pipes.

> On Jul 7, 2016, at 15:49, Amit Rana  wrote:
> 
> Hi all,
> 
> I am trying  to trace the data flow in pyspark. I am using intellij IDEA in 
> windows 7.
> I had submitted  a python  job as follows:
> --master local[4]  
> 
> I have made the following  insights after running the above command in debug 
> mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with 
> which it communicates through socket.
> ->py4j is used to handle this communication 
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext 
> which communicates with the spark executors in cluster.
> 
> In cluster I have read that the data flow between spark executors and python 
> interpreter happens using pipes. But I am not able to trace that data flow.
> 
> Please correct me if my understanding is wrong. It would be very helpful if, 
> someone can help me understand tge code-flow for data transfer between JVM 
> and python workers.
> 
> Thanks,
> Amit Rana
> 



Re: Windows Rstudio to Linux spakR

2016-06-01 Thread Sun Rui
Selvam,

First, deploy the Spark distribution on your Windows machine, which is of the 
same version of Spark in your Linux cluster

Second, follow the instructions at 
https://github.com/apache/spark/tree/master/R#using-sparkr-from-rstudio. 
Specify the Spark master URL for your Linux Spark cluster when calling 
sparkR.init(). Don’t know your Spark cluster deployment mode. If it is YARN, 
you may have to copy YARN conf files from your cluster and set YARN_CONF_DIR 
environment variable to point to it.

These steps are my personal understanding, I have not tested in this scenario. 
Please report if you have any problem.

> On Jun 1, 2016, at 16:55, Selvam Raman  wrote:
> 
> Hi ,
> 
> How to connect to sparkR (which is available in Linux env) using 
> Rstudio(Windows env).
> 
> Please help me.
> 
> -- 
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"



Re:

2016-05-22 Thread Sun Rui
No permission is required. Just send your PR:)
> On May 22, 2016, at 20:04, 成强  > wrote:
> 
> spark-15429



Re: spark on kubernetes

2016-05-22 Thread Sun Rui
If it is possible to rewrite URL in outbound responses in Knox or other reverse 
proxy, would that solve your issue?
> On May 22, 2016, at 14:55, Gurvinder Singh  wrote:
> 
> On 05/22/2016 08:32 AM, Reynold Xin wrote:
>> Kubernetes itself already has facilities for http proxy, doesn't it?
>> 
> Yeah kubernetes has ingress controller which can act the L7 load
> balancer and router traffic to Spark UI in this case. But I am referring
> to link present in UI to worker and application UI. Replied in the
> detail to Sun Rui's mail where I gave example of possible scenario.
> 
> - Gurvinder
>> 
>> On Sat, May 21, 2016 at 9:30 AM, Gurvinder Singh
>> > wrote:
>> 
>>Hi,
>> 
>>I am currently working on deploying Spark on kuberentes (K8s) and it is
>>working fine. I am running Spark with standalone mode and checkpointing
>>the state to shared system. So if master fails K8s starts it and from
>>checkpoint it recover the earlier state and things just works fine. I
>>have an issue with the Spark master Web UI to access the worker and
>>application UI links. In brief, kubernetes service model allows me to
>>expose the master service to internet, but accessing the
>>application/workers UI is not possible as then I have to expose them too
>>individually and given I can have multiple application it becomes hard
>>to manage.
>> 
>>One solution can be that the master can act as reverse proxy to access
>>information/state/logs from application/workers. As it has the
>>information about their endpoint when application/worker register with
>>master, so when a user initiate a request to access the information,
>>master can proxy the request to corresponding endpoint.
>> 
>>So I am wondering if someone has already done work in this direction
>>then it would be great to know. If not then would the community will be
>>interesting in such feature. If yes then how and where I should get
>>started as it would be helpful for me to have some guidance to start
>>working on this.
>> 
>>Kind Regards,
>>Gurvinder
>> 
>>-
>>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>
>>For additional commands, e-mail: dev-h...@spark.apache.org
>>
>> 
>> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark on kubernetes

2016-05-22 Thread Sun Rui
I think “reverse proxy” is beneficial  to monitoring a cluster in a secure way. 
This feature is not only desired for Spark on standalone, but also Spark on 
YARN, and also projects other than spark.

Maybe Apache Knox can help you. Not sure how Knox can integrate with Spark.
> On May 22, 2016, at 00:30, Gurvinder Singh  wrote:
> 
> standalone mod



Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
Kai,
You can simply ignore this test failure before it is fixed
> On May 20, 2016, at 12:54, Sun Rui <sunrise_...@163.com> wrote:
> 
> Yes. I also met this issue. It is likely related to recent R versions.
> Could you help to submit a JIRA issue? I will take a look at it
>> On May 20, 2016, at 11:13, Kai Jiang <jiang...@gmail.com 
>> <mailto:jiang...@gmail.com>> wrote:
>> 
>> I was trying to build SparkR this week. hmm~ But, I encountered problem with 
>> SparkR unit testing. That is probably similar as Gayathri encountered with.
>> I tried many times with running ./R/run-tests.sh script. It seems like every 
>> time the test will be failed.
>> 
>> Here are some environments when I was building:
>> java 7
>> R 3.30(sudo apt-get install r-base-devunder   ubuntu 15.04)
>> set SPARK_HOME=/path
>> 
>> R -e 'install.packages("testthat", repos="http://cran.us.r-project.org 
>> <http://cran.us.r-project.org/>")'
>> build with:   build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 
>> -Psparkr -DskipTests -T 1C clean package
>> 
>> ./R/install-dev.sh
>> ./R/run-tests.sh
>> 
>> Here is the error message I got: 
>> https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2 
>> <https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2>
>> 
>> I guess this issue related to permission. It seems I used `sudo 
>> ./R/run-tests.sh` and it worked sometimes. Without permission, maybe we 
>> couldn't access /tmp directory.  However, the SparkR unit testing is brittle.
>> 
>> Could someone give any hints of how to solve this?
>> 
>> Best,
>> Kai.
>> 
>> On Thu, May 19, 2016 at 6:59 PM, Sun Rui <sunrise_...@163.com 
>> <mailto:sunrise_...@163.com>> wrote:
>> You must specify -Psparkr when building from source.
>> 
>>> On May 20, 2016, at 08:09, Gayathri Murali <gayathri.m.sof...@gmail.com 
>>> <mailto:gayathri.m.sof...@gmail.com>> wrote:
>>> 
>>> That helped! Thanks. I am building from source code and I am not sure what 
>>> caused the issue with SparkR.
>>> 
>>> On Thu, May 19, 2016 at 4:17 PM, Xiangrui Meng <men...@gmail.com 
>>> <mailto:men...@gmail.com>> wrote:
>>> We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the 
>>> latest branch-2.0, there could be an issue with your SparkR installation. 
>>> Did you try `R/install-dev.sh`?
>>> 
>>> On Thu, May 19, 2016 at 11:42 AM Gayathri Murali 
>>> <gayathri.m.sof...@gmail.com <mailto:gayathri.m.sof...@gmail.com>> wrote:
>>> This is on Spark 2.0. I see the following on the unit-tests.log when I run 
>>> the R/run-tests.sh. This on a single MAC laptop, on the recently rebased 
>>> master. R version is 3.3.0.
>>> 
>>> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor: 
>>> Exception in task 0.0 in stage 5186.0 (TID 10370)
>>> 1384595 org.apache.spark.SparkException: R computation failed with
>>> 1384596
>>> 1384597 Execution halted
>>> 1384598
>>> 1384599 Execution halted
>>> 1384600
>>> 1384601 Execution halted
>>> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>>> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>>> 1384604 at 
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>>> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>>> 1384606 at 
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>>> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>> 1384608 at 
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>> 1384609 at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 1384610 at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 1384611 at java.lang.Thread.run(Thread.java:745)
>>> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped 
>>> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
>>> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped 
>>> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
>>> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI 
>>> at http://localhost:4040 <http://localhost:4040/>
>>

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
Yes. I also met this issue. It is likely related to recent R versions.
Could you help to submit a JIRA issue? I will take a look at it
> On May 20, 2016, at 11:13, Kai Jiang <jiang...@gmail.com> wrote:
> 
> I was trying to build SparkR this week. hmm~ But, I encountered problem with 
> SparkR unit testing. That is probably similar as Gayathri encountered with.
> I tried many times with running ./R/run-tests.sh script. It seems like every 
> time the test will be failed.
> 
> Here are some environments when I was building:
> java 7
> R 3.30(sudo apt-get install r-base-devunder   ubuntu 15.04)
> set SPARK_HOME=/path
> 
> R -e 'install.packages("testthat", repos="http://cran.us.r-project.org 
> <http://cran.us.r-project.org/>")'
> build with:   build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 
> -Psparkr -DskipTests -T 1C clean package
> 
> ./R/install-dev.sh
> ./R/run-tests.sh
> 
> Here is the error message I got: 
> https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2 
> <https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2>
> 
> I guess this issue related to permission. It seems I used `sudo 
> ./R/run-tests.sh` and it worked sometimes. Without permission, maybe we 
> couldn't access /tmp directory.  However, the SparkR unit testing is brittle.
> 
> Could someone give any hints of how to solve this?
> 
> Best,
> Kai.
> 
> On Thu, May 19, 2016 at 6:59 PM, Sun Rui <sunrise_...@163.com 
> <mailto:sunrise_...@163.com>> wrote:
> You must specify -Psparkr when building from source.
> 
>> On May 20, 2016, at 08:09, Gayathri Murali <gayathri.m.sof...@gmail.com 
>> <mailto:gayathri.m.sof...@gmail.com>> wrote:
>> 
>> That helped! Thanks. I am building from source code and I am not sure what 
>> caused the issue with SparkR.
>> 
>> On Thu, May 19, 2016 at 4:17 PM, Xiangrui Meng <men...@gmail.com 
>> <mailto:men...@gmail.com>> wrote:
>> We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the 
>> latest branch-2.0, there could be an issue with your SparkR installation. 
>> Did you try `R/install-dev.sh`?
>> 
>> On Thu, May 19, 2016 at 11:42 AM Gayathri Murali 
>> <gayathri.m.sof...@gmail.com <mailto:gayathri.m.sof...@gmail.com>> wrote:
>> This is on Spark 2.0. I see the following on the unit-tests.log when I run 
>> the R/run-tests.sh. This on a single MAC laptop, on the recently rebased 
>> master. R version is 3.3.0.
>> 
>> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor: 
>> Exception in task 0.0 in stage 5186.0 (TID 10370)
>> 1384595 org.apache.spark.SparkException: R computation failed with
>> 1384596
>> 1384597 Execution halted
>> 1384598
>> 1384599 Execution halted
>> 1384600
>> 1384601 Execution halted
>> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>> 1384604 at 
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>> 1384606 at 
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
>> 1384608 at 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> 1384609 at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 1384610 at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 1384611 at java.lang.Thread.run(Thread.java:745)
>> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped 
>> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
>> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped 
>> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
>> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI at 
>> http://localhost:4040 <http://localhost:4040/>
>> 1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR Executor: 
>> Exception in task 1.0 in stage 5186.0 (TID 10371)
>> 1384616 org.apache.spark.SparkException: R computation failed with
>> 1384617
>> 1384618 Execution halted
>> 1384619
>> 1384620 Execution halted
>> 1384621
>> 1384622 Execution halted
>> 1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>> 1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>&

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 1384643 16/05/19 11:28:13.909 Thread-1 INFO SparkContext: Successfully 
> stopped SparkContext
> 1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown 
> hook called
> 1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHookManager: Deleting 
> directory 
> /private/var/folders/xy/qc35m0y55vq83dsqzg066_c4gn/T/spark-dfafdddc-fd25-4eb4-bb1d-565915
> 1c8231
> 
> 
> On Thu, May 19, 2016 at 8:46 AM, Xiangrui Meng <men...@gmail.com 
> <mailto:men...@gmail.com>> wrote:
> Is it on 1.6.x?
> 
> 
> On Wed, May 18, 2016, 6:57 PM Sun Rui <sunrise_...@163.com 
> <mailto:sunrise_...@163.com>> wrote:
> I saw it, but I can’t see the complete error message on it.
> I mean the part after “error in invokingJava(…)”
> 
>> On May 19, 2016, at 08:37, Gayathri Murali <gayathri.m.sof...@gmail.com 
>> <mailto:gayathri.m.sof...@gmail.com>> wrote:
>> 
>> There was a screenshot attached to my original email. If you did not get it, 
>> attaching here again.
>> 
>> On Wed, May 18, 2016 at 5:27 PM, Sun Rui <sunrise_...@163.com 
>> <mailto:sunrise_...@163.com>> wrote:
>> It’s wrong behaviour that head(df) outputs no row
>> Could you send a screenshot displaying whole error message?
>>> On May 19, 2016, at 08:12, Gayathri Murali <gayathri.m.sof...@gmail.com 
>>> <mailto:gayathri.m.sof...@gmail.com>> wrote:
>>> 
>>> I am trying to run a basic example on Interactive R shell and run into the 
>>> following error. Also note that head(df) does not display any rows. Can 
>>> someone please help if I am missing something?
>>> 
>>> 
>>> 
>>>  Thanks
>>> Gayathri
>>> 
>>>  邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
>>> 共有 1 个附件
>>> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)
>>> 极速下载 
>>> <http://preview.mail.163.com/xdownload?filename=Screen+Shot+2016-05-18+at+5.09.29+PM.png=xtbB0QpumlUL%2BmgE3wAAs4=3=de2b9113bb74f7c3ed7f83a1243fb575=1463616821=sunrise_win%40163.com>
>>>  在线预览 
>>> <http://preview.mail.163.com/preview?mid=xtbB0QpumlUL%2BmgE3wAAs4=3=de2b9113bb74f7c3ed7f83a1243fb575=1463616821=sunrise_win%40163.com>
>> 
>>  邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
>> 共有 1 个附件
>> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)
>> 极速下载 
>> <http://preview.mail.163.com/xdownload?filename=Screen+Shot+2016-05-18+at+5.09.29+PM.png=1tbiPhRumlXlgWp0SwAAs6=3=e0cf0eb619175e79cfa81bad7c1d26c9=1463622289=sunrise_win%40163.com>
>>  在线预览 
>> <http://preview.mail.163.com/preview?mid=1tbiPhRumlXlgWp0SwAAs6=3=e0cf0eb619175e79cfa81bad7c1d26c9=1463622289=sunrise_win%40163.com>>  Shot 2016-05-18 at 5.09.29 PM.png>
> 
> 
> 



Re: SparkR dataframe error

2016-05-18 Thread Sun Rui
I saw it, but I can’t see the complete error message on it.
I mean the part after “error in invokingJava(…)”

> On May 19, 2016, at 08:37, Gayathri Murali <gayathri.m.sof...@gmail.com> 
> wrote:
> 
> There was a screenshot attached to my original email. If you did not get it, 
> attaching here again.
> 
> On Wed, May 18, 2016 at 5:27 PM, Sun Rui <sunrise_...@163.com 
> <mailto:sunrise_...@163.com>> wrote:
> It’s wrong behaviour that head(df) outputs no row
> Could you send a screenshot displaying whole error message?
>> On May 19, 2016, at 08:12, Gayathri Murali <gayathri.m.sof...@gmail.com 
>> <mailto:gayathri.m.sof...@gmail.com>> wrote:
>> 
>> I am trying to run a basic example on Interactive R shell and run into the 
>> following error. Also note that head(df) does not display any rows. Can 
>> someone please help if I am missing something?
>> 
>> 
>> 
>>  Thanks
>> Gayathri
>> 
>>  邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
>> 共有 1 个附件
>> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)
>> 极速下载 
>> <http://preview.mail.163.com/xdownload?filename=Screen+Shot+2016-05-18+at+5.09.29+PM.png=xtbB0QpumlUL%2BmgE3wAAs4=3=de2b9113bb74f7c3ed7f83a1243fb575=1463616821=sunrise_win%40163.com>
>>  在线预览 
>> <http://preview.mail.163.com/preview?mid=xtbB0QpumlUL%2BmgE3wAAs4=3=de2b9113bb74f7c3ed7f83a1243fb575=1463616821=sunrise_win%40163.com>
> 
>  邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
> 共有 1 个附件
> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)
> 极速下载 
> <http://preview.mail.163.com/xdownload?filename=Screen+Shot+2016-05-18+at+5.09.29+PM.png=1tbiPhRumlXlgWp0SwAAs6=3=e0cf0eb619175e79cfa81bad7c1d26c9=1463622289=sunrise_win%40163.com>
>  在线预览 
> <http://preview.mail.163.com/preview?mid=1tbiPhRumlXlgWp0SwAAs6=3=e0cf0eb619175e79cfa81bad7c1d26c9=1463622289=sunrise_win%40163.com>  Shot 2016-05-18 at 5.09.29 PM.png>



Re: SparkR dataframe error

2016-05-18 Thread Sun Rui
It’s wrong behaviour that head(df) outputs no row
Could you send a screenshot displaying whole error message?
> On May 19, 2016, at 08:12, Gayathri Murali  
> wrote:
> 
> I am trying to run a basic example on Interactive R shell and run into the 
> following error. Also note that head(df) does not display any rows. Can 
> someone please help if I am missing something?
> 
> 
> 
>  Thanks
> Gayathri
> 
>  邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
> 共有 1 个附件
> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)
> 极速下载 
> 
>  在线预览 
> 


RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Sun, Rui
Thanks for the explaination.

What confusing me is the different internal semantic of Dataset on non-Row type 
(primitive types for example) and Row type:

Dataset[Int] is internally actually Dataset[Row(value:Int)]

scala> val ds = sqlContext.createDataset(Seq(1,2,3))
ds: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> ds.schema.json
res17: String = 
{"type":"struct","fields":[{"name":"value","type":"integer","nullable":false,"metadata":{}}]}

But obviously Dataset[Row] is not internally Dataset[Row(value: Row)].

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, February 26, 2016 3:55 PM
To: Sun, Rui <rui@intel.com>
Cc: Koert Kuipers <ko...@tresata.com>; dev@spark.apache.org
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

The join and joinWith are just two different join semantics, and is not about 
Dataset vs DataFrame.

join is the relational join, where fields are flattened; joinWith is more like 
a tuple join, where the output has two fields that are nested.

So you can do

Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]

DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]

Dataset[A] join Dataset[B] = Dataset[Row]

DataFrame[A] join DataFrame[B] = Dataset[Row]



On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui 
<rui@intel.com<mailto:rui@intel.com>> wrote:
Vote for option 2.
Source compatibility and binary compatibility are very important from user’s 
perspective.
It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As 
you said, sometimes it is more natural to think about DataFrame.

I am wondering if conceptually there is slight subtle difference between 
DataFrame and Dataset[Row]? For example,
Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
So,
Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]

While
DataFrame join DataFrame is still DataFrame of Row?

From: Reynold Xin [mailto:r...@databricks.com<mailto:r...@databricks.com>]
Sent: Friday, February 26, 2016 8:52 AM
To: Koert Kuipers <ko...@tresata.com<mailto:ko...@tresata.com>>
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a 
Dataset[Row], and for some developers it is more natural to think about 
"DataFrame" rather than "Dataset[Row]".

If we were in C++, DataFrame would've been a type alias for Dataset[Row] too, 
and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers 
<ko...@tresata.com<mailto:ko...@tresata.com>> wrote:
since a type alias is purely a convenience thing for the scala compiler, does 
option 1 mean that the concept of DataFrame ceases to exist from a java 
perspective, and they will have to refer to Dataset?

On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin 
<r...@databricks.com<mailto:r...@databricks.com>> wrote:
When we first introduced Dataset in 1.6 as an experimental API, we wanted to 
merge Dataset/DataFrame but couldn't because we didn't want to break the 
pre-existing DataFrame API (e.g. map function should return Dataset, rather 
than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and 
Dataset.

Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways 
to implement this:

Option 1. Make DataFrame a type alias for Dataset[Row]

Option 2. DataFrame as a concrete class that extends Dataset[Row]


I'm wondering what you think about this. The pros and cons I can think of are:


Option 1. Make DataFrame a type alias for Dataset[Row]

+ Cleaner conceptually, especially in Scala. It will be very clear what 
libraries or applications need to do, and we won't see type mismatches (e.g. a 
function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code
- Breaks source compatibility for the DataFrame API in Java, and binary 
compatibility for Scala/Java


Option 2. DataFrame as a concrete class that extends Dataset[Row]

The pros/cons are basically the inverse of Option 1.

+ In most cases, can maintain source compatibility for the DataFrame API in 
Java, and binary compatibility for Scala/Java
- A lot more code (1000+ loc)
- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a 
function that expects a DataFrame


The concerns are mostly with Scala/Java. For Python, it is very easy to 
maintain source compatibility for both (there is no concept of binary 
compatibility), and for R, we are only supporting the DataFrame operations 
anyway because that's more familiar interface for R users outside of Spark.







RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Sun, Rui
Vote for option 2.
Source compatibility and binary compatibility are very important from user’s 
perspective.
It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As 
you said, sometimes it is more natural to think about DataFrame.

I am wondering if conceptually there is slight subtle difference between 
DataFrame and Dataset[Row]? For example,
Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
So,
Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]

While
DataFrame join DataFrame is still DataFrame of Row?

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, February 26, 2016 8:52 AM
To: Koert Kuipers 
Cc: dev@spark.apache.org
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a 
Dataset[Row], and for some developers it is more natural to think about 
"DataFrame" rather than "Dataset[Row]".

If we were in C++, DataFrame would've been a type alias for Dataset[Row] too, 
and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers 
> wrote:
since a type alias is purely a convenience thing for the scala compiler, does 
option 1 mean that the concept of DataFrame ceases to exist from a java 
perspective, and they will have to refer to Dataset?

On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin 
> wrote:
When we first introduced Dataset in 1.6 as an experimental API, we wanted to 
merge Dataset/DataFrame but couldn't because we didn't want to break the 
pre-existing DataFrame API (e.g. map function should return Dataset, rather 
than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and 
Dataset.

Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways 
to implement this:

Option 1. Make DataFrame a type alias for Dataset[Row]

Option 2. DataFrame as a concrete class that extends Dataset[Row]


I'm wondering what you think about this. The pros and cons I can think of are:


Option 1. Make DataFrame a type alias for Dataset[Row]

+ Cleaner conceptually, especially in Scala. It will be very clear what 
libraries or applications need to do, and we won't see type mismatches (e.g. a 
function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code
- Breaks source compatibility for the DataFrame API in Java, and binary 
compatibility for Scala/Java


Option 2. DataFrame as a concrete class that extends Dataset[Row]

The pros/cons are basically the inverse of Option 1.

+ In most cases, can maintain source compatibility for the DataFrame API in 
Java, and binary compatibility for Scala/Java
- A lot more code (1000+ loc)
- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a 
function that expects a DataFrame


The concerns are mostly with Scala/Java. For Python, it is very easy to 
maintain source compatibility for both (there is no concept of binary 
compatibility), and for R, we are only supporting the DataFrame operations 
anyway because that's more familiar interface for R users outside of Spark.






RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-07 Thread Sun, Rui
This should be solved by your pending PR 
https://github.com/apache/spark/pull/10480, right?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Sunday, February 7, 2016 8:50 PM
To: Sun, Rui <rui@intel.com>; Andrew Holway 
<andrew.hol...@otternetworks.de>; dev@spark.apache.org
Subject: RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

I mean not exposed from the SparkR API.
Calling it from R without a SparkR API would require either a serializer change 
or a JVM wrapper function.

On Sun, Feb 7, 2016 at 4:47 AM -0800, "Felix Cheung" 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
That does but it's a bit hard to call from R since it is not exposed.



On Sat, Feb 6, 2016 at 11:57 PM -0800, "Sun, Rui" 
<rui@intel.com<mailto:rui@intel.com>> wrote:

DataFrameWrite.jdbc() does not work?



From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Sunday, February 7, 2016 9:54 AM
To: Andrew Holway 
<andrew.hol...@otternetworks.de<mailto:andrew.hol...@otternetworks.de>>; 
dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2)



Unfortunately I couldn't find a simple workaround. It seems to be an issue with 
DataFrameWriter.save() that does not work with jdbc source/format



For instance, this does not work in Scala either

df1.write.format("jdbc").mode("overwrite").option("url", 
"jdbc:mysql://something.rds.amazonaws.com<http://something.rds.amazonaws.com>:3306?user=user=password").option("dbtable",
 "table").save()



For Spark 1.5.x, it seems the best option would be to write a JVM wrapper and 
call it from R.



_
From: Andrew Holway 
<andrew.hol...@otternetworks.de<mailto:andrew.hol...@otternetworks.de>>
Sent: Saturday, February 6, 2016 11:22 AM
Subject: Fwd: Writing to jdbc database from SparkR (1.5.2)
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>

Hi,



I have a thread on u...@spark.apache.org<mailto:u...@spark.apache.org> but I 
think this might require developer attention.



I'm reading data from a database: This is working well.

> df <- read.df(sqlContext, source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass<http://database.foo.eu-west-1.rds.amazonaws.com:3306/?user=user=pass>")



When I try and write something back to the DB I see this following error:



> write.df(fooframe, path="NULL", source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass<http://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass>",
>  dbtable="db.table", mode="append")



16/02/06 19:05:43 ERROR RBackendHandler: save on 2 failed

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :

  java.lang.RuntimeException: 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
create table as select.

at scala.sys.package$.error(package.scala:27)

at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)

at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1855)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)

at io.netty.channel.SimpleChannelIn



Any ideas on a workaround?



Thanks,



Andrew




RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-06 Thread Sun, Rui
DataFrameWrite.jdbc() does not work?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Sunday, February 7, 2016 9:54 AM
To: Andrew Holway ; dev@spark.apache.org
Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2)

Unfortunately I couldn't find a simple workaround. It seems to be an issue with 
DataFrameWriter.save() that does not work with jdbc source/format

For instance, this does not work in Scala either
df1.write.format("jdbc").mode("overwrite").option("url", 
"jdbc:mysql://something.rds.amazonaws.com:3306?user=user=password").option("dbtable",
 "table").save()

For Spark 1.5.x, it seems the best option would be to write a JVM wrapper and 
call it from R.

_
From: Andrew Holway 
>
Sent: Saturday, February 6, 2016 11:22 AM
Subject: Fwd: Writing to jdbc database from SparkR (1.5.2)
To: >

Hi,

I have a thread on u...@spark.apache.org but I 
think this might require developer attention.

I'm reading data from a database: This is working well.

> df <- read.df(sqlContext, source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass")

When I try and write something back to the DB I see this following error:


> write.df(fooframe, path="NULL", source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass",
>  dbtable="db.table", mode="append")



16/02/06 19:05:43 ERROR RBackendHandler: save on 2 failed

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :

  java.lang.RuntimeException: 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
create table as select.

at scala.sys.package$.error(package.scala:27)

at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)

at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1855)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)

at io.netty.channel.SimpleChannelIn



Any ideas on a workaround?



Thanks,



Andrew



RE: Specifying Scala types when calling methods from SparkR

2015-12-10 Thread Sun, Rui
Hi, Chris,

I know your point: objectFile and saveAsObjectFile pair in SparkR can only be 
used in SparkR context, as the content of RDD is assumed to be serialized R 
objects.

It’s fine to drop down to JVM level in the case the model is saved as 
objectFile in Scala, and load it in SparkR. But I don’t understand “but that 
seems to only work if you specify the type”, seems no need to specify type 
because of type erasure?

Did you try something like: convert the RDD to DataFrame, save it , and load it 
as a DataFrame in SparkR and then to RDD?

From: Chris Freeman [mailto:cfree...@alteryx.com]
Sent: Friday, December 11, 2015 2:47 AM
To: Sun, Rui; shiva...@eecs.berkeley.edu
Cc: dev@spark.apache.org
Subject: RE: Specifying Scala types when calling methods from SparkR

Hi Sun Rui,

I’ve had some luck simply using “objectFile” when saving from SparkR directly. 
The problem is that if you do it that way, the model object will only work if 
you continue to use the current Spark Context, and I think model persistence 
should really enable you to use the model at a later time. That’s where I found 
that I could drop down to the JVM level and interact with the Scala object 
directly, but that seems to only work if you specify the type.



On December 9, 2015 at 7:59:43 PM, Sun, Rui 
(rui@intel.com<mailto:rui@intel.com>) wrote:
Hi,

Just use ""objectFile" instead of "objectFile[PipelineModel]" for callJMethod. 
You can take the objectFile() in context.R as example.

Since the SparkContext created in SparkR is actually a JavaSparkContext, there 
is no need to pass the implicit ClassTag.

-Original Message-
From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Thursday, December 10, 2015 8:21 AM
To: Chris Freeman
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Specifying Scala types when calling methods from SparkR

The SparkR callJMethod can only invoke methods as they show up in the Java byte 
code. So in this case you'll need to check the SparkContext byte code (with 
javap or something like that) to see how that method looks. My guess is the 
type is passed in as a class tag argument, so you'll need to do something like 
create a class tag for the LinearRegressionModel and pass that in as the first 
or last argument etc.

Thanks
Shivaram

On Wed, Dec 9, 2015 at 10:11 AM, Chris Freeman 
<cfree...@alteryx.com<mailto:cfree...@alteryx.com>> wrote:
> Hey everyone,
>
> I’m currently looking at ways to save out SparkML model objects from
> SparkR and I’ve had some luck putting the model into an RDD and then
> saving the RDD as an Object File. Once it’s saved, I’m able to load it
> back in with something like:
>
> sc.objectFile[LinearRegressionModel](“path/to/model”)
>
> I’d like to try and replicate this same process from SparkR using the
> JVM backend APIs (e.g. “callJMethod”), but so far I haven’t been able
> to replicate my success and I’m guessing that it’s (at least in part)
> due to the necessity of specifying the type when calling the objectFile 
> method.
>
> Does anyone know if this is actually possible? For example, here’s
> what I’ve come up with so far:
>
> loadModel <- function(sc, modelPath) {
> modelRDD <- SparkR:::callJMethod(sc,
>
> "objectFile[PipelineModel]",
> modelPath,
> SparkR:::getMinPartitions(sc, NULL))
> return(modelRDD)
> }
>
> Any help is appreciated!
>
> --
> Chris Freeman
>

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> For 
additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>


RE: Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Sun, Rui
Hi,

Just use  ""objectFile" instead of "objectFile[PipelineModel]" for callJMethod. 
You can take the objectFile() in context.R as example.

Since the SparkContext created in SparkR is actually a JavaSparkContext, there 
is no need to pass the implicit ClassTag.

-Original Message-
From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] 
Sent: Thursday, December 10, 2015 8:21 AM
To: Chris Freeman
Cc: dev@spark.apache.org
Subject: Re: Specifying Scala types when calling methods from SparkR

The SparkR callJMethod can only invoke methods as they show up in the Java byte 
code. So in this case you'll need to check the SparkContext byte code (with 
javap or something like that) to see how that method looks. My guess is the 
type is passed in as a class tag argument, so you'll need to do something like 
create a class tag for the LinearRegressionModel and pass that in as the first 
or last argument etc.

Thanks
Shivaram

On Wed, Dec 9, 2015 at 10:11 AM, Chris Freeman  wrote:
> Hey everyone,
>
> I’m currently looking at ways to save out SparkML model objects from 
> SparkR and I’ve had some luck putting the model into an RDD and then 
> saving the RDD as an Object File. Once it’s saved, I’m able to load it 
> back in with something like:
>
> sc.objectFile[LinearRegressionModel](“path/to/model”)
>
> I’d like to try and replicate this same process from SparkR using the 
> JVM backend APIs (e.g. “callJMethod”), but so far I haven’t been able 
> to replicate my success and I’m guessing that it’s (at least in part) 
> due to the necessity of specifying the type when calling the objectFile 
> method.
>
> Does anyone know if this is actually possible? For example, here’s 
> what I’ve come up with so far:
>
> loadModel <- function(sc, modelPath) {
>   modelRDD <- SparkR:::callJMethod(sc,
>
> "objectFile[PipelineModel]",
> modelPath,
> SparkR:::getMinPartitions(sc, NULL))
>   return(modelRDD)
> }
>
> Any help is appreciated!
>
> --
> Chris Freeman
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org



RE: SparkR package path

2015-09-24 Thread Sun, Rui
Yes, the current implementation requires the backend to be on the same host as 
SparkR package. But this does not prevent SparkR from connecting to a remote 
Spark Cluster specified by a Spark master URL. The only thing needed is that 
there need be to a Spark JAR co-located with SparkR package on the same client 
machine. This is similar to any Spark application, which also depends on Spark 
JAR.

Theoritically, as SparkR package communicates with the backend via socket, the 
backend could be running on a different host. But this will make the launching 
of SparkR more complex, requiring not small change to spark-submit. Also 
additional network traffic overhead would be incurred.  I can’t see any 
compelling demand for this.

From: Hossein [mailto:fal...@gmail.com]
Sent: Friday, September 25, 2015 5:09 AM
To: shiva...@eecs.berkeley.edu
Cc: Sun, Rui; dev@spark.apache.org; Dan Putler
Subject: Re: SparkR package path

Right now in sparkR.R the backend hostname is hard coded to "localhost" 
(https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).

If we make that address configurable / parameterized, then a user can connect a 
remote Spark cluster with no need to have spark jars on their local machine. I 
have got this request from some R users. Their company has a Spark cluster 
(usually managed by another team), and they want to connect to it from their 
workstation (e.g., from within RStudio, etc).



--Hossein

On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman 
<shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>> wrote:
I don't think the crux of the problem is about users who download the
source -- Spark's source distribution is clearly marked as something
that needs to be built and they can run `mvn -DskipTests -Psparkr
package` based on instructions in the Spark docs.

The crux of the problem is that with a source or binary R package, the
client side the SparkR code needs the Spark JARs to be available. So
we can't just connect to a remote Spark cluster using just the R
scripts as we need the Scala classes around to create a Spark context
etc.

But this is a use case that I've heard from a lot of users -- my take
is that this should be a separate package / layer on top of SparkR.
Dan Putler (cc'd) had a proposal on a client package for this and
maybe able to add more.

Thanks
Shivaram

On Thu, Sep 24, 2015 at 11:36 AM, Hossein 
<fal...@gmail.com<mailto:fal...@gmail.com>> wrote:
> Requiring users to download entire Spark distribution to connect to a remote
> cluster (which is already running Spark) seems an over kill. Even for most
> spark users who download Spark source, it is very unintuitive that they need
> to run a script named "install-dev.sh" before they can run SparkR.
>
> --Hossein
>
> On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui 
> <rui@intel.com<mailto:rui@intel.com>> wrote:
>>
>> SparkR package is not a standalone R package, as it is actually R API of
>> Spark and needs to co-operate with a matching version of Spark, so exposing
>> it in CRAN does not ease use of R users as they need to download matching
>> Spark distribution, unless we expose a bundled SparkR package to CRAN
>> (packageing with Spark), is this desirable? Actually, for normal users who
>> are not developers, they are not required to download Spark source, build
>> and install SparkR package. They just need to download a Spark distribution,
>> and then use SparkR.
>>
>>
>>
>> For using SparkR in Rstudio, there is a documentation at
>> https://github.com/apache/spark/tree/master/R
>>
>>
>>
>>
>>
>>
>>
>> From: Hossein [mailto:fal...@gmail.com<mailto:fal...@gmail.com>]
>> Sent: Thursday, September 24, 2015 1:42 AM
>> To: shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>
>> Cc: Sun, Rui; dev@spark.apache.org<mailto:dev@spark.apache.org>
>> Subject: Re: SparkR package path
>>
>>
>>
>> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
>> both SparkR and Spark itself to a larger community of data scientists (and
>> statisticians).
>>
>>
>>
>> I have been getting questions on how to use SparkR in RStudio. Most of
>> these folks have a Spark Cluster and wish to talk to it from RStudio. While
>> that is a bigger task, for now, first step could be not requiring them to
>> download Spark source and run a script that is named install-dev.sh. I filed
>> SPARK-10776 to track this.
>>
>>
>>
>>
>> --Hossein
>>
>>
>>
>> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>> <shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>> wrote:
>>
>>

RE: SparkR package path

2015-09-24 Thread Sun, Rui
If  a user downloads Spark source, of course he needs to build it before 
running it. But a user can download pre-built Spark binary distributions, then 
he can directly use sparkR after deployment of the Spark cluster.

From: Hossein [mailto:fal...@gmail.com]
Sent: Friday, September 25, 2015 2:37 AM
To: Sun, Rui
Cc: shiva...@eecs.berkeley.edu; dev@spark.apache.org
Subject: Re: SparkR package path

Requiring users to download entire Spark distribution to connect to a remote 
cluster (which is already running Spark) seems an over kill. Even for most 
spark users who download Spark source, it is very unintuitive that they need to 
run a script named "install-dev.sh" before they can run SparkR.

--Hossein

On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui 
<rui@intel.com<mailto:rui@intel.com>> wrote:
SparkR package is not a standalone R package, as it is actually R API of Spark 
and needs to co-operate with a matching version of Spark, so exposing it in 
CRAN does not ease use of R users as they need to download matching Spark 
distribution, unless we expose a bundled SparkR package to CRAN (packageing 
with Spark), is this desirable? Actually, for normal users who are not 
developers, they are not required to download Spark source, build and install 
SparkR package. They just need to download a Spark distribution, and then use 
SparkR.

For using SparkR in Rstudio, there is a documentation at 
https://github.com/apache/spark/tree/master/R



From: Hossein [mailto:fal...@gmail.com<mailto:fal...@gmail.com>]
Sent: Thursday, September 24, 2015 1:42 AM
To: shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>
Cc: Sun, Rui; dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: SparkR package path

Yes, I think exposing SparkR in CRAN can significantly expand the reach of both 
SparkR and Spark itself to a larger community of data scientists (and 
statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of these 
folks have a Spark Cluster and wish to talk to it from RStudio. While that is a 
bigger task, for now, first step could be not requiring them to download Spark 
source and run a script that is named install-dev.sh. I filed SPARK-10776 to 
track this.


--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman 
<shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>> wrote:
As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui 
<rui@intel.com<mailto:rui@intel.com>> wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:fal...@gmail.com<mailto:fal...@gmail.com>]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org<mailto:dev@spark.apache.org>
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein




RE: SparkR package path

2015-09-23 Thread Sun, Rui
SparkR package is not a standalone R package, as it is actually R API of Spark 
and needs to co-operate with a matching version of Spark, so exposing it in 
CRAN does not ease use of R users as they need to download matching Spark 
distribution, unless we expose a bundled SparkR package to CRAN (packageing 
with Spark), is this desirable? Actually, for normal users who are not 
developers, they are not required to download Spark source, build and install 
SparkR package. They just need to download a Spark distribution, and then use 
SparkR.

For using SparkR in Rstudio, there is a documentation at 
https://github.com/apache/spark/tree/master/R



From: Hossein [mailto:fal...@gmail.com]
Sent: Thursday, September 24, 2015 1:42 AM
To: shiva...@eecs.berkeley.edu
Cc: Sun, Rui; dev@spark.apache.org
Subject: Re: SparkR package path

Yes, I think exposing SparkR in CRAN can significantly expand the reach of both 
SparkR and Spark itself to a larger community of data scientists (and 
statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of these 
folks have a Spark Cluster and wish to talk to it from RStudio. While that is a 
bigger task, for now, first step could be not requiring them to download Spark 
source and run a script that is named install-dev.sh. I filed SPARK-10776 to 
track this.


--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman 
<shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>> wrote:
As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui 
<rui@intel.com<mailto:rui@intel.com>> wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:fal...@gmail.com<mailto:fal...@gmail.com>]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org<mailto:dev@spark.apache.org>
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein



RE: SparkR package path

2015-09-21 Thread Sun, Rui
Hossein,

Any strong reason to download and install SparkR source package separately from 
the Spark distribution?
An R user can simply download the spark distribution, which contains SparkR 
source and binary package, and directly use sparkR. No need to install SparkR 
package at all.

From: Hossein [mailto:fal...@gmail.com]
Sent: Tuesday, September 22, 2015 9:19 AM
To: dev@spark.apache.org
Subject: SparkR package path

Hi dev list,

SparkR backend assumes SparkR source files are located under 
"SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh. 
This setting makes sense for Spark developers, but if an R user downloads and 
installs SparkR source package, the source files are going to be in placed 
different locations.

In the R runtime it is easy to find location of package files using 
path.package("SparkR"). But we need to make some changes to R backend and/or 
spark-submit so that, JVM process learns the location of worker.R and daemon.R 
and shell.R from the R runtime.

Do you think this change is feasible?

Thanks,
--Hossein


[SparkR] is toDF() necessary

2015-05-08 Thread Sun, Rui
toDF() is defined to convert an RDD to a DataFrame. But it is just a very thin 
wrapper of createDataFrame() by help the caller avoid input of SQLContext.

Since Scala/pySpark does not have toDF(), and we'd better keep API as narrow 
and simple as possible. Is toDF() really necessary? Could we eliminate it?