Re: Dynamic resource allocation to Spark on Mesos

2017-02-08 Thread Sun Rui
marathon to run the shuffle service? > > On Tue, Feb 7, 2017 at 7:36 PM, Sun Rui <sunrise_...@163.com > <mailto:sunrise_...@163.com>> wrote: > Yi Jan, > > We have been using Spark on Mesos with dynamic allocation enabled, which > works and improves the overall c

Re: Dynamic resource allocation to Spark on Mesos

2017-02-07 Thread Sun Rui
Yi Jan, We have been using Spark on Mesos with dynamic allocation enabled, which works and improves the overall cluster utilization. In terms of job, do you mean jobs inside a Spark application or jobs among different applications? Maybe you can read

Re: RDD Location

2016-12-30 Thread Sun Rui
ut it seems to be suspended when executing this function. But if I move the > code to other places, like the main() function, it runs well. > > What is the reason for it? > > Thanks, > Fei > > On Fri, Dec 30, 2016 at 2:38 AM, Sun Rui <sunrise_...@163.com > <mailto:s

Re: RDD Location

2016-12-29 Thread Sun Rui
Maybe you can create your own subclass of RDD and override the getPreferredLocations() to implement the logic of dynamic changing of the locations. > On Dec 30, 2016, at 12:06, Fei Hu wrote: > > Dear all, > > Is there any way to change the host location for a certain

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-27 Thread Sun Rui
Although the Spark task scheduler is aware of rack-level data locality, it seems that only YARN implements the support for it. However, node-level locality can still work for Standalone. It is not necessary to copy the hadoop config files into the Spark CONF directory. Set HADOOP_CONF_DIR to

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
d.com <mailto:tony@tendcloud.com> > > From: Sun Rui <mailto:sunrise_...@163.com> > Date: 2016-08-24 22:17 > To: Saisai Shao <mailto:sai.sai.s...@gmail.com> > CC: tony@tendcloud.com <mailto:tony@tendcloud.com>; user > <mailto:user@spark.apa

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
>), but not so stable as I tried > before. > > On Wed, Aug 24, 2016 at 10:09 PM, Sun Rui <sunrise_...@163.com > <mailto:sunrise_...@163.com>> wrote: > For HDFS, maybe you can try mount HDFS as NFS. But not sure about the > stability, and also there is additiona

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
For HDFS, maybe you can try mount HDFS as NFS. But not sure about the stability, and also there is additional overhead of network I/O and replica of HDFS files. > On Aug 24, 2016, at 21:02, Saisai Shao wrote: > > Spark Shuffle uses Java File related API to create local

Re: SparkR error when repartition is called

2016-08-09 Thread Sun Rui
I can’t reproduce your issue with len=1 in local mode. Could you give more environment information? > On Aug 9, 2016, at 11:35, Shane Lee wrote: > > Hi All, > > I am trying out SparkR 2.0 and have run into an issue with repartition. > > Here is the R code

Re: how to run local[k] threads on a single core

2016-08-04 Thread Sun Rui
I don’t think it possible as Spark does not support thread to CPU affinity. > On Aug 4, 2016, at 14:27, sujeet jog wrote: > > Is there a way we can run multiple tasks concurrently on a single core in > local mode. > > for ex :- i have 5 partition ~ 5 tasks, and only a

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-08-03 Thread Sun Rui
--num-executors does not work for Standalone mode. Try --total-executor-cores > On Jul 26, 2016, at 00:17, Mich Talebzadeh wrote: > > Hi, > > > I am doing some tests > > I have started Spark in Standalone mode. > > For simplicity I am using one node only with 8

Re: How to partition a SparkDataFrame using all distinct column values in sparkR

2016-08-03 Thread Sun Rui
SparkDataFrame.repartition() uses hash partitioning, it can guarantee that all rows of the same column value go to the same partition, but it does not guarantee that each partition contain only single column value. Fortunately, Spark 2.0 comes with gapply() in SparkR. You can apply an R

Re: [2.0.0] mapPartitions on DataFrame unable to find encoder

2016-08-02 Thread Sun Rui
import org.apache.spark.sql.catalyst.encoders.RowEncoder implicit val encoder = RowEncoder(df.schema) df.mapPartitions(_.take(1)) > On Aug 3, 2016, at 04:55, Dragisa Krsmanovic wrote: > > I am trying to use mapPartitions on DataFrame. > > Example: > > import

Re: Application not showing in Spark History

2016-08-02 Thread Sun Rui
bin/spark-submit will set some env variable, like SPARK_HOME, that Spark later will use to locate the spark-defaults.conf from which default settings for Spark will be loaded. I would guess that some configuration option like spark.eventLog.enabled in the spark-defaults.conf is skipped by

Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Sun Rui
Are you using Mesos? if not , https://issues.apache.org/jira/browse/SPARK-16522 is not relevant You may describe more information about your Spark environment, and the full stack trace. > On Jul 28, 2016, at 17:44, Carlo.Allocca

Re: Spark 2.0 on YARN - Dynamic Resource Allocation Behavior change?

2016-07-28 Thread Sun Rui
Yes, this is a change in Spark 2.0. you can take a look at https://issues.apache.org/jira/browse/SPARK-13723 In the latest Spark On Yarn documentation for Spark 2.0, there is

Re: Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Sun Rui
If you want to keep using RDD API, then you still need to create SparkContext first. If you want to use just Dataset/DataFrame/SQL API, then you can directly create a SparkSession. Generally the SparkContext is hidden although it is internally created and held within the SparkSession. Anytime

Re: Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Sun Rui
If you want to keep using RDD API, then you still need to create SparkContext first. If you want to use just Dataset/DataFrame/SQL API, then you can directly create a SparkSession. Generally the SparkContext is hidden although it is internally created and held within the SparkSession. Anytime

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-23 Thread Sun Rui
stom-objects-in-a-dataset-in-spark-1-6 > > <http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset-in-spark-1-6> > > How did you setup your encoder? > > > De: "Sun Rui" <sunrise_...@163.com> > À: "Julien Nauroy&quo

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-23 Thread Sun Rui
I did a try. the schema after flatMap is the same, which is expected. What’s your Row encoder? > On Jul 23, 2016, at 20:36, Julien Nauroy wrote: > > Hi, > > I'm trying to call flatMap on a Dataframe with Spark 2.0 (rc5). > The code is the following: > var data =

Re: How to convert from DataFrame to Dataset[Row]?

2016-07-16 Thread Sun Rui
For Spark 1.6.x, a DataFrame can't be directly converted to a Dataset[Row], but can done indirectly as follows: import org.apache.spark.sql.catalyst.encoders.RowEncoder // assume df is a DataFrame implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) val ds = df.as[Row] However,

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Sun Rui
You can simply save the join result distributedly, for example, as a HDFS file, and then copy the HDFS file to a local file. There is an alternative memory-efficient way to collect distributed data back to driver other than collect(), that is toLocalIterator. The iterator will consume as much

Re: Issue in spark job. Remote rpc client dissociated

2016-07-14 Thread Sun Rui
Where is argsList defined? is Launcher.main() thread-safe? Note that if multiple folders are processed in a node, multiple threads may concurrently run in the executor, each processing a folder. > On Jul 14, 2016, at 12:28, Balachandar R.A. wrote: > > Hello Ted, >

Re: Enforcing shuffle hash join

2016-07-04 Thread Sun Rui
You can try set “spark.sql.join.preferSortMergeJoin” cons option to false. For detailed join strategies, take a look at the source code of SparkStrategies.scala: /** * Select the proper physical plan for join based on joining keys and size of logical plan. * * At first, uses the

Re: One map per folder in spark or Hadoop

2016-06-30 Thread Sun Rui
Say you have got all of your folder paths into a val folders: Seq[String] val add = sc.parallelize(folders, folders.size).mapPartitions { iter => val folder = iter.next val status: Int = Seq(status).toIterator } > On Jun 30, 2016, at 16:42, Balachandar R.A.

Re: Using R code as part of a Spark Application

2016-06-30 Thread Sun Rui
I would guess that the technology behind Azure R Server is about Revolution Enterprise DistributedR/ScaleR. I don’t know the details, but the statement in the “Step 6. Install R packages” section in the given documentation page. However, if you need to install R packages on the worker nodes

Re: Using R code as part of a Spark Application

2016-06-29 Thread Sun Rui
Hi, Gilad, You can try the dapply() and gapply() function in SparkR in Spark 2.0. Yes, it is required that R installed on each worker node. However, if your Spark application is Scala/Java based, it is not supported for now to run R code in DataFrames. There is closed lira

Re: sparkR.init() can not load sparkPackages.

2016-06-19 Thread Sun Rui
Hi, Joseph, This is a known issue but not a bug. This issue does not occur when you use interactive SparkR session, while it does occur when you execute an R file. The reason behind this is that in case you execute an R file, the R backend launches before the R interpreter, so there is no

Re: Unable to execute sparkr jobs through Chronos

2016-06-16 Thread Sun Rui
I saw in the job definition an Env Var: SPARKR_MASTER. What is that for? I don’t think SparkR uses it. > On Jun 17, 2016, at 10:08, Sun Rui <sunrise_...@163.com> wrote: > > It seems that spark master URL is not correct. What is it? >> On Jun 16, 2016, at 18:57, Rodrick B

Re: Unable to execute sparkr jobs through Chronos

2016-06-16 Thread Sun Rui
It seems that spark master URL is not correct. What is it? > On Jun 16, 2016, at 18:57, Rodrick Brown wrote: > > Master must start with yarn, spark, mesos, or local

Re: Adding h5 files in a zip to use with PySpark

2016-06-15 Thread Sun Rui
have you tried --files ? > On Jun 15, 2016, at 18:50, ar7 wrote: > > I am using PySpark 1.6.1 for my spark application. I have additional modules > which I am loading using the argument --py-files. I also have a h5 file > which I need to access from one of the modules for

Re: SparkR : glm model

2016-06-11 Thread Sun Rui
You were looking at some old code. poisson family is supported in latest master branch. You can try spark 2.0 preview release from http://spark.apache.org/news/spark-2.0.0-preview.html > On Jun 10, 2016, at 12:14, april_ZMQ

Re: Slow collecting of large Spark Data Frames into R

2016-06-11 Thread Sun Rui
Hi, Jonathan, Thanks for reporting. This is a known issue that the community would like to address later. Please refer to https://issues.apache.org/jira/browse/SPARK-14037. It would be better that you can profile your use case using the method discussed in the JIRA issue and paste the

Re: SparkR interaction with R libraries (currently 1.5.2)

2016-06-07 Thread Sun Rui
Hi, Ian, You should not use the Spark DataFrame a_df in your closure. For an R function for lapplyPartition, the parameter is a list of lists, representing the rows in the corresponding partition. In Spark 2.0, SparkR provides a new public API called dapply, which can apply an R function to each

Re: --driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-02 Thread Sun Rui
yes, I think you can fire a JIRA issue for this. But why removing the default value. Seems the default core is 1 according to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala#L110 On Jun 2, 2016, at 05:18, Jacek Laskowski

Re: get and append file name in record being reading

2016-06-02 Thread Sun Rui
You can use RDD.wholeTextFiles(). For example, suppose all your files are under /tmp/ABC_input/, val rdd = sc.wholeTextFiles("file:///tmp/ABC_input”) val rdd1 = rdd.flatMap { case (path, content) => val fileName = new java.io.File(path).getName content.split("\n").map { line =>

Re: Windows Rstudio to Linux spakR

2016-06-01 Thread Sun Rui
Selvam, First, deploy the Spark distribution on your Windows machine, which is of the same version of Spark in your Linux cluster Second, follow the instructions at https://github.com/apache/spark/tree/master/R#using-sparkr-from-rstudio. Specify the Spark master URL for your Linux Spark

Re: Can we use existing R model in Spark

2016-05-30 Thread Sun Rui
er@spark.apache.org>> > > > Try to invoke a R script from Spark using rdd pipe method , get the work done > & and receive the model back in RDD. > > > for ex :- > . rdd.pipe("") > > > On Mon, May 30, 2016 at 3:57 PM, Sun Rui <sunr

Re: Can we use existing R model in Spark

2016-05-30 Thread Sun Rui
Unfortunately no. Spark does not support loading external modes (for examples, PMML) for now. Maybe you can try using the existing random forest model in Spark. > On May 30, 2016, at 18:21, Neha Mehta wrote: > > Hi, > > I have an existing random forest model created

Re: Splitting RDD by partition

2016-05-20 Thread Sun Rui
I think the latter approach is better, which can avoid un-necessary computations by filtering out un-needed partitions. It is better to cache the previous RDD so that it won’t be computed twice > On May 20, 2016, at 16:59, shlomi wrote: > > Another approach I found: > >

Re: Does spark support Apache Arrow

2016-05-19 Thread Sun Rui
1. I don’t think so 2. Arrow is for in-memory columnar execution. While cache is for in-memory columnar storage > On May 20, 2016, at 10:16, Todd wrote: > > From the official site http://arrow.apache.org/, Apache Arrow is used for > Columnar In-Memory storage. I have two quick

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
from python? > > On 19 May 2016 16:57, "Sun Rui" <sunrise_...@163.com > <mailto:sunrise_...@163.com>> wrote: > 1. create a temp dir on HDFS, say “/tmp” > 2. write a script to create in the temp dir one file for each tar file. Each > file has only one

Re: dataframe stat corr for multiple columns

2016-05-19 Thread Sun Rui
There is an existing JIRA issue for it: https://issues.apache.org/jira/browse/SPARK-11057 Also there is an PR. Maybe we should help to review and merge it with a higher priority. > On May 20, 2016, at 00:09, Xiangrui Meng

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
1. create a temp dir on HDFS, say “/tmp” 2. write a script to create in the temp dir one file for each tar file. Each file has only one line: 3. Write a spark application. It is like: val rdd = sc.textFile () rdd.map { line => construct an untar command using the path information in

Re: SparkR query

2016-05-17 Thread Sun Rui
nd workers looking for Windows path, > Which must be being passed through by the driver I guess. I checked the > spark-env.sh on each node and the appropriate SPARK_HOME is set > correctly…. > > > From: Sun Rui [mailto:sunrise_...@163.com] > Sent: 17 May 2016 11:32 >

Re: SparkR query

2016-05-17 Thread Sun Rui
Lewis, 1. Could you check the values of “SPARK_HOME” environment on all of your worker nodes? 2. How did you start your SparkR shell? > On May 17, 2016, at 18:07, Mike Lewis wrote: > > Hi, > > I have a SparkR driver process that connects to a master running on

Re: Spark 1.6.0: substring on df.select

2016-05-12 Thread Sun Rui
Alternatively, you may try the built-in function: regexp_extract > On May 12, 2016, at 20:27, Ewan Leith wrote: > > You could use a UDF pretty easily, something like this should work, the > lastElement function could be changed to do pretty much any string >

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-13 Thread Sun, Rui
...@gmail.com] Sent: Thursday, April 14, 2016 5:45 AM To: Sun, Rui <rui@intel.com> Cc: user <user@spark.apache.org> Subject: Re: How does spark-submit handle Python scripts (and how to repeat it)? Julia can pick the env var, and set the system properties or directly fill the co

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Sun, Rui
c/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65 From: Andrei [mailto:faithlessfri...@gmail.com] Sent: Wednesday, April 13, 2016 4:32 AM To: Sun, Rui <rui@intel.com&

RE: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Sun, Rui
val ALLOW_MULTIPLE_CONTEXTS = booleanConf("spark.sql.allowMultipleContexts", defaultValue = Some(true), doc = "When set to true, creating multiple SQLContexts/HiveContexts is allowed." + "When set to false, only one SQLContext/HiveContext is allowed to be created " +

RE: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Sun, Rui
Which py file is your main file (primary py file)? Zip the other two py files. Leave the main py file alone. Don't copy them to S3 because it seems that only local primary and additional py files are supported. ./bin/spark-submit --master spark://... --py-files -Original Message-

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Sun, Rui
There is much deployment preparation work handling different deployment modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize it briefly, you had better refer to the source code. Supporting running Julia scripts in SparkSubmit is more than implementing a ‘JuliaRunner’. One

RE: How to process one partition at a time?

2016-04-06 Thread Sun, Rui
Maybe you can try SparkContext.submitJob: def submitJob[T, U, R](rdd: RDD[T], processPartition: (Iterator[T]) ⇒ U, partitions: Seq[Int], resultHandler: (Int, U) ⇒ Unit, resultFunc: ⇒ R):

RE: What's the benifit of RDD checkpoint against RDD save

2016-03-24 Thread Sun, Rui
As Mark said, checkpoint() can be called before calling any action on the RDD. The choice between checkpoint and saveXXX depends. If you just want to cut the long RDD lineage, and the data won’t be re-used later, then use checkpoint, because it is simple and the checkpoint data will be cleaned

RE: Run External R script from Spark

2016-03-21 Thread Sun, Rui
It’s a possible approach. It actually leverages Spark’s parallel execution. PipeRDD’s launching of external processes is just like that in pySpark and SparkR for RDD API. The concern is pipeRDD relies on text based serialization/deserialization. Whether the performance is acceptable actually

RE: Error in "java.io.IOException: No input paths specified in job"

2016-03-19 Thread Sun, Rui
It complains about the file path "./examples/src/main/resources/people.json" You can try to use absolute path instead of relative path, and make sure the absolute path is correct. If that still does not work, you can prefix the path with "file://" in case the default file schema for Hadoop is

RE: sparkR issues ?

2016-03-18 Thread Sun, Rui
Sorry. I am wrong. The issue is not related to as.data.frame(). It seems to be related to DataFrame naming conflict between s4vectors and SparkR. Refer to https://issues.apache.org/jira/browse/SPARK-12148 From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, March 16, 2016 9:33 AM To: Alex

RE: sparkR issues ?

2016-03-15 Thread Sun, Rui
I have submitted https://issues.apache.org/jira/browse/SPARK-13905 and a PR for it. From: Alex Kozlov [mailto:ale...@gmail.com] Sent: Wednesday, March 16, 2016 12:52 AM To: roni <roni.epi...@gmail.com> Cc: Sun, Rui <rui@intel.com>; user@spark.apache.org Subject: Re: sparkR issue

RE: sparkR issues ?

2016-03-15 Thread Sun, Rui
It seems as.data.frame() defined in SparkR convers the versions in R base package. We can try to see if we can change the implementation of as.data.frame() in SparkR to avoid such covering. From: Alex Kozlov [mailto:ale...@gmail.com] Sent: Tuesday, March 15, 2016 2:59 PM To: roni

RE: lint-r checks failing

2016-03-10 Thread Sun, Rui
This is probably because the installed lintr package get updated. After update, lintr can detect errors that are skipped before I will submit a PR for this issue -Original Message- From: Gayathri Murali [mailto:gayathri.m.sof...@gmail.com] Sent: Friday, March 11, 2016 12:48 PM To:

RE: SparkR Count vs Take performance

2016-03-02 Thread Sun, Rui
This is nothing to do with object serialization/deserialization. It is expected behavior that take(1) most likely runs slower than count() on an empty RDD. This is all about the algorithm with which take() is implemented. Take() 1. Reads one partition to get the elements 2. If the fetched

RE: Apache Arrow + Spark examples?

2016-02-24 Thread Sun, Rui
Spark has not supported Arrow yet. There is a JIRA https://issues.apache.org/jira/browse/SPARK-13391 requesting working on it. From: Robert Towne [mailto:robert.to...@webtrends.com] Sent: Wednesday, February 24, 2016 5:21 AM To: user@spark.apache.org Subject: Apache Arrow + Spark examples? I

RE: Running synchronized JRI code

2016-02-15 Thread Sun, Rui
On computation, RRDD launches one R process for each partition, so there won't be thread-safe issue Could you give more details on your new environment? -Original Message- From: Simon Hafner [mailto:reactorm...@gmail.com] Sent: Monday, February 15, 2016 7:31 PM To: Sun, Rui <

RE: Running synchronized JRI code

2016-02-14 Thread Sun, Rui
For YARN mode, you can set --executor-cores 1 -Original Message- From: Sun, Rui [mailto:rui@intel.com] Sent: Monday, February 15, 2016 11:35 AM To: Simon Hafner <reactorm...@gmail.com>; user <user@spark.apache.org> Subject: RE: Running synchronized JRI code Yes, JR

RE: Running synchronized JRI code

2016-02-14 Thread Sun, Rui
Yes, JRI loads an R dynamic library into the executor JVM, which faces thread-safe issue when there are multiple task threads within the executor. If you are running Spark on Standalone mode, it is possible to run multiple workers per node, and at the same time, limit the cores per worker to be

RE: different behavior while using createDataFrame and read.df in SparkR

2016-02-08 Thread Sun, Rui
le has to be re-assigned to reference a column in the new DataFrame. From: Devesh Raj Singh [mailto:raj.deves...@gmail.com] Sent: Saturday, February 6, 2016 8:31 PM To: Sun, Rui <rui@intel.com> Cc: user@spark.apache.org Subject: Re: different behavior while using createDataFrame and read.d

RE: different behavior while using createDataFrame and read.df in SparkR

2016-02-05 Thread Sun, Rui
: Devesh Raj Singh [mailto:raj.deves...@gmail.com] Sent: Friday, February 5, 2016 2:44 PM To: user@spark.apache.org Cc: Sun, Rui Subject: different behavior while using createDataFrame and read.df in SparkR Hi, I am using Spark 1.5.1 When I do this df <- createDataFrame(sqlContext, i

RE: sparkR not able to create /append new columns

2016-02-03 Thread Sun, Rui
Devesh, Note that DataFrame is immutable. withColumn returns a new DataFrame instead of adding a column in-pace to the DataFrame being operated. So, you can modify the for loop like: for (j in 1:lev) { dummy.df.new<-withColumn(df, paste0(colnames(cat.column),j),

RE: can we do column bind of 2 dataframes in spark R? similar to cbind in R?

2016-02-02 Thread Sun, Rui
Devesh, The cbind-like operation is not supported by Scala DataFrame API, so it is also not supported in SparkR. You may try to workaround this by trying the approach in http://stackoverflow.com/questions/32882529/how-to-zip-twoor-more-dataframe-in-spark You could also submit a JIRA

RE: building spark 1.6 throws error Rscript: command not found

2016-01-19 Thread Sun, Rui
Hi, Mich, Building Spark with SparkR profile enabled requires installation of R on your building machine. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, January 19, 2016 5:27 AM To: Mich Talebzadeh Cc: user @spark Subject: Re: building spark 1.6 throws error Rscript: command not found

RE: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread Sun, Rui
Spark does not support computing cov matrix now. But there is a PR for it. Maybe you can try it: https://issues.apache.org/jira/browse/SPARK-11057 From: zhangjp [mailto:592426...@qq.com] Sent: Tuesday, December 29, 2015 3:21 PM To: Felix Cheung; Andy Davidson; Yanbo Liang Cc: user Subject: 回复:

RE: Do existing R packages work with SparkR data frames

2015-12-22 Thread Sun, Rui
Hi, Lan, Generally, it is hard to use existing R packages working with R data frames to work with SparkR data frames transparently. Typically the algorithms have to be re-written to use SparkR DataFrame API. Collect is for collecting the data from a SparkR DataFrame into a local data.frame.

RE: SparkR read.df failed to read file from local directory

2015-12-08 Thread Sun, Rui
Hi, Boyu, Does the local file “/home/myuser/test_data/sparkR/flights.csv” really exist? I just tried, and had no problem creating a DataFrame from a local CSV file. From: Boyu Zhang [mailto:boyuzhan...@gmail.com] Sent: Wednesday, December 9, 2015 1:49 AM To: Felix Cheung Cc:

RE: SparkR DataFrame , Out of memory exception for very small file.

2015-11-22 Thread Sun, Rui
Vipul, Not sure if I understand your question. DataFrame is immutable. You can't update a DataFrame. Could you paste some log info for the OOM error? -Original Message- From: vipulrai [mailto:vipulrai8...@gmail.com] Sent: Friday, November 20, 2015 12:11 PM To: user@spark.apache.org

RE: Connecting SparkR through Yarn

2015-11-13 Thread Sun, Rui
To: Sun, Rui; user@spark.apache.org Subject: Re: Connecting SparkR through Yarn Hi Sun, Thank you for reply. I did the same, but now I am getting another issue. i.e: Not able to connect to ResourceManager after submitting the job the Error message showing something like this Connecting

RE: Connecting SparkR through Yarn

2015-11-10 Thread Sun, Rui
Amit, You can simply set “MASTER” as “yarn-client” before calling sparkR.init(). Sys.setenv("MASTER"="yarn-client") I assume that you have set “YARN_CONF_DIR” env variable required for running Spark on YARN. If you want to set more YARN specific configurations, you can for example Sys.setenv

RE: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-07 Thread Sun, Rui
k.driver.memory”, (also other similar options, like: spark.driver.extraClassPath, spark.driver.extraJavaOptions, spark.driver.extraLibraryPath) in the sparkEnvir parameter for sparkR.init() to take effect. Would you like to give it a try? Note the change is on the master branch, you have to build Spark from source befo

RE: [Spark R]could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

2015-11-06 Thread Sun, Rui
Hi,Todd, "--driver-memory" options specifies the maximum heap memory size of the JVM backend for SparkR. The error you faced is memory allocation error of your R process. They are different. I guess that 2G memory bound for a string is limitation of the R interpreter? That's the reason why we

RE: sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-11-01 Thread Sun, Rui
Tom, Have you set the “MASTER” evn variable on your machine? What is the value if set? From: Tom Stewart [mailto:stewartthom...@yahoo.com.INVALID] Sent: Friday, October 30, 2015 10:11 PM To: user@spark.apache.org Subject: sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found I have

RE: SparkR job with >200 tasks hangs when calling from web server

2015-11-01 Thread Sun, Rui
I guess that this is not related to SparkR, but something wrong in the Spark Core. Could you try your application logic within spark-shell (you have to use Scala DataFrame API) instead of SparkR shell and to see if this issue still happens? -Original Message- From: rporcio

RE: How to set memory for SparkR with master="local[*]"

2015-11-01 Thread Sun, Rui
, spark.driver.extraJavaOptions, spark.driver.extraLibraryPath) in the sparkEnvir parameter for sparkR.init() to take effect. Would you like to give it a try? Note the change is on the master branch, you have to build Spark from source before using it. From: Sun, Rui [mailto:rui@intel.com] Sent: Monday

RE: How to set memory for SparkR with master="local[*]"

2015-10-25 Thread Sun, Rui
As documented in http://spark.apache.org/docs/latest/configuration.html#available-properties, Note for “spark.driver.memory”: Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead,

RE: Spark_1.5.1_on_HortonWorks

2015-10-22 Thread Sun, Rui
Frans, SparkR runs with R 3.1+. If possible, latest verison of R is recommended. From: Saisai Shao [mailto:sai.sai.s...@gmail.com] Sent: Thursday, October 22, 2015 11:17 AM To: Frans Thamura Cc: Ajay Chander; Doug Balog; user spark mailing list Subject: Re: Spark_1.5.1_on_HortonWorks SparkR is

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-08 Thread Sun, Rui
Can you extract the spark-submit command from the console output, and run it on the Shell, and see if there is any error message? From: Khandeshi, Ami [mailto:ami.khande...@fmr.com] Sent: Wednesday, October 7, 2015 9:57 PM To: Sun, Rui; Hossein Cc: akhandeshi; user@spark.apache.org Subject: RE

RE: How can I read file from HDFS i sparkR from RStudio

2015-10-08 Thread Sun, Rui
Amit, sqlContext <- sparkRSQL.init(sc) peopleDF <- read.df(sqlContext, "hdfs://master:9000/sears/example.csv") have you restarted the R session in RStudio between the two lines? From: Amit Behera [mailto:amit.bd...@gmail.com] Sent: Thursday, October 8, 2015 5:59 PM To: user@spark.apache.org

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-07 Thread Sun, Rui
Not sure "/C/DevTools/spark-1.5.1/bin/spark-submit.cmd" is a valid? From: Hossein [mailto:fal...@gmail.com] Sent: Wednesday, October 7, 2015 12:46 AM To: Khandeshi, Ami Cc: Sun, Rui; akhandeshi; user@spark.apache.org Subject: Re: SparkR Error in sparkR.init(master=“local”) in RStudio

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread Sun, Rui
What you have done is supposed to work. Need more debugging information to find the cause. Could you add the following lines before calling sparkR.init()? Sys.setenv(SPARKR_SUBMIT_ARGS="--verbose sparkr-shell") Sys.setenv(SPARK_PRINT_LAUNCH_COMMAND=1) Then to see if you can find any hint in

RE: textFile() and includePackage() not found

2015-09-27 Thread Sun, Rui
Eugene, SparkR RDD API is private for now (https://issues.apache.org/jira/browse/SPARK-7230) You can use SparkR::: prefix to access those private functions. -Original Message- From: Eugene Cao [mailto:eugene...@163.com] Sent: Monday, September 28, 2015 8:02 AM To:

RE: SparkR for accumulo

2015-09-23 Thread Sun, Rui
transformations on it. -Original Message- From: madhvi.gupta [mailto:madhvi.gu...@orkash.com] Sent: Wednesday, September 23, 2015 11:42 AM To: Sun, Rui; user Subject: Re: SparkR for accumulo Hi Rui, Cant we use the accumulo data RDD created from JAVA in spark, in sparkR? Thanks and Regards Madhvi

RE: Support of other languages?

2015-09-22 Thread Sun, Rui
Palamuttam [mailto:rahulpala...@gmail.com] Sent: Thursday, September 17, 2015 3:09 PM To: Sun, Rui Cc: user@spark.apache.org Subject: Re: Support of other languages? Hi, Thank you for both responses. Sun you pointed out the exact issue I was referring to, which is copying,serializing, deserializing

RE: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-17 Thread Sun, Rui
The existing algorithms operating on R data.frame can't simply operate on SparkR DataFrame. They have to be re-implemented to be based on SparkR DataFrame API. -Original Message- From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] Sent: Thursday, September 17, 2015 3:30 AM To:

RE: reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread Sun, Rui
Hi, Roni, For parquetFile(), it is just a warning, you can get the DataFrame successfully, right? It is a bug has been fixed in the latest repo: https://issues.apache.org/jira/browse/SPARK-8952 For S3, it is not related to SparkR. I guess it is related to

RE: Support of other languages?

2015-09-09 Thread Sun, Rui
Hi, Rahul, To support a new language other than Java/Scala in spark, it is different between RDD API and DataFrame API. For RDD API: RDD is a distributed collection of the language-specific data types whose representation is unknown to JVM. Also transformation functions for RDD are written

RE: SparkR csv without headers

2015-08-20 Thread Sun, Rui
Hi, You can create a DataFrame using load.df() with a specified schema. Something like: schema - structType(structField(“a”, “string”), structField(“b”, integer), …) read.df ( …, schema = schema) From: Franc Carter [mailto:franc.car...@rozettatech.com] Sent: Wednesday, August 19, 2015 1:48 PM

RE: SparkR

2015-07-27 Thread Sun, Rui
Simply no. Currently SparkR is the R API of Spark DataFrame, no existing algorithms can benefit from it unless they are re-written to be based on the API. There is on-going development on supporting MLlib and ML Pipelines in SparkR: https://issues.apache.org/jira/browse/SPARK-6805 From: Mohit

RE: unserialize error in sparkR

2015-07-27 Thread Sun, Rui
Hi, Do you mean you are running the script with https://github.com/amplab-extras/SparkR-pkg and spark 1.2? I am afraid that currently there is no development effort and support on the SparkR-pkg since it has been integrated into Spark since Spark 1.4. Unfortunately, the RDD API and RDD-like

RE: SparkR Supported Types - Please add bigint

2015-07-24 Thread Sun, Rui
printSchema calls StructField. buildFormattedString() to output schema information. buildFormattedString() use DataType.typeName as string representation of the data type. LongType. typeName = long LongType.simpleString = bigint I am not sure about the difference of these two type name

RE: SparkR Supported Types - Please add bigint

2015-07-23 Thread Sun, Rui
Exie, Reported your issue: https://issues.apache.org/jira/browse/SPARK-9302 SparkR has support for long(bigint) type in serde. This issue is related to support complex Scala types in serde. -Original Message- From: Exie [mailto:tfind...@prodevelop.com.au] Sent: Friday, July 24, 2015

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
is not so complete. You may use scala documentation as reference, and try if some method is supported in SparkR. From: jianshu Weng [jian...@gmail.com] Sent: Wednesday, July 15, 2015 9:37 PM To: Sun, Rui Cc: user@spark.apache.org Subject: Re: [SparkR] creating dataframe

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
suppose df - jsonFile(sqlContext, json file) You can extract hashtags.text as a Column object using the following command: t - getField(df$hashtags, text) and then you can perform operations on the column. You can extract hashtags.text as a DataFrame using the following command: t -

  1   2   >