Re: Advice on Scaling RandomForest
I'm using dataframes, types are all doubles and I'm only extracting what I need. The caveat on these is that I am porting an existing system for a client and for there business it's likely to be cheaper to throw hardware (in aws) at the problem for a couple of hours than re-engineer there algorithms cheers On 7 June 2016 at 21:54, Jörn Franke wrote: > Before hardware optimization there is always software optimization. > Are you using dataset / dataframe? Are you using the right data types ( > eg int where int is appropriate , try to avoid string and char etc) > Do you extract only the stuff needed? What are the algorithm parameters? > > > On 07 Jun 2016, at 13:09, Franc Carter wrote: > > > > > > Hi, > > > > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and > am interested in how it might be best to scale it - e.g more cpus per > instances, more memory per instance, more instances etc. > > > > I'm currently using 32 m3.xlarge instances for for a training set with > 2.5 million rows, 1300 columns and a total size of 31GB (parquet) > > > > thanks > > > > -- > > Franc > -- Franc
Advice on Scaling RandomForest
Hi, I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am interested in how it might be best to scale it - e.g more cpus per instances, more memory per instance, more instances etc. I'm currently using 32 m3.xlarge instances for for a training set with 2.5 million rows, 1300 columns and a total size of 31GB (parquet) thanks -- Franc
Re: installing packages with pyspark
Thanks - I'll give that a try cheers On 20 March 2016 at 09:42, Felix Cheung wrote: > You are running pyspark in Spark client deploy mode. I have ran into the > same error as well and I'm not sure if this is graphframes specific - the > python process can't find the graphframes Python code when it is loaded as > a Spark package. > > To workaround this, I extract the graphframes Python directory locally > where I run pyspark into a directory called graphframes. > > > > > > > On Thu, Mar 17, 2016 at 10:11 PM -0700, "Franc Carter" < > franc.car...@gmail.com> wrote: > > > I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- > > pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 > > which starts and gives me a REPL, but when I try > >from graphframes import * > > I get > > No module names graphframes > > without '--master yarn' it works as expected > > thanks > > > On 18 March 2016 at 12:59, Felix Cheung wrote: > > For some, like graphframes that are Spark packages, you could also use > --packages in the command line of spark-submit or pyspark. See > http://spark.apache.org/docs/latest/submitting-applications.html > > _ > From: Jakob Odersky > Sent: Thursday, March 17, 2016 6:40 PM > Subject: Re: installing packages with pyspark > To: Ajinkya Kale > Cc: > > > > Hi, > regarding 1, packages are resolved locally. That means that when you > specify a package, spark-submit will resolve the dependencies and > download any jars on the local machine, before shipping* them to the > cluster. So, without a priori knowledge of dataproc clusters, it > should be no different to specify packages. > > Unfortunatly I can't help with 2. > > --Jakob > > *shipping in this case means making them available via the network > > On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale > wrote: > > Hi all, > > > > I had couple of questions. > > 1. Is there documentation on how to add the graphframes or any other > package > > for that matter on the google dataproc managed spark clusters ? > > > > 2. Is there a way to add a package to an existing pyspark context > through a > > jupyter notebook ? > > > > --aj > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > -- > Franc > -- Franc
Re: installing packages with pyspark
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 which starts and gives me a REPL, but when I try from graphframes import * I get No module names graphframes without '--master yarn' it works as expected thanks On 18 March 2016 at 12:59, Felix Cheung wrote: > For some, like graphframes that are Spark packages, you could also use > --packages in the command line of spark-submit or pyspark. See > http://spark.apache.org/docs/latest/submitting-applications.html > > _ > From: Jakob Odersky > Sent: Thursday, March 17, 2016 6:40 PM > Subject: Re: installing packages with pyspark > To: Ajinkya Kale > Cc: > > > > Hi, > regarding 1, packages are resolved locally. That means that when you > specify a package, spark-submit will resolve the dependencies and > download any jars on the local machine, before shipping* them to the > cluster. So, without a priori knowledge of dataproc clusters, it > should be no different to specify packages. > > Unfortunatly I can't help with 2. > > --Jakob > > *shipping in this case means making them available via the network > > On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale > wrote: > > Hi all, > > > > I had couple of questions. > > 1. Is there documentation on how to add the graphframes or any other > package > > for that matter on the google dataproc managed spark clusters ? > > > > 2. Is there a way to add a package to an existing pyspark context > through a > > jupyter notebook ? > > > > --aj > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > -- Franc
Re: filter by dict() key in pySpark
A colleague found how to do this, the approach was to use a udf() cheers On 21 February 2016 at 22:41, Franc Carter wrote: > > I have a DataFrame that has a Python dict() as one of the columns. I'd > like to filter he DataFrame for those Rows that where the dict() contains a > specific value. e.g something like this:- > > DF2 = DF1.filter('name' in DF1.params) > > but that gives me this error > > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > > How do I express this correctly ? > > thanks > > -- > Franc > -- Franc
filter by dict() key in pySpark
I have a DataFrame that has a Python dict() as one of the columns. I'd like to filter he DataFrame for those Rows that where the dict() contains a specific value. e.g something like this:- DF2 = DF1.filter('name' in DF1.params) but that gives me this error ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. How do I express this correctly ? thanks -- Franc
Re: sparkR not able to create /append new columns
Yes, I didn't work out how to solve that - sorry On 3 February 2016 at 22:37, Devesh Raj Singh wrote: > Hi, > > but "withColumn" will only add once, if i want to add columns to the same > dataframe in a loop it will keep overwriting the added column and in the > end the last added column( in the loop) will be the added column. like in > my code above. > > On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter > wrote: > >> >> I had problems doing this as well - I ended up using 'withColumn', it's >> not particularly graceful but it worked (1.5.2 on AWS EMR) >> >> cheerd >> >> On 3 February 2016 at 22:06, Devesh Raj Singh >> wrote: >> >>> Hi, >>> >>> i am trying to create dummy variables in sparkR by creating new columns >>> for categorical variables. But it is not appending the columns >>> >>> >>> df <- createDataFrame(sqlContext, iris) >>> class(dtypes(df)) >>> >>> cat.column<-vector(mode="character",length=nrow(df)) >>> cat.column<-collect(select(df,df$Species)) >>> lev<-length(levels(as.factor(unlist(cat.column >>> varb.names<-vector(mode="character",length=lev) >>> for (i in 1:lev){ >>> >>> varb.names[i]<-paste0(colnames(cat.column),i) >>> >>> } >>> >>> for (j in 1:lev) >>> >>> { >>> >>>dummy.df.new<-withColumn(df,paste0(colnames >>>(cat.column),j),if else(df$Species==levels(as.factor(un >>> list(cat.column)) >>>[j],1,0) ) >>> >>> } >>> >>> I am getting the below output for >>> >>> head(dummy.df.new) >>> >>> output: >>> >>> Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1 >>> 1 5.1 3.5 1.4 0.2 setosa1 >>> 2 4.9 3.0 1.4 0.2 setosa1 >>> 3 4.7 3.2 1.3 0.2 setosa1 >>> 4 4.6 3.1 1.5 0.2 setosa1 >>> 5 5.0 3.6 1.4 0.2 setosa1 >>> 6 5.4 3.9 1.7 0.4 setosa1 >>> >>> Problem: Species2 and Species3 column are not getting added to the >>> dataframe >>> >>> -- >>> Warm regards, >>> Devesh. >>> >> >> >> >> -- >> Franc >> > > > > -- > Warm regards, > Devesh. > -- Franc
Re: sparkR not able to create /append new columns
I had problems doing this as well - I ended up using 'withColumn', it's not particularly graceful but it worked (1.5.2 on AWS EMR) cheerd On 3 February 2016 at 22:06, Devesh Raj Singh wrote: > Hi, > > i am trying to create dummy variables in sparkR by creating new columns > for categorical variables. But it is not appending the columns > > > df <- createDataFrame(sqlContext, iris) > class(dtypes(df)) > > cat.column<-vector(mode="character",length=nrow(df)) > cat.column<-collect(select(df,df$Species)) > lev<-length(levels(as.factor(unlist(cat.column > varb.names<-vector(mode="character",length=lev) > for (i in 1:lev){ > > varb.names[i]<-paste0(colnames(cat.column),i) > > } > > for (j in 1:lev) > > { > >dummy.df.new<-withColumn(df,paste0(colnames >(cat.column),j),if else(df$Species==levels(as.factor(un > list(cat.column)) >[j],1,0) ) > > } > > I am getting the below output for > > head(dummy.df.new) > > output: > > Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1 > 1 5.1 3.5 1.4 0.2 setosa1 > 2 4.9 3.0 1.4 0.2 setosa1 > 3 4.7 3.2 1.3 0.2 setosa1 > 4 4.6 3.1 1.5 0.2 setosa1 > 5 5.0 3.6 1.4 0.2 setosa1 > 6 5.4 3.9 1.7 0.4 setosa1 > > Problem: Species2 and Species3 column are not getting added to the > dataframe > > -- > Warm regards, > Devesh. > -- Franc
Re: pyspark: calculating row deltas
Thanks cheers On 10 January 2016 at 22:35, Blaž Šnuderl wrote: > This can be done using spark.sql and window functions. Take a look at > https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html > > On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter > wrote: > >> >> Sure, for a dataframe that looks like this >> >> ID Year Value >> 1 2012 100 >> 1 2013 102 >> 1 2014 106 >> 2 2012 110 >> 2 2013 118 >> 2 2014 128 >> >> I'd like to get back >> >> ID Year Value >> 1 2013 2 >> 1 2014 4 >> 2 2013 8 >> 2 201410 >> >> i.e the Value for an ID,Year combination is the Value for the ID,Year >> minus the Value for the ID,Year-1 >> >> thanks >> >> >> >> >> >> >> On 10 January 2016 at 20:51, Femi Anthony wrote: >> >>> Can you clarify what you mean with an actual example ? >>> >>> For example, if your data frame looks like this: >>> >>> ID Year Value >>> 12012 100 >>> 22013 101 >>> 32014 102 >>> >>> What's your desired output ? >>> >>> Femi >>> >>> >>> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter >>> wrote: >>> >>>> >>>> Hi, >>>> >>>> I have a DataFrame with the columns >>>> >>>> ID,Year,Value >>>> >>>> I'd like to create a new Column that is Value2-Value1 where the >>>> corresponding Year2=Year-1 >>>> >>>> At the moment I am creating a new DataFrame with renamed columns and >>>> doing >>>> >>>>DF.join(DF2, . . . .) >>>> >>>> This looks cumbersome to me, is there abtter way ? >>>> >>>> thanks >>>> >>>> >>>> -- >>>> Franc >>>> >>> >>> >>> >>> -- >>> http://www.femibyte.com/twiki5/bin/view/Tech/ >>> http://www.nextmatrix.com >>> "Great spirits have always encountered violent opposition from mediocre >>> minds." - Albert Einstein. >>> >> >> >> >> -- >> Franc >> > > -- Franc
Re: pyspark: calculating row deltas
Sure, for a dataframe that looks like this ID Year Value 1 2012 100 1 2013 102 1 2014 106 2 2012 110 2 2013 118 2 2014 128 I'd like to get back ID Year Value 1 2013 2 1 2014 4 2 2013 8 2 201410 i.e the Value for an ID,Year combination is the Value for the ID,Year minus the Value for the ID,Year-1 thanks On 10 January 2016 at 20:51, Femi Anthony wrote: > Can you clarify what you mean with an actual example ? > > For example, if your data frame looks like this: > > ID Year Value > 12012 100 > 22013 101 > 32014 102 > > What's your desired output ? > > Femi > > > On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter > wrote: > >> >> Hi, >> >> I have a DataFrame with the columns >> >> ID,Year,Value >> >> I'd like to create a new Column that is Value2-Value1 where the >> corresponding Year2=Year-1 >> >> At the moment I am creating a new DataFrame with renamed columns and >> doing >> >>DF.join(DF2, . . . .) >> >> This looks cumbersome to me, is there abtter way ? >> >> thanks >> >> >> -- >> Franc >> > > > > -- > http://www.femibyte.com/twiki5/bin/view/Tech/ > http://www.nextmatrix.com > "Great spirits have always encountered violent opposition from mediocre > minds." - Albert Einstein. > -- Franc
Re: pyspark: conditionals inside functions
Got it, I needed to use the when/otherwise construct - code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) x = when(n==7,day).otherwise(sun) return x On 10 January 2016 at 08:41, Franc Carter wrote: > > My Python is not particularly good, so I'm afraid I don't understand what > that mean > > cheers > > > On 9 January 2016 at 14:45, Franc Carter wrote: > >> >> Hi, >> >> I'm trying to write a short function that returns the last sunday of the >> week of a given date, code below >> >> def getSunday(day): >> >> day = day.cast("date") >> >> sun = next_day(day, "Sunday") >> >> n = datediff(sun,day) >> >> if (n == 7): >> >> return day >> >> else: >> >> return sun >> >> >> this gives me >> >> ValueError: Cannot convert column into bool: >> >> >> Can someone point out what I am doing wrong >> >> thanks >> >> >> -- >> Franc >> > > > > -- > Franc > -- Franc
pyspark: calculating row deltas
Hi, I have a DataFrame with the columns ID,Year,Value I'd like to create a new Column that is Value2-Value1 where the corresponding Year2=Year-1 At the moment I am creating a new DataFrame with renamed columns and doing DF.join(DF2, . . . .) This looks cumbersome to me, is there abtter way ? thanks -- Franc
Re: pyspark: conditionals inside functions
My Python is not particularly good, so I'm afraid I don't understand what that mean cheers On 9 January 2016 at 14:45, Franc Carter wrote: > > Hi, > > I'm trying to write a short function that returns the last sunday of the > week of a given date, code below > > def getSunday(day): > > day = day.cast("date") > > sun = next_day(day, "Sunday") > > n = datediff(sun,day) > > if (n == 7): > > return day > > else: > > return sun > > > this gives me > > ValueError: Cannot convert column into bool: > > > Can someone point out what I am doing wrong > > thanks > > > -- > Franc > -- Franc
pyspark: conditionals inside functions
Hi, I'm trying to write a short function that returns the last sunday of the week of a given date, code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) if (n == 7): return day else: return sun this gives me ValueError: Cannot convert column into bool: Can someone point out what I am doing wrong thanks -- Franc
Re: number of executors in sparkR.init()
Thanks, that works cheers On 26 December 2015 at 16:53, Felix Cheung wrote: > The equivalent for spark-submit --num-executors should be > spark.executor.instances > When use in SparkConf? > http://spark.apache.org/docs/latest/running-on-yarn.html > > Could you try setting that with sparkR.init()? > > > _____ > From: Franc Carter > Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: > > > > Hi, > > I'm having trouble working out how to get the number of executors set when > using sparkR.init(). > > If I start sparkR with > > sparkR --master yarn --num-executors 6 > > then I get 6 executors > > However, if start sparkR with > > sparkR > > followed by > > sc <- sparkR.init(master="yarn-client", > sparkEnvir=list(spark.num.executors='6')) > > then I only get 2 executors. > > Can anyone point me in the direction of what I might doing wrong ? I need > to initialise this was so that rStudio can hook in to SparkR > > thanks > > -- > Franc > > > -- Franc
number of executors in sparkR.init()
Hi, I'm having trouble working out how to get the number of executors set when using sparkR.init(). If I start sparkR with sparkR --master yarn --num-executors 6 then I get 6 executors However, if start sparkR with sparkR followed by sc <- sparkR.init(master="yarn-client", sparkEnvir=list(spark.num.executors='6')) then I only get 2 executors. Can anyone point me in the direction of what I might doing wrong ? I need to initialise this was so that rStudio can hook in to SparkR thanks -- Franc
Re: SparkR csv without headers
Thanks - works nicely cheers On Fri, Aug 21, 2015 at 12:43 PM, Sun, Rui wrote: > Hi, > > > > You can create a DataFrame using load.df() with a specified schema. > > > > Something like: > > schema <- structType(structField(“a”, “string”), structField(“b”, > integer), …) > > read.df ( …, schema = schema) > > > > *From:* Franc Carter [mailto:franc.car...@rozettatech.com] > *Sent:* Wednesday, August 19, 2015 1:48 PM > *To:* user@spark.apache.org > *Subject:* SparkR csv without headers > > > > > > Hi, > > > > Does anyone have an example of how to create a DataFrame in SparkR which > specifies the column names - the csv files I have do not have column names > in the first row. I can get read a csv nicely with > com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2, > C3 etc > > > > > > thanks > > > > -- > > *Franc Carter* I Systems ArchitectI RoZetta Technology > > > > [image: Description: Description: Description: > cid:image003.jpg@01D02903.9B540580] > > > > L4. 55 Harrington Street, THE ROCKS, NSW, 2000 > > PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA > > *T* +61 2 8355 2515 Iwww.rozettatechnology.com > > [image: cid:image002.jpg@01D02903.0B41B280] > > DISCLAIMER: The contents of this email, inclusive of attachments, may be > legally > > privileged and confidential. Any unauthorised use of the contents is > expressly prohibited. > > > > > -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.
SparkR csv without headers
Hi, Does anyone have an example of how to create a DataFrame in SparkR which specifies the column names - the csv files I have do not have column names in the first row. I can get read a csv nicely with com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2, C3 etc thanks -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.
subscribe
subscribe
Column operation on Spark RDDs.
Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation is on columns, e.g., I need to create many intermediate variables from different columns, what is the most efficient way to do this? For example, if my dataRDD[Array[String]] is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 .. 29, 94, 956, ..., 758 I will need to create a new column or a variable as newCol1 = 2ndCol+19thCol, and another new column based on newCol1 and the existing columns: newCol2 = function(newCol1, 34thCol), what is the best way of doing this? I have been thinking using index for the intermediate variables and the dataRDD, and then join them together on the index to do my calculation: var dataRDD = sc.textFile("/test.csv").map(_.split(",")) val dt = dataRDD.zipWithIndex.map(_.swap) val newCol1 = dataRDD.map(x => x(1)+x(18)).zipWithIndex.map(_.swap) val newCol2 = newCol1.join(dt).map(x=> function(.)) Is there a better way of doing this? Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add a column to a spark RDD with many columns?
Thanks for your reply! It is what I am after. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to add a column to a spark RDD with many columns?
Hi all, I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 .. 29, 94, 956, ..., 758 how can I efficiently add a column to it, whose value is the sum of the 2nd and the 200th columns? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple of TeraBytes is not that challenging (depending on the algorithm) these days where as 5 years ago it was a big challenge. We have a bit over a PetaByte (not using Spark) and using a distributed system is the only viable way to get reasonable performance for reasonable cost cheers On Tue, Mar 31, 2015 at 4:55 AM, Steve Loughran wrote: > > On 30 Mar 2015, at 13:27, jay vyas wrote: > > Just the same as spark was disrupting the hadoop ecosystem by changing > the assumption that "you can't rely on memory in distributed > analytics"...now maybe we are challenging the assumption that "big data > analytics need to distributed"? > > I've been asking the same question lately and seen similarly that spark > performs quite reliably and well on local single node system even for an > app which I ran for a streaming app which I ran for ten days in a row... I > almost felt guilty that I never put it on a cluster! > > > Modern machines can be pretty powerful: 16 physical cores HT'd to 32, > 384+MB, GPU, giving you lots of compute. What you don't get is the storage > capacity to match, and especially, the IO bandwidth. RAID-0 striping 2-4 > HDDs gives you some boost, but if you are reading, say, a 4 GB file from > HDFS broken in to 256MB blocks, you have that data replicated into (4*4*3) > blocks: 48. Algorithm and capacity permitting, you've just massively > boosted your load time. Downstream, if data can be thinned down, then you > can start looking more at things you can do on a single host : a machine > that can be in your Hadoop cluster. Ask YARN nicely and you can get a > dedicated machine for a couple of days (i.e. until your Kerberos tokens > expire). > > -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.
Re: FW: Submitting jobs to Spark EC2 cluster remotely
ise$DefaultPromise.ready(Promise.scala:219) > >> at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > >> at > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > >> at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > >> at scala.concurrent.Await$.result(package.scala:107) > >> at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127) > >> at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) > >> at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) > >> ... 7 more > >> > >> > >> > >> > >> -Original Message- > >> From: Patrick Wendell [mailto:pwend...@gmail.com] > >> Sent: Monday, February 23, 2015 12:17 AM > >> To: Oleg Shirokikh > >> Subject: Re: Submitting jobs to Spark EC2 cluster remotely > >> > >> The reason is that the file needs to be in a globally visible > >> filesystem where the master node can download. So it needs to be on > >> s3, for instance, rather than on your local filesystem. > >> > >> - Patrick > >> > >> On Sun, Feb 22, 2015 at 11:55 PM, olegshirokikh > wrote: > >>> I've set up the EC2 cluster with Spark. Everything works, all > >>> master/slaves are up and running. > >>> > >>> I'm trying to submit a sample job (SparkPi). When I ssh to cluster > >>> and submit it from there - everything works fine. However when > >>> driver is created on a remote host (my laptop), it doesn't work. > >>> I've tried both modes for > >>> `--deploy-mode`: > >>> > >>> **`--deploy-mode=client`:** > >>> > >>> From my laptop: > >>> > >>> ./bin/spark-submit --master > >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 > >>> --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar > >>> > >>> Results in the following indefinite warnings/errors: > >>> > >>>> WARN TaskSchedulerImpl: Initial job has not accepted any > >>>> resources; check your cluster UI to ensure that workers are > >>>> registered and have sufficient memory 15/02/22 18:30:45 > >>> > >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent > >>>> executor 0 > >>>> 15/02/22 18:30:45 > >>> > >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent > >>>> executor 1 > >>> > >>> ...and failed drivers - in Spark Web UI "Completed Drivers" with > >>> "State=ERROR" appear. > >>> > >>> I've tried to pass limits for cores and memory to submit script but > >>> it didn't help... > >>> > >>> **`--deploy-mode=cluster`:** > >>> > >>> From my laptop: > >>> > >>> ./bin/spark-submit --master > >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 > >>> --deploy-mode cluster --class SparkPi > >>> ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar > >>> > >>> The result is: > >>> > >>>> Driver successfully submitted as driver-20150223023734-0007 ... > >>>> waiting before polling master for driver state ... polling master > >>>> for driver state State of driver-20150223023734-0007 is ERROR > >>>> Exception from cluster was: java.io.FileNotFoundException: File > >>>> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.1 > >>>> 0 -0.0.1.jar does not exist. java.io.FileNotFoundException: File > >>>> > file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar > >>>> does not exist. at > >>>> > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) > >>>> at > >>>> > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > >>>> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > >>>> org.apache.spark.deploy.worker.DriverRunner.org > $apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150) > >>>> at > >>>> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunne > >>>> r > >>>> .scala:75) > >>> > >>> So, I'd appreciate any pointers on what is going wrong and some > >>> guidance how to deploy jobs from remote client. Thanks. > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs- > >>> t o-Spark-EC2-cluster-remotely-tp21762.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> > >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For > >>> additional commands, e-mail: user-h...@spark.apache.org > >>> > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: Description: Description: Description: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.
How to sum up the values in the columns of a dataset in Scala?
I am new to Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this? I worked out a way by using for loop, but I don't think it is efficient: val AllLabels = List("ID", "val1", "val2", "val3", "val4") val lbla = List("val1", "val3", "val4") val index_lbla = lbla.map(x => AllLabels.indexOf(x)) val dataRDD = sc.textFile("../test.csv").map(_.split(",")) dataRDD.map(x=> { var sum = 0.0 for (i <- 1 to index_lbla.length) sum = sum + x(i).toDouble sum } ).collect The test.csv looks like below (without column names): "ID", "val1", "val2", "val3", "val4" A, 123, 523, 534, 893 B, 536, 98, 1623, 98472 C, 537, 89, 83640, 9265 D, 7297, 98364, 9, 735 ... Your help is very much appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-sum-up-the-values-in-the-columns-of-a-dataset-in-Scala-tp21639.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark, reading from s3
Check that your timezone is correct as well, an incorrect timezone can make it look like your time is correct when it is skewed. cheers On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim wrote: > The thing is that my time is perfectly valid... > > On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das > wrote: > >> Its with the timezone actually, you can either use an NTP to maintain >> accurate system clock or you can adjust your system time to match with the >> AWS one. You can do it as: >> >> telnet s3.amazonaws.com 80 >> GET / HTTP/1.0 >> >> >> [image: Inline image 1] >> >> Thanks >> Best Regards >> >> On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim wrote: >> >>> I'm getting this warning when using s3 input: >>> 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in >>> response to >>> RequestTimeTooSkewed error. Local machine and S3 server disagree on the >>> time by approximately 0 seconds. Retrying connection. >>> >>> After that there are tons of 403/forbidden errors and then job fails. >>> It's sporadic, so sometimes I get this error and sometimes not, what >>> could be the issue? >>> I think it could be related to network connectivity? >>> >> >> > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: Datastore HDFS vs Cassandra
I forgot to mention that if you do decide to use Cassandra I'd highly recommend jumping on the Cassandra mailing list, if we had taken in come of the advice on that list things would have been considerably smoother cheers On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz < christian.b...@performance-media.de> wrote: > Hi > > Regarding the Cassandra Data model, there's an excellent post on the > ebay tech blog: > http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/. > There's also a slideshare for this somewhere. > > Happy hacking > > Chris > > Von: Franc Carter > Datum: Mittwoch, 11. Februar 2015 10:03 > An: Paolo Platter > Cc: Mike Trienis , "user@spark.apache.org" < > user@spark.apache.org> > Betreff: Re: Datastore HDFS vs Cassandra > > > One additional comment I would make is that you should be careful with > Updates in Cassandra, it does support them but large amounts of Updates > (i.e changing existing keys) tends to cause fragmentation. If you are > (mostly) adding new keys (e.g new records in the the time series) then > Cassandra can be excellent > > cheers > > > On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter > wrote: > >> Hi Mike, >> >> I developed a Solution with cassandra and spark, using DSE. >> The main difficult is about cassandra, you need to understand very well >> its data model and its Query patterns. >> Cassandra has better performance than hdfs and it has DR and stronger >> availability. >> Hdfs is a filesystem, cassandra is a dbms. >> Cassandra supports full CRUD without acid. >> Hdfs is more flexible than cassandra. >> >> In my opinion, if you have a real time series, go with Cassandra paying >> attention at your reporting data access patterns. >> >> Paolo >> >> Inviata dal mio Windows Phone >> -- >> Da: Mike Trienis >> Inviato: ?11/?02/?2015 05:59 >> A: user@spark.apache.org >> Oggetto: Datastore HDFS vs Cassandra >> >> Hi, >> >> I am considering implement Apache Spark on top of Cassandra database after >> listing to related talk and reading through the slides from DataStax. It >> seems to fit well with our time-series data and reporting requirements. >> >> >> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data >> >> Does anyone have any experiences using Apache Spark and Cassandra, >> including >> limitations (and or) technical difficulties? How does Cassandra compare >> with >> HDFS and what use cases would make HDFS more suitable? >> >> Thanks, Mike. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > > *Franc Carter* | Systems Architect | Rozetta Technology > > franc.car...@rozettatech.com | > www.rozettatechnology.com > > Tel: +61 2 8355 2515 > > Level 4, 55 Harrington St, The Rocks NSW 2000 > > PO Box H58, Australia Square, Sydney NSW 1215 > > AUSTRALIA > > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: Datastore HDFS vs Cassandra
One additional comment I would make is that you should be careful with Updates in Cassandra, it does support them but large amounts of Updates (i.e changing existing keys) tends to cause fragmentation. If you are (mostly) adding new keys (e.g new records in the the time series) then Cassandra can be excellent cheers On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter wrote: > Hi Mike, > > I developed a Solution with cassandra and spark, using DSE. > The main difficult is about cassandra, you need to understand very well > its data model and its Query patterns. > Cassandra has better performance than hdfs and it has DR and stronger > availability. > Hdfs is a filesystem, cassandra is a dbms. > Cassandra supports full CRUD without acid. > Hdfs is more flexible than cassandra. > > In my opinion, if you have a real time series, go with Cassandra paying > attention at your reporting data access patterns. > > Paolo > > Inviata dal mio Windows Phone > -- > Da: Mike Trienis > Inviato: 11/02/2015 05:59 > A: user@spark.apache.org > Oggetto: Datastore HDFS vs Cassandra > > Hi, > > I am considering implement Apache Spark on top of Cassandra database after > listing to related talk and reading through the slides from DataStax. It > seems to fit well with our time-series data and reporting requirements. > > > http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data > > Does anyone have any experiences using Apache Spark and Cassandra, > including > limitations (and or) technical difficulties? How does Cassandra compare > with > HDFS and what use cases would make HDFS more suitable? > > Thanks, Mike. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: How to create spark AMI in AWS
Hi, I'm very new to Spark, but experienced with AWS - so take that in to account with my suggestions. I started with an AWS base image and then added the pre-built Spark-1.2. I then added made a 'Master' version and a 'Worker' versions and then made AMIs for them. The Master comes up with a static IP and the Worker image has this baked in. I haven't completed everything I am planning to do but so far I can bring up the Master and a bunch of Workers inside and ASG and run spark code successfully. cheers On Mon, Feb 9, 2015 at 10:06 PM, Guodong Wang wrote: > Hi guys, > > I want to launch spark cluster in AWS. And I know there is a spark_ec2.py > script. > > I am using the AWS service in China. But I can not find the AMI in the > region of China. > > So, I have to build one. My question is > 1. Where is the bootstrap script to create the Spark AMI? Is it here( > https://github.com/mesos/spark-ec2/blob/branch-1.3/create_image.sh) ? > 2. What is the base image of the Spark AMI? Eg, the base image of this ( > https://github.com/mesos/spark-ec2/blob/branch-1.3/ami-list/us-west-1/hvm) > 3. Shall I install scala during building the AMI? > > > Thanks. > > Guodong > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: No AMI for Spark 1.2 using ec2 scripts
AMI's are specific to an AWS region, so the ami-id of the spark AMI in us-west will be different if it exists. I can't remember where but I have a memory of seeing somewhere that the AMI was only in us-east cheers On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson wrote: > Thanks, > > I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and > 1.2 with the same command: > > ./spark-ec2 -k foo -i bar.pem launch mycluster > > By default this launches in us-east-1. I tried changing the the region > using: > > -r us-west-1 but that had the same result: > > Could not resolve AMI at: > https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm > > Looking up > https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a > browser results in the same AMI ID as yours. If I search for ami-7a320f3f > AMI in AWS, I can't find any such image. I tried searching in all regions I > could find in the AWS console. > > The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in > us-west-1. > > Strange. Not sure what to do. > > /Håkan > > > On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke > wrote: > > I definitely have Spark 1.2 running within EC2 using the spark-ec2 > scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. > > What parameters are you using when you execute spark-ec2? > > > I am launching in the us-west-1 region (ami-7a320f3f) which may explain > things. > > On Mon Jan 26 2015 at 2:40:01 AM hajons wrote: > > Hi, > > When I try to launch a standalone cluster on EC2 using the scripts in the > ec2 directory for Spark 1.2, I get the following error: > > Could not resolve AMI at: > https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm > > It seems there is not yet any AMI available on EC2. Any ideas when there > will be one? > > This works without problems for version 1.1. Starting up a cluster using > these scripts is so simple and straightforward, so I am really missing it > on > 1.2. > > /Håkan > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Does DecisionTree model in MLlib deal with missing values?
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal with missing values? If so, what data structure should I use for the input? Moreover, my data has categorical features, but the LabeledPoint requires "double" data type, in this case what can I do? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-DecisionTree-model-in-MLlib-deal-with-missing-values-tp21080.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reading from a centralized stored
Ah, so it's rdd specific - that would make sense. For those systems where it is possible to extract sensible susbets the rdds do so. My use case, which is probably biasing my thinking is DynamoDb which I don't think can efficiently extract records from M-to-N cheers On Wed, Jan 7, 2015 at 6:59 AM, Cody Koeninger wrote: > No, most rdds partition input data appropriately. > > On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter > wrote: > >> >> One more question, to be clarify. Will every node pull in all the data ? >> >> thanks >> >> On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger >> wrote: >> >>> If you are not co-locating spark executor processes on the same machines >>> where the data is stored, and using an rdd that knows about which node to >>> prefer scheduling a task on, yes, the data will be pulled over the network. >>> >>> Of the options you listed, S3 and DynamoDB cannot have spark running on >>> the same machines. Cassandra can be run on the same nodes as spark, and >>> recent versions of the spark cassandra connector implement preferred >>> locations. You can run an rdbms on the same nodes as spark, but JdbcRDD >>> doesn't implement preferred locations. >>> >>> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter < >>> franc.car...@rozettatech.com> wrote: >>> >>>> >>>> Hi, >>>> >>>> I'm trying to understand how a Spark Cluster behaves when the data it >>>> is processing resides on a centralized/remote store (S3, Cassandra, >>>> DynamoDB, RDBMS etc). >>>> >>>> Does every node in the cluster retrieve all the data from the central >>>> store ? >>>> >>>> thanks >>>> >>>> -- >>>> >>>> *Franc Carter* | Systems Architect | Rozetta Technology >>>> >>>> franc.car...@rozettatech.com | >>>> www.rozettatechnology.com >>>> >>>> Tel: +61 2 8355 2515 >>>> >>>> Level 4, 55 Harrington St, The Rocks NSW 2000 >>>> >>>> PO Box H58, Australia Square, Sydney NSW 1215 >>>> >>>> AUSTRALIA >>>> >>>> >>> >> >> >> -- >> >> *Franc Carter* | Systems Architect | Rozetta Technology >> >> franc.car...@rozettatech.com | >> www.rozettatechnology.com >> >> Tel: +61 2 8355 2515 >> >> Level 4, 55 Harrington St, The Rocks NSW 2000 >> >> PO Box H58, Australia Square, Sydney NSW 1215 >> >> AUSTRALIA >> >> > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: Reading from a centralized stored
One more question, to be clarify. Will every node pull in all the data ? thanks On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger wrote: > If you are not co-locating spark executor processes on the same machines > where the data is stored, and using an rdd that knows about which node to > prefer scheduling a task on, yes, the data will be pulled over the network. > > Of the options you listed, S3 and DynamoDB cannot have spark running on > the same machines. Cassandra can be run on the same nodes as spark, and > recent versions of the spark cassandra connector implement preferred > locations. You can run an rdbms on the same nodes as spark, but JdbcRDD > doesn't implement preferred locations. > > On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter > wrote: > >> >> Hi, >> >> I'm trying to understand how a Spark Cluster behaves when the data it is >> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, >> RDBMS etc). >> >> Does every node in the cluster retrieve all the data from the central >> store ? >> >> thanks >> >> -- >> >> *Franc Carter* | Systems Architect | Rozetta Technology >> >> franc.car...@rozettatech.com | >> www.rozettatechnology.com >> >> Tel: +61 2 8355 2515 >> >> Level 4, 55 Harrington St, The Rocks NSW 2000 >> >> PO Box H58, Australia Square, Sydney NSW 1215 >> >> AUSTRALIA >> >> > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: Reading from a centralized stored
Thanks, that's what I suspected. cheers On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger wrote: > If you are not co-locating spark executor processes on the same machines > where the data is stored, and using an rdd that knows about which node to > prefer scheduling a task on, yes, the data will be pulled over the network. > > Of the options you listed, S3 and DynamoDB cannot have spark running on > the same machines. Cassandra can be run on the same nodes as spark, and > recent versions of the spark cassandra connector implement preferred > locations. You can run an rdbms on the same nodes as spark, but JdbcRDD > doesn't implement preferred locations. > > On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter > wrote: > >> >> Hi, >> >> I'm trying to understand how a Spark Cluster behaves when the data it is >> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, >> RDBMS etc). >> >> Does every node in the cluster retrieve all the data from the central >> store ? >> >> thanks >> >> -- >> >> *Franc Carter* | Systems Architect | Rozetta Technology >> >> franc.car...@rozettatech.com | >> www.rozettatechnology.com >> >> Tel: +61 2 8355 2515 >> >> Level 4, 55 Harrington St, The Rocks NSW 2000 >> >> PO Box H58, Australia Square, Sydney NSW 1215 >> >> AUSTRALIA >> >> > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Reading from a centralized stored
Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: How to compile a Spark project in Scala IDE for Eclipse?
Thanks for your reply Wei, will try this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7224.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to compile a Spark project in Scala IDE for Eclipse?
Thanks a lot Krishna, this works for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7223.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
How to compile a Spark project in Scala IDE for Eclipse?
Hi All, I just downloaded the Scala IDE for Eclipse. After I created a Spark project and clicked "Run" there was an error on this line of code "import org.apache.spark.SparkContext": "object apache is not a member of package org". I guess I need to import the Spark dependency into Scala IDE for Eclipse, can anyone tell me how to do it? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to get the help or explanation for the functions in Spark shell?
Thank you very much Gerard. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191p7193.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
How to get the help or explanation for the functions in Spark shell?
Hi All, I am new to Spark. In the Spark shell, how can I get the help or explanation for those functions that I can use for a variable or RDD? For example, after I input a RDD's name with a dot (.) at the end, if I press the Tab key, a list of functions that I can use for this RDD will be displayed, but I dont know how to use these functions. Your help is greatly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: K-nearest neighbors search in Spark
Hi Andrew, Thank you for your info. I will have a look at these links. Thanks, Carter Date: Tue, 27 May 2014 09:06:02 -0700 From: ml-node+s1001560n6436...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: K-nearest neighbors search in Spark Hi Carter, In Spark 1.0 there will be an implementation of k-means available as part of MLLib. You can see the documentation for that below (until 1.0 is fully released). https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html Maybe diving into the source here will help get you started? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala Cheers,Andrew On Tue, May 27, 2014 at 4:10 AM, Carter <[hidden email]> wrote: Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6436.html To unsubscribe from K-nearest neighbors search in Spark, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6465.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: K-nearest neighbors search in Spark
Hi Krishna, Thank you very much for your code. I will use it as a good start point. Thanks, Carter Date: Tue, 27 May 2014 16:42:39 -0700 From: ml-node+s1001560n6455...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: K-nearest neighbors search in Spark Carter, Just as a quick & simple starting point for Spark. (caveats - lots of improvements reqd for scaling, graceful and efficient handling of RDD et al): import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import scala.collection.immutable.ListMap import scala.collection.immutable.SortedMap object TopK { // def getCurrentDirectory = new java.io.File( "." ).getCanonicalPath // def distance(x1:List[Int],x2:List[Int]):Double = { val dist:Double = math.sqrt(math.pow(x1(1)-x2(1),2) + math.pow(x1(2)-x2(2),2)) dist } // def main(args: Array[String]): Unit = { // println(getCurrentDirectory) val sc = new SparkContext("local","TopK","spark://USS-Defiant.local:7077") println(s"Running Spark Version ${sc.version}") val file = sc.textFile("data01.csv") // val data = file .map(line => line.split(",")) .map(x1 => List(x1(0).toInt,x1(1).toInt,x1(2).toInt)) //val data1 = data.collect println("data") for (d <- data) { println(d) println(d(0)) } // val distList = for (d <- data) yield {d(0)} //for (d <- distList) (println(d)) val zipList = for (a <- distList.collect; b <- distList.collect) yield { List(a,b)} zipList.foreach(println(_)) // val dist = for (l <- zipList) yield { println(s"${l(0)} = ${l(1)}") val x1a:Array[List[Int]] = data.filter(d => d(0) == l(0)).collect val x2a:Array[List[Int]] = data.filter(d => d(0) == l(1)).collect val x1:List[Int] = x1a(0) val x2:List[Int] = x2a(0) val dist = distance(x1,x2) Map ( dist -> l ) } dist.foreach(println(_)) // sort this for topK // } } data01.csv 1,68,93 2,12,90 3,45,76 4,86,54 HTH. Cheers On Tue, May 27, 2014 at 4:10 AM, Carter <[hidden email]> wrote: Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6455.html To unsubscribe from K-nearest neighbors search in Spark, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6464.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: K-nearest neighbors search in Spark
Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
K-nearest neighbors search in Spark
Hi all,I want to implement a basic K-nearest neighbors search in Spark, but I am totally new to Scala so don't know where to start with.My data consists of millions of points. For each point, I need to compute its Euclidean distance to the other points, and return the top-K points that are closest to it. The data.txt is with the comma-separated format like this:ID, X, Y1, 68, 932, 12, 903, 45, 76100, 86, 54 Could you please tell me what data structure I should use, and how to implement this algorithm in Scala (*some sample code are greatly appreciated*).Thank you very much.Regards,Carter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: "sbt/sbt run" command returns a JVM problem
Hi Akhil, Thanks for your reply. I have tried this option with different values, but it still doesn't work. The Java version I am using is jre1.7.0_55, does the java version matter in this problem? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5437.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: "sbt/sbt run" command returns a JVM problem
hi I still have over 1g left for my program. Date: Sun, 4 May 2014 19:14:30 -0700 From: ml-node+s1001560n5340...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: "sbt/sbt run" command returns a JVM problem the total memory of your machine is 2G right?then how much memory is left free? wouldn`t ubuntu take up quite a big portion of 2G? just a guess! On Sat, May 3, 2014 at 8:15 PM, Carter <[hidden email]> wrote: Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ "$@" I have tried to set these 3 parameters bigger and smaller, but nothing works. Did I change the right thing? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5340.html To unsubscribe from "sbt/sbt run" command returns a JVM problem, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5412.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: "sbt/sbt run" command returns a JVM problem
Hi Michael, The log after I typed "last" is as below: > last scala.tools.nsc.MissingRequirementError: object scala not found. at scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$.getModule(Definitions.scala:605) at scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackage(Definitions.scala:145) at scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackageClass(Definitions.scala:146) at scala.tools.nsc.symtab.Definitions$definitions$.AnyClass(Definitions.scala:176) at scala.tools.nsc.symtab.Definitions$definitions$.init(Definitions.scala:814) at scala.tools.nsc.Global$Run.(Global.scala:697) at sbt.compiler.Eval$$anon$1.(Eval.scala:53) at sbt.compiler.Eval.run$1(Eval.scala:53) at sbt.compiler.Eval.unlinkAll$1(Eval.scala:56) at sbt.compiler.Eval.eval(Eval.scala:62) at sbt.EvaluateConfigurations$.evaluateSetting(Build.scala:104) at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:212) at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:209) at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60) at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60) at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62) at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62) at sbt.Command$.process(Command.scala:90) at sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71) at sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71) at sbt.State$$anon$2.process(State.scala:171) at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71) at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.MainLoop$.next(MainLoop.scala:71) at sbt.MainLoop$.run(MainLoop.scala:64) at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:53) at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:50) at sbt.Using.apply(Using.scala:25) at sbt.MainLoop$.runWithNewLog(MainLoop.scala:50) at sbt.MainLoop$.runAndClearLast(MainLoop.scala:33) at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:17) at sbt.MainLoop$.runLogged(MainLoop.scala:13) at sbt.xMain.run(Main.scala:26) at xsbt.boot.Launch$.run(Launch.scala:55) at xsbt.boot.Launch$$anonfun$explicit$1.apply(Launch.scala:45) at xsbt.boot.Launch$.launch(Launch.scala:60) at xsbt.boot.Launch$.apply(Launch.scala:16) at xsbt.boot.Boot$.runImpl(Boot.scala:31) at xsbt.boot.Boot$.main(Boot.scala:20) at xsbt.boot.Boot.main(Boot.scala) [error] scala.tools.nsc.MissingRequirementError: object scala not found. [error] Use 'last' for the full log. And my sbt file is like below (my sbt launcher is "sbt-launch-0.12.4.jar" in the same folder): #!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # #http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # This script launches sbt for this project. If present it uses the system # version of sbt. If there is no system version of sbt it attempts to download # sbt locally. SBT_VERSION=`awk -F "=" '/sbt\\.version/ {print $2}' ./project/build.properties` URL1=http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar URL2=http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar JAR=sbt/sbt-launch-${SBT_VERSION}.jar # Download sbt launch jar if it hasn't been downloaded yet if [ ! -f ${JAR} ]; then # Download printf "Attempting to fetch sbt\n" if hash curl 2>/dev/null; then curl --progress-bar ${URL1} > ${JAR} || curl --progress-bar ${URL2} > ${JAR} elif hash wget 2>/dev/null; then wget --progress=bar ${URL1} -O ${JAR} || wget --progress=bar ${URL2} -O ${JAR} else printf "You do not have curl or wget installed, please install sbt manually from http://www.scala-sbt.org/\n"; exit -1 fi fi if [ ! -f ${JAR} ]; then # We failed to download printf "Our at
Re: "sbt/sbt run" command returns a JVM problem
Hi Michael, Thank you very much for your reply. Sorry I am not very familiar with sbt. Could you tell me where to set the Java option for the sbt fork for my program? I brought up the sbt console, and run "set javaOptions += "-Xmx1G"" in it, but it returned an error: [error] scala.tools.nsc.MissingRequirementError: object scala not found. [error] Use 'last' for the full log. Is this the right way to set the java option? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5294.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: "sbt/sbt run" command returns a JVM problem
Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ "$@" I have tried to set these 3 parameters bigger and smaller, but nothing works. Did I change the right thing? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
"sbt/sbt run" command returns a JVM problem
Hi, I have a very simple spark program written in Scala: /*** testApp.scala ***/ object testApp { def main(args: Array[String]) { println("Hello! World!") } } Then I use the following command to compile it: $ sbt/sbt package The compilation finished successfully and I got a JAR file. But when I use this command to run it: $ sbt/sbt run it returned an error with JVM: [info] Error occurred during initialization of VM [info] Could not reserve enough space for object heap [error] Error: Could not create the Java Virtual Machine. [error] Error: A fatal exception has occurred. Program will exit. java.lang.RuntimeException: Nonzero exit code returned from runner: 1 at scala.sys.package$.error(package.scala:27) My machine has 2G memory, and runs on Ubuntu 11.04. I also tried to change the setting of java parameter (e.g., -Xmx, -Xms, -XX:MaxPermSize, -XX:ReservedCodeCacheSize) in the file sbt/sbt, but it looks like non of the change works. Can anyone help me out with this problem? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Need help about how hadoop works.
Thank you very much Prashant. Date: Thu, 24 Apr 2014 01:24:39 -0700 From: ml-node+s1001560n4739...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: Need help about how hadoop works. It is the same file and hadoop library that we use for splitting takes care of assigning the right split to each node.Prashant Sharma On Thu, Apr 24, 2014 at 1:36 PM, Carter <[hidden email]> wrote: Thank you very much for your help Prashant. Sorry I still have another question about your answer: "however if the file("/home/scalatest.txt") is present on the same path on all systems it will be processed on all nodes." When presenting the file to the same path on all nodes, do we just simply copy the same file to all nodes, or do we need to split the original file into different parts (each part is still with the same file name "scalatest.txt"), and copy each part to a different node for parallelization? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4739.html To unsubscribe from Need help about how hadoop works., click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4746.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help about how hadoop works.
Thank you very much for your help Prashant. Sorry I still have another question about your answer: "however if the file("/home/scalatest.txt") is present on the same path on all systems it will be processed on all nodes." When presenting the file to the same path on all nodes, do we just simply copy the same file to all nodes, or do we need to split the original file into different parts (each part is still with the same file name "scalatest.txt"), and copy each part to a different node for parallelization? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help about how hadoop works.
Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile("/home/scalatest.txt",5) doc.count we can only get parallelization within the computer where the file is loaded, but not the parallelization within the computers in the cluster (Spark can not automatically duplicate the file to the other computers in the cluster), is this understanding correct? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Need help about how hadoop works.
Hi, I am a beginner of Hadoop and Spark, and want some help in understanding how hadoop works. If we have a cluster of 5 computers, and install Spark on the cluster WITHOUT Hadoop. And then we run the code on one computer: val doc = sc.textFile("/home/scalatest.txt",5) doc.count Can the "count" task be distributed to all the 5 computers? Or it is only run by 5 parallel threads of the current computer? On th other hand, if we install Hadoop on the cluster and upload the data into HDFS, when running the same code will this "count" task be done by 25 threads? Thank you very much for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638.html Sent from the Apache Spark User List mailing list archive at Nabble.com.