Re: Advice on Scaling RandomForest
I'm using dataframes, types are all doubles and I'm only extracting what I need. The caveat on these is that I am porting an existing system for a client and for there business it's likely to be cheaper to throw hardware (in aws) at the problem for a couple of hours than re-engineer there algorithms cheers On 7 June 2016 at 21:54, Jörn Franke <jornfra...@gmail.com> wrote: > Before hardware optimization there is always software optimization. > Are you using dataset / dataframe? Are you using the right data types ( > eg int where int is appropriate , try to avoid string and char etc) > Do you extract only the stuff needed? What are the algorithm parameters? > > > On 07 Jun 2016, at 13:09, Franc Carter <franc.car...@gmail.com> wrote: > > > > > > Hi, > > > > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and > am interested in how it might be best to scale it - e.g more cpus per > instances, more memory per instance, more instances etc. > > > > I'm currently using 32 m3.xlarge instances for for a training set with > 2.5 million rows, 1300 columns and a total size of 31GB (parquet) > > > > thanks > > > > -- > > Franc > -- Franc
Advice on Scaling RandomForest
Hi, I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am interested in how it might be best to scale it - e.g more cpus per instances, more memory per instance, more instances etc. I'm currently using 32 m3.xlarge instances for for a training set with 2.5 million rows, 1300 columns and a total size of 31GB (parquet) thanks -- Franc
Re: installing packages with pyspark
Thanks - I'll give that a try cheers On 20 March 2016 at 09:42, Felix Cheung <felixcheun...@hotmail.com> wrote: > You are running pyspark in Spark client deploy mode. I have ran into the > same error as well and I'm not sure if this is graphframes specific - the > python process can't find the graphframes Python code when it is loaded as > a Spark package. > > To workaround this, I extract the graphframes Python directory locally > where I run pyspark into a directory called graphframes. > > > > > > > On Thu, Mar 17, 2016 at 10:11 PM -0700, "Franc Carter" < > franc.car...@gmail.com> wrote: > > > I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- > > pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 > > which starts and gives me a REPL, but when I try > >from graphframes import * > > I get > > No module names graphframes > > without '--master yarn' it works as expected > > thanks > > > On 18 March 2016 at 12:59, Felix Cheung <felixcheun...@hotmail.com> wrote: > > For some, like graphframes that are Spark packages, you could also use > --packages in the command line of spark-submit or pyspark. See > http://spark.apache.org/docs/latest/submitting-applications.html > > _ > From: Jakob Odersky <ja...@odersky.com> > Sent: Thursday, March 17, 2016 6:40 PM > Subject: Re: installing packages with pyspark > To: Ajinkya Kale <kaleajin...@gmail.com> > Cc: <user@spark.apache.org> > > > > Hi, > regarding 1, packages are resolved locally. That means that when you > specify a package, spark-submit will resolve the dependencies and > download any jars on the local machine, before shipping* them to the > cluster. So, without a priori knowledge of dataproc clusters, it > should be no different to specify packages. > > Unfortunatly I can't help with 2. > > --Jakob > > *shipping in this case means making them available via the network > > On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale <kaleajin...@gmail.com> > wrote: > > Hi all, > > > > I had couple of questions. > > 1. Is there documentation on how to add the graphframes or any other > package > > for that matter on the google dataproc managed spark clusters ? > > > > 2. Is there a way to add a package to an existing pyspark context > through a > > jupyter notebook ? > > > > --aj > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > -- > Franc > -- Franc
Re: installing packages with pyspark
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 which starts and gives me a REPL, but when I try from graphframes import * I get No module names graphframes without '--master yarn' it works as expected thanks On 18 March 2016 at 12:59, Felix Cheungwrote: > For some, like graphframes that are Spark packages, you could also use > --packages in the command line of spark-submit or pyspark. See > http://spark.apache.org/docs/latest/submitting-applications.html > > _ > From: Jakob Odersky > Sent: Thursday, March 17, 2016 6:40 PM > Subject: Re: installing packages with pyspark > To: Ajinkya Kale > Cc: > > > > Hi, > regarding 1, packages are resolved locally. That means that when you > specify a package, spark-submit will resolve the dependencies and > download any jars on the local machine, before shipping* them to the > cluster. So, without a priori knowledge of dataproc clusters, it > should be no different to specify packages. > > Unfortunatly I can't help with 2. > > --Jakob > > *shipping in this case means making them available via the network > > On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale > wrote: > > Hi all, > > > > I had couple of questions. > > 1. Is there documentation on how to add the graphframes or any other > package > > for that matter on the google dataproc managed spark clusters ? > > > > 2. Is there a way to add a package to an existing pyspark context > through a > > jupyter notebook ? > > > > --aj > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > -- Franc
Re: filter by dict() key in pySpark
A colleague found how to do this, the approach was to use a udf() cheers On 21 February 2016 at 22:41, Franc Carter <franc.car...@gmail.com> wrote: > > I have a DataFrame that has a Python dict() as one of the columns. I'd > like to filter he DataFrame for those Rows that where the dict() contains a > specific value. e.g something like this:- > > DF2 = DF1.filter('name' in DF1.params) > > but that gives me this error > > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > > How do I express this correctly ? > > thanks > > -- > Franc > -- Franc
filter by dict() key in pySpark
I have a DataFrame that has a Python dict() as one of the columns. I'd like to filter he DataFrame for those Rows that where the dict() contains a specific value. e.g something like this:- DF2 = DF1.filter('name' in DF1.params) but that gives me this error ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. How do I express this correctly ? thanks -- Franc
Re: sparkR not able to create /append new columns
Yes, I didn't work out how to solve that - sorry On 3 February 2016 at 22:37, Devesh Raj Singh <raj.deves...@gmail.com> wrote: > Hi, > > but "withColumn" will only add once, if i want to add columns to the same > dataframe in a loop it will keep overwriting the added column and in the > end the last added column( in the loop) will be the added column. like in > my code above. > > On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter <franc.car...@gmail.com> > wrote: > >> >> I had problems doing this as well - I ended up using 'withColumn', it's >> not particularly graceful but it worked (1.5.2 on AWS EMR) >> >> cheerd >> >> On 3 February 2016 at 22:06, Devesh Raj Singh <raj.deves...@gmail.com> >> wrote: >> >>> Hi, >>> >>> i am trying to create dummy variables in sparkR by creating new columns >>> for categorical variables. But it is not appending the columns >>> >>> >>> df <- createDataFrame(sqlContext, iris) >>> class(dtypes(df)) >>> >>> cat.column<-vector(mode="character",length=nrow(df)) >>> cat.column<-collect(select(df,df$Species)) >>> lev<-length(levels(as.factor(unlist(cat.column >>> varb.names<-vector(mode="character",length=lev) >>> for (i in 1:lev){ >>> >>> varb.names[i]<-paste0(colnames(cat.column),i) >>> >>> } >>> >>> for (j in 1:lev) >>> >>> { >>> >>>dummy.df.new<-withColumn(df,paste0(colnames >>>(cat.column),j),if else(df$Species==levels(as.factor(un >>> list(cat.column)) >>>[j],1,0) ) >>> >>> } >>> >>> I am getting the below output for >>> >>> head(dummy.df.new) >>> >>> output: >>> >>> Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1 >>> 1 5.1 3.5 1.4 0.2 setosa1 >>> 2 4.9 3.0 1.4 0.2 setosa1 >>> 3 4.7 3.2 1.3 0.2 setosa1 >>> 4 4.6 3.1 1.5 0.2 setosa1 >>> 5 5.0 3.6 1.4 0.2 setosa1 >>> 6 5.4 3.9 1.7 0.4 setosa1 >>> >>> Problem: Species2 and Species3 column are not getting added to the >>> dataframe >>> >>> -- >>> Warm regards, >>> Devesh. >>> >> >> >> >> -- >> Franc >> > > > > -- > Warm regards, > Devesh. > -- Franc
Re: sparkR not able to create /append new columns
I had problems doing this as well - I ended up using 'withColumn', it's not particularly graceful but it worked (1.5.2 on AWS EMR) cheerd On 3 February 2016 at 22:06, Devesh Raj Singhwrote: > Hi, > > i am trying to create dummy variables in sparkR by creating new columns > for categorical variables. But it is not appending the columns > > > df <- createDataFrame(sqlContext, iris) > class(dtypes(df)) > > cat.column<-vector(mode="character",length=nrow(df)) > cat.column<-collect(select(df,df$Species)) > lev<-length(levels(as.factor(unlist(cat.column > varb.names<-vector(mode="character",length=lev) > for (i in 1:lev){ > > varb.names[i]<-paste0(colnames(cat.column),i) > > } > > for (j in 1:lev) > > { > >dummy.df.new<-withColumn(df,paste0(colnames >(cat.column),j),if else(df$Species==levels(as.factor(un > list(cat.column)) >[j],1,0) ) > > } > > I am getting the below output for > > head(dummy.df.new) > > output: > > Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1 > 1 5.1 3.5 1.4 0.2 setosa1 > 2 4.9 3.0 1.4 0.2 setosa1 > 3 4.7 3.2 1.3 0.2 setosa1 > 4 4.6 3.1 1.5 0.2 setosa1 > 5 5.0 3.6 1.4 0.2 setosa1 > 6 5.4 3.9 1.7 0.4 setosa1 > > Problem: Species2 and Species3 column are not getting added to the > dataframe > > -- > Warm regards, > Devesh. > -- Franc
Re: pyspark: calculating row deltas
Sure, for a dataframe that looks like this ID Year Value 1 2012 100 1 2013 102 1 2014 106 2 2012 110 2 2013 118 2 2014 128 I'd like to get back ID Year Value 1 2013 2 1 2014 4 2 2013 8 2 201410 i.e the Value for an ID,Year combination is the Value for the ID,Year minus the Value for the ID,Year-1 thanks On 10 January 2016 at 20:51, Femi Anthony <femib...@gmail.com> wrote: > Can you clarify what you mean with an actual example ? > > For example, if your data frame looks like this: > > ID Year Value > 12012 100 > 22013 101 > 32014 102 > > What's your desired output ? > > Femi > > > On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter <franc.car...@gmail.com> > wrote: > >> >> Hi, >> >> I have a DataFrame with the columns >> >> ID,Year,Value >> >> I'd like to create a new Column that is Value2-Value1 where the >> corresponding Year2=Year-1 >> >> At the moment I am creating a new DataFrame with renamed columns and >> doing >> >>DF.join(DF2, . . . .) >> >> This looks cumbersome to me, is there abtter way ? >> >> thanks >> >> >> -- >> Franc >> > > > > -- > http://www.femibyte.com/twiki5/bin/view/Tech/ > http://www.nextmatrix.com > "Great spirits have always encountered violent opposition from mediocre > minds." - Albert Einstein. > -- Franc
Re: pyspark: calculating row deltas
Thanks cheers On 10 January 2016 at 22:35, Blaž Šnuderl <snud...@gmail.com> wrote: > This can be done using spark.sql and window functions. Take a look at > https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html > > On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter <franc.car...@gmail.com> > wrote: > >> >> Sure, for a dataframe that looks like this >> >> ID Year Value >> 1 2012 100 >> 1 2013 102 >> 1 2014 106 >> 2 2012 110 >> 2 2013 118 >> 2 2014 128 >> >> I'd like to get back >> >> ID Year Value >> 1 2013 2 >> 1 2014 4 >> 2 2013 8 >> 2 201410 >> >> i.e the Value for an ID,Year combination is the Value for the ID,Year >> minus the Value for the ID,Year-1 >> >> thanks >> >> >> >> >> >> >> On 10 January 2016 at 20:51, Femi Anthony <femib...@gmail.com> wrote: >> >>> Can you clarify what you mean with an actual example ? >>> >>> For example, if your data frame looks like this: >>> >>> ID Year Value >>> 12012 100 >>> 22013 101 >>> 32014 102 >>> >>> What's your desired output ? >>> >>> Femi >>> >>> >>> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter <franc.car...@gmail.com> >>> wrote: >>> >>>> >>>> Hi, >>>> >>>> I have a DataFrame with the columns >>>> >>>> ID,Year,Value >>>> >>>> I'd like to create a new Column that is Value2-Value1 where the >>>> corresponding Year2=Year-1 >>>> >>>> At the moment I am creating a new DataFrame with renamed columns and >>>> doing >>>> >>>>DF.join(DF2, . . . .) >>>> >>>> This looks cumbersome to me, is there abtter way ? >>>> >>>> thanks >>>> >>>> >>>> -- >>>> Franc >>>> >>> >>> >>> >>> -- >>> http://www.femibyte.com/twiki5/bin/view/Tech/ >>> http://www.nextmatrix.com >>> "Great spirits have always encountered violent opposition from mediocre >>> minds." - Albert Einstein. >>> >> >> >> >> -- >> Franc >> > > -- Franc
Re: pyspark: conditionals inside functions
My Python is not particularly good, so I'm afraid I don't understand what that mean cheers On 9 January 2016 at 14:45, Franc Carter <franc.car...@gmail.com> wrote: > > Hi, > > I'm trying to write a short function that returns the last sunday of the > week of a given date, code below > > def getSunday(day): > > day = day.cast("date") > > sun = next_day(day, "Sunday") > > n = datediff(sun,day) > > if (n == 7): > > return day > > else: > > return sun > > > this gives me > > ValueError: Cannot convert column into bool: > > > Can someone point out what I am doing wrong > > thanks > > > -- > Franc > -- Franc
pyspark: calculating row deltas
Hi, I have a DataFrame with the columns ID,Year,Value I'd like to create a new Column that is Value2-Value1 where the corresponding Year2=Year-1 At the moment I am creating a new DataFrame with renamed columns and doing DF.join(DF2, . . . .) This looks cumbersome to me, is there abtter way ? thanks -- Franc
Re: pyspark: conditionals inside functions
Got it, I needed to use the when/otherwise construct - code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) x = when(n==7,day).otherwise(sun) return x On 10 January 2016 at 08:41, Franc Carter <franc.car...@gmail.com> wrote: > > My Python is not particularly good, so I'm afraid I don't understand what > that mean > > cheers > > > On 9 January 2016 at 14:45, Franc Carter <franc.car...@gmail.com> wrote: > >> >> Hi, >> >> I'm trying to write a short function that returns the last sunday of the >> week of a given date, code below >> >> def getSunday(day): >> >> day = day.cast("date") >> >> sun = next_day(day, "Sunday") >> >> n = datediff(sun,day) >> >> if (n == 7): >> >> return day >> >> else: >> >> return sun >> >> >> this gives me >> >> ValueError: Cannot convert column into bool: >> >> >> Can someone point out what I am doing wrong >> >> thanks >> >> >> -- >> Franc >> > > > > -- > Franc > -- Franc
pyspark: conditionals inside functions
Hi, I'm trying to write a short function that returns the last sunday of the week of a given date, code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) if (n == 7): return day else: return sun this gives me ValueError: Cannot convert column into bool: Can someone point out what I am doing wrong thanks -- Franc
number of executors in sparkR.init()
Hi, I'm having trouble working out how to get the number of executors set when using sparkR.init(). If I start sparkR with sparkR --master yarn --num-executors 6 then I get 6 executors However, if start sparkR with sparkR followed by sc <- sparkR.init(master="yarn-client", sparkEnvir=list(spark.num.executors='6')) then I only get 2 executors. Can anyone point me in the direction of what I might doing wrong ? I need to initialise this was so that rStudio can hook in to SparkR thanks -- Franc
Re: number of executors in sparkR.init()
Thanks, that works cheers On 26 December 2015 at 16:53, Felix Cheung <felixcheun...@hotmail.com> wrote: > The equivalent for spark-submit --num-executors should be > spark.executor.instances > When use in SparkConf? > http://spark.apache.org/docs/latest/running-on-yarn.html > > Could you try setting that with sparkR.init()? > > > _____ > From: Franc Carter <franc.car...@gmail.com> > Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: <user@spark.apache.org> > > > > Hi, > > I'm having trouble working out how to get the number of executors set when > using sparkR.init(). > > If I start sparkR with > > sparkR --master yarn --num-executors 6 > > then I get 6 executors > > However, if start sparkR with > > sparkR > > followed by > > sc <- sparkR.init(master="yarn-client", > sparkEnvir=list(spark.num.executors='6')) > > then I only get 2 executors. > > Can anyone point me in the direction of what I might doing wrong ? I need > to initialise this was so that rStudio can hook in to SparkR > > thanks > > -- > Franc > > > -- Franc
Re: SparkR csv without headers
Thanks - works nicely cheers On Fri, Aug 21, 2015 at 12:43 PM, Sun, Rui rui@intel.com wrote: Hi, You can create a DataFrame using load.df() with a specified schema. Something like: schema - structType(structField(“a”, “string”), structField(“b”, integer), …) read.df ( …, schema = schema) *From:* Franc Carter [mailto:franc.car...@rozettatech.com] *Sent:* Wednesday, August 19, 2015 1:48 PM *To:* user@spark.apache.org *Subject:* SparkR csv without headers Hi, Does anyone have an example of how to create a DataFrame in SparkR which specifies the column names - the csv files I have do not have column names in the first row. I can get read a csv nicely with com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2, C3 etc thanks -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited. -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.
SparkR csv without headers
Hi, Does anyone have an example of how to create a DataFrame in SparkR which specifies the column names - the csv files I have do not have column names in the first row. I can get read a csv nicely with com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2, C3 etc thanks -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.
Column operation on Spark RDDs.
Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation is on columns, e.g., I need to create many intermediate variables from different columns, what is the most efficient way to do this? For example, if my dataRDD[Array[String]] is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 .. 29, 94, 956, ..., 758 I will need to create a new column or a variable as newCol1 = 2ndCol+19thCol, and another new column based on newCol1 and the existing columns: newCol2 = function(newCol1, 34thCol), what is the best way of doing this? I have been thinking using index for the intermediate variables and the dataRDD, and then join them together on the index to do my calculation: var dataRDD = sc.textFile(/test.csv).map(_.split(,)) val dt = dataRDD.zipWithIndex.map(_.swap) val newCol1 = dataRDD.map(x = x(1)+x(18)).zipWithIndex.map(_.swap) val newCol2 = newCol1.join(dt).map(x= function(.)) Is there a better way of doing this? Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add a column to a spark RDD with many columns?
Thanks for your reply! It is what I am after. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to add a column to a spark RDD with many columns?
Hi all, I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 .. 29, 94, 956, ..., 758 how can I efficiently add a column to it, whose value is the sum of the 2nd and the 200th columns? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple of TeraBytes is not that challenging (depending on the algorithm) these days where as 5 years ago it was a big challenge. We have a bit over a PetaByte (not using Spark) and using a distributed system is the only viable way to get reasonable performance for reasonable cost cheers On Tue, Mar 31, 2015 at 4:55 AM, Steve Loughran ste...@hortonworks.com wrote: On 30 Mar 2015, at 13:27, jay vyas jayunit100.apa...@gmail.com wrote: Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that you can't rely on memory in distributed analytics...now maybe we are challenging the assumption that big data analytics need to distributed? I've been asking the same question lately and seen similarly that spark performs quite reliably and well on local single node system even for an app which I ran for a streaming app which I ran for ten days in a row... I almost felt guilty that I never put it on a cluster! Modern machines can be pretty powerful: 16 physical cores HT'd to 32, 384+MB, GPU, giving you lots of compute. What you don't get is the storage capacity to match, and especially, the IO bandwidth. RAID-0 striping 2-4 HDDs gives you some boost, but if you are reading, say, a 4 GB file from HDFS broken in to 256MB blocks, you have that data replicated into (4*4*3) blocks: 48. Algorithm and capacity permitting, you've just massively boosted your load time. Downstream, if data can be thinned down, then you can start looking more at things you can do on a single host : a machine that can be in your Hadoop cluster. Ask YARN nicely and you can get a dedicated machine for a couple of days (i.e. until your Kerberos tokens expire). -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.
Re: FW: Submitting jobs to Spark EC2 cluster remotely
/errors: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/02/22 18:30:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1 ...and failed drivers - in Spark Web UI Completed Drivers with State=ERROR appear. I've tried to pass limits for cores and memory to submit script but it didn't help... **`--deploy-mode=cluster`:** From my laptop: ./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar The result is: Driver successfully submitted as driver-20150223023734-0007 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150223023734-0007 is ERROR Exception from cluster was: java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.1 0 -0.0.1.jar does not exist. java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at org.apache.spark.deploy.worker.DriverRunner.org $apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunne r .scala:75) So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs- t o-Spark-EC2-cluster-remotely-tp21762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 Iwww.rozettatechnology.com [image: Description: Description: Description: cid:image002.jpg@01D02903.0B41B280] DISCLAIMER: The contents of this email, inclusive of attachments, may be legally privileged and confidential. Any unauthorised use of the contents is expressly prohibited.
Re: spark, reading from s3
Check that your timezone is correct as well, an incorrect timezone can make it look like your time is correct when it is skewed. cheers On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim kane.ist...@gmail.com wrote: The thing is that my time is perfectly valid... On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its with the timezone actually, you can either use an NTP to maintain accurate system clock or you can adjust your system time to match with the AWS one. You can do it as: telnet s3.amazonaws.com 80 GET / HTTP/1.0 [image: Inline image 1] Thanks Best Regards On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim kane.ist...@gmail.com wrote: I'm getting this warning when using s3 input: 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 0 seconds. Retrying connection. After that there are tons of 403/forbidden errors and then job fails. It's sporadic, so sometimes I get this error and sometimes not, what could be the issue? I think it could be related to network connectivity? -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
How to sum up the values in the columns of a dataset in Scala?
I am new to Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this? I worked out a way by using for loop, but I don't think it is efficient: val AllLabels = List(ID, val1, val2, val3, val4) val lbla = List(val1, val3, val4) val index_lbla = lbla.map(x = AllLabels.indexOf(x)) val dataRDD = sc.textFile(../test.csv).map(_.split(,)) dataRDD.map(x= { var sum = 0.0 for (i - 1 to index_lbla.length) sum = sum + x(i).toDouble sum } ).collect The test.csv looks like below (without column names): ID, val1, val2, val3, val4 A, 123, 523, 534, 893 B, 536, 98, 1623, 98472 C, 537, 89, 83640, 9265 D, 7297, 98364, 9, 735 ... Your help is very much appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-sum-up-the-values-in-the-columns-of-a-dataset-in-Scala-tp21639.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Datastore HDFS vs Cassandra
One additional comment I would make is that you should be careful with Updates in Cassandra, it does support them but large amounts of Updates (i.e changing existing keys) tends to cause fragmentation. If you are (mostly) adding new keys (e.g new records in the the time series) then Cassandra can be excellent cheers On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter paolo.plat...@agilelab.it wrote: Hi Mike, I developed a Solution with cassandra and spark, using DSE. The main difficult is about cassandra, you need to understand very well its data model and its Query patterns. Cassandra has better performance than hdfs and it has DR and stronger availability. Hdfs is a filesystem, cassandra is a dbms. Cassandra supports full CRUD without acid. Hdfs is more flexible than cassandra. In my opinion, if you have a real time series, go with Cassandra paying attention at your reporting data access patterns. Paolo Inviata dal mio Windows Phone -- Da: Mike Trienis mike.trie...@orcsol.com Inviato: 11/02/2015 05:59 A: user@spark.apache.org Oggetto: Datastore HDFS vs Cassandra Hi, I am considering implement Apache Spark on top of Cassandra database after listing to related talk and reading through the slides from DataStax. It seems to fit well with our time-series data and reporting requirements. http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data Does anyone have any experiences using Apache Spark and Cassandra, including limitations (and or) technical difficulties? How does Cassandra compare with HDFS and what use cases would make HDFS more suitable? Thanks, Mike. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: Datastore HDFS vs Cassandra
I forgot to mention that if you do decide to use Cassandra I'd highly recommend jumping on the Cassandra mailing list, if we had taken in come of the advice on that list things would have been considerably smoother cheers On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz christian.b...@performance-media.de wrote: Hi Regarding the Cassandra Data model, there's an excellent post on the ebay tech blog: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/. There's also a slideshare for this somewhere. Happy hacking Chris Von: Franc Carter franc.car...@rozettatech.com Datum: Mittwoch, 11. Februar 2015 10:03 An: Paolo Platter paolo.plat...@agilelab.it Cc: Mike Trienis mike.trie...@orcsol.com, user@spark.apache.org user@spark.apache.org Betreff: Re: Datastore HDFS vs Cassandra One additional comment I would make is that you should be careful with Updates in Cassandra, it does support them but large amounts of Updates (i.e changing existing keys) tends to cause fragmentation. If you are (mostly) adding new keys (e.g new records in the the time series) then Cassandra can be excellent cheers On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter paolo.plat...@agilelab.it wrote: Hi Mike, I developed a Solution with cassandra and spark, using DSE. The main difficult is about cassandra, you need to understand very well its data model and its Query patterns. Cassandra has better performance than hdfs and it has DR and stronger availability. Hdfs is a filesystem, cassandra is a dbms. Cassandra supports full CRUD without acid. Hdfs is more flexible than cassandra. In my opinion, if you have a real time series, go with Cassandra paying attention at your reporting data access patterns. Paolo Inviata dal mio Windows Phone -- Da: Mike Trienis mike.trie...@orcsol.com Inviato: ?11/?02/?2015 05:59 A: user@spark.apache.org Oggetto: Datastore HDFS vs Cassandra Hi, I am considering implement Apache Spark on top of Cassandra database after listing to related talk and reading through the slides from DataStax. It seems to fit well with our time-series data and reporting requirements. http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data Does anyone have any experiences using Apache Spark and Cassandra, including limitations (and or) technical difficulties? How does Cassandra compare with HDFS and what use cases would make HDFS more suitable? Thanks, Mike. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: No AMI for Spark 1.2 using ec2 scripts
AMI's are specific to an AWS region, so the ami-id of the spark AMI in us-west will be different if it exists. I can't remember where but I have a memory of seeing somewhere that the AMI was only in us-east cheers On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson haj...@gmail.com wrote: Thanks, I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and 1.2 with the same command: ./spark-ec2 -k foo -i bar.pem launch mycluster By default this launches in us-east-1. I tried changing the the region using: -r us-west-1 but that had the same result: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm Looking up https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a browser results in the same AMI ID as yours. If I search for ami-7a320f3f AMI in AWS, I can't find any such image. I tried searching in all regions I could find in the AWS console. The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in us-west-1. Strange. Not sure what to do. /Håkan On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke charles.fed...@gmail.com wrote: I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. What parameters are you using when you execute spark-ec2? I am launching in the us-west-1 region (ami-7a320f3f) which may explain things. On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote: Hi, When I try to launch a standalone cluster on EC2 using the scripts in the ec2 directory for Spark 1.2, I get the following error: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm It seems there is not yet any AMI available on EC2. Any ideas when there will be one? This works without problems for version 1.1. Starting up a cluster using these scripts is so simple and straightforward, so I am really missing it on 1.2. /Håkan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Does DecisionTree model in MLlib deal with missing values?
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal with missing values? If so, what data structure should I use for the input? Moreover, my data has categorical features, but the LabeledPoint requires double data type, in this case what can I do? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-DecisionTree-model-in-MLlib-deal-with-missing-values-tp21080.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reading from a centralized stored
Ah, so it's rdd specific - that would make sense. For those systems where it is possible to extract sensible susbets the rdds do so. My use case, which is probably biasing my thinking is DynamoDb which I don't think can efficiently extract records from M-to-N cheers On Wed, Jan 7, 2015 at 6:59 AM, Cody Koeninger c...@koeninger.org wrote: No, most rdds partition input data appropriately. On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com wrote: One more question, to be clarify. Will every node pull in all the data ? thanks On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org wrote: If you are not co-locating spark executor processes on the same machines where the data is stored, and using an rdd that knows about which node to prefer scheduling a task on, yes, the data will be pulled over the network. Of the options you listed, S3 and DynamoDB cannot have spark running on the same machines. Cassandra can be run on the same nodes as spark, and recent versions of the spark cassandra connector implement preferred locations. You can run an rdbms on the same nodes as spark, but JdbcRDD doesn't implement preferred locations. On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter franc.car...@rozettatech.com wrote: Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: Reading from a centralized stored
Thanks, that's what I suspected. cheers On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org wrote: If you are not co-locating spark executor processes on the same machines where the data is stored, and using an rdd that knows about which node to prefer scheduling a task on, yes, the data will be pulled over the network. Of the options you listed, S3 and DynamoDB cannot have spark running on the same machines. Cassandra can be run on the same nodes as spark, and recent versions of the spark cassandra connector implement preferred locations. You can run an rdbms on the same nodes as spark, but JdbcRDD doesn't implement preferred locations. On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter franc.car...@rozettatech.com wrote: Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Reading from a centralized stored
Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
How to get the help or explanation for the functions in Spark shell?
Hi All, I am new to Spark. In the Spark shell, how can I get the help or explanation for those functions that I can use for a variable or RDD? For example, after I input a RDD's name with a dot (.) at the end, if I press the Tab key, a list of functions that I can use for this RDD will be displayed, but I dont know how to use these functions. Your help is greatly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to get the help or explanation for the functions in Spark shell?
Thank you very much Gerard. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191p7193.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
How to compile a Spark project in Scala IDE for Eclipse?
Hi All, I just downloaded the Scala IDE for Eclipse. After I created a Spark project and clicked Run there was an error on this line of code import org.apache.spark.SparkContext: object apache is not a member of package org. I guess I need to import the Spark dependency into Scala IDE for Eclipse, can anyone tell me how to do it? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to compile a Spark project in Scala IDE for Eclipse?
Thanks a lot Krishna, this works for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7223.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to compile a Spark project in Scala IDE for Eclipse?
Thanks for your reply Wei, will try this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7224.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: K-nearest neighbors search in Spark
Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
K-nearest neighbors search in Spark
Hi all,I want to implement a basic K-nearest neighbors search in Spark, but I am totally new to Scala so don't know where to start with.My data consists of millions of points. For each point, I need to compute its Euclidean distance to the other points, and return the top-K points that are closest to it. The data.txt is with the comma-separated format like this:ID, X, Y1, 68, 932, 12, 903, 45, 76100, 86, 54 Could you please tell me what data structure I should use, and how to implement this algorithm in Scala (*some sample code are greatly appreciated*).Thank you very much.Regards,Carter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: sbt/sbt run command returns a JVM problem
hi I still have over 1g left for my program. Date: Sun, 4 May 2014 19:14:30 -0700 From: ml-node+s1001560n5340...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: sbt/sbt run command returns a JVM problem the total memory of your machine is 2G right?then how much memory is left free? wouldn`t ubuntu take up quite a big portion of 2G? just a guess! On Sat, May 3, 2014 at 8:15 PM, Carter [hidden email] wrote: Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ $@ I have tried to set these 3 parameters bigger and smaller, but nothing works. Did I change the right thing? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5340.html To unsubscribe from sbt/sbt run command returns a JVM problem, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5412.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt/sbt run command returns a JVM problem
Hi Michael, The log after I typed last is as below: last scala.tools.nsc.MissingRequirementError: object scala not found. at scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$.getModule(Definitions.scala:605) at scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackage(Definitions.scala:145) at scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackageClass(Definitions.scala:146) at scala.tools.nsc.symtab.Definitions$definitions$.AnyClass(Definitions.scala:176) at scala.tools.nsc.symtab.Definitions$definitions$.init(Definitions.scala:814) at scala.tools.nsc.Global$Run.init(Global.scala:697) at sbt.compiler.Eval$$anon$1.init(Eval.scala:53) at sbt.compiler.Eval.run$1(Eval.scala:53) at sbt.compiler.Eval.unlinkAll$1(Eval.scala:56) at sbt.compiler.Eval.eval(Eval.scala:62) at sbt.EvaluateConfigurations$.evaluateSetting(Build.scala:104) at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:212) at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:209) at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60) at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60) at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62) at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62) at sbt.Command$.process(Command.scala:90) at sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71) at sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71) at sbt.State$$anon$2.process(State.scala:171) at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71) at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.MainLoop$.next(MainLoop.scala:71) at sbt.MainLoop$.run(MainLoop.scala:64) at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:53) at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:50) at sbt.Using.apply(Using.scala:25) at sbt.MainLoop$.runWithNewLog(MainLoop.scala:50) at sbt.MainLoop$.runAndClearLast(MainLoop.scala:33) at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:17) at sbt.MainLoop$.runLogged(MainLoop.scala:13) at sbt.xMain.run(Main.scala:26) at xsbt.boot.Launch$.run(Launch.scala:55) at xsbt.boot.Launch$$anonfun$explicit$1.apply(Launch.scala:45) at xsbt.boot.Launch$.launch(Launch.scala:60) at xsbt.boot.Launch$.apply(Launch.scala:16) at xsbt.boot.Boot$.runImpl(Boot.scala:31) at xsbt.boot.Boot$.main(Boot.scala:20) at xsbt.boot.Boot.main(Boot.scala) [error] scala.tools.nsc.MissingRequirementError: object scala not found. [error] Use 'last' for the full log. And my sbt file is like below (my sbt launcher is sbt-launch-0.12.4.jar in the same folder): #!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the License); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # #http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an AS IS BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # This script launches sbt for this project. If present it uses the system # version of sbt. If there is no system version of sbt it attempts to download # sbt locally. SBT_VERSION=`awk -F = '/sbt\\.version/ {print $2}' ./project/build.properties` URL1=http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar URL2=http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar JAR=sbt/sbt-launch-${SBT_VERSION}.jar # Download sbt launch jar if it hasn't been downloaded yet if [ ! -f ${JAR} ]; then # Download printf Attempting to fetch sbt\n if hash curl 2/dev/null; then curl --progress-bar ${URL1} ${JAR} || curl --progress-bar ${URL2} ${JAR} elif hash wget 2/dev/null; then wget --progress=bar ${URL1} -O ${JAR} || wget --progress=bar ${URL2} -O ${JAR} else printf You do not have curl or wget installed, please install sbt manually from http://www.scala-sbt.org/\n; exit -1 fi fi if [ ! -f ${JAR} ]; then # We failed to download printf Our attempt to
Re: sbt/sbt run command returns a JVM problem
Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ $@ I have tried to set these 3 parameters bigger and smaller, but nothing works. Did I change the right thing? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt/sbt run command returns a JVM problem
Hi Michael, Thank you very much for your reply. Sorry I am not very familiar with sbt. Could you tell me where to set the Java option for the sbt fork for my program? I brought up the sbt console, and run set javaOptions += -Xmx1G in it, but it returned an error: [error] scala.tools.nsc.MissingRequirementError: object scala not found. [error] Use 'last' for the full log. Is this the right way to set the java option? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5294.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
sbt/sbt run command returns a JVM problem
Hi, I have a very simple spark program written in Scala: /*** testApp.scala ***/ object testApp { def main(args: Array[String]) { println(Hello! World!) } } Then I use the following command to compile it: $ sbt/sbt package The compilation finished successfully and I got a JAR file. But when I use this command to run it: $ sbt/sbt run it returned an error with JVM: [info] Error occurred during initialization of VM [info] Could not reserve enough space for object heap [error] Error: Could not create the Java Virtual Machine. [error] Error: A fatal exception has occurred. Program will exit. java.lang.RuntimeException: Nonzero exit code returned from runner: 1 at scala.sys.package$.error(package.scala:27) My machine has 2G memory, and runs on Ubuntu 11.04. I also tried to change the setting of java parameter (e.g., -Xmx, -Xms, -XX:MaxPermSize, -XX:ReservedCodeCacheSize) in the file sbt/sbt, but it looks like non of the change works. Can anyone help me out with this problem? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help about how hadoop works.
Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile(/home/scalatest.txt,5) doc.count we can only get parallelization within the computer where the file is loaded, but not the parallelization within the computers in the cluster (Spark can not automatically duplicate the file to the other computers in the cluster), is this understanding correct? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help about how hadoop works.
Thank you very much for your help Prashant. Sorry I still have another question about your answer: however if the file(/home/scalatest.txt) is present on the same path on all systems it will be processed on all nodes. When presenting the file to the same path on all nodes, do we just simply copy the same file to all nodes, or do we need to split the original file into different parts (each part is still with the same file name scalatest.txt), and copy each part to a different node for parallelization? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Need help about how hadoop works.
Thank you very much Prashant. Date: Thu, 24 Apr 2014 01:24:39 -0700 From: ml-node+s1001560n4739...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: Need help about how hadoop works. It is the same file and hadoop library that we use for splitting takes care of assigning the right split to each node.Prashant Sharma On Thu, Apr 24, 2014 at 1:36 PM, Carter [hidden email] wrote: Thank you very much for your help Prashant. Sorry I still have another question about your answer: however if the file(/home/scalatest.txt) is present on the same path on all systems it will be processed on all nodes. When presenting the file to the same path on all nodes, do we just simply copy the same file to all nodes, or do we need to split the original file into different parts (each part is still with the same file name scalatest.txt), and copy each part to a different node for parallelization? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4739.html To unsubscribe from Need help about how hadoop works., click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4746.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Need help about how hadoop works.
Hi, I am a beginner of Hadoop and Spark, and want some help in understanding how hadoop works. If we have a cluster of 5 computers, and install Spark on the cluster WITHOUT Hadoop. And then we run the code on one computer: val doc = sc.textFile(/home/scalatest.txt,5) doc.count Can the count task be distributed to all the 5 computers? Or it is only run by 5 parallel threads of the current computer? On th other hand, if we install Hadoop on the cluster and upload the data into HDFS, when running the same code will this count task be done by 25 threads? Thank you very much for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638.html Sent from the Apache Spark User List mailing list archive at Nabble.com.