Re: pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-05 Thread Ali Tajeldin
I think the tricky part here is that the join condition is encoded in the second data frame and not a direct value. Assuming the second data frame (the tags) is small enough, you can collect it (read it into memory) and then construct a "when" expression chain for each of the possible tags ,

Re: DataFrame First method is resulting different results in each iteration

2016-02-04 Thread Ali Tajeldin EDU
Hi Satish, Take a look at the smvTopNRecs() function in the SMV package. It does exactly what you are looking for. It might be overkill to bring in all of SMV for just one function but you will also get a lot more than just DF helper functions (modular views, higher level graphs, dynamic

Re: Dynamic sql in Spark 1.5

2016-02-04 Thread Ali Tajeldin EDU
lly appreciate the help. > > > Thanks, > Divya > > > > > > On 3 February 2016 at 11:42, Ali Tajeldin EDU <alitedu1...@gmail.com> wrote: > While you can construct the SQL string dynamically in scala/java/python, it > would be best to use the Datafr

Re: Dynamic sql in Spark 1.5

2016-02-02 Thread Ali Tajeldin EDU
While you can construct the SQL string dynamically in scala/java/python, it would be best to use the Dataframe API for creating dynamic SQL queries. See http://spark.apache.org/docs/1.5.2/sql-programming-guide.html for details. On Feb 2, 2016, at 6:49 PM, Divya Gehlot

Re: Workflow manager for Spark and Spark SQL

2015-12-10 Thread Ali Tajeldin EDU
Hi Alexander, We developed SMV to address the exact issue you mentioned. While it is not a workflow engine per-se, It does allow for the creation of modules with dependency and automates the execution of these modules. See

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Ali Tajeldin EDU
Checkout the Sameer Farooqui video on youtube for spark internals (https://www.youtube.com/watch?v=7ooZ4S7Ay6Y=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX) Starting at 2:15:00, he describes YARN mode. btw, highly recommend the entire video. Very detailed and concise. -- Ali On Dec 7, 2015, at 8:38

Re: Local mode: Stages hang for minutes

2015-12-03 Thread Ali Tajeldin EDU
You can try to run "jstack" a couple of times while the app is hung to look for patterns for where the app is hung. -- Ali On Dec 3, 2015, at 8:27 AM, Richard Marscher wrote: > I should add that the pauses are not from GC and also in tracing the CPU call > tree in

Re: ClassNotFoundException with a uber jar.

2015-11-26 Thread Ali Tajeldin EDU
I'm not %100 sure, but I don't think a jar within a jar will work without a custom class loader. You can perhaps try to use "maven-assembly-plugin" or "maven-shade-plugin" to build your uber/fat jar. Both of these will build a flattened single jar. -- Ali On Nov 26, 2015, at 2:49 AM, Marc de

Re: Is it possible Running SparkR on 2 nodes without HDFS

2015-11-10 Thread Ali Tajeldin EDU
make sure "/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv" is accessible on your slave node. -- Ali On Nov 9, 2015, at 6:06 PM, Sanjay Subramanian wrote: > hey guys > > I have a 2 node SparkR (1 master 1 slave)cluster on AWS

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ali Tajeldin EDU
You can take a look at the smvPivot function in the SMV library ( https://github.com/TresAmigosSD/SMV ). Should look for method "smvPivot" in SmvDFHelper ( http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvDFHelper). You can also perform the pivot on a

Re: Request for submitting Spark jobs in code purely, without jar

2015-10-22 Thread Ali Tajeldin EDU
The Spark job-server project may help (https://github.com/spark-jobserver/spark-jobserver). -- Ali On Oct 21, 2015, at 11:43 PM, ?? wrote: > Hi developers, I've encountered some problem with Spark, and before opening > an issue, I'd like to hear your thoughts. >

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Ali Tajeldin EDU
Furthermore, even adding aliasing as suggested by the warning doesn't seem to help either. Slight modification to example below: > scala> val largeValues = df.filter('value >= 10).as("lv") And just looking at the join results: > scala> val j = smallValues > .join(largeValues,

Re: dataframe average error: Float does not take parameters

2015-10-21 Thread Ali Tajeldin EDU
Which version of Spark are you using? I just tried the example below on 1.5.1 and it seems to work as expected: scala> val res = df.groupBy("key").count.agg(min("count"), avg("count")) res: org.apache.spark.sql.DataFrame = [min(count): bigint, avg(count): double] scala> res.show

Re: Incremental load of RDD from HDFS?

2015-10-20 Thread Ali Tajeldin EDU
I could be misreading the code, but looking at the code for toLocalIterator (copied below), it should lazily call runJob on each partition in your input. It shouldn't be parsing the entire RDD before returning from the first "next" call. If it is taking a long time on the first "next" call, it

Re: Problem of RDD in calculation

2015-10-16 Thread Ali Tajeldin EDU
Since DF2 only has the userID, I'm assuming you are musing DF2 to filter for desired userIDs. You can just use the join() and groupBy operations on DataFrame to do what you desire. For example: scala> val df1=app.createDF("id:String; v:Integer", "X,1;X,2;Y,3;Y,4;Z,10") df1: