Windowed stream operations -- These are too lazy for some use cases

2015-08-20 Thread Justin Grimes
We are aggregating real time logs of events, and want to do windows of 30 minutes. However, since the computation doesn't start until 30 minutes have passed, there is a ton of data built up that processing could've already started on. When it comes time to actually process the data, there is too

Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Apologies I accidentally included Spark User DL on BCC. The actual email message is below. = Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs* that

Run scala code with spark submit

2015-08-20 Thread MasterSergius
Is there any possibility to run standalone scala program via spark submit? Or I have always put it in some packages, build it with maven (or sbt)? What if I have just simple program, like that example word counter? Could anyone please, show it on this simple test file Greeting.scala: It

Re: Run scala code with spark submit

2015-08-20 Thread Dean Wampler
I haven't tried it, but scala-shell should work if you give it a scala script file, since it's basically a wrapper around the Scala REPL. dean On Thursday, August 20, 2015, MasterSergius master.serg...@gmail.com wrote: Is there any possibility to run standalone scala program via spark submit?

Re: insert overwrite table phonesall in spark-sql resulted in java.io.StreamCorruptedException

2015-08-20 Thread John Jay
The answer is that my table was not serialized by kyro,but I started spark-sql shell with kyro,so the data could not be deserialized。 -- View this message in context:

Re: persist for DStream

2015-08-20 Thread Hemant Bhanawat
Are you asking for something more than this? http://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence On Thu, Aug 20, 2015 at 2:09 PM, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, there are function available tp cache() or persist() RDD in

Transformation not happening for reduceByKey or GroupByKey

2015-08-20 Thread satish chandra j
HI All, I have data in RDD as mentioned below: RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40)) I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on Values for each key Code: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Result in console: RDD:

How to get the radius of clusters in spark K means

2015-08-20 Thread ashensw
We can get cluster centers in K means clustering. Like wise is there any method in spark to get the cluster radius? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-radius-of-clusters-in-spark-K-means-tp24353.html Sent from the Apache Spark

SparkSQL concerning materials

2015-08-20 Thread Dawid Wysakowicz
Hi, I would like to dip into SparkSQL. Get to know better the architecture, good practices, some internals. Could you advise me some materials on this matter? Regards Dawid

persist for DStream

2015-08-20 Thread Deepesh Maheshwari
Hi, there are function available tp cache() or persist() RDD in memory but i am reading data from kafka in form of DStream and applying operation it and i want to persist that DStream in memory for further. Please suggest method how i can persist DStream in memory. Regards, Deepesh

Re: spark kafka partitioning

2015-08-20 Thread ayan guha
If you have 1 topic, that means you have 1 DStream, which will have a series of RDDs for each batch interval. In receiver-based integration, there is no direct relationship b/w Kafka paritions with spark partitions. in Direct approach, 1 partition will be created for each kafka partition. On Fri,

Spark-Cassandra-connector

2015-08-20 Thread Samya
Hi All, I need to write an RDD to Cassandra using the sparkCassandraConnector from DataStax. My application is using Yarn. *Some basic Questions :* 1. Will a call to saveToCassandra(.), be using the same connection object between all task in a given executor? I mean is there 1 (one)

Spark 1.3. Insert into hive parquet partitioned table from DataFrame

2015-08-20 Thread Masf
Hi. I have a dataframe and I want to insert these data into parquet partitioned table in Hive. In Spark 1.4 I can use df.write.partitionBy(x,y).format(parquet).mode(append).saveAsTable(tbl_parquet) but in Spark 1.3 I can't. How can I do it? Thanks -- Regards Miguel

Re: How to overwrite partition when writing Parquet?

2015-08-20 Thread Romi Kuntsman
Cheng - what if I want to overwrite a specific partition? I'll to remove the folder, as Hemant suggested... On Thu, Aug 20, 2015 at 1:17 PM Cheng Lian lian.cs@gmail.com wrote: You can apply a filter first to filter out data of needed dates and then append them. Cheng On 8/20/15 4:59

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Dhaval Patel
Apologies, sent too early accidentally. Actual message is below A dataframe has 2 datecolumns (datetime type) and I would like to add another column that would have difference between these two dates. Dataframe snippet is below.

Re: How to overwrite partition when writing Parquet?

2015-08-20 Thread Cheng Lian
You can apply a filter first to filter out data of needed dates and then append them. Cheng On 8/20/15 4:59 PM, Hemant Bhanawat wrote: How can I overwrite only a given partition or manually remove a partition before writing? I don't know if (and I don't think) there is a way to do that

[SparkR] How to perform a for loop on a DataFrame object

2015-08-20 Thread Florian M
Hi guys, First of all, thank you for your amazing work. As you can see in the subject, I post here because I need to perform a for loop on a DataFrame object. Sample of my Dataset (the entire dataset is ~400k lines long) : I use the 1.4.1 Spark version with R in 3.2.1 I launch sparkR using

Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Naveen
Hi All, Is there anyway to convert a mllib matrix to a Dense Matrix of Breeze? Any leads are appreciated. Thanks, Naveen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Yanbo Liang
You can use Matrix.toBreeze() https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L56 . 2015-08-20 18:24 GMT+08:00 Naveen nav...@formcept.com: Hi All, Is there anyway to convert a mllib matrix to a Dense Matrix of Breeze? Any leads

DataFrameWriter.jdbc is very slow

2015-08-20 Thread Aram Mkrtchyan
We want to migrate our data (approximately 20M rows) from parquet to postgres, when we are using dataframe writer's jdbc method the execution time is very large, we have tried the same with batch insert it was much effective. Is it intentionally implemented in that way?

How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Dhaval Patel
new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|

Data locality with HDFS not being seen

2015-08-20 Thread Sunil
Hello . I am seeing some unexpected issues with achieving HDFS data locality. I expect the tasks to be executed only on the node which has the data but this is not happening (ofcourse, unless the node is busy in which case, I understand tasks can go to some other node). Could anyone

RE: Any suggestion about sendMessageReliably failed because ack was not received within 120 sec

2015-08-20 Thread java8964
The closed information I can found online related to this error ishttps://issues.apache.org/jira/browse/SPARK-3633 But it is quite different in our case. In our case, we never saw the (Too many open files) error, the log just simple show the 120 sec time out. I checked all the GC output from all

RE: SparkR csv without headers

2015-08-20 Thread Sun, Rui
Hi, You can create a DataFrame using load.df() with a specified schema. Something like: schema - structType(structField(“a”, “string”), structField(“b”, integer), …) read.df ( …, schema = schema) From: Franc Carter [mailto:franc.car...@rozettatech.com] Sent: Wednesday, August 19, 2015 1:48 PM

Any suggestion about sendMessageReliably failed because ack was not received within 120 sec

2015-08-20 Thread java8964
Hi, Sparkers: After first 2 weeks of Spark in our production cluster, with more familiar with Spark, we are more confident to avoid Lost Executor due to memory issue. So far, most of our jobs won't fail or slow down due to Lost executor. But sometimes, I observed that individual tasks failed due

Re: Kafka Spark Partition Mapping

2015-08-20 Thread Cody Koeninger
In general you cannot guarantee which node an RDD will be processed on. The preferred location for a kafkardd is the kafka leader for that partition, if they're deployed on the same machine. If you want to try to override that behavior, the method is getPreferredLocations But even in that case,

Re: what determine the task size?

2015-08-20 Thread ambujhbti
cwz wrote sorry, my question is not clear. I mean what determine the one task size? not how many tasks one task size= one HDFS block size. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-determine-the-task-size-tp24363p24375.html Sent from the

spark kafka partitioning

2015-08-20 Thread Gaurav Agarwal
Hello Regarding Spark Streaming and Kafka Partitioning When i send message on kafka topic with 3 partitions and listens on kafkareceiver with local value[4] . how will i come to know in Spark Streaming that different Dstreams are created according to partitions of kafka messages . Thanks

Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Rishitesh Mishra
I am not sure if you can view all RDDs in a session. Tables are maintained in a catalogue . Hence its easier. However you can see the DAG representation , which lists all the RDDs in a job , with Spark UI. On 20 Aug 2015 22:34, Dhaval Patel dhaval1...@gmail.com wrote: Apologies I

Re: Windowed stream operations -- These are too lazy for some use cases

2015-08-20 Thread Iulian Dragoș
On Thu, Aug 20, 2015 at 6:58 PM, Justin Grimes jgri...@adzerk.com wrote: We are aggregating real time logs of events, and want to do windows of 30 minutes. However, since the computation doesn't start until 30 minutes have passed, there is a ton of data built up that processing could've already

Re: DAG related query

2015-08-20 Thread Andrew Or
Hi Bahubali, Once RDDs are created, they are immutable (in most cases). In your case you end up with 3 RDDs: (1) the original rdd1 that reads from the text file (2) rdd2, that applies a map function on (1), and (3) the new rdd1 that applies a map function on (2) There's no cycle because you

Re: Windowed stream operations -- These are too lazy for some use cases

2015-08-20 Thread Justin Grimes
I tried something like that. When I tried just doing count() on the DStream, it didn't seem like it was actually forcing the computation. What (sort of) worked was doing a forEachRDD((rdd) = rdd.count()), or doing a print() on the DStream. The only problem was this seemed to add a lot of

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Davies Liu
As Aram said, there two options in Spark 1.4, 1) Use the HiveContext, then you got datediff from Hive, df.selectExpr(datediff(d2, d1)) 2) Use Python UDF: ``` from datetime import date df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9, 26))], ['d1', 'd2']) from

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-20 Thread Nicholas Chammas
I'm planning to close the survey to further responses early next week. If you haven't chimed in yet, the link to the survey is here: http://goo.gl/forms/erct2s6KRR We already have some great responses, which you can view. I'll share a summary after the survey is closed. Cheers! Nick On Mon,

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Dhaval Patel
More update on this question..I am using spark 1.4.1. I was just reading documentation of spark 1.5 (still in development) and I think there will be a new func *datediff* that will solve the issue. So please let me know if there is any work-around until spark 1.5 is out :).

Re: DAG related query

2015-08-20 Thread Sean Owen
No. The third line creates a third RDD whose reference simply replaces the reference to the first RDD in your local driver program. The first RDD still exists. On Thu, Aug 20, 2015 at 2:15 PM, Bahubali Jain bahub...@gmail.com wrote: Hi, How would the DAG look like for the below code

PySpark concurrent jobs using single SparkContext

2015-08-20 Thread Mike Sukmanowsky
Hi all, We're using Spark 1.3.0 via a small YARN cluster to do some log processing. The jobs are pretty simple, for a number of customers and a number of days, fetch some event log data, build aggregates and store those aggregates into a data store. The way our script is written right now does

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
What version of Spark are you using? Have you set any shuffle configs? On Wed, Aug 19, 2015 at 11:46 AM, unk1102 umesh.ka...@gmail.com wrote: I have one Spark job which seems to run fine but after one hour or so executor start getting lost because of time out something like the following

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Umesh Kacha
Hi Hemant sorry for the confusion I meant final output part files in the final directory hdfs I never meant intermediate files. Thanks. My goal is to reduce those many files because of my use case explained in the first email with calculations. On Aug 20, 2015 5:59 PM, Hemant Bhanawat

Re: SparkSQL concerning materials

2015-08-20 Thread Muhammad Atif
Hi Dawid The best pace to get started is the Spark SQL Guide from Apache http://spark.apache.org/docs/latest/sql-programming-guide.html Regards Muhammad On Thu, Aug 20, 2015 at 5:46 AM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: Hi, I would like to dip into SparkSQL. Get to know

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
Moving this back onto user@ Regarding GC, can you look in the web UI and see whether the GC time metric dominates the amount of time spent on each task (or at least the tasks that aren't completing)? Also, have you tried bumping your spark.yarn.executor.memoryOverhead? YARN may be killing your

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Burak Yavuz
Matrix.toBreeze is a private method. MLlib matrices have the same structure as Breeze Matrices. Just create a new Breeze matrix like this https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L270 . Best,

Data frame created from hive table and its partition

2015-08-20 Thread VIJAYAKUMAR JAWAHARLAL
Hi I have a question regarding data frame partition. I read a hive table from spark and following spark api converts it as DF. test_df = sqlContext.sql(“select * from hivetable1”) How does spark decide partition of test_df? Is there a way to partition test_df based on some column while

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-20 Thread satish chandra j
HI All, Could anybody let me know what is that i missing here, it should work as its a basic transformation Please let me know if any additional information required Regards, Satish On Thu, Aug 20, 2015 at 3:35 PM, satish chandra j jsatishchan...@gmail.com wrote: HI All, I have data in RDD

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Aram Mkrtchyan
Hi, hope this will help you import org.apache.spark.sql.functions._ import sqlContext.implicits._ import java.sql.Timestamp val df = sc.parallelize(Array((date1, date2))).toDF(day1, day2) val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) =

DAG related query

2015-08-20 Thread Bahubali Jain
Hi, How would the DAG look like for the below code JavaRDDString rdd1 = context.textFile(SOMEPATH); JavaRDDString rdd2 = rdd1.map(DO something); rdd1 = rdd2.map(Do SOMETHING); Does this lead to any kind of cycle? Thanks, Baahu

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Hemant Bhanawat
Sorry, I misread your mail. Thanks for pointing that out. BTW, are the 8 files shuffle intermediate output and not the final output? I assume yes. I didn't know that you can keep intermediate output on HDFS and I don't think that is recommended. On Thu, Aug 20, 2015 at 2:43 PM, Hemant

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Naveen
Hi, Thanks for the reply. I tried Matrix.toBreeze() which returns the following error: */method toBreeze in trait Matrix cannot be accessed in org.apache.spark.mllib.linalg.Matrix/* On Thursday 20 August 2015 07:50 PM, Burak Yavuz wrote: Matrix.toBreeze is a private method. MLlib matrices

Re: SparkR - can't create spark context - JVM not ready

2015-08-20 Thread Deborah Siegel
Thanks Shivaram. You got me wondering about the path so I put it in full and it worked. R does not, of course, expand a ~. On Thu, Aug 20, 2015 at 4:35 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Can you check if the file `~/software/spark-1.4.1-bin-hadoop2.4/bin/spark-submit`

Re: spark kafka partitioning

2015-08-20 Thread Cody Koeninger
I'm not clear on your question, can you rephrase it? Also, are you talking about createStream or createDirectStream? On Thu, Aug 20, 2015 at 9:48 PM, Gaurav Agarwal gaurav130...@gmail.com wrote: Hello Regarding Spark Streaming and Kafka Partitioning When i send message on kafka topic with

Re: SparkR csv without headers

2015-08-20 Thread Franc Carter
Thanks - works nicely cheers On Fri, Aug 21, 2015 at 12:43 PM, Sun, Rui rui@intel.com wrote: Hi, You can create a DataFrame using load.df() with a specified schema. Something like: schema - structType(structField(“a”, “string”), structField(“b”, integer), …) read.df ( …,

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Sandy Ryza
GC wouldn't necessarily result in errors - it could just be slowing down your job and causing the executor JVMs to stall. If you click on a stage in the UI, you should end up on a page with all the metrics concerning the tasks that ran in that stage. GC Time is one of these task metrics. -Sandy

Re: Creating Spark DataFrame from large pandas DataFrame

2015-08-20 Thread Burak Yavuz
If you would like to try using spark-csv, please use `pyspark --packages com.databricks:spark-csv_2.11:1.2.0` You're missing a dependency. Best, Burak On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack charles.t.h...@gmail.com wrote: Hi, I'm new to spark and am trying to create a Spark df from a

org.apache.hadoop.security.AccessControlException: Permission denied when access S3

2015-08-20 Thread Shuai Zheng
Hi All, I try to access S3 file from S3 in Hadoop file format: Below is my code: Configuration hadoopConf = ctx.hadoopConfiguration(); hadoopConf.set(fs.s3n.awsAccessKeyId, this.getAwsAccessKeyId());

Re: SparkSQL concerning materials

2015-08-20 Thread Ted Yu
See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.package Cheers On Thu, Aug 20, 2015 at 7:50 AM, Muhammad Atif muhammadatif...@gmail.com wrote: Hi Dawid The best pace to get started is the Spark SQL Guide from Apache

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Umesh Kacha
Hi where do I see GC time in UI? I have set spark.yarn.executor.memoryOverhead as 3500 which seems to be good enough I believe. So you mean only GC could be the reason behind timeout I checked Yarn logs I did not see any GC error there. Please guide. Thanks much. On Thu, Aug 20, 2015 at 8:14 PM,

dataframe json schema scan

2015-08-20 Thread Alex Nastetsky
The doc for DataFrameReader#json(RDD[String]) method says Unless the schema is specified using schema function, this function goes through the input once to determine the input schema. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader Why is this

Re: Data frame created from hive table and its partition

2015-08-20 Thread VIJAYAKUMAR JAWAHARLAL
Thanks Michael. My bad regarding hive table primary keys. I have one big 140GB hdfs file and external hive table defined on it. Table is not partitioned. When I read external hive table using sqlContext.sql, how does spark decides number of partitions which should be created for that data

load NULL Values in RDD

2015-08-20 Thread SAHA, DEBOBROTA
Hi , Can anyone help me in loading a column that may or may not have NULL values in a RDD. Thanks

Creating Spark DataFrame from large pandas DataFrame

2015-08-20 Thread Charlie Hack
Hi, I'm new to spark and am trying to create a Spark df from a pandas df with ~5 million rows. Using Spark 1.4.1. When I type: df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) (the df.where is a hack I found on the Spark JIRA to avoid a problem with NaN values making

Re: DataFrameWriter.jdbc is very slow

2015-08-20 Thread Michael Armbrust
We will probably fix this in Spark 1.6 https://issues.apache.org/jira/browse/SPARK-10040 On Thu, Aug 20, 2015 at 5:18 AM, Aram Mkrtchyan aram.mkrtchyan...@gmail.com wrote: We want to migrate our data (approximately 20M rows) from parquet to postgres, when we are using dataframe writer's jdbc

Re: Data frame created from hive table and its partition

2015-08-20 Thread Michael Armbrust
There is no such thing as primary keys in the Hive metastore, but Spark SQL does support partitioned hive tables: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables DataFrameWriter also has a partitionBy method. On Thu, Aug 20, 2015 at 7:29

FAILED_TO_UNCOMPRESS error from Snappy

2015-08-20 Thread Kohki Nishio
Right after upgraded to 1.4.1, we started seeing this exception and yes we picked up snappy-java-1.1.1.7 (previously snappy-java-1.1.1.6). Is there anything I could try ? I don't have a repro case. org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5) at

Re: Saving and loading MLlib models as standalone (no Hadoop)

2015-08-20 Thread Robineast
You can't serialize models out of Spark and then use them outside of the Spark context. However there is support for the PMML format - have a look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html Robin

Re: FAILED_TO_UNCOMPRESS error from Snappy

2015-08-20 Thread Ruslan Dautkhanov
https://issues.apache.org/jira/browse/SPARK-7660 ? -- Ruslan Dautkhanov On Thu, Aug 20, 2015 at 1:49 PM, Kohki Nishio tarop...@gmail.com wrote: Right after upgraded to 1.4.1, we started seeing this exception and yes we picked up snappy-java-1.1.1.7 (previously snappy-java-1.1.1.6). Is there

Re: SparkSQL concerning materials

2015-08-20 Thread Dhaval Patel
Or if you're a python lover then this is a good place - https://spark.apache.org/docs/1.4.1/api/python/pyspark.sql.html# On Thu, Aug 20, 2015 at 10:58 AM, Ted Yu yuzhih...@gmail.com wrote: See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.package Cheers

Re: How to get the radius of clusters in spark K means

2015-08-20 Thread Ashen Weerathunga
Okay. Thanks. I already did that and wanted to check whether is there any other method to extract it from the model itself. Thanks again for the help. On Thu, Aug 20, 2015 at 8:39 PM, Robin East robin.e...@xense.co.uk wrote: There is no cluster radius method on the model returned from K-means.

How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs* that has been created in current session. Anyone knows if there is any such commands available? Something similar to SparkSQL to list all temp tables : show

MLlib Prefixspan implementation

2015-08-20 Thread alexis GILLAIN
I want to use prefixspan so I had a look at the code and the cited paper : Distributed PrefixSpan Algorithm Based on MapReduce. There is a result in the paper I didn't really undertstand and I could'nt find where it is used in the code. Suppose a sequence database S = {­1­,2...­n}, a sequence

SparkR - can't create spark context - JVM not ready

2015-08-20 Thread Deborah Siegel
Hello, I have previously successfully run SparkR in RStudio, with: Sys.setenv(SPARK_HOME=~/software/spark-1.4.1-bin-hadoop2.4) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=local[2],appName=SparkR-example) Then I tried putting some

Spark SQL window functions (RowsBetween)

2015-08-20 Thread Mike Trienis
Hi All, I would like some clarification regarding window functions for Apache Spark 1.4.0 - https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html In particular, the rowsBetween * {{{ * val w = Window.partitionBy(name).orderBy(id) * df.select(