Re: Spark 2.0 issue with left_outer join

2017-03-04 Thread ayan guha
How about running this - select * from (select * , count() over (partition by id order by id) c from filteredDS) f where f.cnt < 7500 On Sun, Mar 5, 2017 at 12:05 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Yes every time I run this code with production scale data it fails.

Re: Spark 2.0 issue with left_outer join

2017-03-04 Thread Ankur Srivastava
Yes every time I run this code with production scale data it fails. Test case with small dataset of 50 records on local box runs fine. Thanks Ankur Sent from my iPhone > On Mar 4, 2017, at 12:09 PM, ayan guha wrote: > > Just to be sure, can you reproduce the error using

Sharing my DataFrame (DataSet) cheat sheet.

2017-03-04 Thread Yuhao Yang
Sharing some snippets I accumulated during developing with Apache Spark DataFrame (DataSet). Hope it can help you in some way. https://github.com/hhbyyh/DataFrameCheatSheet. [image: 内嵌图片 1] Regards, Yuhao Yang

Re: Spark 2.0 issue with left_outer join

2017-03-04 Thread ayan guha
Just to be sure, can you reproduce the error using sql api? On Sat, 4 Mar 2017 at 2:32 pm, Ankur Srivastava wrote: > Adding DEV. > > Or is there any other way to do subtractByKey using Dataset APIs? > > Thanks > Ankur > > On Wed, Mar 1, 2017 at 1:28 PM, Ankur

RE: using spark to load a data warehouse in real time

2017-03-04 Thread Adaryl Wakefield
That does thanks. I’m starting to think a straight Kafka solution would be more appropriate. Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.massstreet.net www.linkedin.com/in/bobwakefieldmba

RE: using spark to load a data warehouse in real time

2017-03-04 Thread Adaryl Wakefield
For all the work that is necessary to load a warehouse, could not that work be considered a special case of CEP? Real time means I’m trying to get to zero lag between an event happening in the transactional system and someone being able to do analytics on that data but not just from that

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread bryan . jeffrey
Rdd operation: rdd.map(x => (word, count)).reduceByKey(_+_) Get Outlook for Android On Sat, Mar 4, 2017 at 8:59 AM -0500, "Old-School" wrote: Hi, I want to perform some simple transformations and check the execution time, under

Re: Spark - Not contains on Spark dataframe

2017-03-04 Thread KhajaAsmath Mohammed
Hi, I was able to resolve issue with below conditions. datapoint_df(Constants.Datapoint.Vin).like("012345") datapoint_filter_df.filter( datapoint_filter_df(Constants.Datapoint.Vin) rlike "^([A-Z]|[0-9]|[a-z])+$" ) // for checking alpha numeric. Thanks, Asmath On Tue, Feb 28, 2017 at 10:49 AM,

Re: Not able to remove header from a text file while creating a data frame .

2017-03-04 Thread KhajaAsmath Mohammed
You have to use below. Include the com.databricks.spark.csv in sbt or maven before doing it. If doing on spark submit, then you need to add as --jars or --packages .. sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",

Not able to remove header from a text file while creating a data frame .

2017-03-04 Thread PSwain
Hi All, I am reading a text file to create a dataframe . While I am trying to exclude header form the text file I am not able r to do it . Now my concern is how to know what all options are there that I can use while reading from a source , I checked the API , there the Arguments in option

[RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread Old-School
Hi, I want to perform some simple transformations and check the execution time, under various configurations (e.g. number of cores being used, number of partitions etc). Since it is not possible to set the partitions of a dataframe , I guess that I should probably use RDDs. I've got a dataset