Re: Joining streaming data with static table data.

2017-12-11 Thread Rishi Mishra
You can do a join between streaming dataset and a static dataset. I would prefer your first approach. But the problem with this approach is performance. Unless you cache the dataset , every time you fire a join query it will fetch the latest records from the table. Regards, Rishitesh Mishra,

Re: DataFrame joins with Spark-Java

2017-11-29 Thread Rishi Mishra
Hi Sushma, can you try as below with a left anti join ..In my example name & id consists of a key. df1.alias("a").join(df2.alias("b"), col("a.name").equalTo(col("b.name")) .and(col("a.id").equalTo(col("b.id"))) , "left_anti").selectExpr("name", "id").show(10,

Re: Extend Dataframe API

2016-07-08 Thread Rishi Mishra
Or , you can extend SQLContext to add your plans . Not sure if it fits your requirement , but answered to highlight an option. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Thu, Jul 7, 2016 at 8:39 PM, tan shai

Re: RDD and Dataframes

2016-07-07 Thread Rishi Mishra
Yes, finally it will be converted to an RDD internally. However DataFrame queries are passed through catalyst , which provides several optimizations e.g. code generation, intelligent shuffle etc , which is not the case for pure RDDs. Regards, Rishitesh Mishra, SnappyData .

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Rishi Mishra
predicates? > > Telmo Rodrigues > > No dia 11/05/2016, às 09:49, Rishi Mishra <rmis...@snappydata.io> > escreveu: > > It does push the predicate. But as a relations are generic and might or > might not handle some of the predicates , it needs to apply filter of >

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Rishi Mishra
It does push the predicate. But as a relations are generic and might or might not handle some of the predicates , it needs to apply filter of un-handled predicates. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Wed, May 11,

Re: partitioner aware subtract

2016-05-10 Thread Rishi Mishra
As you have same partitioner and number of partitions probably you can use zipPartition and provide a user defined function to substract . A very primitive example being. val data1 = Seq(1->1,2->2,3->3,4->4,5->5,6->6,7->7) val data2 = Seq(1->1,2->2,3->3,4->4,5->5,6->6) val rdd1 =

Re: Updating Values Inside Foreach Rdd loop

2016-05-10 Thread Rishi Mishra
Hi Harsh, Probably you need to maintain some state for your values, as you are updating some of the keys in a batch and check for a global state of your equation. Can you check the API mapWithState of DStream ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/)

Re: Accumulator question

2016-05-10 Thread Rishi Mishra
Your mail does not describe much , but wont a simple reduce function help you ? Something like as below val data = Seq(1,2,3,4,5,6,7) val rdd = sc.parallelize(data, 2) val sum = rdd.reduce((a,b) => a+b) Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/)

Re: Spark Streaming share state between two streams

2016-04-08 Thread Rishi Mishra
Hi Shekhar, As both of your state functions does the same thing can't you do a union of dtsreams before applying mapWithState() ? It might be difficult if one state function is dependent on other state. This requires a named state, which can be accessed in other state functions. I have not gone

Re: About nested RDD

2016-04-08 Thread Rishi Mishra
As mentioned earlier you can create a broadcast variable containing all the small RDD elements. I hope they are really small. Then you can fire A.updatae(broadcastVariable). Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Fri,

Re: About nested RDD

2016-04-08 Thread Rishi Mishra
rdd.count() is a fairly straightforward operations which can be calculated on a driver and then the value can be included in the map function. Is your goal is to write a generic function which operates on two rdds, one rdd being evaluated for each partition of the other ? Here also you can use

Re: Spark SQL Optimization

2016-03-22 Thread Rishi Mishra
What we have observed so far is Spark picks join order in the same order as tables in from clause is specified. Sometimes reordering benefits the join query. This can be an inbuilt optimization in Spark. But again its not going to be straight forward, where rather than table size, selectivity of

Re: sliding Top N window

2016-03-22 Thread Rishi Mishra
Hi Alexy, We are also trying to solve similar problems using approximation. Would like to hear more about your usage. We can discuss this offline without boring others. :) Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Tue,

Re: Joins in Spark

2016-03-19 Thread Rishi Mishra
My suspect is your input file partitions are small. Hence small number of tasks are started. Can you provide some more details like how you load the files and how the result size is around 500GBs ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/)

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Rishi Mishra
Michael, Is there any specific reason why DataFrames does not have partitioners like RDDs ? This will be very useful if one is writing custom datasources , which keeps data in partitions. While storing data one can pre-partition the data at Spark level rather than at the datasource. Regards,

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread Rishi Mishra
Unfortunately there is not any, at least till 1.5. Have not gone through the new DataSet of 1.6. There is some basic support for Parquet like partitionByColumn. If you want to partition your dataset on a certain way you have to use an RDD to partition & convert that into a DataFrame before

Re: SparkSQL parallelism

2016-02-11 Thread Rishi Mishra
I am not sure why all 3 nodes should query. If you have not mentioned any partitions it should only be one partition of JDBCRDD where all dataset should reside. On Fri, Feb 12, 2016 at 10:15 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a spark cluster with One

Re: Spark : Unable to connect to Oracle

2016-02-10 Thread Rishi Mishra
ASFIK sc.addJar() will add the jars to executor's classpath . The datasource resolution ( createRelation) happens at driver side and driver classpath should contain the ojdbc6.jar. You can use "spark.driver.extraClassPath" config parameter to set the same. On Wed, Feb 10, 2016 at 3:08 PM, Jorge

Re: spark.executor.memory ? is used just for cache RDD or both cache RDD and the runtime of cores on worker?

2016-02-04 Thread Rishi Mishra
You would probably like to see http://spark.apache.org/docs/latest/configuration.html#memory-management. Other config parameters are also explained there. On Fri, Feb 5, 2016 at 10:56 AM, charles li wrote: > if set spark.executor.memory = 2G for each worker [ 10 in

Re: Unit test with sqlContext

2016-02-04 Thread Rishi Mishra
Hi Steve, Have you cleaned up your SparkContext ( sc.stop()) , in a afterAll(). The error suggests you are creating more than one SparkContext. On Fri, Feb 5, 2016 at 10:04 AM, Holden Karau wrote: > Thanks for recommending spark-testing-base :) Just wanted to add if

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Rishi Mishra
Agree with Koert that UnionRDD should have a narrow dependencies . Although union of two RDDs increases the number of tasks to be executed ( rdd1.partitions + rdd2.partitions) . If your two RDDs have same number of partitions , you can also use zipPartitions, which causes lesser number of tasks,

Re: How to use collections inside foreach block

2015-12-09 Thread Rishi Mishra
Your list is defined on the driver, whereas function specified in forEach will be evaluated on each executor. You might want to add an accumulator or handle a Sequence of list from each partition. On Wed, Dec 9, 2015 at 11:19 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > >

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Rishi Mishra
As long as all your data is being inserted by Spark , hence using the same hash partitioner, what Fengdong mentioned should work. On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu wrote: > Hi > you can try: > > if your table under location “/test/table/“ on HDFS > and has

Re: Join and HashPartitioner question

2015-11-16 Thread Rishi Mishra
AFAIK and can see in the code both of them should behave same. On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov wrote: > Hi Everyone > > Is there any difference in performance btw the following two joins? > > > val r1: RDD[(String, String]) = ??? > val r2: RDD[(String,