Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-20 Thread Petr Novak
Hi Michal, yes, it is there logged twice, it can be seen in attached log in one of previous post with more details: 15/09/17 23:06:37 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook 15/09/17 23:06:37 INFO StreamingContext: Invoking stop(stopGracefully=false) from

Re: Using Spark for portfolio manager app

2015-09-20 Thread Jörn Franke
I think generally the way forward would be to put aggregate statistics to an external storage (eg hbase) - it should not have that much influence on latency. You will probably need it anyway if you need to store historical information. Wrt to deltas - always a tricky topic. You may want to work

Re: in joins, does one side stream?

2015-09-20 Thread Rishitesh Mishra
Got it..thnx Reynold.. On 20 Sep 2015 07:08, "Reynold Xin" wrote: > The RDDs themselves are not materialized, but the implementations can > materialize. > > E.g. in cogroup (which is used by RDD.join), it materializes all the data > during grouping. > > In SQL/DataFrame

Re: Using Spark for portfolio manager app

2015-09-20 Thread Huy Banh
Hi Thuy, You can check Rdd.lookup(). It requires the rdd is partitioned, and of course, cached in memory. Or you may consider a distributed cache like ehcache, aws elastic cache. I think an external storage is an option, too. Especially nosql databases, they can handle updates at high speed, at

Re: Kafka createDirectStream ​issue

2015-09-20 Thread Petr Novak
val topics="first" shouldn't it be val topics = Set("first") ? On Sun, Sep 20, 2015 at 1:01 PM, Petr Novak wrote: > val topics="first" > > shouldn't it be val topics = Set("first") ? > > On Sat, Sep 19, 2015 at 10:07 PM, kali.tumm...@gmail.com < > kali.tumm...@gmail.com>

Re: in joins, does one side stream?

2015-09-20 Thread Reynold Xin
We do - but I don't think it is feasible to duplicate every single algorithm in DF and in RDD. The only way for this to work is to make one underlying implementation work for both. Right now DataFrame knows how to serialize individual elements well and can manage memory that way -- the RDD API

Problem at sbt/sbt assembly

2015-09-20 Thread Aaroncq4
When I used “sbt/sbt assembly" to compile spark code of spark-1.5.0,I got a problem and I did not know why.It signs that: NOTE: The sbt/sbt script has been relocated to build/sbt. Please update references to point to the new location. Invoking 'build/sbt assembly' now ... Using

Re: Problem at sbt/sbt assembly

2015-09-20 Thread Ted Yu
Have you seen this thread: http://search-hadoop.com/m/q3RTtVJJ3I15OJ251 Cheers On Sun, Sep 20, 2015 at 6:11 PM, Aaroncq4 <475715...@qq.com> wrote: > When I used “sbt/sbt assembly" to compile spark code of spark-1.5.0,I got a > problem and I did not know why.It signs that: > > NOTE: The sbt/sbt

Re: word count (group by users) in spark

2015-09-20 Thread Huy Banh
Hi, If your input format is user -> comment, then you could: val comments = sc.parallelize(List(("u1", "one two one"), ("u2", "three four three"))) val wordCounts = comments. flatMap({case (user, comment) => for (word <- comment.split(" ")) yield(((user, word), 1)) }).

Re: PrunedFilteredScan does not work for UDTs and Struct fields

2015-09-20 Thread Richard Eggert
Having to restructure my queries isn't a very satisfactory solution, unfortunately. I did notice that if I implement the CatalystScan interface instead, then the filters DO get passed in, but the column identifiers would need to be translated somewhat to be usable, so that's another option.

Re: in joins, does one side stream?

2015-09-20 Thread Koert Kuipers
why dont we want these (broadcast join and sort-merge join) in DataFrame but not in RDD? they dont seem specific to structured data analysis to me. On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra wrote: > Got it..thnx Reynold.. > On 20 Sep 2015 07:08, "Reynold Xin"

Re: in joins, does one side stream?

2015-09-20 Thread Koert Kuipers
sorry that was a typo. i meant to say: why do we have these features (broadcast join and sort-merge join) in DataFrame but not in RDD? they don't seem specific to structured data analysis to me. thanks! koert On Sun, Sep 20, 2015 at 2:46 PM, Koert Kuipers wrote: > why dont