Re: Different watermark for different kafka partitions in Structured Streaming

2017-09-01 Thread Ryan
I don't think ss now support "partitioned" watermark. and why different partition's consumption rate vary? If the handling logic is quite different, using different topic is a better way. On Fri, Sep 1, 2017 at 4:59 PM, 张万新 wrote: > Thanks, it's true that looser

Re: Spark GroupBy Save to different files

2017-09-01 Thread Ryan
you may try foreachPartition On Fri, Sep 1, 2017 at 10:54 PM, asethia wrote: > Hi, > > I have list of person records in following format: > > case class Person(fName:String, city:String) > > val l=List(Person("A","City1"),Person("B","City2"),Person("C","City1")) > > val

[SPARK-SQL] Spark Persist slower than non-persist calls

2017-09-01 Thread sfbayeng
My settings are: Running Spark 2.1 on 3 node YARN cluster with 160 GB. Dynamic allocation turned on. spark.executor.memory=6G, spark.executor.cores=6 First, I am reading hive tables: orders(329MB) and lineitems(1.43GB) and doing left outer join. Next, I apply 7 different filter conditions based

Re: update hive metastore in spark session at runtime

2017-09-01 Thread ayan guha
AFAIK, one of the side must be jdbc On Fri, 1 Sep 2017 at 10:37 pm, HARSH TAKKAR wrote: > Hi, > > I have just started using spark session, with hive enabled. but i am > facing some issue while updating hive warehouse directory post spark > session creation, > > usecase: i

Is watermark always set using processing time or event time or both?

2017-09-01 Thread kant kodali
Is watermark always set using processing time or event time or both?

Re: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

2017-09-01 Thread Karthik Palaniappan
Any ideas @Tathagata? I'd be happy to contribute a patch if you can point me in the right direction. From: Karthik Palaniappan Sent: Friday, August 25, 2017 9:15 AM To: Akhil Das Cc: user@spark.apache.org; t...@databricks.com Subject: RE:

Re: isCached

2017-09-01 Thread Nathan Kronenfeld
Thanks for the info On Fri, Sep 1, 2017 at 12:06 PM, Nick Pentreath wrote: > No unfortunately not - as i recall storageLevel accesses some private > methods to get the result. > > On Fri, 1 Sep 2017 at 17:55, Nathan Kronenfeld > >

Re: isCached

2017-09-01 Thread Nick Pentreath
No unfortunately not - as i recall storageLevel accesses some private methods to get the result. On Fri, 1 Sep 2017 at 17:55, Nathan Kronenfeld wrote: > Ah, in 2.1.0. > > I'm in 2.0.1 at the moment... is there any way that works that far back? > > On Fri, Sep 1,

Re: isCached

2017-09-01 Thread Nathan Kronenfeld
Ah, in 2.1.0. I'm in 2.0.1 at the moment... is there any way that works that far back? On Fri, Sep 1, 2017 at 11:46 AM, Nick Pentreath wrote: > Dataset does have storageLevel. So you can use isCached = (storageLevel != > StorageLevel.NONE) as a test. > > Arguably

Re: isCached

2017-09-01 Thread Nick Pentreath
Dataset does have storageLevel. So you can use isCached = (storageLevel != StorageLevel.NONE) as a test. Arguably isCached could be added to dataset too, shouldn't be a controversial change. On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld wrote: > I'm currently

isCached

2017-09-01 Thread Nathan Kronenfeld
I'm currently porting some of our code from RDDs to Datasets. With RDDs it's pretty easy to figure out if they are cached or not. I notice that the catalog has a function for determining this on Datasets too, but it's private[sql]. Is there any reason for it not to be public? Is there any way

Spark GroupBy Save to different files

2017-09-01 Thread asethia
Hi, I have list of person records in following format: case class Person(fName:String, city:String) val l=List(Person("A","City1"),Person("B","City2"),Person("C","City1")) val rdd:RDD[Person]=sc.parallelize(l) val groupBy:RDD[(String, Iterable[Person])]=rdd.groupBy(_.city) I would like to

update hive metastore in spark session at runtime

2017-09-01 Thread HARSH TAKKAR
Hi, I have just started using spark session, with hive enabled. but i am facing some issue while updating hive warehouse directory post spark session creation, usecase: i want to read data from hive one cluster and write to hive on another cluster Please suggest if this can be done?

Re: Different watermark for different kafka partitions in Structured Streaming

2017-09-01 Thread 张万新
Thanks, it's true that looser watermark can guarantee more data not be dropped, but at the same time more state need to be kept. I just consider if there is sth like kafka-partition-aware watermark in flink in SS may be a better solution. Tathagata Das 于2017年8月31日周四