[Structured Streaming] How to replay data and overwrite using FileSink

2017-09-20 Thread Bandish Chheda
Hi, We are using StructuredStreaming (Spark 2.2.0) for processing data from Kafka. We read from a Kafka topic, do some conversions, computation and then use FileSink to store data to partitioned path in HDFS. We have enabled checkpoint (using a dir in HDFS). For cases when there is a bad code

How to get time slice or the batch time for which the current micro batch is running in Spark Streaming

2017-09-20 Thread SRK
Hi, How to get the time slice or the batch time for which the current micro batch is running in Spark Streaming? Currently I am using System time which is causing the clearing keys feature of reduceByKeyAndWindow to not work properly. Thanks, Swetha -- Sent from:

Re: Is there a SparkILoop for Java?

2017-09-20 Thread kant kodali
Is there an API like SparkILoop for Python? On Wed, Sep 20, 2017 at 7:17 AM, Jean Georges Perrin wrote: > It all depends what you want to do :) - we've implemented some dynamic > interpretation of

Re: Spark code to get select firelds from ES

2017-09-20 Thread Jean Georges Perrin
Same issue with RDBMS ingestion (I think). I solved it with views. Can you do views on ES? jg > On Sep 20, 2017, at 09:22, Kedarnath Dixit > wrote: > > Hi, > > I want to get only select fields from ES using Spark ES connector. > > I have done some code

Re: graphframes on cluster

2017-09-20 Thread Felix Cheung
Could you include the code where it fails? Generally the best way to use gf is to use the --packages options with spark-submit command From: Imran Rajjad Sent: Wednesday, September 20, 2017 5:47:27 AM To: user @spark Subject: graphframes on

Re: Pyspark define UDF for windows

2017-09-20 Thread Weichen Xu
UDF cannot be used as window function. You can use built-in window function or UDAF. On Wed, Sep 20, 2017 at 7:23 PM, Simon Dirmeier wrote: > Dear all, > I am trying to partition a DataFrame into windows and then for every > column and window use a custom function (udf)

Re: Is there a SparkILoop for Java?

2017-09-20 Thread kant kodali
Got it! Thats what I thought. Java 9 is going to release tomorrow http://www.java9countdown.xyz/ Which has the repl called Jshell. On Wed, Sep 20, 2017 at 5:45 AM, Weichen Xu wrote: > I haven't hear that. It seems that java do not have an official REPL. > > On Wed,

graphframes on cluster

2017-09-20 Thread Imran Rajjad
Trying to run graph frames on a spark cluster. Do I need to include the package in spark context settings? or the only the driver program is suppose to have the graphframe libraries in its class path? Currently the job is crashing when action function is invoked on graphframe classes. regards,

Re: Is there a SparkILoop for Java?

2017-09-20 Thread Weichen Xu
I haven't hear that. It seems that java do not have an official REPL. On Wed, Sep 20, 2017 at 8:38 PM, kant kodali wrote: > Hi All, > > I am wondering if there is a SparkILoop > > for >

Is there a SparkILoop for Java?

2017-09-20 Thread kant kodali
Hi All, I am wondering if there is a SparkILoop for java so I can pass Java code as a string to repl? Thanks!

Re: for loops in pyspark

2017-09-20 Thread Weichen Xu
Spark manage memory allocation and release automatically. Can you post the complete program which help checking where is wrong ? On Wed, Sep 20, 2017 at 8:12 PM, Alexander Czech < alexander.cz...@googlemail.com> wrote: > Hello all, > > I'm running a pyspark script that makes use of for loop to

for loops in pyspark

2017-09-20 Thread Alexander Czech
Hello all, I'm running a pyspark script that makes use of for loop to create smaller chunks of my main dataset. some example code: for chunk in chunks: my_rdd = sc.parallelize(chunk).flatmap(somefunc) # do some stuff with my_rdd my_df = make_df(my_rdd) # do some stuff with

Pyspark define UDF for windows

2017-09-20 Thread Simon Dirmeier
Dear all, I am trying to partition a DataFrame into windows and then for every column and window use a custom function (udf) using Spark's Python interface. Within that function I cast a column of a window in a m x n matrix to do a median-polish and afterwards return a list again. This

Spark Streaming + Kafka + Hive: delayed

2017-09-20 Thread toletum
Hello. I have a process (python) that reads a kafka queue, for each record it checks in a table. # Load table in memory table=sqlContext.sql("select id from table") table.cache() kafkaTopic.foreachRDD(processForeach) def processForeach (time, rdd): print(time) for k in rdd.collect (): if

Re: Structured streaming coding question

2017-09-20 Thread kant kodali
Just tried with sparkSession.streams().awaitAnyTermination(); And thats the only await* I had and it works! But what if I don't want all my queries to fail or stop making progress if one of them fails? On Wed, Sep 20, 2017 at 2:26 AM, kant kodali wrote: > Hi Burak, > > Are

Re: Structured streaming coding question

2017-09-20 Thread kant kodali
Hi Burak, Are you saying get rid of both query1.awaitTermination(); query2.awaitTermination(); and just have the line below? sparkSession.streams().awaitAnyTermination(); Thanks! On Wed, Sep 20, 2017 at 12:51 AM, kant kodali wrote: > If I don't get anywhere after

Re: Structured streaming coding question

2017-09-20 Thread kant kodali
If I don't get anywhere after query1.awaitTermination(); Then I cannot put this sparkSession.streams().awaitAnyTermination(); as the last line of code right? Like below query1.awaitTermination(); sparkSession.streams().awaitAnyTermination(); On Wed, Sep 20, 2017 at 12:07 AM, Burak Yavuz

Re: Structured streaming coding question

2017-09-20 Thread Burak Yavuz
Please remove query1.awaitTermination(); query2.awaitTermination(); once query1.awaitTermination(); is called, you don't even get to query2.awaitTermination(). On Tue, Sep 19, 2017 at 11:59 PM, kant kodali wrote: > Hi Burak, > > Thanks much! had no clue that existed.

Re: Structured streaming coding question

2017-09-20 Thread kant kodali
Hi Burak, Thanks much! had no clue that existed. Now, I changed it to this. StreamingQuery query1 = outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new KafkaSink("hello1")).start(); StreamingQuery query2 =

Re: Structured streaming coding question

2017-09-20 Thread Burak Yavuz
Hey Kant, That won't work either. Your second query may fail, and as long as your first query is running, you will not know. Put this as the last line instead: spark.streams.awaitAnyTermination() On Tue, Sep 19, 2017 at 10:11 PM, kant kodali wrote: > Looks like my problem