Streaming write to orc problem

2022-04-22 Thread hsy...@gmail.com
Hello all, I’m just trying to build a pipeline reading data from a streaming source and write to orc file. But I don’t see any file that is written to the file system nor any exceptions Here is an example val df = spark.readStream.format(“...") .option( “Topic", "Some

I got weird error from a join

2018-02-21 Thread hsy...@gmail.com
from pyspark.sql import Row A_DF = sc.parallelize( [ Row(id='A123', name='a1'), Row(id='A234', name='a2') ]).toDF() B_DF = sc.parallelize( [ Row(id='A123', pid='A234', ename='e1') ]).toDF() join_df = B_DF.join(A_DF, B_DF.id==A_DF.id).drop(B_DF.id)

Questions about using pyspark 2.1.1 pushing data to kafka

2018-01-23 Thread hsy...@gmail.com
I have questions about using pyspark 2.1.1 pushing data to kafka. I don't see any pyspark streaming api to write data directly to kafka, if there is one or example, please point me to the right page. I implemented my own way which using a global kafka producer and push the data picked from

Question about accumulator

2018-01-23 Thread hsy...@gmail.com
I have a small application like this acc = sc.accumulate(5) def t_f(x,): global acc sleep(5) acc += x def f(x): global acc thread = Thread(target = t_f, args = (x,)) thread.start() # thread.join() # without this it doesn't work rdd = sc.parallelize([1,2,4,1])

Re: How to do an interactive Spark SQL

2014-07-23 Thread hsy...@gmail.com
Anyone has any idea on this? On Tue, Jul 22, 2014 at 7:02 PM, hsy...@gmail.com hsy...@gmail.com wrote: But how do they do the interactive sql in the demo? https://www.youtube.com/watch?v=dJQ5lV5Tldw And if it can work in the local mode. I think it should be able to work in cluster mode

How to do an interactive Spark SQL

2014-07-22 Thread hsy...@gmail.com
Hi guys, I'm able to run some Spark SQL example but the sql is static in the code. I would like to know is there a way to read sql from somewhere else (shell for example) I could read sql statement from kafka/zookeeper, but I cannot share the sql to all workers. broadcast seems not working for

Re: How to do an interactive Spark SQL

2014-07-22 Thread hsy...@gmail.com
in the code? What do you mean by cannot shar the sql to all workers? On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys, I'm able to run some Spark SQL example but the sql is static in the code. I would like to know is there a way to read sql from somewhere

Re: How to do an interactive Spark SQL

2014-07-22 Thread hsy...@gmail.com
)) }) ssc.start() ssc.awaitTermination() On Tue, Jul 22, 2014 at 5:10 PM, Zongheng Yang zonghen...@gmail.com wrote: Can you paste a small code example to illustrate your questions? On Tue, Jul 22, 2014 at 5:05 PM, hsy...@gmail.com hsy...@gmail.com wrote: Sorry, typo. What I mean

Re: How to do an interactive Spark SQL

2014-07-22 Thread hsy...@gmail.com
after the StreamingContext has started. Tobias On Wed, Jul 23, 2014 at 9:55 AM, hsy...@gmail.com hsy...@gmail.com wrote: For example, this is what I tested and work on local mode, what it does is it get data and sql query both from kafka and do sql on each RDD and output the result back

Re: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-21 Thread hsy...@gmail.com
I have the same problem On Sat, Jul 19, 2014 at 12:31 AM, lihu lihu...@gmail.com wrote: Hi, Everyone. I have a piece of following code. When I run it, it occurred the error just like below, it seem that the SparkContext is not serializable, but i do not try to use the SparkContext

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-17 Thread hsy...@gmail.com
Thanks Tathagata, so can I say RDD size(from the stream) is window size. and the overlap between 2 adjacent RDDs are sliding size. But I still don't understand what it batch size, why do we need this since data processing is RDD by RDD right? And does spark chop the data into RDDs at the very

Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread hsy...@gmail.com
When I'm reading the API of spark streaming, I'm confused by the 3 different durations StreamingContext(conf: SparkConf http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html , batchDuration: Duration

Re: How to kill running spark yarn application

2014-07-15 Thread hsy...@gmail.com
reproduce it. On Mon, Jul 14, 2014 at 7:36 PM, hsy...@gmail.com hsy...@gmail.com wrote: Before yarn application -kill If you do jps You'll have a list of SparkSubmit and ApplicationMaster After you use yarn applicaton -kill you only kill the SparkSubmit On Mon, Jul 14, 2014 at 4:29 PM

Re: SQL + streaming

2014-07-15 Thread hsy...@gmail.com
sure it works, and you see output? Also, I recommend going through the previous step-by-step approach to narrow down where the problem is. TD On Mon, Jul 14, 2014 at 9:15 PM, hsy...@gmail.com hsy...@gmail.com wrote: Actually, I deployed this on yarn cluster(spark-submit) and I couldn't find

Re: SQL + streaming

2014-07-15 Thread hsy...@gmail.com
anything in the driver logs! So try doing a collect, or take on the RDD returned by sql query and print that. TD On Tue, Jul 15, 2014 at 4:28 PM, hsy...@gmail.com hsy...@gmail.com wrote: By the way, have you ever run SQL and stream together? Do you know any example that works? Thanks! On Tue

How to kill running spark yarn application

2014-07-14 Thread hsy...@gmail.com
Hi all, A newbie question, I start a spark yarn application through spark-submit How do I kill this app. I can kill the yarn app by yarn application -kill appid but the application master is still running. What's the proper way to shutdown the entire app? Best, Siyuan

SQL + streaming

2014-07-14 Thread hsy...@gmail.com
Hi All, Couple days ago, I tried to integrate SQL and streaming together. My understanding is I can transform RDD from Dstream to schemaRDD and execute SQL on each RDD. But I got no luck Would you guys help me take a look at my code? Thank you very much! object KafkaSpark { def main(args:

Re: How to kill running spark yarn application

2014-07-14 Thread hsy...@gmail.com
. This is what I did 2 hours ago. Sorry I cannot provide more help. Sent from my iPhone On 14 Jul, 2014, at 6:05 pm, hsy...@gmail.com hsy...@gmail.com wrote: yarn-cluster On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam chiling...@gmail.com wrote: Hi Siyuan, I wonder if you --master yarn-cluster

Re: SQL + streaming

2014-07-14 Thread hsy...@gmail.com
but SQL command throwing error? No errors but no output either? TD On Mon, Jul 14, 2014 at 4:06 PM, hsy...@gmail.com hsy...@gmail.com wrote: Hi All, Couple days ago, I tried to integrate SQL and streaming together. My understanding is I can transform RDD from Dstream to schemaRDD and execute

Re: SQL + streaming

2014-07-14 Thread hsy...@gmail.com
that to work, then I would test the Spark SQL stuff. TD On Mon, Jul 14, 2014 at 5:25 PM, hsy...@gmail.com hsy...@gmail.com wrote: No errors but no output either... Thanks! On Mon, Jul 14, 2014 at 4:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Could you elaborate on what

Re: Some question about SQL and streaming

2014-07-10 Thread hsy...@gmail.com
:21 AM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys, I'm a new user to spark. I would like to know is there an example of how to user spark SQL and spark streaming together? My use case is I want to do some SQL on the input stream from kafka. Thanks! Best, Siyuan

Difference between SparkSQL and shark

2014-07-10 Thread hsy...@gmail.com
I have a newbie question. What is the difference between SparkSQL and Shark? Best, Siyuan

Some question about SQL and streaming

2014-07-09 Thread hsy...@gmail.com
Hi guys, I'm a new user to spark. I would like to know is there an example of how to user spark SQL and spark streaming together? My use case is I want to do some SQL on the input stream from kafka. Thanks! Best, Siyuan