Re: Structured Streaming Spark 3.0.1

2021-01-20 Thread Jungtaek Lim
I quickly looked into the attached log in SO post, and the problem doesn't seem to be related to Kafka. The error stack trace is from checkpointing to GCS, and the implementation of OutputStream for GCS seems to be provided with Google. Could you please elaborate the stack trace or upload the log

Re: Structured Streaming Spark 3.0.1

2021-01-20 Thread German Schiavon
Hi, I couldn't reproduce this error :/ I wonder if there is something else underline causing it... *Input* ➜ kafka_2.12-2.5.0 ./bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic test1 {"name": "pedro", "age": 50} >{"name": "pedro", "age": 50} >{"name": "pedro", "age": 50}

Re: Use case advice

2021-01-20 Thread purav aggarwal
Unsubscribe On Fri, Jan 15, 2021 at 9:52 AM Dilip Desavali wrote: > Unsubscribe >

Re: Spark job stuck after read and not starting next stage

2021-01-20 Thread German Schiavon
Hi, not sure if it is your case, but if the source data is heavy and deeply nested I'd recommend explicitly providing the schema when reading the json. df = spark.read.schema(schema).json(updated_dataset) On Thu, 21 Jan 2021 at 04:15, srinivasarao daruna wrote: > Hi, > I am running a spark

Spark job stuck after read and not starting next stage

2021-01-20 Thread srinivasarao daruna
Hi, I am running a spark job on a huge dataset. I have allocated 10 R5.16xlarge machines. (each consists 64cores, 512G). The source data is json and i need to do some json transformations. So, i read them as text and then convert to a dataframe. ds = spark.read.textFile() updated_dataset =

unsubscribe

2021-01-20 Thread luby
陆伯鹰 中国投资有限责任公司信息技术部 电话:+86 (0)10 84096521 传真:+86 (0)10 64086851 北京市东城区朝阳门北大街1号新保利大厦8层 100010 网站:www.china-inv.cn 本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外 披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件 人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。 This email message

Re: Structured Streaming Spark 3.0.1

2021-01-20 Thread gshen
This SO post is pretty much the exact same issue: https://stackoverflow.com/questions/59962680/spark-structured-streaming-error-while-sending-aggregated-result-to-kafka-topic The user mentions it's an issue with org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 -- Sent from:

Structured Streaming Spark 3.0.1

2021-01-20 Thread gshen
Hi all: I am having a strange issue incorporating `groupBy` statements into a structured streaming job when trying to write to Kafka or Delta. Weirdly it only appears to work if I write to console, or to memory... *I'm running Spark 3.0.1 with the following dependencies: *

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
Heh that could make sense, but that definitely was not my mental model of how python binds variables! Definitely is not how Scala works. On Wed, Jan 20, 2021 at 10:00 AM Marco Wong wrote: > Hmm, I think I got what Jingnan means. The lambda function is x != i and i > is not evaluated when the

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Marco Wong
Hmm, I think I got what Jingnan means. The lambda function is x != i and i is not evaluated when the lambda function was defined. So the pipelined rdd is rdd.filter(lambda x: x != i).filter(lambda x: x != i), rather than having the values of i substituted. Does that make sense to you, Sean? On

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
No, because the final rdd is really the result of chaining 3 filter operations. They should all execute. It _should_ work like "rdd.filter(...).filter(..).filter(...)" On Wed, Jan 20, 2021 at 9:46 AM Zhu Jingnan wrote: > I thought that was right result. > > As rdd runs on a lacy basis. so

Re: Spark RDD + HBase: adoption trend

2021-01-20 Thread Sean Owen
RDDs are still relevant in a few ways - there is no Dataset in Python for example, so RDD is still the 'typed' API. They still underpin DataFrames. And of course it's still there because there's probably still a lot of code out there that uses it. Occasionally it's still useful to drop into that

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Zhu Jingnan
I thought that was right result. As rdd runs on a lacy basis. so every time rdd.collect() executed, the i will be updated to the latest i value, so only one will be filter out. Regards Jingnan On Wed, Jan 20, 2021 at 9:01 AM Sean Owen wrote: > That looks very odd indeed. Things like this

Re: RDD filter in for loop gave strange results

2021-01-20 Thread יורי אולייניקוב
A. global scope and global variables are bad habits in Python (this is about an 'rdd' and 'i' variable used in lambda). B. lambdas are usually misused and abused in Python especially when they used in global context: ideally you'd like to use pure functions and use something like: ``` def

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
That looks very odd indeed. Things like this work as expected: rdd = spark.sparkContext.parallelize([0,1,2]) def my_filter(data, i): return data.filter(lambda x: x != i) for i in range(3): rdd = my_filter(rdd, i) rdd.collect() ... as does unrolling the loop. But your example behaves as if

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Jacek Laskowski
Hi Marco, A Scala dev here. In short: yet another reason against Python :) Honestly, I've got no idea why the code gives the output. Ran it with 3.1.1-rc1 and got the very same results. Hoping pyspark/python devs will chime in and shed more light on this. Pozdrawiam, Jacek Laskowski

Re: Spark RDD + HBase: adoption trend

2021-01-20 Thread Jacek Laskowski
Hi Marco, IMHO RDD is only for very sophisticated use cases that very few Spark devs would be capable of. I consider RDD API a sort of Spark assembler and most Spark devs should stick to Dataset API. Speaking of HBase, see

Re: Issue with executer

2021-01-20 Thread Vikas Garg
The issue is resolved. Resolution is little weird but it worked. It was due to scala version mismatch with projects in my package. On Wed, 20 Jan 2021 at 18:07, Mich Talebzadeh wrote: > Hi Vikas, > > Are you running this on your local laptop etc or using some IDE etc? > > What is your available

Spark RDD + HBase: adoption trend

2021-01-20 Thread Marco Firrincieli
Hi, my name is Marco and I'm one of the developers behind  https://github.com/unicredit/hbase-rdd  a project we are currently reviewing for various reasons. We were basically wondering if RDD "is still a thing" nowadays (we see lots of usage for DataFrames or Datasets) and we're not sure how

RDD filter in for loop gave strange results

2021-01-20 Thread Marco Wong
Dear Spark users, I ran the Python code below on a simple RDD, but it gave strange results. The filtered RDD contains non-existent elements which were filtered away earlier. Any idea why this happened? ``` rdd = spark.sparkContext.parallelize([0,1,2]) for i in range(3): print("RDD is ",

Re: Issue with executer

2021-01-20 Thread Mich Talebzadeh
Hi Vikas, Are you running this on your local laptop etc or using some IDE etc? What is your available memory for Spark? Start with minimum set like below def spark_session_local(appName): return SparkSession.builder \ .master('local[1]') \ .appName(appName) \

Re: Issue with executer

2021-01-20 Thread Vikas Garg
Hi Sachit, I am running it in local. My IP mentioned is the private IP address and therefore it is useless for anyone. On Wed, 20 Jan 2021 at 17:37, Sachit Murarka wrote: > Hi Vikas > > 1. Are you running in local mode? Master has local[*] > 2. Pls mask the ip or confidential info while

Re: Issue with executer

2021-01-20 Thread Sachit Murarka
Hi Vikas 1. Are you running in local mode? Master has local[*] 2. Pls mask the ip or confidential info while sharing logs Thanks Sachit On Wed, 20 Jan 2021, 17:35 Vikas Garg, wrote: > Hi, > > I am facing issue with spark executor. I am struggling with this issue > since last many days and

Process each kafka record for structured streaming

2021-01-20 Thread rajat kumar
Hi, I want to apply custom logic for each row of data I am getting through kafka and want to do it with microbatch. When I am running it , it is not progressing. kafka_stream_df \ .writeStream \ .foreach(process_records) \ .outputMode("append") \