Spark Streaming Custom Receiver for REST API

2020-10-06 Thread Muhammed Favas
Hi, I have REST GET endpoint that gives data packet from a queue. I am thinking of implementing spark custom receiver to stream the data into spark. Somebody please help me on how to integrate REST endpoint with spark custom receiver class. Regards, Favas

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread David Edwards
After adding the sequential ids you might need a repartition? I've found using monotically increasing id before that the df goes to a single partition. Usually becomes clear in the spark ui though On Tue, 6 Oct 2020, 20:38 Sachit Murarka, wrote: > Yes, Even I tried the same first. Then I moved

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
Yes, Even I tried the same first. Then I moved to join method because shuffle spill was happening because row num without partition happens on single task. Instead of processinf entire dataframe on single task. I have broken down that into df1 and df2 and joining. Because df2 is having very less

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Eve Liao
Try to avoid broadcast. Thought this: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6 could be helpful. On Tue, Oct 6, 2020 at 12:18 PM Sachit Murarka wrote: > Thanks Eve for response. > > Yes I know we can use broadcast for smaller datasets,I increased

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
Thanks Eve for response. Yes I know we can use broadcast for smaller datasets,I increased the threshold (4Gb) for the same then also it did not work. and the df3 is somewhat greater than 2gb. Trying by removing broadcast as well.. Job is running since 1 hour. Will let you know. Thanks Sachit

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Eve Liao
How many rows does df3 have? Broadcast joins are a great way to append data stored in relatively *small* single source of truth data files to large DataFrames. DataFrames up to 2GB can be broadcasted so a data file with tens or even hundreds of thousands of rows is a broadcast candidate. Your

Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
Hello Users, I am facing an issue in spark job where I am doing row number() without partition by clause because I need to add sequential increasing IDs. But to avoid the large spill I am not doing row number() over the complete data frame. Instead I am applying monotically_increasing id on

Re: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Ricardo Martinelli de Oliveira
My 2 cents is that this is a complicated question since I'm not confident that Spark is 100% compatible with Hive in terms of query language. I have an unanswered question in this list about this:

Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Manu Jacob
Hi All, Not sure if I need to ask this question on spark community or hive community. We have a set of hive scripts that runs on EMR (Tez engine). We would like to experiment by moving some of it onto Spark. We are planning to experiment with two options. 1. Use the current code based on

Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-06 Thread Evgeniy Ignatiev
Note: forwarding to list, incorrectly hit "Repliy" first, instead of "Reply List" Hello, Does your code run without enabling fallback mode? Arrow vectorization might not just get applied - if you still observe "javaToPython" stages on Spark UI. Also data is not skewed (partitions are too

Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-06 Thread Lian Jiang
Hi, I used these settings but did not see obvious improvement (190 minutes reduced to 170 minutes): spark.sql.execution.arrow.pyspark.enabled: True spark.sql.execution.arrow.pyspark.fallback.enabled: True This job heavily uses pandas udfs and it runs on a 30 xlarge node emr. Any idea

Executor Lost Spark Issue

2020-10-06 Thread Sachit Murarka
Hello, I have to write the aggregated output stored in DF(about 3GB) in a single file for that I have tried using repartition(1) as well as coalesce(1). My Job is failing with the following Exception: ExecutorLostFailure (executor 32 exited caused by one of the running tasks) Reason: Remote

Re: Spark Streaming ElasticSearch

2020-10-06 Thread jainshasha
Hi Siva In that case u can use structured streaming foreach / foreachBatch function which can help you process each record and write it into some sink -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Spark 3 - Predicate/Projection Pushdown Feature

2020-10-06 Thread Pınar Ersoy
Hello, I am working as a Data Scientist and using Spark with Python while developing my models. Our data model has multiple nested structures (*Ex.* attributes.product.feature.name). For the first two-level, I can read them as the following with PySpark: *data.where(col('attributes.product') !=

Re: Arbitrary stateful aggregation: updating state without setting timeout

2020-10-06 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Hi Jungtaek Thank you very much for clarification > 5 окт. 2020 г., в 15:17, Jungtaek Lim > написал(а): > >  > Hi, > > That's not explained in the SS guide doc but explained in the scala API doc. > http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html