Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
It uses Helm to deploy Spark Operator and Nginx. For other parts like creating EKS, IAM role, node group, etc, it uses AWS SDK to provision those AWS resources. On Wed, Feb 23, 2022 at 11:28 AM Bjørn Jørgensen wrote: > So if I get this right you will make a Helm chart to >

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
I actually went through the sort-merge algorithm and found out that it compares the two values and actually resets the respective pointer to the last matched pointer and then goes on comparing and fetches the records. Could you please go through

Re: Spark Explain Plan and Joins

2022-02-23 Thread Mich Talebzadeh
Yes correct because sort-merge can only work for equijoins. The point being that join columns are sortable in each DF. In a sort-merge join, the optimizer sorts the first DF by its join columns, sorts the second DF by its join columns, and then merges the intermediate result sets together. As

Structured Streaming + UDF - logic based on checking if a column is present in the Dataframe

2022-02-23 Thread karan alang
Hello All, I'm using StructuredStreaming, and am trying to use UDF to parse each row. Here is the requirement: - we can get alerts of a particular KPI with type 'major' OR 'critical' - for a KPI, if we get alerts of type 'major' eg _major, and we have a critical alert as well _critical,

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
From what I understood, you are asking whether sort-merge can be used in either of the conditions? If my understanding is correct then yes because it supports equi joins. Please correct me if I'm wrong. On Thu, Feb 24, 2022 at 1:49 AM Mich Talebzadeh wrote: > OK let me put this question to you

Re: Spark Explain Plan and Joins

2022-02-23 Thread Mich Talebzadeh
OK let me put this question to you if I may What is the essence for sort-merge assuming we have a SARG WHERE D.deptno = E.deptno? Can we have a sort-merge for WHERE D.deptno >= E.deptno! view my Linkedin profile

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
Hi Mich, Thanks for the link. I will go through it. I have two doubts regarding sort-merge join. 1) I came across one article where it mentioned that it is a better join technique since it doesn't have to scan the entire tables since the keys are sorted. If I have keys like 1,2,4,10 and other

Re: One click to run Spark on Kubernetes

2022-02-23 Thread Bjørn Jørgensen
So if I get this right you will make a Helm chart to deploy Spark and some other stuff on K8S? ons. 23. feb. 2022 kl. 17:49 skrev bo yang : > Hi Sarath, let's follow up offline on this. > > On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy < > sarath.annare...@gmail.com> wrote:

Re: Unable to display JSON records with null values

2022-02-23 Thread Sid
Okay. So what should I do if I get such data? On Wed, Feb 23, 2022 at 11:59 PM Sean Owen wrote: > There is no record "345" here it seems, right? it's not that it exists and > has null fields; it's invalid w.r.t. the schema that the rest suggests. > > On Wed, Feb 23, 2022 at 11:57 AM Sid wrote:

Re: Unable to display JSON records with null values

2022-02-23 Thread Sean Owen
There is no record "345" here it seems, right? it's not that it exists and has null fields; it's invalid w.r.t. the schema that the rest suggests. On Wed, Feb 23, 2022 at 11:57 AM Sid wrote: > Hello experts, > > I have a JSON data like below: > > [ > { > "123": { > "Party1": { >

Re: Spark Explain Plan and Joins

2022-02-23 Thread Mich Talebzadeh
Hi Sid, For now, with regard to point 2 2) Predicate push down under the optimized logical plan. Could you please help me to understand the predicate pushdown with some other simple example? Please see this good explanation with examples Using Spark predicate push down in Spark SQL queries

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
Hi, Can you help me with my doubts? Any links would also be helpful. Thanks, Sid On Wed, Feb 23, 2022 at 1:22 AM Sid Kal wrote: > Hi Mich / Gourav, > > Thanks for your time :) Much appreciated. I went through the article > shared by Mich about the query execution plan. I pretty much

Unable to display JSON records with null values

2022-02-23 Thread Sid
Hello experts, I have a JSON data like below: [ { "123": { "Party1": { "FIRSTNAMEBEN": "ABC", "ALIASBEN": "", "RELATIONSHIPTYPE": "ABC, FGHIJK LMN", "DATEOFBIRTH": "7/Oct/1969" }, "Party2": { "FIRSTNAMEBEN": "ABCC",

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Hi Sarath, let's follow up offline on this. On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy wrote: > Hi bo > > How do we start? > > Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc > > > Thanks > Sarath > > > Sent from my iPhone > > On Feb 23, 2022, at 10:27 AM, bo yang

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Dennis Suhari
Currently we are trying AnalyticsZoo and Ray Von meinem iPhone gesendet > Am 23.02.2022 um 04:53 schrieb Bitfox : > >  > tensorflow itself can implement the distributed computing via a parameter > server. Why did you want spark here? > > regards. > >> On Wed, Feb 23, 2022 at 11:27 AM

Re: One click to run Spark on Kubernetes

2022-02-23 Thread Sarath Annareddy
Hi bo How do we start? Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc Thanks Sarath Sent from my iPhone > On Feb 23, 2022, at 10:27 AM, bo yang wrote: > >  > Hi Sarath, thanks for your interest and willing to contribute! The project > supports local development

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Hi Sarath, thanks for your interest and willing to contribute! The project supports local development using MiniKube. Similarly there is a one click command with one extra argument to deploy all components in MiniKube, and people could use that to develop on their local MacBook. On Wed, Feb 23,

Re: One click to run Spark on Kubernetes

2022-02-23 Thread Sarath Annareddy
Hi bo I am interested to contribute. But I don’t have free access to any cloud provider. Not sure how I can get free access. I know Google, aws, azure only provides temp free access, it may not be sufficient. Guidance is appreciated. Sarath Sent from my iPhone > On Feb 23, 2022, at 2:01

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sean Owen
The standalone koalas project should have the same functionality for older Spark versions: https://koalas.readthedocs.io/en/latest/ You should be moving to Spark 3 though; 2.x is EOL. On Wed, Feb 23, 2022 at 9:06 AM Sid wrote: > Cool. Here, the problem is I have to run the Spark jobs on Glue

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
Cool. Here, the problem is I have to run the Spark jobs on Glue ETL which supports 2.4.3 of Spark and I don't think so this distributed support was added for pandas in that version. AFMKIC, it has been added in 3.2 version. So how can I do it in spark 2.4.3? Correct me if I'm wrong. On Wed, Feb

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Bjørn Jørgensen
You will. Pandas API on spark that `imported with from pyspark import pandas as ps` is not pandas but an API that is using pyspark under. ons. 23. feb. 2022 kl. 15:54 skrev Sid : > Hi Bjørn, > > Thanks for your reply. This doesn't help while loading huge datasets. > Won't be able to achieve

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sean Owen
This isn't pandas, it's pandas on Spark. It's distributed. On Wed, Feb 23, 2022 at 8:55 AM Sid wrote: > Hi Bjørn, > > Thanks for your reply. This doesn't help while loading huge datasets. > Won't be able to achieve spark functionality while loading the file in > distributed manner. > > Thanks,

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
Hi Bjørn, Thanks for your reply. This doesn't help while loading huge datasets. Won't be able to achieve spark functionality while loading the file in distributed manner. Thanks, Sid On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen wrote: > from pyspark import pandas as ps > > > ps.read_excel?

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Bjørn Jørgensen
from pyspark import pandas as ps ps.read_excel? "Support both `xls` and `xlsx` file extensions from a local filesystem or URL" pdf = ps.read_excel("file") df = pdf.to_spark() ons. 23. feb. 2022 kl. 14:57 skrev Sid : > Hi Gourav, > > Thanks for your time. > > I am worried about the

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
Hi Gourav, Thanks for your time. I am worried about the distribution of data in case of a huge dataset file. Is Koalas still a better option to go ahead with? If yes, how can I use it with Glue ETL jobs? Do I have to pass some kind of external jars for it? Thanks, Sid On Wed, Feb 23, 2022 at

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Gourav Sengupta
Hi, this looks like a very specific and exact problem in its scope. Do you think that you can load the data into panda dataframe and load it back to SPARK using PANDAS UDF? Koalas is now natively integrated with SPARK, try to see if you can use those features. Regards, Gourav On Wed, Feb 23,

Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
I have an excel file which unfortunately cannot be converted to CSV format and I am trying to load it using pyspark shell. I tried invoking the below pyspark session with the jars provided. pyspark --jars

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Sean Owen
Petastorm does that https://github.com/uber/petastorm in the sense that it feeds Spark DFs to those frameworks in distributed training. I'm not sure what you mean by native integration that is different? these tools do just what you are talking about and have for a while. On Wed, Feb 23, 2022 at

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Gourav Sengupta
Hi, I am sure those who have actually built a data processing pipeline whose contents have to be then delivered to tensorflow or pytorch (not for POC, or writing a blog to get clicks, or resolving symptomatic bugs, but in real life end-to-end application), will perhaps understand some of the

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Sean Owen
Spark does do distributed ML, but not Tensorflow. Barrier execution mode is an element that things like Horovod uses. Not sure what you are getting at? Ray is not Spark. As I say -- Horovod does this already. The upside over TF distributed is that Spark sets up and manages the daemon processes

Spark 3.1.3 docker pre-built with Python Data science packages

2022-02-23 Thread Mich Talebzadeh
Some people asked me whether it was possible to create a docker file (spark 3.1.3) with Python packages geared towards DS etc., having the following pre-built packages pyyaml TensorFlow Theano Pandas Keras NumPy SciPy Scrapy SciKit-Learn XGBoost Matplotlib Seaborn Bokeh Plotly pydot Statsmodels

Unsubscribe

2022-02-23 Thread ashmeet kandhari
Unsubscribe

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-23 Thread Mich Talebzadeh
Well all those parameter settings have no effect because you are only using Hive as a metadata storage for your external tables on gcs. Can you give a typical example of your external table creation in hive and the partitioned column. HTH view my Linkedin profile

Re: One click to run Spark on Kubernetes

2022-02-23 Thread Bitfox
from my viewpoints, if there is such a pay as you go service I would like to use. otherwise I have to deploy a regular spark cluster with GCP/AWS etc and the cost is not low. Thanks. On Wed, Feb 23, 2022 at 4:00 PM bo yang wrote: > Right, normally people start with simple script, then add more

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Gourav Sengupta
Hi, the SPARK community should have been able to build distributed ML capabilities, and as far as I remember that was the idea initially behind SPARK 3.x roadmap (barrier execution mode, https://issues.apache.org/jira/browse/SPARK-24579). Ray, another Berkeley Labs output like SPARK, is trying

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Right, normally people start with simple script, then add more stuff, like permission and more components. After some time, people want to run the script consistently in different environments. Things will become complex. That is why we want to see whether people have interest for such a "one