Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Well OK in a nutshell you want the result set for every user prepared and email to that user right. This is a form of ETL where those result sets need to be posted somewhere. Say you create a table based on the result set prepared for each user. You may have many raw target tables at the end of

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Hi Mich, First, thank you for that. Great effort put into helping. Second, I don't think this tackles the technical challenge here. I understand the windowing as it serves those ranks you created, but I don't see how the ranks contribute to the solution. Third, the core of the challenge is about

Re: unsubscribe

2023-04-25 Thread santhosh Gandhe
To remove your address from the list, send a message to: On Mon, Apr 24, 2023 at 10:41 PM wrote: > unsubscribe

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Hi Marco, First thoughts. foreach() is an action operation that is to iterate/loop over each element in the dataset, meaning cursor based. That is different from operating over the dataset as a set which is far more efficient. So in your case as I understand it correctly, you want to get order

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich, Great idea. I have done it. Those files are attached. I'm interested to know your thoughts. Let's imagine this same structure, but with huge amounts of data as well. Please and thank you, Marco. On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh wrote: > Hi Marco, > > Let us start

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Hi Marco, Let us start simple, Provide a csv file of 5 rows for the users table. Each row has a unique user_id and one or two other columns like fictitious email etc. Also for each user_id, provide 10 rows of orders table, meaning that orders table has 5 x 10 rows for each user_id. both as

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich, I have not but I will certainly read up on this today. To your point that all of the essential data is in the 'orders' table; I agree! That distills the problem nicely. Yet, I still have some questions on which someone may be able to shed some light. 1) If my 'orders' table is very

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Have you thought of using windowing function s to achieve this? Effectively all your information is in the orders table. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United

unsubscribe

2023-04-24 Thread yxj1141
unsubscribe

What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 10 Orders in the orders table. I have to use pyspark to generate a statement of Orders for each User. So, a single user will need his/her own list of Orders. Additionally, I need

What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
Marco Costantini 5:55 PM (5 minutes ago) to user I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 10 Orders in the orders table. I have to use pyspark to generate a statement of Orders for each User. So, a single user will need

Unsubcribing

2023-04-24 Thread phiroc
Hello, does this mailist list have an administrator, please? I'm trying to unsubscribe, but to no avail. Many thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reg: create spark using virtual machine through chef

2023-04-24 Thread sunkara akhil sai teja
Hi team, Myself akhil, Iam trying to create a spark using virtual machine through chef. Could you please help us how we can do it. If possible could you please share the documentation. Regards Akhil

Re: Use Spark Aggregator in PySpark

2023-04-24 Thread Enrico Minack
Hi, For an aggregating UDF, use spark.udf.registerJavaUDAF(name, className). Enrico Am 23.04.23 um 23:42 schrieb Thomas Wang: Hi Spark Community, I have implemented a custom Spark Aggregator (a subclass to |org.apache.spark.sql.expressions.Aggregator|). Now I'm trying to use it in a

Use Spark Aggregator in PySpark

2023-04-23 Thread Thomas Wang
Hi Spark Community, I have implemented a custom Spark Aggregator (a subclass to org.apache.spark.sql.expressions.Aggregator). Now I'm trying to use it in a PySpark application, but for some reason, I'm not able to trigger the function. Here is what I'm doing, could someone help me take a look?

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
Thanks Raghavendra, Could you be more specific about how I can use ExpressionEncoder()? More specifically, how can I conform to the return type of Encoder>? Thomas On Sun, Apr 23, 2023 at 9:42 AM Raghavendra Ganesh wrote: > For simple array types setting encoder to ExpressionEncoder() should

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Raghavendra Ganesh
For simple array types setting encoder to ExpressionEncoder() should work. -- Raghavendra On Sun, Apr 23, 2023 at 9:20 PM Thomas Wang wrote: > Hi Spark Community, > > I'm trying to implement a custom Spark Aggregator (a subclass to > org.apache.spark.sql.expressions.Aggregator). Correct me if

Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
Hi Spark Community, I'm trying to implement a custom Spark Aggregator (a subclass to org.apache.spark.sql.expressions.Aggregator). Correct me if I'm wrong, but I'm assuming I will be able to use it as an aggregation function like SUM. What I'm trying to do is that I have a column of ARRAY and I

State of GraphX and GraphFrames

2023-04-23 Thread g
Hello, I am currently doing my Master thesis on data provenance on Apache Spark and would like to extend the provenance capabilities to include GraphX/GraphFrames. I am curious what the current status of both GraphX and GraphFrames is. It seems that GraphX is no longer being updated (but still

Dependency injection for spark executors

2023-04-20 Thread Deepak Patankar
I am writing a spark application which uses java and spring boot to process rows. For every row it performs some logic and saves data into the database. The logic is performed using some services defined in my application and some external

Re: Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
Is it possible to use MultipleOutputs and define a custom OutputFormat and then use `saveAsHadoopFile` to be able to achieve this? On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal wrote: > Hi folks, > > We are writing a dataframe and doing a partitionby() on it. >

Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
Hi folks, We are writing a dataframe and doing a partitionby() on it. df.write.partitionBy('col').parquet('output') Job is running super slow because internally per partition it is doing a sort before starting to output to the final location. This sort isn't useful in any way since # of files

Re: [Spark on SBT] Executor just keeps running

2023-04-18 Thread Dhruv Singla
You can reproduce the behavior in ordinary Scala code if you keep reduce in an object outside the main method. Hope it might help On Mon, Apr 17, 2023 at 10:22 PM Dhruv Singla wrote: > Hi Team >I was trying to run spark using `sbt console` on the terminal. I am > able to build the

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Ankit Gupta
Thanks Elliot ! Let me check it out ! On Mon, 17 Apr, 2023, 10:08 pm Elliot West, wrote: > Hi Ankit, > > While not a part of Spark, there is a project called 'WaggleDance' that > can federate multiple Hive metastores so that they are accessible via a > single URI:

[Spark on SBT] Executor just keeps running

2023-04-17 Thread Dhruv Singla
Hi Team I was trying to run spark using `sbt console` on the terminal. I am able to build the project successfully using build.sbt and the following piece of code runs fine on IntelliJ. The only issue I am facing while running the same on terminal is that the Executor keeps running and is

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Cheng Pan
There is a DSv2-based Hive connector in Apache Kyuubi[1] that supports connecting multiple HMS in a single Spark application. Some limitations - currently only supports Spark 3.3 - has a known issue when using w/ `spark-sql`, but OK w/ spark-shell and normal jar-based Spark application. [1]

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Elliot West
Hi Ankit, While not a part of Spark, there is a project called 'WaggleDance' that can federate multiple Hive metastores so that they are accessible via a single URI: https://github.com/ExpediaGroup/waggle-dance This may be useful or perhaps serve as inspiration. Thanks, Elliot. On Mon, 17 Apr

Spark Log Shipper to Cloud Bucket

2023-04-17 Thread Jayabindu Singh
Greetings Everyone! We are in need to ship spark (driver and executor) logs (not spark event logs) from K8 to cloud bucket ADLS/S3. Using fluentbit we are able to ship the log files but only to one single path container/logs/. This will cause a huge number of files in a single folder and will

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Ankit Gupta
++ User Mailing List Just a reminder, anyone who can help on this. Thanks a lot ! Ankit Prakash Gupta On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta wrote: > Hi All > > The question is regarding the support of multiple Remote Hive Metastore > catalogs with Spark. Starting Spark 3, multiple

Re: Non string type partitions

2023-04-15 Thread Bjørn Jørgensen
I guess that it has to do with indexing and partitioning data to nodes. Have a look at data partitioning system design concept and key range partitions

Re: Non string type partitions

2023-04-15 Thread Charles vinodh
bumping this up again for suggestions?.. Is the official recommendation to not have *int* or *date* typed partition columns? On Wed, 12 Apr 2023 at 10:44, Charles vinodh wrote: > There are other distributed execution engines (like hive, trino) that do > support non-string data types for

CVE-2023-22946: Apache Spark proxy-user privilege escalation from malicious configuration class

2023-04-15 Thread Sean R. Owen
Description: In Apache Spark versions prior to 3.4.0, applications using spark-submit can specify a 'proxy-user' to run as, limiting privileges. The application can execute code with the privileges of the submitting user, however, by providing malicious configuration-related classes on the

Scala commands syntax shortcuts(alias)

2023-04-14 Thread Ankit Singla
HI there, I'm a user of spark as part of a Data Engineer profile for daily analytical work. I write a few commands 100s of times a day and I always wonder if there would be some way to get spark commands alias instead of rewriting whole syntax all the time. I checked and there seems no *eval

Re: Spark Kubernetes Operator

2023-04-14 Thread Yuval Itzchakov
I'm not running on GKE. I am wondering what's the long term strategy around a Spark operator. Operators are the de-facto way to run complex deployments. The Flink community now has an official community led operator, and I was wondering if there are any similar plans for Spark. On Fri, Apr 14,

Re: Spark Kubernetes Operator

2023-04-14 Thread Mich Talebzadeh
Hi, What exactly are you trying to achieve? Spark on GKE works fine and you can run Datapoc now on GKE https://www.linkedin.com/pulse/running-google-dataproc-kubernetes-engine-gke-spark-mich/?trackingId=lz12GC5dRFasLiaJm5qDSw%3D%3D Unless I misunderstood your point. HTH Mich Talebzadeh, Lead

Spark Kubernetes Operator

2023-04-14 Thread Yuval Itzchakov
Hi, ATM I see the most used option for a Spark operator is the one provided by Google: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator Unfortunately, it doesn't seem actively maintained. Are there any plans to support an official Apache Spark community driven operator?

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-14 Thread Mich Talebzadeh
OK I managed to load the Python zipped file and the run py.file onto s3 for AWS EKS to work It is a bit of nightmare compared to the same on Google SDK which is simpler Anyhow you will require additional jar files to be added to $SPARK_HOME/jars. These two files will be picked up after you build

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-14 Thread Jacek Laskowski
Hi, Start with intercepting stage completions using SparkListenerStageCompleted [1]. That's Spark Core (jobs, stages and tasks). Go up the execution chain to Spark SQL with SparkListenerSQLExecutionStart [2] and SparkListenerSQLExecutionEnd [3], and correlate infos. You may want to look at how

Re: How to create spark udf use functioncatalog?

2023-04-14 Thread Jacek Laskowski
Hi, I'm not sure I understand the question, but if your question is how to register (plug-in) your own custom FunctionCatalog, it's through spark.sql.catalog configuration property, e.g. spark.sql.catalog.catalog-name=com.example.YourCatalogClass spark.sql.catalog registers a CatalogPlugin that

How to create spark udf use functioncatalog?

2023-04-14 Thread ??????
We are using spark.Today I see the FunctionCatalog , and I haveseen the source of spark\sql\core\src\test\scala\org\apache\spark\sql\connector\DataSourceV2FunctionSuite.scala and have implements theScalarFunction.But i still not konw how toregisterit in sql

[ANNOUNCE] Apache Spark 3.2.4 released

2023-04-13 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.2.4! Spark 3.2.4 is a maintenance release containing stability fixes. This release is based on the branch-3.2 maintenance branch of Spark. We strongly recommend all 3.2 users to upgrade to this stable release. To download Spark 3.2.4,

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-13 Thread Trường Trần Phan An
Hi, Can you give me more details or give me a tutorial on "You'd have to intercept execution events and correlate them. Not an easy task yet doable" Thank Vào Th 4, 12 thg 4, 2023 vào lúc 21:04 Jacek Laskowski đã viết: > Hi, > > tl;dr it's not possible to "reverse-engineer" tasks to

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Not sure I follow. If my output is my/path/output then the spark metadata will be written to my/path/output/_spark_metadata. All my data will also be stored under my/path/output so there's no way to split it? ‪On Thu, Apr 13, 2023 at 1:14 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Yeah but can’t you use following?1 . For data files: My/path/part-2. For partitioned data: my/path/partition=Best regardsOn 13 Apr 2023, at 12:58, Yuval Itzchakov wrote:The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins :( https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4 "You might specify an S3 Lifecycle configuration in

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
My naïve assumption that specifying lifecycle policy for _spark_metadata with longer retention will solve the issue Best regards > On 13 Apr 2023, at 11:52, Yuval Itzchakov wrote: > >  > Hi everyone, > > I am using Sparks FileStreamSink in order to write files to S3. On the S3 > bucket, I

_spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Hi everyone, I am using Sparks FileStreamSink in order to write files to S3. On the S3 bucket, I have a lifecycle policy that deletes data older than X days back from the bucket in order for it to not infinitely grow. My problem starts with Spark jobs that don't have frequent data. What will

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Maytas Monsereenusorn
Hi, I was wondering if it's not possible to determine tasks to functions, is it still possible to easily figure out which job and stage completed which part of the query from the UI? For example, in the SQL tab of the Spark UI, I am able to see the query and the Job IDs for that query. However,

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Thanks! I will have a look. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Bjørn Jørgensen
Yes, it looks inside the docker containers folder. It will work if you are using s3 og gs. ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh : > Hi, > > In my spark-submit to eks cluster, I use the standard code to submit to > the cluster as below: > > spark-submit --verbose \ >--master

Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Hi, In my spark-submit to eks cluster, I use the standard code to submit to the cluster as below: spark-submit --verbose \ --master k8s://$KUBERNETES_MASTER_IP:443 \ --deploy-mode cluster \ --name sparkOnEks \ --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Mich Talebzadeh
Hi Lingzhe Sun, Thanks for your comments. I am afraid I won't be able to take part in this project and contribute. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Jacek Laskowski
Hi, tl;dr it's not possible to "reverse-engineer" tasks to functions. In essence, Spark SQL is an abstraction layer over RDD API that's made up of partitions and tasks. Tasks are Scala functions (possibly with some Python for PySpark). A simple-looking high-level operator like DataFrame.join can

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread 孙令哲
Hi Rajesh, It's working fine, at least for now. But you'll need to build your own spark image using later versions. Lingzhe Sun Hirain Technologies Original: From:Rajesh Katkar Date:2023-04-12 21:36:52To:Lingzhe SunCc:Mich Talebzadeh , user Subject:Re: Re: spark streaming and

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Yi Huang
unsubscribe On Wed, Apr 12, 2023 at 3:59 PM Rajesh Katkar wrote: > Hi Lingzhe, > > We are also started using this operator. > Do you see any issues with it? > > > On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun, wrote: > >> Hi Mich, >> >> FYI we're using spark operator( >>

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Rajesh Katkar
Hi Lingzhe, We are also started using this operator. Do you see any issues with it? On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun, wrote: > Hi Mich, > > FYI we're using spark operator( > https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build > stateful structured streaming on k8s

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-12 Thread Jacek Laskowski
Hi, You could use QueryExecutionListener or Spark listeners to intercept query execution events and extract whatever is required. That's what web UI does (as it's simply a bunch of SparkListeners --> https://youtu.be/mVP9sZ6K__Y ;-)). Pozdrawiam, Jacek Laskowski "The Internals Of" Online

PySpark tests are failed with the java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.sources.FakeSourceOne not found

2023-04-12 Thread Ranga Reddy
Hi Team, I am running the pyspark tests in Spark version and it failed with P*rovider org.apache.spark.sql.sources.FakeSourceOne not found.* Spark Version: 3.4.0/3.5.0 Python Version: 3.8.10 OS: Ubuntu 20.04 *Steps: * # /opt/data/spark/build/sbt -Phive clean package #

Re: Non string type partitions

2023-04-12 Thread Charles vinodh
There are other distributed execution engines (like hive, trino) that do support non-string data types for partition columns such as date and integer. Any idea why this restriction exists in Spark? .. On Tue, 11 Apr 2023 at 20:34, Chitral Verma wrote: > Because the name of the directory

Re: Re: spark streaming and kinesis integration

2023-04-11 Thread Lingzhe Sun
Hi Mich, FYI we're using spark operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build stateful structured streaming on k8s for a year. Haven't test it using non-operator way. Besides that, the main contributor of the spark operator, Yinan Li, has been inactive for

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-11 Thread Chitral Verma
try explain codegen on your DF and then pardee the string On Fri, 7 Apr, 2023, 3:53 pm Chenghao Lyu, wrote: > Hi, > > The detailed stage page shows the involved WholeStageCodegen Ids in its > DAG visualization from the Spark UI when running a SparkSQL. (e.g., under > the link >

Re: Non string type partitions

2023-04-11 Thread Chitral Verma
Because the name of the directory cannot be an object, it has to be a string to create partitioned dirs like "date=2023-04-10" On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, wrote: > > Hi Team, > > We are running into the below error when we are trying to run a simple > query a partitioned table

Non string type partitions

2023-04-11 Thread Charles vinodh
Hi Team, We are running into the below error when we are trying to run a simple query a partitioned table in Spark. *MetaException(message:Filtering is supported only on partition keys of type string) * Our the partition column has been to type *date *instead of string and query is a very

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
Just to clarify, a major benefit of k8s in this case is to host your Spark applications in the form of containers in an automated fashion so that one can easily deploy as many instances of the application as required (autoscaling). From below:

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
What I said was this "In so far as I know k8s does not support spark structured streaming?" So it is an open question. I just recalled it. I have not tested myself. I know structured streaming works on Google Dataproc cluster but I have not seen any official link that says Spark Structured

Re: spark streaming and kinesis integration

2023-04-10 Thread Rajesh Katkar
Do you have any link or ticket which justifies that k8s does not support spark streaming ? On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, wrote: > Do you have a high level diagram of the proposed solution? > > In so far as I know k8s does not support spark structured streaming? > > Mich

[ANNOUNCE] Apache Uniffle(Incubating) 0.7.0 available

2023-04-10 Thread Junfan Zhang
Hi all, Apache Uniffle (incubating) Team is glad to announce the new release of Apache Uniffle (incubating) 0.7.0. Apache Uniffle (incubating) is a high performance, general purpose Remote Shuffle Service for distributed compute engines like Apache Spark , Apache

Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-09 Thread Andrew Redd
remove On Wed, Apr 5, 2023 at 8:06 AM Mich Talebzadeh wrote: > OK Spark Structured Streaming. > > How are you getting messages into Spark? Is it Kafka? > > This to me index that the message is incomplete or having another value in > Json > > HTH > > Mich Talebzadeh, > Lead Solutions

[SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-07 Thread Chenghao Lyu
Hi, The detailed stage page shows the involved WholeStageCodegen Ids in its DAG visualization from the Spark UI when running a SparkSQL. (e.g., under the link node:18088/history/application_1663600377480_62091/stages/stage/?id=1=0). However, I have trouble extracting the WholeStageCodegen ids

Re: spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Use case is , we want to read/write to kinesis streams using k8s Officially I could not find the connector or reader for kinesis from spark like it has for kafka. Checking here if anyone used kinesis and spark streaming combination ? On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, wrote: > Hi

RE: spark streaming and kinesis integration

2023-04-06 Thread Jonske, Kurt
unsubscribe Regards, Kurt Jonske Senior Director Alvarez & Marsal Direct: 212 328 8532 Mobile: 312 560 5040 Email: kjon...@alvarezandmarsal.com www.alvarezandmarsal.com From: Mich Talebzadeh Sent: Thursday, April 06, 2023 11:45 AM To: Rajesh Katkar Cc:

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Do you have a high level diagram of the proposed solution? In so far as I know k8s does not support spark structured streaming? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view my Linkedin profile

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Hi Rajesh, What is the use case for Kinesis here? I have not used it personally, Which use case it concerns https://aws.amazon.com/kinesis/ Can you use something else instead? HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view

spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Hi Spark Team, We need to read/write the kinesis streams using spark streaming. We checked the official documentation - https://spark.apache.org/docs/latest/streaming-kinesis-integration.html It does not mention kinesis connector. Alternative is - https://github.com/qubole/kinesis-sql which is

Raise exception whilst casting instead of defaulting to null

2023-04-05 Thread Yeachan Park
Hi all, The default behaviour of Spark is to add a null value for casts that fail, unless ANSI SQL is enabled, SPARK-30292 . Whilst I understand that this is a subset of ANSI compliant behaviour, I don't understand why this feature is so

Re: Potability of dockers built on different cloud platforms

2023-04-05 Thread Mich Talebzadeh
The whole idea of creating a docker container is to have a reployable self contained utility. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. The

Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread Mich Talebzadeh
OK Spark Structured Streaming. How are you getting messages into Spark? Is it Kafka? This to me index that the message is incomplete or having another value in Json HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view my Linkedin

Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread me
Dear Apache Spark users, I have a long running Spark application that is encountering an ArrayIndexOutOfBoundsException once every two weeks. The exception does not disrupt the operation of my app, but I'm still concerned about it and would like to find a solution. Here's some

Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread me
Dear Apache Spark users, I have a long running Spark application that is encountering an ArrayIndexOutOfBoundsException once every two weeks. The exception does not disrupt the operation of my app, but I'm still concerned about it and would like to find a solution. Here's some

Re: Potability of dockers built on different cloud platforms

2023-04-05 Thread Ken Peng
ashok34...@yahoo.com.INVALID wrote: Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? I am using the spark image from bitnami for running on k8s. And yes, it's deployed by helm. -- https://kenpeng.pages.dev/

Potability of dockers built on different cloud platforms

2023-04-05 Thread ashok34...@yahoo.com.INVALID
Hello team Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? Will that work please. AK

Re: Creating InMemory relations with data in ColumnarBatches

2023-04-04 Thread Bobby Evans
This is not going to work without changes to Spark. InMemoryTableScanExec supports columnar output, but not columnar input. You would have to write code to support that in Spark itself. The second part is that there are only a handful of operators that support columnar output. Really it is just

Re: Slack for PySpark users

2023-04-04 Thread Mich Talebzadeh
That 3 months retention is just a soft setting. For low volume traffic, it can be negotiated to a year’s retention. Let me see what we can do about it. HTH On Tue, 4 Apr 2023 at 09:31, Bjørn Jørgensen wrote: > One of the things that I don't like about this slack solution is that > questions

Re: Slack for PySpark users

2023-04-04 Thread Bjørn Jørgensen
One of the things that I don't like about this slack solution is that questions and answers disappear after 90 days. Today's maillist solution is indexed by search engines and when one day you wonder about something, you can find solutions with the help of just searching the web. Another question

Re: Slack for PySpark users

2023-04-04 Thread Mich Talebzadeh
Hi Shani, I believe I am an admin so that is fine by me. Hi Dongioon, With regard to summarising the discussion etc, no need, It is like flogging the dead horse, we have already discussed it enough. I don't see the point of it. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead

Re: Slack for PySpark users

2023-04-04 Thread shani . alishar
Hey Dongjoon, Denny and all,I’ve created the current slack.All users have the option to create channels for different topics.I don’t see a reason for creating a new one.If anyone want to be admin on the current slack channel you all are welcome to send me a msg and I’ll grand permission.Have a

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Thank you, Denny. May I interpret your comment as a request to support multiple channels in ASF too? > because it would allow us to create multiple channels for different topics Any other reasons? Dongjoon. On Mon, Apr 3, 2023 at 5:31 PM Denny Lee wrote: > I do think creating a new Slack

Re: Slack for PySpark users

2023-04-03 Thread Denny Lee
I do think creating a new Slack channel would be helpful because it would allow us to create multiple channels for different topics - streaming, graph, ML, etc. We would need a volunteer core to maintain it so we can keep the spirit and letter of ASF / code of conduct. I’d be glad to volunteer

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Shall we summarize the discussion so far? To sum up, "ASF Slack" vs "3rd-party Slack" was the real background to initiate this thread instead of "Slack" vs "Mailing list"? If ASF Slack provides what you need, is it better than creating a new Slack channel? Or, is there another reason for us to

Re: Slack for PySpark users

2023-04-03 Thread Mich Talebzadeh
I agree, whatever individual sentiments are. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
Just to be clear, if there is no strong volunteer to make the new community channel stay active, I'd probably be OK to not fork the channel. You can see a strong counter example from #spark channel in ASF. It is the place where there are only questions and promos but zero answers. I see volunteers

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
The number of subscribers doesn't give any meaningful value. Please look into the number of mails being sent to the list. https://lists.apache.org/list.html?user@spark.apache.org The latest month there were more than 200 emails being sent was Feb 2022, more than a year ago. It was more than 1k in

Re: Looping through a series of telephone numbers

2023-04-03 Thread Gera Shegalov
+1 to using a UDF. E.g., TransmogrifAI uses libphonenumber https://github.com/google/libphonenumber that normalizes

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Do you think there is a way to put it back to the official ASF-provided Slack channel? Dongjoon. On Mon, Apr 3, 2023 at 2:18 PM Mich Talebzadeh wrote: > > I for myself prefer to use the newly formed slack. > > sparkcommunitytalk.slack.com > > In summary, it may be a good idea to take a tour of

Re: Slack for PySpark users

2023-04-03 Thread Mich Talebzadeh
I for myself prefer to use the newly formed slack. sparkcommunitytalk.slack.com In summary, it may be a good idea to take a tour of it and see for yourself. Topics are sectioned as per user requests. I trust this answers your question. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
As Mich Talebzadeh pointed out, Apache Spark has an official Slack channel. > It's unavoidable if "users" prefer to use an alternative communication mechanism rather than the user mailing list. The following is the number of people in the official channels. - user@spark.apache.org has 4519

Re: Looping through a series of telephone numbers

2023-04-02 Thread Mich Talebzadeh
Hi Philippe, Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Hi Mich, what exactly do you mean by « if you prefer to broadcast the reference data »? Philippe > Le 2 avr. 2023 à 18:16, Mich Talebzadeh a écrit : > > Hi Phillipe, > > These are my thoughts besides comments from Sean > > Just to clarify, you receive a CSV file periodically and you already

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Wow, you guys, Anastasios, Bjørn and Mich, are stars! Thank you very much for your suggestions. I’m going to print them and study them closely. > Le 2 avr. 2023 à 20:05, Anastasios Zouzias a écrit : > > Hi Philippe, > > I would like to draw your attention to this great library that saved my

Re: Looping through a series of telephone numbers

2023-04-02 Thread Anastasios Zouzias
Hi Philippe, I would like to draw your attention to this great library that saved my day in the past when parsing phone numbers in Spark: https://github.com/google/libphonenumber If you combine it with Bjørn's suggestions you will have a good start on your linkage task. Best regards,

<    8   9   10   11   12   13   14   15   16   17   >