Ctrl - left and right now working in Spark Shell in Windows 10

2022-11-01 Thread Salil Surendran
I installed Spark on Windows 10. Everything works fine except for the Ctrl - left and Ctrl - right keys which doesn't move a word but just a character. How do I fix this or find out what are the correct bindings to move a word in Spark Shell -- Thanks, Salil "The surest sign that i

Re: spark - local question

2022-10-31 Thread Sean Owen
Sure, as stable and available as your machine is. If you don't need fault tolerance or scale beyond one machine, sure. On Mon, Oct 31, 2022 at 8:43 AM 张健BJ wrote: > Dear developers: > I have a question about the pyspark local > mode. Can it be used in production and Will it cause unexpected

spark - local question

2022-10-31 Thread 张健BJ
Dear developers: I have a question about the pyspark local mode. Can it be used in production and Will it cause unexpected problems? The scenario is as follows: Our team wants to develop an etl component based on python language. Data can be transferred between various data sources. If there i

RE: Apache Spark Operator for Kubernetes?

2022-10-28 Thread Jim Halfpenny
Hi Clayton, I’m not aware of an official Apache operator, but I can recommend taking a look a the one we’re created at Stackable. https://github.com/stackabletech/spark-k8s-operator It’s actively maintained and we’d be happy to receive feedback if you have feature requests. Kind regards, Jim

Re: Prometheus with spark

2022-10-27 Thread Denny Lee
Hi Raja, A little atypical way to respond to your question - please check out the most recent Spark AMA where we discuss this: https://www.linkedin.com/posts/apachespark_apachespark-ama-committers-activity-6989052811397279744-jpWH?utm_source=share&utm_medium=member_ios HTH! Denny On Tue,

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Sean Owen
That just means G = GB mem, C = cores, but yeah the driver and executors are very small, possibly related. On Wed, Oct 26, 2022 at 12:34 PM Artemis User wrote: > Are these Cloudera specific acronyms? Not sure how Cloudera configures > Spark differently, but obviously the number of nodes

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Artemis User
Are these Cloudera specific acronyms?  Not sure how Cloudera configures Spark differently, but obviously the number of nodes is too small, considering each app only uses a small number of cores and RAM.  So you may consider increase the number of nodes.   When all these apps jam on a few nodes

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Chao Sun
Congrats everyone! and thanks Yuming for driving the release! On Wed, Oct 26, 2022 at 7:37 AM beliefer wrote: > > Congratulations everyone have contributed to this release. > > > At 2022-10-26 14:21:36, "Yuming Wang" wrote: > > We are happy to announce the ava

Re:[ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread beliefer
Congratulations everyone have contributed to this release. At 2022-10-26 14:21:36, "Yuming Wang" wrote: We are happy to announce the availability of Apache Spark 3.3.1! Spark 3.3.1 is a maintenance release containing stability fixes. This release is based on the branch-3.3 m

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Sean Owen
nario: > > 1.spark app resource with 2G driver memory/2C driver vcore/1 executor > nums/2G executor memory/2C executor vcore. > 2.one spark app will use 5G4C on yarn. > 3.first, I only run one spark app takes 40s. > 4.Then, I run 30 the same spark app at once, and each spark app tak

Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread eab...@163.com
Hi All, I have a CDH5.16.2 hadoop cluster with 1+3 nodes(64C/128G, 1NN/RM + 3DN/NM), and yarn with 192C/240G. I used the following test scenario: 1.spark app resource with 2G driver memory/2C driver vcore/1 executor nums/2G executor memory/2C executor vcore. 2.one spark app will use 5G4C on

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Jacek Laskowski
Yoohoo! Thanks Yuming for driving this release. A tiny step for Spark a huge one for my clients (who still are on 3.2.1 or even older :)) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on htt

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Yang,Jie(INF)
Thanks Yuming and all developers ~ Yang Jie 发件人: Maxim Gekk 日期: 2022年10月26日 星期三 15:19 收件人: Hyukjin Kwon 抄送: "L. C. Hsieh" , Dongjoon Hyun , Yuming Wang , dev , User 主题: Re: [ANNOUNCE] Apache Spark 3.3.1 released Congratulations everyone with the new release, and thanks to Yumi

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Maxim Gekk
ou for driving the release of Apache Spark 3.3.1, Yuming! >> >> On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun >> wrote: >> > >> > It's great. Thank you so much, Yuming! >> > >> > Dongjoon >> > >> > On Tue, Oct 25, 2022 at 11:

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Hyukjin Kwon
Thanks, Yuming. On Wed, 26 Oct 2022 at 16:01, L. C. Hsieh wrote: > Thank you for driving the release of Apache Spark 3.3.1, Yuming! > > On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun > wrote: > > > > It's great. Thank you so much, Yuming! > > > > Dongj

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread L. C. Hsieh
Thank you for driving the release of Apache Spark 3.3.1, Yuming! On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun wrote: > > It's great. Thank you so much, Yuming! > > Dongjoon > > On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang wrote: >> >> We are happy to announ

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-25 Thread Dongjoon Hyun
It's great. Thank you so much, Yuming! Dongjoon On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang wrote: > We are happy to announce the availability of Apache Spark 3.3.1! > > Spark 3.3.1 is a maintenance release containing stability fixes. This > release is based on the branc

[ANNOUNCE] Apache Spark 3.3.1 released

2022-10-25 Thread Yuming Wang
We are happy to announce the availability of Apache Spark 3.3.1! Spark 3.3.1 is a maintenance release containing stability fixes. This release is based on the branch-3.3 maintenance branch of Spark. We strongly recommend all 3.3 users to upgrade to this stable release. To download Spark 3.3.1

Re: Prometheus with spark

2022-10-25 Thread Raja bhupati
We have use case where we would like process Prometheus metrics data with spark On Tue, Oct 25, 2022, 19:49 Jacek Laskowski wrote: > Hi Raj, > > Do you want to do the following? > > spark.read.format("prometheus").load... > > I haven't heard of such a data so

Re: Prometheus with spark

2022-10-25 Thread Jacek Laskowski
<https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Fri, Oct 21, 2022 at 6:12 PM Raj ks wrote: > Hi Team, > > > We wanted to query Prometheus data with spark. Any suggestions will > be appreciated > > Searched for documents but did not got any prompt one >

Prometheus with spark

2022-10-21 Thread Raj ks
Hi Team, We wanted to query Prometheus data with spark. Any suggestions will be appreciated Searched for documents but did not got any prompt one

[PySpark, Spark Streaming] Bug in timestamp handling in Structured Streaming?

2022-10-21 Thread kai-michael.roes...@sap.com.INVALID
Hi, I suspect I may have come across a bug in the handling of data containing timestamps in PySpark "Structured Streaming" using the foreach option. I'm "just" a user of PySpark, no Spark community member, so I don't know how to properly address the issue. I

Re: pyspark connect to spark thrift server port

2022-10-21 Thread Artemis User
I guess there are some confusions here between the metastore and the actual Hive database.  Spark (as well as Apache Hive) requires two databases for Hive DB operations.  Metastore is used for storing metadata only (e.g., schema info), whereas the actual Hive database, accessible through

Re: pyspark connect to spark thrift server port

2022-10-20 Thread second_co...@yahoo.com.INVALID
AM GMT+8, Artemis User wrote: By default, Spark uses Apache Derby (running in embedded mode with store content defined in local files) for hosting the Hive metastore.  You can externalize the metastore on a JDBC-compliant database (e.g., PostgreSQL) and use the database authentication

Re: pyspark connect to spark thrift server port

2022-10-20 Thread Artemis User
By default, Spark uses Apache Derby (running in embedded mode with store content defined in local files) for hosting the Hive metastore.  You can externalize the metastore on a JDBC-compliant database (e.g., PostgreSQL) and use the database authentication provided by the database.  The JDBC

Spark partitioned By

2022-10-20 Thread venkatesh bandaru
Hi Team, I have asked this question in our stackoverflow group pyspark - Apache Spark partition by output path - Stack Overflow <https://stackoverflow.com/questions/74089582/apache-spark-partition-by-output-path> *Requirement* 1. I have huge data coming from source and loaded into Azur

pyspark connect to spark thrift server port

2022-10-20 Thread second_co...@yahoo.com.INVALID
Currently my pyspark code able to connect to hive metastore at port 9083. However using this approach i can't put in-place any security mechanism like LDAP and sql authentication control. Is there anyway to connect from pyspark to spark thrift server on port 1 without exposing

Re: spark on kubernetes

2022-10-16 Thread Qian Sun
vice accounts token and certificate. > and you are right I have to use `service account cert` to configure > spark.kubernetes.authenticate.caCertFile. > Thanks again. best regards. > > On Sat, Oct 15, 2022 at 4:51 PM Qian Sun wrote: > >> Hi Mohammad >> Did you try this com

Re: How to use neo4j cypher/opencypher to query spark RDD/graphdb

2022-10-16 Thread Artemis User
Spark doesn't offer a native graph database like Neo4j does since GraphX is still using the RDD tabular data structure.  Spark doesn't have a GQL or Cypher query engine either, but uses Google's Pregal API for graph processing.  Don't see any prospect that Spark is going to

How to use neo4j cypher/opencypher to query spark RDD/graphdb

2022-10-15 Thread ERSyrfw212oe
I think I saw GraphX here and there,is it a re-implementation of open cypher or is it a graphdb for spark? I wanted to create a graphdb and query with cypher language,i looked around docs and didnt see any relevant guide.SO seems to be tackling specific problems.and I currently dont even know

Re: spark on kubernetes

2022-10-15 Thread Qian Sun
Hi Mohammad Did you try this command? ./bin/spark-submit \ --master k8s://https://vm13:6443 \ --class com.example.WordCounter \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=default \ --conf spark.kubernetes.container.image=private-docker-registery/spark/spark:3.2.1

spark on kubernetes

2022-10-15 Thread Mohammad Abdollahzade Arani
I have a k8s cluster and a spark cluster. my question is is as bellow: https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc I have searched and I found lot's of other similar questions on stackoverflow without an answer

Re: Apache Spark Operator for Kubernetes?

2022-10-14 Thread Artemis User
If you have the hardware resources, it isn't difficult to set up Spark in a kubernetes cluster.  The online doc describes everything you would need (https://spark.apache.org/docs/latest/running-on-kubernetes.html). You're right, both AWS EMR and Google's environment aren'

Apache Spark Operator for Kubernetes?

2022-10-14 Thread Clayton Wohl
My company has been exploring the Google Spark Operator for running Spark jobs on a Kubernetes cluster, but we've found lots of limitations and problems, and the product seems weakly supported. Is there any official Apache option, or plans for such an option, to run Spark jobs on Kubernete

Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-12 Thread Chartist
Hi, Sadha I have solved this problem. And in my case it was caused by the different compression suite between hive and spark. In detail,Hive take ZLIB as default ORC compression suite but Spark take SNAPPY. Finally, when I took the same compression suite, final table file produced by spark

Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-11 Thread Sadha Chilukoori
I have faced the same problem, where hive and spark orc were using the snappy compression. Hive 2.1 Spark 2.4.8 I'm curious to learn what could be the root cause of this. -S On Tue, Oct 11, 2022, 2:18 AM Chartist <13289341...@163.com> wrote: > > Hi,All > > I encounte

Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Sean Owen
See the pom.xml file https://github.com/apache/spark/blob/master/pom.xml#L3590 2.13.8 at the moment; IIRC there was some Scala issue that prevented updating to 2.13.9. Search issues/PRs. On Tue, Oct 11, 2022 at 6:11 PM Henrik Park wrote: > scala 2.13.9 was released. do you know which sp

Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Henrik Park
scala 2.13.9 was released. do you know which spark version would have it built-in? thanks Sean Owen wrote: I would imagine that Scala 2.12 support goes away, and Scala 3 support is added, for maybe Spark 4.0, and maybe that happens in a year or so. -- Simple Mail https://simplemail.co.in

Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Sean Owen
For Spark, the issue is maintaining simultaneous support for multiple Scala versions, which has historically been mutually incompatible across minor versions. Until Scala 2.12 support is reasonable to remove, it's hard to also support Scala 3, as it would mean maintaining three versions of co

Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Никита Романов
No one knows for sure except Apache, but I’d learn Scala 2 if I were you. Even if Spark one day migrates to Scala 3 (which is not given), it’ll take a while for the industry to adjust. It even takes a while to move from Spark 2 to Spark 3 (Scala 2.11 to Scala 2.12). I don’t think your

Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-11 Thread Chartist
='1', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx', 'spark.sql.sources.schema.partCol.0'='pt', 'transient_lastDdlTime'='1653484849’) ENV: hive versi

As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-10 Thread Oliver Plohmann
Hello, I was lucky and will be joining a project where Spark is being used in conjunction with Python. Scala will not be used at all. Everything will be Python. This means that I have free choice whether to start diving into Scala 2 or Scala 3. For future Spark jobs knowledge of Scala will

Re: [Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread Bjørn Jørgensen
I have made a PR <https://github.com/apache/spark/pull/38098> for this now. tir. 4. okt. 2022 kl. 19:02 skrev Sean Owen : > I think it's fine to backport that to 3.3.x, regardless of whether it > clearly affects Spark or not. > > On Tue, Oct 4, 2022 at 11:31 AM phoebe

Re: [Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread Sean Owen
I think it's fine to backport that to 3.3.x, regardless of whether it clearly affects Spark or not. On Tue, Oct 4, 2022 at 11:31 AM phoebe chen wrote: > Hi: > (Not sure if this mailing group is good to use for such question, but just > try my luck here, thanks) > >

[Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread phoebe chen
Hi: (Not sure if this mailing group is good to use for such question, but just try my luck here, thanks) SPARK-39725 <https://issues.apache.org/jira/browse/SPARK-39725> has fix for security issues CVE-2022-2047 and CVE2022-2048 (High), which was set to 3.4.0 release but that will happen Fe

Re: Spark ML VarianceThresholdSelector Unexpected Results

2022-10-01 Thread 姜鑫
Thank you so much for the reply. You are right and maybe it would be better if it is mentioned in docs because in some other ml libraries e.g. sklearn, it uses population variance. > 2022年9月30日 上午10:49,Sean Owen 写道: > > This is sample variance, not population (i.e. divide by n-1, not n). I t

Re: Spark ML VarianceThresholdSelector Unexpected Results

2022-09-29 Thread Sean Owen
This is sample variance, not population (i.e. divide by n-1, not n). I think that's justified as the data are notionally a sample from a population. On Thu, Sep 29, 2022 at 9:21 PM 姜鑫 wrote: > Hi folks, > > Has anyone used VarianceThresholdSelector refer to > https://spark.apache.org/docs/latest

Spark ML VarianceThresholdSelector Unexpected Results

2022-09-29 Thread 姜鑫
Hi folks, Has anyone used VarianceThresholdSelector refer to https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector ? In the doc, an example is gaven and says `The variance for the 6 featu

depolying stage-level scheduling for Spark SQL and how to expose RDD code from Spark SQL?

2022-09-29 Thread Chenghao Lyu
Hi, I am trying to deploy the stage-level scheduling for Spark SQL. Since the current stage-level scheduling only supports the RDD-APIs, I want to expose the RDD transformation code from my Spark SQL code (with SQL syntax). Can you provide any pointers on how to do it? Stage level scheduling

Re: Updating Broadcast Variable in Spark Streaming 2.4.4

2022-09-28 Thread Sean Owen
n < i...@ricobergmann.de> wrote: > Hi folks! > > > I'm trying to implement an update of a broadcast var in Spark Streaming. > The idea is that whenever some configuration value has changed (this is > periodically checked by the driver) the existing broadcast variable is >

Updating Broadcast Variable in Spark Streaming 2.4.4

2022-09-28 Thread Dipl.-Inf. Rico Bergmann
Hi folks! I'm trying to implement an update of a broadcast var in Spark Streaming. The idea is that whenever some configuration value has changed (this is periodically checked by the driver) the existing broadcast variable is unpersisted and then (re-)broadcasted. In a local test

Re: [Spark Kubernetes] Question about Configurability of Labeling Driver Service

2022-09-27 Thread Shiqi Sun
Forgot to paste the link... the spark option is spark.kubernetes.driver.service.annotation.[AnnotationName], and you can see it in https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties . Thanks, Shiqi On Tue, Sep 27, 2022 at 3:19 PM Shiqi Sun wrote: > Hi all, &g

[Spark Kubernetes] Question about Configurability of Labeling Driver Service

2022-09-27 Thread Shiqi Sun
Hi all, I have the need to add certain labels to the driver headless service when running Spark in Kubernetes. I know that there is this spark option to add annotation (this one <http://spark.kubernetes.driver.service.annotation.[AnnotationName]>), but is there a similar one to add custom

Re: [Spark Internals]: Is sort order preserved after partitioned write?

2022-09-26 Thread Swetha Baskaran
Hi Enrico, Using Spark version 3.1.3 and turning AQE off seems to fix the sorting. Looking into why, do you have thoughts? Thanks, Swetha On Sat, Sep 17, 2022 at 1:58 PM Enrico Minack wrote: > Hi, > > from a quick glance over your transformations, sortCol should be sorted. > >

Re: Error - Spark STREAMING

2022-09-21 Thread Anupam Singh
Which version of spark are you using? On Tue, Sep 20, 2022, 1:57 PM Akash Vellukai wrote: > Hello, > > > py4j.protocol.Py4JJavaError: An error occurred while calling o80.load. : > java.lang.NoClassDefFoundError: > org/apache/spark/sql/internal/connector/SimpleTableProvide

Error - Spark STREAMING

2022-09-20 Thread Akash Vellukai
Hello, py4j.protocol.Py4JJavaError: An error occurred while calling o80.load. : java.lang.NoClassDefFoundError: org/apache/spark/sql/internal/connector/SimpleTableProvider May anyone help Me to solve this issue. Thanks and regards Akash

Re: Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang
here is the stackoverflow link https://stackoverflow.com/questions/73780259/spark-structured-streaming-stderr-getting-filled-up On Mon, Sep 19, 2022 at 4:41 PM karan alang wrote: > I've created a stackoverflow ticket for this as well > > On Mon, Sep 19, 2022 at 4:37 PM kara

Re: Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang
I've created a stackoverflow ticket for this as well On Mon, Sep 19, 2022 at 4:37 PM karan alang wrote: > Hello All, > I've a Spark Structured Streaming job on GCP Dataproc - which picks up > data from Kafka, does processing and pushes data back into kafka topics. > >

Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang
Hello All, I've a Spark Structured Streaming job on GCP Dataproc - which picks up data from Kafka, does processing and pushes data back into kafka topics. Couple of questions : 1. Does Spark put all the log (incl. INFO, WARN etc) into stderr ? What I notice is that stdout is empty, while al

Re: [Spark Internals]: Is sort order preserved after partitioned write?

2022-09-17 Thread Enrico Minack
Hi, from a quick glance over your transformations, sortCol should be sorted. Are you using Spark 3.2 or above? Can you try again with AQE turned off in that case? https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution Enrico Am 16.09.22 um 23:28 schrieb

Re: [Spark Internals]: Is sort order preserved after partitioned write?

2022-09-16 Thread Swetha Baskaran
d partitions to be preserved after a > dataframe write. We use the following code to write out one file per > partition, with the rows sorted by a column. > > > > > > > *df .repartition($"col1") .sortWithinPartitions("col1", "col2") > .write .partitionBy("col1") .csv(path)* > > However we observe unexpected sort order in some files. Does spark > guarantee sort order within partitions on write? > > > Thanks, > swebask > > >

[Spark Core] Joining Same DataFrame Multiple Times Results in Column not getting dropped

2022-09-16 Thread Shahban Riaz
Hi, We have some PySpark code that joins a table table_a, twice to another table table_b using the following code. After joining the table, we drop the key_hash column from the output DataFrame. This code was working fine in spark version 3.0.1. Since upgrading to spark version 3.2.2, the

Re: [Spark Internals]: Is sort order preserved after partitioned write?

2022-09-15 Thread Enrico Minack
sorted by a column. /df     .repartition($"col1")     .sortWithinPartitions("col1", "col2")     .write     .partitionBy("col1")     .csv(path)/ However we observe unexpected sort order in some files. Does spark guarantee sort order within partitions on write? Thanks, swebask

Re: EXT: Re: Spark SQL

2022-09-15 Thread Vibhor Gupta
unction, does the underlying thread get killed when a TimeoutExc... stackoverflow.com  Regards, Vibhor From: Gourav Sengupta Sent: Thursday, September 15, 2022 10:22 PM To: Mayur Benodekar Cc: user ; i...@spark.apache.org Subject: EXT: Re: Spark SQL EXTERNAL:

[Spark Internals]: Is sort order preserved after partitioned write?

2022-09-15 Thread Swetha Baskaran
) .write.partitionBy("col1").csv(path)* However we observe unexpected sort order in some files. Does spark guarantee sort order within partitions on write? Thanks, swebask

Re: Spark SQL

2022-09-15 Thread Gourav Sengupta
Okay, so for the problem to the solution 👍 that is powerful On Thu, 15 Sept 2022, 14:48 Mayur Benodekar, wrote: > Hi Gourav, > > It’s the way the framework is > > > Sent from my iPhone > > On Sep 15, 2022, at 02:02, Gourav Sengupta > wrote: > >  > Hi, >

Re: Spark SQL

2022-09-15 Thread Mayur Benodekar
Hi Gourav,It’s the way the framework is Sent from my iPhoneOn Sep 15, 2022, at 02:02, Gourav Sengupta wrote:Hi,Why spark and why scala? Regards,GouravOn Wed, 7 Sept 2022, 21:42 Mayur Benodekar, <askma...@gmail.com> wrote: am new to scala and spark both .I have a code in scala which ex

Re: Spark SQL

2022-09-14 Thread Gourav Sengupta
Hi, Why spark and why scala? Regards, Gourav On Wed, 7 Sept 2022, 21:42 Mayur Benodekar, wrote: > am new to scala and spark both . > > I have a code in scala which executes quieres in while loop one after the > other. > > What we need to do is if a particular query takes m

Re: Long running task in spark

2022-09-14 Thread Sid
ode level , any property which can be used in Spark > Config? > > > I am using Spark2 hence AQE can not be used. > > > Thanks > Rajat >

Re: EXT: Network time out property is not getting set in Spark

2022-09-13 Thread Sachit Murarka
On Tue, Sep 13, 2022, 21:23 Sachit Murarka wrote: > Hi Vibhor, > > Thanks for your response! > > There are some properties which can be set without changing this flag > "spark.sql.legacy.setCommandRejectsSparkCoreConfs" > post creation of spark session , like shuf

Re: EXT: Network time out property is not getting set in Spark

2022-09-13 Thread Vibhor Gupta
Hi Sachit, Check the migration guide. https://spark.apache.org/docs/latest/sql-migration-guide.html#:~:text=Spark%202.4%20and%20below%3A%20the,legacy.setCommandRejectsSparkCoreConfs%20to%20false. Migration Guide: SQL, Datasets and DataFrame - Spark 3.3.0 Documentation - Apache Spark<ht

Network time out property is not getting set in Spark

2022-09-13 Thread Sachit Murarka
Hello Everyone, I am trying to set network timeout property , it used to work in Spark2.X , but in Spark 3 , it is giving following error:- Could you please suggest if it is due to any bug in Spark3 or do we need any other property because as per spark official doc ,this is the unchanged

Re: [SPARK STRUCTURED STREAMING] : Rocks DB uses off-heap usage

2022-09-12 Thread Artemis User
ised it is using much more off-heap space than expected. Because of this, the executors get killed with  : *out of**physical memory exception.* * * Could you please help in understanding, why is there a massive increase in off-heap space, and what can we do about it? We are using, SPARK 3.

Re: Spark Issue with Istio in Distributed Mode

2022-09-11 Thread Deepak Sharma
oy-v3-api-field-config-core-v3-httpprotocoloptions-idle-timeout > > > On Sat, Sep 3, 2022 at 4:23 AM Deepak Sharma > wrote: > >> Thank for the reply IIan . >> Can we set this in spark conf or does it need to goto istio / envoy conf? >> >> >> >> On S

Long running task in spark

2022-09-11 Thread rajat kumar
Hello Users, My 2 tasks are running forever. One of them gave a java heap space error. I have 10 Joins , all tables are big. I understand this is data skewness. Apart from changes at code level , any property which can be used in Spark Config? I am using Spark2 hence AQE can not be used

[SPARK STRUCTURED STREAMING] : Rocks DB uses off-heap usage

2022-09-11 Thread akshit marwah
, why is there a massive increase in off-heap space, and what can we do about it? We are using, SPARK 3.2.1 with 1 executor and 1 executor core, to understand the memory requirements - 1. Rocks DB Run - took 3.5 GB heap and 11.5 GB Res Memory 2. Hdfs State Manager - took 5 GB heap and 10 GB Res Memory

Re: Pipelined execution in Spark (???)

2022-09-11 Thread Gourav Sengupta
y execute unrelated DAGs in parallel of course. >> >> On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park wrote: >> >>> You are right -- Spark can't do this with its current architecture. My >>> question was: if there was a new implementation supporting pipelined

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
arallel of course. > > On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park wrote: > >> You are right -- Spark can't do this with its current architecture. My >> question was: if there was a new implementation supporting pipelined >> execution, what kind of Spark

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
as intended, but for typical Spark jobs (like SparkSQL jobs), we don't see noticeable performance improvement because Spark tasks are mostly short-running tasks. My question was if there would be some category of Spark jobs that would benefit from pipelined execution. Thanks, --- Sungwoo O

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
Oops, it has been long since Russell labored on Hadoop, speculative execution isn’t the right term - that is something else. Cascading has a declarative interface so you can plan more, whereas Spark is more imperative. Point remains :) On Wed, Sep 7, 2022 at 3:56 PM Russell Jurney wrote: >

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
ask on the cascading user group. https://cascading.wensel.net/ On Wed, Sep 7, 2022 at 3:49 PM Sungwoo Park wrote: > You are right -- Spark can't do this with its current architecture. My > question was: if there was a new implementation supporting pipelined > execution, what kind

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sean Owen
re right -- Spark can't do this with its current architecture. My > question was: if there was a new implementation supporting pipelined > execution, what kind of Spark jobs would benefit (a lot) from it? > > Thanks, > > --- Sungwoo > > On Thu, Sep 8, 2022 at 1:47 AM Russel

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
You are right -- Spark can't do this with its current architecture. My question was: if there was a new implementation supporting pipelined execution, what kind of Spark jobs would benefit (a lot) from it? Thanks, --- Sungwoo On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney wrote: >

Spark SQL

2022-09-07 Thread Mayur Benodekar
am new to scala and spark both . I have a code in scala which executes quieres in while loop one after the other. What we need to do is if a particular query takes more than a certain time , for example # 10 mins we should be able to stop the query execution for that particular query and move

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
I don't think Spark can do this with its current architecture. It has to wait for the step to be done, speculative execution isn't possible. Others probably know more about why that is. Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <

Re: Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Many thanks, Sean. - Mail original - De: "Sean Owen" À: phi...@free.fr Cc: "User" Envoyé: Mercredi 7 Septembre 2022 17:05:55 Objet: Re: Spark equivalent to hdfs groups No, because this is a storage concept, and Spark is not a storage system. You would appeal to

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen
No, because this is a storage concept, and Spark is not a storage system. You would appeal to tools and interfaces that the storage system provides, like hdfs. Where or how the hdfs binary is available depends on how you deploy Spark where; it would be available on a Hadoop cluster. It's jus

Re: Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Hi Sean, I'm talking about HDFS Groups. On Linux, you can type "hdfs groups " to get the list of the groups user1 belongs to. In Zeppelin/Spark, the hdfs executable is not accessible. As a result, I wondered if there was a class in Spark (eg. Security or ACL) which would l

Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
Hello Spark users, I have a question on the architecture of Spark (which could lead to a research problem). In its current implementation, Spark finishes executing all the tasks in a stage before proceeding to child stages. For example, given a two-stage map-reduce DAG, Spark finishes executing

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen
Spark isn't a storage system or user management system; no there is no notion of groups (groups for what?) On Wed, Sep 7, 2022 at 8:36 AM wrote: > Hello, > is there a Spark equivalent to "hdfs groups "? >

Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Hello, is there a Spark equivalent to "hdfs groups "? Many thanks. Philippe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Structured Streaming - unable to change max.poll.records (showing as 1)

2022-09-06 Thread karan alang
Hello All, i've a Spark structured streaming job which reads from Kafka, does processing and puts data into Mongo/Kafka/GCP Buckets (i.e. it is processing heavy) I'm consistently seeing the following warnings: ``` 22/09/06 16:

Re: Error in Spark in Jupyter Notebook

2022-09-06 Thread Sean Owen
That just says a task failed - no real info there. YOu have to look at Spark logs from the UI to see why. On Tue, Sep 6, 2022 at 7:07 AM Mamata Shee wrote: > Hello, > > I'm using spark in Jupyter Notebook, but when performing some queries > getting the below error, can you pl

Error in Spark in Jupyter Notebook

2022-09-06 Thread Mamata Shee
Hello, I'm using spark in Jupyter Notebook, but when performing some queries getting the below error, can you please tell me what is the actual reason for this or any suggestions to make it work? *Error:* [image: image.png] Thank you -- <https://www.xenonstack.com> CONFIDENTIA

Apache Spark - How to concert DataFrame json string to structured element and using schema_of_json

2022-09-05 Thread M Singh
Hi: In apache spark we can read json using the following: spark.read.json("path"). There is support to convert json string in a dataframe into structured element using (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#from_json-org.apache.spark.

Re: Spark Issue with Istio in Distributed Mode

2022-09-03 Thread Deepak Sharma
Thank for the reply IIan . Can we set this in spark conf or does it need to goto istio / envoy conf? On Sat, 3 Sept 2022 at 10:28, Ilan Filonenko wrote: > This might be a result of the idle_timeout that is configured in envoy. > The default is an hour. > > On Sat, Sep 3, 2022

Spark Issue with Istio in Distributed Mode

2022-09-02 Thread Deepak Sharma
Hi All, In 1 of our cluster , we enabled Istio where spark is running in distributed mode. Spark works fine when we run it with Istio in standalone mode. In spark distributed mode , we are seeing that every 1 hour or so the workers are getting disassociated from master and then master is not able

Re: Spark 3.3.0/3.2.2: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 15

2022-09-01 Thread FengYu Cao
tach the file to it? I can take a look. > > Chao > > On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao wrote: > > > > I'm trying to upgrade our spark (3.2.1 now) > > > > but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet > file > > &

Re: Spark 3.3.0/3.2.2: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 15

2022-09-01 Thread Chao Sun
Hi Fengyu, Do you still have the Parquet file that caused the error? could you open a JIRA and attach the file to it? I can take a look. Chao On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao wrote: > > I'm trying to upgrade our spark (3.2.1 now) > > but with spark 3.3.0 and spark 3.

Re: Moving to Spark 3x from Spark2

2022-09-01 Thread Martin Andersson
You should check the release notes and upgrade instructions. From: rajat kumar Sent: Thursday, September 1, 2022 12:44 To: user @spark Subject: Moving to Spark 3x from Spark2 EXTERNAL SENDER. Do not click links or open attachments unless you recognize the

<    6   7   8   9   10   11   12   13   14   15   >