Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-01 Thread Sean Owen
Pretty much what it says? you are creating a table over a path that already has data in it. You can't do that without mode=overwrite at least, if that's what you intend. On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga wrote: > > >- Component: Spark Delta, Spark SQL >- Level: Beginner >-

Re: WARN: netlib.BLAS

2022-08-01 Thread Sean Owen
Hm, I think the problem is either that you need to build the spark-ganglia-lgpl module in your Spark distro, or the pomOnly() part of your build. You need the code in your app. Yes you need openblas too. On Mon, Aug 1, 2022 at 7:36 AM 陈刚 wrote: > Dear export, > > > I'm using spark-3.1.1 mllib,

Re: Spark Avro Java 17 Compatibility

2022-07-27 Thread Sean Owen
See the documentation at spark.apache.org . Spark 2.4 definitely does not support versions after Java 8. Spark 3.3 supports 17. (General note to anyone mailing the list, don't use a ".invalid" reply-to address) On Wed, Jul 27, 2022 at 7:47 AM Shivaraj Sivasankaran wrote: > Gentle Reminder on

Re: Updating Broadcast Variable in Spark Streaming 2.4.4

2022-07-22 Thread Sean Owen
I think you're taking the right approach, trying to create a new broadcast var. What part doesn't work? for example I wonder if comparing Map equality like that does what you think, isn't it just reference equality? debug a bit more to see whether it even destroys and recreates the broadcast in

Re: [MLlib] Differences after version upgrade

2022-07-20 Thread Sean Owen
How different? I think quite small variations are to be expected. On Wed, Jul 20, 2022 at 9:13 AM Roger Wechsler wrote: > Hi! > > We've been using Spark 3.0.1 to train Logistic regression models > with MLLIb. > We've recently upgraded to Spark 3.3.0 without making any other code > changes and

Re: Building a ML pipeline with no training

2022-07-20 Thread Sean Owen
The data transformation is all the same. Sure, linear regression is easy: https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression These are components that operate on DataFrames. You'll want to look at VectorAssembler to prepare data into an array column. There

Re: [Building] Building with JDK11

2022-07-18 Thread Sean Owen
Why do you need Java 11 bytecode though? Java 8 bytecode runs fine on Java 11. The settings in the build are really there for testing, not because it's required to use Java 11. On Mon, Jul 18, 2022 at 10:29 PM Gera Shegalov wrote: > Bytecode version is controlled by javac "-target" option for

Re: very simple UI on webpage to display x/y plots+histogram of data stored in hive

2022-07-18 Thread Sean Owen
Sure, look at any python-based plotting package. plot.ly does this nicely. You pull your data via Spark to a pandas DF and do whatever you want. On Mon, Jul 18, 2022 at 1:42 PM Joris Billen wrote: > Hi, > I am making a very short demo and would like to make the most rudimentary > UI (withouth

Re: Issue while building spark project

2022-07-18 Thread Sean Owen
Increase the stack size for the JVM when Maven / SBT run. The build sets this but you may still need something like "-Xss4m" in your MAVEN_OPTS On Mon, Jul 18, 2022 at 11:18 AM rajat kumar wrote: > Hello , > > Can anyone pls help me in below error. It is a maven project. It is coming > while

CVE-2022-33891: Apache Spark shell command injection vulnerability via Spark UI

2022-07-17 Thread Sean Owen
Severity: important Description: The Apache Spark UI offers the possibility to enable ACLs via the configuration option spark.acls.enable. With an authentication filter, this checks whether a user has access permissions to view or modify the application. If ACLs are enabled, a code path in

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-16 Thread Sean Owen
Use GraphFrames? On Sat, Jul 16, 2022 at 3:54 PM Yuhao Zhang wrote: > Hi Shay, > > Thanks for your reply! I would very much like to use pyspark. However, my > project depends on GraphX, which is only available in the Scala API as far > as I know. So I'm locked with Scala and trying to find a

Re: [Building] Building with JDK11

2022-07-15 Thread Sean Owen
Java 8 binaries are probably on your PATH On Fri, Jul 15, 2022, 5:01 PM Szymon Kuryło wrote: > Hello, > > I'm trying to build a Java 11 Spark distro using the > dev/make-distribution.sh script. > I have set JAVA_HOME to point to JDK11 location, I've also set the > java.version property in

Re: Spark (K8S) IPv6 support

2022-07-14 Thread Sean Owen
I don't know about the state of IPv6 support, but yes you're right in guessing that 3.4.0 might be released perhaps early next year. You can always clone the source repo and build it! On Thu, Jul 14, 2022 at 2:19 PM Valer wrote: > Hi, > > We're starting to use IPv6-only K8S cluster (EKS) which

Re: about cpu cores

2022-07-10 Thread Sean Owen
Jobs consist of tasks, each of which consumes a core (can be set to >1 too, but that's a different story). If there are more tasks ready to execute than available cores, some tasks simply wait. On Sun, Jul 10, 2022 at 3:31 AM Yong Walt wrote: > given my spark cluster has 128 cores totally. > If

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sean Owen
I think that is more accurate yes. Though, shuffle files are local, not on distributed storage too, which is an advantage. MR also had map only transforms and chained mappers, but harder to use. Not impossible but you could also say Spark just made it easier to do the more efficient thing. On

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sean Owen
You're right. I suppose I just mean most operations don't need a shuffle - you don't have 10 stages for 10 transformations. Also: caching in memory is another way that memory is used to avoid IO. On Sat, Jul 2, 2022, 8:42 AM krexos wrote: > This doesn't add up with what's described in the

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sean Owen
Because only shuffle stages write shuffle files. Most stages are not shuffles On Sat, Jul 2, 2022, 7:28 AM krexos wrote: > Hello, > > One of the main "selling points" of Spark is that unlike Hadoop map-reduce > that persists intermediate results of its computation to HDFS (disk), Spark > keeps

Re: Spark Group How to Ask

2022-07-01 Thread Sean Owen
Yes, user@spark.apache.org. This incubator address hasn't been used in about 8 years. On Fri, Jul 1, 2022 at 10:24 AM Zehra Günindi wrote: > Hi, > > Is there any group for asking question related to Apache Spark? > > > Sincerely, > Zehra > > obase > TEL: +90216 527 30 00 > FAX: +90216 527 31 11

Re: Follow up on Jira Issue 39549

2022-06-24 Thread Sean Owen
ithout writing it to disk because of performance issues. > > > > *Chenyang Zhang* > Software Engineering Intern, Platform > Redwood City, California > <https://c3.ai/?utm_source=signature_campaign=enterpriseai> > © 2022 C3.ai. Confidential Information. > > On J

Re: Follow up on Jira Issue 39549

2022-06-24 Thread Sean Owen
Spark is decoupled from storage. You can write data to any storage you like. Anything that can read that data, can read that data - Spark or not, different session or not. Temp views are specific to a session and do not store data. I think this is trivial and no problem at all, or else I'm not

Re: repartition(n) should be deprecated/alerted

2022-06-22 Thread Sean Owen
Eh, there is a huge caveat - you are making your input non-deterministic, where determinism is assumed. I don't think that supports such a drastic statement. On Wed, Jun 22, 2022 at 12:39 PM Igor Berman wrote: > Hi All > tldr; IMHO repartition(n) should be deprecated or red-flagged, so that >

Re: Spark Summit Europe

2022-06-21 Thread Sean Owen
It's still held, just called the Data and AI Summit. https://databricks.com/dataaisummit/ Next one is next week; last one in Europe was in November 2020, and think it might be virtual in Europe if held separately this year. On Tue, Jun 21, 2022 at 7:38 AM Gowran, Declan wrote: > Announcing

Re: How to guarantee dataset is split over unique partitions (partitioned by a column value)

2022-06-20 Thread Sean Owen
repartition() puts all values with the same key in one partition, but, multiple other keys can be in the same partition. It sounds like you want groupBy, not repartition, if you want to handle these separately. On Mon, Jun 20, 2022 at 8:26 AM DESCOTTE Loic - externe wrote: > Hi, > > > > I have

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread Sean Owen
ot;)); > > > But it returned an empty dataset. > > Le ven. 17 juin 2022 à 20:28, Sean Owen a écrit : > >> Same answer as last time - those are strings, not dates. 02-02-2015 as a >> string is before 02-03-2012. >> You apply date function to dates, not strings. >

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread Sean Owen
Same answer as last time - those are strings, not dates. 02-02-2015 as a string is before 02-03-2012. You apply date function to dates, not strings. You have to parse the dates properly, which was the problem in your last email. On Fri, Jun 17, 2022 at 12:58 PM marc nicole wrote: > Hello, > > I

Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread Sean Owen
my below code to work I cast to string the resulting min > column. > > Le mar. 14 juin 2022 à 21:12, Sean Owen a écrit : > >> You haven't shown your input or the result >> >> On Tue, Jun 14, 2022 at 1:40 PM marc nicole wrote: >> >>> Hi Sean, >>> &

Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread Sean Owen
Yes that is right. It has to be parsed as a date to correctly reason about ordering. Otherwise you are finding the minimum string alphabetically. Small note, MM is month. mm is minute. You have to fix that for this to work. These are Java format strings. On Tue, Jun 14, 2022, 12:32 PM marc

Re: API Problem

2022-06-09 Thread Sean Owen
That repartition seems to do nothing? But yes the key point is use col() On Thu, Jun 9, 2022, 9:41 PM Stelios Philippou wrote: > Perhaps > > > finalDF.repartition(finalDF.rdd.getNumPartitions()).withColumn("status_for_batch > > To > >

Re: How the data is distributed

2022-06-06 Thread Sean Owen
Data is not distributed to executors by anything. If you are processing data with Spark. Spark spawns tasks on executors to read chunks of data from wherever they are (S3, HDFS, etc). On Mon, Jun 6, 2022 at 4:07 PM Sid wrote: > Hi experts, > > > When we load any file, I know that based on the

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Sean Owen
hich would then yield correct > column types. > What do you think? > > Le sam. 4 juin 2022 à 15:56, Sean Owen a écrit : > >> I don't think you want to do that. You get a string representation of >> structured data without the structure, at best. This is part of t

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Sean Owen
I don't think you want to do that. You get a string representation of structured data without the structure, at best. This is part of the reason it doesn't work directly this way. You can use a UDF to call .toString on the Row of course, but, again what are you really trying to do? On Sat, Jun 4,

Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-17 Thread Sean Owen
I don't think that is standard SQL? what are you trying to do, and why not do it outside SQL? On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran wrote: > Gentle ping. Any info here would be great. > > Regards, > Ram > > On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran > wrote: > >> Hello

Re: How do I read parquet with python object

2022-05-09 Thread Sean Owen
That's a parquet library error. It might be this: https://issues.apache.org/jira/browse/PARQUET-1633 That's fixed in recent versions of Parquet. You didn't say what versions of libraries you are using, but try the latest Spark. On Mon, May 9, 2022 at 8:49 AM wrote: > # python: > > import

Re: Reg: CVE-2020-9480

2022-04-28 Thread Sean Owen
It is not a real dependency, so should not be any issue. I am not sure why your tool flags it at all. On Thu, Apr 28, 2022 at 10:04 PM Sundar Sabapathi Meenakshi < sun...@mcruncher.com> wrote: > Hi all, > > I am using spark-sql_2.12 dependency version 3.2.1 in my > project. My

Re: When should we cache / persist ? After or Before Actions?

2022-04-27 Thread Sean Owen
t;> >> Btw, I’m not sure if caching is useful when you have a HUGE dataframe. >> Maybe persisting will be more useful >> >> Best regards >> >> On 21 Apr 2022, at 16:24, Sean Owen wrote: >> >>  >> You persist before actions, not af

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Sean Owen
ecutors. Or is this assumption wrong? > Thanks, > > Joe > > > On Thu, 2022-04-21 at 09:14 -0500, Sean Owen wrote: > > A job can have multiple stages for sure. One action triggers a job. > > This seems normal. > > > > On Thu, Apr 21, 2022, 9:10 AM Joe wrote:

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Sean Owen
A job can have multiple stages for sure. One action triggers a job. This seems normal. On Thu, Apr 21, 2022, 9:10 AM Joe wrote: > Hi, > When looking at application UI (in Amazon EMR) I'm seeing one job for > my particular line of code, for example: > 64 Running count at MySparkJob.scala:540 > >

Re: When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Sean Owen
You persist before actions, not after, if you want the action's outputs to be persistent. If anything swap line 2 and 3. However, there's no point in the count() here, and because there is already only one action following to write, no caching is useful in that example. On Thu, Apr 21, 2022 at

Re: How is union() implemented? Need to implement column bind

2022-04-21 Thread Sean Owen
and max on column values not work ? > > Cheers, > Sonal > https://github.com/zinggAI/zingg > > > > On Thu, Apr 21, 2022 at 6:50 AM Sean Owen wrote: > >> Oh, Spark directly supports upserts (with the right data destination) and >> yeah you could do this as 1

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
Oh, Spark directly supports upserts (with the right data destination) and yeah you could do this as 1+ updates to a table without any pivoting, etc. It'd still end up being 10K+ single joins along the way but individual steps are simpler. It might actually be pretty efficient I/O wise as

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
I know bigQuery use map reduce like spark. > > > > Kind regards > > > > Andy > > > > *From: *Sean Owen > *Date: *Wednesday, April 20, 2022 at 2:31 PM > *To: *Andrew Melo > *Cc: *Andrew Davidson , Bjørn Jørgensen < > bjornjorgen...@gmail.com>,

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
, 2022 at 4:29 PM Andrew Melo wrote: > It would certainly be useful for our domain to have some sort of native > cbind(). Is there a fundamental disapproval of adding that functionality, > or is it just a matter of nobody implementing it? > > On Wed, Apr 20, 2022 at 16:28 S

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
cs/3.1.1/api/python/reference/api/pyspark.sql.functions.concat.html#pyspark.sql.functions.concat> > like the pyspark version takes 2 columns and concat it to one column. > > ons. 20. apr. 2022 kl. 21:04 skrev Sean Owen : > >> cbind? yeah though the answer is typically a join. I don't know if >>

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
e BigQuery > might work better? I do not know much about the implementation. > > > > No one tool will solve all problems. Once I get the matrix I think it > spark will work well for our need > > > > Kind regards > > > > Andy > > > > *From

Re: Grouping and counting occurences of specific column rows

2022-04-19 Thread Sean Owen
Just .groupBy(...).count() ? On Tue, Apr 19, 2022 at 6:24 AM marc nicole wrote: > Hello guys, > > I want to group by certain column attributes (e.g.,List > groupByQidAttributes) a dataset (initDataset) and then count the > occurrences of associated grouped rows, how do i achieve that neatly? >

Re: RDD memory use question

2022-04-19 Thread Sean Owen
Don't collect() - that pulls all data into memory. Use count(). On Tue, Apr 19, 2022 at 5:34 AM wilson wrote: > Hello, > > Do you know for a big dataset why the general RDD job can be done, but > the collect() failed due to memory overflow? > > for instance, for a dataset which has xxx million

Re: How is union() implemented? Need to implement column bind

2022-04-18 Thread Sean Owen
A join is the natural answer, but this is a 10114-way join, which probably chokes readily just to even plan it, let alone all the shuffling and shuffling of huge data. You could tune your way out of it maybe, but not optimistic. It's just huge. You could go off-road and lower-level to take

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Sean Owen
It looks good, are you sure it even starts? the problem I see is that you send a copy of the model from the driver for every task. Try broadcasting the model instead. I'm not sure if that resolves it but would be a good practice. On Mon, Apr 18, 2022 at 9:10 AM Xavier Gervilla wrote: > Hi Team,

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Sean Owen
with jdk17 " or should I open another discussion? > > > > > > > > > > *Thanks And RegardsSibi.ArunachalammCruncher* > > > On Wed, Apr 13, 2022 at 10:16 PM Sean Owen wrote: > >> Yes I think that's a change that has caused difficulties, but,

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Sean Owen
who used the unsafe API > either directly or indirectly (via netty, etc..) it's a bit surprising that > it was so thoroughly closed off without an escape hatch, but I'm sure there > was a lively discussion around it... > > Cheers > Andrew > > On Wed, Apr 13, 202

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Sean Owen
ther workaround)? > > Thanks > Andrew > > On Tue, Apr 12, 2022 at 08:45 Sean Owen wrote: > >> In Java 11+, you will need to tell the JVM to allow access to internal >> packages in some cases, for any JVM application. You will need flags like >> "--add-opens=java.b

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-12 Thread Sean Owen
In Java 11+, you will need to tell the JVM to allow access to internal packages in some cases, for any JVM application. You will need flags like "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED", which you can see in the pom.xml file for the project. Spark 3.2 does not necessarily work with Java 17

Re: Spark Write BinaryType Column as continues file to S3

2022-04-08 Thread Sean Owen
That's for strings, but still doesn't address what is desired w.r.t. writing a binary column On Fri, Apr 8, 2022 at 10:31 AM Bjørn Jørgensen wrote: > In the New spark 3.3 there Will be an sql function > https://github.com/apache/spark/commit/25dd4254fed71923731fd59838875c0dd1ff665a > hope this

Re: Spark Write BinaryType Column as continues file to S3

2022-04-08 Thread Sean Owen
You can certainly write that UDF. You get a column in a DataFrame of array type and you can write that to any appropriate format. What do you mean by continuous byte stream? something besides, say, parquet files holding the byte arrays? On Fri, Apr 8, 2022 at 10:14 AM Philipp Kraus <

Re: Aggregate over a column: the proper way to do

2022-04-08 Thread Sean Owen
Dataset.count() returns one value directly? On Thu, Apr 7, 2022 at 11:25 PM sam smith wrote: > My bad, yes of course that! still i don't like the .. > select("count(myCol)") .. part in my line is there any replacement to that ? > > Le ven. 8 avr. 2022 à 06:13, Sean Owen

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread Sean Owen
Wait, why groupBy at all? After the filter only rows with myCol equal to your target are left. There is only one group. Don't group just count after the filter? On Thu, Apr 7, 2022, 10:27 PM sam smith wrote: > I want to aggregate a column by counting the number of rows having the > value

Re: Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Sean Owen
(Don't cross post please) Generally you definitely want to compile and test vs what you're running on. There shouldn't be many binary or source incompatibilities -- these are avoided in a major release where possible. So it may need no code change. But I would certainly recompile just on

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-01 Thread Sean Owen
* ##IMPROVEMENT END* > * ...* > * df12=df11.spark.sql(complex stufff)* > * spark.sql(CACHE TABLE df10)* > * ...* > * df13=spark.sql( complex stuff with df12)* > * df13.write * > * df14=spark.sql( some other complex stuff with df12)* > * df14.write * > * df15=spark.s

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-31 Thread Sean Owen
s I cached the >> table in spark sql): >> >> >> *sqlContext.sql("UNCACHE TABLE mytableofinterest ")* >> *spark.stop()* >> >> >> Wrt looping: if I want to process 3 years of data, my modest cluster will >> never do it one go , I wo

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Sean Owen
The Spark context does not stop when a job does. It stops when you stop it. There could be many ways mem can leak. Caching maybe - but it will evict. You should be clearing caches when no longer needed. I would guess it is something else your program holds on to in its logic. Also consider not

Re: GraphX Support

2022-03-21 Thread Sean Owen
GraphX is not active, though still there and does continue to build and test with each Spark release. GraphFrames kind of superseded it, but is also not super active FWIW. On Mon, Mar 21, 2022 at 6:03 PM Jacob Marquez wrote: > Hello! > > > > My team and I are evaluating GraphX as a possible

Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-21 Thread Sean Owen
Looks like you are trying to apply this class/function across Spark, but it contains a non-serialized object, the connection. That has to be initialized on use, otherwise you try to send it from the driver and that can't work. On Mon, Mar 21, 2022 at 11:51 AM guillaume farcy <

Re: Continuous ML model training in stream mode

2022-03-17 Thread Sean Owen
Sengupta wrote: > Dear friends, > > a few years ago, I was in a London meetup seeing Sean (Owen) demonstrate > how we can try to predict the gender of individuals who are responding to > tweets after accepting privacy agreements, in case I am not wrong. > > It was real tim

Re: [Pyspark] [Linear Regression] Can't Fit Data

2022-03-17 Thread Sean Owen
The error points you to the answer. Somewhere in your code you are parsing dates, and the date format is no longer valid / supported. These changes are doc'ed in the docs it points you to. It is not related to the regression itself. On Thu, Mar 17, 2022 at 11:35 AM Bassett, Kenneth wrote: >

Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

2022-03-15 Thread Sean Owen
Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. On Tue, Mar 15, 2022, 10:30 PM wrote: > Hi all, > > I am stuck at a correlation calculation problem. I have a dataframe like >

Re: Continuous ML model training in stream mode

2022-03-15 Thread Sean Owen
There is a streaming k-means example in Spark. https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means On Tue, Mar 15, 2022, 3:46 PM Artemis User wrote: > Has anyone done any experiments of training an ML model using stream > data? especially for unsupervised models? Any

Re: spark distribution build fails

2022-03-14 Thread Sean Owen
Try increasing the stack size in the build. It's the Xss argument you find in various parts of the pom or sbt build. I have seen this and not sure why it happens on certain envs, but that's the workaround On Mon, Mar 14, 2022, 8:59 AM Bulldog20630405 wrote: > > using tag v3.2.1 with java 8

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Sean Owen
tion since it doesn't do any harm. Spark >> > uses lazy binding so you can do a lot of such "unharmful" things. >> > Developers will have to understand the behaviors of each API before >> when >> > using them.. >> > >> > >> > On 3/9/2

Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread Sean Owen
You can run Spark in local mode and not require any standalone master or worker. Are you sure you're not using local mode? are you sure the daemons aren't running? What is the Spark master you pass? On Wed, Mar 9, 2022 at 7:35 PM wrote: > What I tried to say is, I didn't start spark

Re: RebaseDateTime with dynamicAllocation

2022-03-09 Thread Sean Owen
Doesn't quite seem the same. What is the rest of the error -- why did the class fail to initialize? On Wed, Mar 9, 2022 at 10:08 AM Andreas Weise wrote: > Hi, > > When playing around with spark.dynamicAllocation.enabled I face the > following error after the first round of executors have been

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Sean Owen
> Cheers - Rafal > > On Wed, 9 Mar 2022 at 13:15, Sean Owen wrote: > >> That isn't a bug - you can't change the classpath once the JVM is >> executing. >> >> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła >> wrote: >> >>> Hi, >>> My us

Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread Sean Owen
Did it start successfully? What do you mean ports were not opened? On Wed, Mar 9, 2022 at 3:02 AM wrote: > Hello > > I have spark 3.2.0 deployed in localhost as the standalone mode. > I even didn't run the start master and worker command: > > start-master.sh > start-worker.sh

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Sean Owen
That isn't a bug - you can't change the classpath once the JVM is executing. On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła wrote: > Hi, > My use case is that, I have a long running process (orchestrator) with > multiple tasks, some tasks might require extra spark dependencies. It seems > once

Re: spark 3.2.1 download

2022-03-07 Thread Sean Owen
Hm, 3.2.1 shows up for me, it's the default. Try refreshing the page? sometimes people have an old cached copy. On Mon, Mar 7, 2022 at 10:30 AM Bulldog20630405 wrote: > > from website spark 3.2.1 has been release in january 2020; however not > available for download from =>

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
tion for each of the members of the group > yes (or the accumulative per element, don't really know how to phrase > that), and the correlation is affected by the counter used for the column, > right? Top to bottom? > > Ps. Thank you so much for replying so fast! > > El lun, 28 feb 2022 a la

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
How are you defining the window? It looks like it's something like "rows unbounded proceeding, current" or the reverse, as the correlation varies across the elements of the group as if it's computing them on 1, then 2, then 3 elements. Don't you want the correlation across the group? otherwise

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
You're computing correlations of two series of values, but each series has one value, a sum. Correlation is not defined in this case (both variances are undefined). This is sample correlation, note. On Mon, Feb 28, 2022 at 7:06 AM Edgar H wrote: > Morning all, been struggling with this for a

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
"count distinct' does not have that problem, whether in a group-by or not. I'm still not sure these are equivalent queries but maybe not seeing it. Windowing makes sense when you need the whole window, or when you need sliding windows to express the desired groups. It may be unnecessary when your

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
loyee,Salary from ( > select d.name as Department, e.name as Employee,e.salary as > Salary,dense_rank() over(partition by d.name order by e.salary desc) as > rnk from Department d join Employee e on e.departmentId=d.id ) a where > rnk<=3 > > Time Taken: 790 ms > > Thanks, &

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
Those two queries are identical? On Sun, Feb 27, 2022 at 11:30 AM Sid wrote: > Hi Team, > > I am aware that if windowing functions are used, then at first it loads > the entire dataset into one window,scans and then performs the other > mentioned operations for that particular window which

Re: Issue while creating spark app

2022-02-26 Thread Sean Owen
I don't think any of that is related, no. How are you dependencies set up? manually with IJ, or in a build file (Maven, Gradle)? Normally you do the latter and dependencies are taken care of for you, but you app would definitely have to express a dependency on Scala libs. On Sat, Feb 26, 2022 at

Re: Spark Kafka Integration

2022-02-25 Thread Sean Owen
Spark 3.2.1 is compiled vs Kafka 2.8.0; the forthcoming Spark 3.3 against Kafka 3.1.0. It may well be mutually compatible though. On Fri, Feb 25, 2022 at 2:40 PM Michael Williams (SSI) < michael.willi...@ssigroup.com> wrote: > I believe it is 3.1, but if there is a different version that “works

Re: Spark Kafka Integration

2022-02-25 Thread Sean Owen
That .jar is available on Maven, though typically you depend on it in your app, and compile an uber JAR which will contain it and all its dependencies. You can I suppose manage to compile an uber JAR from that dependency itself with tools if needed. On Fri, Feb 25, 2022 at 1:37 PM Michael

Re: DataTables 1.10.20 reported vulnerable in spark-core_2.13:3.2.1

2022-02-24 Thread Sean Owen
What is the vulnerability and does it affect Spark? what is the remediation? Can you try updating these and open a pull request if it works? On Thu, Feb 24, 2022 at 7:28 AM vinodh palanisamy wrote: > Hi Team, > We are using spark-core_2.13:3.2.1 in our project. Where in that > version

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Sean Owen
On the contrary, distributed deep learning is not data parallel. It's dominated by the need to share parameters across workers. Gourav, I don't understand what you're looking for. Have you looked at Petastorm and Horovod? they _use Spark_, not another platform like Ray. Why recreate this which has

Re: Unable to display JSON records with null values

2022-02-23 Thread Sean Owen
There is no record "345" here it seems, right? it's not that it exists and has null fields; it's invalid w.r.t. the schema that the rest suggests. On Wed, Feb 23, 2022 at 11:57 AM Sid wrote: > Hello experts, > > I have a JSON data like below: > > [ > { > "123": { > "Party1": { >

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sean Owen
The standalone koalas project should have the same functionality for older Spark versions: https://koalas.readthedocs.io/en/latest/ You should be moving to Spark 3 though; 2.x is EOL. On Wed, Feb 23, 2022 at 9:06 AM Sid wrote: > Cool. Here, the problem is I have to run the Spark jobs on Glue

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sean Owen
This isn't pandas, it's pandas on Spark. It's distributed. On Wed, Feb 23, 2022 at 8:55 AM Sid wrote: > Hi Bjørn, > > Thanks for your reply. This doesn't help while loading huge datasets. > Won't be able to achieve spark functionality while loading the file in > distributed manner. > > Thanks,

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Sean Owen
n the fact that if > SPARK were to be able to natively scale out and distribute data to > tensorflow, or pytorch then there will be competition between Ray and SPARK. > > Regards, > Gourav Sengupta > > On Wed, Feb 23, 2022 at 12:35 PM Sean Owen wrote: > >> Spark does do dis

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-23 Thread Sean Owen
tributor *libraries, and >>> there has been no major development recently on those libraries. I faced >>> the issue of version dependencies on those and had a hard time fixing the >>> library compatibilities. Hence a couple of below doubts:- >>> >>&g

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-22 Thread Sean Owen
y dependencies? >- Any other library which is suitable for my use case.? >- Any example code would really be of great help to understand. > > > > Thanks, > > Vijayant > > > > *From:* Sean Owen > *Sent:* Wednesday, February 23, 2022 8:40 AM > *To:

Re: TensorFlow on Spark

2022-02-22 Thread Sean Owen
Sure, Horovod is commonly used on Spark for this: https://horovod.readthedocs.io/en/stable/spark_include.html On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar wrote: > Hi All, > > > > Anyone using Apache spark with TensorFlow for building models. My > requirement is to use TensorFlow distributed

Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Sean Owen
Spark does not use Hive for execution, so Hive params will not have an effect. I don't think you can enforce that in Spark. Typically you enforce things like that at a layer above your SQL engine, or can do so, because there is probably other access you need to lock down. On Tue, Feb 22, 2022 at

Re: Question about spark.sql min_by

2022-02-21 Thread Sean Owen
>From the source code, looks like this function was added to pyspark in Spark 3.3, up for release soon. It exists in SQL. You can still use it in SQL with `spark.sql(...)` in Python though, not hard. On Mon, Feb 21, 2022 at 4:01 AM David Diebold wrote: > Hello all, > > I'm trying to use the

Re: Encoders.STRING() causing performance problems in Java application

2022-02-21 Thread Sean Owen
a time by the self-built prediction pipeline (which is also using > other ML techniques apart from Spark). Needs some re-factoring... > > Thanks again for the help. > > Cheers, > > Martin > > > Am 2022-02-18 13:41, schrieb Sean Owen: > > That doesn't make a l

Re: Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-18 Thread Sean Owen
|| --> > avd.aquasec.com/nvd/cve-2018-1000873 | > +-+------+ > > +++---+ > > > Rajesh Krishnamur

Re: Encoders.STRING() causing performance problems in Java application

2022-02-18 Thread Sean Owen
That doesn't make a lot of sense. Are you profiling the driver, rather than executors where the work occurs? Is your data set quite small such that small overheads look big? Do you even need Spark if your data is not distributed - coming from the driver anyway? The fact that a static final field

Re: Implementing circuit breaker pattern in Spark

2022-02-16 Thread Sean Owen
ircuit breaker. So what that essentially means is we should not >> be catching those HTTP 5XX exceptions (which we currently do) and let the >> tasks fail on their own only for spark to retry them for finite number of >> times and then subsequently fail and thereby break the circuit.

Re: Implementing circuit breaker pattern in Spark

2022-02-16 Thread Sean Owen
that microbatch. This approach keeps the > pipeline alive and keeps pushing messages to DLQ microbatch after > microbatch until the microservice is back up. > > > On Wed, Feb 16, 2022 at 6:50 PM Sean Owen wrote: > >> You could use the same pattern in your flatMap function. If y

Re: Implementing circuit breaker pattern in Spark

2022-02-16 Thread Sean Owen
You could use the same pattern in your flatMap function. If you want Spark to keep retrying though, you don't need any special logic, that is what it would do already. You could increase the number of task retries though; see the spark.excludeOnFailure.task.* configurations. You can just

<    1   2   3   4   5   6   7   8   9   10   >