Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Sean Owen
Probably could also be because that coalesce can cause some upstream transformations to also have parallelism of 1. I think (?) an OK solution is to cache the result, then coalesce and write. Or combine the files after the fact. or do what Silvio said. On Wed, Feb 3, 2021 at 12:55 PM James Yu

Re: Exception on Avro Schema Object Serialization

2021-02-02 Thread Sean Owen
Your function is somehow capturing the actual Avro schema object, which won't seiralize. Try rewriting it to ensure that it isn't used in the function. On Tue, Feb 2, 2021 at 2:32 PM Artemis User wrote: > We tried to standardize the SQL data source management using the Avro > schema, but

Re: Java/Spark

2021-02-01 Thread Sean Owen
ibited and subject to prosecution to the > fullest extent of the law! If you are not the intended recipient, please > delete this electronic message and DO NOT ACT UPON, FORWARD, COPY OR > OTHERWISE DISSEMINATE IT OR ITS CONTENTS." > > > > *From:* Sean Owen > *Sent:* M

Re: Java/Spark

2021-02-01 Thread Sean Owen
The Spark distro does not include Java. That has to be present in the environment where the Spark cluster is run. It works with Java 8, and 11 in 3.x (Oracle and OpenJDK AFAIK). It seems to 99% work on 14+ even. On Mon, Feb 1, 2021 at 9:11 AM wrote: > Hello, > > > > I am looking for information

Re: Apache Spark

2021-01-26 Thread Sean Owen
To clarify: Apache projects and the ASF do not provide paid support. However there are many vendors who provide distributions of Apache Spark who will provide technical support - not nearly just Databricks but Cloudera, etc. There are also plenty of consultancies and individuals who can provide

Re: Using same rdd from two threads

2021-01-22 Thread Sean Owen
RDDs are immutable, and Spark itself is thread-safe. This should be fine. Something else is going on in your code. On Fri, Jan 22, 2021 at 7:59 AM jelmer wrote: > HI, > > I have a piece of code in which an rdd is created from a main method. > It then does work on this rdd from 2 different

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Sean Owen
ing to find a more elegant approach. > > > > On Thu, Jan 21, 2021 at 5:28 PM Sean Owen wrote: > >> If you mean you want to train N models in parallel, you wouldn't be able >> to do that with a groupBy first. You apply logic to the result of groupBy >> with Spark, bu

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Sean Owen
If you mean you want to train N models in parallel, you wouldn't be able to do that with a groupBy first. You apply logic to the result of groupBy with Spark, but can't use Spark within Spark. You can run N Spark jobs in parallel on the driver but you'd have to have each read the subset of data

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Sean Owen
Is your app accumulating a lot of streaming state? that's one reason something could slow down after a long time. Some memory leak in your app putting GC/memory pressure on the JVM, etc too. On Thu, Jan 21, 2021 at 5:13 AM Eric Beabes wrote: > Hello, > > My Spark Structured Streaming

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
Heh that could make sense, but that definitely was not my mental model of how python binds variables! Definitely is not how Scala works. On Wed, Jan 20, 2021 at 10:00 AM Marco Wong wrote: > Hmm, I think I got what Jingnan means. The lambda function is x != i and i > is not evaluated when the

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
No, because the final rdd is really the result of chaining 3 filter operations. They should all execute. It _should_ work like "rdd.filter(...).filter(..).filter(...)" On Wed, Jan 20, 2021 at 9:46 AM Zhu Jingnan wrote: > I thought that was right result. > > As rdd runs on a lacy basis. so

Re: Spark RDD + HBase: adoption trend

2021-01-20 Thread Sean Owen
RDDs are still relevant in a few ways - there is no Dataset in Python for example, so RDD is still the 'typed' API. They still underpin DataFrames. And of course it's still there because there's probably still a lot of code out there that uses it. Occasionally it's still useful to drop into that

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
That looks very odd indeed. Things like this work as expected: rdd = spark.sparkContext.parallelize([0,1,2]) def my_filter(data, i): return data.filter(lambda x: x != i) for i in range(3): rdd = my_filter(rdd, i) rdd.collect() ... as does unrolling the loop. But your example behaves as if

Re: subscribe user@spark.apache.org

2021-01-19 Thread Sean Owen
You have to sign up by sending an email - see http://spark.apache.org/community.html for what to send where. On Tue, Jan 19, 2021 at 12:25 PM Peter Podlovics < peter.d.podlov...@gmail.com> wrote: > Hello, > > I would like to subscribe to the above mailing list. I already tried > subscribing

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Sean Owen
Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using? On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan wrote: > Hi folks, > > I finally found the root cause of this issue. > It can be easily reproduced by the following code. > We ran it on a standalone mode 4 cores * 4

Re: Spark 3.0.1 giving warning while running with Java 11

2021-01-14 Thread Sean Owen
You can ignore that. Spark 3.x works with Java 11 but it will generate some warnings that are safe to disregard. On Thu, Jan 14, 2021 at 11:26 PM Sachit Murarka wrote: > Hi All, > > Getting warning while running spark3.0.1 with Java11 . > > > WARNING: An illegal reflective access operation has

Re: Customizing K-Means for Anomaly Detection

2021-01-12 Thread Sean Owen
You could fit the k-means pipeline, get the cluster centers, create a Transformer using that info, then create a new PipelineModel including all the original elements and the new Transformer. Does that work? It's not out of the question to expose a new parameter in KMeansModel that lets you also

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Sean Owen
is email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 8 Jan 2021 at 16:38, Riccardo Ferrari wrote: > >> I think spark checks the python p

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Sean Owen
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise? On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh wrote: > Thanks Riccardo. > > I am well aware of the submission form > > However, my question relates to doing submission within PyCharm itself.

Re: Does Spark dynamic allocation work with more than one workers?

2021-01-07 Thread Sean Owen
Yes it does. It controls how many executors are allocated on workers, and isn't related to the number of workers. Something else is wrong with your setup. You would not typically, by the way, run multiple workers per machine at that scale. On Thu, Jan 7, 2021 at 7:15 AM Varun kumar wrote: > Hi,

Re: Extending GraphFrames without running into serialization issues

2021-01-05 Thread Sean Owen
It's because this calls the no-arg superclass constructor that sets _vertices and _edges in the actual GraphFrame class to null. That yields the error. Normally you'd just show you want to call the two-arg superclass constructor with "extends GraphFrame(_vertices, _edges)" but that constructor is

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destr

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
gt; > fwhm: 16.7485671 +/- 0.91958379 (5.49%) == '2.000*sigma' > > height: 1182407.88 +/- 15681.8211 (1.33%) == > '0.3183099*amplitude/max(2.220446049250313e-16, sigma)' > > [[Correlations]] (unreported correlations are < 0.100) > > C(amplitude, sigma)

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
If your data set is 11 points, surely this is not a distributed problem? or are you asking how to build tens of thousands of those projections in parallel? On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh wrote: > Hi, > > I am not sure Spark forum is the correct avenue for this question. > > I am

Re: How Spark Framework works a Compiler

2021-01-03 Thread Sean Owen
No it's much simpler than that. Spark is just a bunch of APIs that user applications call into to cause it to form a DAG and execute it. There's no need to reflection or transpiling or anything. The user app is just calling the framework directly, not the other way around. On Sun, Jan 3, 2021 at

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
y key") and some "pkey" missing. > Since it only happens when executors being preempted, I believe this is a > bug (nondeterministic shuffle) that SPARK-23207 trying to solve. > > Thanks, > > Shiao-An Yuan > > On Tue, Dec 29, 2020 at 10:53 PM Sean Owen wrote:

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
Total guess here, but your key is a case class. It does define hashCode and equals for you, but, you have an array as one of the members. Array equality is by reference, so, two arrays of the same elements are not equal. You may have to define hashCode and equals manually to make them correct. On

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation. You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in. On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh wrote: > Thanks for the feedback. > > I

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
t numpy would come back with > > Thanks > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The aut

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Sean Owen
Why do you want to use this function instead of the built-in stddev function? On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh wrote: > Hi, > > > This is a shot in the dark so to speak. > > > I would like to use the standard deviation std offered by numpy in > PySpark. I am using SQL for now > >

Re: No matter how many instances and cores configured for spark on k8s, only one executor is reading file

2020-12-21 Thread Sean Owen
Pass more partitions to the second argument of parallelize()? On Mon, Dec 21, 2020 at 7:39 AM 沈俊 wrote: > Hi > > I am now trying to use spark to do tcpdump pcap file analysis. The first > step is to read the file and parse the content to dataframe according to > analysis requirements. > > I've

Re: Convert Seq[Any] to Seq[String]

2020-12-18 Thread Sean Owen
It's not really a Spark question. .toDF() takes column names. atrb.head.toSeq.map(_.toString)? but it's not clear what you mean the col names to be On Fri, Dec 18, 2020 at 8:37 AM Vikas Garg wrote: > Hi, > > Can someone please help me how to convert Seq[Any] to Seq[String] > > For line > val df

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-13 Thread Sean Owen
'd look in what's inside your Range and what you get out >>>>> of it. I suspect something wrong in there >>>>> >>>>> If there was something with the clustered function, then you should be >>>>> able to take it out of the map() and still have

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-11 Thread Sean Owen
Looks like a simple Python error - you haven't shown the code that produces it. Indeed, I suspect you'll find there is no such symbol. On Fri, Dec 11, 2020 at 9:09 AM Mich Talebzadeh wrote: > Hi, > > This used to work but not anymore. > > I have UsedFunctions.py file that has these functions >

Re: Caching

2020-12-07 Thread Sean Owen
No, it's not true that one action means every DF is evaluated once. This is a good counterexample. On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma wrote: > Thanks for the information. I am using spark 2.3.3 There are few more > questions > > 1. Yes I am using DF1 two times but at the end action is

Re: Spark ML / ALS question

2020-12-02 Thread Sean Owen
There is only a fit() method in spark.ml's ALS http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/recommendation/ALS.html The older spark.mllib interface has a train() method. You'd generally use the spark.ml version. On Wed, Dec 2, 2020 at 2:13 PM Steve Pruitt wrote: > I am

Re: Regexp_extract not giving correct output

2020-12-02 Thread Sean Owen
; and as I mentioned when I am using 2 backslashes it is giving an exception > as follows: > : java.util.regex.PatternSyntaxException: Unknown inline modifier near > index 21 > > (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) > >

Re: Regexp_extract not giving correct output

2020-12-02 Thread Sean Owen
As in Java/Scala, in Python you'll need to escape the backslashes with \\. "\[" means just "[" in a string. I think you could also prefix the string literal with 'r' to disable Python's handling of escapes. On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka wrote: > Hi All, > > I am using Pyspark to

Re: Remove subsets from FP Growth output

2020-12-02 Thread Sean Owen
-dev Increase the threshold? Just filter the rules as desired after they are generated? It's not clear what your criteria are. On Wed, Dec 2, 2020 at 7:30 AM Aditya Addepalli wrote: > Hi, > > Is there a good way to remove all the subsets of patterns from the output > given by FP Growth? > >

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Sean Owen
NFS is a simple option for this kind of usage, yes. But --files is making N copies of the data - you may not want to do that for large data, or for data that you need to mutate. On Wed, Nov 25, 2020 at 9:16 PM Artemis User wrote: > Ah, I almost forgot that there is an even easier solution for

Re: Purpose of type in pandas_udf

2020-11-12 Thread Sean Owen
It's the return value On Thu, Nov 12, 2020 at 5:20 PM Daniel Stojanov wrote: > Hi, > > > Note "double" in the function decorator. Is this specifying the type of > the data that goes into pandas_mean, or the type returned by that function? > > > Regards, > > > > > @pandas_udf("double",

Re: Spark Dataset withColumn issue

2020-11-12 Thread Sean Owen
You can still simply select the columns by name in order, after .withColumn() On Thu, Nov 12, 2020 at 9:49 AM Vikas Garg wrote: > I am deriving the col2 using with colunn which is why I cant use it like > you told me > > On Thu, Nov 12, 2020, 20:11 German Schiavon > wrote: > >>

Re: Spark 2.4 lifetime

2020-11-11 Thread Sean Owen
I don't think there's an official EOL for Spark 2.4.x, but would expect another maintenance release in the first half of 2021 at least. I'd also guess it wouldn't be maintained by 2022. On Wed, Nov 11, 2020 at 12:24 AM Netanel Malka wrote: > Hi folks, > Do you know about how long Spark will

Re: Ask about Pyspark ML interaction

2020-11-09 Thread Sean Owen
I think you have this flipped around - you want to one-hot encode, then compute interactions. As it is you are treating the product of {0,1,2,3,4} x {0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25 possible values and probably is not what you intend. On Mon, Nov 9, 2020 at

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Sean Owen
I don't find this trolling; I agree with the observation that 'the skills you have' are a valid and important determiner of what tools you pick. I disagree that you just have to pick the optimal tool for everything. Sounds good until that comes in contact with the real world. For Spark, Python vs

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Sean Owen
Yes, it's reasonable to build an uber-jar in development, using Maven/Ivy to resolve dependencies (and of course excluding 'provided' dependencies like Spark), and push that to production. That gives you a static artifact to run that does not depend on external repo access in production. On Wed,

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
Rather, let --packages (via Ivy) worry about them, because they tell Ivy what they need. There's no 100% guarantee that conflicting dependencies are resolved in a way that works in every single case, which you run into sometimes when using incompatible libraries, but yes this is the point of

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
>From the looks of it, it's the com.google.http-client ones. But there may be more. You should not have to reason about this. That's why you let Maven / Ivy resolution figure it out. It is not true that everything in .ivy2 is on the classpath. On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
Probably because your JAR file requires other JARs which you didn't supply. If you specify a package, it reads metadata like a pom.xml file to understand what other dependent JARs also need to be loaded. On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh wrote: > Hi, > > I have a scenario that I

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-02 Thread Sean Owen
It would be quite trivial. None of that affects any of the Spark execution. It doesn't seem like it helps though - you are just swallowing the cause. Just let it fly? On Fri, Oct 2, 2020 at 9:34 AM Mich Talebzadeh wrote: > As a side question consider the following read JDBC read > > > val

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-01 Thread Sean Owen
You are reusing HiveDF for two vars and it ends up ambiguous. Just rename one. On Thu, Oct 1, 2020, 5:02 PM Mich Talebzadeh wrote: > Hi, > > > Spark version 2.3.3 on Google Dataproc > > > I am trying to use databricks to other databases > > >

Re: Apache Spark Bogotá Meetup

2020-09-30 Thread Sean Owen
Sure, we just ask people to open a pull request against https://github.com/apache/spark-website to update the page and we can merge it. On Wed, Sep 30, 2020 at 7:30 AM Miguel Angel Díaz Rodríguez < madiaz...@gmail.com> wrote: > Hello > > I am Co-organizer of Apache Spark Bogotá Meetup from

Re: [Spark SQL] does pyspark udf support spark.sql inside def

2020-09-30 Thread Sean Owen
No, you can't use the SparkSession from within a function executed by Spark tasks. On Wed, Sep 30, 2020 at 7:29 AM Lakshmi Nivedita wrote: > Here is a spark udf structure as an example > > Def sampl_fn(x): >Spark.sql(“select count(Id) from sample Where Id = x ”) > > >

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-25 Thread Sean Owen
se's code that has no material value to me; I'm > interested in seeing a simple example of something working that I can then > carry across to my own datasets with a view to adopting the platform. > > Thx > > > > On Fri, Sep 25, 2020 at 2:29 PM Sean Owen wrote: >> >&

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-25 Thread Sean Owen
th some quick code and a large public data set and say > this runs faster on a cluster than standalone. I'd be happy to make a post > myself for any new people interested in Spark. > > Thanks > > > > > > > > > On Thu, Sep 24, 2020 at 9:58 PM Sean Owen w

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-24 Thread Sean Owen
If you have the same amount of resource (cores, memory, etc) on one machine, that is pretty much always going to be faster than using those same resources split across several machines. Even if you have somewhat more resource available on a cluster, the distributed version could be slower if you,

Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-23 Thread Sean Owen
It is but it happens asynchronously. If you access the same block twice quickly, the cached block may not yet be available the second time yet. On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wrote: > Hi, > I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I > have multiple

Re: 【Spark ML】How to get access of the MLlib's LogisticRegressionWithSGD after 3.0.0?

2020-09-22 Thread Sean Owen
-dev See the migration guide: https://spark.apache.org/docs/3.0.0/ml-migration-guide.html Use ml.LogisticRegression, which should still let you use SGD On Tue, Sep 22, 2020 at 12:54 AM Lyx <1181245...@qq.com> wrote: > > Hi, > I have updated my Spark to the version of 3.0.0, > and it seems

Re: [DISCUSS] Spark cannot identify the problem executor

2020-09-11 Thread Sean Owen
-dev, +user Executors do not communicate directly, so I don't think that's quite what you are seeing. You'd have to clarify. On Fri, Sep 11, 2020 at 12:08 AM 陈晓宇 wrote: > > Hello all, > > We've been using spark 2.3 with blacklist enabled and often meet the problem > that when executor A has

Re: Missing / Duplicate Data when Spark retries

2020-09-10 Thread Sean Owen
It's more likely a subtle issue with your code or data, but hard to say without knowing more. The lineage is fine and deterministic, but your data or operations might not be. On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li wrote: > > Hi all, > > I am on Spark 2.4.4 using Mesos as the task resource

Re: Iterating all columns in a pyspark dataframe

2020-09-04 Thread Sean Owen
Do you need to iterate anything? you can always write a function of all columns, df.columns. You can operate on a whole Row at a time too. On Fri, Sep 4, 2020 at 2:11 AM Devi P.V wrote: > > Hi all, > What is the best approach for iterating all columns in a pyspark dataframe?I > want to apply

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-25 Thread Sean Owen
That looks roughly right, though you will want to mark Spark dependencies as provided. Do you need netlib directly? Pyspark won't matter here if you're in Scala; what's installed with pip would not matter in any event. On Tue, Aug 25, 2020 at 3:30 AM Aviad Klein wrote: > > Hey Chris and Sean,

Re: Ability to have CountVectorizerModel vocab as empty

2020-08-19 Thread Sean Owen
I think that's true. You're welcome to open a pull request / JIRA to remove that requirement. On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri wrote: > > Hello, > > This is wrt > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244 >

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-17 Thread Sean Owen
Hm, next guess: you need a no-arg constructor this() on FooTransformer? also consider extending UnaryTransformer. On Mon, Aug 17, 2020 at 9:08 AM Aviad Klein wrote: > Hi Owen, it's omitted from what I pasted but I'm using spark 2.4.4 on both. > > On Mon, Aug 17, 2020 at 4:37 PM Sean Ow

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-17 Thread Sean Owen
Looks like you are building vs Spark 3 and running on Spark 2, or something along those lines. On Mon, Aug 17, 2020 at 4:02 AM Aviad Klein wrote: > Hi, I've referenced the same problem on stack overflow and can't seem to > find answers. > > I have custom spark pipelinestages written in scala

Re: Spark - Scala-Java interoperablity

2020-08-16 Thread Sean Owen
That should be fine. The JVM doesn't care how the bytecode it is executing was produced. As long as you were able to compile it together - which sometimes means using plugins like scala-maven-plugin for mixed compilation - the result should be fine. On Sun, Aug 16, 2020 at 4:28 PM Ramesh

Re: How can I use pyspark to upsert one row without replacing entire table

2020-08-12 Thread Sean Owen
It's not so much Spark but the data format, whether it supports upserts. Parquet, CSV, JSON, etc would not. That is what Delta, Hudi et al are for, and yes you can upsert them in Spark. On Wed, Aug 12, 2020 at 9:57 AM Siavash Namvar wrote: > > Hi, > > I have a use case, and read data from a db

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread Sean Owen
What supports Python in (Kafka?) 0.8? I don't think Spark ever had a specific Python-Kafka integration. But you have always been able to use it to read DataFrames as in Structured Streaming. Kafka 0.8 support is deprecated (gone in 3.0) but 0.10 means 0.10+ - works with the latest 2.x. What is the

Re: [SPARK-SQL] How to return GenericInternalRow from spark udf

2020-08-06 Thread Sean Owen
The UDF should return the result value you want, not a whole Row. In Scala it figures out the schema of the UDF's result from the signature. On Thu, Aug 6, 2020 at 7:56 AM Amit Joshi wrote: > > Hi, > > I have a spark udf written in scala that takes couuple of columns and apply > some logic and

Re: Comments conventions in Spark distribution official examples

2020-08-05 Thread Sean Owen
These only matter to our documentation, which includes the source of these examples inline in the docs. For brevity, the examples don't need to show all the imports that are otherwise necessary for the source file. You can ignore them like the compiler does as comments if you are using the example

Re: CVE-2020-9480: Apache Spark RCE vulnerability in auth-enabled standalone master

2020-08-03 Thread Sean Owen
+. For those using vendor distros, you may want to check with your vendor about whether the relevant patch has been applied. Sean On Mon, Jun 22, 2020 at 4:49 PM Sean Owen wrote: > > Severity: Important > > Vendor: The Apache Software Foundation > > Versions Affected: &g

Re: Tab delimited csv import and empty columns

2020-07-31 Thread Sean Owen
Try setting nullValue to anything besides the empty string. Because its default is the empty string, empty strings become null by default. On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy wrote: > That does not work. > > This is Spark 3.0 by the way. > > I have been looking at the Spark unit tests

Re: [Spark ML] existence of Matrix Factorization ALS algorithm's log version

2020-07-29 Thread Sean Owen
No there isn't a log version. You could probably copy and hack the implementation easily if necessary. On Wed, Jul 29, 2020 at 11:05 AM jyuan1986 wrote: > > Hi Team, > > I'm looking for information regarding MF_ALS algorithm's log version if > implemented. In original Hu et al.'s paper

Re: Spark DataFrame Creation

2020-07-22 Thread Sean Owen
You'd probably do best to ask that project, but scanning the source code, that looks like it's how it's meant to work. It downloads to a temp file on the driver then copies to distributed storage then returns a DataFrame for that. I can't see how it would be implemented directly over sftp as there

Re: Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-17 Thread Sean Owen
I can't reproduce it (on Databricks / Spark 2.4), but as you say, sounds really specific to some way of executing it. I can't off the top of my head imagine why that would be. As you say, no matter the model, it should be the same result. I don't recall a bug being fixed around there, but

Re: download of spark

2020-07-15 Thread Sean Owen
Works for me - do you have javascript disabled? it will be necessary. On Wed, Jul 15, 2020 at 11:52 AM Ming Liao wrote: > To whom it may concern, > > Hope this email finds you well. > I am trying to download spark but I was not able to select the release and > package type. Could you please

Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Sean Owen
It is still copyrighted material, no matter its state of editing. Yes, you should not be sharing this on the internet. On Tue, Jul 14, 2020 at 9:46 AM Anwar AliKhan wrote: > > Please note It is freely available because it is an early unedited raw > edition. > It is not 100% complete , it is not

Re: scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

2020-07-13 Thread Sean Owen
Wouldn't toDS() do this without conversion? On Mon, Jul 13, 2020 at 5:25 PM Ivan Petrov wrote: > > Hi! > I'm trying to understand the cost of RDD to Dataset conversion > It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000 > records > It takes around 15 minutes to convert

Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Sean Owen
There is a multilayer perceptron implementation in Spark ML, but that's not what you're looking for. To parallelize model training developed using standard libraries like Keras, use Horovod from Uber. https://horovod.readthedocs.io/en/stable/spark_include.html On Mon, Jul 13, 2020 at 6:59 AM

Re: Strange WholeStageCodegen UI values

2020-07-09 Thread Sean Owen
It sounds like you have huge data skew? On Thu, Jul 9, 2020 at 4:15 PM Bobby Evans wrote: > > Sadly there isn't a lot you can do to fix this. All of the operations take > iterators of rows as input and produce iterators of rows as output. For > efficiency reasons, the timing is not done for

Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Sean Owen
I haven't used the K8S scheduler personally, but, just based on that comment I wouldn't worry too much. It's been around for several versions and AFAIK works fine in general. We sometimes aren't so great about removing "experimental" labels. That said I know there are still some things that could

Re: com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.9.6 requires Jackson Databind version >= 2.9.0 and < 2.10.0

2020-07-09 Thread Sean Owen
You have a Jackson version conflict somewhere. It might be from other libraries you include in your application. I am not sure Spark 2.3 works with Hadoop 3.1, so this may be the issue. Make sure you match these to Spark, and/or use the latest versions. On Thu, Jul 9, 2020 at 8:23 AM Julian Jiang

Re: When does SparkContext.defaultParallelism have the correct value?

2020-07-07 Thread Sean Owen
If not set explicitly with spark.default.parallelism, it will default to the number of cores currently available (minimum 2). At the very start, some executors haven't completed registering, which I think explains why it goes up after a short time. (In the case of dynamic allocation it will change

Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Sean Owen
2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work connecting to Hadoop 3 / Hive 3; it's possible in a few cases. It's also possible some vendor distributions support this combination. On Mon, Jul 6, 2020 at 7:51 AM Teja wrote: > > We use spark 2.4.0 to connect to Hadoop 2.7

Re: XmlReader not Parsing the Nested elements in XML properly

2020-06-30 Thread Sean Owen
This is more a question about spark-xml, which is not part of Spark. You can ask at https://github.com/databricks/spark-xml/ but if you do please show some example of the XML input and schema and output. On Tue, Jun 30, 2020 at 11:39 AM mars76 wrote: > > Hi, > > I am trying to read XML data

Re: When is a Bigint a long and when is a long a long

2020-06-28 Thread Sean Owen
'bigint' is a long, not a Java BigInteger. On Sun, Jun 28, 2020 at 5:52 AM Anwar AliKhan wrote: > > I wish to draw your attention for your consideration to this approach > where the BigInt data type maps to Long without drawing an error. > >

Re: When is a Bigint a long and when is a long a long

2020-06-27 Thread Sean Owen
reduce(_+_) > <http://www.backbutton.co.uk/> > > > On Sat, 27 Jun 2020, 15:42 Sean Owen, wrote: > >> There are several confusing things going on here. I think this is part >> of the explanation, not 100% sure: >> >> 'bigint' is the Spark SQL type of an 8

Re: When is a Bigint a long and when is a long a long

2020-06-27 Thread Sean Owen
There are several confusing things going on here. I think this is part of the explanation, not 100% sure: 'bigint' is the Spark SQL type of an 8-byte long. 'long' is the type of a JVM primitive. Both are the same, conceptually, but represented differently internally as they are logically somewhat

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Sean Owen
You can always list the S3 output path, of course. On Thu, Jun 25, 2020 at 7:52 AM Tzahi File wrote: > Hi, > > I'm using pyspark to write df to s3, using the following command: > "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". > > Is there any way to get the

CVE-2020-9480: Apache Spark RCE vulnerability in auth-enabled standalone master

2020-06-22 Thread Sean Owen
Severity: Important Vendor: The Apache Software Foundation Versions Affected: Apache Spark 2.4.5 and earlier Description: In Apache Spark 2.4.5 and earlier, a standalone resource manager's master may be configured to require authentication (spark.authenticate) via a shared secret. When enabled,

Re: Hey good looking toPandas () error stack

2020-06-21 Thread Sean Owen
That part isn't related to Spark. It means you have some code compiled for Java 11, but are running Java 8. On Sun, Jun 21, 2020 at 1:51 PM randy clinton wrote: > You can see from the GitHub history for "toPandas()" that the function has > been in the code for 5 years. > >

Re: [pyspark 2.3+] read/write huge data with smaller block size (128MB per block)

2020-06-19 Thread Sean Owen
Yes you'll generally get 1 partition per block, and 1 task per partition. The amount of RAM isn't directly relevant; it's not loaded into memory. But you may nevertheless get some improvement with larger partitions / tasks, though typically only if your tasks are very small and very fast right now

Re: Spark ml how to extract split points from trained decision tree mode

2020-06-11 Thread Sean Owen
Hm, the root is a leaf? it's possible but that means there are no splits. If it's a toy example, could be. This was just off the top of my head looking at the code, so could be missing something, but a non-trivial tree should start with an internalnode. On Thu, Jun 11, 2020 at 11:01 PM AaronLee

Re: Spark ml how to extract split points from trained decision tree mode

2020-06-11 Thread Sean Owen
You should be able to look at dtm.rootNode and, treating it as an InternalNode, get the .split from it On Thu, Jun 11, 2020 at 7:02 PM AaronLee wrote: > I am following official spark 2.4.3 tutorial > < >

Re: NoClassDefFoundError: scala/Product$class

2020-06-06 Thread Sean Owen
Spark 3 supports only Scala 2.12. This actually sounds like third party library is compiled for 2.11 or something. On Fri, Jun 5, 2020 at 11:11 PM charles_cai <1620075...@qq.com> wrote: > Hi Pol, > > thanks for your suggestion, I am going to use Spark-3.0.0 for GPU > acceleration,so I update the

Re: Spark Security

2020-05-29 Thread Sean Owen
tsv file on my local computer is > secure correct? > > > Thanks > > Wilbert J. Seoane > > Sent from iPhone > > On May 29, 2020, at 11:25 AM, Sean Owen wrote: > >  > What do you mean by secure here? > > On Fri, May 29, 2020 at 10:21 AM wrote: > &

Re: Spark Security

2020-05-29 Thread Sean Owen
What do you mean by secure here? On Fri, May 29, 2020 at 10:21 AM wrote: > Hello, > > I plan to load in a local .tsv file from my hard drive using sparklyr (an > R package). I have figured out how to do this already on small files. > > When I decide to receive my client’s large .tsv file, can I

Re: CSV parsing issue

2020-05-28 Thread Sean Owen
way I can handle it in code? > > Thanks, > Elango > > On Thu, May 28, 2020, 8:52 PM Sean Owen wrote: > >> Your data doesn't escape double-quotes. >> >> On Thu, May 28, 2020 at 10:21 AM elango vaidyanathan >> wrote: >> >>> >>> Hi team

Re: CSV parsing issue

2020-05-28 Thread Sean Owen
Your data doesn't escape double-quotes. On Thu, May 28, 2020 at 10:21 AM elango vaidyanathan wrote: > > Hi team, > > I am loading an CSV. One column contains a json value. I am unable to > parse that column properly. Below is the details. Can you please check once? > > > > val

Re: Regarding Spark 3.0 GA

2020-05-27 Thread Sean Owen
No firm dates; it always depends on RC voting. Another RC is coming soon. It is however looking pretty close to done. On Wed, May 27, 2020 at 3:54 AM ARNAV NEGI SOFTWARE ARCHITECT < negi.ar...@gmail.com> wrote: > Hi, > > I am working on Spark 3.0 preview release for large Spark jobs on >

<    1   2   3   4   5   6   7   8   9   10   >