Re: Unable to use scala function in pyspark

2021-09-26 Thread Sean Owen
You can also call a Scala UDF from Python in Spark - this doesn't need Zeppelin or relate to the front-end. This may indeed be much easier as a proper UDF; depends on what this function does. However I think the issue may be that you're trying to wrap the resulting DataFrame in a DataFrame or

Re: Spark DStream application memory leak debugging

2021-09-25 Thread Sean Owen
It could be 'normal' - executors won't GC unless they need to. It could be state in your application, if you're storing state. You'd want to dump the heap to take a first look On Sat, Sep 25, 2021 at 7:24 AM Kiran Biswal wrote: > Hello Experts > > I have a spark streaming application(DStream).

Re: SparkDF writing null values when not to database

2021-09-22 Thread Sean Owen
What is null, what is the type, does it make sense in postgres, etc. Need more info. On Wed, Sep 22, 2021 at 9:18 AM Aram Morcecian wrote: > Hi everyone, I'm facing something weird. After doing some transformations > to a SparkDF I print some rows and I see those perfectly, but when I write >

Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Sean Owen
spark-rapids is not part of Spark, so couldn't speak to it, but Spark itself does not use GPUs at all. It does let you configure a task to request a certain number of GPUs, and that would work for RDDs, but it's up to the code being executed to use the GPUs. On Tue, Sep 21, 2021 at 1:23 PM

Re: [apache-spark][Spark SQL][Debug] Maven Spark build fails while compiling spark-hive-thriftserver_2.12 for Hadoop 2.10.1

2021-09-17 Thread Sean Owen
I don't think that has ever showed up in the CI/CD builds and can't recall someone reporting this. What did you change? it may be some local env issue On Fri, Sep 17, 2021 at 7:09 AM Enrico Minardi wrote: > > Hello, > > > the Maven build of Apache Spark 3.1.2 for user-provided Hadoop 2.10.1

Re: Lock issue with SQLConf.getConf

2021-09-11 Thread Sean Owen
Looks like this was improved in https://issues.apache.org/jira/browse/SPARK-35701 for 3.2.0 On Fri, Sep 10, 2021 at 10:21 PM Kohki Nishio wrote: > Hello, > I'm running spark in local mode and seeing multiple threads showing like > below, anybody knows why it's not using a concurrent hash map ?

Re: issue in Apache Spark install

2021-09-09 Thread Sean Owen
- other lists, please don't cross post to 4 lists (!) This is a problem you'd see with Java 9 or later - I assume you're running that under the hood. However it should be handled by Spark in the case that you can't access certain things in Java 9+, and this may be a bug I'll look into. In the

Re: JavaSerializerInstance is slow

2021-09-03 Thread Sean Owen
I don't know if java serialization is slow in that case; that shows blocking on a class load, which may or may not be directly due to deserialization. Indeed I don't think (some) things are serialized in local mode within one JVM, so not sure that's actually what's going on. On Thu, Sep 2, 2021

Re: Processing Multiple Streams in a Single Job

2021-08-27 Thread Sean Owen
That is something else. Yes, you can create a single, complex stream job that joins different data sources, etc. That is not different than any other Spark usage. What are you looking for w.r.t. docs? We are also saying you can simply run N unrelated streaming jobs in parallel on the driver,

Re: Processing Multiple Streams in a Single Job

2021-08-25 Thread Sean Owen
use my ignorant, but I just can't figure out how to > create a collection across multiple streams using multiple stream readers. > Could you provide some examples or additional references? Thanks! > > On 8/24/21 11:01 PM, Sean Owen wrote: > > No, that applies to the stream

Re: Processing Multiple Streams in a Single Job

2021-08-24 Thread Sean Owen
No, that applies to the streaming DataFrame API too. No jobs can't communicate with each other. On Tue, Aug 24, 2021 at 9:51 PM Artemis User wrote: > Thanks Daniel. I guess you were suggesting using DStream/RDD. Would it > be possible to use structured streaming/DataFrames for multi-source >

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Sean Owen
Date handling was tightened up in Spark 3. I think you need to compare to a date literal, not a string literal. On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta < gourav.sengupta.develo...@gmail.com> wrote: > Hi, > > while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT * > FROM

Re: How can I read ftp

2021-08-08 Thread Sean Owen
FTP is definitely not supported. Read the files to distributed storage first then read from there. On Sun, Aug 8, 2021, 10:18 PM igyu wrote: > val ftpUrl = "ftp://ftpuser:ftpuser@10.3.87.51:21/sparkftp/; > > val schemas = StructType(List( > new StructField("name", DataTypes.StringType,

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

2021-08-05 Thread Sean Owen
Doesn't a persist break stages? On Thu, Aug 5, 2021, 11:40 AM Tom Graves wrote: > As Sean mentioned its only available at Stage level but you said you don't > want to shuffle so splitting into stages doesn't help you. Without more > details it seems like you could "hack" this by just

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

2021-08-01 Thread Sean Owen
Oh I see, I missed that. You can specify at the stage level, nice. I think you are more looking to break these operations into two stages. You can do that with a persist or something - which has a cost but may work fine. Does it actually help much with GPU utilization - in theory yes but

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

2021-07-31 Thread Sean Owen
No, unless I'm crazy you can't even change resource requirements at the job level let alone stage. Does it help you though? Is something else even able to use the GPU otherwise? On Sat, Jul 31, 2021, 3:56 AM Andreas Kunft wrote: > I have a setup with two work intensive tasks, one map using GPU

Re: Cloudera Parcel : spark issues after upgrade 1.6 to 2.4

2021-07-30 Thread Sean Owen
(This is a list of OSS Spark - anything vendor-specific should go to vendor lists for better answers.) On Fri, Jul 30, 2021 at 8:35 AM Harsh Sharma wrote: > hi Team , > > we are upgrading our cloudera parcels to 6.X from 5.x , hence e have > upgraded version of park from 1.6 to 2.4 . While

Re: Advanced GC Tuning

2021-07-20 Thread Sean Owen
You're right, I think storageFraction is somewhat better to control this, although some things 'counted' in spark.memory.fraction will also be long-lived and in the OldGen. You can also increase the OldGen size if you're pretty sure that's the issue - 'old' objects in the YoungGen. I'm not sure

Re: How to specify “positive class” in sparkml classification?

2021-07-07 Thread Sean Owen
The positive class is "1" and negative is "0" by convention; I don't think you can change that (though you can translate your data if needed). F1 is defined only in a one-vs-rest sense in multi-class evaluation. You can set 'metricLabel' to define which class is 'positive' in multiclass -

Re: Increase batch interval in case of delay

2021-07-01 Thread Sean Owen
Wouldn't this happen naturally? the large batches would just take a longer time to complete already. On Thu, Jul 1, 2021 at 6:32 AM András Kolbert wrote: > Hi, > > I have a spark streaming application which generally able to process the > data within the given time frame. However, in certain

Re: OutOfMemoryError

2021-07-01 Thread Sean Owen
You need to set driver memory before the driver starts, on the CLI or however you run your app, not in the app itself. By the time the driver starts to run your app, its heap is already set. On Thu, Jul 1, 2021 at 12:10 AM javaguy Java wrote: > Hi, > > I'm getting Java OOM errors even though

Re: Spark Null Pointer Exception

2021-06-30 Thread Sean Owen
The error is in your code, which you don't show. You are almost certainly incorrectly referencing something like a SparkContext in a Spark task. On Wed, Jun 30, 2021 at 3:48 PM Amit Sharma wrote: > Hi , I am using spark 2.7 version with scala. I am calling a method as > below > > 1. val

Re: Inclusive terminology usage in Spark

2021-06-30 Thread Sean Owen
This was covered and mostly done last year: https://issues.apache.org/jira/browse/SPARK-32004 In some instances, it's hard to change the terminology as it would break user APIs, and the marginal benefit may not be worth it, but, have a look at the remaining task under that umbrella. On Wed, Jun

Re: CVEs

2021-06-21 Thread Sean Owen
se the >> visibility. >> >> Will 3.2.x be Scala 2.13.x only or cross compiled with 2.12? >> >> I realize Spark is a beast so I just want to help if I can but also not >> create extra work if it is not useful for me or the Spark team/contributors. >&g

Re: CVEs

2021-06-21 Thread Sean Owen
o help if I can but also not > create extra work if it is not useful for me or the Spark team/contributors. > > On Mon, Jun 21, 2021 at 3:43 PM Sean Owen wrote: > >> Whether it matters really depends on whether the CVE affects Spark. >> Sometimes it clearly could and so we'd

Re: CVEs

2021-06-21 Thread Sean Owen
Whether it matters really depends on whether the CVE affects Spark. Sometimes it clearly could and so we'd try to back-port dependency updates to active branches. Sometimes it clearly doesn't and hey sometimes the dependency is updated anyway for good measure (mostly to keep this off static

Re: Why does sparkml random forest classifier not support maxBins < number of total categorical values?

2021-06-16 Thread Sean Owen
I think it's because otherwise you would not be able to consider, at least, K-1 splits among K features, and you want to be able to do that. There may be more technical reasons in the code that this is strictly enforced, but it seems like a decent idea. Agree, more than K doesn't seem to help,

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sean Owen
run multiple tasks > on multiple nodes. > > On Wed, Jun 9, 2021 at 7:57 PM Sean Owen wrote: > >> Wait. Isn't that what you were trying to parallelize in the first place? >> >> On Wed, Jun 9, 2021 at 1:49 PM Tom Barber wrote: >> >>> Yeah but that

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sean Owen
That looks like you did some work on the cluster, and now it's stuck doing something else on the driver - not doing everything on 1 machine. On Wed, Jun 9, 2021 at 12:43 PM Tom Barber wrote: > And also as this morning: https://pasteboard.co/K5Q9aEf.png > > Removing the cpu pins gives me more

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sean Owen
gt;>> Interesting Sean thanks for that insight, I wasn't aware of that fact, I >>> assume the .persist() at the end of that line doesn't do it? >>> >>> I believe, looking at the output in the SparkUI, it gets to >>> https://github.com/USCDataScience/

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sean Owen
orkload in question: >>>> https://gist.github.com/buggtb/a9e0445f24182bc8eedfe26c0f07a473 >>>> >>>> On 2021/06/09 01:52:39, Tom Barber wrote: >>>> > ExecutorID says driver, and looking at the IP addresses its >>>> running on

Re: Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Sean Owen
the workers, so I no longer saturate the master node, but I > also have 3 workers just sat there doing nothing. > > On 2021/06/09 01:26:50, Sean Owen wrote: > > Are you sure it's on the driver? or just 1 executor? > > how many partitions does the groupByKey produce? that would

Re: Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Sean Owen
Are you sure it's on the driver? or just 1 executor? how many partitions does the groupByKey produce? that would limit your parallelism no matter what if it's a small number. On Tue, Jun 8, 2021 at 8:07 PM Tom Barber wrote: > Hi folks, > > Hopefully someone with more Spark experience than me

Re: Problem in Restoring ML Pipeline with UDF

2021-06-08 Thread Sean Owen
It's a little bit of a guess, but the class name $line103090609224.$read$FeatureModder looks like something generated by the shell. I think it's your 'real' classname in this case. If you redefined this later and loaded it you may not find it matches up. Can you declare this in a package? On Tue,

Re: Petastorm vs horovod vs tensorflowonspark vs spark_tensorflow_distributor

2021-06-05 Thread Sean Owen
All of these tools are reasonable choices. I don't think the Spark project itself has a view on what works best. These things do different things. For example petastorm is not a training framework, but a way to feed data to a distributed DL training process on Spark. For what it's worth,

Re: Missing module spark-hadoop-cloud in Maven central

2021-05-31 Thread Sean Owen
I know it's not enabled by default when the binary artifacts are built, but not exactly sure why it's not built separately at all. It's almost a dependencies-only pom artifact, but there are two source files. Steve do you have an angle on that? On Mon, May 31, 2021 at 5:37 AM Erik Torres wrote:

Re: [apache spark] Does Spark 2.4.8 have issues with ServletContextHandler

2021-05-27 Thread Sean Owen
Despite the name, the error doesn't mean the class isn't found but could not be initialized. What's the rest of the error? I don't believe any testing has ever encountered this error, so it's likely something to do with your environment, but I don't know what. On Thu, May 27, 2021 at 7:32 AM

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
CS (Elastic Container Service) for this use > case which allows us to autoscale? > > On Tue, May 25, 2021 at 2:16 PM Sean Owen wrote: > >> What you could do is launch N Spark jobs in parallel from the driver. >> Each one would process a directory you supply with spark.read.par

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
What you could do is launch N Spark jobs in parallel from the driver. Each one would process a directory you supply with spark.read.parquet, for example. You would just have 10s or 100s of those jobs running at the same time. You have to write a bit of async code to do it, but it's pretty easy

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
Right, you can't use Spark within Spark. Do you actually need to read Parquet like this vs spark.read.parquet? that's also parallel of course. You'd otherwise be reading the files directly in your function with the Parquet APIs. On Tue, May 25, 2021 at 12:24 PM Eric Beabes wrote: > I've a use

Re: unresolved dependency: graphframes#graphframes;0.8.1-spark2.4-s_2.11: not found

2021-05-19 Thread Sean Owen
I think it's because the bintray repo has gone away. Did you see the recent email about the new repo for these packages? On Wed, May 19, 2021 at 12:42 PM Wensheng Deng wrote: > Hi experts: > > I tried the example as shown on this page, and it is not working for me: >

Re: Merge two dataframes

2021-05-17 Thread Sean Owen
Why join here - just add two columns to the DataFrame directly? On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: > Anyone have ideas about the below Q? > > It seems to me that given that "diamond" DAG, that spark could see > that the rows haven't been shuffled/filtered, it could do some type

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-15 Thread Sean Owen
If code running on the executors need some local file like a config file, then it does have to be passed this way. That much is normal. On Sat, May 15, 2021 at 1:41 AM Gourav Sengupta wrote: > Hi, > > once again lets start with the requirement. Why are you trying to pass xml > and json files to

Re: Merge two dataframes

2021-05-12 Thread Sean Owen
Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, 3, etc. I think row_number() might be what you need to generate a join ID. RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You could .zip two RDDs you get from DataFrames and manually convert the

Re: Installation Error - Please Help!

2021-05-11 Thread Sean Owen
spark-shell is not on your path. Give the full path to it. On Tue, May 11, 2021 at 4:10 PM Talha Javed wrote: > Hello Team! > Hope you are doing well > > I have downloaded the Apache Spark version (spark-3.1.1-bin-hadoop2.7). I > have downloaded the winutils file too from github. > Python

Re: Issue while calling foreach in Pyspark

2021-05-08 Thread Sean Owen
It looks like the executor (JVM) stops immediately. Hard to say why - do you have Java installed and a compatible version? I agree it could be a py4j version problem, from that SO link. On Sat, May 8, 2021, 1:35 PM rajat kumar wrote: > Hi Sean/Mich, > > Thanks for response. > > That was the

Re: Issue while calling foreach in Pyspark

2021-05-07 Thread Sean Owen
SparkContext.getOrCreate(sparkConf) >> File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 367, >> in getOrCreate >> SparkContext(conf=conf or SparkConf()) >> File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 133, >> in

Re: Issue while calling foreach in Pyspark

2021-05-07 Thread Sean Owen
foreach definitely works :) This is not a streaming question. The error says that the JVM worker died for some reason. You'd have to look at its logs to see why. On Fri, May 7, 2021 at 11:03 AM Mich Talebzadeh wrote: > Hi, > > I am not convinced foreach works even in 3.1.1 > Try doing the same

Re: Broadcast Variable

2021-05-03 Thread Sean Owen
There is just one copy in memory. No different than if you have to variables pointing to the same dict. On Mon, May 3, 2021 at 7:54 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi all, > > > > when broadcasting a large dict containing several million entries to > executors

Re: Cypher language on spark

2021-04-30 Thread Sean Owen
Right, yes it did not continue. It's not in Spark. On Fri, Apr 30, 2021 at 7:07 AM jonnysettle wrote: > I remeber back in 2019 reading about Cypher language for graph queries been > introduced to spark 3.X. But I don't see it in the latest version. Has > the > project been abandoned (issues

Re: Spark DataFrame CodeGeneration in Java generates Scala specific code?

2021-04-29 Thread Sean Owen
>From tracing the code a bit, it might do this if the POJO class has no public constructors - does it? On Thu, Apr 29, 2021 at 9:55 AM Rico Bergmann wrote: > Here is the relevant generated code and the Exception stacktrace. > > The problem in the generated code is at line 35. > >

Re: Spark DataFrame CodeGeneration in Java generates Scala specific code?

2021-04-29 Thread Sean Owen
I don't know this code well, but yes seems like something is looking for members of a companion object when there is none here. Can you show any more of the stack trace or generated code? On Thu, Apr 29, 2021 at 7:40 AM Rico Bergmann wrote: > Hi all! > > A simplified code snippet of what my

Re: How to calculate percentiles in Scala Spark 2.4.x

2021-04-27 Thread Sean Owen
Erm, just https://spark.apache.org/docs/2.3.0/api/sql/index.html#approx_percentile ? On Tue, Apr 27, 2021 at 3:52 AM Ivan Petrov wrote: > Hi, I have billions, potentially dozens of billions of observations. Each > observation is a decimal number. > I need to calculate percentiles 1, 25, 50, 75,

Re: [Spark-Streaming] moving average on categorical data with time windowing

2021-04-26 Thread Sean Owen
You might be able to do this with multiple aggregations on avg(col("col1") == "cat1") etc, but how about pivoting the DataFrame first so that you get columns like "cat1" being 1 or 0? you would end up with columns x categories new columns if you want to count all categories in all cols. But then

Re: java.lang.IllegalArgumentException: Unsupported class file major version 55

2021-04-23 Thread Sean Owen
This means you compiled with Java 11, but are running on Java < 11. It's not related to Spark. On Fri, Apr 23, 2021 at 10:23 AM chansonzhang wrote: > I just update the spark-* version in my pom.xml to match my spark and scala > environment, and this solved the problem > > > > > -- > Sent from:

Re: [Spark Core][Advanced]: Wrong memory allocation on standalone mode cluster

2021-04-18 Thread Sean Owen
Are you sure about the worker mem configuration? what are you setting --memory too and what does the worker UI think its memory allocation is? On Sun, Apr 18, 2021 at 4:08 AM Mohamadreza Rostami < mohamadrezarosta...@gmail.com> wrote: > I see a bug in executer memory allocation in the standalone

Re: Spark Session error with 30s

2021-04-12 Thread Sean Owen
apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession > > Thanks, > Asmath > > On Mon, Apr 12, 2021 at 2:20 PM KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> I am using spark hbase connector provided by hortonwokrs. I was able to >> run witho

Re: Spark Session error with 30s

2021-04-12 Thread Sean Owen
Somewhere you're passing a property that expects a number, but give it "30s". Is it a time property somewhere that really just wants MS or something? But most time properties (all?) in Spark should accept that type of input anyway. Really depends on what property has a problem and what is setting

Re: Spark and Bintray's shutdown

2021-04-12 Thread Sean Owen
Spark itself is distributed via Maven Central primarily, so I don't think it will be affected? On Mon, Apr 12, 2021 at 11:22 AM Florian CASTELAIN < florian.castel...@redlab.io> wrote: > Hello. > > > > Bintray will shutdown on first May. > > > > I just saw that packages are hosted on Bintray

Re: GPU job in Spark 3

2021-04-09 Thread Sean Owen
(I apologize, I totally missed that this should use GPUs because of RAPIDS. Ignore my previous. But yeah it's more a RAPIDS question.) On Fri, Apr 9, 2021 at 12:09 PM HaoZ wrote: > Hi Martin, > > I tested the local mode in Spark on Rapids Accelerator and it works fine > for > me. > The only

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
me to complete,please suggest >> best way to get below requirement without using UDF >> >> >> Thanks, >> >> Ankamma Rao B >> -- >> *From:* Sean Owen >> *Sent:* Friday, April 9, 2021 6:11 PM >> *To:* ayan guha >&

Re: GPU job in Spark 3

2021-04-09 Thread Sean Owen
I don't see anything in this job that would use a GPU? On Fri, Apr 9, 2021 at 11:19 AM Martin Somers wrote: > > Hi Everyone !! > > Im trying to get on premise GPU instance of Spark 3 running on my ubuntu > box, and I am following: > >

Re: possible bug

2021-04-09 Thread Sean Owen
OK so it's '7 threads overwhelming off heap mem in the JVM' kind of thing. Or running afoul of ulimits in the OS. On Fri, Apr 9, 2021 at 11:19 AM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > Hi Sean! > > So the "coalesce" without shuffle will create a CoalescedRDD which during

Re: possible bug

2021-04-09 Thread Sean Owen
Yeah I figured it's not something fundamental to the task or Spark. The error is very odd, never seen that. Do you have a theory on what's going on there? I don't! On Fri, Apr 9, 2021 at 10:43 AM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > Hi! > > I looked into the code and find

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
This can be significantly faster with a pandas UDF, note, because you can vectorize the operations. On Fri, Apr 9, 2021, 7:32 AM ayan guha wrote: > Hi > > We are using a haversine distance function for this, and wrapping it in > udf. > > from pyspark.sql.functions import acos, cos, sin, lit,

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Sean Owen
Right, you already established a few times that the difference is the number of partitions. Russell answered with what is almost surely the correct answer, that it's AQE. In toy cases it isn't always a win. Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up more realistic

Re: possible bug

2021-04-08 Thread Sean Owen
That's a very low level error from the JVM. Any chance you are misconfiguring the executor size? like to 10MB instead of 10GB, that kind of thing. Trying to think of why the JVM would have very little memory to operate. An app running out of mem would not look like this. On Thu, Apr 8, 2021 at

Re: Apache ML Agorithm Solution

2021-04-07 Thread Sean Owen
I think this question was asked just a week ago? same company and setup. https://mail-archives.apache.org/mod_mbox/spark-user/202104.mbox/%3CLNXP123MB2604758548BE38E8D3F369EC8A7B9%40LNXP123MB2604.GBRP123.PROD.OUTLOOK.COM%3E On Wed, Apr 7, 2021 at 11:17 AM SRITHALAM, ANUPAMA (Risk Value Stream)

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Sean Owen
You shouldn't be modifying your cluster install. You may at this point have conflicting, excess JARs in there somewhere. I'd start it over if you can. On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi wrote: > Not sure what you mean not working. You've added 3.1.1 to packages which > uses: > * 2.6.0

Mesos + Spark users going forward?

2021-04-07 Thread Sean Owen
I noted that Apache Mesos is moving to the attic, so won't be actively developed soon: https://lists.apache.org/thread.html/rab2a820507f7c846e54a847398ab20f47698ec5bce0c8e182bfe51ba%40%3Cdev.mesos.apache.org%3E That doesn't mean people will stop using it as a Spark resource manager soon. But it

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Sean Owen
de. >>> >>> >>> Running on local node with >>> >>> >>> spark-submit --master local[4] --conf >>> spark.pyspark.virtualenv.enabled=true --conf >>> spark.pyspark.virtualenv.type=native --conf >>> spark.pyspark.virtuale

Re: Tuning spark job to make count faster.

2021-04-06 Thread Sean Owen
Hard to say without a lot more info, but 76.5K tasks is very large. How big are the tasks / how long do they take? if very short, you should repartition down. Do you end up with 800 executors? if so why 2 per machine? that generally is a loss at this scale of worker. I'm confused because you have

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Sean Owen
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1. You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version. On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh wrote: > Thanks Gabor. > > All nodes are running Spark

Re: FW: Email to Spark Org please

2021-04-01 Thread Sean Owen
Yes that's a great option when the modeling process itself doesn't really need Spark. You can use any old modeling tool you want and get the parallelism in tuning via hyperopt's Spark integration. On Thu, Apr 1, 2021 at 10:50 AM Williams, David (Risk Value Stream) wrote: > Classification:

Re: Error Message Suggestion

2021-03-29 Thread Sean Owen
Sure, just open a pull request? On Mon, Mar 29, 2021 at 10:37 AM Josh Herzberg wrote: > Hi, > > I'd like to suggest this change to the PySpark code. I haven't contributed > before so https://spark.apache.org/contributing.html suggested emailing > here first. > > In the error raised here >

Re: Spark Views Functioning

2021-03-26 Thread Sean Owen
Views are simply bookkeeping about how the query is executed, like a DataFrame. There is no data or result to store; it's just how to run a query. The views exist on the driver. The query executes like any other, on the cluster. On Fri, Mar 26, 2021 at 3:38 AM Mich Talebzadeh wrote: > > As a

Re: FW: Email to Spark Org please

2021-03-26 Thread Sean Owen
get that working in distributed, will we get > benefits similar to spark ML? > > > > Best Regards, > > Dave Williams > > > > *From:* Sean Owen > *Sent:* 26 March 2021 13:20 > *To:* Williams, David (Risk Value Stream) > > *Cc:* user@spark.apache.org &g

Re: FW: Email to Spark Org please

2021-03-26 Thread Sean Owen
gt; Many thanks for your response Sean. > > > > Question - why spark is overkill for this and why is sklearn is faster > please? It’s the same algorithm right? > > > > Thanks again, > > Dave Williams > > > > *From:* Sean Owen > *Sent:*

Re: convert java dataframe to pyspark dataframe

2021-03-26 Thread Sean Owen
The problem is that both of these are not sharing a SparkContext as far as I can see, so there is no way to share the object across them, let alone languages. You can of course write the data from Java, read it from Python. In some hosted Spark products, you can access the same session from two

Re: FW: Email to Spark Org please

2021-03-25 Thread Sean Owen
Spark is overkill for this problem; use sklearn. But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark. repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task. On Thu, Mar

Re: Rdd - zip with index

2021-03-24 Thread Sean Owen
arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 24 Mar 2021 at 12:40, Sean Owen wrote: > >>

Re: Rdd - zip with index

2021-03-24 Thread Sean Owen
No need to do that. Reading the header with Spark automatically is trivial. On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh wrote: > If it is a csv then it is a flat file somewhere in a directory I guess. > > Get the header out by doing > > */usr/bin/zcat csvfile.gz |head -n 1* > Title

Re: Rdd - zip with index

2021-03-23 Thread Sean Owen
It would split 10GB of CSV into multiple partitions by default, unless it's gzipped. Something else is going on here. ‪On Tue, Mar 23, 2021 at 10:04 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.com> wrote:‬ > I’m not Spark core developer and do not want to confuse you but it seems

Re: Rdd - zip with index

2021-03-23 Thread Sean Owen
I don't think that would change partitioning? try .repartition(). It isn't necessary to write it out let alone in Avro. ‪On Tue, Mar 23, 2021 at 8:45 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.com> wrote:‬ > Hi, Mohammed > I think that the reason that only one executor is running

Re: Repartition or Coalesce not working

2021-03-22 Thread Sean Owen
You need to do something with the result of repartition. You haven't changed textDF On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed wrote: > Hi, > > I have a use case where there are large files in hdfs. > > Size of the file is 3 GB. > > It is an existing code in production and I am trying

Re: Spark version verification

2021-03-21 Thread Sean Owen
I believe you can "SELECT version()" in Spark SQL to see the build version. On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh wrote: > Thanks for the detailed info. > > I was hoping that one can find a simpler answer to the Spark version than > doing forensic examination on base code so to speak.

Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Sean Owen
That looks like you didn't compile with Java 11 actually. How did you try to do so? On Tue, Mar 16, 2021, 7:50 AM kaki mahesh raja wrote: > HI All, > > We have compiled spark with java 11 ("11.0.9.1") and when testing the > thrift > server we are seeing that insert query from operator using

Re: Sounds like Structured streaming with foreach, can only run on one executor

2021-03-09 Thread Sean Owen
That should not be the case. See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch Maybe you are calling .foreach on some Scala object inadvertently. On Tue, Mar 9, 2021 at 4:41 PM Mich Talebzadeh wrote: > Hi, > > When I use

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread Sean Owen
You can also group by the key in the transformation on each batch. But yes that's faster/easier if it's already partitioned that way. On Tue, Mar 9, 2021 at 7:30 AM Ali Gouta wrote: > Do not know Kenesis, but it looks like it works like kafka. Your producer > should implement a paritionner that

Re: Creating spark context outside of the driver throws error

2021-03-08 Thread Sean Owen
Yep, you can never use Spark inside Spark. You could run N jobs in parallel from the driver using Spark, however. On Mon, Mar 8, 2021 at 3:14 PM Mich Talebzadeh wrote: > > In structured streaming with pySpark, I need to do some work on the row > *foreach(process_row)* > > below > > > *def

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sean Owen
It's there in the error: No space left on device You ran out of disk space (local disk) on one of your machines. On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka wrote: > Hi All, > > I am getting the following error in my spark job. > > Can someone please have a look ? > >

Re: Possible upgrade path from Spark 3.1.1-RC2 to Spark 3.1.1 GA

2021-03-04 Thread Sean Owen
I think you're still asking about GCP and Dataproc, and that's really nothing to do with Spark itself. Whatever issues you are having concern Dataproc and how it's run and possibly customizations in Dataproc. 3.1.1-RC2 is not a release, but, also nothing meaningfully changed between it and the

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Sean Owen
I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended. On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic < oldrich.vla...@datasentics.com> wrote: > Hi,

Re: Please update this notification on Spark download Site

2021-03-02 Thread Sean Owen
That statement is still accurate - it is saying the release will be 3.1.1, not 3.1.0. In any event, 3.1.1 is rolling out as we speak - already in Maven and binaries are up and the website changes are being merged. On Tue, Mar 2, 2021 at 9:10 AM Mich Talebzadeh wrote: > > Can someone please

Re: Spark closures behavior in local mode in IDEs

2021-02-26 Thread Sean Owen
Yeah this is a good question. It is certainly to do with executing within the same JVM, but even I'd have to dig into the code to explain why the spark-sql version operates differently, as that also appears to be local. To be clear this 'shouldn't' work, just happens to not fail in local

Re: Issue after change to 3.0.2

2021-02-26 Thread Sean Owen
That looks to me like you have two different versions of Spark in use somewhere here. Like the cluster and driver versions aren't quite the same. Check your classpaths? On Fri, Feb 26, 2021 at 2:53 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi All, > > > > After changing to

Re: A serious bug in the fitting of a binary logistic regression.

2021-02-22 Thread Sean Owen
I'll take a look. At a glance - is it converging? might turn down the tolerance to check. Also what does scikit learn say on the same data? we can continue on the JIRA. On Mon, Feb 22, 2021 at 5:42 PM Yakov Kerzhner wrote: > I have written up a JIRA, and there is a gist attached that has code

Re: spark 3.1.1 release date?

2021-02-20 Thread Sean Owen
Another RC is starting imminently, which looks pretty good. If it succeeds, probably next week. It will support Scala 2.12, but I believe a Scala 2.13 build is only coming in 3.2.0. On Sat, Feb 20, 2021 at 1:54 PM Bulldog20630405 wrote: > > what is the expected ballpark release date of spark

Re: Using Custom Scala Spark ML Estimator in PySpark

2021-02-16 Thread Sean Owen
You won't be able to use it in python if it is implemented in Java - needs a python wrapper too. On Mon, Feb 15, 2021, 11:29 PM HARSH TAKKAR wrote: > Hi , > > I have created a custom Estimator in scala, which i can use successfully > by creating a pipeline model in Java and scala, But when i

Re: vm.swappiness value for Spark on Kubernetes

2021-02-16 Thread Sean Owen
You probably don't want swapping in any environment. Some tasks will grind to a halt under mem pressure rather than just fail quickly. You would want to simply provision more memory. On Tue, Feb 16, 2021, 7:57 AM Jahar Tyagi wrote: > Hi, > > We have recently migrated from Spark 2.4.4 to Spark

<    1   2   3   4   5   6   7   8   9   10   >