You can also call a Scala UDF from Python in Spark - this doesn't need
Zeppelin or relate to the front-end.
This may indeed be much easier as a proper UDF; depends on what this
function does.
However I think the issue may be that you're trying to wrap the resulting
DataFrame in a DataFrame or
It could be 'normal' - executors won't GC unless they need to.
It could be state in your application, if you're storing state.
You'd want to dump the heap to take a first look
On Sat, Sep 25, 2021 at 7:24 AM Kiran Biswal wrote:
> Hello Experts
>
> I have a spark streaming application(DStream).
What is null, what is the type, does it make sense in postgres, etc. Need
more info.
On Wed, Sep 22, 2021 at 9:18 AM Aram Morcecian wrote:
> Hi everyone, I'm facing something weird. After doing some transformations
> to a SparkDF I print some rows and I see those perfectly, but when I write
>
spark-rapids is not part of Spark, so couldn't speak to it, but Spark
itself does not use GPUs at all.
It does let you configure a task to request a certain number of GPUs, and
that would work for RDDs, but it's up to the code being executed to use the
GPUs.
On Tue, Sep 21, 2021 at 1:23 PM
I don't think that has ever showed up in the CI/CD builds and can't recall
someone reporting this. What did you change? it may be some local env issue
On Fri, Sep 17, 2021 at 7:09 AM Enrico Minardi
wrote:
>
> Hello,
>
>
> the Maven build of Apache Spark 3.1.2 for user-provided Hadoop 2.10.1
Looks like this was improved in
https://issues.apache.org/jira/browse/SPARK-35701 for 3.2.0
On Fri, Sep 10, 2021 at 10:21 PM Kohki Nishio wrote:
> Hello,
> I'm running spark in local mode and seeing multiple threads showing like
> below, anybody knows why it's not using a concurrent hash map ?
- other lists, please don't cross post to 4 lists (!)
This is a problem you'd see with Java 9 or later - I assume you're running
that under the hood. However it should be handled by Spark in the case that
you can't access certain things in Java 9+, and this may be a bug I'll look
into. In the
I don't know if java serialization is slow in that case; that shows
blocking on a class load, which may or may not be directly due to
deserialization.
Indeed I don't think (some) things are serialized in local mode within one
JVM, so not sure that's actually what's going on.
On Thu, Sep 2, 2021
That is something else. Yes, you can create a single, complex stream job
that joins different data sources, etc. That is not different than any
other Spark usage. What are you looking for w.r.t. docs?
We are also saying you can simply run N unrelated streaming jobs in
parallel on the driver,
use my ignorant, but I just can't figure out how to
> create a collection across multiple streams using multiple stream readers.
> Could you provide some examples or additional references? Thanks!
>
> On 8/24/21 11:01 PM, Sean Owen wrote:
>
> No, that applies to the stream
No, that applies to the streaming DataFrame API too.
No jobs can't communicate with each other.
On Tue, Aug 24, 2021 at 9:51 PM Artemis User wrote:
> Thanks Daniel. I guess you were suggesting using DStream/RDD. Would it
> be possible to use structured streaming/DataFrames for multi-source
>
Date handling was tightened up in Spark 3. I think you need to compare to a
date literal, not a string literal.
On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
gourav.sengupta.develo...@gmail.com> wrote:
> Hi,
>
> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT *
> FROM
FTP is definitely not supported. Read the files to distributed storage
first then read from there.
On Sun, Aug 8, 2021, 10:18 PM igyu wrote:
> val ftpUrl = "ftp://ftpuser:ftpuser@10.3.87.51:21/sparkftp/;
>
> val schemas = StructType(List(
> new StructField("name", DataTypes.StringType,
Doesn't a persist break stages?
On Thu, Aug 5, 2021, 11:40 AM Tom Graves
wrote:
> As Sean mentioned its only available at Stage level but you said you don't
> want to shuffle so splitting into stages doesn't help you. Without more
> details it seems like you could "hack" this by just
Oh I see, I missed that. You can specify at the stage level, nice. I think
you are more looking to break these operations into two stages. You can do
that with a persist or something - which has a cost but may work fine.
Does it actually help much with GPU utilization - in theory yes but
No, unless I'm crazy you can't even change resource requirements at the
job level let alone stage. Does it help you though? Is something else even
able to use the GPU otherwise?
On Sat, Jul 31, 2021, 3:56 AM Andreas Kunft wrote:
> I have a setup with two work intensive tasks, one map using GPU
(This is a list of OSS Spark - anything vendor-specific should go to vendor
lists for better answers.)
On Fri, Jul 30, 2021 at 8:35 AM Harsh Sharma
wrote:
> hi Team ,
>
> we are upgrading our cloudera parcels to 6.X from 5.x , hence e have
> upgraded version of park from 1.6 to 2.4 . While
You're right, I think storageFraction is somewhat better to control this,
although some things 'counted' in spark.memory.fraction will also be
long-lived and in the OldGen.
You can also increase the OldGen size if you're pretty sure that's the
issue - 'old' objects in the YoungGen.
I'm not sure
The positive class is "1" and negative is "0" by convention; I don't think
you can change that (though you can translate your data if needed).
F1 is defined only in a one-vs-rest sense in multi-class evaluation. You
can set 'metricLabel' to define which class is 'positive' in multiclass -
Wouldn't this happen naturally? the large batches would just take a longer
time to complete already.
On Thu, Jul 1, 2021 at 6:32 AM András Kolbert
wrote:
> Hi,
>
> I have a spark streaming application which generally able to process the
> data within the given time frame. However, in certain
You need to set driver memory before the driver starts, on the CLI or
however you run your app, not in the app itself. By the time the driver
starts to run your app, its heap is already set.
On Thu, Jul 1, 2021 at 12:10 AM javaguy Java wrote:
> Hi,
>
> I'm getting Java OOM errors even though
The error is in your code, which you don't show. You are almost certainly
incorrectly referencing something like a SparkContext in a Spark task.
On Wed, Jun 30, 2021 at 3:48 PM Amit Sharma wrote:
> Hi , I am using spark 2.7 version with scala. I am calling a method as
> below
>
> 1. val
This was covered and mostly done last year:
https://issues.apache.org/jira/browse/SPARK-32004
In some instances, it's hard to change the terminology as it would break
user APIs, and the marginal benefit may not be worth it, but, have a look
at the remaining task under that umbrella.
On Wed, Jun
se the
>> visibility.
>>
>> Will 3.2.x be Scala 2.13.x only or cross compiled with 2.12?
>>
>> I realize Spark is a beast so I just want to help if I can but also not
>> create extra work if it is not useful for me or the Spark team/contributors.
>&g
o help if I can but also not
> create extra work if it is not useful for me or the Spark team/contributors.
>
> On Mon, Jun 21, 2021 at 3:43 PM Sean Owen wrote:
>
>> Whether it matters really depends on whether the CVE affects Spark.
>> Sometimes it clearly could and so we'd
Whether it matters really depends on whether the CVE affects Spark.
Sometimes it clearly could and so we'd try to back-port dependency updates
to active branches.
Sometimes it clearly doesn't and hey sometimes the dependency is updated
anyway for good measure (mostly to keep this off static
I think it's because otherwise you would not be able to consider, at least,
K-1 splits among K features, and you want to be able to do that. There may
be more technical reasons in the code that this is strictly enforced, but
it seems like a decent idea. Agree, more than K doesn't seem to help,
run multiple tasks
> on multiple nodes.
>
> On Wed, Jun 9, 2021 at 7:57 PM Sean Owen wrote:
>
>> Wait. Isn't that what you were trying to parallelize in the first place?
>>
>> On Wed, Jun 9, 2021 at 1:49 PM Tom Barber wrote:
>>
>>> Yeah but that
That looks like you did some work on the cluster, and now it's stuck doing
something else on the driver - not doing everything on 1 machine.
On Wed, Jun 9, 2021 at 12:43 PM Tom Barber wrote:
> And also as this morning: https://pasteboard.co/K5Q9aEf.png
>
> Removing the cpu pins gives me more
gt;>> Interesting Sean thanks for that insight, I wasn't aware of that fact, I
>>> assume the .persist() at the end of that line doesn't do it?
>>>
>>> I believe, looking at the output in the SparkUI, it gets to
>>> https://github.com/USCDataScience/
orkload in question:
>>>> https://gist.github.com/buggtb/a9e0445f24182bc8eedfe26c0f07a473
>>>>
>>>> On 2021/06/09 01:52:39, Tom Barber wrote:
>>>> > ExecutorID says driver, and looking at the IP addresses its
>>>> running on
the workers, so I no longer saturate the master node, but I
> also have 3 workers just sat there doing nothing.
>
> On 2021/06/09 01:26:50, Sean Owen wrote:
> > Are you sure it's on the driver? or just 1 executor?
> > how many partitions does the groupByKey produce? that would
Are you sure it's on the driver? or just 1 executor?
how many partitions does the groupByKey produce? that would limit your
parallelism no matter what if it's a small number.
On Tue, Jun 8, 2021 at 8:07 PM Tom Barber wrote:
> Hi folks,
>
> Hopefully someone with more Spark experience than me
It's a little bit of a guess, but the class name
$line103090609224.$read$FeatureModder looks like something generated by the
shell. I think it's your 'real' classname in this case. If you redefined
this later and loaded it you may not find it matches up. Can you declare
this in a package?
On Tue,
All of these tools are reasonable choices. I don't think the Spark project
itself has a view on what works best. These things do different things. For
example petastorm is not a training framework, but a way to feed data to a
distributed DL training process on Spark. For what it's worth,
I know it's not enabled by default when the binary artifacts are built, but
not exactly sure why it's not built separately at all. It's almost a
dependencies-only pom artifact, but there are two source files. Steve do
you have an angle on that?
On Mon, May 31, 2021 at 5:37 AM Erik Torres wrote:
Despite the name, the error doesn't mean the class isn't found but could
not be initialized. What's the rest of the error?
I don't believe any testing has ever encountered this error, so it's likely
something to do with your environment, but I don't know what.
On Thu, May 27, 2021 at 7:32 AM
CS (Elastic Container Service) for this use
> case which allows us to autoscale?
>
> On Tue, May 25, 2021 at 2:16 PM Sean Owen wrote:
>
>> What you could do is launch N Spark jobs in parallel from the driver.
>> Each one would process a directory you supply with spark.read.par
What you could do is launch N Spark jobs in parallel from the driver. Each
one would process a directory you supply with spark.read.parquet, for
example. You would just have 10s or 100s of those jobs running at the same
time. You have to write a bit of async code to do it, but it's pretty easy
Right, you can't use Spark within Spark.
Do you actually need to read Parquet like this vs spark.read.parquet?
that's also parallel of course.
You'd otherwise be reading the files directly in your function with the
Parquet APIs.
On Tue, May 25, 2021 at 12:24 PM Eric Beabes
wrote:
> I've a use
I think it's because the bintray repo has gone away. Did you see the recent
email about the new repo for these packages?
On Wed, May 19, 2021 at 12:42 PM Wensheng Deng
wrote:
> Hi experts:
>
> I tried the example as shown on this page, and it is not working for me:
>
Why join here - just add two columns to the DataFrame directly?
On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote:
> Anyone have ideas about the below Q?
>
> It seems to me that given that "diamond" DAG, that spark could see
> that the rows haven't been shuffled/filtered, it could do some type
If code running on the executors need some local file like a config file,
then it does have to be passed this way. That much is normal.
On Sat, May 15, 2021 at 1:41 AM Gourav Sengupta
wrote:
> Hi,
>
> once again lets start with the requirement. Why are you trying to pass xml
> and json files to
Yeah I don't think that's going to work - you aren't guaranteed to get 1,
2, 3, etc. I think row_number() might be what you need to generate a join
ID.
RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You
could .zip two RDDs you get from DataFrames and manually convert the
spark-shell is not on your path. Give the full path to it.
On Tue, May 11, 2021 at 4:10 PM Talha Javed wrote:
> Hello Team!
> Hope you are doing well
>
> I have downloaded the Apache Spark version (spark-3.1.1-bin-hadoop2.7). I
> have downloaded the winutils file too from github.
> Python
It looks like the executor (JVM) stops immediately. Hard to say why - do
you have Java installed and a compatible version? I agree it could be a
py4j version problem, from that SO link.
On Sat, May 8, 2021, 1:35 PM rajat kumar wrote:
> Hi Sean/Mich,
>
> Thanks for response.
>
> That was the
SparkContext.getOrCreate(sparkConf)
>> File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 367,
>> in getOrCreate
>> SparkContext(conf=conf or SparkConf())
>> File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 133,
>> in
foreach definitely works :)
This is not a streaming question.
The error says that the JVM worker died for some reason. You'd have to look
at its logs to see why.
On Fri, May 7, 2021 at 11:03 AM Mich Talebzadeh
wrote:
> Hi,
>
> I am not convinced foreach works even in 3.1.1
> Try doing the same
There is just one copy in memory. No different than if you have to
variables pointing to the same dict.
On Mon, May 3, 2021 at 7:54 AM Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:
> Hi all,
>
>
>
> when broadcasting a large dict containing several million entries to
> executors
Right, yes it did not continue. It's not in Spark.
On Fri, Apr 30, 2021 at 7:07 AM jonnysettle
wrote:
> I remeber back in 2019 reading about Cypher language for graph queries been
> introduced to spark 3.X. But I don't see it in the latest version. Has
> the
> project been abandoned (issues
>From tracing the code a bit, it might do this if the POJO class has no
public constructors - does it?
On Thu, Apr 29, 2021 at 9:55 AM Rico Bergmann wrote:
> Here is the relevant generated code and the Exception stacktrace.
>
> The problem in the generated code is at line 35.
>
>
I don't know this code well, but yes seems like something is looking for
members of a companion object when there is none here. Can you show any
more of the stack trace or generated code?
On Thu, Apr 29, 2021 at 7:40 AM Rico Bergmann wrote:
> Hi all!
>
> A simplified code snippet of what my
Erm, just
https://spark.apache.org/docs/2.3.0/api/sql/index.html#approx_percentile ?
On Tue, Apr 27, 2021 at 3:52 AM Ivan Petrov wrote:
> Hi, I have billions, potentially dozens of billions of observations. Each
> observation is a decimal number.
> I need to calculate percentiles 1, 25, 50, 75,
You might be able to do this with multiple aggregations on avg(col("col1")
== "cat1") etc, but how about pivoting the DataFrame first so that you get
columns like "cat1" being 1 or 0? you would end up with columns x
categories new columns if you want to count all categories in all cols. But
then
This means you compiled with Java 11, but are running on Java < 11. It's
not related to Spark.
On Fri, Apr 23, 2021 at 10:23 AM chansonzhang
wrote:
> I just update the spark-* version in my pom.xml to match my spark and scala
> environment, and this solved the problem
>
>
>
>
> --
> Sent from:
Are you sure about the worker mem configuration? what are you setting
--memory too and what does the worker UI think its memory allocation is?
On Sun, Apr 18, 2021 at 4:08 AM Mohamadreza Rostami <
mohamadrezarosta...@gmail.com> wrote:
> I see a bug in executer memory allocation in the standalone
apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession
>
> Thanks,
> Asmath
>
> On Mon, Apr 12, 2021 at 2:20 PM KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> I am using spark hbase connector provided by hortonwokrs. I was able to
>> run witho
Somewhere you're passing a property that expects a number, but give it
"30s". Is it a time property somewhere that really just wants MS or
something? But most time properties (all?) in Spark should accept that type
of input anyway. Really depends on what property has a problem and what is
setting
Spark itself is distributed via Maven Central primarily, so I don't think
it will be affected?
On Mon, Apr 12, 2021 at 11:22 AM Florian CASTELAIN <
florian.castel...@redlab.io> wrote:
> Hello.
>
>
>
> Bintray will shutdown on first May.
>
>
>
> I just saw that packages are hosted on Bintray
(I apologize, I totally missed that this should use GPUs because of RAPIDS.
Ignore my previous. But yeah it's more a RAPIDS question.)
On Fri, Apr 9, 2021 at 12:09 PM HaoZ wrote:
> Hi Martin,
>
> I tested the local mode in Spark on Rapids Accelerator and it works fine
> for
> me.
> The only
me to complete,please suggest
>> best way to get below requirement without using UDF
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>> --
>> *From:* Sean Owen
>> *Sent:* Friday, April 9, 2021 6:11 PM
>> *To:* ayan guha
>&
I don't see anything in this job that would use a GPU?
On Fri, Apr 9, 2021 at 11:19 AM Martin Somers wrote:
>
> Hi Everyone !!
>
> Im trying to get on premise GPU instance of Spark 3 running on my ubuntu
> box, and I am following:
>
>
OK so it's '7 threads overwhelming off heap mem in the JVM' kind of
thing. Or running afoul of ulimits in the OS.
On Fri, Apr 9, 2021 at 11:19 AM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:
> Hi Sean!
>
> So the "coalesce" without shuffle will create a CoalescedRDD which during
Yeah I figured it's not something fundamental to the task or Spark. The
error is very odd, never seen that. Do you have a theory on what's going on
there? I don't!
On Fri, Apr 9, 2021 at 10:43 AM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:
> Hi!
>
> I looked into the code and find
This can be significantly faster with a pandas UDF, note, because you can
vectorize the operations.
On Fri, Apr 9, 2021, 7:32 AM ayan guha wrote:
> Hi
>
> We are using a haversine distance function for this, and wrapping it in
> udf.
>
> from pyspark.sql.functions import acos, cos, sin, lit,
Right, you already established a few times that the difference is the
number of partitions. Russell answered with what is almost surely the
correct answer, that it's AQE. In toy cases it isn't always a win.
Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up
more realistic
That's a very low level error from the JVM. Any chance you are
misconfiguring the executor size? like to 10MB instead of 10GB, that kind
of thing. Trying to think of why the JVM would have very little memory to
operate.
An app running out of mem would not look like this.
On Thu, Apr 8, 2021 at
I think this question was asked just a week ago? same company and setup.
https://mail-archives.apache.org/mod_mbox/spark-user/202104.mbox/%3CLNXP123MB2604758548BE38E8D3F369EC8A7B9%40LNXP123MB2604.GBRP123.PROD.OUTLOOK.COM%3E
On Wed, Apr 7, 2021 at 11:17 AM SRITHALAM, ANUPAMA (Risk Value Stream)
You shouldn't be modifying your cluster install. You may at this point have
conflicting, excess JARs in there somewhere. I'd start it over if you can.
On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi
wrote:
> Not sure what you mean not working. You've added 3.1.1 to packages which
> uses:
> * 2.6.0
I noted that Apache Mesos is moving to the attic, so won't be actively
developed soon:
https://lists.apache.org/thread.html/rab2a820507f7c846e54a847398ab20f47698ec5bce0c8e182bfe51ba%40%3Cdev.mesos.apache.org%3E
That doesn't mean people will stop using it as a Spark resource manager
soon. But it
de.
>>>
>>>
>>> Running on local node with
>>>
>>>
>>> spark-submit --master local[4] --conf
>>> spark.pyspark.virtualenv.enabled=true --conf
>>> spark.pyspark.virtualenv.type=native --conf
>>> spark.pyspark.virtuale
Hard to say without a lot more info, but 76.5K tasks is very large. How big
are the tasks / how long do they take? if very short, you should
repartition down.
Do you end up with 800 executors? if so why 2 per machine? that generally
is a loss at this scale of worker. I'm confused because you have
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like
this with your app at the correct version.
On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh
wrote:
> Thanks Gabor.
>
> All nodes are running Spark
Yes that's a great option when the modeling process itself doesn't really
need Spark. You can use any old modeling tool you want and get the
parallelism in tuning via hyperopt's Spark integration.
On Thu, Apr 1, 2021 at 10:50 AM Williams, David (Risk Value Stream)
wrote:
> Classification:
Sure, just open a pull request?
On Mon, Mar 29, 2021 at 10:37 AM Josh Herzberg wrote:
> Hi,
>
> I'd like to suggest this change to the PySpark code. I haven't contributed
> before so https://spark.apache.org/contributing.html suggested emailing
> here first.
>
> In the error raised here
>
Views are simply bookkeeping about how the query is executed, like a
DataFrame. There is no data or result to store; it's just how to run a
query. The views exist on the driver. The query executes like any other, on
the cluster.
On Fri, Mar 26, 2021 at 3:38 AM Mich Talebzadeh
wrote:
>
> As a
get that working in distributed, will we get
> benefits similar to spark ML?
>
>
>
> Best Regards,
>
> Dave Williams
>
>
>
> *From:* Sean Owen
> *Sent:* 26 March 2021 13:20
> *To:* Williams, David (Risk Value Stream)
>
> *Cc:* user@spark.apache.org
&g
gt; Many thanks for your response Sean.
>
>
>
> Question - why spark is overkill for this and why is sklearn is faster
> please? It’s the same algorithm right?
>
>
>
> Thanks again,
>
> Dave Williams
>
>
>
> *From:* Sean Owen
> *Sent:*
The problem is that both of these are not sharing a SparkContext as far as
I can see, so there is no way to share the object across them, let alone
languages.
You can of course write the data from Java, read it from Python.
In some hosted Spark products, you can access the same session from two
Spark is overkill for this problem; use sklearn.
But I'd suspect that you are using just 1 partition for such a small data
set, and get no parallelism from Spark.
repartition your input to many more partitions, but, it's unlikely to get
much faster than in-core sklearn for this task.
On Thu, Mar
arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 24 Mar 2021 at 12:40, Sean Owen wrote:
>
>>
No need to do that. Reading the header with Spark automatically is trivial.
On Wed, Mar 24, 2021 at 5:25 AM Mich Talebzadeh
wrote:
> If it is a csv then it is a flat file somewhere in a directory I guess.
>
> Get the header out by doing
>
> */usr/bin/zcat csvfile.gz |head -n 1*
> Title
It would split 10GB of CSV into multiple partitions by default, unless it's
gzipped. Something else is going on here.
On Tue, Mar 23, 2021 at 10:04 PM "Yuri Oleynikov (יורי אולייניקוב)" <
yur...@gmail.com> wrote:
> I’m not Spark core developer and do not want to confuse you but it seems
I don't think that would change partitioning? try .repartition(). It isn't
necessary to write it out let alone in Avro.
On Tue, Mar 23, 2021 at 8:45 PM "Yuri Oleynikov (יורי אולייניקוב)" <
yur...@gmail.com> wrote:
> Hi, Mohammed
> I think that the reason that only one executor is running
You need to do something with the result of repartition. You haven't
changed textDF
On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed
wrote:
> Hi,
>
> I have a use case where there are large files in hdfs.
>
> Size of the file is 3 GB.
>
> It is an existing code in production and I am trying
I believe you can "SELECT version()" in Spark SQL to see the build version.
On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh
wrote:
> Thanks for the detailed info.
>
> I was hoping that one can find a simpler answer to the Spark version than
> doing forensic examination on base code so to speak.
That looks like you didn't compile with Java 11 actually. How did you try
to do so?
On Tue, Mar 16, 2021, 7:50 AM kaki mahesh raja
wrote:
> HI All,
>
> We have compiled spark with java 11 ("11.0.9.1") and when testing the
> thrift
> server we are seeing that insert query from operator using
That should not be the case. See
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
Maybe you are calling .foreach on some Scala object inadvertently.
On Tue, Mar 9, 2021 at 4:41 PM Mich Talebzadeh
wrote:
> Hi,
>
> When I use
You can also group by the key in the transformation on each batch. But yes
that's faster/easier if it's already partitioned that way.
On Tue, Mar 9, 2021 at 7:30 AM Ali Gouta wrote:
> Do not know Kenesis, but it looks like it works like kafka. Your producer
> should implement a paritionner that
Yep, you can never use Spark inside Spark.
You could run N jobs in parallel from the driver using Spark, however.
On Mon, Mar 8, 2021 at 3:14 PM Mich Talebzadeh
wrote:
>
> In structured streaming with pySpark, I need to do some work on the row
> *foreach(process_row)*
>
> below
>
>
> *def
It's there in the error: No space left on device
You ran out of disk space (local disk) on one of your machines.
On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka
wrote:
> Hi All,
>
> I am getting the following error in my spark job.
>
> Can someone please have a look ?
>
>
I think you're still asking about GCP and Dataproc, and that's really
nothing to do with Spark itself.
Whatever issues you are having concern Dataproc and how it's run and
possibly customizations in Dataproc.
3.1.1-RC2 is not a release, but, also nothing meaningfully changed between
it and the
I don't have any good answer here, but, I seem to recall that this is
because of SQL semantics, which follows column ordering not naming when
performing operations like this. It may well be as intended.
On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <
oldrich.vla...@datasentics.com> wrote:
> Hi,
That statement is still accurate - it is saying the release will be 3.1.1,
not 3.1.0.
In any event, 3.1.1 is rolling out as we speak - already in Maven and
binaries are up and the website changes are being merged.
On Tue, Mar 2, 2021 at 9:10 AM Mich Talebzadeh
wrote:
>
> Can someone please
Yeah this is a good question. It is certainly to do with executing within
the same JVM, but even I'd have to dig into the code to explain why the
spark-sql version operates differently, as that also appears to be local.
To be clear this 'shouldn't' work, just happens to not fail in local
That looks to me like you have two different versions of Spark in use
somewhere here. Like the cluster and driver versions aren't quite the same.
Check your classpaths?
On Fri, Feb 26, 2021 at 2:53 AM Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:
> Hi All,
>
>
>
> After changing to
I'll take a look. At a glance - is it converging? might turn down the
tolerance to check.
Also what does scikit learn say on the same data? we can continue on the
JIRA.
On Mon, Feb 22, 2021 at 5:42 PM Yakov Kerzhner wrote:
> I have written up a JIRA, and there is a gist attached that has code
Another RC is starting imminently, which looks pretty good. If it succeeds,
probably next week.
It will support Scala 2.12, but I believe a Scala 2.13 build is only coming
in 3.2.0.
On Sat, Feb 20, 2021 at 1:54 PM Bulldog20630405
wrote:
>
> what is the expected ballpark release date of spark
You won't be able to use it in python if it is implemented in Java - needs
a python wrapper too.
On Mon, Feb 15, 2021, 11:29 PM HARSH TAKKAR wrote:
> Hi ,
>
> I have created a custom Estimator in scala, which i can use successfully
> by creating a pipeline model in Java and scala, But when i
You probably don't want swapping in any environment. Some tasks will grind
to a halt under mem pressure rather than just fail quickly. You would want
to simply provision more memory.
On Tue, Feb 16, 2021, 7:57 AM Jahar Tyagi wrote:
> Hi,
>
> We have recently migrated from Spark 2.4.4 to Spark
301 - 400 of 1849 matches
Mail list logo