Hi all,
Is anyone here interested in adding the ability to request GPUs to Spark's
client (i.e, spark-submit)? As of now, Yarn 3.0's resource manager server
has the ability to schedule GPUs as resources via cgroups, but the Spark
client lacks an ability to request these.
The ability to guarantee
Hi all,
We are running into a scenario where the structured streaming job is exiting
after a while specifically when the Kafka topic is not getting any data.
>From the job logs, I see this connections.max.idle.ms = 54. Does that
mean the spark readstream will close when it does not get data
Hi all
Now I am using Structured Streaming in Continuous Processing mode and I
faced a odd problem.
My code is so simple that it is similar to the sample code on the
documentation.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
When
Do you have a code sample, and detailed error message/exception to show?
From: Martin Engen
Date: Tuesday, May 15, 2018 at 9:24 AM
To: "user@spark.apache.org"
Subject: Structured Streaming, Reading and Updating a variable
Hello,
I'm working
thanks bryan for the answer
Envoyé de mon iPhone
> Le 15 mai 2018 à 19:06, Bryan Cutler a écrit :
>
> Hi Xavier,
>
> Regarding Arrow usage in Spark, using Arrow format to transfer data between
> Python and Java has been the focus so far because this area stood to benefit
Hi,
I am trying to test my spark app implemented in Java. In my spark app I
load the logisticRegressionModel that I have already created, trained and
tested using the portion of training data.
Now, when I test my spark app with another set of data and try to predict,
I get below error when
Hi Xavier,
Regarding Arrow usage in Spark, using Arrow format to transfer data between
Python and Java has been the focus so far because this area stood to
benefit the most. It's possible that the scope of Arrow could broaden in
the future, but there still needs to be discussions about this.
Hi,
Is there a way to load model saved using sklearn lib in pyspark/ scala
spark for prediction.
Thanks
Hi,
So, what is the workaround? Should I create multiple indexer(one for each
column), and then create pipeline and set stages to have all the
StringIndexers?
I am using 2.2.1 as I cannot move to 2.3.0. Looks like
oneHotEncoderEstimator is broken, please see my email sent today with
subject:
Isn't _* varargs? So you should be able to use Java array?
On Tue, May 15, 2018, 06:29 onmstester onmstester
wrote:
> I could not find how to pass a list to isin() filter in java, something
> like this could be done with scala:
>
> val ids = Array(1,2)
>
Hi Jacek,
If we use RDD instead of Dataframe, can we accomplish the same? I mean, is
joining between RDDS allowed in Spark streaming ?
Best,
Ravi
On Sun, May 13, 2018 at 11:18 AM Jacek Laskowski wrote:
> Hi,
>
> The exception message should be self-explanatory and says that
You use a windowed aggregation for this
On Tue, May 15, 2018, 09:23 Martin Engen wrote:
> Hello,
>
>
>
> I'm working with Structured Streaming, and I need a method of keeping a
> running average based on last 24hours of data.
>
> To help with this, I can use
Hi,
I get below error when I try to run oneHotEncoderEstimator example.
https://github.com/apache/spark/blob/b74366481cc87490adf4e69d26389ec737548c15/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java#L67
Which is this line of the code:
I am trying to register a UDTF not a UDF.
So I don't think this applies
Mick
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hello,
I'm working with Structured Streaming, and I need a method of keeping a running
average based on last 24hours of data.
To help with this, I can use Exponential Smoothing, which means I really only
need to store 1 value from a previous calculation into the new, and update this
variable
You can register udf's by using the in-built udf function as well using (import
org.apache.spark.sql.functions._)
Something along the lines of
val flattenUdf = udf(udfutils.flatten)
where udfutils is another object and flatten is a method in it.
On Tue, May 15, 2018 at 3:27 AM Mick Davies
I have a streaming dataframe where I insert a uuid in every row, then join
with a static dataframe (after which uuid column is no longer unique), then
group by uuid and do a simple aggregation.
So I know all rows with same uuid will be in same micro batch, guaranteed,
correct? How do I express it
I could not find how to pass a list to isin() filter in java, something like
this could be done with scala:
val ids = Array(1,2) df.filter(df("id").isin(ids:_*)).show
But in java everything that converts java list to scala Seq fails with
unsupported literal type exception:
Hi Gourav,
I don't think you can register UDTFs via sparkSession.udf.register
Mick
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi All,
I am trying to implement a custom sink and I have few questions mainly on
output modes.
1) How does spark let the sink know that a new row is an update of an
existing row? does it look at all the values of all columns of the new row
and an existing row for an equality match or does it
3000 filters don’t look like something reasonable. This is very difficult to
test and verify as well as impossible to maintain.
Could it be that your filters are another table that you should join with ?
The example is a little bit artificial to understand the underlying business
case. Can you
>From the information you provided I would tackle this as a batch problem,
because this way you have access to more sophisticated techniques and you
have more flexibility (maybe HDFS and a SparkJob, but also think about a
datastore offering good indexes for the kind of data types and values you
When I terminate a spark streaming application and restart it, it always
stuck in this step:
>
> Revoking previously assigned partitions [] for group [mygroup]
> (Re-)joing group [mygroup]
If I use a new group id, even though it works fine, I may lose the data
from the last time I read the
Hi,
I am not familiar with ATNConfigSet, but some thoughts that might help.
How many distinct key1 (resp. key2) values do you have? Are these values
reasonably stable over time?
Are these records ingested in real-time or are they loaded from a datastore?
If the latter case the DB might be able
Multi column support for StringIndexer didn’t make it into Spark 2.3.0
The PR is still in progress I think - should be available in 2.4.0
On Mon, 14 May 2018 at 22:32, Mina Aslani wrote:
> Please take a look at the api doc:
>
Hi,
I need to run some queries on huge amount input records. Input rate for records
are 100K/seconds.
A record is like (key1,key2,value) and the application should report occurances
of kye1 = something key2 == somethingElse.
The problem is there are too many filters in my query: more than
26 matches
Mail list logo