I can maintain backwards
compatibility with spark2.4 while upgrading to spark 3.2. ? Or is there any
documentation of those changes or if someone can point me to the right
direction?
Regards
Ben
# python:
import pandas as pd
a = pd.DataFrame([[1, [2.3, 1.2]]], columns=['a', 'b'])
a.to_parquet('a.parquet')
# pyspark:
d2 = spark.read.parquet('a.parquet')
will return error:
An error was encountered: An error occurred while calling o277.showString. :
org.apache.spark.SparkException: Job
rouped(100).toList.par.map(groupedParts =>
> spark.read.parquet(groupedParts: _*))
>
> val finalDF = dfs.seq.grouped(100).toList.par.map(dfgroup =>
> dfgroup.reduce(_ union _)).reduce(_ union _).coalesce(2000)
>
>
>
> *From: *Ben Kaylor
> *Date: *Tuesday, March 16, 2021 at 3:2
:
> P.S.: 3. If fast updates are required, one way would be capturing S3
> events & putting the paths/modifications dates/etc of the paths into
> DynamoDB/your DB of choice.
>
>
>
> *From:* Boris Litvak
> *Sent:* Tuesday, 16 March 2021 9:03
> *To:* Ben Kaylor ;
Not sure on answer on this, but am solving similar issues. So looking for
additional feedback on how to do this.
My thoughts if unable to do via spark and S3 boto commands, then have apps
self report those changes. Where instead of having just mappers discovering
the keys, you have services self
I can also recreate with the very latest master branch (3.1.0-SNAPSHOT) if I
compile it locally
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Thanks for that. I have played with this a bit more after your feedback and
found:
I can only recreate the problem with python 3.6+. If I change between python
2.7, python 3.6 and python 3.7 I find that the problem occurs in the python
3.6 and 3.7 case but not in the python 2.7.
- I have used mini
Hi,
I am having an issue that looks like a potentially serious bug with Spark
2.4.3 as it impacts data accuracy. I have searched in the Spark Jira and
mail lists as best I can and cannot find reference to anyone else having
this issue. I am not sure if this would be suitable for raising as a bug i
I've noticed that DataSet.sqlContext is public in Scala but the equivalent
(DataFrame._sc) in PySpark is named as if it should be treated as private.
Is this intentional? If so, what's the rationale? If not, then it feels
like a bug and DataFrame should have some form of public access back to th
rver as root and run a job on it from a user account.
Thanks,
Ben
Disclaimer
The information contained in this communication from the sender is
confidential. It is intended solely for use by the recipient and others
authorized to receive it. If you are not the recipient, you should delete this
R_SERVICEACCOUNTNAME))
.setConf(Constants.SPARK_DRIVER_IAM_ROLE_ARN_KEY,
config.get[String]("k8s.submitter.spark.driver.iam.role.arn"))
.setConf(Constants.SPARK_EXECUTOR_IAM_ROLE_ARN_KEY,
config.get[String]("k8s.submitter.spark.executor.iam.role.arn"))
.addAppA
Sounds like the same root cause as SPARK-14948 or SPARK-10925.
A workaround is to "clone" df3 like this:
val df3clone = df3.toDF(df.schema.fieldNames:_*)
Then use df3clone in place of df3 in the second join.
On Wed, Jul 11, 2018 at 2:52 PM Nirav Patel wrote:
> I am trying to joind df1 with
s
>>>>>>>>>> generally useful: https://issues.apache.
>>>>>>>>>> org/jira/browse/SPARK-19031
>>>>>>>>>>
>>>>>>>>>> In the mean time you could try implementing your own Source
Hi Folks,
I have a spark job reading a csv file into a dataframe. I register that
dataframe as a tempTable then I’m writing that dataframe/tempTable to hive
external table (using parquet format for storage)
I’m using this kind of command :
hiveContext.sql(*"INSERT INTO TABLE t PARTITION(statPa
in this case it
would fall back to regenerating from the parent RDD, rather than
providing the wrong answer.
Is this the expected behaviour? It seems a little difficult to work
with in practise.
Cheers,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
n mappings_bc[column]:
key = unicode(x)
return
Vectors.sparse(len(mappings_bc[column]),[mappings_bc[column][key]],[1.0])
Ben
> On Aug 19, 2016, at 11:34 PM, Davies Liu wrote:
>
> The OOM happen in driver, you may also need more memory for driver.
>
> On Fri, Aug 19, 2016 at
d to see how it compares.
>
> By the way, the feature size you select for the hasher should be a power of 2
> (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes are
> evenly distributed (see the section on HashingTF under
> http://spark.apache.org/docs/latest/ml
.
> On Aug 11, 2016, at 9:48 PM, Gourav Sengupta
> wrote:
>
> Hi Ben,
>
> and that will take care of skewed data?
>
> Gourav
>
> On Thu, Aug 11, 2016 at 8:41 PM, Ben Teeuwen <mailto:bteeu...@gmail.com>> wrote:
> When you read both ‘a’ and ‘b', c
uot;{} min: {} max: {}".format(c, min(mappings[c].values()),
max(mappings[c].values( # some logging to confirm the indexes.
logging.info("Missing value = {}".format(mappings[c]['missing']))
return max_index, mappings
I’d love to see the StringIndexe
When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’?
If you .cache() and .count() to force a shuffle, it'll push the records that
will be joined to the same executors.
So;
a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()
a.count()
b = spark.read.pa
, right? So I’m curious what to use
instead.
> On Aug 4, 2016, at 3:54 PM, Nicholas Chammas
> wrote:
>
> Have you looked at pyspark.sql.functions.udf and the associated examples?
> 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen <mailto:bteeu...@gmail.com>>님이 작성:
> Hi,
>
-2 status:
____
After 40 minutes or so, with no activity in the application master, it dies.
Ben
> On Aug 4, 2016, at 12:14 PM, Nick Pentreath wrote:
>
> Hi Ben
>
> Perhaps with this size cardinality it is worth looking at feature hashing for
> your probl
ion? I only see registerFunction in the deprecated
sqlContext at http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
<http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html>.
As the ‘spark’ object unifies hiveContext and sqlContext, what is the new way
to go?
Ben
at
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecu
Related question: what are good profiling tools other than watching along the
application master with the running code?
Are there things that can be logged during the run? If I have say 2 ways of
accomplishing the same thing, and I want to learn about the time/memory/general
resource blocking p
Did you check the executors logs to check whether the kafka offsets pulled in
evenly over the 4 executors?
I recall a similar situation with such uneven balancing from a kafka stream,
and ended up raising the amount of resources (RAM and cores). Then it nicely
balanced out. I don’t understand t
Hi,
I want to one hot encode a column containing 56 million distinct values. My
dataset is 800m rows + 17 columns.
I first apply a StringIndexer, but it already breaks there giving a OOM java
heap space error.
I launch my app on YARN with:
/opt/spark/2.0.0/bin/spark-shell --executor-memory 10G
ate-statesnapshots-to-database-for-later-resume-of-spark-st>.
The context is to be able to pause and resume a streaming app while not losing
the state information. So I want to save and reload (initialize) the state
snapshot. Has anyone of you already been able to do this?
Thanks,
Ben
n VERY RELUCTANT to edit the spark jars
>
> I know this point has been discussed here before but I do not see a clean
> answer
>
>
>
>
>
>
>
--
Best regards,
Demi Ben-Ari <http://il.linkedin.com/in/demibenari>
Entrepreneur @ Stealth Mode Startup
Twitter: @demibenari <https://twitter.com/demibenari>
Hello,
I tried to use sparkSQL to analyse json data streams within a standalone
application.
here the code snippet that receive the streaming data:
*final JavaReceiverInputDStream lines =
streamCtx.socketTextStream("localhost", Integer.parseInt(args[0]),
StorageLevel.MEMORY_AND_DISK_SER_2());*
Hi,
I installed a standalone spark cluster with two workers. I developed a Java
Application that use the maven dependency of spark (same version as the
spark cluster).
In my class Spark jobs I have only two methods considered as two different
jobs:
the first one is the example of spark word coun
Hi,
I forgot to mention that I'm using the 1.5.1 version.
Regards,
On Thu, Feb 18, 2016 at 4:20 PM, Mehdi Ben Haj Abbes
wrote:
> Hi folks,
>
> I have DataFrame with let's say this schema :
> -dealId,
> -ptf,
> -ts
> from it I derive another dataframe (lets call i
he original columns I get what is
expected as count which 0
The same goes for isin
Any help will be more than appreciated.
Best regards,
--
Mehdi BEN HAJ ABBES
Terry
>
> On Fri, Jan 29, 2016 at 5:45 PM, Mehdi Ben Haj Abbes <
> mehdi.ab...@gmail.com> wrote:
>
>> Hi folks,
>>
>> I have a streaming job running for more than 24 hours. It seems that
>> there is a limit on the number of the batches displayed in the Stre
Saturday. I will
have the batches that have run the previous 24 hours and today it was like
only the previous 3 hours.
Any help will be very appreciated.
--
Mehdi BEN HAJ ABBES
I'm curious to see the feedback others will provide. My impression is the
only way to get Spark to give up resources while it is idle would be to use
the preemption feature of the scheduler you're using in YARN. When another
user comes along the scheduler would preempt one or more Spark executors
Oops - I meant while it is *busy* when I said while it is *idle*.
On Tue, Dec 15, 2015 at 11:35 AM Ben Roling wrote:
> I'm curious to see the feedback others will provide. My impression is the
> only way to get Spark to give up resources while it is idle would be to use
>
Hi,
After reading some documentations about spark and ignite,
I am wondering if shared RDD from ignite can be used to share data in
memory without any duplication between multiple spark jobs.
Running on mesos I can collocate them, but will this be enough to avoid
memory duplication or not?
I am als
Hi,
After reading some documentations about spark and ignite,
I am wondering if shared RDD from ignite can be used to share data in
memory without any duplication between multiple spark jobs.
Running on mesos I can collocate them, but will this be enough to avoid
memory duplication or not?
I am als
also confused by Tachyon usage compare to apache ignite
which seems to be overlapping at some points.
Thanks for you help
Regards
Ben
time
-- some graph problems (
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
This is not in anyway an attack on Spark. It's an amazing tool that does
it's job very well. I'm just curious where it starts breaking down. Let me
know if you have any experiences!
Thanks very much,
Ben
Hi Maximo —
This is a relatively naive answer, but I would consider structuring the RDD
into a DataFrame, then saving the 'splits' using something like
DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then
read a DataFrame from each resulting parquet directory and do your
per-
Thanks very much Kevin.
I didn't recognize from the title ( Maven issues with 1.5-RC) that this
was a 1.5 build problem.
Ben
On 8/30/15 10:28 PM, Kevin Jung [via Apache Spark User List] wrote:
> I expect it because the versions are not in the range defined in pom.xml.
> You sho
I have Spark 1.4 deployed on AWS EMR but methods of SparkR dataFrame read.df
method cannot load data from aws s3.
1) "read.df" error message
read.df(sqlContext,"s3://some-bucket/some.json","json")
15/07/09 04:07:01 ERROR r.RBackendHandler: loadDF on
org.apache.spark.sql.api.r.SQLUtils failed
jav
t archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Best regards,
Demi Ben-Ari <http://il.linkedin.com/in/demibenari>
Senior Software Engineer
Windward Ltd. <http://windward.eu/>
I would like to know if the foreachPartitions will results in a better
performance, due to an higher level of parallelism, compared to the foreach
method considering the case in which I'm flowing through an RDD in order to
perform some sums into an accumulator variable.
Thank you,
Beniamino.
-
Hi, everybody.
There are some cases in which I can obtain the same results by using the
mapPartitions and the foreach method.
For example in a typical MapReduce approach one would perform a reduceByKey
immediately after a mapPartitions that transform the original RDD in a
collection of tuple (ke
Happening right now
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364p22369.html
Sent from the Apache Spark User List
*Ask Me Anything about Apache Spark & big data*
Reddit AMA with Matei Zaharia
Friday, April 3 at 9AM PT/ 12PM ET
Details can be found here:
http://strataconf.com/big-data-conference-uk-2015/public/content/reddit-ama
--
View this message in context:
http://apache-spark-user-list.1001560.n3.na
014/12/spark-configuration-mess-solved.html>
--
Enjoy,
Demi Ben-Ari
Senior Software Engineer
Windward LTD.
months now, and it's incredibly helpful. I hope this post helps
towards fixing the problem!
Thanks,
-Ben
P.S. This is the full initial command I used, in case this is isolated to
particular instance types or anything:
ec2/spark-ec2 -k ... -i ... -z us-east-1d -s 4 -t m3.2xlarge
--ebs-vol-si
so used df -h to see what drives were mounted, and I get the same drives
regardless of whether I specify the --ebs-vol-size parameter when I execute
the script...
Can you give me an exact command line that works for you? I will try and
execute it verbatim.
Thanks,
-Ben
--
View this message i
please add
From: "Ben Horner [via Apache Spark User List]"
mailto:ml-node+s1001560n9934...@n3.nabble.com>>
Date: Wednesday, July 16, 2014 at 8:47 AM
To: Ben Horner mailto:ben.hor...@atigeo.com>>
Subject: Re: Trouble with spark-ec2 script: --ebs-vol-size
Should I take it fr
Should I take it from the lack of replies that the --ebs-vol-size feature
doesn't work?
-Ben
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-spark-ec2-script-ebs-vol-size-tp9619p9934.html
Sent from the Apache Spark User List mailing
Hello,
I'm using the spark-0.9.1-bin-hadoop1 distribution, and the ec2/spark-ec2
script within it to spin up a cluster. I tried running my processing just
using the default (ephemeral) HDFS configuration, but my job errored out,
saying that there was no space left. So now I'm trying to increase
55 matches
Mail list logo