Hello
May I know from what version of spark, the RDD syntax can be shorten as
this?
rdd.groupByKey().mapValues(lambda x:len(x)).collect()
[('b', 2), ('d', 1), ('a', 2)]
rdd.groupByKey().mapValues(len).collect()
[('b', 2), ('d', 1), ('a', 2)]
I know in scala the syntax: xxx(x => x.len)
What I tried to say is, I didn't start spark master/worker at all, for a
standalone deployment.
But I still can login into pyspark to run the job. I don't know why.
$ ps -efw|grep spark
$ netstat -ntlp
both the output above have no spark related info.
And this machine is managed by myself, I
Hello
I have spark 3.2.0 deployed in localhost as the standalone mode.
I even didn't run the start master and worker command:
start-master.sh
start-worker.sh spark://127.0.0.1:7077
And the ports (such as 7077) were not opened there.
But I still can login into pyspark to run the jobs.
such as this table definition:
desc people;
+---+---+--+
| col_name | data_type | comment |
+---+---+--+
| name | string| |
| born | date
In the past time we have been using hive for building the data
warehouse.
Do you think if spark can used for this purpose? it's even more realtime
than hive.
Thanks.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi
I got a dataframe object from other application, it means this obj is
not generated by me.
How can I change the data types for some columns in this dataframe?
For example, change the column type from Int to Float.
Thanks.
Hello
After I converted the dataframe to RDD I found the data type was
missing.
scala> df.show
++---+
|name|age|
++---+
|jone| 12|
|rosa| 21|
++---+
scala> df.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
scala> df.rdd.map{ row =>
When I submitted the job from scala client, I got the warning messages:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor
when creating dataframe from a list, how can I specify the col type?
such as:
df =
spark.createDataFrame(list,["name","title","salary","rate","insurance"])
df.show()
+---+-+--++-+
| name|title|salary|rate|insurance|
Hello
I am converting some py code to scala.
This works in python:
rdd = sc.parallelize([('apple',1),('orange',2)])
rdd.toDF(['fruit','num']).show()
+--+---+
| fruit|num|
+--+---+
| apple| 1|
|orange| 2|
+--+---+
And in scala:
scala> rdd.toDF("fruit","num").show()
+--+---+
Hello,
For this query:
df.select("*").orderBy("amount",ascending=False).show()
+--+--+
| fruit|amount|
+--+--+
|tomato| 9|
| apple| 6|
|cherry| 5|
|orange| 3|
+--+--+
I want to add a column "top", in which the value is: 1,2,3... meaning
top1, top2,
Hello Gourav
As you see here orderBy has already give the solution for "equal
amount":
df =
sc.parallelize([("orange",2),("apple",3),("tomato",3),("cherry",5)]).toDF(['fruit','amount'])
df.select("*").orderBy("amount",ascending=False).show()
+--+--+
| fruit|amount|
I have got the answer from Mich's answer. Thank you both.
frakass
On 08/02/2022 16:36, Gourav Sengupta wrote:
Hi,
so do you want to rank apple and tomato both as 2? Not quite clear on
the use case here though.
Regards,
Gourav Sengupta
On Tue, Feb 8, 2022 at 7:10 AM wrote:
Hello Gourav
That did resolve my issue.
Thanks a lot.
frakass
n 06/02/2022 17:25, Hannes Bibel wrote:
Hi,
looks like you're packaging your application for Scala 2.13 (should be
specified in your build.sbt) while your Spark installation is built
for Scala 2.12.
Go to
Hello
I wrote this simple job in scala:
$ cat Myjob.scala
import org.apache.spark.sql.SparkSession
object Myjob {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.appName("Simple
Application").getOrCreate()
val sparkContext =
for example, this work for RDD object:
scala> val li = List(3,2,1,4,0)
li: List[Int] = List(3, 2, 1, 4, 0)
scala> val rdd = sc.parallelize(li)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at :24
scala> rdd.filter(_ > 2).collect()
res0: Array[Int] = Array(3, 4)
For a dataframe object, how to add a column who is auto_increment like
mysql's behavior?
Thank you.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
I am a bit confused why in pyspark this doesn't work?
x = sc.parallelize([3,2,1,4])
x.toDF.show()
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'function' object has no attribute 'show'
Thank you.
rdd = sc.parallelize([3,2,1,4])
rdd.toDF().show()
Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "/opt/spark/python/pyspark/sql/session.py",
Thanks for the reply.
It looks strange that in scala shell I can implement this translation:
scala> sc.parallelize(List(3,2,1,4)).toDF.show
+-+
|value|
+-+
|3|
|2|
|1|
|4|
+-+
But in pyspark i have to write as:
sc.parallelize([3,2,1,4]).map(lambda x:
Indeed. in spark-shell I ignore the parentheses always,
scala> sc.parallelize(List(3,2,1,4)).toDF.show
+-+
|value|
+-+
|3|
|2|
|1|
|4|
+-+
So I think it would be ok in pyspark.
But this still doesn't work. why?
sc.parallelize([3,2,1,4]).toDF().show()
Traceback
Hello
Please help take a look why my this simple reduce doesn't work?
rdd = sc.parallelize([("a",1),("b",2),("c",3)])
rdd.reduce(lambda x,y: x[1]+y[1])
Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/python/pyspark/rdd.py", line 1001, in reduce
return
On 22/01/2022 11:07, Renan F. Souza wrote:
unsubscribe
You could be able to unsubscribe yourself from the list by sending an
email to:
user-unsubscr...@spark.apache.org
thanks.
-
To unsubscribe e-mail:
23 matches
Mail list logo