Hi
I got a dataframe object from other application, it means this obj is
not generated by me.
How can I change the data types for some columns in this dataframe?
For example, change the column type from Int to Float.
Thanks.
Hello
After I converted the dataframe to RDD I found the data type was
missing.
scala> df.show
++---+
|name|age|
++---+
|jone| 12|
|rosa| 21|
++---+
scala> df.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
scala> df.rdd.map{ row =>
In the past time we have been using hive for building the data
warehouse.
Do you think if spark can used for this purpose? it's even more realtime
than hive.
Thanks.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
What I tried to say is, I didn't start spark master/worker at all, for a
standalone deployment.
But I still can login into pyspark to run the job. I don't know why.
$ ps -efw|grep spark
$ netstat -ntlp
both the output above have no spark related info.
And this machine is managed by myself, I
Hello
I have spark 3.2.0 deployed in localhost as the standalone mode.
I even didn't run the start master and worker command:
start-master.sh
start-worker.sh spark://127.0.0.1:7077
And the ports (such as 7077) were not opened there.
But I still can login into pyspark to run the jobs.
such as this table definition:
desc people;
+---+---+--+
| col_name | data_type | comment |
+---+---+--+
| name | string| |
| born | date
Hello
I am converting some py code to scala.
This works in python:
rdd = sc.parallelize([('apple',1),('orange',2)])
rdd.toDF(['fruit','num']).show()
+--+---+
| fruit|num|
+--+---+
| apple| 1|
|orange| 2|
+--+---+
And in scala:
scala> rdd.toDF("fruit","num").show()
+--+---+
I have got the answer from Mich's answer. Thank you both.
frakass
On 08/02/2022 16:36, Gourav Sengupta wrote:
Hi,
so do you want to rank apple and tomato both as 2? Not quite clear on
the use case here though.
Regards,
Gourav Sengupta
On Tue, Feb 8, 2022 at 7:10 AM wrote:
Hello Gourav
Hello Gourav
As you see here orderBy has already give the solution for "equal
amount":
df =
sc.parallelize([("orange",2),("apple",3),("tomato",3),("cherry",5)]).toDF(['fruit','amount'])
df.select("*").orderBy("amount",ascending=False).show()
+--+--+
| fruit|amount|
Hello,
For this query:
df.select("*").orderBy("amount",ascending=False).show()
+--+--+
| fruit|amount|
+--+--+
|tomato| 9|
| apple| 6|
|cherry| 5|
|orange| 3|
+--+--+
I want to add a column "top", in which the value is: 1,2,3... meaning
top1, top2,
Thanks for the reply.
It looks strange that in scala shell I can implement this translation:
scala> sc.parallelize(List(3,2,1,4)).toDF.show
+-+
|value|
+-+
|3|
|2|
|1|
|4|
+-+
But in pyspark i have to write as:
sc.parallelize([3,2,1,4]).map(lambda x:
rdd = sc.parallelize([3,2,1,4])
rdd.toDF().show()
Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "/opt/spark/python/pyspark/sql/session.py",
Indeed. in spark-shell I ignore the parentheses always,
scala> sc.parallelize(List(3,2,1,4)).toDF.show
+-+
|value|
+-+
|3|
|2|
|1|
|4|
+-+
So I think it would be ok in pyspark.
But this still doesn't work. why?
sc.parallelize([3,2,1,4]).toDF().show()
Traceback
I am a bit confused why in pyspark this doesn't work?
x = sc.parallelize([3,2,1,4])
x.toDF.show()
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'function' object has no attribute 'show'
Thank you.
For a dataframe object, how to add a column who is auto_increment like
mysql's behavior?
Thank you.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
for example, this work for RDD object:
scala> val li = List(3,2,1,4,0)
li: List[Int] = List(3, 2, 1, 4, 0)
scala> val rdd = sc.parallelize(li)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at :24
scala> rdd.filter(_ > 2).collect()
res0: Array[Int] = Array(3, 4)
That did resolve my issue.
Thanks a lot.
frakass
n 06/02/2022 17:25, Hannes Bibel wrote:
Hi,
looks like you're packaging your application for Scala 2.13 (should be
specified in your build.sbt) while your Spark installation is built
for Scala 2.12.
Go to
Hello
I wrote this simple job in scala:
$ cat Myjob.scala
import org.apache.spark.sql.SparkSession
object Myjob {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.appName("Simple
Application").getOrCreate()
val sparkContext =
When I submitted the job from scala client, I got the warning messages:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor
when creating dataframe from a list, how can I specify the col type?
such as:
df =
spark.createDataFrame(list,["name","title","salary","rate","insurance"])
df.show()
+---+-+--++-+
| name|title|salary|rate|insurance|
On 22/01/2022 11:07, Renan F. Souza wrote:
unsubscribe
You could be able to unsubscribe yourself from the list by sending an
email to:
user-unsubscr...@spark.apache.org
thanks.
-
To unsubscribe e-mail:
Hello
Please help take a look why my this simple reduce doesn't work?
rdd = sc.parallelize([("a",1),("b",2),("c",3)])
rdd.reduce(lambda x,y: x[1]+y[1])
Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/python/pyspark/rdd.py", line 1001, in reduce
return
Can Ignite support the spark like dataframe API? Thanks
On 17/01/2022 20:31, Pavel Tupitsyn wrote:
Hi,
The reason for deprecation is that SqlQuery is a limited subset of
SqlFieldsQuery, which may be confusing.
https://issues.apache.org/jira/browse/IGNITE-11334
On Mon, Jan 17, 2022 at 2:59 PM
Hello
May I know from what version of spark, the RDD syntax can be shorten as
this?
rdd.groupByKey().mapValues(lambda x:len(x)).collect()
[('b', 2), ('d', 1), ('a', 2)]
rdd.groupByKey().mapValues(len).collect()
[('b', 2), ('d', 1), ('a', 2)]
I know in scala the syntax: xxx(x => x.len)
24 matches
Mail list logo