how to change data type for columns of dataframe

2022-04-01 Thread capitnfrakass
Hi I got a dataframe object from other application, it means this obj is not generated by me. How can I change the data types for some columns in this dataframe? For example, change the column type from Int to Float. Thanks.

data type missing

2022-04-01 Thread capitnfrakass
Hello After I converted the dataframe to RDD I found the data type was missing. scala> df.show ++---+ |name|age| ++---+ |jone| 12| |rosa| 21| ++---+ scala> df.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = false) scala> df.rdd.map{ row =>

spark as data warehouse?

2022-03-25 Thread capitnfrakass
In the past time we have been using hive for building the data warehouse. Do you think if spark can used for this purpose? it's even more realtime than hive. Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread capitnfrakass
What I tried to say is, I didn't start spark master/worker at all, for a standalone deployment. But I still can login into pyspark to run the job. I don't know why. $ ps -efw|grep spark $ netstat -ntlp both the output above have no spark related info. And this machine is managed by myself, I

spark jobs don't require the master/worker to startup?

2022-03-09 Thread capitnfrakass
Hello I have spark 3.2.0 deployed in localhost as the standalone mode. I even didn't run the start master and worker command: start-master.sh start-worker.sh spark://127.0.0.1:7077 And the ports (such as 7077) were not opened there. But I still can login into pyspark to run the jobs.

can dataframe API deal with subquery

2022-02-26 Thread capitnfrakass
such as this table definition: desc people; +---+---+--+ | col_name | data_type | comment | +---+---+--+ | name | string| | | born | date

question on the different way of RDD to dataframe

2022-02-08 Thread capitnfrakass
Hello I am converting some py code to scala. This works in python: rdd = sc.parallelize([('apple',1),('orange',2)]) rdd.toDF(['fruit','num']).show() +--+---+ | fruit|num| +--+---+ | apple| 1| |orange| 2| +--+---+ And in scala: scala> rdd.toDF("fruit","num").show() +--+---+

Re: add an auto_increment column

2022-02-08 Thread capitnfrakass
I have got the answer from Mich's answer. Thank you both. frakass On 08/02/2022 16:36, Gourav Sengupta wrote: Hi, so do you want to rank apple and tomato both as 2? Not quite clear on the use case here though. Regards, Gourav Sengupta On Tue, Feb 8, 2022 at 7:10 AM wrote: Hello Gourav

Re: add an auto_increment column

2022-02-07 Thread capitnfrakass
Hello Gourav As you see here orderBy has already give the solution for "equal amount": df = sc.parallelize([("orange",2),("apple",3),("tomato",3),("cherry",5)]).toDF(['fruit','amount']) df.select("*").orderBy("amount",ascending=False).show() +--+--+ | fruit|amount|

Re: add an auto_increment column

2022-02-07 Thread capitnfrakass
Hello, For this query: df.select("*").orderBy("amount",ascending=False).show() +--+--+ | fruit|amount| +--+--+ |tomato| 9| | apple| 6| |cherry| 5| |orange| 3| +--+--+ I want to add a column "top", in which the value is: 1,2,3... meaning top1, top2,

Re: TypeError: Can not infer schema for type:

2022-02-06 Thread capitnfrakass
Thanks for the reply. It looks strange that in scala shell I can implement this translation: scala> sc.parallelize(List(3,2,1,4)).toDF.show +-+ |value| +-+ |3| |2| |1| |4| +-+ But in pyspark i have to write as: sc.parallelize([3,2,1,4]).map(lambda x:

TypeError: Can not infer schema for type:

2022-02-06 Thread capitnfrakass
rdd = sc.parallelize([3,2,1,4]) rdd.toDF().show() Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF return sparkSession.createDataFrame(self, schema, sampleRatio) File "/opt/spark/python/pyspark/sql/session.py",

Re: dataframe doesn't support higher order func, right?

2022-02-06 Thread capitnfrakass
Indeed. in spark-shell I ignore the parentheses always, scala> sc.parallelize(List(3,2,1,4)).toDF.show +-+ |value| +-+ |3| |2| |1| |4| +-+ So I think it would be ok in pyspark. But this still doesn't work. why? sc.parallelize([3,2,1,4]).toDF().show() Traceback

Re: dataframe doesn't support higher order func, right?

2022-02-06 Thread capitnfrakass
I am a bit confused why in pyspark this doesn't work? x = sc.parallelize([3,2,1,4]) x.toDF.show() Traceback (most recent call last): File "", line 1, in AttributeError: 'function' object has no attribute 'show' Thank you.

add an auto_increment column

2022-02-06 Thread capitnfrakass
For a dataframe object, how to add a column who is auto_increment like mysql's behavior? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

dataframe doesn't support higher order func, right?

2022-02-06 Thread capitnfrakass
for example, this work for RDD object: scala> val li = List(3,2,1,4,0) li: List[Int] = List(3, 2, 1, 4, 0) scala> val rdd = sc.parallelize(li) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24 scala> rdd.filter(_ > 2).collect() res0: Array[Int] = Array(3, 4)

Re: help check my simple job

2022-02-06 Thread capitnfrakass
That did resolve my issue. Thanks a lot. frakass n 06/02/2022 17:25, Hannes Bibel wrote: Hi, looks like you're packaging your application for Scala 2.13 (should be specified in your build.sbt) while your Spark installation is built for Scala 2.12. Go to

help check my simple job

2022-02-06 Thread capitnfrakass
Hello I wrote this simple job in scala: $ cat Myjob.scala import org.apache.spark.sql.SparkSession object Myjob { def main(args: Array[String]): Unit = { val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate() val sparkContext =

how can I remove the warning message

2022-01-28 Thread capitnfrakass
When I submitted the job from scala client, I got the warning messages: WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor

question for definition of column types

2022-01-26 Thread capitnfrakass
when creating dataframe from a list, how can I specify the col type? such as: df = spark.createDataFrame(list,["name","title","salary","rate","insurance"]) df.show() +---+-+--++-+ | name|title|salary|rate|insurance|

Re: unsubscribe

2022-01-21 Thread capitnfrakass
On 22/01/2022 11:07, Renan F. Souza wrote: unsubscribe You could be able to unsubscribe yourself from the list by sending an email to: user-unsubscr...@spark.apache.org thanks. - To unsubscribe e-mail:

newbie question for reduce

2022-01-18 Thread capitnfrakass
Hello Please help take a look why my this simple reduce doesn't work? rdd = sc.parallelize([("a",1),("b",2),("c",3)]) rdd.reduce(lambda x,y: x[1]+y[1]) Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/rdd.py", line 1001, in reduce return

Re: SqlQuery deprecated

2022-01-17 Thread capitnfrakass
Can Ignite support the spark like dataframe API? Thanks On 17/01/2022 20:31, Pavel Tupitsyn wrote: Hi, The reason for deprecation is that SqlQuery is a limited subset of SqlFieldsQuery, which may be confusing. https://issues.apache.org/jira/browse/IGNITE-11334 On Mon, Jan 17, 2022 at 2:59 PM

question of shorten syntax for rdd

2022-01-17 Thread capitnfrakass
Hello May I know from what version of spark, the RDD syntax can be shorten as this? rdd.groupByKey().mapValues(lambda x:len(x)).collect() [('b', 2), ('d', 1), ('a', 2)] rdd.groupByKey().mapValues(len).collect() [('b', 2), ('d', 1), ('a', 2)] I know in scala the syntax: xxx(x => x.len)