from:"capitnfrakass"

question of shorten syntax for rdd

2022-01-17 Thread capitnfrakass

Hello May I know from what version of spark, the RDD syntax can be shorten as this? rdd.groupByKey().mapValues(lambda x:len(x)).collect() [('b', 2), ('d', 1), ('a', 2)] rdd.groupByKey().mapValues(len).collect() [('b', 2), ('d', 1), ('a', 2)] I know in scala the syntax: xxx(x => x.len)

Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread capitnfrakass

What I tried to say is, I didn't start spark master/worker at all, for a standalone deployment. But I still can login into pyspark to run the job. I don't know why. $ ps -efw|grep spark $ netstat -ntlp both the output above have no spark related info. And this machine is managed by myself, I

spark jobs don't require the master/worker to startup?

2022-03-09 Thread capitnfrakass

Hello I have spark 3.2.0 deployed in localhost as the standalone mode. I even didn't run the start master and worker command: start-master.sh start-worker.sh spark://127.0.0.1:7077 And the ports (such as 7077) were not opened there. But I still can login into pyspark to run the jobs.

can dataframe API deal with subquery

2022-02-26 Thread capitnfrakass

such as this table definition: desc people; +---+---+--+ | col_name | data_type | comment | +---+---+--+ | name | string| | | born | date

spark as data warehouse?

2022-03-25 Thread capitnfrakass

In the past time we have been using hive for building the data warehouse. Do you think if spark can used for this purpose? it's even more realtime than hive. Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

how to change data type for columns of dataframe

2022-04-01 Thread capitnfrakass

Hi I got a dataframe object from other application, it means this obj is not generated by me. How can I change the data types for some columns in this dataframe? For example, change the column type from Int to Float. Thanks.

data type missing

2022-04-01 Thread capitnfrakass

Hello After I converted the dataframe to RDD I found the data type was missing. scala> df.show ++---+ |name|age| ++---+ |jone| 12| |rosa| 21| ++---+ scala> df.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = false) scala> df.rdd.map{ row =>

how can I remove the warning message

2022-01-28 Thread capitnfrakass

When I submitted the job from scala client, I got the warning messages: WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor

question for definition of column types

2022-01-26 Thread capitnfrakass

when creating dataframe from a list, how can I specify the col type? such as: df = spark.createDataFrame(list,["name","title","salary","rate","insurance"]) df.show() +---+-+--++-+ | name|title|salary|rate|insurance|

question on the different way of RDD to dataframe

2022-02-08 Thread capitnfrakass

Hello I am converting some py code to scala. This works in python: rdd = sc.parallelize([('apple',1),('orange',2)]) rdd.toDF(['fruit','num']).show() +--+---+ | fruit|num| +--+---+ | apple| 1| |orange| 2| +--+---+ And in scala: scala> rdd.toDF("fruit","num").show() +--+---+

Re: add an auto_increment column

2022-02-07 Thread capitnfrakass

Hello, For this query: df.select("*").orderBy("amount",ascending=False).show() +--+--+ | fruit|amount| +--+--+ |tomato| 9| | apple| 6| |cherry| 5| |orange| 3| +--+--+ I want to add a column "top", in which the value is: 1,2,3... meaning top1, top2,

Re: add an auto_increment column

2022-02-07 Thread capitnfrakass

Hello Gourav As you see here orderBy has already give the solution for "equal amount": df = sc.parallelize([("orange",2),("apple",3),("tomato",3),("cherry",5)]).toDF(['fruit','amount']) df.select("*").orderBy("amount",ascending=False).show() +--+--+ | fruit|amount|

Re: add an auto_increment column

2022-02-08 Thread capitnfrakass

I have got the answer from Mich's answer. Thank you both. frakass On 08/02/2022 16:36, Gourav Sengupta wrote: Hi, so do you want to rank apple and tomato both as 2? Not quite clear on the use case here though. Regards, Gourav Sengupta On Tue, Feb 8, 2022 at 7:10 AM wrote: Hello Gourav

Re: help check my simple job

2022-02-06 Thread capitnfrakass

That did resolve my issue. Thanks a lot. frakass n 06/02/2022 17:25, Hannes Bibel wrote: Hi, looks like you're packaging your application for Scala 2.13 (should be specified in your build.sbt) while your Spark installation is built for Scala 2.12. Go to

help check my simple job

2022-02-06 Thread capitnfrakass

Hello I wrote this simple job in scala: $ cat Myjob.scala import org.apache.spark.sql.SparkSession object Myjob { def main(args: Array[String]): Unit = { val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate() val sparkContext =

dataframe doesn't support higher order func, right?

2022-02-06 Thread capitnfrakass

for example, this work for RDD object: scala> val li = List(3,2,1,4,0) li: List[Int] = List(3, 2, 1, 4, 0) scala> val rdd = sc.parallelize(li) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24 scala> rdd.filter(_ > 2).collect() res0: Array[Int] = Array(3, 4)

add an auto_increment column

2022-02-06 Thread capitnfrakass

For a dataframe object, how to add a column who is auto_increment like mysql's behavior? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: dataframe doesn't support higher order func, right?

2022-02-06 Thread capitnfrakass

I am a bit confused why in pyspark this doesn't work? x = sc.parallelize([3,2,1,4]) x.toDF.show() Traceback (most recent call last): File "", line 1, in AttributeError: 'function' object has no attribute 'show' Thank you.

TypeError: Can not infer schema for type:

2022-02-06 Thread capitnfrakass

rdd = sc.parallelize([3,2,1,4]) rdd.toDF().show() Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF return sparkSession.createDataFrame(self, schema, sampleRatio) File "/opt/spark/python/pyspark/sql/session.py",

Re: TypeError: Can not infer schema for type:

2022-02-06 Thread capitnfrakass

Thanks for the reply. It looks strange that in scala shell I can implement this translation: scala> sc.parallelize(List(3,2,1,4)).toDF.show +-+ |value| +-+ |3| |2| |1| |4| +-+ But in pyspark i have to write as: sc.parallelize([3,2,1,4]).map(lambda x:

Re: dataframe doesn't support higher order func, right?

2022-02-06 Thread capitnfrakass

Indeed. in spark-shell I ignore the parentheses always, scala> sc.parallelize(List(3,2,1,4)).toDF.show +-+ |value| +-+ |3| |2| |1| |4| +-+ So I think it would be ok in pyspark. But this still doesn't work. why? sc.parallelize([3,2,1,4]).toDF().show() Traceback

newbie question for reduce

2022-01-18 Thread capitnfrakass

Hello Please help take a look why my this simple reduce doesn't work? rdd = sc.parallelize([("a",1),("b",2),("c",3)]) rdd.reduce(lambda x,y: x[1]+y[1]) Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/rdd.py", line 1001, in reduce return

Re: unsubscribe

2022-01-21 Thread capitnfrakass

On 22/01/2022 11:07, Renan F. Souza wrote: unsubscribe You could be able to unsubscribe yourself from the list by sending an email to: user-unsubscr...@spark.apache.org thanks. - To unsubscribe e-mail:

question of shorten syntax for rdd

Re: spark jobs don't require the master/worker to startup?

spark jobs don't require the master/worker to startup?

can dataframe API deal with subquery

spark as data warehouse?

how to change data type for columns of dataframe

data type missing

how can I remove the warning message

question for definition of column types

question on the different way of RDD to dataframe

Re: add an auto_increment column

Re: add an auto_increment column

Re: add an auto_increment column

Re: help check my simple job

help check my simple job

dataframe doesn't support higher order func, right?

add an auto_increment column

Re: dataframe doesn't support higher order func, right?

TypeError: Can not infer schema for type:

Re: TypeError: Can not infer schema for type:

Re: dataframe doesn't support higher order func, right?

newbie question for reduce

Re: unsubscribe

23 matches

Site Navigation

Mail list logo

Footer information