virtual_mailbox_maps usage

2022-08-30 Thread frakass
I have a virtual_mailbox_domain:
a.com

and I have a virtual_alias_domain:
b.com

I can setup this entry in virtual_alias_maps for a domain alias:

x...@b.com x...@a.com


but what's the usage of virtual_mailbox_maps?

Thank you.



where to setup virtual_mailbox_maps

2022-08-30 Thread frakass
Hello,

I have a domain in virtual_mailbox_domains:

aaa.com

I have also the virtual_alias_domains which include:

bbb.com

I know how to forward x...@bbb.com to y...@aaa.com by setting up the file
"virtual_alias_maps":

x...@bbb.com y...@aaa.com

(and run postmap after the changes.)

But, how can I setup virtual_mailbox_maps (if I name this correctly)? for
example, 1...@aaa.com forwards to 2...@aaa.com.

Thank you.



Re: sync or async producer

2022-02-15 Thread frakass

Hello,

I did a test with these two rb scripts which take the time almost the 
same. do you have the further idea?


$ cat async-pub.rb
require 'kafka'

kafka = Kafka.new("localhost:9092", client_id: "ruby-client", 
resolve_seed_brokers: true)
producer = kafka.async_producer(required_acks: :all,max_buffer_size: 
50_000,max_queue_size: 10_000)


1.times do
message = rand.to_s
producer.produce(message, topic: "mytest")
end

producer.deliver_messages
producer.shutdown


$ cat sync-pub.rb
require 'kafka'

kafka = Kafka.new("localhost:9092", client_id: "ruby-client", 
resolve_seed_brokers: true)


producer = kafka.producer(required_acks: :all,max_buffer_size: 50_000)

1.times do
message = rand.to_s
producer.produce(message, topic: "mytest")
end

producer.deliver_messages

Thanks

On 2022/2/16 10:18, Luke Chen wrote:

Hi frakass,

I think the most difference for sync and async send (or "publish" like you
said), is the throughput.
You said the performance is almost the same, and I would guess the "acks"
config in your environment might be 0? Or maybe the produce rate is slow?
Or "max.in.flight.requests.per.connection" is 1?
Usually, when "acks=all", you have to wait for the records completely
replicated into all brokers before server response in "sync" mode, which is
why the throughput will be slow.
Compared with async mode, the producer send will return immediately after
appending the records, and wait for the response in callback function, no
matter it's acks=0 or acks=all.


Hope that helps.

Luke







On Wed, Feb 16, 2022 at 9:10 AM frakass  wrote:


for a producer, is there a principle that when to use sync publishing,
and when to use async publishing?

for the simple format messages, i have tested both, their performance
are almost the same.

Thank you.
frakass





sync or async producer

2022-02-15 Thread frakass
for a producer, is there a principle that when to use sync publishing, 
and when to use async publishing?


for the simple format messages, i have tested both, their performance 
are almost the same.


Thank you.
frakass


Re: how to classify column

2022-02-11 Thread frakass

that's good. thanks

On 2022/2/12 12:11, Raghavendra Ganesh wrote:
.withColumn("newColumn",expr(s"case when score>3 then 'good' else 'bad' 
end"))




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



how to classify column

2022-02-11 Thread frakass

Hello

I have a column whose value (Int type as score) is from 0 to 5.
I want to query that, when the score > 3, classified as "good". else 
classified as "bad".

How do I implement that? A UDF like something as this?

scala> implicit class Foo(i:Int) {
 |   def classAs(f:Int=>String) = f(i)
 | }
class Foo

scala> 4.classAs { x => if (x > 3) "good" else "bad" }
val res13: String = good

scala> 2.classAs { x => if (x > 3) "good" else "bad" }
val res14: String = bad


Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: data size exceeds the total ram

2022-02-11 Thread frakass

Hello list

I have imported the data into spark and I found there is disk IO in 
every node. The memory didn't get overflow.


But such query is quite slow:

>>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show()


May I ask:
1. since I have 3 nodes (as known as 3 executors?), are there 3 
partitions for each job?

2. can I expand the partition by hand to increase the performance?

Thanks



On 2022/2/11 6:22, frakass wrote:



On 2022/2/11 6:16, Gourav Sengupta wrote:
What is the source data (is it JSON, CSV, Parquet, etc)? Where are you 
reading it from (JDBC, file, etc)? What is the compression format (GZ, 
BZIP, etc)? What is the SPARK version that you are using?


it's a well built csv file (no compressed) stored in HDFS.
spark 3.2.0

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: data size exceeds the total ram

2022-02-11 Thread frakass




On 2022/2/11 6:16, Gourav Sengupta wrote:
What is the source data (is it JSON, CSV, Parquet, etc)? Where are you 
reading it from (JDBC, file, etc)? What is the compression format (GZ, 
BZIP, etc)? What is the SPARK version that you are using?


it's a well built csv file (no compressed) stored in HDFS.
spark 3.2.0

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



data size exceeds the total ram

2022-02-11 Thread frakass

Hello

I have three nodes with total memory 128G x 3 = 384GB
But the input data is about 1TB.
How can spark handle this case?

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Avro file format with SparkSQL

2022-02-09 Thread frakass

Have you added the dependency in the build.sbt?
Can you 'sbt package' the source successfully?

regards
frakass

On 2022/2/10 11:25, Karanika, Anna wrote:
For context, I am invoking spark-submit and adding arguments --packages 
org.apache.spark:spark-avro_2.12:3.2.0.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: question on the different way of RDD to dataframe

2022-02-08 Thread frakass

I think it's better as:

df1.map { case(w,x,y,z) => columns(w,x,y,z) }

Thanks


On 2022/2/9 12:46, Mich Talebzadeh wrote:
scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString, 
p(2).toString,p(3).toString.toDouble)) // map those columns


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: flatMap for dataframe

2022-02-08 Thread frakass

Is this the scala syntax?
Yes in scala I know how to do it by converting the df to a dataset.
how for pyspark?

Thanks

On 2022/2/9 10:24, oliver dd wrote:

df.flatMap(row => row.getAs[String]("value").split(" "))


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: question on the different way of RDD to dataframe

2022-02-08 Thread frakass

I know that using case class I can control the data type strictly.

scala> val rdd = sc.parallelize(List(("apple",1),("orange",2)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] 
at parallelize at :23


scala> rdd.toDF.printSchema
root
 |-- _1: string (nullable = true)
 |-- _2: integer (nullable = false)


I can specify the second column to other type such as Double by case class:

scala> rdd.map{ case (x,y) => Fruit(x,y) }.toDF.printSchema
root
 |-- fruit: string (nullable = true)
 |-- num: double (nullable = false)



Thank you.



On 2022/2/8 10:32, Sean Owen wrote:
It's just a possibly tidier way to represent objects with named, typed 
fields, in order to specify a DataFrame's contents.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



flatMap for dataframe

2022-02-08 Thread frakass

Hello

for the RDD I can apply flatMap method:

>>> sc.parallelize(["a few words","ba na ba na"]).flatMap(lambda x: 
x.split(" ")).collect()

['a', 'few', 'words', 'ba', 'na', 'ba', 'na']


But for a dataframe table how can I flatMap that as above?

>>> df.show()
++
|   value|
++
| a few lines|
|hello world here|
| ba na ba na|
++


Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: unsubscribe

2022-01-14 Thread frakass
please send an empty message to: user-unsubscr...@spark.apache.org to 
unsubscribe yourself from the list.


Thanks

On 2022/1/15 7:04, ALOK KUMAR SINGH wrote:

unsubscribe


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: groupMapReduce

2022-01-14 Thread frakass

OK thanks. I will check that.


On 2022/1/14 7:09, David Diebold wrote:

Hello,

In RDD api, you must be looking for reduceByKey.

Cheers

Le ven. 14 janv. 2022 à 11:56, frakass <mailto:capitnfrak...@free.fr>> a écrit :


Is there a RDD API which is similar to Scala's groupMapReduce?
https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/
<https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/>

Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



groupMapReduce

2022-01-14 Thread frakass

Is there a RDD API which is similar to Scala's groupMapReduce?
https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/

Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: about memory size for loading file

2022-01-13 Thread frakass

for this case i have 3 partitions, each process 3.333 GB data, am i right?


On 2022/1/14 2:20, Sonal Goyal wrote:

No it should not. The file would be partitioned and read across each node.

On Fri, 14 Jan 2022 at 11:48 AM, frakass <mailto:capitnfrak...@free.fr>> wrote:


Hello list

Given the case I have a file whose size is 10GB. The ram of total
cluster is 24GB, three nodes. So the local node has only 8GB.
If I load this file into Spark as a RDD via sc.textFile interface, will
this operation run into "out of memory" issue?

Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>

--
Cheers,
Sonal
https://github.com/zinggAI/zingg <https://github.com/zinggAI/zingg>



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



about memory size for loading file

2022-01-13 Thread frakass

Hello list

Given the case I have a file whose size is 10GB. The ram of total 
cluster is 24GB, three nodes. So the local node has only 8GB.
If I load this file into Spark as a RDD via sc.textFile interface, will 
this operation run into "out of memory" issue?


Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org