[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-10-17 Thread gaokui (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619262#comment-17619262
 ] 

gaokui commented on SPARK-38161:


yes,i can use two dataframe fitler or not filter;
but when i face two thousand condtion, i need write four thousand times.
and terrble dag job with poor performance.

> when clean data hope to spilt one dataframe or dataset  to two dataframe
> 
>
> Key: SPARK-38161
> URL: https://issues.apache.org/jira/browse/SPARK-38161
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when I am  processing  data clean, I meet such scene.
> one coloumn need judge by empy or null condition.
> so I do it right now similar code as following:
> df1= dataframe.filter("coloumn=null")
> df2= dataframe.filter("coloumn=!null")
> and then write df1 and df2 into hdfs parquet file.
> but when i have thousand condition. every job need more stage.
> I hope dataframe can filter by one condition once and not twice. and that can 
> generate two dataframe.
>  
>  
>       



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule

2022-04-18 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui reopened SPARK-38812:


you can see my attach 

> when i clean data ,I hope one rdd spill two rdd according clean data rule
> -
>
> Key: SPARK-38812
> URL: https://issues.apache.org/jira/browse/SPARK-38812
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when id do clean data,one rdd according one value(>or <) filter data, and 
> then generate two different set,one is error data file, another is errorless 
> data file.
> Now I use filter, but this filter must have two spark dag job, that cost too 
> much.
> exactly some code like iterator.span(preidicate) and then return one 
> tuple(iter1,iter2)
> one dataset will be spilted tow dataset in one rule data clean progress.
> i hope compute once not twice.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule

2022-04-18 Thread gaokui (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523476#comment-17523476
 ] 

gaokui edited comment on SPARK-38812 at 4/18/22 8:05 AM:
-

I see  SPARK-2373, SPARK-6664

actually i can get method , use once time dag job to calcute, not twice.

for example :

val intRDD=sc.makeRDD(Array(1,2,3,4,5,6))
intRDD.foreachPartition(iter=>

{ val (it1,it2)=iter.patition(x=>x<=3) saveQualityError(it1)   //but right here 
can not use rdd.savetextfile, need write store policy with interaltime and 
writing size. saveQualityGood(it2)  //but right here can not use 
rdd.savetextfile, need write store policy with interaltime and writing size. 
//and more serious problem short bucket effect. one patition good data is less, 
worse data is more. then one write method will wait another method. }

)

 

but this method cause will not use a lot of rdd api. recycle copy code that rdd 
include. like flush witer to hdfs.


was (Author: sungaok):
I see  SPARK-2373, SPARK-6664 

actually i can get more better method than two, use once time job to calcute, 
not twice.

for example :

val intRDD=sc.makeRDD(Array(1,2,3,4,5,6))
intRDD.foreachPartition(iter=>{
val (it1,it2)=iter.patition(x=>x<=3)
saveQualityError(it1)   //but right here can not use rdd.savetextfile, need 
write store policy with interaltime and writing size.
saveQualityGood(it2)  //but right here can not use rdd.savetextfile, need write 
store policy with interaltime and writing size.

//and more serious problem short bucket effect. one patition good data is less, 
worse data is more. then one write method will wait another method.
})

> when i clean data ,I hope one rdd spill two rdd according clean data rule
> -
>
> Key: SPARK-38812
> URL: https://issues.apache.org/jira/browse/SPARK-38812
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when id do clean data,one rdd according one value(>or <) filter data, and 
> then generate two different set,one is error data file, another is errorless 
> data file.
> Now I use filter, but this filter must have two spark dag job, that cost too 
> much.
> exactly some code like iterator.span(preidicate) and then return one 
> tuple(iter1,iter2)
> one dataset will be spilted tow dataset in one rule data clean progress.
> i hope compute once not twice.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule

2022-04-17 Thread gaokui (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523476#comment-17523476
 ] 

gaokui commented on SPARK-38812:


I see  SPARK-2373, SPARK-6664 

actually i can get more better method than two, use once time job to calcute, 
not twice.

for example :

val intRDD=sc.makeRDD(Array(1,2,3,4,5,6))
intRDD.foreachPartition(iter=>{
val (it1,it2)=iter.patition(x=>x<=3)
saveQualityError(it1)   //but right here can not use rdd.savetextfile, need 
write store policy with interaltime and writing size.
saveQualityGood(it2)  //but right here can not use rdd.savetextfile, need write 
store policy with interaltime and writing size.

//and more serious problem short bucket effect. one patition good data is less, 
worse data is more. then one write method will wait another method.
})

> when i clean data ,I hope one rdd spill two rdd according clean data rule
> -
>
> Key: SPARK-38812
> URL: https://issues.apache.org/jira/browse/SPARK-38812
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when id do clean data,one rdd according one value(>or <) filter data, and 
> then generate two different set,one is error data file, another is errorless 
> data file.
> Now I use filter, but this filter must have two spark dag job, that cost too 
> much.
> exactly some code like iterator.span(preidicate) and then return one 
> tuple(iter1,iter2)
> one dataset will be spilted tow dataset in one rule data clean progress.
> i hope compute once not twice.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule

2022-04-17 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui updated SPARK-38812:
---
Description: 
when id do clean data,one rdd according one value(>or <) filter data, and then 
generate two different set,one is error data file, another is errorless data 
file.

Now I use filter, but this filter must have two spark dag job, that cost too 
much.

exactly some code like iterator.span(preidicate) and then return one 
tuple(iter1,iter2)

one dataset will be spilted tow dataset in one rule data clean progress.

i hope compute once not twice.

  was:
when id do clean data,one rdd according one value(>or <) filter data, and then 
generate two different set,one is error data file, another is errorless data 
file.

Now I use filter, but this filter must have two spark dag job, that cost too 
much.

exactly some code like iterator.span(preidicate) and then return one 
tuple(iter1,iter2)


> when i clean data ,I hope one rdd spill two rdd according clean data rule
> -
>
> Key: SPARK-38812
> URL: https://issues.apache.org/jira/browse/SPARK-38812
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when id do clean data,one rdd according one value(>or <) filter data, and 
> then generate two different set,one is error data file, another is errorless 
> data file.
> Now I use filter, but this filter must have two spark dag job, that cost too 
> much.
> exactly some code like iterator.span(preidicate) and then return one 
> tuple(iter1,iter2)
> one dataset will be spilted tow dataset in one rule data clean progress.
> i hope compute once not twice.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule

2022-04-06 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui updated SPARK-38812:
---
Description: 
when id do clean data,one rdd according one value(>or <) filter data, and then 
generate two different set,one is error data file, another is errorless data 
file.

Now I use filter, but this filter must have two spark dag job, that cost too 
much.

exactly some code like iterator.span(preidicate) and then return one 
tuple(iter1,iter2)

  was:
when id do clean data, i hope rdd according one value(>or <) filter data, and 
then to generate two different set,one is error data file, another is right 
data file.

Now I use filter, but this filter must have two spark dag job, that cost too 
much.

exactly some code like iterator.span(preidicate) and then return one 
tuple(iter1,iter2)


> when i clean data ,I hope one rdd spill two rdd according clean data rule
> -
>
> Key: SPARK-38812
> URL: https://issues.apache.org/jira/browse/SPARK-38812
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when id do clean data,one rdd according one value(>or <) filter data, and 
> then generate two different set,one is error data file, another is errorless 
> data file.
> Now I use filter, but this filter must have two spark dag job, that cost too 
> much.
> exactly some code like iterator.span(preidicate) and then return one 
> tuple(iter1,iter2)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule

2022-04-06 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui updated SPARK-38812:
---
Summary: when i clean data ,I hope one rdd spill two rdd according clean 
data rule  (was: rdd spil two rdd)

> when i clean data ,I hope one rdd spill two rdd according clean data rule
> -
>
> Key: SPARK-38812
> URL: https://issues.apache.org/jira/browse/SPARK-38812
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when id do clean data, i hope rdd according one value(>or <) filter data, and 
> then to generate two different set,one is error data file, another is right 
> data file.
> Now I use filter, but this filter must have two spark dag job, that cost too 
> much.
> exactly some code like iterator.span(preidicate) and then return one 
> tuple(iter1,iter2)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38812) rdd spil two rdd

2022-04-06 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui updated SPARK-38812:
---
Description: 
when id do clean data, i hope rdd according one value(>or <) filter data, and 
then to generate two different set,one is error data file, another is right 
data file.

Now I use filter, but this filter must have two spark dag job, that cost too 
much.

exactly some code like iterator.span(preidicate) and then return one 
tuple(iter1,iter2)
Summary: rdd spil two rdd  (was: rdd spilt,when id do clean data, i 
hope rdd according one value(>or <) filter data to two different set,one write 
error file, another is)

> rdd spil two rdd
> 
>
> Key: SPARK-38812
> URL: https://issues.apache.org/jira/browse/SPARK-38812
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when id do clean data, i hope rdd according one value(>or <) filter data, and 
> then to generate two different set,one is error data file, another is right 
> data file.
> Now I use filter, but this filter must have two spark dag job, that cost too 
> much.
> exactly some code like iterator.span(preidicate) and then return one 
> tuple(iter1,iter2)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38812) rdd spilt,when id do clean data, i hope rdd according one value(>or <) filter data to two different set,one write error file, another is

2022-04-06 Thread gaokui (Jira)
gaokui created SPARK-38812:
--

 Summary: rdd spilt,when id do clean data, i hope rdd according one 
value(>or <) filter data to two different set,one write error file, another is
 Key: SPARK-38812
 URL: https://issues.apache.org/jira/browse/SPARK-38812
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: gaokui






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-10 Thread gaokui (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490625#comment-17490625
 ] 

gaokui commented on SPARK-38161:


do you mean that need twice tranformation in one spark dag?

can this pseudocode be like
List lsfd= dataframe.filter(cond)
but this list include two list.

> when clean data hope to spilt one dataframe or dataset  to two dataframe
> 
>
> Key: SPARK-38161
> URL: https://issues.apache.org/jira/browse/SPARK-38161
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when I am  processing  data clean, I meet such scene.
> one coloumn need judge by empy or null condition.
> so I do it right now similar code as following:
> df1= dataframe.filter("coloumn=null")
> df2= dataframe.filter("coloumn=!null")
> and then write df1 and df2 into hdfs parquet file.
> but when i have thousand condition. every job need more stage.
> I hope dataframe can filter by one condition once and not twice. and that can 
> generate two dataframe.
>  
>  
>       



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-10 Thread gaokui (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490606#comment-17490606
 ] 

gaokui commented on SPARK-38161:


ok, but why not use one condition , get two dataframe or dataframe list  in one 
statement?

 

 

 

> when clean data hope to spilt one dataframe or dataset  to two dataframe
> 
>
> Key: SPARK-38161
> URL: https://issues.apache.org/jira/browse/SPARK-38161
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when I am  processing  data clean, I meet such scene.
> one coloumn need judge by empy or null condition.
> so I do it right now similar code as following:
> df1= dataframe.filter("coloumn=null")
> df2= dataframe.filter("coloumn=!null")
> and then write df1 and df2 into hdfs parquet file.
> but when i have thousand condition. every job need more stage.
> I hope dataframe can filter by one condition once and not twice. and that can 
> generate two dataframe.
>  
>  
>       



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-09 Thread gaokui (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489971#comment-17489971
 ] 

gaokui commented on SPARK-38161:


do you mean that is 'wirte.mode('parquet').patitionby('col')' or df.partion?

could you provide more detail? 

ths!

> when clean data hope to spilt one dataframe or dataset  to two dataframe
> 
>
> Key: SPARK-38161
> URL: https://issues.apache.org/jira/browse/SPARK-38161
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: gaokui
>Priority: Major
>
> when I am  processing  data clean, I meet such scene.
> one coloumn need judge by empy or null condition.
> so I do it right now similar code as following:
> df1= dataframe.filter("coloumn=null")
> df2= dataframe.filter("coloumn=!null")
> and then write df1 and df2 into hdfs parquet file.
> but when i have thousand condition. every job need more stage.
> I hope dataframe can filter by one condition once and not twice. and that can 
> generate two dataframe.
>  
>  
>       



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-09 Thread gaokui (Jira)
gaokui created SPARK-38161:
--

 Summary: when clean data hope to spilt one dataframe or dataset  
to two dataframe
 Key: SPARK-38161
 URL: https://issues.apache.org/jira/browse/SPARK-38161
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Affects Versions: 3.2.1
Reporter: gaokui


when I am  processing  data clean, I meet such scene.

one coloumn need judge by empy or null condition.

so I do it right now similar code as following:

df1= dataframe.filter("coloumn=null")

df2= dataframe.filter("coloumn=!null")

and then write df1 and df2 into hdfs parquet file.

but when i have thousand condition. every job need more stage.

I hope dataframe can filter by one condition once and not twice. and that can 
generate two dataframe.

 

 

      



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32341) add mutiple filter in rdd function

2020-07-29 Thread gaokui (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167043#comment-17167043
 ] 

gaokui commented on SPARK-32341:


Yes, I can do that. But at that situation, I need create a lot of kafka topic 
for every single dataset, I have lots of dataset over1000. that will create 
lots of kafka topics. And then I must lanuch same spark job numbers . This job  
numbers also will  lead to over1000. At that situation , it is crazy job to 
manage and allocate machine cpu , memory.

so I need this mutiplefilter feature to  solve all the problems.

thanks

 

 

> add mutiple filter in rdd function
> --
>
> Key: SPARK-32341
> URL: https://issues.apache.org/jira/browse/SPARK-32341
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0
>Reporter: gaokui
>Priority: Major
>
> when i use spark rdd . i often use to read kafka data.And kafka data has lots 
> of kinds data set.
> I filter these rdd  by kafka key , then i can use Array[rdd] to fill every 
> topic rdd. 
> But at that ,  i use rdd.filter,that  will generate more than one stage.Data 
> will process by many task, that consume too many time. And it is not 
> necessary.
> i hope add multiple  filter function not rdd.filter ,that will return 
> Array[RDD] in one stage by dividing all  mixture data  RDD to single data set 
> RDD .
> function like Array[RDD]=rdd.multiplefilter(setcondition).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32341) add mutiple filter in rdd function

2020-07-22 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui updated SPARK-32341:
---
Affects Version/s: 3.0.0
  Description: 
when i use spark rdd . i often use to read kafka data.And kafka data has lots 
of kinds data set.

I filter these rdd  by kafka key , then i can use Array[rdd] to fill every 
topic rdd. 

But at that ,  i use rdd.filter,that  will generate more than one stage.Data 
will process by many task, that consume too many time. And it is not necessary.

i hope add multiple  filter function not rdd.filter ,that will return 
Array[RDD] in one stage by dividing all  mixture data  RDD to single data set 
RDD .

function like Array[RDD]=rdd.multiplefilter(setcondition).

 

  was:
when i use spark rdd . i often use to read kafka data.

but kafka data has lots of kinds data set.

when i use rdd.filter,that  will generate more stage.

i hope add mutiple  filter function not rdd.filter ,that will return in one 
stage with  all single data set.

like Array[RDD]=rdd.mutiplefilter(setcondition).

 

  Summary: add mutiple filter in rdd function  (was: wish to add 
mutiple filter in rdd function)

> add mutiple filter in rdd function
> --
>
> Key: SPARK-32341
> URL: https://issues.apache.org/jira/browse/SPARK-32341
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0
>Reporter: gaokui
>Priority: Major
>
> when i use spark rdd . i often use to read kafka data.And kafka data has lots 
> of kinds data set.
> I filter these rdd  by kafka key , then i can use Array[rdd] to fill every 
> topic rdd. 
> But at that ,  i use rdd.filter,that  will generate more than one stage.Data 
> will process by many task, that consume too many time. And it is not 
> necessary.
> i hope add multiple  filter function not rdd.filter ,that will return 
> Array[RDD] in one stage by dividing all  mixture data  RDD to single data set 
> RDD .
> function like Array[RDD]=rdd.multiplefilter(setcondition).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32341) wish to add mutiple filter in rdd function

2020-07-16 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui updated SPARK-32341:
---
Description: 
when i use spark rdd . i often use to read kafka data.

but kafka data has lots of kinds data set.

when i use rdd.filter,that  will generate more stage.

i hope add mutiple  filter function not rdd.filter ,that will return in one 
stage with  all single data set.

like Array[RDD]=rdd.mutiplefilter(setcondition).

 

  was:
when i use spark rdd . i often use to read kafka data.

but kafka data has lots of kinds data set.

when i use rdd.filter,that  will generate more stage.

i hope add mutiple  filter function not rdd.filter ,that will return in one 
stage with  all single data set.


> wish to add mutiple filter in rdd function
> --
>
> Key: SPARK-32341
> URL: https://issues.apache.org/jira/browse/SPARK-32341
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: gaokui
>Priority: Major
>
> when i use spark rdd . i often use to read kafka data.
> but kafka data has lots of kinds data set.
> when i use rdd.filter,that  will generate more stage.
> i hope add mutiple  filter function not rdd.filter ,that will return in one 
> stage with  all single data set.
> like Array[RDD]=rdd.mutiplefilter(setcondition).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32341) wish to add mutiple filter in rdd function

2020-07-16 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui updated SPARK-32341:
---
Description: 
when i use spark rdd . i often use to read kafka data.

but kafka data has lots of kinds data set.

when i use rdd.filter,that  will generate more stage.

i hope add mutiple  filter function not rdd.filter ,that will return in one 
stage with  all single data set.

  was:
when i use spark rdd . i often use to read kafka data.

but kafka data has lots of kinds data set.

when i use rdd.filter,that  will generate more stage.

i hope rdd.filter in one stage that  return every filter single data set.


> wish to add mutiple filter in rdd function
> --
>
> Key: SPARK-32341
> URL: https://issues.apache.org/jira/browse/SPARK-32341
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: gaokui
>Priority: Major
>
> when i use spark rdd . i often use to read kafka data.
> but kafka data has lots of kinds data set.
> when i use rdd.filter,that  will generate more stage.
> i hope add mutiple  filter function not rdd.filter ,that will return in one 
> stage with  all single data set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32341) wish to add mutiple filter in rdd function

2020-07-16 Thread gaokui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaokui updated SPARK-32341:
---
Issue Type: New Feature  (was: Bug)

> wish to add mutiple filter in rdd function
> --
>
> Key: SPARK-32341
> URL: https://issues.apache.org/jira/browse/SPARK-32341
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: gaokui
>Priority: Major
>
> when i use spark rdd . i often use to read kafka data.
> but kafka data has lots of kinds data set.
> when i use rdd.filter,that  will generate more stage.
> i hope rdd.filter in one stage that  return every filter single data set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32341) wish to add mutiple filter in rdd function

2020-07-16 Thread gaokui (Jira)
gaokui created SPARK-32341:
--

 Summary: wish to add mutiple filter in rdd function
 Key: SPARK-32341
 URL: https://issues.apache.org/jira/browse/SPARK-32341
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.6
Reporter: gaokui


when i use spark rdd . i often use to read kafka data.

but kafka data has lots of kinds data set.

when i use rdd.filter,that  will generate more stage.

i hope rdd.filter in one stage that  return every filter single data set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org