[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe
[ https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619262#comment-17619262 ] gaokui commented on SPARK-38161: yes,i can use two dataframe fitler or not filter; but when i face two thousand condtion, i need write four thousand times. and terrble dag job with poor performance. > when clean data hope to spilt one dataframe or dataset to two dataframe > > > Key: SPARK-38161 > URL: https://issues.apache.org/jira/browse/SPARK-38161 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when I am processing data clean, I meet such scene. > one coloumn need judge by empy or null condition. > so I do it right now similar code as following: > df1= dataframe.filter("coloumn=null") > df2= dataframe.filter("coloumn=!null") > and then write df1 and df2 into hdfs parquet file. > but when i have thousand condition. every job need more stage. > I hope dataframe can filter by one condition once and not twice. and that can > generate two dataframe. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule
[ https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui reopened SPARK-38812: you can see my attach > when i clean data ,I hope one rdd spill two rdd according clean data rule > - > > Key: SPARK-38812 > URL: https://issues.apache.org/jira/browse/SPARK-38812 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when id do clean data,one rdd according one value(>or <) filter data, and > then generate two different set,one is error data file, another is errorless > data file. > Now I use filter, but this filter must have two spark dag job, that cost too > much. > exactly some code like iterator.span(preidicate) and then return one > tuple(iter1,iter2) > one dataset will be spilted tow dataset in one rule data clean progress. > i hope compute once not twice. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule
[ https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523476#comment-17523476 ] gaokui edited comment on SPARK-38812 at 4/18/22 8:05 AM: - I see SPARK-2373, SPARK-6664 actually i can get method , use once time dag job to calcute, not twice. for example : val intRDD=sc.makeRDD(Array(1,2,3,4,5,6)) intRDD.foreachPartition(iter=> { val (it1,it2)=iter.patition(x=>x<=3) saveQualityError(it1) //but right here can not use rdd.savetextfile, need write store policy with interaltime and writing size. saveQualityGood(it2) //but right here can not use rdd.savetextfile, need write store policy with interaltime and writing size. //and more serious problem short bucket effect. one patition good data is less, worse data is more. then one write method will wait another method. } ) but this method cause will not use a lot of rdd api. recycle copy code that rdd include. like flush witer to hdfs. was (Author: sungaok): I see SPARK-2373, SPARK-6664 actually i can get more better method than two, use once time job to calcute, not twice. for example : val intRDD=sc.makeRDD(Array(1,2,3,4,5,6)) intRDD.foreachPartition(iter=>{ val (it1,it2)=iter.patition(x=>x<=3) saveQualityError(it1) //but right here can not use rdd.savetextfile, need write store policy with interaltime and writing size. saveQualityGood(it2) //but right here can not use rdd.savetextfile, need write store policy with interaltime and writing size. //and more serious problem short bucket effect. one patition good data is less, worse data is more. then one write method will wait another method. }) > when i clean data ,I hope one rdd spill two rdd according clean data rule > - > > Key: SPARK-38812 > URL: https://issues.apache.org/jira/browse/SPARK-38812 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when id do clean data,one rdd according one value(>or <) filter data, and > then generate two different set,one is error data file, another is errorless > data file. > Now I use filter, but this filter must have two spark dag job, that cost too > much. > exactly some code like iterator.span(preidicate) and then return one > tuple(iter1,iter2) > one dataset will be spilted tow dataset in one rule data clean progress. > i hope compute once not twice. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule
[ https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523476#comment-17523476 ] gaokui commented on SPARK-38812: I see SPARK-2373, SPARK-6664 actually i can get more better method than two, use once time job to calcute, not twice. for example : val intRDD=sc.makeRDD(Array(1,2,3,4,5,6)) intRDD.foreachPartition(iter=>{ val (it1,it2)=iter.patition(x=>x<=3) saveQualityError(it1) //but right here can not use rdd.savetextfile, need write store policy with interaltime and writing size. saveQualityGood(it2) //but right here can not use rdd.savetextfile, need write store policy with interaltime and writing size. //and more serious problem short bucket effect. one patition good data is less, worse data is more. then one write method will wait another method. }) > when i clean data ,I hope one rdd spill two rdd according clean data rule > - > > Key: SPARK-38812 > URL: https://issues.apache.org/jira/browse/SPARK-38812 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when id do clean data,one rdd according one value(>or <) filter data, and > then generate two different set,one is error data file, another is errorless > data file. > Now I use filter, but this filter must have two spark dag job, that cost too > much. > exactly some code like iterator.span(preidicate) and then return one > tuple(iter1,iter2) > one dataset will be spilted tow dataset in one rule data clean progress. > i hope compute once not twice. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule
[ https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui updated SPARK-38812: --- Description: when id do clean data,one rdd according one value(>or <) filter data, and then generate two different set,one is error data file, another is errorless data file. Now I use filter, but this filter must have two spark dag job, that cost too much. exactly some code like iterator.span(preidicate) and then return one tuple(iter1,iter2) one dataset will be spilted tow dataset in one rule data clean progress. i hope compute once not twice. was: when id do clean data,one rdd according one value(>or <) filter data, and then generate two different set,one is error data file, another is errorless data file. Now I use filter, but this filter must have two spark dag job, that cost too much. exactly some code like iterator.span(preidicate) and then return one tuple(iter1,iter2) > when i clean data ,I hope one rdd spill two rdd according clean data rule > - > > Key: SPARK-38812 > URL: https://issues.apache.org/jira/browse/SPARK-38812 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when id do clean data,one rdd according one value(>or <) filter data, and > then generate two different set,one is error data file, another is errorless > data file. > Now I use filter, but this filter must have two spark dag job, that cost too > much. > exactly some code like iterator.span(preidicate) and then return one > tuple(iter1,iter2) > one dataset will be spilted tow dataset in one rule data clean progress. > i hope compute once not twice. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule
[ https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui updated SPARK-38812: --- Description: when id do clean data,one rdd according one value(>or <) filter data, and then generate two different set,one is error data file, another is errorless data file. Now I use filter, but this filter must have two spark dag job, that cost too much. exactly some code like iterator.span(preidicate) and then return one tuple(iter1,iter2) was: when id do clean data, i hope rdd according one value(>or <) filter data, and then to generate two different set,one is error data file, another is right data file. Now I use filter, but this filter must have two spark dag job, that cost too much. exactly some code like iterator.span(preidicate) and then return one tuple(iter1,iter2) > when i clean data ,I hope one rdd spill two rdd according clean data rule > - > > Key: SPARK-38812 > URL: https://issues.apache.org/jira/browse/SPARK-38812 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when id do clean data,one rdd according one value(>or <) filter data, and > then generate two different set,one is error data file, another is errorless > data file. > Now I use filter, but this filter must have two spark dag job, that cost too > much. > exactly some code like iterator.span(preidicate) and then return one > tuple(iter1,iter2) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38812) when i clean data ,I hope one rdd spill two rdd according clean data rule
[ https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui updated SPARK-38812: --- Summary: when i clean data ,I hope one rdd spill two rdd according clean data rule (was: rdd spil two rdd) > when i clean data ,I hope one rdd spill two rdd according clean data rule > - > > Key: SPARK-38812 > URL: https://issues.apache.org/jira/browse/SPARK-38812 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when id do clean data, i hope rdd according one value(>or <) filter data, and > then to generate two different set,one is error data file, another is right > data file. > Now I use filter, but this filter must have two spark dag job, that cost too > much. > exactly some code like iterator.span(preidicate) and then return one > tuple(iter1,iter2) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38812) rdd spil two rdd
[ https://issues.apache.org/jira/browse/SPARK-38812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui updated SPARK-38812: --- Description: when id do clean data, i hope rdd according one value(>or <) filter data, and then to generate two different set,one is error data file, another is right data file. Now I use filter, but this filter must have two spark dag job, that cost too much. exactly some code like iterator.span(preidicate) and then return one tuple(iter1,iter2) Summary: rdd spil two rdd (was: rdd spilt,when id do clean data, i hope rdd according one value(>or <) filter data to two different set,one write error file, another is) > rdd spil two rdd > > > Key: SPARK-38812 > URL: https://issues.apache.org/jira/browse/SPARK-38812 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when id do clean data, i hope rdd according one value(>or <) filter data, and > then to generate two different set,one is error data file, another is right > data file. > Now I use filter, but this filter must have two spark dag job, that cost too > much. > exactly some code like iterator.span(preidicate) and then return one > tuple(iter1,iter2) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38812) rdd spilt,when id do clean data, i hope rdd according one value(>or <) filter data to two different set,one write error file, another is
gaokui created SPARK-38812: -- Summary: rdd spilt,when id do clean data, i hope rdd according one value(>or <) filter data to two different set,one write error file, another is Key: SPARK-38812 URL: https://issues.apache.org/jira/browse/SPARK-38812 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.2.1 Reporter: gaokui -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe
[ https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490625#comment-17490625 ] gaokui commented on SPARK-38161: do you mean that need twice tranformation in one spark dag? can this pseudocode be like List lsfd= dataframe.filter(cond) but this list include two list. > when clean data hope to spilt one dataframe or dataset to two dataframe > > > Key: SPARK-38161 > URL: https://issues.apache.org/jira/browse/SPARK-38161 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when I am processing data clean, I meet such scene. > one coloumn need judge by empy or null condition. > so I do it right now similar code as following: > df1= dataframe.filter("coloumn=null") > df2= dataframe.filter("coloumn=!null") > and then write df1 and df2 into hdfs parquet file. > but when i have thousand condition. every job need more stage. > I hope dataframe can filter by one condition once and not twice. and that can > generate two dataframe. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe
[ https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490606#comment-17490606 ] gaokui commented on SPARK-38161: ok, but why not use one condition , get two dataframe or dataframe list in one statement? > when clean data hope to spilt one dataframe or dataset to two dataframe > > > Key: SPARK-38161 > URL: https://issues.apache.org/jira/browse/SPARK-38161 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when I am processing data clean, I meet such scene. > one coloumn need judge by empy or null condition. > so I do it right now similar code as following: > df1= dataframe.filter("coloumn=null") > df2= dataframe.filter("coloumn=!null") > and then write df1 and df2 into hdfs parquet file. > but when i have thousand condition. every job need more stage. > I hope dataframe can filter by one condition once and not twice. and that can > generate two dataframe. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe
[ https://issues.apache.org/jira/browse/SPARK-38161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489971#comment-17489971 ] gaokui commented on SPARK-38161: do you mean that is 'wirte.mode('parquet').patitionby('col')' or df.partion? could you provide more detail? ths! > when clean data hope to spilt one dataframe or dataset to two dataframe > > > Key: SPARK-38161 > URL: https://issues.apache.org/jira/browse/SPARK-38161 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.1 >Reporter: gaokui >Priority: Major > > when I am processing data clean, I meet such scene. > one coloumn need judge by empy or null condition. > so I do it right now similar code as following: > df1= dataframe.filter("coloumn=null") > df2= dataframe.filter("coloumn=!null") > and then write df1 and df2 into hdfs parquet file. > but when i have thousand condition. every job need more stage. > I hope dataframe can filter by one condition once and not twice. and that can > generate two dataframe. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe
gaokui created SPARK-38161: -- Summary: when clean data hope to spilt one dataframe or dataset to two dataframe Key: SPARK-38161 URL: https://issues.apache.org/jira/browse/SPARK-38161 Project: Spark Issue Type: New Feature Components: Block Manager Affects Versions: 3.2.1 Reporter: gaokui when I am processing data clean, I meet such scene. one coloumn need judge by empy or null condition. so I do it right now similar code as following: df1= dataframe.filter("coloumn=null") df2= dataframe.filter("coloumn=!null") and then write df1 and df2 into hdfs parquet file. but when i have thousand condition. every job need more stage. I hope dataframe can filter by one condition once and not twice. and that can generate two dataframe. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32341) add mutiple filter in rdd function
[ https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167043#comment-17167043 ] gaokui commented on SPARK-32341: Yes, I can do that. But at that situation, I need create a lot of kafka topic for every single dataset, I have lots of dataset over1000. that will create lots of kafka topics. And then I must lanuch same spark job numbers . This job numbers also will lead to over1000. At that situation , it is crazy job to manage and allocate machine cpu , memory. so I need this mutiplefilter feature to solve all the problems. thanks > add mutiple filter in rdd function > -- > > Key: SPARK-32341 > URL: https://issues.apache.org/jira/browse/SPARK-32341 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.6, 3.0.0 >Reporter: gaokui >Priority: Major > > when i use spark rdd . i often use to read kafka data.And kafka data has lots > of kinds data set. > I filter these rdd by kafka key , then i can use Array[rdd] to fill every > topic rdd. > But at that , i use rdd.filter,that will generate more than one stage.Data > will process by many task, that consume too many time. And it is not > necessary. > i hope add multiple filter function not rdd.filter ,that will return > Array[RDD] in one stage by dividing all mixture data RDD to single data set > RDD . > function like Array[RDD]=rdd.multiplefilter(setcondition). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32341) add mutiple filter in rdd function
[ https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui updated SPARK-32341: --- Affects Version/s: 3.0.0 Description: when i use spark rdd . i often use to read kafka data.And kafka data has lots of kinds data set. I filter these rdd by kafka key , then i can use Array[rdd] to fill every topic rdd. But at that , i use rdd.filter,that will generate more than one stage.Data will process by many task, that consume too many time. And it is not necessary. i hope add multiple filter function not rdd.filter ,that will return Array[RDD] in one stage by dividing all mixture data RDD to single data set RDD . function like Array[RDD]=rdd.multiplefilter(setcondition). was: when i use spark rdd . i often use to read kafka data. but kafka data has lots of kinds data set. when i use rdd.filter,that will generate more stage. i hope add mutiple filter function not rdd.filter ,that will return in one stage with all single data set. like Array[RDD]=rdd.mutiplefilter(setcondition). Summary: add mutiple filter in rdd function (was: wish to add mutiple filter in rdd function) > add mutiple filter in rdd function > -- > > Key: SPARK-32341 > URL: https://issues.apache.org/jira/browse/SPARK-32341 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.6, 3.0.0 >Reporter: gaokui >Priority: Major > > when i use spark rdd . i often use to read kafka data.And kafka data has lots > of kinds data set. > I filter these rdd by kafka key , then i can use Array[rdd] to fill every > topic rdd. > But at that , i use rdd.filter,that will generate more than one stage.Data > will process by many task, that consume too many time. And it is not > necessary. > i hope add multiple filter function not rdd.filter ,that will return > Array[RDD] in one stage by dividing all mixture data RDD to single data set > RDD . > function like Array[RDD]=rdd.multiplefilter(setcondition). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32341) wish to add mutiple filter in rdd function
[ https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui updated SPARK-32341: --- Description: when i use spark rdd . i often use to read kafka data. but kafka data has lots of kinds data set. when i use rdd.filter,that will generate more stage. i hope add mutiple filter function not rdd.filter ,that will return in one stage with all single data set. like Array[RDD]=rdd.mutiplefilter(setcondition). was: when i use spark rdd . i often use to read kafka data. but kafka data has lots of kinds data set. when i use rdd.filter,that will generate more stage. i hope add mutiple filter function not rdd.filter ,that will return in one stage with all single data set. > wish to add mutiple filter in rdd function > -- > > Key: SPARK-32341 > URL: https://issues.apache.org/jira/browse/SPARK-32341 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: gaokui >Priority: Major > > when i use spark rdd . i often use to read kafka data. > but kafka data has lots of kinds data set. > when i use rdd.filter,that will generate more stage. > i hope add mutiple filter function not rdd.filter ,that will return in one > stage with all single data set. > like Array[RDD]=rdd.mutiplefilter(setcondition). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32341) wish to add mutiple filter in rdd function
[ https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui updated SPARK-32341: --- Description: when i use spark rdd . i often use to read kafka data. but kafka data has lots of kinds data set. when i use rdd.filter,that will generate more stage. i hope add mutiple filter function not rdd.filter ,that will return in one stage with all single data set. was: when i use spark rdd . i often use to read kafka data. but kafka data has lots of kinds data set. when i use rdd.filter,that will generate more stage. i hope rdd.filter in one stage that return every filter single data set. > wish to add mutiple filter in rdd function > -- > > Key: SPARK-32341 > URL: https://issues.apache.org/jira/browse/SPARK-32341 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: gaokui >Priority: Major > > when i use spark rdd . i often use to read kafka data. > but kafka data has lots of kinds data set. > when i use rdd.filter,that will generate more stage. > i hope add mutiple filter function not rdd.filter ,that will return in one > stage with all single data set. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32341) wish to add mutiple filter in rdd function
[ https://issues.apache.org/jira/browse/SPARK-32341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaokui updated SPARK-32341: --- Issue Type: New Feature (was: Bug) > wish to add mutiple filter in rdd function > -- > > Key: SPARK-32341 > URL: https://issues.apache.org/jira/browse/SPARK-32341 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: gaokui >Priority: Major > > when i use spark rdd . i often use to read kafka data. > but kafka data has lots of kinds data set. > when i use rdd.filter,that will generate more stage. > i hope rdd.filter in one stage that return every filter single data set. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32341) wish to add mutiple filter in rdd function
gaokui created SPARK-32341: -- Summary: wish to add mutiple filter in rdd function Key: SPARK-32341 URL: https://issues.apache.org/jira/browse/SPARK-32341 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.6 Reporter: gaokui when i use spark rdd . i often use to read kafka data. but kafka data has lots of kinds data set. when i use rdd.filter,that will generate more stage. i hope rdd.filter in one stage that return every filter single data set. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org