Hi Santhosh,
My name is not Bipin, its Biplob as is clear from my Signature.
Regarding your question, I have no clue what your map operation is doing on
the grouped data, so I can only suggest you to do :
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).reduceByKey
reduceByKey() in this context ?
Santhosh
On Mon, Aug 6, 2018 at 1:20 AM, Biplob Biswas
wrote:
> Hi Santhosh,
>
> If you are not performing any aggregation, then I don't think you can
> replace your groupbykey with a reducebykey, and as I see you are only
> grouping and taking 2 value
Hi Santhosh,
If you are not performing any aggregation, then I don't think you can
replace your groupbykey with a reducebykey, and as I see you are only
grouping and taking 2 values of the result, thus I believe you can't just
replace your groupbykey with that.
Thanks & Regards
Biplob Bi
I am trying to replace groupByKey() with reudceByKey(), I am a pyspark and
python newbie and I am having a hard time figuring out the lambda function
for the reduceByKey() operation.
Here is the code
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).groupByKey(25).take(2)
Here
Yeah, you are right. I ran the experiments locally not on YARN.
On Fri, Jul 27, 2018 at 11:54 PM, Vadim Semenov wrote:
> `spark.worker.cleanup.enabled=true` doesn't work for YARN.
> On Fri, Jul 27, 2018 at 8:52 AM dineshdharme
> wrote:
> >
> > I am trying to do few
`spark.worker.cleanup.enabled=true` doesn't work for YARN.
On Fri, Jul 27, 2018 at 8:52 AM dineshdharme wrote:
>
> I am trying to do few (union + reduceByKey) operations on a hiearchical
> dataset in a iterative fashion in rdd. The first few loops run fine but on
> the subs
I am trying to do few (union + reduceByKey) operations on a hiearchical
dataset in a iterative fashion in rdd. The first few loops run fine but on
the subsequent loops, the operations ends up using the whole scratch space
provided to it.
I have set the spark scratch directory, i.e
up that has changed in the batch.
>>
>> On Wed, Oct 25, 2017 at 5:52 PM, Piyush Mukati <piyush.muk...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> we are migrating some jobs from Dstream to Structured Stream.
>>>
>>> Currently to handle agg
h Mukati <piyush.muk...@gmail.com>
> wrote:
>
>> Hi,
>> we are migrating some jobs from Dstream to Structured Stream.
>>
>> Currently to handle aggregations we call map and reducebyKey on each RDD
>> like
>> rdd.map(event => (event._1, event)
Oct 25, 2017 at 5:52 PM, Piyush Mukati <piyush.muk...@gmail.com>
wrote:
> Hi,
> we are migrating some jobs from Dstream to Structured Stream.
>
> Currently to handle aggregations we call map and reducebyKey on each RDD
> like
> rdd.map(event => (event._1, event))
Hi,
we are migrating some jobs from Dstream to Structured Stream.
Currently to handle aggregations we call map and reducebyKey on each RDD
like
rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b))
The final output of each RDD is merged to the sink with support for
aggre
Hi Stephen,
If you use aggregate functions or reduceGroup on KeyValueGroupedDataset it
behaves as reduceByKey on RDD.
Only if you use flatMapGroups and mapGroups it behaves as groupByKey on
RDD and if you read the API documentation it warns of using the API.
Hope this helps.
Thanks
Ankur
Are there plans to add reduceByKey to dataframes, Since switching over to
spark 2 I find myself increasing dissatisfied with the idea of converting
dataframes to RDD to do procedural programming on grouped data(both from a
ease of programming stance and performance stance). So I've been using
is split into two stages, flatMap() and count(). When counting
Tuples, flatMap() takes about 6s and count() takes about 2s, while when
counting Longs, flatMap() takes 18s and count() takes 10s.
I haven't look into Spark's implementation of flatMap/reduceByKey, but I
guess Spark has some
Hi,
Read somewhere that
groupByKey() in RDD disables map-side aggregation as the aggregation
function (appending to a list) does not save any space.
However from my understanding, using something like reduceByKey or
(CombineByKey + a combiner function,) we could reduce the data shuffled
e-from-within-function-passed-to-reduceByKey-tp28082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
(Spark 2.0 , on AWS EMR yarn cluster)
>> >>> listens to Campaigns based on live stock feeds and the batch duration
>> is 5
>> >>> seconds. The applications uses Kafka DirectStream and based on the
>> feed
>> >>> source there are three streams.
d the batch duration
> is 5
> >>> seconds. The applications uses Kafka DirectStream and based on the feed
> >>> source there are three streams. As given in the code snippet I am
> doing a
> >>> union of three streams and I am trying to remove the dupli
luster)
>>> listens to Campaigns based on live stock feeds and the batch duration is 5
>>> seconds. The applications uses Kafka DirectStream and based on the feed
>>> source there are three streams. As given in the code snippet I am doing a
>>> union of three streams
three streams and I am trying to remove the duplicate campaigns
>> received using reduceByKey based on the customer and campaignId. I could
>> see lot of duplicate email being send out for the same key in the same
>> batch.I was expecting reduceByKey to remove the duplicate campaigns
ree streams and I am trying to remove the duplicate campaigns
> received using reduceByKey based on the customer and campaignId. I could
> see lot of duplicate email being send out for the same key in the same
> batch.I was expecting reduceByKey to remove the duplicate campaigns in a
> bat
snippet I am doing a
union of three streams and I am trying to remove the duplicate campaigns
received using reduceByKey based on the customer and campaignId. I could
see lot of duplicate email being send out for the same key in the same
batch.I was expecting reduceByKey to remove the duplicate
it works similarly as reducebykey.
>
> On 29 Aug 2016 20:57, "Marius Soutier" <mps@gmail.com
> <mailto:mps@gmail.com>> wrote:
> In DataFrames (and thus in 1.5 in general) this is not possible, correct?
>
>> On 11.08.2016, at 05:42, Holden Karau &
If you are confused because of the name of two APIs. I think DF API name
groupBy came from SQL, but it works similarly as reducebykey.
On 29 Aug 2016 20:57, "Marius Soutier" <mps@gmail.com> wrote:
> In DataFrames (and thus in 1.5 in general) this is not possible, correct
hips partial aggregations instead).
>
> I wonder whether the DataFrame API optimizes the code doing something
> similar to what RDD.reduceByKey does.
>
> I am using Spark 1.6.2.
>
> Regards,
> Luis
>
>
>
> --
> View this message in context:
> http://ap
ds,
> Luis
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-
> API-tp27508.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
t.1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-API-tp27508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
SO it was indeed my merge function. I created new result object for every
merge and its working now.
Thanks
On Wed, Jun 22, 2016 at 3:46 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
> PS. In my reduceByKey operation I have two mutable object. What I do is
> merge mutable2 i
PS. In my reduceByKey operation I have two mutable object. What I do is
merge mutable2 into mutable1 and return mutable1. I read that it works for
aggregateByKey so thought it will work for reduceByKey as well. I might be
wrong here. Can someone verify if this will work or be un predictable
incorrect result, did you observe any error (on
> workers) ?
>
> Cheers
>
> On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel <npa...@xactlycorp.com>
> wrote:
>
>> I have an RDD[String, MyObj] which is a result of Join + Map operation.
>> It has no partiti
For the run which returned incorrect result, did you observe any error (on
workers) ?
Cheers
On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
> I have an RDD[String, MyObj] which is a result of Join + Map operation. It
> has no partitioner info. I run
Hi,
Could you check the issue also occurs in v1.6.1 and v2.0?
// maropu
On Wed, Jun 22, 2016 at 2:42 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
> I have an RDD[String, MyObj] which is a result of Join + Map operation. It
> has no partitioner info. I run reduceByKey without
I have an RDD[String, MyObj] which is a result of Join + Map operation. It
has no partitioner info. I run reduceByKey without passing any Partitioner
or partition counts. I observed that output aggregation result for given
key is incorrect sometime. like 1 out of 5 times. It looks like reduce
t; I am looking at the option of moving RDD based operations to Dataset based
> operations. We are calling 'reduceByKey' on some pair RDDs we have. What
> would the equivalent be in the Dataset interface - I do not see a simple
> reduceByKey replacement.
>
&
On Tue, Jun 7, 2016 at 2:50 PM, Richard Marscher <rmarsc...@localytics.com>
wrote:
> There certainly are some gaps between the richness of the RDD API and the
> Dataset API. I'm also migrating from RDD to Dataset and ran into
> reduceByKey and join scenarios.
>
> In the spark
There certainly are some gaps between the richness of the RDD API and the
Dataset API. I'm also migrating from RDD to Dataset and ran into
reduceByKey and join scenarios.
In the spark-dev list, one person was discussing reduceByKey being
sub-optimal at the moment and it spawned this JIRA
https
un 7, 2016 at 2:32 PM, Bryan Jeffrey <bryan.jeff...@gmail.com>
> wrote:
>
>> Hello.
>>
>> I am looking at the option of moving RDD based operations to Dataset
>> based operations. We are calling 'reduceByKey' on some pair RDDs we have.
>> What would the
at 2:32 PM, Bryan Jeffrey <bryan.jeff...@gmail.com>
wrote:
> Hello.
>
> I am looking at the option of moving RDD based operations to Dataset based
> operations. We are calling 'reduceByKey' on some pair RDDs we have. What
> would the equivalent be in the Dataset interface -
Hello.
I am looking at the option of moving RDD based operations to Dataset based
operations. We are calling 'reduceByKey' on some pair RDDs we have. What
would the equivalent be in the Dataset interface - I do not see a simple
reduceByKey replacement.
Regards,
Bryan Jeffrey
by keys are empty. How do I avoid empty group by keys in DataFrame? Does
DataFrame avoid empty group by key? I have around 8 keys on which I do group
by.
sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla
bla");
Even though it does not sound intuitive, reduce by key expects all values
for a particular key for a partition to be loaded into memory. So once you
increase the partitions you can run the jobs.
...@cs.stanford.edu> wrote:
>>
>>> Hello,
>>>
>>> I'm using Spark version 1.6.0 and have trouble with memory when trying
>>> to do reducebykey on a dataset with as many as 75 million keys. I.e. I get
>>> the following exception when I run the
, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Hello,
>>
>> I'm using Spark version 1.6.0 and have trouble with memory when trying to
>> do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
>> following exception when I r
trying to
> do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
> following exception when I run the task.
>
> There are 20 workers in the cluster. It is running under the standalone
> mode with 12 GB assigned per executor and 4 cores per worker. The
> spark.memo
Hello,
I'm using Spark version 1.6.0 and have trouble with memory when trying to
do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
following exception when I run the task.
There are 20 workers in the cluster. It is running under the standalone
mode with 12 GB assigned
On Monday 25 April 2016 11:28 PM,
Weiping Qu wrote:
Dear Ted,
You are right. ReduceByKey is transformation. My fault.
I would rephrase my question using following code snippet.
object ScalaApp {
def main(args: Array
Dear Ted,
You are right. ReduceByKey is transformation. My fault.
I would rephrase my question using following code snippet.
object ScalaApp {
def main(args: Array[String]): Unit ={
val conf = new SparkConf().setAppName("ScalaApp").setMaster("local")
val sc = n
ds
> on whether they are lazily executed or not.
> As far as I saw from my codes, the reduceByKey will be executed without
> any operations in the Action category.
> Please correct me if I am wrong.
>
> Thanks,
> Regards,
> Weiping
>
> On 25.04.2016 17:20, Chadha Pooja
Thanks.
I read that from the specification.
I thought the way people distinguish actions and transformations depends
on whether they are lazily executed or not.
As far as I saw from my codes, the reduceByKey will be executed without
any operations in the Action category.
Please correct me if I
Hi,
I'd like just to verify that whether reduceByKey is transformation or
actions.
As written in RDD papers, spark flow will not be triggered only if
actions are reached.
I tried and saw that the my flow will be executed once there is a
reduceByKey while it is categorized into transformations
in my job that
send data to the driver - just a pull of data from S3, a map and
reduceByKey and then conversion to dataframe and saveAsTable action that
puts the results back on S3.
I've found a few references to reduceByKey and spark.driver.maxResultSize
having some importance, but cannot fathom how
ke any sense to me, as there are
no actions in my job that send data to the driver - just a pull of data from
S3, a map and reduceByKey and then conversion to dataframe and saveAsTable
action that puts the results back on S3.
I've found a few references to reduceByKey and spark.driver.maxResult
o actions in my job that
> send data to the driver - just a pull of data from S3, a map and
> reduceByKey and then conversion to dataframe and saveAsTable action that
> puts the results back on S3.
>
> I've found a few references to reduceByKey and spark.driver.maxResultSize
>
65), (320,66), (340,67),
> (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
>
> scala> val rddNum = sc.parallelize(nums)
> rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at
> parallelize at :23
>
> scala> val reducedNum
)] = ShuffledRDD[1] at
reduceByKey at :25
scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator,
true).collect.toList
res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)
To distribute my data more evenly across the partitions I created my own
custom Partitoiner
zes: RDD[Int] = rdd.mapPartitions(iter =>
Array(iter.size).iterator, true)log.info(s"rdd ==>
[${sizes.collect.toList}]")
My question is why does my data end up in one partition after the
reduceByKey? After the filter it can be seen that the data is evenly
distributed, but the
Deep copying the data solved the issue:
data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id,
List(t)) }).reduceByKey(_ ++ _)
(noted here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003)
Thanks Igor Ber
Howdy,
We've noticed a strange behavior with Avro serialized data and reduceByKey
RDD functionality. Please see below:
// We're reading a bunch of Avro serialized data
val data: RDD[T] = sparkContext.hadoopFile(path,
classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], classOf[NullWritable
Hi.
What is slow exactly?
In code-base 1:
When you run the persist() + count() you stored the result in RAM.
Then the map + reducebykey is done on in-memory data.
In the latter case (all-in-oneline) you are doing both steps at the same
time.
So you are saying that if you sum-up the time
yes, the first code takes only 30mins.
but the second method, I wait for 5 hours, only finish 10%
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242p25249.html
Sent from the Apache Spark User List mailing
I meet a situation:
When I use
val a = rdd.pipe("./my_cpp_program").persist()
a.count() // just use it to persist a
val b = a.map(s => (s, 1)).reduceByKey().count()
it 's so fast
but when I use
val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByK
pipe("./my_cpp_program").persist()
a.count() // just use it to persist a
val b = a.map(s => (s, 1)).reduceByKey().count()
it 's so fast
but when I use
val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count()
it is so slow
and there are many
Which Spark release are you using ?
Which OS ?
Thanks
On Sat, Oct 31, 2015 at 5:18 AM, hotdog <lisend...@163.com> wrote:
> I meet a situation:
> When I use
> val a = rdd.pipe("./my_cpp_program").persist()
> a.count() // just use it to persist a
> val b = a.map(s
Hi,
We currently use reduceByKey to reduce by a particular metric name in our
Streaming/Batch job. It seems to be doing a lot of shuffles and it has
impact on performance. Does using a custompartitioner before calling
reduceByKey improve performance?
Thanks,
Swetha
--
View this message
ultimately improves performance.
On Tue, Oct 27, 2015 at 1:20 PM, swetha <swethakasire...@gmail.com> wrote:
> Hi,
>
> We currently use reduceByKey to reduce by a particular metric name in our
> Streaming/Batch job. It seems to be doing a lot of shuffles and it has
> impact
So, Wouldn't using a customPartitioner on the rdd upon which the
groupByKey or reduceByKey is performed avoid shuffles and improve
performance? My code does groupByAndSort and reduceByKey on different
datasets as shown below. Would using a custom partitioner on those datasets
before using
do that operation with a reduceByKey?
2. If not, use more partitions. That would cause lesser data in each
partition, so less spilling.
3. You can control the amount memory allocated for shuffles by changing the
configuration spark.shuffle.memoryFraction . More fraction would cause less
spilling
e same.
>
>
> Also, does using a customPartitioner for a reduceByKey improve performance?
>
>
> def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String,
> List[(Long, String)])] =
> { val grpdRecs = rdd.groupByKey(); val srtdRecs =
> grpdRecs.mapValues[(List[(Long
.
Also, does using a customPartitioner for a reduceByKey improve performance?
def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String,
List[(Long, String)])] =
{ val grpdRecs = rdd.groupByKey(); val srtdRecs =
grpdRecs.mapValues[(List[(Long, String)])](iter =>
iter.toList.sortBy(_
hi Daniel,
Do you solve your problem?
I met the same problem when running massive data using reduceByKey on yarn.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html
Sent from
It turns out the mesos can overwrite the OS ulimit -n setting. So we have
increased the mesos slave ulimit -n setting.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html
Sent from the Apache Spark
un(Thread.java:745)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
-many-open-files-exception-on-reduceByKey-tp2462p24985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
va.lang.Thread.run(Thread.java:745)
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com
All the *ByKey aggregations perform an efficient shuffle and preserve
partitioning on the output. If all you need is to call reduceByKey, then don’t
bother with groupBy. You should use groupBy if you really need all the
datapoints from a key for a very custom operation.
From the docs:
Note
Hi,
How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of
keys for which I need to do sum and average inside the updateStateByKey by
joining with old state. How do I accomplish that?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560
case?
Sent from my iPhone
> On 24 Sep 2015, at 19:47, swetha <swethakasire...@gmail.com> wrote:
>
> Hi,
>
> How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of
> keys for which I need to do sum and average inside the updateStateByKey by
> joini
Hi,
How to make Group By more efficient? Is it recommended to use a custom
partitioner and then do a Group By? And can we use a custom partitioner and
then use a reduceByKey for optimization?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3
Hi,
I have applied mapToPair and then a reduceByKey on a DStream to obtain a
JavaPairDStreamString, MapString, Object.
I have to apply a flatMapToPair and reduceByKey on the DSTream Obtained
above.
But i do not see any logs from reduceByKey operation.
Can anyone explain why is this happening
I don't see that you invoke any action in this code. It won't do
anything unless you tell it to perform an action that requires the
transformations.
On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari
deepesh.maheshwar...@gmail.com wrote:
Hi,
I have applied mapToPair and then a reduceByKey
HI All,
Please find fix info for users who are following the mail chain of this
issue and the respective solution below:
*reduceByKey: Non working snippet*
import org.apache.spark.Context
import org.apache.spark.Context._
import org.apache.spark.SparkConf
val conf = new SparkConf()
val sc = new
UseCompressedOops is set; assuming yes
res0: Array[(Int, Int)] = Array((0,3), (1,50), (2,40))
Yong
--
Date: Fri, 21 Aug 2015 19:24:09 +0530
Subject: Re: Transformation not happening for reduceByKey or GroupByKey
From: jsatishchan...@gmail.com
To: abhis
: Transformation not happening for reduceByKey or GroupByKey
From: zjf...@gmail.com
To: jsatishchan...@gmail.com
CC: robin.e...@xense.co.uk; user@spark.apache.org
Hi Satish,
I don't see where spark support -i, so suspect it is provided by DSE. In that
case, it might be bug of DSE.
On Fri, Aug 21, 2015
know too much about the original problem though.
Yong
--
Date: Fri, 21 Aug 2015 18:19:49 +0800
Subject: Re: Transformation not happening for reduceByKey or GroupByKey
From: zjf...@gmail.com
To: jsatishchan...@gmail.com
CC: robin.e...@xense.co.uk; user
), (0,2),(1,20),(1,30),(2,40))
I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on
Values for each key
Code:
RDD.reduceByKey((x,y) = x+y)
RDD.take(3)
Result in console:
RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at
console:73
for each key
Code:
RDD.reduceByKey((x,y) = x+y)
RDD.take(3)
Result in console:
RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey
at console:73
res:Array[(Int,Int)] = Array()
Command as mentioned
dse spark --master local --jars postgresql-9.4-1201.jar -i
not happening for reduceByKey or GroupByKey
From: jsatishchan...@gmail.com
To: abhis...@tetrationanalytics.com
CC: user@spark.apache.org
HI Abhishek,
I have even tried that but rdd2 is empty
Regards,Satish
On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh
abhis...@tetrationanalytics.com wrote:
You had
),(2,40))
I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function
on Values for each key
Code:
RDD.reduceByKey((x,y) = x+y)
RDD.take(3)
Result in console:
RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey
at console:73
res:Array[(Int,Int)] = Array
),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40))
I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function
on Values for each key
Code:
RDD.reduceByKey((x,y) = x+y)
RDD.take(3)
Result in console:
RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey
((0,3),(1,50),(2,40)) just a sum function
on Values for each key
Code:
RDD.reduceByKey((x,y) = x+y)
RDD.take(3)
Result in console:
RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey
at console:73
res:Array[(Int,Int)] = Array()
Command as mentioned
dse spark --master
), (0,2),(1,20),(1,30),(2,40))
I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on
Values for each key
Code:
RDD.reduceByKey((x,y) = x+y)
RDD.take(3)
Result in console:
RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at
console:73
res:Array[(Int
: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at
console:73
res:Array[(Int,Int)] = Array()
Command as mentioned
dse spark --master local --jars postgresql-9.4-1201.jar -i ScriptFile
Please let me know what is missing in my code, as my resultant Array is
empty
Regards,
Satish
)]= ShuffledRDD[1] at reduceByKey at
console:73
res:Array[(Int,Int)] = Array()
Command as mentioned
dse spark --master local --jars postgresql-9.4-1201.jar -i ScriptFile
Please let me know what is missing in my code, as my resultant Array is
empty
Regards,
Satish
, V) = V.
I am really stuck here. Please help.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Custom-function-for-reduceByKey-missing-arguments-for-method-tp23756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
really stuck here. Please help.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Custom-function-for-reduceByKey-missing-arguments-for-method-tp23756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546p23555.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
I'm running reduceByKey in spark. My program is the simplest example of
spark:
val counts = textFile.flatMap(line = line.split( )).repartition(2).
.map(word = (word, 1))
.reduceByKey(_ + _, 1)
counts.saveAsTextFile(hdfs://...)
but it always run out
mode ?
Cheers
On Tue, Jun 30, 2015 at 10:03 AM, hotdog lisend...@163.com
mailto:lisend...@163.com wrote:
I'm running reduceByKey in spark. My program is the simplest example of
spark:
val counts = textFile.flatMap(line = line.split( )).repartition(2).
.map(word
Which Spark release are you using ?
Are you running in standalone mode ?
Cheers
On Tue, Jun 30, 2015 at 10:03 AM, hotdog lisend...@163.com wrote:
I'm running reduceByKey in spark. My program is the simplest example of
spark:
val counts = textFile.flatMap(line = line.split( )).repartition
), (USA, Colorado)
Output RDD: (USA, [California, Colorado]), (UK, Yorkshire)
Is it possible to use reduceByKey or foldByKey to achieve this, instead of
groupBykey.
Something equivalent to a cons operator from LISP?, so that I could just say
reduceBykey(lambda x,y: (cons x y) ). May be it is more
1 - 100 of 234 matches
Mail list logo