Why not give it a shot? Spark always outruns old mapreduce jobs. Thanks Best Regards
On Sat, Aug 8, 2015 at 8:25 AM, linlma <lin...@gmail.com> wrote: > I have a tens of million records, which is customer ID and city ID pair. > There are tens of millions of unique customer ID, and only a few hundreds > unique city ID. I want to do a merge to get all city ID aggregated for a > specific customer ID, and pull back all records. I want to do this using > group by customer ID using Pig on Hadoop, and wondering if it is the most > efficient way. > > Also wondering if there are overhead for sorting in Hadoop (I do not care > if > customer1 before customer2 or not, as long as all city are aggregated > correctly for customer1 and customer 2)? Do you think Spark is better? > > Here is an example of inputs, > > CustomerID1 City1 > CustomerID2 City2 > CustomerID3 City1 > CustomerID1 City3 > CustomerID2 City4 > I want output like this, > > CustomerID1 City1 City3 > CustomerID2 City2 City4 > CustomerID3 City1 > > thanks in advance, > Lin > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/using-Spark-or-pig-group-by-efficient-in-my-use-case-tp24178.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >