Re: using Spark or pig group by efficient in my use case?

2015-08-13 Thread Eugene Morozov
I’d say spark will be faster in this case, because it avoids storing 
intermediate data to disk after map and before reduce tasks.
It’ll be faster even if you use Combiner (I’d assume Pig is able to figure that 
out).

Hard to say how much faster as it’ll depend on disks available (ssd vs sshd vs 
hdd), size of data, etc.

Although, the only experiment can reveal the truth =)

On 08 Aug 2015, at 05:55, linlma lin...@gmail.com wrote:

 I have a tens of million records, which is customer ID and city ID pair.
 There are tens of millions of unique customer ID, and only a few hundreds
 unique city ID. I want to do a merge to get all city ID aggregated for a
 specific customer ID, and pull back all records. I want to do this using
 group by customer ID using Pig on Hadoop, and wondering if it is the most
 efficient way.
 
 Also wondering if there are overhead for sorting in Hadoop (I do not care if
 customer1 before customer2 or not, as long as all city are aggregated
 correctly for customer1 and customer 2)? Do you think Spark is better?
 
 Here is an example of inputs,
 
 CustomerID1 City1
 CustomerID2 City2
 CustomerID3 City1
 CustomerID1 City3
 CustomerID2 City4
 I want output like this,
 
 CustomerID1 City1 City3
 CustomerID2 City2 City4
 CustomerID3 City1
 
 thanks in advance,
 Lin
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/using-Spark-or-pig-group-by-efficient-in-my-use-case-tp24178.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

Eugene Morozov
fathers...@list.ru






Re: using Spark or pig group by efficient in my use case?

2015-08-09 Thread Akhil Das
Why not give it a shot? Spark always outruns old mapreduce jobs.

Thanks
Best Regards

On Sat, Aug 8, 2015 at 8:25 AM, linlma lin...@gmail.com wrote:

 I have a tens of million records, which is customer ID and city ID pair.
 There are tens of millions of unique customer ID, and only a few hundreds
 unique city ID. I want to do a merge to get all city ID aggregated for a
 specific customer ID, and pull back all records. I want to do this using
 group by customer ID using Pig on Hadoop, and wondering if it is the most
 efficient way.

 Also wondering if there are overhead for sorting in Hadoop (I do not care
 if
 customer1 before customer2 or not, as long as all city are aggregated
 correctly for customer1 and customer 2)? Do you think Spark is better?

 Here is an example of inputs,

 CustomerID1 City1
 CustomerID2 City2
 CustomerID3 City1
 CustomerID1 City3
 CustomerID2 City4
 I want output like this,

 CustomerID1 City1 City3
 CustomerID2 City2 City4
 CustomerID3 City1

 thanks in advance,
 Lin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/using-Spark-or-pig-group-by-efficient-in-my-use-case-tp24178.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




using Spark or pig group by efficient in my use case?

2015-08-07 Thread linlma
I have a tens of million records, which is customer ID and city ID pair.
There are tens of millions of unique customer ID, and only a few hundreds
unique city ID. I want to do a merge to get all city ID aggregated for a
specific customer ID, and pull back all records. I want to do this using
group by customer ID using Pig on Hadoop, and wondering if it is the most
efficient way.

Also wondering if there are overhead for sorting in Hadoop (I do not care if
customer1 before customer2 or not, as long as all city are aggregated
correctly for customer1 and customer 2)? Do you think Spark is better?

Here is an example of inputs,

CustomerID1 City1
CustomerID2 City2
CustomerID3 City1
CustomerID1 City3
CustomerID2 City4
I want output like this,

CustomerID1 City1 City3
CustomerID2 City2 City4
CustomerID3 City1

thanks in advance,
Lin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-Spark-or-pig-group-by-efficient-in-my-use-case-tp24178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org