Re: Spark Dataframe 1.4 (GroupBy partial match)
Hi Salih, Thanks for the links :) This seems very promising to me. When do you think this would be available in the spark codeline ? Thanks, Suraj On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop soz...@yahoo.com wrote: Hi Suraj, It seems your requirement is Record Linkage/Entity Resolution. https://en.wikipedia.org/wiki/Record_linkage http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf A presentation from Spark Summit using GraphX https://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark Kind Regards Salih Oztop 07856128843 http://www.linkedin.com/in/salihoztop -- *From:* Suraj Shetiya surajshet...@gmail.com *To:* Michael Armbrust mich...@databricks.com *Cc:* Salih Oztop soz...@yahoo.com; user@spark.apache.org user@spark.apache.org; megha.sridh...@cynepia.com *Sent:* Thursday, July 2, 2015 10:47 AM *Subject:* Re: Spark Dataframe 1.4 (GroupBy partial match) Hi Michael, Thanks for a quick response.. This sounds like something that would work. However, Rethinking the problem statement and various other use cases, which are growing, there are more such scenarios, where one could have columns with structured and unstructured data embedded (json or xml or other kind of collections), it may make sense to allow probabilistic groupby operations where the user can get the same functionality in one step instead of two.. Your thoughts on if that makes sense.. -Suraj -- Forwarded message -- From: Michael Armbrust mich...@databricks.com Date: Jul 2, 2015 12:49 AM Subject: Re: Spark Dataframe 1.4 (GroupBy partial match) To: Suraj Shetiya surajshet...@gmail.com Cc: Salih Oztop soz...@yahoo.com, user@spark.apache.org user@spark.apache.org You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column. On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com wrote: Thanks Salih. :) The output of the groupby is as below. 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry And subsequently, we would like to aggregate all messages with a particular reference subject. For instance the question we are trying to answer could be : Get the count of messages with a particular subject. Looking forward to any suggestion from you. On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote: Hi Suraj What will be your output after group by? Since GroupBy is for aggregations like sum, count etc. If you want to count the 2015 records than it is possible. Kind Regards Salih Oztop -- *From:* Suraj Shetiya surajshet...@gmail.com *To:* user@spark.apache.org *Sent:* Tuesday, June 30, 2015 3:05 PM *Subject:* Spark Dataframe 1.4 (GroupBy partial match) I have a dataset (trimmed and simplified) with 2 columns as below. DateSubject 2015-01-14 SEC Inquiry 2014-02-12 Happy birthday 2014-02-13 Re: Happy birthday 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry I have imported the same in a Spark Dataframe. What I am looking at is groupBy subject field (however, I need a partial match to identify the discussion topic). For example in the above case.. I would like to group all messages, which have subject containing SEC Inquiry which returns following grouped frame: 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry Another usecase for a similar problem could be group by year (in the above example), it would mean partial match of the date field, which would mean groupBy Date by matching year as 2014 or 2015. Keenly Looking forward to reply/solution to the above. - Suraj -- Regards, Suraj
Re: Spark Dataframe 1.4 (GroupBy partial match)
Hi Michael, Thanks for a quick response.. This sounds like something that would work. However, Rethinking the problem statement and various other use cases, which are growing, there are more such scenarios, where one could have columns with structured and unstructured data embedded (json or xml or other kind of collections), it may make sense to allow probabilistic groupby operations where the user can get the same functionality in one step instead of two.. Your thoughts on if that makes sense.. -Suraj -- Forwarded message -- From: Michael Armbrust mich...@databricks.com Date: Jul 2, 2015 12:49 AM Subject: Re: Spark Dataframe 1.4 (GroupBy partial match) To: Suraj Shetiya surajshet...@gmail.com Cc: Salih Oztop soz...@yahoo.com, user@spark.apache.org user@spark.apache.org You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column. On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com wrote: Thanks Salih. :) The output of the groupby is as below. 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry And subsequently, we would like to aggregate all messages with a particular reference subject. For instance the question we are trying to answer could be : Get the count of messages with a particular subject. Looking forward to any suggestion from you. On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote: Hi Suraj What will be your output after group by? Since GroupBy is for aggregations like sum, count etc. If you want to count the 2015 records than it is possible. Kind Regards Salih Oztop -- *From:* Suraj Shetiya surajshet...@gmail.com *To:* user@spark.apache.org *Sent:* Tuesday, June 30, 2015 3:05 PM *Subject:* Spark Dataframe 1.4 (GroupBy partial match) I have a dataset (trimmed and simplified) with 2 columns as below. DateSubject 2015-01-14 SEC Inquiry 2014-02-12 Happy birthday 2014-02-13 Re: Happy birthday 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry I have imported the same in a Spark Dataframe. What I am looking at is groupBy subject field (however, I need a partial match to identify the discussion topic). For example in the above case.. I would like to group all messages, which have subject containing SEC Inquiry which returns following grouped frame: 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry Another usecase for a similar problem could be group by year (in the above example), it would mean partial match of the date field, which would mean groupBy Date by matching year as 2014 or 2015. Keenly Looking forward to reply/solution to the above. - Suraj
Re: Spark Dataframe 1.4 (GroupBy partial match)
Hi Suraj,It seems your requirement is Record Linkage/Entity Resolution.https://en.wikipedia.org/wiki/Record_linkage http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf A presentation from Spark Summit using GraphXhttps://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark Kind Regards Salih Oztop 07856128843 http://www.linkedin.com/in/salihoztop From: Suraj Shetiya surajshet...@gmail.com To: Michael Armbrust mich...@databricks.com Cc: Salih Oztop soz...@yahoo.com; user@spark.apache.org user@spark.apache.org; megha.sridh...@cynepia.com Sent: Thursday, July 2, 2015 10:47 AM Subject: Re: Spark Dataframe 1.4 (GroupBy partial match) Hi Michael, Thanks for a quick response.. This sounds like something that would work. However, Rethinking the problem statement and various other use cases, which are growing, there are more such scenarios, where one could have columns with structured and unstructured data embedded (json or xml or other kind of collections), it may make sense to allow probabilistic groupby operations where the user can get the same functionality in one step instead of two.. Your thoughts on if that makes sense.. -Suraj -- Forwarded message -- From: Michael Armbrust mich...@databricks.com Date: Jul 2, 2015 12:49 AM Subject: Re: Spark Dataframe 1.4 (GroupBy partial match) To: Suraj Shetiya surajshet...@gmail.com Cc: Salih Oztop soz...@yahoo.com, user@spark.apache.org user@spark.apache.org You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column. On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com wrote: Thanks Salih. :) The output of the groupby is as below. 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry And subsequently, we would like to aggregate all messages with a particular reference subject. For instance the question we are trying to answer could be : Get the count of messages with a particular subject. Looking forward to any suggestion from you. On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote: Hi SurajWhat will be your output after group by? Since GroupBy is for aggregations like sum, count etc. If you want to count the 2015 records than it is possible. Kind Regards Salih Oztop From: Suraj Shetiya surajshet...@gmail.com To: user@spark.apache.org Sent: Tuesday, June 30, 2015 3:05 PM Subject: Spark Dataframe 1.4 (GroupBy partial match) I have a dataset (trimmed and simplified) with 2 columns as below. Date Subject 2015-01-14 SEC Inquiry 2014-02-12 Happy birthday 2014-02-13 Re: Happy birthday 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry I have imported the same in a Spark Dataframe. What I am looking at is groupBy subject field (however, I need a partial match to identify the discussion topic). For example in the above case.. I would like to group all messages, which have subject containing SEC Inquiry which returns following grouped frame: 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry Another usecase for a similar problem could be group by year (in the above example), it would mean partial match of the date field, which would mean groupBy Date by matching year as 2014 or 2015. Keenly Looking forward to reply/solution to the above. - Suraj
Re: Spark Dataframe 1.4 (GroupBy partial match)
You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column. On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com wrote: Thanks Salih. :) The output of the groupby is as below. 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry And subsequently, we would like to aggregate all messages with a particular reference subject. For instance the question we are trying to answer could be : Get the count of messages with a particular subject. Looking forward to any suggestion from you. On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote: Hi Suraj What will be your output after group by? Since GroupBy is for aggregations like sum, count etc. If you want to count the 2015 records than it is possible. Kind Regards Salih Oztop -- *From:* Suraj Shetiya surajshet...@gmail.com *To:* user@spark.apache.org *Sent:* Tuesday, June 30, 2015 3:05 PM *Subject:* Spark Dataframe 1.4 (GroupBy partial match) I have a dataset (trimmed and simplified) with 2 columns as below. DateSubject 2015-01-14 SEC Inquiry 2014-02-12 Happy birthday 2014-02-13 Re: Happy birthday 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry I have imported the same in a Spark Dataframe. What I am looking at is groupBy subject field (however, I need a partial match to identify the discussion topic). For example in the above case.. I would like to group all messages, which have subject containing SEC Inquiry which returns following grouped frame: 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry Another usecase for a similar problem could be group by year (in the above example), it would mean partial match of the date field, which would mean groupBy Date by matching year as 2014 or 2015. Keenly Looking forward to reply/solution to the above. - Suraj
Re: Spark Dataframe 1.4 (GroupBy partial match)
Thanks Salih. :) The output of the groupby is as below. 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry And subsequently, we would like to aggregate all messages with a particular reference subject. For instance the question we are trying to answer could be : Get the count of messages with a particular subject. Looking forward to any suggestion from you. On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote: Hi Suraj What will be your output after group by? Since GroupBy is for aggregations like sum, count etc. If you want to count the 2015 records than it is possible. Kind Regards Salih Oztop -- *From:* Suraj Shetiya surajshet...@gmail.com *To:* user@spark.apache.org *Sent:* Tuesday, June 30, 2015 3:05 PM *Subject:* Spark Dataframe 1.4 (GroupBy partial match) I have a dataset (trimmed and simplified) with 2 columns as below. DateSubject 2015-01-14 SEC Inquiry 2014-02-12 Happy birthday 2014-02-13 Re: Happy birthday 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry I have imported the same in a Spark Dataframe. What I am looking at is groupBy subject field (however, I need a partial match to identify the discussion topic). For example in the above case.. I would like to group all messages, which have subject containing SEC Inquiry which returns following grouped frame: 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry Another usecase for a similar problem could be group by year (in the above example), it would mean partial match of the date field, which would mean groupBy Date by matching year as 2014 or 2015. Keenly Looking forward to reply/solution to the above. - Suraj
Spark Dataframe 1.4 (GroupBy partial match)
I have a dataset (trimmed and simplified) with 2 columns as below. DateSubject 2015-01-14 SEC Inquiry 2014-02-12 Happy birthday 2014-02-13 Re: Happy birthday 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry I have imported the same in a Spark Dataframe. What I am looking at is groupBy subject field (however, I need a partial match to identify the discussion topic). For example in the above case.. I would like to group all messages, which have subject containing SEC Inquiry which returns following grouped frame: 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry Another usecase for a similar problem could be group by year (in the above example), it would mean partial match of the date field, which would mean groupBy Date by matching year as 2014 or 2015. Keenly Looking forward to reply/solution to the above. - Suraj
Re: Spark Dataframe 1.4 (GroupBy partial match)
Hi SurajWhat will be your output after group by? Since GroupBy is for aggregations like sum, count etc. If you want to count the 2015 records than it is possible. Kind Regards Salih Oztop From: Suraj Shetiya surajshet...@gmail.com To: user@spark.apache.org Sent: Tuesday, June 30, 2015 3:05 PM Subject: Spark Dataframe 1.4 (GroupBy partial match) I have a dataset (trimmed and simplified) with 2 columns as below. Date Subject 2015-01-14 SEC Inquiry 2014-02-12 Happy birthday 2014-02-13 Re: Happy birthday 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry I have imported the same in a Spark Dataframe. What I am looking at is groupBy subject field (however, I need a partial match to identify the discussion topic). For example in the above case.. I would like to group all messages, which have subject containing SEC Inquiry which returns following grouped frame: 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry Another usecase for a similar problem could be group by year (in the above example), it would mean partial match of the date field, which would mean groupBy Date by matching year as 2014 or 2015. Keenly Looking forward to reply/solution to the above. - Suraj