Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-03 Thread Suraj Shetiya
Hi Salih,

Thanks for the links :) This seems very promising to me.

When do you think this would be available in the spark codeline ?

Thanks,
Suraj

On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop soz...@yahoo.com wrote:

 Hi Suraj,
 It seems your requirement is Record Linkage/Entity Resolution.
 https://en.wikipedia.org/wiki/Record_linkage
 http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

 A presentation from Spark Summit using GraphX

 https://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark


 Kind Regards
 Salih Oztop
 07856128843
 http://www.linkedin.com/in/salihoztop

   --
  *From:* Suraj Shetiya surajshet...@gmail.com
 *To:* Michael Armbrust mich...@databricks.com
 *Cc:* Salih Oztop soz...@yahoo.com; user@spark.apache.org 
 user@spark.apache.org; megha.sridh...@cynepia.com
 *Sent:* Thursday, July 2, 2015 10:47 AM

 *Subject:* Re: Spark Dataframe 1.4 (GroupBy partial match)

 Hi Michael,

 Thanks for a quick response.. This sounds like something that would work.
 However, Rethinking the problem statement and various other use cases,
 which are growing, there are more such scenarios, where one could have
 columns with structured and unstructured data embedded (json or xml or
 other kind of collections), it may make sense to allow probabilistic
 groupby operations where the user can get the same functionality in one
 step instead of two..

 Your thoughts on if that makes sense..

 -Suraj




 -- Forwarded message --
 From: Michael Armbrust mich...@databricks.com
 Date: Jul 2, 2015 12:49 AM
 Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
 To: Suraj Shetiya surajshet...@gmail.com
 Cc: Salih Oztop soz...@yahoo.com, user@spark.apache.org 
 user@spark.apache.org

 You should probably write a UDF that uses regular expression or other
 string munging to canonicalize the subject and then group on that derived
 column.

 On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com
 wrote:

 Thanks Salih. :)


 The output of the groupby is as below.

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry


 And subsequently, we would like to aggregate all messages with a
 particular reference subject.
 For instance the question we are trying to answer could be : Get the count
 of messages with a particular subject.

 Looking forward to any suggestion from you.


 On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote:

 Hi Suraj
 What will be your output after group by? Since GroupBy is for aggregations
 like sum, count etc.
 If you want to count the 2015 records than it is possible.

 Kind Regards
 Salih Oztop


   --
  *From:* Suraj Shetiya surajshet...@gmail.com
 *To:* user@spark.apache.org
 *Sent:* Tuesday, June 30, 2015 3:05 PM
 *Subject:* Spark Dataframe 1.4 (GroupBy partial match)

 I have a dataset (trimmed and simplified) with 2 columns as below.

 DateSubject
 2015-01-14  SEC Inquiry
 2014-02-12   Happy birthday
 2014-02-13   Re: Happy birthday
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 I have imported the same in a Spark Dataframe. What I am looking at is
 groupBy subject field (however, I need a partial match to identify the
 discussion topic).

 For example in the above case.. I would like to group all messages, which
 have subject containing SEC Inquiry which returns following grouped
 frame:

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 Another usecase for a similar problem could be group by year (in the above
 example), it would mean partial match of the date field, which would mean
 groupBy Date by matching year as 2014 or 2015.

 Keenly Looking forward to reply/solution to the above.

 - Suraj











-- 
Regards,
Suraj


Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-02 Thread Suraj Shetiya
Hi Michael,

Thanks for a quick response.. This sounds like something that would work.
However, Rethinking the problem statement and various other use cases,
which are growing, there are more such scenarios, where one could have
columns with structured and unstructured data embedded (json or xml or
other kind of collections), it may make sense to allow probabilistic
groupby operations where the user can get the same functionality in one
step instead of two..

Your thoughts on if that makes sense..

-Suraj


-- Forwarded message --
From: Michael Armbrust mich...@databricks.com
Date: Jul 2, 2015 12:49 AM
Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
To: Suraj Shetiya surajshet...@gmail.com
Cc: Salih Oztop soz...@yahoo.com, user@spark.apache.org 
user@spark.apache.org

You should probably write a UDF that uses regular expression or other
string munging to canonicalize the subject and then group on that derived
column.

On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com
wrote:

 Thanks Salih. :)


 The output of the groupby is as below.

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry


 And subsequently, we would like to aggregate all messages with a
 particular reference subject.
 For instance the question we are trying to answer could be : Get the count
 of messages with a particular subject.

 Looking forward to any suggestion from you.


 On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote:

 Hi Suraj
 What will be your output after group by? Since GroupBy is for
 aggregations like sum, count etc.
 If you want to count the 2015 records than it is possible.

 Kind Regards
 Salih Oztop


   --
  *From:* Suraj Shetiya surajshet...@gmail.com
 *To:* user@spark.apache.org
 *Sent:* Tuesday, June 30, 2015 3:05 PM
 *Subject:* Spark Dataframe 1.4 (GroupBy partial match)

 I have a dataset (trimmed and simplified) with 2 columns as below.

 DateSubject
 2015-01-14  SEC Inquiry
 2014-02-12   Happy birthday
 2014-02-13   Re: Happy birthday
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 I have imported the same in a Spark Dataframe. What I am looking at is
 groupBy subject field (however, I need a partial match to identify the
 discussion topic).

 For example in the above case.. I would like to group all messages, which
 have subject containing SEC Inquiry which returns following grouped
 frame:

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 Another usecase for a similar problem could be group by year (in the
 above example), it would mean partial match of the date field, which would
 mean groupBy Date by matching year as 2014 or 2015.

 Keenly Looking forward to reply/solution to the above.

 - Suraj








Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-02 Thread Salih Oztop
Hi Suraj,It seems your requirement is Record Linkage/Entity 
Resolution.https://en.wikipedia.org/wiki/Record_linkage
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

A presentation from Spark Summit using 
GraphXhttps://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark

 Kind Regards
Salih Oztop
07856128843
http://www.linkedin.com/in/salihoztop
  From: Suraj Shetiya surajshet...@gmail.com
 To: Michael Armbrust mich...@databricks.com 
Cc: Salih Oztop soz...@yahoo.com; user@spark.apache.org 
user@spark.apache.org; megha.sridh...@cynepia.com 
 Sent: Thursday, July 2, 2015 10:47 AM
 Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
   
Hi Michael,

Thanks for a quick response.. This sounds like something that would work. 
However, Rethinking the problem statement and various other use cases, which 
are growing, there are more such scenarios, where one could have columns with 
structured and unstructured data embedded (json or xml or other kind of 
collections), it may make sense to allow probabilistic groupby operations where 
the user can get the same functionality in one step instead of two.. 

Your thoughts on if that makes sense..

-Suraj




-- Forwarded message --
From: Michael Armbrust mich...@databricks.com
Date: Jul 2, 2015 12:49 AM
Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
To: Suraj Shetiya surajshet...@gmail.com
Cc: Salih Oztop soz...@yahoo.com, user@spark.apache.org 
user@spark.apache.org

You should probably write a UDF that uses regular expression or other string 
munging to canonicalize the subject and then group on that derived column.
On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com wrote:

Thanks Salih. :)


The output of the groupby is as below.

2015-01-14  SEC Inquiry
2015-01-16   Re: SEC Inquiry
2015-01-18   Fwd: Re: SEC Inquiry


And subsequently, we would like to aggregate all messages with a particular 
reference subject. 
For instance the question we are trying to answer could be : Get the count of 
messages with a particular subject. 

Looking forward to any suggestion from you.

On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote:

Hi SurajWhat will be your output after group by? Since GroupBy is for 
aggregations like sum, count etc.
If you want to count the 2015 records than it is possible. Kind Regards
Salih Oztop


  From: Suraj Shetiya surajshet...@gmail.com
 To: user@spark.apache.org 
 Sent: Tuesday, June 30, 2015 3:05 PM
 Subject: Spark Dataframe 1.4 (GroupBy partial match)
   
I have a dataset (trimmed and simplified) with 2 columns as below.

Date    Subject
2015-01-14  SEC Inquiry
2014-02-12   Happy birthday
2014-02-13   Re: Happy birthday
2015-01-16   Re: SEC Inquiry
2015-01-18   Fwd: Re: SEC Inquiry

I have imported the same in a Spark Dataframe. What I am looking at is groupBy 
subject field (however, I need a partial match to identify the discussion 
topic). 

For example in the above case.. I would like to group all messages, which have 
subject containing SEC Inquiry which returns following grouped frame: 

2015-01-14  SEC Inquiry
2015-01-16   Re: SEC Inquiry
2015-01-18   Fwd: Re: SEC Inquiry

Another usecase for a similar problem could be group by year (in the above 
example), it would mean partial match of the date field, which would mean 
groupBy Date by matching year as 2014 or 2015.

Keenly Looking forward to reply/solution to the above.

- Suraj










  

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-01 Thread Michael Armbrust
You should probably write a UDF that uses regular expression or other
string munging to canonicalize the subject and then group on that derived
column.

On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com
wrote:

 Thanks Salih. :)


 The output of the groupby is as below.

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry


 And subsequently, we would like to aggregate all messages with a
 particular reference subject.
 For instance the question we are trying to answer could be : Get the count
 of messages with a particular subject.

 Looking forward to any suggestion from you.


 On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote:

 Hi Suraj
 What will be your output after group by? Since GroupBy is for
 aggregations like sum, count etc.
 If you want to count the 2015 records than it is possible.

 Kind Regards
 Salih Oztop


   --
  *From:* Suraj Shetiya surajshet...@gmail.com
 *To:* user@spark.apache.org
 *Sent:* Tuesday, June 30, 2015 3:05 PM
 *Subject:* Spark Dataframe 1.4 (GroupBy partial match)

 I have a dataset (trimmed and simplified) with 2 columns as below.

 DateSubject
 2015-01-14  SEC Inquiry
 2014-02-12   Happy birthday
 2014-02-13   Re: Happy birthday
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 I have imported the same in a Spark Dataframe. What I am looking at is
 groupBy subject field (however, I need a partial match to identify the
 discussion topic).

 For example in the above case.. I would like to group all messages, which
 have subject containing SEC Inquiry which returns following grouped
 frame:

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 Another usecase for a similar problem could be group by year (in the
 above example), it would mean partial match of the date field, which would
 mean groupBy Date by matching year as 2014 or 2015.

 Keenly Looking forward to reply/solution to the above.

 - Suraj








Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-06-30 Thread Suraj Shetiya
Thanks Salih. :)


The output of the groupby is as below.

2015-01-14  SEC Inquiry
2015-01-16   Re: SEC Inquiry
2015-01-18   Fwd: Re: SEC Inquiry


And subsequently, we would like to aggregate all messages with a particular
reference subject.
For instance the question we are trying to answer could be : Get the count
of messages with a particular subject.

Looking forward to any suggestion from you.

On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote:

 Hi Suraj
 What will be your output after group by? Since GroupBy is for aggregations
 like sum, count etc.
 If you want to count the 2015 records than it is possible.

 Kind Regards
 Salih Oztop


   --
  *From:* Suraj Shetiya surajshet...@gmail.com
 *To:* user@spark.apache.org
 *Sent:* Tuesday, June 30, 2015 3:05 PM
 *Subject:* Spark Dataframe 1.4 (GroupBy partial match)

 I have a dataset (trimmed and simplified) with 2 columns as below.

 DateSubject
 2015-01-14  SEC Inquiry
 2014-02-12   Happy birthday
 2014-02-13   Re: Happy birthday
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 I have imported the same in a Spark Dataframe. What I am looking at is
 groupBy subject field (however, I need a partial match to identify the
 discussion topic).

 For example in the above case.. I would like to group all messages, which
 have subject containing SEC Inquiry which returns following grouped
 frame:

 2015-01-14  SEC Inquiry
 2015-01-16   Re: SEC Inquiry
 2015-01-18   Fwd: Re: SEC Inquiry

 Another usecase for a similar problem could be group by year (in the above
 example), it would mean partial match of the date field, which would mean
 groupBy Date by matching year as 2014 or 2015.

 Keenly Looking forward to reply/solution to the above.

 - Suraj







Spark Dataframe 1.4 (GroupBy partial match)

2015-06-30 Thread Suraj Shetiya
I have a dataset (trimmed and simplified) with 2 columns as below.

DateSubject
2015-01-14  SEC Inquiry
2014-02-12   Happy birthday
2014-02-13   Re: Happy birthday
2015-01-16   Re: SEC Inquiry
2015-01-18   Fwd: Re: SEC Inquiry

I have imported the same in a Spark Dataframe. What I am looking at is
groupBy subject field (however, I need a partial match to identify the
discussion topic).

For example in the above case.. I would like to group all messages, which
have subject containing SEC Inquiry which returns following grouped
frame:

2015-01-14  SEC Inquiry
2015-01-16   Re: SEC Inquiry
2015-01-18   Fwd: Re: SEC Inquiry

Another usecase for a similar problem could be group by year (in the above
example), it would mean partial match of the date field, which would mean
groupBy Date by matching year as 2014 or 2015.

Keenly Looking forward to reply/solution to the above.

- Suraj


Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-06-30 Thread Salih Oztop
Hi SurajWhat will be your output after group by? Since GroupBy is for 
aggregations like sum, count etc.
If you want to count the 2015 records than it is possible. Kind Regards
Salih Oztop


  From: Suraj Shetiya surajshet...@gmail.com
 To: user@spark.apache.org 
 Sent: Tuesday, June 30, 2015 3:05 PM
 Subject: Spark Dataframe 1.4 (GroupBy partial match)
   
I have a dataset (trimmed and simplified) with 2 columns as below.

Date    Subject
2015-01-14  SEC Inquiry
2014-02-12   Happy birthday
2014-02-13   Re: Happy birthday
2015-01-16   Re: SEC Inquiry
2015-01-18   Fwd: Re: SEC Inquiry

I have imported the same in a Spark Dataframe. What I am looking at is groupBy 
subject field (however, I need a partial match to identify the discussion 
topic). 

For example in the above case.. I would like to group all messages, which have 
subject containing SEC Inquiry which returns following grouped frame: 

2015-01-14  SEC Inquiry
2015-01-16   Re: SEC Inquiry
2015-01-18   Fwd: Re: SEC Inquiry

Another usecase for a similar problem could be group by year (in the above 
example), it would mean partial match of the date field, which would mean 
groupBy Date by matching year as 2014 or 2015.

Keenly Looking forward to reply/solution to the above.

- Suraj