Re: unable to do group by with 1st column

2014-12-28 Thread Michael Albert
Greetings!
Thanks for the comment.
I have tried several variants of this, as indicated.
The code works on small sets, but fails on larger sets.However, I don't get 
memory errors.I see java.nio.channels.CancelledKeyException and things about 
lost taskand then things like Resubmitting state 1, and off it goes.
I've already upped the memory (I think the last experiment had 
--executor-memory 6G and --driver memory 6G.
I'm experimenting with recoding this with map-reduce and so far seem to be 
having more success (with HADOOP_OPTS=-Xmx6g -Xmx5g)
Again, each grouping should have no more than 6E7 values, and the data is 
(DataKey(Int,Int), Option[Float]), so that shouldn't need 5g?
Anyway, thanks for the info.
Best wishes,Mike


  From: Sean Owen so...@cloudera.com
 To: Michael Albert m_albert...@yahoo.com 
Cc: user@spark.apache.org 
 Sent: Friday, December 26, 2014 3:23 PM
 Subject: Re: unable to do group by with 1st column
   
Here is a sketch of what you need to do off the top of my head and based on a 
guess of what your RDD is like:val in: RDD[(K,Seq[(C,V)])] = ...in.flatMap { 
case (key, colVals) =
  colVals.map { case (col, val) = 
    (col, (key, val))
  }
}.groupByKeySo the problem with both input and output here is that all values 
for each key exist in memory at once. When transposed, each element contains 
50M key value pairs. You probably should try to do what you're trying to do a 
slightly different way.Depends on what you mean by resubmitting but I imagine 
you need a cache() on an RDD you are reusing. 

On Dec 26, 2014 4:18 PM, Michael Albert m_albert...@yahoo.com.invalid wrote:

Greetings!
I'm trying to do something similar, and having a very bad time of it.
What I start with is
key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)key2: 
(col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...)
What I want  (what I have been asked to produce :-)) is:
col1: (key1: val-1-1, key2: val-2-1, key3, val-3-1, ...)col2: (key1: val-1-2, 
key2: val2-2, key3: val-3-2,...)
So basically the transpose.  The input is actually avro/parquet with each key 
in one record. In the output, the final step is to convert each column into a 
matlab file.Please don't ask me whether this is a good idea. 
I can get this to work for smallish data sets (e.g, a few hundred keys and a 
few hundred columns).However, if I crank up the number of keys to about 5e7, 
then this fails, even if I turn the number of columns that are actually used 
down to 10.
The system seems to spend lots of time resubmitting parts of the first phase in 
which the data is read from the original records and shuffled and never quite 
finishes.
I can't post the code, but I can give folks and idea of what I've tried.
Try #1: Mapper emits data as (DataKey(col-as-int,key-as-int), 
value-as-Option[Any]), then create a ShuffledRDD using the col-as-int for 
partitioning and then SetKeyOrdering on the key-as-int.  This is then fed to 
mapPartitionWithIndex.
Try #2: Emit (col-as-int, (key-as-int, value)) and groupBy, and have a final 
map() on each col.
Try #3: Emit (col-as-t, Collection[(key-as-int, value)]), then have a 
reduceByKey which takes the union of the collection (union for set, ++ for 
list) then havea final map() which attempts the final conversion.
No matter what I do, it works for for small numbers of keys (hundreds), but 
when I crank it up, it seems to sit there resubmitting the shuffle phase.
Happy holidays, all!-Mike 

  From: Amit Behera amit.bd...@gmail.com
 To: u...@spark.incubator.apache.org 
 Sent: Thursday, December 25, 2014 3:22 PM
 Subject: unable to do group by with 1st column
   
Hi Users,
I am reading a csv file and my data format is like :
key1,value1key1,value2
key1,value1
key1,value3
key2,value1
key2,value5
key2,value5
key2,value4key1,value4key1,value4
key3,value1
key3,value1
key3,value2

required output :
key1:[value1,value2,value1,value3,value4,value4]key2:[value1,value5,value5,value4]key3:[value1,value1,value2]

How can I do it? Please help me to do.
ThanksAmit   


   


  

Re: unable to do group by with 1st column

2014-12-28 Thread Sean Owen
One value is at least 12 + 4 + 4 + 12 + 4 = 36 bytes if you factor in
object overhead, if my math is right. 60M of them is about 2.1GB for a
single key. I could imagine that blowing up an executor that's trying
to have one in memory and deserialize another. You won't want to use
groupByKey if the number of values is this big.

MapReduce doesn't quite operate this way. You would not have the
values in memory for a single key in general. That said, you don't
necessarily have to make Spark work this way either. There may be
other ways to do what you want that do not involve groupByKey but
rather reduceByKey or similar.

The error does not show a memory error per se, so it's not clear from
this why the executor is failing. If you search around you'll see
CancelledKeyException is a symptom of a couple things, some of which
are bugs that are recently fixed. Hard to know whether it matters but
you might use 1.2 to make sure.

You might also try the sort-based shuffle instead if you are doing
such a big shuffle?

On Sun, Dec 28, 2014 at 9:02 PM, Michael Albert m_albert...@yahoo.com wrote:
 Greetings!

 Thanks for the comment.

 I have tried several variants of this, as indicated.

 The code works on small sets, but fails on larger sets.
 However, I don't get memory errors.
 I see java.nio.channels.CancelledKeyException and things about lost task
 and then things like Resubmitting state 1, and off it goes.

 I've already upped the memory (I think the last experiment had
 --executor-memory 6G and --driver memory 6G.

 I'm experimenting with recoding this with map-reduce and so far seem to be
 having more success (with HADOOP_OPTS=-Xmx6g -Xmx5g)

 Again, each grouping should have no more than 6E7 values, and the data is
 (DataKey(Int,Int), Option[Float]), so that shouldn't need 5g?

 Anyway, thanks for the info.

 Best wishes,
 Mike



 
 From: Sean Owen so...@cloudera.com
 To: Michael Albert m_albert...@yahoo.com
 Cc: user@spark.apache.org
 Sent: Friday, December 26, 2014 3:23 PM
 Subject: Re: unable to do group by with 1st column

 Here is a sketch of what you need to do off the top of my head and based on
 a guess of what your RDD is like:
 val in: RDD[(K,Seq[(C,V)])] = ...
 in.flatMap { case (key, colVals) =
   colVals.map { case (col, val) =
 (col, (key, val))
   }
 }.groupByKey
 So the problem with both input and output here is that all values for each
 key exist in memory at once. When transposed, each element contains 50M key
 value pairs.
 You probably should try to do what you're trying to do a slightly different
 way.
 Depends on what you mean by resubmitting but I imagine you need a cache() on
 an RDD you are reusing.


 On Dec 26, 2014 4:18 PM, Michael Albert m_albert...@yahoo.com.invalid
 wrote:

 Greetings!

 I'm trying to do something similar, and having a very bad time of it.

 What I start with is

 key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)
 key2: (col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...)
 

 What I want  (what I have been asked to produce :-)) is:

 col1: (key1: val-1-1, key2: val-2-1, key3, val-3-1, ...)
 col2: (key1: val-1-2, key2: val2-2, key3: val-3-2,...)

 So basically the transpose.  The input is actually avro/parquet with each
 key in one record.
 In the output, the final step is to convert each column into a matlab
 file.
 Please don't ask me whether this is a good idea.

 I can get this to work for smallish data sets (e.g, a few hundred keys and a
 few hundred columns).
 However, if I crank up the number of keys to about 5e7, then this fails,
 even if I turn the number of columns that are actually used down to 10.

 The system seems to spend lots of time resubmitting parts of the first phase
 in which the data is read from the original records and shuffled and never
 quite finishes.

 I can't post the code, but I can give folks and idea of what I've tried.

 Try #1: Mapper emits data as (DataKey(col-as-int,key-as-int),
 value-as-Option[Any]),
 then create a ShuffledRDD using the col-as-int for partitioning and
 then SetKeyOrdering on the key-as-int.  This is then fed to
 mapPartitionWithIndex.

 Try #2: Emit (col-as-int, (key-as-int, value)) and groupBy, and have a final
 map() on each col.

 Try #3: Emit (col-as-t, Collection[(key-as-int, value)]), then have a
 reduceByKey
 which takes the union of the collection (union for set, ++ for list) then
 have
 a final map() which attempts the final conversion.

 No matter what I do, it works for for small numbers of keys (hundreds),
 but
 when I crank it up, it seems to sit there resubmitting the shuffle phase.

 Happy holidays, all!
 -Mike



 
 From: Amit Behera amit.bd...@gmail.com
 To: u...@spark.incubator.apache.org
 Sent: Thursday, December 25, 2014 3:22 PM
 Subject: unable to do group by with 1st column

 Hi Users,

 I am reading a csv file and my data format is like :

 key1,value1
 key1,value2
 key1,value1
 key1

RE: unable to do group by with 1st column

2014-12-26 Thread Sean Owen
This does not appear to be what the asker wanted as this makes one big
string. groupByKey is correct after parsing to key value pairs.
On Dec 26, 2014 3:55 AM, Somnath Pandeya somnath_pand...@infosys.com
wrote:

  Hi ,

 You can try reducebyKey also ,

 Something like this

 JavaPairRDDString, String ones = lines

.mapToPair(*new* *PairFunctionString, String,
 String()* {

   @Override

   *public* Tuple2String, String
 call(String s) {

  String[] temp = s.split(,);

  *return* *new* Tuple2String,
 String(temp[0], temp[1]);

   }

});



   JavaPairRDDString, String *counts* = ones

.reduceByKey(*new* *Function2String, String,
 String()* {

   @Override

   *public* String call(String i1, String
 i2) {

  *return* i1 + , + i2;

   }

});



 *From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
 *Sent:* Friday, December 26, 2014 6:35 AM
 *To:* Amit Behera
 *Cc:* u...@spark.incubator.apache.org
 *Subject:* Re: unable to do group by with 1st column



 Hi,



 On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera amit.bd...@gmail.com wrote:

  How can I do it? Please help me to do.



 Have you considered using groupByKey?

 http://spark.apache.org/docs/latest/programming-guide.html#transformations



 Tobias

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
 for the use of the addressee(s). If you are not the intended recipient, please
 notify the sender by e-mail and delete the original message. Further, you are 
 not
 to copy, disclose, or distribute this e-mail or its contents to any other 
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has 
 taken
 every reasonable precaution to minimize this risk, but is not liable for any 
 damage
 you may sustain as a result of any virus in this e-mail. You should carry out 
 your
 own virus checks before opening the e-mail or attachment. Infosys reserves the
 right to monitor and review the content of all messages sent to or from this 
 e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***




Re: unable to do group by with 1st column

2014-12-26 Thread Amit Behera
Hi,

Thank you very much to all for your reply.

I am able to get it by groupByKey

Here is my code :

import au.com.bytecode.opencsv.CSVParser
val data = sc.textFile(/data/data.csv);
def pLines(lines:Iterator[String])={
  val parser=new CSVParser()
  lines.map(l={val vs=parser.parseLine(l)
(vs(0),vs(1).toInt)})
}
val result = data.mapPartitions(pLines).groupByKey.collect

Thanks
Amit







On Fri, Dec 26, 2014 at 2:18 PM, Sean Owen so...@cloudera.com wrote:

 This does not appear to be what the asker wanted as this makes one big
 string. groupByKey is correct after parsing to key value pairs.
 On Dec 26, 2014 3:55 AM, Somnath Pandeya somnath_pand...@infosys.com
 wrote:

  Hi ,

 You can try reducebyKey also ,

 Something like this

 JavaPairRDDString, String ones = lines

.mapToPair(*new* *PairFunctionString,
 String, String()* {

   @Override

   *public* Tuple2String, String
 call(String s) {

  String[] temp = s.split(,);

  *return* *new* Tuple2String,
 String(temp[0], temp[1]);

   }

});



   JavaPairRDDString, String *counts* = ones

.reduceByKey(*new* *Function2String, String,
 String()* {

   @Override

   *public* String call(String i1, String
 i2) {

  *return* i1 + , + i2;

   }

});



 *From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
 *Sent:* Friday, December 26, 2014 6:35 AM
 *To:* Amit Behera
 *Cc:* u...@spark.incubator.apache.org
 *Subject:* Re: unable to do group by with 1st column



 Hi,



 On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera amit.bd...@gmail.com
 wrote:

  How can I do it? Please help me to do.



 Have you considered using groupByKey?

 http://spark.apache.org/docs/latest/programming-guide.html#transformations



 Tobias

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
 for the use of the addressee(s). If you are not the intended recipient, 
 please
 notify the sender by e-mail and delete the original message. Further, you 
 are not
 to copy, disclose, or distribute this e-mail or its contents to any other 
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has 
 taken
 every reasonable precaution to minimize this risk, but is not liable for any 
 damage
 you may sustain as a result of any virus in this e-mail. You should carry 
 out your
 own virus checks before opening the e-mail or attachment. Infosys reserves 
 the
 right to monitor and review the content of all messages sent to or from this 
 e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***




Re: unable to do group by with 1st column

2014-12-26 Thread Michael Albert
Greetings!
I'm trying to do something similar, and having a very bad time of it.
What I start with is
key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)key2: 
(col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...)
What I want  (what I have been asked to produce :-)) is:
col1: (key1: val-1-1, key2: val-2-1, key3, val-3-1, ...)col2: (key1: val-1-2, 
key2: val2-2, key3: val-3-2,...)
So basically the transpose.  The input is actually avro/parquet with each key 
in one record. In the output, the final step is to convert each column into a 
matlab file.Please don't ask me whether this is a good idea. 
I can get this to work for smallish data sets (e.g, a few hundred keys and a 
few hundred columns).However, if I crank up the number of keys to about 5e7, 
then this fails, even if I turn the number of columns that are actually used 
down to 10.
The system seems to spend lots of time resubmitting parts of the first phase in 
which the data is read from the original records and shuffled and never quite 
finishes.
I can't post the code, but I can give folks and idea of what I've tried.
Try #1: Mapper emits data as (DataKey(col-as-int,key-as-int), 
value-as-Option[Any]), then create a ShuffledRDD using the col-as-int for 
partitioning and then SetKeyOrdering on the key-as-int.  This is then fed to 
mapPartitionWithIndex.
Try #2: Emit (col-as-int, (key-as-int, value)) and groupBy, and have a final 
map() on each col.
Try #3: Emit (col-as-t, Collection[(key-as-int, value)]), then have a 
reduceByKey which takes the union of the collection (union for set, ++ for 
list) then havea final map() which attempts the final conversion.
No matter what I do, it works for for small numbers of keys (hundreds), but 
when I crank it up, it seems to sit there resubmitting the shuffle phase.
Happy holidays, all!-Mike 

  From: Amit Behera amit.bd...@gmail.com
 To: u...@spark.incubator.apache.org 
 Sent: Thursday, December 25, 2014 3:22 PM
 Subject: unable to do group by with 1st column
   
Hi Users,
I am reading a csv file and my data format is like :
key1,value1key1,value2
key1,value1
key1,value3
key2,value1
key2,value5
key2,value5
key2,value4key1,value4key1,value4
key3,value1
key3,value1
key3,value2

required output :
key1:[value1,value2,value1,value3,value4,value4]key2:[value1,value5,value5,value4]key3:[value1,value1,value2]

How can I do it? Please help me to do.
ThanksAmit   


  

Re: unable to do group by with 1st column

2014-12-26 Thread Sean Owen
Here is a sketch of what you need to do off the top of my head and based on
a guess of what your RDD is like:

val in: RDD[(K,Seq[(C,V)])] = ...

in.flatMap { case (key, colVals) =
  colVals.map { case (col, val) =
(col, (key, val))
  }
}.groupByKey

So the problem with both input and output here is that all values for each
key exist in memory at once. When transposed, each element contains 50M key
value pairs.

You probably should try to do what you're trying to do a slightly different
way.

Depends on what you mean by resubmitting but I imagine you need a cache()
on an RDD you are reusing.
On Dec 26, 2014 4:18 PM, Michael Albert m_albert...@yahoo.com.invalid
wrote:

 Greetings!

 I'm trying to do something similar, and having a very bad time of it.

 What I start with is

 key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)
 key2: (col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...)
 

 What I want  (what I have been asked to produce :-)) is:

 col1: (key1: val-1-1, key2: val-2-1, key3, val-3-1, ...)
 col2: (key1: val-1-2, key2: val2-2, key3: val-3-2,...)

 So basically the transpose.  The input is actually avro/parquet with each
 key in one record.
 In the output, the final step is to convert each column into a matlab
 file.
 Please don't ask me whether this is a good idea.

 I can get this to work for smallish data sets (e.g, a few hundred keys and
 a few hundred columns).
 However, if I crank up the number of keys to about 5e7, then this fails,
 even if I turn the number of columns that are actually used down to 10.

 The system seems to spend lots of time resubmitting parts of the first
 phase
 in which the data is read from the original records and shuffled and never
 quite finishes.

 I can't post the code, but I can give folks and idea of what I've tried.

 Try #1: Mapper emits data as (DataKey(col-as-int,key-as-int),
 value-as-Option[Any]),
 then create a ShuffledRDD using the col-as-int for partitioning and
 then SetKeyOrdering on the key-as-int.  This is then fed to
 mapPartitionWithIndex.

 Try #2: Emit (col-as-int, (key-as-int, value)) and groupBy, and have a
 final map() on each col.

 Try #3: Emit (col-as-t, Collection[(key-as-int, value)]), then have a
 reduceByKey
 which takes the union of the collection (union for set, ++ for list)
 then have
 a final map() which attempts the final conversion.

 No matter what I do, it works for for small numbers of keys (hundreds),
 but
 when I crank it up, it seems to sit there resubmitting the shuffle phase.

 Happy holidays, all!
 -Mike



   --
  *From:* Amit Behera amit.bd...@gmail.com
 *To:* u...@spark.incubator.apache.org
 *Sent:* Thursday, December 25, 2014 3:22 PM
 *Subject:* unable to do group by with 1st column

 Hi Users,

 I am reading a csv file and my data format is like :

 key1,value1
 key1,value2
 key1,value1
 key1,value3
 key2,value1
 key2,value5
 key2,value5
 key2,value4
 key1,value4
 key1,value4
 key3,value1
 key3,value1
 key3,value2

 required output :

 key1:[value1,value2,value1,value3,value4,value4]
 key2:[value1,value5,value5,value4]
 key3:[value1,value1,value2]


 How can I do it? Please help me to do.

 Thanks
 Amit





Re: unable to do group by with 1st column

2014-12-25 Thread Tobias Pfeiffer
Hi,

On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera amit.bd...@gmail.com wrote:

 How can I do it? Please help me to do.


Have you considered using groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html#transformations

Tobias


RE: unable to do group by with 1st column

2014-12-25 Thread Somnath Pandeya
Hi ,
You can try reducebyKey also ,
Something like this
JavaPairRDDString, String ones = lines
   .mapToPair(new PairFunctionString, String, 
String() {
  @Override
  public Tuple2String, String call(String s) {
 String[] temp = s.split(,);
 return new Tuple2String, 
String(temp[0], temp[1]);
  }
   });

  JavaPairRDDString, String counts = ones
   .reduceByKey(new Function2String, String, String() 
{
  @Override
  public String call(String i1, String i2) {
 return i1 + , + i2;
  }
   });

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, December 26, 2014 6:35 AM
To: Amit Behera
Cc: u...@spark.incubator.apache.org
Subject: Re: unable to do group by with 1st column

Hi,

On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera 
amit.bd...@gmail.commailto:amit.bd...@gmail.com wrote:
How can I do it? Please help me to do.

Have you considered using groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html#transformations

Tobias

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***