Spark to HBase Fast Bulk Upload

2016-09-19 Thread Punit Naik
Hi Guys I have a huge dataset (~ 1TB) which has about a billion records. I have to transfer it to an HBase table. What is the fastest way of doing it? -- Thank You Regards Punit Naik

Partition RDD based on K-Means Clusters

2016-09-15 Thread Punit Naik
of partitions created are equal to the number of clusters (2 in this case) and each partition has all the elements belonging to a certain cluster in it. -- Thank You Regards Punit Naik

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
I meant to say that first we can sort the individual partitions and then sort them again by merging. Sort of a divide and conquer mechanism. Does sortByKey take care of all this internally? On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik <naik.puni...@gmail.com> wrote: > Can we increase th

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Can we increase the sorting speed of RDD by doing a secondary sort first? On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <naik.puni...@gmail.com> wrote: > Okay. Can't I supply the same partitioner I used for > "repartitionAndSortWithinPartitions" as an argument to "sor

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
titioner than > repartitionAndSortWithinPartitions you do not get much benefit from running > sortByKey after repartitionAndSortWithinPartitions (because all the data > will get shuffled again) > > > On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <naik.puni...@gmail.com> >

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Partit sort by keys, not values per key, so not > really secondary sort by itself. > > for secondary sort also check out: > https://github.com/tresata/spark-sorted > > > On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <naik.puni...@gmail.com> > wrote: > >> Hi guys >

repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
ortByKey" after "repartitionAndSortWithinPartitions" be faster now that the individual partitions are sorted? -- Thank You Regards Punit Naik

Spark Terasort Help

2016-07-08 Thread Punit Naik
e one which is above is he latest one which is failing. Can anyone help me in designing the configuration or set some properties which will not result in executors failing and let the tersort complete? -- Thank You Regards Punit Naik

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
. On 29-Jun-2016 6:31 AM, "Ted Yu" <yuzhih...@gmail.com> wrote: > Since the data.length is variable, I am not sure whether mixing data.length > and the index makes sense. > > Can you describe your use case in bit more detail ? > > Thanks > > On Tue, Jun 28,

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
d: > val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition] > firstParent[T].iterator(split.prev, context).zipWithIndex.map { x => > (x._1, split.startIndex + x._2) > > You can modify the second component of the tuple to take data.length into > account. > >

Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Hi I wanted to change the functioning of the "zipWithIndex" function for spark RDDs in which the output of the function is, just for an example, "(data, prev_index+data.length)" instead of "(data,prev_index+1)". How can I do this? -- Thank You Regards Punit Naik