Hi Guys
I have a huge dataset (~ 1TB) which has about a billion records. I have to
transfer it to an HBase table. What is the fastest way of doing it?
--
Thank You
Regards
Punit Naik
of partitions created are equal to the
number of clusters (2 in this case) and each partition has all the elements
belonging to a certain cluster in it.
--
Thank You
Regards
Punit Naik
I meant to say that first we can sort the individual partitions and then
sort them again by merging. Sort of a divide and conquer mechanism.
Does sortByKey take care of all this internally?
On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik <naik.puni...@gmail.com> wrote:
> Can we increase th
Can we increase the sorting speed of RDD by doing a secondary sort first?
On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <naik.puni...@gmail.com> wrote:
> Okay. Can't I supply the same partitioner I used for
> "repartitionAndSortWithinPartitions" as an argument to "sor
titioner than
> repartitionAndSortWithinPartitions you do not get much benefit from running
> sortByKey after repartitionAndSortWithinPartitions (because all the data
> will get shuffled again)
>
>
> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <naik.puni...@gmail.com>
>
Partit sort by keys, not values per key, so not
> really secondary sort by itself.
>
> for secondary sort also check out:
> https://github.com/tresata/spark-sorted
>
>
> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <naik.puni...@gmail.com>
> wrote:
>
>> Hi guys
>
ortByKey" after
"repartitionAndSortWithinPartitions" be faster now that the individual
partitions are sorted?
--
Thank You
Regards
Punit Naik
e
one which is above is he latest one which is failing.
Can anyone help me in designing the configuration or set some properties
which will not result in executors failing and let the tersort complete?
--
Thank You
Regards
Punit Naik
.
On 29-Jun-2016 6:31 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
> Since the data.length is variable, I am not sure whether mixing data.length
> and the index makes sense.
>
> Can you describe your use case in bit more detail ?
>
> Thanks
>
> On Tue, Jun 28,
d:
> val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
> firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
> (x._1, split.startIndex + x._2)
>
> You can modify the second component of the tuple to take data.length into
> account.
>
>
Hi
I wanted to change the functioning of the "zipWithIndex" function for spark
RDDs in which the output of the function is, just for an example, "(data,
prev_index+data.length)" instead of "(data,prev_index+1)".
How can I do this?
--
Thank You
Regards
Punit Naik
11 matches
Mail list logo