Hi All,
Do we have some statistics about current bottlenecks which part is taking
more time??
*@Ravindra* Please correct me If I am wrong I think our current unsafe sort
is also in-place it is only swapping the offsets not data. Only from
comparison it is getting the data from off-heap to on-hea
Yes, and if after dictionary encoding, SORT_COLUMNS can fit in 6 bytes, our
approach can be even better, because the 8 bytes data can be put in cache
totally, without the remaining portion in memory.
Regards,
Jacky
> 在 2017年5月22日,下午5:23,Ravindra Pesala 写道:
>
> Hi,
>
> I think you are referri
Hi,
I think you are referring to tungsten sort, there they tried keep pointer
and key together to simulate cache aware computation. It is only possible
if the sort keys are always starts with fixed keys like dictionary keys. So
basically first encountered few dictionary columns can be kept along w
Yes, Ravindra, unsafe sort will be better.
In my last mail, I mentioned a 8-bytes encoded format for RowID + SORT_COLUMNS,
if SORT_COLUMNS are dictionary encoded, I think it is effectively like unsafe
which is only type of byte[8], right? So we can do this by ourselves instead of
depending on 3r
-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-Data-loading-improvement-tp11429p13056.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list archive
> at Nabble.com.
Hi,
Using Object[] as a row while loading is not efficient in terms of memory
usage. It would be more efficient to keep them in unsafe as it can keep the
data in more compacted way as per data type.
And regarding sorting it would be good to concentrate on single sorting
solution. Since we already
-Data-loading-improvement-tp11429p13056.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive
at Nabble.com.
Hi,
While I am working on data load improvement and encoding override feature, I
found that it is not efficient to use the CarbonRow with Object[]. I think a
better way is to use fix length primitive type instead of Object.
Since currently SORT_COLUMNS is merged, I think it is possible to:
1.
Hi community,
More idea on the loading process, in my opinion the ideal one should be as
following, please comment:
1. input step:
- Do the parsing of input data, either CSV or Dataframe, they all convert into
CarbonRow.
- Buffering them to CarbonRowBatch
- Prefetchiong of rows
2. convert st
reply inline
> 在 2017年4月26日,上午2:03,Vimal Das Kammath 写道:
>
> +1 jacky, Its a very good initiative. I think it will improve the
> performance by reducing the GC overhead as the new approach could
> potentially create lesser short lived objects.
>
> I have few concerns
> 1) I could not follow the
+1 jacky, Its a very good initiative. I think it will improve the
performance by reducing the GC overhead as the new approach could
potentially create lesser short lived objects.
I have few concerns
1) I could not follow the Sort improvement using row ID array saperately,
could you elaborate more
Jacky, thank you list these constructive improvements of data loading.
Agree to consider all these improvement points, only the below one i have
some concerns.
Before considering open interfaces for data loading, we need to more
clearly define block/blocklet/page which play what different roles, t
I want to propose following improvement on data loading process:
Currently different steps are using different data layout in CarbonRow, and it
convert back and forth in different steps. It is not easy for developer to
understand the data structure used in each steps and it increase the memory
13 matches
Mail list logo