Re: [DISCUSS] Data loading improvement

2017-05-24 Thread Kumar Vishal
Hi All, Do we have some statistics about current bottlenecks which part is taking more time?? *@Ravindra* Please correct me If I am wrong I think our current unsafe sort is also in-place it is only swapping the offsets not data. Only from comparison it is getting the data from off-heap to on-hea

Re: [DISCUSS] Data loading improvement

2017-05-22 Thread Jacky Li
Yes, and if after dictionary encoding, SORT_COLUMNS can fit in 6 bytes, our approach can be even better, because the 8 bytes data can be put in cache totally, without the remaining portion in memory. Regards, Jacky > 在 2017年5月22日,下午5:23,Ravindra Pesala 写道: > > Hi, > > I think you are referri

Re: [DISCUSS] Data loading improvement

2017-05-22 Thread Ravindra Pesala
Hi, I think you are referring to tungsten sort, there they tried keep pointer and key together to simulate cache aware computation. It is only possible if the sort keys are always starts with fixed keys like dictionary keys. So basically first encountered few dictionary columns can be kept along w

Re: [DISCUSS] Data loading improvement

2017-05-21 Thread Jacky Li
Yes, Ravindra, unsafe sort will be better. In my last mail, I mentioned a 8-bytes encoded format for RowID + SORT_COLUMNS, if SORT_COLUMNS are dictionary encoded, I think it is effectively like unsafe which is only type of byte[8], right? So we can do this by ourselves instead of depending on 3r

Re: [DISCUSS] Data loading improvement

2017-05-21 Thread Jacky Li
-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-Data-loading-improvement-tp11429p13056.html > Sent from the Apache CarbonData Dev Mailing List archive mailing list archive > at Nabble.com.

Re: [DISCUSS] Data loading improvement

2017-05-21 Thread Ravindra Pesala
Hi, Using Object[] as a row while loading is not efficient in terms of memory usage. It would be more efficient to keep them in unsafe as it can keep the data in more compacted way as per data type. And regarding sorting it would be good to concentrate on single sorting solution. Since we already

Re: [DISCUSS] Data loading improvement

2017-05-21 Thread David CaiQiang
-Data-loading-improvement-tp11429p13056.html Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.

Re: [DISCUSS] Data loading improvement

2017-05-21 Thread Jacky Li
Hi, While I am working on data load improvement and encoding override feature, I found that it is not efficient to use the CarbonRow with Object[]. I think a better way is to use fix length primitive type instead of Object. Since currently SORT_COLUMNS is merged, I think it is possible to: 1.

Re: [DISCUSS] Data loading improvement

2017-04-26 Thread Jacky Li
Hi community, More idea on the loading process, in my opinion the ideal one should be as following, please comment: 1. input step: - Do the parsing of input data, either CSV or Dataframe, they all convert into CarbonRow. - Buffering them to CarbonRowBatch - Prefetchiong of rows 2. convert st

Re: [DISCUSS] Data loading improvement

2017-04-26 Thread Jacky Li
reply inline > 在 2017年4月26日,上午2:03,Vimal Das Kammath 写道: > > +1 jacky, Its a very good initiative. I think it will improve the > performance by reducing the GC overhead as the new approach could > potentially create lesser short lived objects. > > I have few concerns > 1) I could not follow the

Re: [DISCUSS] Data loading improvement

2017-04-26 Thread Vimal Das Kammath
+1 jacky, Its a very good initiative. I think it will improve the performance by reducing the GC overhead as the new approach could potentially create lesser short lived objects. I have few concerns 1) I could not follow the Sort improvement using row ID array saperately, could you elaborate more

Re: [DISCUSS] Data loading improvement

2017-04-22 Thread Liang Chen
Jacky, thank you list these constructive improvements of data loading. Agree to consider all these improvement points, only the below one i have some concerns. Before considering open interfaces for data loading, we need to more clearly define block/blocklet/page which play what different roles, t

[DISCUSS] Data loading improvement

2017-04-20 Thread Jacky Li
I want to propose following improvement on data loading process: Currently different steps are using different data layout in CarbonRow, and it convert back and forth in different steps. It is not easy for developer to understand the data structure used in each steps and it increase the memory