Re: RFR: 8283681: Improve ZonedDateTime offset handling

2022-03-25 Thread Richard Startin
On Fri, 25 Mar 2022 12:28:58 GMT, Claes Redestad  wrote:

> Richard Startin prompted me to have a look at a case where java.time 
> underperforms relative to joda time 
> (https://twitter.com/richardstartin/status/1506975932271190017). 
> 
> It seems the java.time test of his suffer from heavy allocations due 
> ZoneOffset::getRules allocating a new ZoneRules object every time and escape 
> analysis failing to do the thing in his test. The patch here adds a simple 
> specialization so that when creating ZonedDateTimes using a ZoneOffset we 
> don't query the rules at all. This removes the risk of extra allocations and 
> slightly speeds up ZonedDateTime creation for both ZoneOffset (+14%) and 
> ZoneRegion (+5%) even when EA works like it should (the case in the here 
> provided microbenchmark).

test/micro/org/openjdk/bench/java/time/GetYearBench.java line 70:

> 68: private static final long[] INSTANT_MILLIS = createInstants();
> 69: 
> 70: private static final int[] YEARS = new int[INSTANT_MILLIS.length];

Does it make any difference if these aren't constant?

-

PR: https://git.openjdk.java.net/jdk/pull/7957


Re: [VOTE] Apache Pinot 0.10.0 RC0

2022-03-24 Thread Richard Startin
+1

- verified sha512 hash
- verified signature
- verified git hash
- verified contents based on git commit hash & the downloaded source code
- verified LICENSE, NOTICE are correctly present
- compiled the downloaded source code
- ran quick start scripts

On Mon, Mar 21, 2022 at 7:01 PM Sajjad Moradi  wrote:

> Hi Pinot Community,
>
> This is a call for a vote to release Apache Pinot 0.10.0.
>
> The release candidate:
> https://dist.apache.org/repos/dist/dev/pinot/apache-pinot-0.10.0-rc0/
>
> Git tag for this release:
> https://github.com/apache/pinot/tree/release-0.10.0-rc0
>
> Git hash for this release:
> 30c4635bfeee88f88aa9c9f63b93bcd4a650607f
>
> The artifact has been signed with key: 9079294B, which can be found in
> the following KEYS file:
> https://dist.apache.org/repos/dist/release/pinot/KEYS
>
> Release notes:
> https://github.com/apache/pinot/releases/tag/release-0.10.0-rc0
>
> Staging repository:
> https://repository.apache.org/content/repositories/orgapachepinot-1035
>
> Documentation on verifying a release candidate:
>
> https://cwiki.apache.org/confluence/display/PINOT/Validating+a+release+candidate
>
> The vote will be open for at least 72 hours or until a necessary number of
> votes is reached.
>
> Please vote accordingly,
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Thanks,
> Apache Pinot team
>


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2]

2022-03-08 Thread Richard Startin
On Mon, 7 Mar 2022 21:41:05 GMT, Richard Startin  wrote:

>> Ludovic Henry has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Add UTF-16 benchmarks
>
> Great to see this taken up. As it’s implemented here, it’s still scalar, but 
> the unroll prevents a strength reduction of the multiplication in the loop 
> from
> 
> result = 31 * result + element;
> 
> to:
> 
> result = (result << 5) - result + element
> 
> which creates a data dependency and slows the loop down.
> 
> This was first reported by Peter Levart here: 
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-September/028898.html

> @richardstartin - does that strength reduction actually happen? The bit-shift 
> transformation valid only if the original `result` is known to be 
> non-negative.

Yes.


@State(Scope.Benchmark)
public class StringHashCode {

  @Param({"sdjhfklashdfklashdflkashdflkasdhf", "締国件街徹条覧野武鮮覧横営績難比兵州催色"})
  String string;

  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
  @Benchmark
  public int stringHashCode() {
return new String(string).hashCode();
  }
}



[Hottest Region 
1]..
c2, level 4, StringHashCode::stringHashCode, version 507 (384 bytes) 

0x7f2df0142da4: shl$0x3,%r10
0x7f2df0142da8: movabs $0x8,%r12
0x7f2df0142db2: add%r12,%r10
0x7f2df0142db5: xor%r12,%r12
0x7f2df0142db8: cmp%r10,%rax
0x7f2df0142dbb: jne0x7f2de8696080  ;   
{runtime_call ic_miss_stub}
0x7f2df0142dc1: data16 xchg %ax,%ax
0x7f2df0142dc4: nopl   0x0(%rax,%rax,1)
0x7f2df0142dcc: data16 data16 xchg %ax,%ax
  [Verified Entry Point]
  0.12% 0x7f2df0142dd0: mov%eax,-0x14000(%rsp)
  0.84% 0x7f2df0142dd7: push   %rbp
  0.22% 0x7f2df0142dd8: sub$0x30,%rsp ;*synchronization 
entry
  ; - 
StringHashCode::stringHashCode@-1 (line 14)
0x7f2df0142ddc: mov0xc(%rsi),%r8d ;*getfield string 
{reexecute=0 rethrow=0 return_oop=0}
  ; - 
StringHashCode::stringHashCode@5 (line 14)
  0.73% 0x7f2df0142de0: mov0x10(%r12,%r8,8),%eax  ; implicit 
exception: dispatches to 0x7f2df0142fc4
  0.10% 0x7f2df0142de5: test   %eax,%eax
 ╭  0x7f2df0142de7: je 0x7f2df0142df9  
;*synchronization entry
 │; - 
StringHashCode::stringHashCode@-1 (line 14)
  0.16%  │  0x7f2df0142de9: add$0x30,%rsp
 │  0x7f2df0142ded: pop%rbp
 │  0x7f2df0142dee: mov0x108(%r15),%r10
  0.88%  │  0x7f2df0142df5: test   %eax,(%r10);   {poll_return}
  0.18%  │  0x7f2df0142df8: retq   
 ↘  0x7f2df0142df9: mov0xc(%r12,%r8,8),%ecx  ;*getfield 
value {reexecute=0 rethrow=0 return_oop=0}
  ; - 
java.lang.String::init@6 (line 236)
  ; - 
StringHashCode::stringHashCode@8 (line 14)
0x7f2df0142dfe: mov0xc(%r12,%rcx,8),%r10d  
;*arraylength {reexecute=0 rethrow=0 return_oop=0}
  ; - 
java.lang.String::hashCode@13 (line 1503)
  ; - 
StringHashCode::stringHashCode@11 (line 14)
  ; implicit 
exception: dispatches to 0x7f2df0142fd0
  0.83% 0x7f2df0142e03: test   %r10d,%r10d
0x7f2df0142e06: jbe0x7f2df0142f86  ;*ifle 
{reexecute=0 rethrow=0 return_oop=0}
  ; - 
java.lang.String::hashCode@14 (line 1503)
  ; - 
StringHashCode::stringHashCode@11 (line 14)
  0.14% 0x7f2df0142e0c: movsbl 0x14(%r12,%r8,8),%r8d  ;*getfield 
coder {reexecute=0 rethrow=0 return_oop=0}
  ; - 
java.lang.String::init@14 (line 237)
  ; - 
StringHashCode::stringHashCode@8 (line 14)
  0.02% 0x7f2df0142e12: test   %r8d,%r8d
0x7f2df0142e15: jne0x7f2df0142fac  ;*ifne 
{reexecute=0 rethrow=0 return_oop=0}
  ; - 
java.lang.String::isLatin1@10 (line 3266)
   

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2]

2022-03-07 Thread Richard Startin
On Fri, 4 Mar 2022 17:44:44 GMT, Ludovic Henry  wrote:

>> Despite the hash value being cached for Strings, computing the hash still 
>> represents a significant CPU usage for applications handling lots of text.
>> 
>> Even though it would be generally better to do it through an enhancement to 
>> the autovectorizer, the complexity of doing it by hand is trivial and the 
>> gain is sizable (2x speedup) even without the Vector API. The algorithm has 
>> been proposed by Richard Startin and Paul Sandoz [1].
>> 
>> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
>> 
>> 
>> Benchmark(size)  Mode  Cnt Score 
>>Error  Units
>> StringHashCode.Algorithm.scalarLatin1 0  avgt   25 2.111 
>> ±  0.210  ns/op
>> StringHashCode.Algorithm.scalarLatin1 1  avgt   25 3.500 
>> ±  0.127  ns/op
>> StringHashCode.Algorithm.scalarLatin110  avgt   25 7.001 
>> ±  0.099  ns/op
>> StringHashCode.Algorithm.scalarLatin1   100  avgt   2561.285 
>> ±  0.444  ns/op
>> StringHashCode.Algorithm.scalarLatin1  1000  avgt   25   628.995 
>> ±  0.846  ns/op
>> StringHashCode.Algorithm.scalarLatin1 1  avgt   25  6307.990 
>> ±  4.071  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   0  avgt   25 2.358 
>> ±  0.092  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   1  avgt   25 3.631 
>> ±  0.159  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16  10  avgt   25 7.049 
>> ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16 100  avgt   2533.626 
>> ±  1.218  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled161000  avgt   25   317.811 
>> ±  1.225  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   1  avgt   25  3212.333 
>> ± 14.621  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled80  avgt   25 2.356 
>> ±  0.097  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled81  avgt   25 3.630 
>> ±  0.158  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8   10  avgt   25 8.724 
>> ±  0.065  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8  100  avgt   2532.402 
>> ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8 1000  avgt   25   321.949 
>> ±  0.251  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled81  avgt   25  3202.083 
>> ±  1.667  ns/op
>> StringHashCode.Algorithm.scalarUTF16  0  avgt   25 2.135 
>> ±  0.191  ns/op
>> StringHashCode.Algorithm.scalarUTF16  1  avgt   25 5.202 
>> ±  0.362  ns/op
>> StringHashCode.Algorithm.scalarUTF16 10  avgt   2511.105 
>> ±  0.112  ns/op
>> StringHashCode.Algorithm.scalarUTF16100  avgt   2575.974 
>> ±  0.702  ns/op
>> StringHashCode.Algorithm.scalarUTF16   1000  avgt   25   716.429 
>> ±  3.290  ns/op
>> StringHashCode.Algorithm.scalarUTF16  1  avgt   25  7095.459 
>> ± 43.847  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled160  avgt   25 2.381 
>> ±  0.038  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled161  avgt   25 5.268 
>> ±  0.422  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16   10  avgt   2511.248 
>> ±  0.178  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16  100  avgt   2552.966 
>> ±  0.089  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16 1000  avgt   25   450.912 
>> ±  1.834  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled161  avgt   25  4403.988 
>> ±  2.927  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 0  avgt   25 2.401 
>> ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 1  avgt   25 5.091 
>> ±  0.396  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled810  avgt   2512.801 
>> ±  0.189  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8   100  avgt   2552.068 
>> ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8  1000  avgt   25   453.270 
>> ±  0.340  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 1  avgt   25  4433.112 
>> ±  2.699  ns/op
>> 
>> 
>> At Datadog, we handle a great amount of text (through logs management for 
>> example), and hashing String represents a large part of our CPU usage. It's 
&g

Re: [VOTE] Apache Pinot 0.9.3 RC0

2021-12-24 Thread Richard Startin
+1

On Fri, Dec 24, 2021 at 10:17 AM Atri Sharma  wrote:

> +1
>
> On Fri, 24 Dec 2021, 15:41 Xiang Fu,  wrote:
>
>> Hi Pinot Community,
>>
>> This is a call for a vote to release Apache Pinot 0.9.3.
>>
>> This is a bug fixing release contains:
>> - Upgrade log4j to 2.17.0 to address CVE-2021-45105
>>  (#7933
>> )
>>
>> The release candidate:
>> https://dist.apache.org/repos/dist/dev/pinot/apache-pinot-0.9.3-rc0
>>
>> Git tag for this release:
>> https://github.com/apache/pinot/tree/release-0.9.3-rc0
>>
>> Git hash for this release:
>> e23f213cf0d16b1e9e086174d734a4db868542cb
>>
>> The artifacts have been signed with the key: CDEDB21B862F6C66, which can
>> be found in the following KEYS file.
>> https://dist.apache.org/repos/dist/release/pinot/KEYS
>>
>> Release notes:
>> https://github.com/apache/pinot/releases/tag/release-0.9.3-rc0
>>
>> Staging repository:
>> https://repository.apache.org/content/repositories/orgapachepinot-1034
>>
>> Documentation on verifying a release candidate:
>>
>> https://cwiki.apache.org/confluence/display/PINOT/Validating+a+release+candidate
>>
>> The vote will be open for at least 72 hours or until a necessary number
>> of votes is reached.
>>
>> Please vote accordingly,
>>
>> [ ] +1 approve
>> [ ] +0 no opinion
>> [ ] -1 disapprove with the reason
>>
>> Thanks,
>>
>> Apache Pinot team
>>
>


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-14 Thread Richard Startin
On Tue, 14 Sep 2021 10:57:17 GMT, Alan Bateman  wrote:

>>> Hi @iaroslavski I'm unconvinced that this work was from 14/06/2020 - I 
>>> believe this work derives from an unsigned radix sort I implemented on 
>>> 10/04/2021 
>>> [richardstartin/radix-sort-benchmark@ab4da23#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR226](https://github.com/richardstartin/radix-sort-benchmark/commit/ab4da230e1d0ac68e5ee2cee38d71c7e7d50f49b#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR226)
>>>  which has numerous structural similarities to this work:
>>> Moreover, @bourgesl forked my repository on 11/04/2021 and communicated 
>>> with me about doing so. On 25/04/2021 there was a new implementation of 
>>> `DualPivotQuicksort` with a signed radix sort but the same structural 
>>> similarities, and with the same method and variable names in places 
>>> [bourgesl/radix-sort-benchmark@90ff7e4#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R607-R609](https://github.com/bourgesl/radix-sort-benchmark/commit/90ff7e427da0fa49f374bff0241fb2487bd87bde#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R607-R609)
>> 
>> @iaroslavski The attribution is not clear here. Can you provide a summary as 
>> to who is contributing to this patch? I can't tell if all involved have 
>> signed the OCA or not. I'm sure there will be questions about space/time 
>> trade-offs with radix sort but I think it's important to first establish the 
>> origins of this patch first.
>
>> @AlanBateman Vertical pipeline of PR hides comments in the middle and you 
>> have to click on "Show more..." to see all comments. There are no claims 
>> related to the origin of my patch, it doesn't violate any rights.
> 
> There is a comment from richardstartin suggesting that something was derived 
> from code in his repo. Is this a benchmark that is not part of this PR? Only 
> asking because I can't find him on OCA signatories. You can use the Skara 
> /contributor command to list the contributors.

@AlanBateman my claim was that the implementation was derived from my 
implementation, and demonstrated a sequence of name changes after @bourgesl 
forked my repository containing a structurally similar radix sort 
implementation and benchmarks, in order to provide circumstantial evidence for 
my claim. Via email @iaroslavski told me that this was not the case, which I 
decided to accept at face value. So please judge this PR on its merits, and 
disregard the claims made in these comments. I have not signed an OCA but do 
not want to block this PR if the space time tradeoff is deemed acceptable.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Richard Startin
On Sat, 8 May 2021 20:54:48 GMT, iaroslavski 
 wrote:

> Sorting:
> 
> - adopt radix sort for sequential and parallel sorts on int/long/float/double 
> arrays (almost random and length > 6K)
> - fix tryMergeRuns() to better handle case when the last run is a single 
> element
> - minor javadoc and comment changes
> 
> Testing:
> - add new data inputs in tests for sorting
> - add min/max/infinity values to float/double testing
> - add tests for radix sort

src/java.base/share/classes/java/util/DualPivotQuicksort.java line 672:

> 670: count2[(a[i] >>>  8) & 0xFF]--;
> 671: count3[(a[i] >>> 16) & 0xFF]--;
> 672: count4[(a[i] >>> 24) ^ 0x80]--;

It seems that C2 can't eliminate the bounds check here because of the `xor`, 
even though this can't possibly exceed 256. The three masked accesses above are 
all eliminated. Maybe someone could look in to improving that.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Richard Startin
On Thu, 13 May 2021 10:22:57 GMT, Laurent Bourgès  wrote:

>> Hi @iaroslavski I'm unconvinced that this work was from 14/06/2020 - I 
>> believe this work derives from an unsigned radix sort I implemented on 
>> 10/04/2021 
>> https://github.com/richardstartin/radix-sort-benchmark/commit/ab4da230e1d0ac68e5ee2cee38d71c7e7d50f49b#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR226
>>  which has numerous structural similarities to this work:
>> * Producing all four histograms in one pass
>> * Skipping passes based on detecting the total in the histogram
>> * Bailing out of the skip detection if a nonzero value not equal to the 
>> total is encountered
>> * Manually unrolling the LSD radix sort loop in order to avoid array copies
>> 
>> My implementation from 10th April is below for reference:
>> 
>>   public static void unrollOnePassHistogramsSkipLevels(int[] data) {
>> int[] histogram1 = new int[257];
>> int[] histogram2 = new int[257];
>> int[] histogram3 = new int[257];
>> int[] histogram4 = new int[257];
>> 
>> for (int value : data) {
>>   ++histogram1[(value & 0xFF) + 1];
>>   ++histogram2[((value >>> 8) & 0xFF) + 1];
>>   ++histogram3[((value >>> 16) & 0xFF) + 1];
>>   ++histogram4[(value >>> 24) + 1];
>> }
>> boolean skipLevel1 = canSkipLevel(histogram1, data.length);
>> boolean skipLevel2 = canSkipLevel(histogram2, data.length);
>> boolean skipLevel3 = canSkipLevel(histogram3, data.length);
>> boolean skipLevel4 = canSkipLevel(histogram4, data.length);
>> 
>> if (skipLevel1 && skipLevel2 && skipLevel3 && skipLevel4) {
>>   return;
>> }
>> int[] copy = new int[data.length];
>> 
>> int[] source = data;
>> int[] dest = copy;
>> 
>> if (!skipLevel1) {
>>   for (int i = 1; i < histogram1.length; ++i) {
>> histogram1[i] += histogram1[i - 1];
>>   }
>>   for (int value : source) {
>> dest[histogram1[value & 0xFF]++] = value;
>>   }
>>   if (!skipLevel2 || !skipLevel3 || !skipLevel4) {
>> int[] tmp = dest;
>> dest = source;
>> source = tmp;
>>   }
>> }
>> 
>> if (!skipLevel2) {
>>   for (int i = 1; i < histogram2.length; ++i) {
>> histogram2[i] += histogram2[i - 1];
>>   }
>>   for (int value : source) {
>> dest[histogram2[(value >>> 8) & 0xFF]++] = value;
>>   }
>>   if (!skipLevel3 || !skipLevel4) {
>> int[] tmp = dest;
>> dest = source;
>> source = tmp;
>>   }
>> }
>> 
>> if (!skipLevel3) {
>>   for (int i = 1; i < histogram3.length; ++i) {
>> histogram3[i] += histogram3[i - 1];
>>   }
>>   for (int value : data) {
>> dest[histogram3[(value >>> 16) & 0xFF]++] = value;
>>   }
>>   if (!skipLevel4) {
>> int[] tmp = dest;
>> dest = source;
>> source = tmp;
>>   }
>> }
>> 
>> if (!skipLevel4) {
>>   for (int i = 1; i < histogram4.length; ++i) {
>> histogram4[i] += histogram4[i - 1];
>>   }
>>   for (int value : source) {
>> dest[histogram4[value >>> 24]++] = value;
>>   }
>> }
>> if (dest != data) {
>>   System.arraycopy(dest, 0, data, 0, data.length);
>> }
>>   }
>> 
>>   private static boolean canSkipLevel(int[] histogram, int dataSize) {
>> for (int count : histogram) {
>>   if (count == dataSize) {
>> return true;
>>   } else if (count > 0) {
>> return false;
>>   }
>> }
>> return true;
>>   }
>> 
>> 
>> Moreover, @bourgesl forked my repository on 11/04/2021 and communicated with 
>> me about doing so. On 25/04/2021 there was a new implementation of 
>> `DualPivotQuicksort` with a signed radix sort but the same structural 
>> similarities, and with the same method and variable names in places 
>> https://github.com/bourgesl/radix-sort-benchmark/commit/90ff7e427da0fa49f374bff0241fb2487bd87bde#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R607-R609
>> 
>> 
>> // TODO add javadoc
>> private static void radixSort(Sorter sorter, int[] a, int low, int high) 
>> {
>> int[] b;
>> // LBO: prealloc (high - low) +1 element:
>> if (sorter == null || (b = sorter.b) == null || b.length < (high - 
>> low)) {
>> // System.out.println("alloc b: " + (high - low));
>> b = new int[high - low];
>> }
>> 
>> int[] count1, count2, count3, count4;
>> if (sorter != null) {
>> sorter.resetRadixBuffers();
>> count1 = sorter.count1;
>> count2 = sorter.count2;
>> count3 = sorter.count3;
>> count4 = sorter.count4;
>> } else {
>> // System.out.println("alloc radix buffers(4x256)");
>> count1 = new int[256];
>> count2 = new int[256];
>> count3 = new int[256];
>> count4 = new int[256];
>> }
>> 
>> 

Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Richard Startin
On Fri, 14 May 2021 07:14:27 GMT, Laurent Bourgès  wrote:

>> So the issue of not skipping passes was my fault in the translation process, 
>> so not something to worry about, though after [fixing 
>> that](https://github.com/richardstartin/radix-sort-benchmark/commit/ccbee984c6a0e0f50c30de59e1a5e9fbcad89510)
>>  the original implementation still has the edge because of the bounds checks 
>> on the `xor` not getting eliminated.
>> 
>> 
>> Benchmark (bits)  (padding)  
>> (scenario)  (seed)   (size)  Mode  Cnt  ScoreError  Units
>> RadixSortBenchmark.jdk17  7 
>> UNIFORM   0  100  avgt5  10432.408 ± 87.024  us/op
>> RadixSortBenchmark.jdk23  7 
>> UNIFORM   0  100  avgt5   9465.990 ± 40.598  us/op
>> RadixSortBenchmark.jdk30  7 
>> UNIFORM   0  100  avgt5  11189.146 ± 50.972  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  17  7 
>> UNIFORM   0  100  avgt5   9546.963 ± 41.698  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  23  7 
>> UNIFORM   0  100  avgt5   9412.114 ± 43.081  us/op
>> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  30  7 
>> UNIFORM   0  100  avgt5  10823.618 ± 64.311  us/op
>
> Great analysis on C2, richard.
> 
> maybe (x ^ 0x80) &0xFF would help C2 to eliminate bound checks...

I don't know Laurent, I find the handling of signed order over-complicated. 
Subtracting `Integer.MIN_VALUE` is really cheap...

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Richard Startin
On Thu, 13 May 2021 14:44:28 GMT, Richard Startin 
 wrote:

>> @iaroslavski I would prefer to discuss this in private than here, but my 
>> argument is that the name `skipByte` came from Laurent's code, and that 
>> Laurent's code was clearly derived from my own within a fork of my 
>> repository. I linked the commits where you changed `skipByte` to `passLevel` 
>> and Laurent changed my name `canSkipLevel` to `skipByte`. 
>> 
>> For me, this raises questions about the independence of your work from 
>> Laurent's, and Laurent's work is clearly derived from my own (and I don't 
>> think anyone is disputing the latter). I would be happy to sort this out in 
>> private.
>
> In private correspondence with Vladimir, it was explained that where 
> Vladimir's code and Laurent's code are identical, including typos 
> ([Vladimir's 
> code](https://github.com/iaroslavski/sorting/commit/f076073b8b819a9687613903a164e3ed71821769#diff-4b4d68fc834c2ad12a9fb9d316a812221af7c398338ed2ee907d0a795e7aadafR672),
>  [Laurent's 
> code](https://github.com/bourgesl/radix-sort-benchmark/commit/a693b26b2e2c14cfeedf9c753c9d643096b0e38d#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R719))
>  it is because Vladimir sent the code to Laurent, not the other way around, 
> therefore Vladimir's code does not derive from Laurent's, and it does not 
> derive from mine. I can only trust that this is the case, so please disregard 
> my claim that this is derivative work when reviewing this PR.

For what it's worth, I benchmarked this implementation radix sort ([adapted 
here to fit in to my 
harness](https://github.com/richardstartin/radix-sort-benchmark/commit/07169e8e8602152cfda859baa159db165bf5fcab#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR681-R710782))
 against a [signed 
variant](https://github.com/richardstartin/radix-sort-benchmark/commit/07169e8e8602152cfda859baa159db165bf5fcab#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR396-R478)
 of what I have claimed this work was derived from and the proposed 
implementation does not perform favourably on uniformly random data:



Benchmark (bits)  (padding)  (scenario) 
 (seed)   (size)  Mode  Cnt  Score Error  Units
RadixSortBenchmark.jdk17  7 UNIFORM 
  0  100  avgt5  11301.950 ± 113.691  us/op
RadixSortBenchmark.jdk23  7 UNIFORM 
  0  100  avgt5  11792.351 ±  60.757  us/op
RadixSortBenchmark.jdk30  7 UNIFORM 
  0  100  avgt5  11184.616 ±  67.094  us/op
RadixSortBenchmark.unrollOnePassSkipLevelsSigned  17  7 UNIFORM 
  0  100  avgt5   9564.626 ±  69.497  us/op
RadixSortBenchmark.unrollOnePassSkipLevelsSigned  23  7 UNIFORM 
  0  100  avgt5   9432.085 ±  58.983  us/op
RadixSortBenchmark.unrollOnePassSkipLevelsSigned  30  7 UNIFORM 
  0  100  avgt5  10772.975 ±  51.848  us/op



I believe the root cause is a defect in the mechanism employed to skip passes 
as can be seen by the increased number of instructions and cycles here. In the 
proposed implementation, instructions is roughly constant as a function of 
bits. In the case where all passes must be performed (bits = 30), IPC is 
superior in `unrollOnePassHistogramsSkipLevelsSigned`.


Benchmark   (bits)  
(padding)  (scenario)  (seed)   (size)  Mode  Cnt Score Error  
Units
RadixSortBenchmark.jdk:cycles   17  
7 UNIFORM   0  100  avgt   34976971.877 
#/op
RadixSortBenchmark.jdk:instructions 17  
7 UNIFORM   0  100  avgt   70121142.003 
#/op
RadixSortBenchmark.jdk:cycles   23  
7 UNIFORM   0  100  avgt   32369970.385 
#/op
RadixSortBenchmark.jdk:instructions 23  
7 UNIFORM   0  100  avgt   70201664.963 
#/op
RadixSortBenchmark.jdk:cycles   30  
7 UNIFORM   0  100  avgt   30789736.602 
#/op
RadixSortBenchmark.jdk:instructions 30  
7 UNIFORM   0  100  avgt   70180942.122 
#/op
RadixSortBenchmark.jdk:IPC  30  
7 UNIFORM   0  100  avgt  2.279
insns/clk
RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles  

Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Richard Startin
On Thu, 13 May 2021 20:23:16 GMT, Richard Startin 
 wrote:

>> In private correspondence with Vladimir, it was explained that where 
>> Vladimir's code and Laurent's code are identical, including typos 
>> ([Vladimir's 
>> code](https://github.com/iaroslavski/sorting/commit/f076073b8b819a9687613903a164e3ed71821769#diff-4b4d68fc834c2ad12a9fb9d316a812221af7c398338ed2ee907d0a795e7aadafR672),
>>  [Laurent's 
>> code](https://github.com/bourgesl/radix-sort-benchmark/commit/a693b26b2e2c14cfeedf9c753c9d643096b0e38d#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R719))
>>  it is because Vladimir sent the code to Laurent, not the other way around, 
>> therefore Vladimir's code does not derive from Laurent's, and it does not 
>> derive from mine. I can only trust that this is the case, so please 
>> disregard my claim that this is derivative work when reviewing this PR.
>
> For what it's worth, I benchmarked this implementation radix sort ([adapted 
> here to fit in to my 
> harness](https://github.com/richardstartin/radix-sort-benchmark/commit/07169e8e8602152cfda859baa159db165bf5fcab#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR681-R710782))
>  against a [signed 
> variant](https://github.com/richardstartin/radix-sort-benchmark/commit/07169e8e8602152cfda859baa159db165bf5fcab#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR396-R478)
>  of what I have claimed this work was derived from and the proposed 
> implementation does not perform favourably on uniformly random data:
> 
> 
> 
> Benchmark (bits)  (padding)  
> (scenario)  (seed)   (size)  Mode  Cnt  Score Error  Units
> RadixSortBenchmark.jdk17  7 
> UNIFORM   0  100  avgt5  11301.950 ± 113.691  us/op
> RadixSortBenchmark.jdk23  7 
> UNIFORM   0  100  avgt5  11792.351 ±  60.757  us/op
> RadixSortBenchmark.jdk30  7 
> UNIFORM   0  100  avgt5  11184.616 ±  67.094  us/op
> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  17  7 
> UNIFORM   0  100  avgt5   9564.626 ±  69.497  us/op
> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  23  7 
> UNIFORM   0  100  avgt5   9432.085 ±  58.983  us/op
> RadixSortBenchmark.unrollOnePassSkipLevelsSigned  30  7 
> UNIFORM   0  100  avgt5  10772.975 ±  51.848  us/op
> 
> 
> 
> I believe the root cause is a defect in the mechanism employed to skip passes 
> as can be seen by the increased number of instructions and cycles here. In 
> the proposed implementation, instructions is roughly constant as a function 
> of bits. In the case where all passes must be performed (bits = 30), IPC is 
> superior in `unrollOnePassHistogramsSkipLevelsSigned`.
> 
> 
> Benchmark   
> (bits)  (padding)  (scenario)  (seed)   (size)  Mode  Cnt Score 
> Error  Units
> RadixSortBenchmark.jdk:cycles   
> 17  7 UNIFORM   0  100  avgt   34976971.877   
>   #/op
> RadixSortBenchmark.jdk:instructions 
> 17  7 UNIFORM   0  100  avgt   70121142.003   
>   #/op
> RadixSortBenchmark.jdk:cycles   
> 23  7 UNIFORM   0  100  avgt   32369970.385   
>   #/op
> RadixSortBenchmark.jdk:instructions 
> 23  7 UNIFORM   0  100  avgt   70201664.963   
>   #/op
> RadixSortBenchmark.jdk:cycles   
> 30  7 UNIFORM   0  100  avgt   30789736.602   
>   #/op
> RadixSortBenchmark.jdk:instructions 
> 30  7 UNIFORM   0  100  avgt   70180942.122   
>   #/op
> RadixSortBenchmark.jdk:IPC  
> 30  7 UNIFORM   0  100  avgt  2.279   
>  insns/clk
> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles 
> 17  7 UNIFORM   0  100  avgt   26983994.479   
>   #/op
> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:instructions   
> 17  7 UNIFORM   0  100  avgt   62065304.827   
>   #/op
> RadixSortBenchmark.unrollOnePassSkipLevelsSigned:cycles  

Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Richard Startin
On Thu, 13 May 2021 11:31:49 GMT, iaroslavski 
 wrote:

>> Perhaps we can resolve this issue in private - my email address is on my 
>> profile (or in the commits in `radix-sort-benchmark`)?
>
> @richardstartin And one more addon: my first version of Radix sort, see my 
> github https://github.com/iaroslavski/sorting/tree/master/radixsort uses 
> another name, like skipBytes, then renamed to passLevel.
> So, the common part is "skip". And this method has different number of 
> parameters. I don't see any collision with your code.

@iaroslavski I would prefer to discuss this in private than here, but my 
argument is that the name `skipByte` came from Laurent's code, and that 
Laurent's code was clearly derived from my own within a fork of my repository. 
I linked the commits where you changed `skipByte` to `passLevel` and Laurent 
changed my name `canSkipLevel` to `skipByte`. 

For me, this raises questions about the independence of your work from 
Laurent's, and Laurent's work is clearly derived from my own (and I don't think 
anyone is disputing the latter). I would be happy to sort this out in private.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Richard Startin
On Thu, 13 May 2021 11:47:58 GMT, Richard Startin 
 wrote:

>> @richardstartin And one more addon: my first version of Radix sort, see my 
>> github https://github.com/iaroslavski/sorting/tree/master/radixsort uses 
>> another name, like skipBytes, then renamed to passLevel.
>> So, the common part is "skip". And this method has different number of 
>> parameters. I don't see any collision with your code.
>
> @iaroslavski I would prefer to discuss this in private than here, but my 
> argument is that the name `skipByte` came from Laurent's code, and that 
> Laurent's code was clearly derived from my own within a fork of my 
> repository. I linked the commits where you changed `skipByte` to `passLevel` 
> and Laurent changed my name `canSkipLevel` to `skipByte`. 
> 
> For me, this raises questions about the independence of your work from 
> Laurent's, and Laurent's work is clearly derived from my own (and I don't 
> think anyone is disputing the latter). I would be happy to sort this out in 
> private.

In private correspondence with Vladimir, it was explained that where Vladimir's 
code and Laurent's code are identical, including typos ([Vladimir's 
code](https://github.com/iaroslavski/sorting/commit/f076073b8b819a9687613903a164e3ed71821769#diff-4b4d68fc834c2ad12a9fb9d316a812221af7c398338ed2ee907d0a795e7aadafR672),
 [Laurent's 
code](https://github.com/bourgesl/radix-sort-benchmark/commit/a693b26b2e2c14cfeedf9c753c9d643096b0e38d#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R719))
 it is because Vladimir sent the code to Laurent, not the other way around, 
therefore Vladimir's code does not derive from Laurent's, and it does not 
derive from mine. I can only trust that this is the case, so please disregard 
my claim that this is derivative work when reviewing this PR.

-

PR: https://git.openjdk.java.net/jdk/pull/3938


Re: RFR: JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

2021-09-13 Thread Richard Startin
On Wed, 12 May 2021 12:20:09 GMT, iaroslavski 
 wrote:

>> src/java.base/share/classes/java/util/DualPivotQuicksort.java line 47:
>> 
>>> 45:  * @author Doug Lea
>>> 46:  *
>>> 47:  * @version 2020.06.14
>> 
>> Vladimir, I would update to 2021.05.06 (+your hash)
>
> Laurent, the date in this class is not the date of our last commit,
> this date is the date when I have final idea regarding to Radix sort,
> therefore, I prefer to keep 2020.06.14

Hi @iaroslavski I'm unconvinced that this work was from 14/06/2020 - I believe 
this work derives from an unsigned radix sort I implemented on 10/04/2021 
https://github.com/richardstartin/radix-sort-benchmark/commit/ab4da230e1d0ac68e5ee2cee38d71c7e7d50f49b#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR226
 which has numerous structural similarities to this work:
* Producing all four histograms in one pass
* Skipping passes based on detecting the total in the histogram
* Bailing out of the skip detection if a nonzero value not equal to the total 
is encountered
* Manually unrolling the LSD radix sort loop in order to avoid array copies

My implementation from 10th April is below for reference:

  public static void unrollOnePassHistogramsSkipLevels(int[] data) {
int[] histogram1 = new int[257];
int[] histogram2 = new int[257];
int[] histogram3 = new int[257];
int[] histogram4 = new int[257];

for (int value : data) {
  ++histogram1[(value & 0xFF) + 1];
  ++histogram2[((value >>> 8) & 0xFF) + 1];
  ++histogram3[((value >>> 16) & 0xFF) + 1];
  ++histogram4[(value >>> 24) + 1];
}
boolean skipLevel1 = canSkipLevel(histogram1, data.length);
boolean skipLevel2 = canSkipLevel(histogram2, data.length);
boolean skipLevel3 = canSkipLevel(histogram3, data.length);
boolean skipLevel4 = canSkipLevel(histogram4, data.length);

if (skipLevel1 && skipLevel2 && skipLevel3 && skipLevel4) {
  return;
}
int[] copy = new int[data.length];

int[] source = data;
int[] dest = copy;

if (!skipLevel1) {
  for (int i = 1; i < histogram1.length; ++i) {
histogram1[i] += histogram1[i - 1];
  }
  for (int value : source) {
dest[histogram1[value & 0xFF]++] = value;
  }
  if (!skipLevel2 || !skipLevel3 || !skipLevel4) {
int[] tmp = dest;
dest = source;
source = tmp;
  }
}

if (!skipLevel2) {
  for (int i = 1; i < histogram2.length; ++i) {
histogram2[i] += histogram2[i - 1];
  }
  for (int value : source) {
dest[histogram2[(value >>> 8) & 0xFF]++] = value;
  }
  if (!skipLevel3 || !skipLevel4) {
int[] tmp = dest;
dest = source;
source = tmp;
  }
}

if (!skipLevel3) {
  for (int i = 1; i < histogram3.length; ++i) {
histogram3[i] += histogram3[i - 1];
  }
  for (int value : data) {
dest[histogram3[(value >>> 16) & 0xFF]++] = value;
  }
  if (!skipLevel4) {
int[] tmp = dest;
dest = source;
source = tmp;
  }
}

if (!skipLevel4) {
  for (int i = 1; i < histogram4.length; ++i) {
histogram4[i] += histogram4[i - 1];
  }
  for (int value : source) {
dest[histogram4[value >>> 24]++] = value;
  }
}
if (dest != data) {
  System.arraycopy(dest, 0, data, 0, data.length);
}
  }

  private static boolean canSkipLevel(int[] histogram, int dataSize) {
for (int count : histogram) {
  if (count == dataSize) {
return true;
  } else if (count > 0) {
return false;
  }
}
return true;
  }


Moreover, @bourgesl forked my repository on 11/04/2021 and communicated with me 
about doing so. On 25/04/2021 there was a new implementation of 
`DualPivotQuicksort` with a signed radix sort but the same structural 
similarities, and with the same method and variable names in places 
https://github.com/bourgesl/radix-sort-benchmark/commit/90ff7e427da0fa49f374bff0241fb2487bd87bde#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R607-R609


// TODO add javadoc
private static void radixSort(Sorter sorter, int[] a, int low, int high) {
int[] b;
// LBO: prealloc (high - low) +1 element:
if (sorter == null || (b = sorter.b) == null || b.length < (high - 
low)) {
// System.out.println("alloc b: " + (high - low));
b = new int[high - low];
}

int[] count1, count2, count3, count4;
if (sorter != null) {
sorter.resetRadixBuffers();
count1 = sorter.count1;
count2 = sorter.count2;
count3 = sorter.count3;
count4 = sorter.count4;
} else {
// System.out.println("alloc radix buffers(4x256)");
count1 = new int[256];
count2 = new int[256];
count3 = new int[256];
count4 = new int[256];
}

for (int i = low; i < 

Re: Reading data for a particular column-cell with 2 or more values of a same row-key

2017-02-25 Thread Richard Startin
If you operate directly on a Result you only get the latest version of each 
cell. To get older versions of cells you have a few options:


1) Result::getFamilyMap, if you only want versioned cells from a single family 
- 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html#getFamilyMap-byte:A-

Result (Apache HBase 2.0.0-SNAPSHOT 
API)
hbase.apache.org
@InterfaceAudience.Public @InterfaceStability.Stable public class Result 
extends Object implements org.apache.hadoop.hbase.CellScannable, 
org.apache.hadoop.hbase ...



2) Result::getMap - If you need versioned cells from all families - 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html#getMap--

Result (Apache HBase 2.0.0-SNAPSHOT 
API)
hbase.apache.org
@InterfaceAudience.Public @InterfaceStability.Stable public class Result 
extends Object implements org.apache.hadoop.hbase.CellScannable, 
org.apache.hadoop.hbase ...



3) Get a cell scanner from Result::cellScanner - 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html#cellScanner--

Result (Apache HBase 2.0.0-SNAPSHOT 
API)
hbase.apache.org
@InterfaceAudience.Public @InterfaceStability.Stable public class Result 
extends Object implements org.apache.hadoop.hbase.CellScannable, 
org.apache.hadoop.hbase ...


So once you have your rows, add another mapping function using one of the 
methods above to get multi-version rows.



https://richardstartin.com/



From: Abir Chokraborty 
Sent: 25 February 2017 07:38
To: user@hbase.apache.org
Subject: Reading data for a particular column-cell with 2 or more values of a 
same row-key

HBase table contains the following:

ROW  COLUMN+CELL
Product01column=cf:ProductFeature,  timestamp=1487917201238,value=
Feature01
Product01column=cf:ProductFeature,  timestamp=1487917201239,value=
Feature02
Product01column=cf:ProductFeature,  timestamp=1487917201240,value=
Feature03
Product01column=cf:Price,  timestamp=1487917201242,value=\x012A\xF8
Product01column=cf:Location,  timestamp=1487917201244,value= Texas
Here VERSIONS is 3. So it is keeping 3 different values for ProductFeature
column. I wrote the following to create an RDD

val hbaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])
  val resultRDD = hbaseRDD.map(tuple => tuple._2)
  val testRDD = resultRDD.map(Row.parseRow)
  val testDF = testRDD.toDF()
Here, parseRow is a method that returns tuple of
(ROW,ProductFeature,Price,Location). I am only getting

+++-+-+
| Row|  ProductFeature|  Price| Location|
+++-+-+
|   Product01|   Feature03|   65|Texas|
+++-+-+
Where do I have to change in the code so that I can create DataFrame for
different values of ProductFeature like the following:

+++-+-+
|Row|  ProductFeature|   Price| Location|
+++-+-+
|Product01|   Feature01|  65|Texas|
+++-+-+
|Product01|   Feature02|  65|Texas|
+++-+-+
|Product01|   Feature03|  65|Texas|
+++-+-+



--
View this message in context: 
http://apache-hbase.679495.n3.nabble.com/Reading-data-for-a-particular-column-cell-with-2-or-more-values-of-a-same-row-key-tp4086420.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: Parallel Scanner

2017-02-20 Thread Richard Startin
For a client only solution, have you looked at the RegionLocator interface? It 
gives you a list of pairs of byte[] (the start and stop keys for each region). 
You can easily use a ForkJoinPool recursive task or java 8 parallel stream over 
that list. I implemented a spark RDD to do that and wrote about it with code 
samples here:

https://richardstartin.com/2016/11/07/co-locating-spark-partitions-with-hbase-regions/

Forget about the spark details in the post (and forget that Hortonworks have a 
library to do the same thing :)) the idea of creating one scan per region and 
setting scan starts and stops from the region locator would give you a parallel 
scan. Note you can also group the scans by region server.

Cheers,
Richard
On 20 Feb 2017, at 07:33, Anil > 
wrote:

Thanks Ram. I will look into EndPoints.

On 20 February 2017 at 12:29, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> 
wrote:

Yes. There is way.

Have you seen Endpoints? Endpoints are triggers like points that allows
your client to trigger them parallely in one ore more regions using the
start and end key of the region. This executes parallely and then you may
have to sort out the results as per your need.

But these endpoints have to running on your region servers and it is not a
client only soln.
https://blogs.apache.org/hbase/entry/coprocessor_introduction.

Be careful when you use them. Since these endpoints run on server ensure
that these are not heavy or things that consume more memory which can have
adverse effects on the server.


Regards
Ram

On Mon, Feb 20, 2017 at 12:18 PM, Anil 
> wrote:

Thanks Ram.

So, you mean that there is no harm in using  HTable#getRegionsInRange in
the application code.

HTable#getRegionsInRange returned single entry for all my region start
key
and end key. i need to explore more on this.

"If you know the table region's start and end keys you could create
parallel scans in your application code."  - is there any way to scan a
region in the application code other than the one i put in the original
email ?

"One thing to watch out is that if there is a split in the region then
this start
and end row may change so in that case it is better you try to get
the regions every time before you issue a scan"
- Agree. i am dynamically determining the region start key and end key
before initiating scan operations for every initial load.

Thanks.




On 20 February 2017 at 10:59, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> 
wrote:

Hi Anil,

HBase directly does not provide parallel scans. If you know the table
region's start and end keys you could create parallel scans in your
application code.

In the above code snippet, the intent is right - you get the required
regions and can issue parallel scans from your app.

One thing to watch out is that if there is a split in the region then
this
start and end row may change so in that case it is better you try to
get
the regions every time before you issue a scan. Does that make sense to
you?

Regards
Ram

On Sat, Feb 18, 2017 at 1:44 PM, Anil 
> wrote:

Hi ,

I am building an usecase where i have to load the hbase data into
In-memory
database (IMDB). I am scanning the each region and loading data into
IMDB.

i am looking at parallel scanner ( https://issues.apache.org/
jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
HTable#
getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
deprecated, HBASE-1935 is still open.

I see Connection from ConnectionFactory is HConnectionImplementation
by
default and creates HTable instance.

Do you see any issues in using HTable from Table instance ?
   for each region {
   int i = 0;
   List regions =
hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
true);

   for (HRegionLocation region : regions){
   startRow = i == 0 ? scans.getStartRow() :
region.getRegionInfo().getStartKey();
   i++;
   endRow = i == regions.size()? scans.getStopRow()
:
region.getRegionInfo().getEndKey();
}
  }

are there any alternatives to achieve parallel scan? Thanks.

Thanks






Re: Doubt

2017-02-14 Thread Richard Startin
I took a look at 
https://www.linkedin.com/pulse/hbase-read-write-performance-conversation-gaurhari-dass?trk=prof-post


Looks like an unattributed copy of 
 
http://apache-hbase.679495.n3.nabble.com/HBase-Performance-td4086182.html#a4086185


https://richardstartin.com/



From: Ted Yu 
Sent: 14 February 2017 12:30
To: user@hbase.apache.org
Subject: Re: Doubt

Clicking on both links directed me to:

https://www.linkedin.com/post/new

Do the pages require read permission ?

On Tue, Feb 14, 2017 at 1:46 AM, gaurhari dass 
wrote:

> Hi
>
> I want to post like this
>
> https://www.linkedin.com/post/edit/hbase-read-write-
> performance-conversation-gaurhari-dass
>
> https://www.linkedin.com/post/edit/hbase-performance-
> improvements-gaurhari-dass
>
> it is just for my reference as well can be helpful to others if it is ok.
>
> Thanks
> Gaurhari
>
> On Tue, Feb 14, 2017 at 6:27 AM, Yu Li  wrote:
>
> > [image: Boxbe]  This message is eligible
> > for Automatic Cleanup! (car...@gmail.com) Add cleanup rule
> >  com%2Fcleanup%3Fkey%3Dlo0h5jSW%252B4%252FJDQc5MGAN%252Fm6wlgiwZwd%
> 252Fg%252BsPLxAODrs%253D%26token%3DtcsooRleiqYIyi33PWCX0y%
> 252BERL88Oa80r1utgWgCgYPz7LFe6D0RdRDy4m%252BmfgI1OPas8eDVfAU3GCEFxpeB6
> cWVAplUMXEpvD4gZ%252FMYuYE25tzifmD6E7XFNwZ7KX3XnWM7mLgh1Ns%253D_serial=
> 28877655010_rand=1497511706_source=stf_
> medium=email_campaign=ANNO_CLEANUP_ADD_content=001>
> > | More info
> >  cleanup?tc_serial=28877655010_rand=1497511706_source=
> stf_medium=email_campaign=ANNO_CLEANUP_ADD_content=001>
> >
> > Would like to hear more details, wherever it will be posted (smile).
> >
> > Best Regards,
> > Yu
> >
> > On 14 February 2017 at 06:06, Stack  wrote:
> >
> > > You might consider adding list of issues and solutions to the hbase
> > > reference guide?
> > > Yours,
> > > S
> > >
> > > On Fri, Feb 10, 2017 at 1:34 AM, gaurhari dass  >
> > > wrote:
> > >
> > > > Hi ,
> > > >
> > > > I always follow problems users facing with hbase and solutions
> > provided.
> > > >
> > > > these problems and solutions are always helpful to me.
> > > >
> > > > I want to know if it is ok if I share these problems and solutions on
> > my
> > > > linkedin articles as it is.
> > > >
> > > > Thanks
> > > > Gaurhari
> > > >
> > >
> >
> >
>


Re: Kryo On Spark 1.6.0

2017-01-10 Thread Richard Startin
Hi Enrico,


Only set spark.kryo.registrationRequired if you want to forbid any classes you 
have not explicitly registered - see 
http://spark.apache.org/docs/latest/configuration.html.

Configuration - Spark 2.0.2 
Documentation
spark.apache.org
Spark Configuration. Spark Properties. Dynamically Loading Spark Properties; 
Viewing Spark Properties; Available Properties. Application Properties; Runtime 
Environment

To enable kryo, you just need 
spark.serializer=org.apache.spark.serializer.KryoSerializer. There is some info 
here - http://spark.apache.org/docs/latest/tuning.html

Cheers,
Richard



https://richardstartin.com/



From: Enrico DUrso 
Sent: 10 January 2017 11:10
To: user@spark.apache.org
Subject: Kryo On Spark 1.6.0


Hi,

I am trying to use Kryo on Spark 1.6.0.
I am able to register my own classes and it works, but when I set 
"spark.kryo.registrationRequired " to true, I get an error about a scala class:
"Class is not registered: scala.collection.mutable.WrappedArray$ofRef".

Any of you has already solved this issue in Java? I found the code to solve it 
in Scala, but unable to register this class in Java.

Cheers,

enrico



CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


Re: ToLocalIterator vs collect

2017-01-05 Thread Richard Startin
Why not do that with spark sql to utilise the executors properly, rather than a 
sequential filter on the driver.

Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k

If you were sorting just so you could iterate in order, this might save you a 
couple of sorts too.

https://richardstartin.com

> On 5 Jan 2017, at 10:40, Rohit Verma  wrote:
> 
> Hi all,
> 
> I am aware that collect will return a list aggregated on driver, this will 
> return OOM when we have a too big list.
> Is toLocalIterator safe to use with very big list, i want to access all 
> values one by one.
> 
> Basically the goal is to compare two sorted rdds (A and B) to find top k 
> entries missed in B but there in A 
> 
> Rohit
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Lease exception

2016-12-21 Thread Richard Startin
If your client caching is set to a large value, you will need to do a long scan 
occasionally, and the rpc itself will be expensive in terms of IO. So it's 
worth looking at hbase.client.scanner.caching to see if it is too large. If 
you're scanning the whole table check you aren't churning the block cache.

The XML below looks wrong, was that copied verbatim from your site file?

https://richardstartin.com

> On 21 Dec 2016, at 12:02, Rajeshkumar J <rajeshkumarit8...@gmail.com> wrote:
> 
> Hi,
> 
>   Thanks for the reply. I have properties as below
> 
> 
>hbase.regionserver.lease.period
>90
>  
>  
>hbase.rpc.timeout
>90>/value>
>  
> 
> 
> Correct me If I am wrong.
> 
> I know hbase.regionserver.lease.period, which says how long a scanner
> lives between calls to scanner.next().
> 
> As far as I understand when scanner.next() is called it will fetch no
> of rows as in *hbase.client.scanner.caching. *When this fetching
> process takes more than lease period it will close the scanner object.
> so this exception occuring?
> 
> 
> Thanks,
> 
> Rajeshkumar J
> 
> 
> 
> On Wed, Dec 21, 2016 at 5:07 PM, Richard Startin <richardstar...@outlook.com
>> wrote:
> 
>> It means your lease on a region server has expired during a call to
>> resultscanner.next(). This happens on a slow call to next(). You can either
>> embrace it or "fix" it by making sure hbase.rpc.timeout exceeds
>> hbase.regionserver.lease.period.
>> 
>> https://richardstartin.com
>> 
>> On 21 Dec 2016, at 11:30, Rajeshkumar J <rajeshkumarit8...@gmail.com<
>> mailto:rajeshkumarit8...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>>  I have faced below issue in our production cluster
>> 
>> org.apache.hadoop.hbase.regionserver.LeaseException:
>> org.apache.hadoop.hbase.regionserver.LeaseException: lease '166881' does
>> not exist
>> at org.apache.hadoop.hbase.regionserver.Leases.
>> removeLease(Leases.java:221)
>> at org.apache.hadoop.hbase.regionserver.Leases.
>> cancelLease(Leases.java:206)
>> at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.
>> scan(RSRpcServices.java:2491)
>> at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>> callBlockingMethod(ClientProtos.java:32205)
>> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
>> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
>> at
>> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>> at java.lang.Thread.run(Thread.java:744)
>> 
>> 
>> Can any one explain what is lease exception
>> 
>> Thanks,
>> Rajeshkumar J
>> 


Re: Lease exception

2016-12-21 Thread Richard Startin
It means your lease on a region server has expired during a call to 
resultscanner.next(). This happens on a slow call to next(). You can either 
embrace it or "fix" it by making sure hbase.rpc.timeout exceeds 
hbase.regionserver.lease.period.

https://richardstartin.com

On 21 Dec 2016, at 11:30, Rajeshkumar J 
> wrote:

Hi,

  I have faced below issue in our production cluster

org.apache.hadoop.hbase.regionserver.LeaseException:
org.apache.hadoop.hbase.regionserver.LeaseException: lease '166881' does
not exist
at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:221)
at org.apache.hadoop.hbase.regionserver.Leases.cancelLease(Leases.java:206)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2491)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:744)


Can any one explain what is lease exception

Thanks,
Rajeshkumar J


Re: withColumn gives "Can only zip RDDs with same number of elements in each partition" but not with a LIMIT on the dataframe

2016-12-20 Thread Richard Startin
I think limit repartitions your data into a single partition if called as a non 
terminal operator. Hence zip works after limit because you only have one 
partition.

In practice, I have found joins to be much more applicable than zip because of 
the strict limitation of identical partitions.

https://richardstartin.com

On 20 Dec 2016, at 16:04, Jack Wenger 
> wrote:

Hello,

I'm facing a strange behaviour with Spark 1.5.0 (Cloudera 5.5.1).
I'm loading data from Hive with HiveContext (~42M records) and then try to add 
a new column with "withColumn" and a UDF.
Finally i'm suppose to create a new Hive table from this dataframe.


Here is the code :

_
_


DATETIME_TO_COMPARE = "-12-31 23:59:59.99"

myFunction = udf(lambda col: 0 if col != DATETIME_TO_COMPARE else 1, 
IntegerType())

df1 = hc.sql("SELECT col1, col2, col3,col4,col5,col6,col7 FROM myTable WHERE 
col4 == someValue")

df2 = df1.withColumn("myNewCol", myFunction(df1.col3))
df2.registerTempTable("df2")

hc.sql("create table my_db.new_table as select * from df2")

_
_


But I get this error :


py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
stage 2.0 failed 4 times, most recent failure: Lost task 18.3 in stage 2.0 (TID 
186, lxpbda25.ra1.intra.groupama.fr): 
org.apache.spark.SparkException: Can only zip RDDs with same number of elements 
in each partition
at 
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$27$$anon$1.hasNext(RDD.scala:832)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:104)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




What is suprising is that if I modify the select statement by addind a LIMIT 
1 (which is more than twice the number of records in my table), then 
it's working :

_
_

df1 = hc.sql("SELECT col1, col2, col3,col4,col5,col6,col7 FROM myTable WHERE 
col4 == someValue" LIMIT 1)

_
_

In both cases, if I run a count() on df1, I'm getting the same number : 42 593 
052

Is it a bug or am I missing something ?
If it is not a bug, what am I doing wrong ?


Thank you !


Jack


Re: Spark streaming completed batches statistics

2016-12-07 Thread Richard Startin
Ok it looks like I could reconstruct the logic in the Spark UI from the /jobs 
resource. Thanks.


https://richardstartin.com/



From: map reduced <k3t.gi...@gmail.com>
Sent: 07 December 2016 19:49
To: Richard Startin
Cc: user@spark.apache.org
Subject: Re: Spark streaming completed batches statistics

Have you checked http://spark.apache.org/docs/latest/monitoring.html#rest-api ?

KP

On Wed, Dec 7, 2016 at 11:43 AM, Richard Startin 
<richardstar...@outlook.com<mailto:richardstar...@outlook.com>> wrote:

Is there any way to get this information as CSV/JSON?


https://docs.databricks.com/_images/CompletedBatches.png

[https://docs.databricks.com/_images/CompletedBatches.png]


https://richardstartin.com/


________
From: Richard Startin 
<richardstar...@outlook.com<mailto:richardstar...@outlook.com>>
Sent: 05 December 2016 15:55
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Spark streaming completed batches statistics

Is there any way to get a more computer friendly version of the completes 
batches section of the streaming page of the application master? I am very 
interested in the statistics and am currently screen-scraping...

https://richardstartin.com
-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>




Re: Spark streaming completed batches statistics

2016-12-07 Thread Richard Startin
Is there any way to get this information as CSV/JSON?


https://docs.databricks.com/_images/CompletedBatches.png

[https://docs.databricks.com/_images/CompletedBatches.png]


https://richardstartin.com/



From: Richard Startin <richardstar...@outlook.com>
Sent: 05 December 2016 15:55
To: user@spark.apache.org
Subject: Spark streaming completed batches statistics

Is there any way to get a more computer friendly version of the completes 
batches section of the streaming page of the application master? I am very 
interested in the statistics and am currently screen-scraping...

https://richardstartin.com
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Richard Startin
I've seen the feature work very well. For tuning, you've got:

spark.streaming.backpressure.pid.proportional (defaults to 1, non-negative) - 
weight for response to "error" (change between last batch and this batch)
spark.streaming.backpressure.pid.integral (defaults to 0.2, non-negative) - 
weight for the response to the accumulation of error. This has a dampening 
effect.
spark.streaming.backpressure.pid.derived (defaults to zero, non-negative) - 
weight for the response to the trend in error. This can cause 
arbitrary/noise-induced fluctuations in batch size, but can also help react 
quickly to increased/reduced capacity.
spark.streaming.backpressure.pid.minRate - the default value is 100 (must be 
positive), batch size won't go below this.

spark.streaming.receiver.maxRate - batch size won't go above this.


Cheers,

Richard


https://richardstartin.com/



From: Liren Ding 
Sent: 05 December 2016 22:18
To: dev@spark.apache.org; u...@spark.apache.org
Subject: Back-pressure to Spark Kafka Streaming?

Hey all,

Does backressure actually work on spark kafka streaming? According to the 
latest spark streaming document:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
"In Spark 1.5, we have introduced a feature called backpressure that eliminate 
the need to set this rate limit, as Spark Streaming automatically figures out 
the rate limits and dynamically adjusts them if the processing conditions 
change. This backpressure can be enabled by setting the configuration parameter 
spark.streaming.backpressure.enabled to true."
But I also see a few open spark jira tickets on this option:
https://issues.apache.org/jira/browse/SPARK-7398
https://issues.apache.org/jira/browse/SPARK-18371

The case in the second ticket describes a similar issue as we have here. We use 
Kafka to send large batches (10~100M) to spark streaming, and the spark 
streaming interval is set to 1~4 minutes. With the backpressure set to true, 
the queued active batches still pile up when average batch processing time 
takes longer than default interval. After the spark driver is restarted, all 
queued batches turn to a giant batch, which block subsequent batches and also 
have a great chance to fail eventually. The only config we found that might 
help is "spark.streaming.kafka.maxRatePerPartition". It does limit the incoming 
batch size, but not a perfect solution since it depends on size of partition as 
well as the length of batch interval. For our case, hundreds of partitions X 
minutes of interval still produce a number that is too large for each batch. So 
we still want to figure out how to make the backressure work in spark kafka 
streaming, if it is supposed to work there. Thanks.


Liren









Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Richard Startin
I've seen the feature work very well. For tuning, you've got:

spark.streaming.backpressure.pid.proportional (defaults to 1, non-negative) - 
weight for response to "error" (change between last batch and this batch)
spark.streaming.backpressure.pid.integral (defaults to 0.2, non-negative) - 
weight for the response to the accumulation of error. This has a dampening 
effect.
spark.streaming.backpressure.pid.derived (defaults to zero, non-negative) - 
weight for the response to the trend in error. This can cause 
arbitrary/noise-induced fluctuations in batch size, but can also help react 
quickly to increased/reduced capacity.
spark.streaming.backpressure.pid.minRate - the default value is 100 (must be 
positive), batch size won't go below this.

spark.streaming.receiver.maxRate - batch size won't go above this.


Cheers,

Richard


https://richardstartin.com/



From: Liren Ding 
Sent: 05 December 2016 22:18
To: d...@spark.apache.org; user@spark.apache.org
Subject: Back-pressure to Spark Kafka Streaming?

Hey all,

Does backressure actually work on spark kafka streaming? According to the 
latest spark streaming document:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
"In Spark 1.5, we have introduced a feature called backpressure that eliminate 
the need to set this rate limit, as Spark Streaming automatically figures out 
the rate limits and dynamically adjusts them if the processing conditions 
change. This backpressure can be enabled by setting the configuration parameter 
spark.streaming.backpressure.enabled to true."
But I also see a few open spark jira tickets on this option:
https://issues.apache.org/jira/browse/SPARK-7398
https://issues.apache.org/jira/browse/SPARK-18371

The case in the second ticket describes a similar issue as we have here. We use 
Kafka to send large batches (10~100M) to spark streaming, and the spark 
streaming interval is set to 1~4 minutes. With the backpressure set to true, 
the queued active batches still pile up when average batch processing time 
takes longer than default interval. After the spark driver is restarted, all 
queued batches turn to a giant batch, which block subsequent batches and also 
have a great chance to fail eventually. The only config we found that might 
help is "spark.streaming.kafka.maxRatePerPartition". It does limit the incoming 
batch size, but not a perfect solution since it depends on size of partition as 
well as the length of batch interval. For our case, hundreds of partitions X 
minutes of interval still produce a number that is too large for each batch. So 
we still want to figure out how to make the backressure work in spark kafka 
streaming, if it is supposed to work there. Thanks.


Liren









Spark streaming completed batches statistics

2016-12-05 Thread Richard Startin
Is there any way to get a more computer friendly version of the completes 
batches section of the streaming page of the application master? I am very 
interested in the statistics and am currently screen-scraping... 

https://richardstartin.com
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Livy with Spark

2016-12-05 Thread Richard Startin
There is a great write up on Livy at
http://henning.kropponline.de/2016/11/06/

On 5 Dec 2016, at 14:34, Mich Talebzadeh 
> wrote:

Hi,

Has there been any experience using Livy with Spark to share multiple Spark 
contexts?

thanks



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




Re: Storing XML file in Hbase

2016-11-28 Thread Richard Startin
In my experience it's better to keep the number of column families low. When 
flushes occur, they effect all column families in a table, so when the memstore 
fills you'll create an HFile per family. I haven't seen any performance impact 
in having two column families though.


As for the number of columns, there are two extremes - 1) "narrow" - store the 
xml as a blob in a single cell; 2) "wide" break it out into columns, of which 
you can have thousands.


  1.  In the case where you store XML as a blob you always need to retrieve the 
entire document, and must deserialise it to perform operations. You save space 
in not repeating the row key, you save space on column and column family 
qualifiers
  2.  When you break the XML out into columns you can retrieve data at a per 
attribute level, which might save IO by filtering unnecessary content, and you 
don't need to break open the XML to perform operations. You incur a cost in 
repeating the row key per tuple (this can add up and will effect read 
performance by limiting the number of rows that can fit into the block cache), 
as well as the extra cost of column families. There is a practical limit to the 
number of columns because a row cannot be split across regions.

You may find optimal performance for you use case somewhere between the two 
extremes and it's best to prototype and measure early.

Cheers,
Richard


https://richardstartin.com/



From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: 28 November 2016 21:57
To: user@hbase.apache.org
Subject: Re: Storing XML file in Hbase

Thanks Richard.

How would one decide on the number of column family and columns?

Is there a ballpark approach

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 November 2016 at 16:04, Richard Startin <richardstar...@outlook.com>
wrote:

> Hi Mich,
>
> If you want to store the file whole, you'll need to enforce a 10MB limit
> to the file size, otherwise you will flush too often (each time the me
> store fills up) which will slow down writes.
>
> Maybe you could deconstruct the xml by extracting columns from the xml
> using xpath?
>
> If the files are small there might be a tangible performance benefit by
> limiting the number of columns.
>
> Cheers,
> Richard
>
> Sent from my iPhone
>
> > On 28 Nov 2016, at 15:53, Dima Spivak <dimaspi...@apache.org> wrote:
> >
> > Hi Mich,
> >
> > How many files are you looking to store? How often do you need to read
> > them? What's the total size of all the files you need to serve?
> >
> > Cheers,
> > Dima
> >
> > On Mon, Nov 28, 2016 at 7:04 AM Mich Talebzadeh <
> mich.talebza...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Storing XML file in Big Data. Are there any strategies to create
> multiple
> >> column families or just one column family and in that case how many
> columns
> >> would be optional?
> >>
> >> thanks
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn *
> >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >> <
> >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >>> *
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly disclaimed.
> >> The author will in no case be liable for any monetary damages arising
> from
> >> such loss, damage or destruction.
> >>
>


Re: Storing XML file in Hbase

2016-11-28 Thread Richard Startin
Hi Mich,

If you want to store the file whole, you'll need to enforce a 10MB limit to the 
file size, otherwise you will flush too often (each time the me store fills up) 
which will slow down writes. 

Maybe you could deconstruct the xml by extracting columns from the xml using 
xpath?

If the files are small there might be a tangible performance benefit by 
limiting the number of columns.

Cheers,
Richard

Sent from my iPhone

> On 28 Nov 2016, at 15:53, Dima Spivak  wrote:
> 
> Hi Mich,
> 
> How many files are you looking to store? How often do you need to read
> them? What's the total size of all the files you need to serve?
> 
> Cheers,
> Dima
> 
> On Mon, Nov 28, 2016 at 7:04 AM Mich Talebzadeh 
> wrote:
> 
>> Hi,
>> 
>> Storing XML file in Big Data. Are there any strategies to create multiple
>> column families or just one column family and in that case how many columns
>> would be optional?
>> 
>> thanks
>> 
>> Dr Mich Talebzadeh
>> 
>> 
>> 
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>> 
>> 
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> 
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>> 


Re: Hbase 1.2 connection pool in java

2016-11-24 Thread Richard Startin
Hi manjeet,

I wrote about a connection pool I implemented at 
https://richardstartin.com/2016/11/05/hbase-connection-management/

Cheers,
Richard

Sent from my iPhone

> On 24 Nov 2016, at 17:43, Manjeet Singh  wrote:
> 
> Hi All
> 
> Cna anyone help me out on How to create Hbase connection pool in java
> 
> also  org.apache.hadoop.hbase.client.HConnection; is deprecated so is their
> any new approch?
> 
> Thanks
> Manjeet
> 
> -- 
> luv all