subject:"Re\: Best ID Generator for ID field in parquet \?"

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Mike Metzger

Hi Kevin -

   There's not really a race condition as the 64 bit value is split into a
31 bit partition id (the upper portion) and a 33 bit incrementing id.  In
other words, as long as each partition contains fewer than 8 billion
entries there should be no overlap and there is not any communication
between executors to get the next id.

Depending on what you mean by duplication, there shouldn't be any within a
column as long as you maintain some sort of state (ie, the startval Mich
shows, a previous maxid, etc.)  While these ids are unique in that sense,
they are not the same as a uuid / guid which are generally unique across
all entries assuming enough randomness.  Think of the monotonically
increasing id as an auto-incrementing column (with potentially massive gaps
in ids) from a relational database.

Thanks

Mike


On Sun, Sep 4, 2016 at 6:41 PM, Kevin Tran  wrote:

> Hi Mich,
> Thank you for your input.
> Does monotonically incremental ensure about race condition and does it
> duplicates the ids at some points with multi threads, multi instances, ... ?
>
> Even System.currentTimeMillis() still has duplication?
>
> Cheers,
> Kevin.
>
> On Mon, Sep 5, 2016 at 12:30 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You can create a monotonically incrementing ID column on your table
>>
>> scala> val ll_18740868 = spark.table("accounts.ll_18740868")
>> scala> val startval = 1
>> scala> val df = ll_18740868.withColumn("id",
>> *monotonically_increasing_id()+* startval).show (2)
>> +---+---+-+-+---
>> ---+---++---+---+
>> |transactiondate|transactiontype| sortcode|accountnumber|transac
>> tiondescription|debitamount|creditamount|balance| id|
>> +---+---+-+-+---
>> ---+---++---+---+
>> | 2011-12-30|DEB|'30-64-72| 18740868|  WWW.GFT.COM
>> CD 4628 |   50.0|null| 304.89|  1|
>> | 2011-12-30|DEB|'30-64-72| 18740868|
>> TDA.CONFECC.D.FRE...|  19.01|null| 354.89|  2|
>> +---+---+-+-+---
>> ---+---++---+---+
>>
>>
>> Now you have a new ID column
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 September 2016 at 12:43, Kevin Tran  wrote:
>>
>>> Hi everyone,
>>> Please give me your opinions on what is the best ID Generator for ID
>>> field in parquet ?
>>>
>>> UUID.randomUUID();
>>> AtomicReference currentTime = new AtomicReference<>(System.curre
>>> ntTimeMillis());
>>> AtomicLong counter = new AtomicLong(0);
>>> 
>>>
>>> Thanks,
>>> Kevin.
>>>
>>>
>>> 
>>> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
>>> writing Parquet files)
>>> https://github.com/apache/spark/pull/6864/files
>>>
>>
>>
>

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran

Hi Mich,
Thank you for your input.
Does monotonically incremental ensure about race condition and does it
duplicates the ids at some points with multi threads, multi instances, ... ?

Even System.currentTimeMillis() still has duplication?

Cheers,
Kevin.

On Mon, Sep 5, 2016 at 12:30 AM, Mich Talebzadeh 
wrote:

> You can create a monotonically incrementing ID column on your table
>
> scala> val ll_18740868 = spark.table("accounts.ll_18740868")
> scala> val startval = 1
> scala> val df = ll_18740868.withColumn("id",
> *monotonically_increasing_id()+* startval).show (2)
> +---+---+-+-+---
> ---+---++---+---+
> |transactiondate|transactiontype| sortcode|accountnumber|
> transactiondescription|debitamount|creditamount|balance| id|
> +---+---+-+-+---
> ---+---++---+---+
> | 2011-12-30|DEB|'30-64-72| 18740868|  WWW.GFT.COM CD
> 4628 |   50.0|null| 304.89|  1|
> | 2011-12-30|DEB|'30-64-72| 18740868|
> TDA.CONFECC.D.FRE...|  19.01|null| 354.89|  2|
> +---+---+-+-+---
> ---+---++---+---+
>
>
> Now you have a new ID column
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 September 2016 at 12:43, Kevin Tran  wrote:
>
>> Hi everyone,
>> Please give me your opinions on what is the best ID Generator for ID
>> field in parquet ?
>>
>> UUID.randomUUID();
>> AtomicReference currentTime = new AtomicReference<>(System.curre
>> ntTimeMillis());
>> AtomicLong counter = new AtomicLong(0);
>> 
>>
>> Thanks,
>> Kevin.
>>
>>
>> 
>> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
>> writing Parquet files)
>> https://github.com/apache/spark/pull/6864/files
>>
>
>

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Mich Talebzadeh

You can create a monotonically incrementing ID column on your table

scala> val ll_18740868 = spark.table("accounts.ll_18740868")
scala> val startval = 1
scala> val df = ll_18740868.withColumn("id",
*monotonically_increasing_id()+* startval).show (2)
+---+---+-+-+--+---++---+---+
|transactiondate|transactiontype|
sortcode|accountnumber|transactiondescription|debitamount|creditamount|balance|
id|
+---+---+-+-+--+---++---+---+
| 2011-12-30|DEB|'30-64-72| 18740868|  WWW.GFT.COM CD
4628 |   50.0|null| 304.89|  1|
| 2011-12-30|DEB|'30-64-72| 18740868|
TDA.CONFECC.D.FRE...|  19.01|null| 354.89|  2|
+---+---+-+-+--+---++---+---+


Now you have a new ID column

HTH






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 September 2016 at 12:43, Kevin Tran  wrote:

> Hi everyone,
> Please give me your opinions on what is the best ID Generator for ID field
> in parquet ?
>
> UUID.randomUUID();
> AtomicReference currentTime = new AtomicReference<>(System.
> currentTimeMillis());
> AtomicLong counter = new AtomicLong(0);
> 
>
> Thanks,
> Kevin.
>
>
> 
> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
> writing Parquet files)
> https://github.com/apache/spark/pull/6864/files
>

Re: Best ID Generator for ID field in parquet ?

Re: Best ID Generator for ID field in parquet ?

Re: Best ID Generator for ID field in parquet ?

3 matches

Site Navigation

Mail list logo

Footer information