Re: Best ID Generator for ID field in parquet ?

Mike Metzger Sun, 04 Sep 2016 21:29:41 -0700

Hi Kevin -

   There's not really a race condition as the 64 bit value is split into a
31 bit partition id (the upper portion) and a 33 bit incrementing id.  In
other words, as long as each partition contains fewer than 8 billion
entries there should be no overlap and there is not any communication
between executors to get the next id.


Depending on what you mean by duplication, there shouldn't be any within a
column as long as you maintain some sort of state (ie, the startval Mich
shows, a previous maxid, etc.)  While these ids are unique in that sense,
they are not the same as a uuid / guid which are generally unique across
all entries assuming enough randomness.  Think of the monotonically
increasing id as an auto-incrementing column (with potentially massive gaps
in ids) from a relational database.

Thanks

Mike


On Sun, Sep 4, 2016 at 6:41 PM, Kevin Tran <kevin...@gmail.com> wrote:

> Hi Mich,
> Thank you for your input.
> Does monotonically incremental ensure about race condition and does it
> duplicates the ids at some points with multi threads, multi instances, ... ?
>
> Even System.currentTimeMillis() still has duplication?
>
> Cheers,
> Kevin.
>
> On Mon, Sep 5, 2016 at 12:30 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You can create a monotonically incrementing ID column on your table
>>
>> scala> val ll_18740868 = spark.table("accounts.ll_18740868")
>> scala> val startval = 1
>> scala> val df = ll_18740868.withColumn("id",
>> *monotonically_increasing_id()+* startval).show (2)
>> +---------------+---------------+---------+-------------+---
>> -------------------+-----------+------------+-------+---+
>> |transactiondate|transactiontype| sortcode|accountnumber|transac
>> tiondescription|debitamount|creditamount|balance| id|
>> +---------------+---------------+---------+-------------+---
>> -------------------+-----------+------------+-------+---+
>> |     2011-12-30|            DEB|'30-64-72|     18740868|  WWW.GFT.COM
>> CD 4628 |       50.0|        null| 304.89|  1|
>> |     2011-12-30|            DEB|'30-64-72|     18740868|
>> TDA.CONFECC.D.FRE...|      19.01|        null| 354.89|  2|
>> +---------------+---------------+---------+-------------+---
>> -------------------+-----------+------------+-------+---+
>>
>>
>> Now you have a new ID column
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 September 2016 at 12:43, Kevin Tran <kevin...@gmail.com> wrote:
>>
>>> Hi everyone,
>>> Please give me your opinions on what is the best ID Generator for ID
>>> field in parquet ?
>>>
>>> UUID.randomUUID();
>>> AtomicReference<Long> currentTime = new AtomicReference<>(System.curre
>>> ntTimeMillis());
>>> AtomicLong counter = new AtomicLong(0);
>>> ....
>>>
>>> Thanks,
>>> Kevin.
>>>
>>>
>>> ----
>>> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
>>> writing Parquet files)
>>> https://github.com/apache/spark/pull/6864/files
>>>
>>
>>
>

Re: Best ID Generator for ID field in parquet ?

Reply via email to