unsubscribe

2023-07-19 Thread Josh Patterson
unsubscribe


Argo for general purpose k8s scheduling

2023-07-19 Thread Mich Talebzadeh
Hi,

Is there any update for use case of argo for k8s. As I understand it,
 Kubeflow  uses it for
scheduling.  Outside
of machine learning and MLOps on Kubernetes has anyone used Argo for
standard ETL as well and if so any experience.

Thanks

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Dipayan Dev
Thank you. Will try out these options.



With Best Regards,



On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh 
wrote:

> Sounds like if the mv command is inherently slow, there is little that can
> be done.
>
> The only suggestion I can make is to create the staging table as
> compressed to reduce its size and hence mv? Is that feasible? Also the
> managed table can be created with SNAPPY compression
>
> STORED AS ORC
> TBLPROPERTIES (
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="KEY",
> "orc.bloom.filter.fpp"="0.05",
> "*orc.compress"="SNAPPY",*
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="1" )
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 19 Jul 2023 at 02:35, Dipayan Dev  wrote:
>
>> Hi Mich,
>> Ok, my use-case is a bit different.
>> I have a Hive table partitioned by dates and need to do dynamic partition
>> updates(insert overwrite) daily for the last 30 days (partitions).
>> The ETL inside the staging directories is completed in hardly 5minutes,
>> but then renaming takes a lot of time as it deletes and copies the
>> partitions.
>> My issue is something related to this -
>> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>>
>>
>> On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Spark has no role in creating that hive staging directory. That
>>> directory belongs to Hive and Spark simply does ETL there, loading to the
>>> Hive managed table in your case which ends up in saging directory
>>>
>>> I suggest that you review your design and use an external hive table
>>> with explicit location on GCS with the date the data loaded. Then push that
>>> data into the Hive managed table for today's partition.
>>>
>>> This was written in bash for Hive HQL itself but you can easily adapt it
>>> for Spark
>>>
>>> TODAY="`date +%Y-%m-%d`"
>>> DateStamp="${TODAY}"
>>> CREATE EXTERNAL TABLE IF NOT EXISTS EXTERNALMARKETDATA (
>>>  KEY string
>>>, TICKER string
>>>, TIMECREATED string
>>>, PRICE float
>>> )
>>> COMMENT 'From prices using Kafka delivered by Flume location by day'
>>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>>> STORED AS TEXTFILE
>>> LOCATION 'gs://etcbucket/cloud_data_fusion/hive.../';
>>>
>>> --Keep track of daily ingestion into the external table.
>>> ALTER TABLE EXTERNALMARKETDATA set location
>>> 'gs://etcbucket/cloud_data_fusion/hive.../${TODAY}';
>>>
>>> -- create your managed table here and populate it from the Hive external
>>> table
>>> CREATE TABLE IF NOT EXISTS MARKETDATA (
>>>  KEY string
>>>, TICKER string
>>>, TIMECREATED string
>>>, PRICE float
>>>, op_type int
>>>, op_time timestamp
>>> )
>>> PARTITIONED BY (DateStamp  string)
>>> STORED AS ORC
>>> TBLPROPERTIES (
>>> "orc.create.index"="true",
>>> "orc.bloom.filter.columns"="KEY",
>>> "orc.bloom.filter.fpp"="0.05",
>>> "orc.compress"="SNAPPY",
>>> "orc.stripe.size"="16777216",
>>> "orc.row.index.stride"="1" )
>>> ;
>>>
>>> --Populate target table
>>> INSERT OVERWRITE TABLE MARKETDATA PARTITION (DateStamp = "${TODAY}")
>>> SELECT
>>>   KEY
>>> , TICKER
>>> , TIMECREATED
>>> , PRICE
>>> , 1
>>> , CAST(from_unixtime(unix_timestamp()) AS timestamp)
>>> FROM EXTERNALMARKETDATA;
>>>
>>> ANALYZE TABLE MARKETDATA PARTITION (DateStamp) COMPUTE STATISTICS;
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 18 Jul 2023 at 18:22, Dipayan Dev 
>>> wrote:
>>>
 It does help performance but not significantly.

 I am just wondering, once Spark creates that staging directory along
 with the SUCCESS file, can we just do a gsutil rsync command and move these
 files to original directory? 

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Mich Talebzadeh
Sounds like if the mv command is inherently slow, there is little that can
be done.

The only suggestion I can make is to create the staging table as compressed
to reduce its size and hence mv? Is that feasible? Also the managed table
can be created with SNAPPY compression

STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="KEY",
"orc.bloom.filter.fpp"="0.05",
"*orc.compress"="SNAPPY",*
"orc.stripe.size"="16777216",
"orc.row.index.stride"="1" )

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 19 Jul 2023 at 02:35, Dipayan Dev  wrote:

> Hi Mich,
> Ok, my use-case is a bit different.
> I have a Hive table partitioned by dates and need to do dynamic partition
> updates(insert overwrite) daily for the last 30 days (partitions).
> The ETL inside the staging directories is completed in hardly 5minutes,
> but then renaming takes a lot of time as it deletes and copies the
> partitions.
> My issue is something related to this -
> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
>
>
>
> With Best Regards,
>
> Dipayan Dev
>
>
>
> On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Spark has no role in creating that hive staging directory. That directory
>> belongs to Hive and Spark simply does ETL there, loading to the Hive
>> managed table in your case which ends up in saging directory
>>
>> I suggest that you review your design and use an external hive table with
>> explicit location on GCS with the date the data loaded. Then push that data
>> into the Hive managed table for today's partition.
>>
>> This was written in bash for Hive HQL itself but you can easily adapt it
>> for Spark
>>
>> TODAY="`date +%Y-%m-%d`"
>> DateStamp="${TODAY}"
>> CREATE EXTERNAL TABLE IF NOT EXISTS EXTERNALMARKETDATA (
>>  KEY string
>>, TICKER string
>>, TIMECREATED string
>>, PRICE float
>> )
>> COMMENT 'From prices using Kafka delivered by Flume location by day'
>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>> STORED AS TEXTFILE
>> LOCATION 'gs://etcbucket/cloud_data_fusion/hive.../';
>>
>> --Keep track of daily ingestion into the external table.
>> ALTER TABLE EXTERNALMARKETDATA set location
>> 'gs://etcbucket/cloud_data_fusion/hive.../${TODAY}';
>>
>> -- create your managed table here and populate it from the Hive external
>> table
>> CREATE TABLE IF NOT EXISTS MARKETDATA (
>>  KEY string
>>, TICKER string
>>, TIMECREATED string
>>, PRICE float
>>, op_type int
>>, op_time timestamp
>> )
>> PARTITIONED BY (DateStamp  string)
>> STORED AS ORC
>> TBLPROPERTIES (
>> "orc.create.index"="true",
>> "orc.bloom.filter.columns"="KEY",
>> "orc.bloom.filter.fpp"="0.05",
>> "orc.compress"="SNAPPY",
>> "orc.stripe.size"="16777216",
>> "orc.row.index.stride"="1" )
>> ;
>>
>> --Populate target table
>> INSERT OVERWRITE TABLE MARKETDATA PARTITION (DateStamp = "${TODAY}")
>> SELECT
>>   KEY
>> , TICKER
>> , TIMECREATED
>> , PRICE
>> , 1
>> , CAST(from_unixtime(unix_timestamp()) AS timestamp)
>> FROM EXTERNALMARKETDATA;
>>
>> ANALYZE TABLE MARKETDATA PARTITION (DateStamp) COMPUTE STATISTICS;
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 18 Jul 2023 at 18:22, Dipayan Dev 
>> wrote:
>>
>>> It does help performance but not significantly.
>>>
>>> I am just wondering, once Spark creates that staging directory along
>>> with the SUCCESS file, can we just do a gsutil rsync command and move these
>>> files to original directory? Anyone tried this approach or foresee any
>>> concern?
>>>
>>>
>>>
>>> On Mon, 17 Jul 2023 at 9:47 PM, Dipayan Dev 
>>> wrote:
>>>
 Thanks Jay, is there any suggestion how much I can increase those
 parameters?

 On Mon, 17 Jul 2023 at 8:25 PM, Jay 
 wrote:

> Fileoutputcommitter