OK, good news. You have made some progress here :)

bzip (bzip2) works (splittable) because it is block-oriented whereas gzip
is stream oriented. I also noticed that you are creating a managed ORC
file.  You can bucket and partition an ORC (Optimized Row Columnar file
format. An example below:


DROP TABLE IF EXISTS dummy;

CREATE TABLE dummy (
     ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 26 Jun 2023 at 19:35, Patrick Tucci <patrick.tu...@gmail.com> wrote:

> Hi Mich,
>
> Thanks for the reply. I started running ANALYZE TABLE on the external
> table, but the progress was very slow. The stage had only read about 275MB
> in 10 minutes. That equates to about 5.5 hours just to analyze the table.
>
> This might just be the reality of trying to process a 240m record file
> with 80+ columns, unless there's an obvious issue with my setup that
> someone sees. The solution is likely going to involve increasing
> parallelization.
>
> To that end, I extracted and re-zipped this file in bzip. Since bzip is
> splittable and gzip is not, Spark can process the bzip file in parallel.
> The same CTAS query only took about 45 minutes. This is still a bit slower
> than I had hoped, but the import from bzip fully utilized all available
> cores. So we can give the cluster more resources if we need the process to
> go faster.
>
> Patrick
>
> On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> OK for now have you analyzed statistics in Hive external table
>>
>> spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
>> COLUMNS;
>> spark-sql (default)> DESC EXTENDED test.stg_t2;
>>
>> Hive external tables have little optimization
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 26 Jun 2023 at 16:33, Patrick Tucci <patrick.tu...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
>>> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
>>> and 64GB of RAM.
>>>
>>> I'm trying to process a large pipe delimited file that has been
>>> compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85
>>> columns). I uploaded the gzipped file to HDFS and created an external table
>>> using the attached script. I tried two simpler queries on the same table,
>>> and they finished in ~5 and ~10 minutes respectively:
>>>
>>> SELECT COUNT(*) FROM ClaimsImport;
>>> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>>>
>>> However, when I tried to create a table stored as ORC using this table
>>> as the input, the query ran for almost 4 hours:
>>>
>>> CREATE TABLE Claims STORED AS ORC
>>> AS
>>> SELECT *
>>> FROM ClaimsImport
>>> --Exclude the header record
>>> WHERE ClaimID <> 'ClaimID';
>>>
>>> [image: image.png]
>>>
>>> Why is there such a speed disparity between these different operations?
>>> I understand that this job cannot be parallelized because the file is
>>> compressed with gzip. I also understand that creating an ORC table from the
>>> input will take more time than a simple COUNT(*). But it doesn't feel like
>>> the CREATE TABLE operation should take more than 24x longer than a simple
>>> SELECT COUNT(*) statement.
>>>
>>> Thanks for any help. Please let me know if I can provide any additional
>>> information.
>>>
>>> Patrick
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Reply via email to