[ 
https://issues.apache.org/jira/browse/PIG-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516745#comment-14516745
 ] 

Ángel Álvarez commented on PIG-4512:
------------------------------------

I've sorted the data as Daniel suggested, and this is what I've got:

                                T1              T2              T3              
T4               Average
HCatLoader              48134   46217   55369   54358   = 51019.5       ms
OrcStorage              44290   49200   49984   50767   = 48560.25      ms
PigStorage              19307   24092   20952   24774   = 22281.25      ms

OrcStorage only improves HCatLoader by no more than 2 or 3 seconds on average. 
The curious thing, PigStorage is the clearest winner (by far). Splitting the 
file before importing to Hive, however, seems not to have any significant 
influence.

On the other hand, predicate pushdown is enabled in Hive by default 
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties):

   hive.optimize.ppd
   Default Value: true
   Added In: Hive 0.4.0
   Whether to enable predicate pushdown (PPD). 

So, if I try to do more or less the same operation in Hive

export HADOOP_OPTS="-Dhive.execution.engine=tez"
hive -e "select uri,count(*) from nasadata_orc where uri=='test' group by uri;"

The one-row result is obtained in only 14048.25 ms  (on average). Does this 
mean my test in PIg is not using Predicate Pushdown?

> No performance improvement using OrcStorage
> -------------------------------------------
>
>                 Key: PIG-4512
>                 URL: https://issues.apache.org/jira/browse/PIG-4512
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.14.0
>         Environment: Hortonworks 2.2, Pig 14.0, Hive 0.14.0, Tez
>            Reporter: Ángel Álvarez
>            Priority: Minor
>
> I've been doing some tests with Pig & Hive, trying to gain some performance 
> using the OrcStorage class and his "Predicate Push Down" loader. I've 
> followed the next steps:
> 1, Download a dataset
> ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz
> 2. Create a new larger file by copying the same original file multiple times.
> cat NASA_access_log_Aug95 NASA_access_log_Aug95 ... > NASA
> 3. Add a new line in the data file
> echo 'slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET test 
> HTTP/1.0" 200 9202' >> NASA
> and split the file into different parts
> split -l 1000000 NASA NASA.
> 4. Create the ORC table in Hive
> DROP TABLE nasadata_txt;
> DROP TABLE nasadata_orc;
> CREATE TABLE nasadata_txt(ip VARCHAR(50), user_identifier VARCHAR(50), 
> user_id VARCHAR(50),date_time VARCHAR(50),zone VARCHAR(10),method 
> VARCHAR(5),uri VARCHAR(200),version VARCHAR(10),status DECIMAL(3,0),size 
> DECIMAL(10,0)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS 
> TEXTFILE;
> CREATE TABLE nasadata_orc(ip VARCHAR(50), user_identifier VARCHAR(50), 
> user_id VARCHAR(50),date_time VARCHAR(50),zone VARCHAR(10),method 
> VARCHAR(5),uri VARCHAR(200),version VARCHAR(10),status DECIMAL(3,0),size 
> DECIMAL(10,0)) STORED AS ORC;
> -- Load into Text table
> LOAD DATA LOCAL INPATH 'NASA.*' INTO TABLE nasadata_txt;
> -- Copy to ORC table
> INSERT OVERWRITE TABLE nasadata_orc SELECT * FROM nasadata_txt;
> 5.  Execute this pig script
> rmf /tmp/pruebaPPD;
> A = LOAD '/apps/hive/warehouse/nasadata_orc' using OrcStorage() as 
> (ip,user_identifier,user_id,date_time,zone,method,uri,version,status,size);
> A = foreach A generate ip,uri,status;
> A = filter A by uri == 'test';
> A = group A by uri;
> A = foreach A generate group,COUNT(*);
> store A into '/tmp/pruebaPPD' using PigStorage(';');
> 6. Execute the previous script replacing OrcStorage by 
> org.apache.hive.hcatalog.pig.HCatLoader.
> I can't see any difference in performance between using OrcStorage and 
> HCatLoader. Is there anything wrong in what I'm doing? Do I have to set any 
> property?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to