RE: CarbonData performance benchmkaring

Rana Faisal Munir Wed, 12 Apr 2017 16:10:13 -0700

Hi Liang,

Thank you very much for your reply. I am giving answers side by side to your 
questions



1.Did you use the latest master version , or 1.0 ? suggest you use master to 
test

I have downloaded the latest version from GIT and compile it. It is 
carbondata_2.11-1.0.0-incubating-shade-hadoop2.2.0

2.Have you tested other TPC-H query which including where/filter?

I just started recently and my future plan is to move towards whole TPCH 
queries to see CarbonData performance improvements over Parquet. But right now, 
I am just running my own queries with different projected columns to see how 
well CarbonData can push down the projection. 

3.In your case, the query is slow ? or the below "write.format" is slow ?
write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")

I have the same line for Parquet and Parquet is working perfectly fine. I don't 
think , this writing is causing any problem.

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants
CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER,
"true")

Thanks for this suggestion. I will enable it and will share with you the 
updated results.

Community is doing TPC-H test also currently, do you want to participate in 
test together?

It would be nice to be part of this. Could you please guide me how I can 
contribute. 

Thank you

Regards
Faisal
-----Original Message-----
From: Liang Chen [mailto:chenliang6...@gmail.com] 
Sent: Thursday, April 13, 2017 1:00 AM
To: dev@carbondata.incubator.apache.org
Subject: Re: CarbonData performance benchmkaring

Hi

1.Did you use the latest master version , or 1.0 ?  suggest you use master to 
test 2.Have you tested other TPC-H query which including where/filter?
3.In your case, the query is slow ? or the below "write.format" is slow ?
write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants
CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER,
"true")

Community is doing TPC-H test also currently, do you want to participate in 
test together?

Regards
Liang

2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>:

> Dear all,
>
>
>
> I am running some experiments to benchmark the performance of both 
> Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB. 
> It has 16 columns and I am running different projection queries where 
> I am reading different number of columns (3,6,9,12,15). I am facing 
> some problem with CarbonData and it seems to be very slow when I select more 
> than 8 columns.
> It takes almost hours to process my request whereas Parquet is very quick.
> Could please anybody helps me to know this behavior.
>
>
>
>
>
>
>
>
>
>   This is my configuration of cluster:
>
>
>
> 3 Machines
>
> 1 Driver Machine (128 GB, 24 cores)
>
> 2 Worker Machines  (128GB, 24 cores)
>
>
>
> My configuration settings for Spark are:
>
>
>
> spark.executor.instances        12
>
> spark.executor.memory   18g
>
> spark.driver.memory     57g
>
> spark.executor.cores    3
>
> spark.driver.cores      5
>
> spark.default.parallelism       72
>
>
>
> carbon.sort.file.buffer.size=20
>
> carbon.graph.rowset.size=100000
>
> carbon.number.of.cores.while.loading=6
>
> carbon.sort.size=500000
>
> carbon.enableXXHash=true
>
> carbon.number.of.cores.while.compacting=2
>
> carbon.compaction.level.threshold=4,3
>
> carbon.major.compaction.size=1024
>
> carbon.number.of.cores=4
>
> carbon.inmemory.record.size=120000
>
> carbon.enable.quick.filter=false
>
>
>
>
>
> My Queries:
>
>
>
> carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT, 
> partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE, 
> extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING, 
> linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE, 
> shipinstruct STRING, shipmode STRING, comment STRING) STORED BY 
> 'carbondata'
> TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')")
>
>
>
> carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO 
> TABLE
> lineitem_4 OPTIONS ('FILEHEADER' =
> 'orderkey,partkey,suppkey,linenumber,quantity,
> extendedprice,discount,tax,ret
> urnflag,linestatus,shipdate,commitdate,receiptdate,
> shipinstruct,shipmode,com
> ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")
>
>
>
>
>
> val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
> lineitem_4"))
>
> proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/
> ")
>
>
>
> val proj2 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag FROM 
> lineitem_4")
>
> proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/
> ")
>
>
>
> val proj3 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag,
> linestatus,commitda
> te,receiptdate FROM lineitem_4")
>
> proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/
> ")
>
>
>
> val proj4 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag,
> linestatus,commitda
> te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")
>
> proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/
> ")
>
>
>
> Thank you
>
>
>
> Regards
>
> Faisal
>
>
>
>


--
Regards
Liang

RE: CarbonData performance benchmkaring

Reply via email to