Hi Liang,

Thank you very much for your reply. I am giving answers side by side to your 

1.Did you use the latest master version , or 1.0 ? suggest you use master to 

I have downloaded the latest version from GIT and compile it. It is 

2.Have you tested other TPC-H query which including where/filter?

I just started recently and my future plan is to move towards whole TPCH 
queries to see CarbonData performance improvements over Parquet. But right now, 
I am just running my own queries with different projected columns to see how 
well CarbonData can push down the projection. 

3.In your case, the query is slow ? or the below "write.format" is slow ?

I have the same line for Parquet and Parquet is working perfectly fine. I don't 
think , this writing is causing any problem.

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants

Thanks for this suggestion. I will enable it and will share with you the 
updated results.

Community is doing TPC-H test also currently, do you want to participate in 
test together?

It would be nice to be part of this. Could you please guide me how I can 

Thank you

-----Original Message-----
From: Liang Chen [mailto:chenliang6...@gmail.com] 
Sent: Thursday, April 13, 2017 1:00 AM
To: dev@carbondata.incubator.apache.org
Subject: Re: CarbonData performance benchmkaring


1.Did you use the latest master version , or 1.0 ?  suggest you use master to 
test 2.Have you tested other TPC-H query which including where/filter?
3.In your case, the query is slow ? or the below "write.format" is slow ?

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants

Community is doing TPC-H test also currently, do you want to participate in 
test together?


2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>:

> Dear all,
> I am running some experiments to benchmark the performance of both 
> Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB. 
> It has 16 columns and I am running different projection queries where 
> I am reading different number of columns (3,6,9,12,15). I am facing 
> some problem with CarbonData and it seems to be very slow when I select more 
> than 8 columns.
> It takes almost hours to process my request whereas Parquet is very quick.
> Could please anybody helps me to know this behavior.
>   This is my configuration of cluster:
> 3 Machines
> 1 Driver Machine (128 GB, 24 cores)
> 2 Worker Machines  (128GB, 24 cores)
> My configuration settings for Spark are:
> spark.executor.instances        12
> spark.executor.memory   18g
> spark.driver.memory     57g
> spark.executor.cores    3
> spark.driver.cores      5
> spark.default.parallelism       72
> carbon.sort.file.buffer.size=20
> carbon.graph.rowset.size=100000
> carbon.number.of.cores.while.loading=6
> carbon.sort.size=500000
> carbon.enableXXHash=true
> carbon.number.of.cores.while.compacting=2
> carbon.compaction.level.threshold=4,3
> carbon.major.compaction.size=1024
> carbon.number.of.cores=4
> carbon.inmemory.record.size=120000
> carbon.enable.quick.filter=false
> My Queries:
> carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT, 
> partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE, 
> extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING, 
> linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE, 
> shipinstruct STRING, shipmode STRING, comment STRING) STORED BY 
> 'carbondata'
> carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO 
> lineitem_4 OPTIONS ('FILEHEADER' =
> 'orderkey,partkey,suppkey,linenumber,quantity,
> extendedprice,discount,tax,ret
> urnflag,linestatus,shipdate,commitdate,receiptdate,
> shipinstruct,shipmode,com
> ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")
> val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
> lineitem_4"))
> proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/
> ")
> val proj2 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag FROM 
> lineitem_4")
> proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/
> ")
> val proj3 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag,
> linestatus,commitda
> te,receiptdate FROM lineitem_4")
> proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/
> ")
> val proj4 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag,
> linestatus,commitda
> te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")
> proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/
> ")
> Thank you
> Regards
> Faisal


Reply via email to