Re: CarbonData performance benchmkaring

Rana Faisal Thu, 13 Apr 2017 05:38:09 -0700

Hi Liang,

Thank you very much. Now, I am doing my tests with the updated one. Iwill share the results with you. I have one question: Can I adjust tableblock size in DataFrame write?


Regards

Faisal


On 13.04.2017 07:07, Liang Chen wrote:

Hi Rana

That would be very nice, you could participate in us to test TPC-H. One
contributor will contact you and share with you about the script and DDL of
TPC-H.

Actually, you are using old version(Jan version), the current master has
done many optimization for TPC-H , maybe you need to clone the master
version.

Regards
Liang

2017-04-13 4:38 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>:

Hi Liang,

Thank you very much for your reply. I am giving answers side by side to
your questions


1.Did you use the latest master version , or 1.0 ? suggest you use master
to test

I have downloaded the latest version from GIT and compile it. It is
carbondata_2.11-1.0.0-incubating-shade-hadoop2.2.0

2.Have you tested other TPC-H query which including where/filter?

I just started recently and my future plan is to move towards whole TPCH
queries to see CarbonData performance improvements over Parquet. But right
now, I am just running my own queries with different projected columns to
see how well CarbonData can push down the projection.

3.In your case, the query is slow ? or the below "write.format" is slow ?
write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")

I have the same line for Parquet and Parquet is working perfectly fine. I
don't think , this writing is causing any problem.

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants
CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_
VECTOR_READER,
"true")

Thanks for this suggestion. I will enable it and will share with you the
updated results.

Community is doing TPC-H test also currently, do you want to participate
in test together?

It would be nice to be part of this. Could you please guide me how I can
contribute.

Thank you

Regards
Faisal
-----Original Message-----
From: Liang Chen [mailto:chenliang6...@gmail.com]
Sent: Thursday, April 13, 2017 1:00 AM
To: dev@carbondata.incubator.apache.org
Subject: Re: CarbonData performance benchmkaring

Hi

1.Did you use the latest master version , or 1.0 ?  suggest you use master
to test 2.Have you tested other TPC-H query which including where/filter?
3.In your case, the query is slow ? or the below "write.format" is slow ?
write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants
CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_
VECTOR_READER,
"true")

Community is doing TPC-H test also currently, do you want to participate
in test together?

Regards
Liang

2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>:

Dear all,



I am running some experiments to benchmark the performance of both
Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB.
It has 16 columns and I am running different projection queries where
I am reading different number of columns (3,6,9,12,15). I am facing
some problem with CarbonData and it seems to be very slow when I select

more than 8 columns.

It takes almost hours to process my request whereas Parquet is very

quick.

Could please anybody helps me to know this behavior.









   This is my configuration of cluster:



3 Machines

1 Driver Machine (128 GB, 24 cores)

2 Worker Machines  (128GB, 24 cores)



My configuration settings for Spark are:



spark.executor.instances        12

spark.executor.memory   18g

spark.driver.memory     57g

spark.executor.cores    3

spark.driver.cores      5

spark.default.parallelism       72



carbon.sort.file.buffer.size=20

carbon.graph.rowset.size=100000

carbon.number.of.cores.while.loading=6

carbon.sort.size=500000

carbon.enableXXHash=true

carbon.number.of.cores.while.compacting=2

carbon.compaction.level.threshold=4,3

carbon.major.compaction.size=1024

carbon.number.of.cores=4

carbon.inmemory.record.size=120000

carbon.enable.quick.filter=false





My Queries:



carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT,
partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE,
extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING,
linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE,
shipinstruct STRING, shipmode STRING, comment STRING) STORED BY
'carbondata'
TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')")



carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO
TABLE
lineitem_4 OPTIONS ('FILEHEADER' =
'orderkey,partkey,suppkey,linenumber,quantity,
extendedprice,discount,tax,ret
urnflag,linestatus,shipdate,commitdate,receiptdate,
shipinstruct,shipmode,com
ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")





val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
lineitem_4"))

proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/
")



val proj2 = carbon.sql("SELECT
orderkey,partkey,linenumber,quantity,discount,returnflag FROM
lineitem_4")

proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/
")



val proj3 = carbon.sql("SELECT
orderkey,partkey,linenumber,quantity,discount,returnflag,
linestatus,commitda
te,receiptdate FROM lineitem_4")

proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/
")



val proj4 = carbon.sql("SELECT
orderkey,partkey,linenumber,quantity,discount,returnflag,
linestatus,commitda
te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")

proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/
")



Thank you



Regards

Faisal


--
Regards
Liang

Re: CarbonData performance benchmkaring

Reply via email to