Re: CarbonData performance benchmkaring

Liang Chen Wed, 12 Apr 2017 22:07:29 -0700

Hi Rana

That would be very nice, you could participate in us to test TPC-H. One
contributor will contact you and share with you about the script and DDL of
TPC-H.


Actually, you are using old version(Jan version), the current master has
done many optimization for TPC-H , maybe you need to clone the master
version.

Regards
Liang

2017-04-13 4:38 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>:

> Hi Liang,
>
> Thank you very much for your reply. I am giving answers side by side to
> your questions
>
>
> 1.Did you use the latest master version , or 1.0 ? suggest you use master
> to test
>
> I have downloaded the latest version from GIT and compile it. It is
> carbondata_2.11-1.0.0-incubating-shade-hadoop2.2.0
>
> 2.Have you tested other TPC-H query which including where/filter?
>
> I just started recently and my future plan is to move towards whole TPCH
> queries to see CarbonData performance improvements over Parquet. But right
> now, I am just running my own queries with different projected columns to
> see how well CarbonData can push down the projection.
>
> 3.In your case, the query is slow ? or the below "write.format" is slow ?
> write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")
>
> I have the same line for Parquet and Parquet is working perfectly fine. I
> don't think , this writing is causing any problem.
>
> 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
> true.
> import org.apache.carbondata.core.util.CarbonProperties
> import org.apache.carbondata.core.constants.CarbonCommonConstants
> CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_
> VECTOR_READER,
> "true")
>
> Thanks for this suggestion. I will enable it and will share with you the
> updated results.
>
> Community is doing TPC-H test also currently, do you want to participate
> in test together?
>
> It would be nice to be part of this. Could you please guide me how I can
> contribute.
>
> Thank you
>
> Regards
> Faisal
> -----Original Message-----
> From: Liang Chen [mailto:chenliang6...@gmail.com]
> Sent: Thursday, April 13, 2017 1:00 AM
> To: dev@carbondata.incubator.apache.org
> Subject: Re: CarbonData performance benchmkaring
>
> Hi
>
> 1.Did you use the latest master version , or 1.0 ?  suggest you use master
> to test 2.Have you tested other TPC-H query which including where/filter?
> 3.In your case, the query is slow ? or the below "write.format" is slow ?
> write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")
>
> 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
> true.
> import org.apache.carbondata.core.util.CarbonProperties
> import org.apache.carbondata.core.constants.CarbonCommonConstants
> CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_
> VECTOR_READER,
> "true")
>
> Community is doing TPC-H test also currently, do you want to participate
> in test together?
>
> Regards
> Liang
>
> 2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>:
>
> > Dear all,
> >
> >
> >
> > I am running some experiments to benchmark the performance of both
> > Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB.
> > It has 16 columns and I am running different projection queries where
> > I am reading different number of columns (3,6,9,12,15). I am facing
> > some problem with CarbonData and it seems to be very slow when I select
> more than 8 columns.
> > It takes almost hours to process my request whereas Parquet is very
> quick.
> > Could please anybody helps me to know this behavior.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >   This is my configuration of cluster:
> >
> >
> >
> > 3 Machines
> >
> > 1 Driver Machine (128 GB, 24 cores)
> >
> > 2 Worker Machines  (128GB, 24 cores)
> >
> >
> >
> > My configuration settings for Spark are:
> >
> >
> >
> > spark.executor.instances        12
> >
> > spark.executor.memory   18g
> >
> > spark.driver.memory     57g
> >
> > spark.executor.cores    3
> >
> > spark.driver.cores      5
> >
> > spark.default.parallelism       72
> >
> >
> >
> > carbon.sort.file.buffer.size=20
> >
> > carbon.graph.rowset.size=100000
> >
> > carbon.number.of.cores.while.loading=6
> >
> > carbon.sort.size=500000
> >
> > carbon.enableXXHash=true
> >
> > carbon.number.of.cores.while.compacting=2
> >
> > carbon.compaction.level.threshold=4,3
> >
> > carbon.major.compaction.size=1024
> >
> > carbon.number.of.cores=4
> >
> > carbon.inmemory.record.size=120000
> >
> > carbon.enable.quick.filter=false
> >
> >
> >
> >
> >
> > My Queries:
> >
> >
> >
> > carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT,
> > partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE,
> > extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING,
> > linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE,
> > shipinstruct STRING, shipmode STRING, comment STRING) STORED BY
> > 'carbondata'
> > TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')")
> >
> >
> >
> > carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO
> > TABLE
> > lineitem_4 OPTIONS ('FILEHEADER' =
> > 'orderkey,partkey,suppkey,linenumber,quantity,
> > extendedprice,discount,tax,ret
> > urnflag,linestatus,shipdate,commitdate,receiptdate,
> > shipinstruct,shipmode,com
> > ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")
> >
> >
> >
> >
> >
> > val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
> > lineitem_4"))
> >
> > proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/
> > ")
> >
> >
> >
> > val proj2 = carbon.sql("SELECT
> > orderkey,partkey,linenumber,quantity,discount,returnflag FROM
> > lineitem_4")
> >
> > proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/
> > ")
> >
> >
> >
> > val proj3 = carbon.sql("SELECT
> > orderkey,partkey,linenumber,quantity,discount,returnflag,
> > linestatus,commitda
> > te,receiptdate FROM lineitem_4")
> >
> > proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/
> > ")
> >
> >
> >
> > val proj4 = carbon.sql("SELECT
> > orderkey,partkey,linenumber,quantity,discount,returnflag,
> > linestatus,commitda
> > te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")
> >
> > proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/
> > ")
> >
> >
> >
> > Thank you
> >
> >
> >
> > Regards
> >
> > Faisal
> >
> >
> >
> >
>
>
> --
> Regards
> Liang
>
>


-- 
Regards
Liang

Re: CarbonData performance benchmkaring

Reply via email to