Hi Liang, Thank you very much for your reply. I am giving answers side by side to your questions
1.Did you use the latest master version , or 1.0 ? suggest you use master to test I have downloaded the latest version from GIT and compile it. It is carbondata_2.11-1.0.0-incubating-shade-hadoop2.2.0 2.Have you tested other TPC-H query which including where/filter? I just started recently and my future plan is to move towards whole TPCH queries to see CarbonData performance improvements over Parquet. But right now, I am just running my own queries with different projected columns to see how well CarbonData can push down the projection. 3.In your case, the query is slow ? or the below "write.format" is slow ? write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/") I have the same line for Parquet and Parquet is working perfectly fine. I don't think , this writing is causing any problem. 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true. import org.apache.carbondata.core.util.CarbonProperties import org.apache.carbondata.core.constants.CarbonCommonConstants CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER, "true") Thanks for this suggestion. I will enable it and will share with you the updated results. Community is doing TPC-H test also currently, do you want to participate in test together? It would be nice to be part of this. Could you please guide me how I can contribute. Thank you Regards Faisal -----Original Message----- From: Liang Chen [mailto:chenliang6...@gmail.com] Sent: Thursday, April 13, 2017 1:00 AM To: dev@carbondata.incubator.apache.org Subject: Re: CarbonData performance benchmkaring Hi 1.Did you use the latest master version , or 1.0 ? suggest you use master to test 2.Have you tested other TPC-H query which including where/filter? 3.In your case, the query is slow ? or the below "write.format" is slow ? write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/") 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true. import org.apache.carbondata.core.util.CarbonProperties import org.apache.carbondata.core.constants.CarbonCommonConstants CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER, "true") Community is doing TPC-H test also currently, do you want to participate in test together? Regards Liang 2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fmu...@essi.upc.edu>: > Dear all, > > > > I am running some experiments to benchmark the performance of both > Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB. > It has 16 columns and I am running different projection queries where > I am reading different number of columns (3,6,9,12,15). I am facing > some problem with CarbonData and it seems to be very slow when I select more > than 8 columns. > It takes almost hours to process my request whereas Parquet is very quick. > Could please anybody helps me to know this behavior. > > > > > > > > > > This is my configuration of cluster: > > > > 3 Machines > > 1 Driver Machine (128 GB, 24 cores) > > 2 Worker Machines (128GB, 24 cores) > > > > My configuration settings for Spark are: > > > > spark.executor.instances 12 > > spark.executor.memory 18g > > spark.driver.memory 57g > > spark.executor.cores 3 > > spark.driver.cores 5 > > spark.default.parallelism 72 > > > > carbon.sort.file.buffer.size=20 > > carbon.graph.rowset.size=100000 > > carbon.number.of.cores.while.loading=6 > > carbon.sort.size=500000 > > carbon.enableXXHash=true > > carbon.number.of.cores.while.compacting=2 > > carbon.compaction.level.threshold=4,3 > > carbon.major.compaction.size=1024 > > carbon.number.of.cores=4 > > carbon.inmemory.record.size=120000 > > carbon.enable.quick.filter=false > > > > > > My Queries: > > > > carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4 (orderkey BIGINT, > partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE, > extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING, > linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE, > shipinstruct STRING, shipmode STRING, comment STRING) STORED BY > 'carbondata' > TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')") > > > > carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO > TABLE > lineitem_4 OPTIONS ('FILEHEADER' = > 'orderkey,partkey,suppkey,linenumber,quantity, > extendedprice,discount,tax,ret > urnflag,linestatus,shipdate,commitdate,receiptdate, > shipinstruct,shipmode,com > ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')") > > > > > > val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM > lineitem_4")) > > proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/ > ") > > > > val proj2 = carbon.sql("SELECT > orderkey,partkey,linenumber,quantity,discount,returnflag FROM > lineitem_4") > > proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/ > ") > > > > val proj3 = carbon.sql("SELECT > orderkey,partkey,linenumber,quantity,discount,returnflag, > linestatus,commitda > te,receiptdate FROM lineitem_4") > > proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/ > ") > > > > val proj4 = carbon.sql("SELECT > orderkey,partkey,linenumber,quantity,discount,returnflag, > linestatus,commitda > te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4") > > proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/ > ") > > > > Thank you > > > > Regards > > Faisal > > > > -- Regards Liang