Re: [POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

2017-08-23 Thread Ravindra Pesala
Hi,

I have verified using tpch tables with 1 GB generated data. on 1.1.1  but I
got below result. I don't have the exact schema as you mentioned but with
original TPCH schema, I verified.

0: jdbc:hive2://localhost:1> select count(c_CustKey),count(o_CustKey)
from customer, orders where o_Custkey = c_CustKey;
+---+---+--+
| count(c_CustKey)  | count(o_CustKey)  |
+---+---+--+
| 150   | 150   |
+---+---+--+


On parquet with same data.

0: jdbc:hive2://localhost:1> select count(c_CustKey),count(o_CustKey)
from customer, orders where o_Custkey = c_CustKey;
+---+---+--+
| count(c_CustKey)  | count(o_CustKey)  |
+---+---+--+
| 150   | 150   |
+---+---+--+


Regards,
Ravindra.

On 23 August 2017 at 19:40, Swapnil Shinde  wrote:

> Hello All
> We are observing incorrect query results with carbondata 1.1.1. Please
> find details below -
>
> *Datasets used -*
>  TPC-H star schema based datasets (http://www.cs.umb.edu/~
> poneil/StarSchemaB.PDF)
> *Query - *
> * select cCustKey,loCustKey from customer, lineorder where loCustkey =
> cCustKey*
> *How we load data -*
>  We validated loading data through dataframe and "INSERT" statements
> and both ways produce incorrect results. I am putting one way here-
>
>
> *-- CREATE CUSTOMER TABLE*
>
> *carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName
> string, cAddress string, cCity string, cNation string, cRegion string,
> cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")*
>
> *carbon.sql("LOAD DATA INPATH '///tmp/ssb_raw/customer' INTO TABLE
> customer
> OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")*
>
>
>
> *-- CREATE LINEORDER TABLE*
>
> *carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey
> bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey
> Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity
> Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue
> Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy
> String) STORED BY 'carbondata'")*
>
> *carbon.sql("LOAD DATA INPATH '///tmp/ssb_raw/lineorder' INTO
> TABLE lineorder
> OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")*
>
>
> *Results with different version - *
>
> *   1.1.0 - *Provides correct results for above query. Validated with
> results from parquet.
>
> *   1.1.1 - *Built from this
> .
> Join is missing lots of rows compared to parquet.
>
> *   1.1.1 - *Built from source code available for download
> .
> Join is missing lots of rows compared to parquet.
>
> *  1.2 - *Built from master branch. Generated correct results similar
> to parquet.
>
>
> *Debugging further - *
>
> 1. Row counts for both lineOrder and customer tables are same.
>
> 2. If I try to find out key column in carbondata vs parquet then it is
> matching as well -
>
>  val cd = carbon.sql("select cCustKey from customer")
> //.distinct.count -- 30,000,000
>
>  val sp = spark.sql("select cCustKey from pcustomer")
> //.distinct.count -- 30,000,000
>
>  cd.intersect(sp) -- 30,000,000 (carbon data has same keys
> compared to parquet)
>
>
>
>  val cd = carbon.sql("select loCustKey from lineorder")
> //.distinct.count -- 13,365,986
>
>  val sp = spark.sql("select loCustKey from plineorder")
> //.distinct.count -- 13,365,986
>
>  cd.intersect(sp) --13,365,986 (carbon data has same keys
> compared to parquet)
>
>
> Above query shows that carbondata customer and lineitem has same key
> values compared to parquet.
>
> However, when you run above join query, carbondata generates very small
> subset of expected rows. If we run filter query for any specific key then
> that also returns no results.
>
>
> Not sure why v1.1.1 is producing incorrect results. My guess is that
> carbondata is skipping rows that it shouldn't in v1.1.1.
>
> Any help and suggestions are very much appreciated!! Thanks in advance..
>
>
>
> Thanks
>
> Swapnil Shinde
>
>
>
>
>
>
>
>
>
>
>


-- 
Thanks & Regards,
Ravi


Re: Apache CarbonData 6th meetup in Shanghai on 2nd Sep,2017 at : https://jinshuju.net/f/X8x5S9?from=timeline

2017-08-23 Thread Erlu Chen
Expect the conference to be held !!

Regards.
Chenerlu



--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Apache-CarbonData-6th-meetup-in-Shanghai-on-2nd-Sep-2017-at-https-jinshuju-net-f-X8x5S9-from-timeline-tp20693p20731.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Re: unable to generate the mdkey

2017-08-23 Thread David CaiQiang
hi,
  From the log, I just known some issue happened during writing data file
step in executor side.
  To locate this issue, I need more executor log.

David



-
Best Regards
David Cai
--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/unable-to-generate-the-mdkey-tp20715p20730.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


[POSSIBLE BUG] Carbondata 1.1.1 inaccurate results

2017-08-23 Thread Swapnil Shinde
Hello All
We are observing incorrect query results with carbondata 1.1.1. Please
find details below -

*Datasets used -*
 TPC-H star schema based datasets (
http://www.cs.umb.edu/~poneil/StarSchemaB.PDF)
*Query - *
* select cCustKey,loCustKey from customer, lineorder where loCustkey =
cCustKey*
*How we load data -*
 We validated loading data through dataframe and "INSERT" statements
and both ways produce incorrect results. I am putting one way here-


*-- CREATE CUSTOMER TABLE*

*carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName
string, cAddress string, cCity string, cNation string, cRegion string,
cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")*

*carbon.sql("LOAD DATA INPATH '///tmp/ssb_raw/customer' INTO TABLE
customer
OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")*



*-- CREATE LINEORDER TABLE*

*carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey
bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey
Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity
Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue
Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy
String) STORED BY 'carbondata'")*

*carbon.sql("LOAD DATA INPATH '///tmp/ssb_raw/lineorder' INTO TABLE
lineorder
OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")*


*Results with different version - *

*   1.1.0 - *Provides correct results for above query. Validated with
results from parquet.

*   1.1.1 - *Built from this
.
Join is missing lots of rows compared to parquet.

*   1.1.1 - *Built from source code available for download
.
Join is missing lots of rows compared to parquet.

*  1.2 - *Built from master branch. Generated correct results similar
to parquet.


*Debugging further - *

1. Row counts for both lineOrder and customer tables are same.

2. If I try to find out key column in carbondata vs parquet then it is
matching as well -

 val cd = carbon.sql("select cCustKey from customer")
//.distinct.count -- 30,000,000

 val sp = spark.sql("select cCustKey from pcustomer")
//.distinct.count -- 30,000,000

 cd.intersect(sp) -- 30,000,000 (carbon data has same keys compared
to parquet)



 val cd = carbon.sql("select loCustKey from lineorder")
//.distinct.count -- 13,365,986

 val sp = spark.sql("select loCustKey from plineorder")
//.distinct.count -- 13,365,986

 cd.intersect(sp) --13,365,986 (carbon data has same keys compared
to parquet)


Above query shows that carbondata customer and lineitem has same key values
compared to parquet.

However, when you run above join query, carbondata generates very small
subset of expected rows. If we run filter query for any specific key then
that also returns no results.


Not sure why v1.1.1 is producing incorrect results. My guess is that
carbondata is skipping rows that it shouldn't in v1.1.1.

Any help and suggestions are very much appreciated!! Thanks in advance..



Thanks

Swapnil Shinde


unable to generate the mdkey

2017-08-23 Thread yixu2001
dev 
 hi

spark 2.1.1 
carbondata 1.1.1 
hadoop 2.7.1

cc.sql("insert into e_carbon.prod_inst_c select prod_inst_id,owner_cust_id,'2' 
mvcc,latn_id,product_id,acc_prod_inst_id,address_id,payment_mode_cd,product_password,important_level,area_code,acc_nbr,exch_id,common_region_id,remark,pay_cycle,begin_rent_time,stop_rent_time,finish_time,stop_status,status_cd,create_date,status_date,update_date,proc_serial,use_cust_id,ext_prod_inst_id,address_desc,area_id,update_staff,create_staff,rec_update_date,account,ext_acc_prod_inst_id,distributor_id,eff_flg
 from e_carbon.prod_inst where latn_id=593").show;


org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException:
at 
org.apache.carbondata.processing.newflow.steps.DataWriterProcessorStepImpl.execute(DataWriterProcessorStepImpl.java:125)
at 
org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:48)
at 
org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2.(NewCarbonDataLoadRDD.scala:434)
at 
org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.compute(NewCarbonDataLoadRDD.scala:398)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException: 
unable to generate the mdkey
at 
org.apache.carbondata.processing.newflow.steps.DataWriterProcessorStepImpl.processBatch(DataWriterProcessorStepImpl.java:181)
at 
org.apache.carbondata.processing.newflow.steps.DataWriterProcessorStepImpl.execute(DataWriterProcessorStepImpl.java:111)
... 11 more
Caused by: java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.FutureTask@23e0197a rejected from 
java.util.concurrent.ThreadPoolExecutor@64c083fa[Shutting down, pool size = 1, 
active threads = 1, queued tasks = 0, completed tasks = 19]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
at 
org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.addDataToStore(CarbonFactDataHandlerColumnar.java:466)
at 
org.apache.carbondata.processing.newflow.steps.DataWriterProcessorStepImpl.processBatch(DataWriterProcessorStepImpl.java:178)
... 12 more
 


yixu2001


Re: method not found issue when creating table

2017-08-23 Thread Lu Cao
Hi Manish,
Thank you for the response!
I have addressed the issue is caused by the spark catalyst jar of cloudera
version is different from the open source version. The case class
CatalogTable in cloudera version has some more columns.

But how can I change the instance of CatalogTable(tableDesc in the
following codes) to the cloudera version? it comes from a SparkPlan...
Though I have changed all the jar dependecies to cloudera version, it still
return the No Such Method error.


/**
 * Carbon strategies for ddl commands
 */
class DDLStrategy(sparkSession: SparkSession) extends SparkStrategy {

  def apply(plan: LogicalPlan): Seq[SparkPlan] = {
plan match {
  case ...
  case
org.apache.spark.sql.execution.datasources.CreateTable(tableDesc, mode,
None)
if tableDesc.provider.get != DDLUtils.HIVE_PROVIDER
   &&
tableDesc.provider.get.equals("org.apache.spark.sql.CarbonSource") =>
val updatedCatalog =
  CarbonSource.updateCatalogTableWithCarbonSchema(tableDesc,
sparkSession)
val cmd =
  CreateDataSourceTableCommand(updatedCatalog, ignoreIfExists =
mode == SaveMode.Ignore)
ExecutedCommandExec(cmd) :: Nil
  case _ => Nil
}
  }
}

the cloudera version case class is

case class CatalogTable(
identifier: TableIdentifier,
tableType: CatalogTableType,
storage: CatalogStorageFormat,
schema: StructType,
provider: Option[String] = None,
partitionColumnNames: Seq[String] = Seq.empty,
bucketSpec: Option[BucketSpec] = None,
owner: String = "",
createTime: Long = System.currentTimeMillis,
lastAccessTime: Long = -1,
properties: Map[String, String] = Map.empty,
stats: Option[Statistics] = None,
viewOriginalText: Option[String] = None,
viewText: Option[String] = None,
comment: Option[String] = None,
unsupportedFeatures: Seq[String] = Seq.empty,
tracksPartitionsInCatalog: Boolean = false,
schemaPreservesCase: Boolean = true)

while in the error log, copy method is

java.lang.NoSuchMethodError:
org.apache.spark.sql.catalyst.catalog.CatalogTable.copy(
Lorg/apache/spark/sql/catalyst/TableIdentifier;
Lorg/apache/spark/sql/catalyst/catalog/CatalogTableType;
Lorg/apache/spark/sql/catalyst/catalog/CatalogStorageFormat;
Lorg/apache/spark/sql/types/StructType;
Lscala/Option;
Lscala/collection/Seq;
Lscala/Option;
Ljava/lang/String;
  <<<- Miss two long type parameters here
JJLscala/collection/immutable/Map;
Lscala/Option;
Lscala/Option;
Lscala/Option;
Lscala/Option;
Lscala/collection/Seq;Z)

Thanks,
CaoLu

On Wed, Aug 23, 2017 at 5:10 PM, manishgupta88 
wrote:

> Hi Lionel,
>
> Carbon table creation flow is executed on the driver side, Executors do not
> participate in creation of carbon table. From the logs it seems that
> spark-catalyst jar is missing which is generally placed under
> $SPARK_HOME/jars OR $SPARK_HOME/lib directory. Please check if spark jars
> directory is there in the driver classpath. You can follow the below steps:
>
> 1. On the driver node execute "jps" comamnd and find out the SparkSubmit
> process id.
> 2. Execute "jinfo  and redirect it to a file.
> 3. Search for spark-catalyst jar in the file. If not found that means its
> not in the classpath and you can then add the jar in classpath and run your
> queries again.
>
> Regards
> Manish Gupta
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/method-not-
> found-issue-when-creating-table-tp20640p20702.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>


Re: method not found issue when creating table

2017-08-23 Thread manishgupta88
Hi Lionel,

Carbon table creation flow is executed on the driver side, Executors do not
participate in creation of carbon table. From the logs it seems that
spark-catalyst jar is missing which is generally placed under
$SPARK_HOME/jars OR $SPARK_HOME/lib directory. Please check if spark jars
directory is there in the driver classpath. You can follow the below steps:

1. On the driver node execute "jps" comamnd and find out the SparkSubmit
process id.
2. Execute "jinfo  and redirect it to a file.
3. Search for spark-catalyst jar in the file. If not found that means its
not in the classpath and you can then add the jar in classpath and run your
queries again.

Regards
Manish Gupta



--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/method-not-found-issue-when-creating-table-tp20640p20702.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Re: Apache CarbonData 6th meetup in Shanghai on 2nd Sep,2017 at : https://jinshuju.net/f/X8x5S9?from=timeline

2017-08-23 Thread Liang Chen

 



--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Apache-CarbonData-6th-meetup-in-Shanghai-on-2nd-Sep-2017-at-https-jinshuju-net-f-X8x5S9-from-timeline-tp20693p20694.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Apache CarbonData 6th meetup in Shanghai on 2nd Sep,2017 at : https://jinshuju.net/f/X8x5S9?from=timeline

2017-08-23 Thread Liang Chen