Re: [VOTE] Apache CarbonData 1.6.1(RC1) release

2019-10-24 Thread Raghunandan S
Hi all


PMC vote has passed for Apache Carbondata 1.6.1 release, the result

as below:


+1(binding): 5(Kumar Vishal, Ravindra, Liang Chen)


+1(non-binding) : 2


Thanks all for your vote.


Regards

On Mon, Oct 14, 2019 at 4:56 PM Liang Chen  wrote:

> +1
>
> Please update the release notes accordingly.
>
> Regards
> Liang
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


[ANNOUNCE] Apache CarbonData 1.6.1 release

2019-10-24 Thread Raghunandan S
Hi All,

Apache CarbonData community is pleased to announce the release of the
Version 1.6.1 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenarios, it supports queries on a single table with 3PB data
(more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.6.1/, and feedback
through the CarbonData user mailing lists !

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.6.1?

CarbonData 1.6.1 intention was to move closer to unified analytics and
improve the stability. In this version of CarbonData, around 40 JIRA
tickets related to improvements, and bugs have been resolved. Following are
the summary.


Index Server performance improvements for Full Scan and TPCH Queries
Carbon currently prunes and caches all block/blocklet datamap index
information into the driver. If the cache size becomes huge(70-80% of the
driver memory) then there can be excessive GC in the driver which can slow
down the queries and the driver may even go OutOfMemory. Moving out the
indexes to separate JDBCServer reduced the overhead on the primary
JDBCServer, but introduced delay in fetching the bulk pruning blocks list
from the Index server. This is improved in this release and performance is
same as running without Index Server.

Behaviour Change

None


Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12345993


Sub-task

   - [CARBONDATA-3454
   ] - Optimize the
   performance of select coun(*) for index server
   - [CARBONDATA-3462
   ] - Add usage and
   deployment document for index server

Bug

   - [CARBONDATA-3452
   ] - select query
   failure when substring on dictionary column with join
   - [CARBONDATA-3474
   ] - Fix validate
   mvQuery having filter expression and correct error message
   - [CARBONDATA-3476
   ] - Read time and
   scan time stats shown wrong in executor log for filter query
   - [CARBONDATA-3477
   ] - Throw out
   exception when use sql: 'update table select\n...'
   - [CARBONDATA-3478
   ] - Fix
   ArrayIndexOutOfBoundsException issue on compaction after alter rename
   operation
   - [CARBONDATA-3480
   ] - Remove
   Modified MDT and make relation refresh only when schema file is modified.
   - [CARBONDATA-3481
   ] - Multi-thread
   pruning fails when datamaps count is just near numOfThreadsForPruning
   - [CARBONDATA-3482
   ] - Null pointer
   exception when concurrent select queries are executed from different
   beeline terminals.
   - [CARBONDATA-3483
   ] - Can not run
   horizontal compaction when execute update sql
   - [CARBONDATA-3485
   ] - data loading
   is failed from S3 to hdfs table having ~2K carbonfiles
   - [CARBONDATA-3486
   ] -
   Serialization/ deserialization issue with Datatype
   - [CARBONDATA-3487
   ] - wrong Input
   metrics (size/record) displayed in spark UI during insert into
   - [CARBONDATA-3490
   ] - Concurrent
   data load failure with carbondata FileNotFound exception
   - [CARBONDATA-3493
   ] - Carbon query
   fails when enable.query.statistics is true in specific scenario.
   - [CARBONDATA-3494
   ] - Nullpointer
   exception in case of drop table
   - [CARBONDATA-3495
   ] - Insert into
   Complex data type of Binary fails with Carbon & SparkFileFormat
   - [CARBONDATA-3499
   ] - Fix insert
   failure with customFileProvider
   - [CARBONDATA-3502
   ] - Select query
   fails with UDF having Match expression inside IN expression
   - 

Re: [DISCUSSION]Support for Geospatial indexing

2019-10-24 Thread Jacky Li
Thanks for the analysis. Please be careful of the code reuse from other 
"opensource" repo, especially for the License.

Regards,
Jacky

On 2019/10/24 06:25:40, Ajantha Bhat  wrote: 
> Hi Jacky,
> 
> we have checked about geomesa
> 
> [image: Screenshot from 2019-10-23 16-25-23.png]
> 
> a. Geomesa is tightly coupled with  key-value pair databases like Accumulo,
> HBase, Google Bigtable and Cassandra databases and used for OLTP queries.
> b. Geomesa current spark integration is only in query flow, load from spark
> is not supported. spark can be used for analytics on geomesa store.
> Here they override spark catalyst optimizer code to intercept filter from
> logical relation and they push down to geomesa server.
> All the query logic like spatial time curve building (z curve, quadtree)
> doesn't happen at spark layer. It happens in geoserver layer which is
> coupled with key-value pair databases.
> https://www.geomesa.org/documentation/user/architecture.html
> 
> https://www.geomesa.org/documentation/user/spark/architecture.html
> 
> https://www.youtube.com/watch?v=Otf2jwdNaUY
> 
> c. Geomesa is for spatio-temporal data , not just a spatial data.
> so, we cannot integrate carbon with  geo mesa directly, but we can reuse
> some of the logic present in it like quadtree formation and look up.
> 
> Also I found *another alternative* "*GeoSpark", *this project is not
> coupled with any store.
> https://datasystemslab.github.io/GeoSpark/
> 
> https://www.public.asu.edu/~jiayu2/presentation/jia-icde19-tutorial.pdf
> so, we will check further about integrating carbon to GeoSpark or reusing
> some of the code from this.
> 
> Also regarding the second point, yes, we can have carbon implementation as
> a generic framework where we can plugin the different logic.
> 
> Thanks,
> Ajantha
> 
> 
> 
> 
> 
> On Mon, Oct 21, 2019 at 6:34 PM Indhumathi  wrote:
> 
> > Hi Venu,
> >
> > I have some questions regarding this feature.
> >
> > 1. Does geospatial index supports on streaming table?. If so, will there be
> > any impact on generating
> > geoIndex on steaming data?
> > 2. Does it have any restrictions on sort_scope?
> > 3. Apart from Point and Polygon queries, will geospatial index also support
> > Aggregation queries on
> > geographical location data?
> >
> > Thanks & Regards,
> > Indhumathi
> >
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
> 


Re: [DISCUSSION]Support for Geospatial indexing

2019-10-24 Thread Ajantha Bhat
Hi Jacky,

we have checked about geomesa

[image: Screenshot from 2019-10-23 16-25-23.png]

a. Geomesa is tightly coupled with  key-value pair databases like Accumulo,
HBase, Google Bigtable and Cassandra databases and used for OLTP queries.
b. Geomesa current spark integration is only in query flow, load from spark
is not supported. spark can be used for analytics on geomesa store.
Here they override spark catalyst optimizer code to intercept filter from
logical relation and they push down to geomesa server.
All the query logic like spatial time curve building (z curve, quadtree)
doesn't happen at spark layer. It happens in geoserver layer which is
coupled with key-value pair databases.
https://www.geomesa.org/documentation/user/architecture.html

https://www.geomesa.org/documentation/user/spark/architecture.html

https://www.youtube.com/watch?v=Otf2jwdNaUY

c. Geomesa is for spatio-temporal data , not just a spatial data.
so, we cannot integrate carbon with  geo mesa directly, but we can reuse
some of the logic present in it like quadtree formation and look up.

Also I found *another alternative* "*GeoSpark", *this project is not
coupled with any store.
https://datasystemslab.github.io/GeoSpark/

https://www.public.asu.edu/~jiayu2/presentation/jia-icde19-tutorial.pdf
so, we will check further about integrating carbon to GeoSpark or reusing
some of the code from this.

Also regarding the second point, yes, we can have carbon implementation as
a generic framework where we can plugin the different logic.

Thanks,
Ajantha





On Mon, Oct 21, 2019 at 6:34 PM Indhumathi  wrote:

> Hi Venu,
>
> I have some questions regarding this feature.
>
> 1. Does geospatial index supports on streaming table?. If so, will there be
> any impact on generating
> geoIndex on steaming data?
> 2. Does it have any restrictions on sort_scope?
> 3. Apart from Point and Polygon queries, will geospatial index also support
> Aggregation queries on
> geographical location data?
>
> Thanks & Regards,
> Indhumathi
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION]Support for Geospatial indexing

2019-10-24 Thread VenuReddy
1. Would table with geospatial location column be allowed to be updated with
non-geospatial data and vice verca . Or would it according to the existing
behavior and any unsupported data in type/column would be treated as bad
records ? 
=> Location columns cannot be allowed with invalid datatypes. It can be
treated as bad records with unsupported data in type/column.

2. Would there be any limitations with respect to using targetColumn column
configured as local dictionary,inverted index,cache column or range column
in table properties ? 
=> I think, there shouldn't be any such restriction. TargetColumn is just an
additional column internally generated when INDEX property is specified.

3. Would only measure data types be supported for targetDataType parameter ?
Supported types can be mentioned in design doc. 
=> We can treat the generated geohash column as dimension column as it
should be part of sort columns.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/